CN109753900A

CN109753900A - A kind of blind person's auxiliary vision system based on CNN/LSTM

Info

Publication number: CN109753900A
Application number: CN201811573815.3A
Authority: CN
Inventors: 潘红光; 雷心宇; 黄向东; 温帆; 张奇; 米文毓; 苏涛
Original assignee: Xian University of Science and Technology
Current assignee: Xian University of Science and Technology
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-05-14
Anticipated expiration: 2038-12-21
Also published as: CN109753900B

Abstract

A blind auxiliary vision system based on CNN/LSTM, comprising: an image acquisition device, which collects images around the user in real time; a control system, which is equipped with a deep neural network pre-trained by using a large number of labeled pictures, and translates in real time the images contained in the scene. information; a voice broadcast system, broadcasts the information in the form of voice, the present invention overcomes the inconvenience of traditional guide sticks, guide dogs and guide glasses, and at the same time can enrich the senses of the blind, so that the guide is not only a reminder The obstacle distance is more about providing a world that can be perceived and taking care of the psychological feelings of the blind. It has the advantages of light structure, convenience, strong real-time performance, high precision, low cost, easy portability, and no need for networking.

Description

Blind person auxiliary vision system based on CNN/LSTM

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to an auxiliary vision system, and particularly relates to a blind auxiliary vision system based on CNN/LSTM.

Background

In modern society, blind or temporarily blinded people due to illness still occupy a large portion of the population of society. Most blind people adopt traditional guide sticks, guide dogs and the like to sense surrounding environment information including obstacles and the like. However, the traditional blind guiding mode is often low in efficiency, the blind guiding stick is inconvenient to use, the blind guiding stick depends on the self feeling of the blind to a great extent when acquiring the surrounding information, and meanwhile, the blind guiding stick is limited in action range, only the information of obstacles in a previous small area can be detected, and the complete external environment cannot be sensed. The guide dog can help the blind person to avoid obstacles quickly, and certain emergency situations are avoided. However, the blind guiding dog usually needs to be trained professionally by selecting a specific dog species, and can only play a blind guiding role after training. This training process is tedious, time consuming, requires high costs and is difficult to undertake by an individual. When the blind person walks with the guide dog, the influence on pedestrians on the road cannot be avoided. Furthermore, the nature of canines is very difficult to be extinguished by training for a period of time, and blind people need to be raised, trained and controlled during daily activities. Therefore, the blind guiding dog cannot completely meet the blind guiding requirement of the blind in daily activities.

A blind guiding tool is also available in the market, and is a pair of blind guiding glasses based on integrated circuit ultrasonic waves. The electronic box comprises an electronic box and glasses, wherein the glasses are provided with two ultrasonic transducers and an earplug, the transducers can transmit and receive reflected ultrasonic pulse waves forwards, and the blind can sense the front obstacles through the sound change of the earphones. The blind guiding glasses have the advantages of small volume, sensitive response and correct direction. However, the blind guiding glasses are expensive, and only the distance between the front obstacle and the blind guiding glasses can be sensed, and the position information and the attribute of the obstacle cannot be accurately known. In addition, when the blind guiding glasses are used, the blind person is reminded by different sounds emitted by the earphones, and when the blind person walks on the road, the blind person must pay attention to the sounds in the earphones all the time, so that other dangers can be caused due to the fact that the attention is not focused. Moreover, the blind guiding glasses need to select a distance gear before use, and the distances which can be detected by different gears are different. But sometimes the blind person himself cannot accurately judge in which position he is. Therefore, the ultrasonic blind guiding glasses also have many inconveniences.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a CNN/LSTM-based blind auxiliary vision system, overcomes the inconvenience of the traditional guide stick, guide dog and guide glasses by using artificial intelligence, and can enrich the sense of the blind, so that the guide not only can prompt the distance of obstacles, but also can provide a sensed world to take care of the psychological feelings of the blind, and the CNN/LSTM-based blind auxiliary vision system has the advantages of light and handy structure, convenience, strong real-time performance, high precision, low cost, convenience in carrying, no need of networking and the like.

In order to achieve the purpose, the invention adopts the technical scheme that:

a CNN/LSTM-based blind-aided vision system, comprising:

the image acquisition device acquires images around the user in real time;

the control system is used for carrying a deep neural network pre-trained by using a large number of pictures with labels and translating information contained in a scene in real time;

and the voice broadcasting system broadcasts the information in a voice form.

The image acquisition device is a portable camera, the image acquisition device, the control system and the voice broadcast system are integrated into a whole, wherein the control system adopts an embedded chip.

The information contained in the translated scene is the relationship between the objects and various objects in the current scene and is output in the form of character information.

The deep neural network uses a deep convolution neural network, the neural network is trained by adopting a data set with labels, the deep convolution neural network is optimized by using a Dropout algorithm, and then the pooling layer is replaced by using a cavity convolution; and translating the feature graph output by the deep convolutional neural network by adopting a long-time and short-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting the character information marked by the picture in the data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.

The deep convolutional neural network is VGG16, and the data set is a Microsoft COCO data set.

In the training stage, the deep convolutional neural network is used for processing a part which is used as a training set in a data set, an image is converted into a characteristic vector with fixed length, the Dropout algorithm is used for optimizing the convolutional neural network, the convergence of the deep convolutional neural network is accelerated, and the void convolution is used for replacing a pooling layer by plugging spaces among convolutional kernel elements during convolution; and the feature graph output by the deep convolutional neural network is spliced with the word embedded vector to form a multi-mode feature together, and the multi-mode feature is sent to a long-time memory network LSTM for translation.

In the testing stage, the invention uses the remaining data in the data set as the testing set to test the trained network, and generates a probability matrix of a word sequence by the LSTM, wherein the maximum corresponding word in each probability vector in the matrix is the predicted word, and the words are combined together according to the sequence to generate the description sentence.

A word is represented as an onehot matrix, i.e. each dimension has elements with a value 1, and the remaining elements are 0. Each word in the dictionary is assigned a number. The length of the vector is equal to the length of the dictionary. Because the pictures used for training are provided with labels, words in the labels are converted into onehot vectors, the vectors are spliced into a one-dimensional long vector, and the one-dimensional long vector is combined with the feature map of the corresponding picture to form a multi-mode feature.

The voice broadcasting system adopts text-to-voice software to convert text information translated by the deep neural network into voice, and the voice is played through a loudspeaker or an earphone.

Compared with the prior art, the method has the advantages that the information contained in the scene is translated in real time by using a large number of the labeled image pre-trained deep neural networks. The deep neural network outputs some things (including people, animals and other objects) in the current scene and the text information of the relationship among various things, and then the text information is converted into the voice information which can be understood by the blind person through the voice broadcasting system. The method has the advantages of reliability, strong real-time performance, small size, low cost, high precision and the like, and can help the blind to quickly acquire the environmental information of the current position in a voice mode.

Drawings

FIG. 1 is a flow chart of an assisted vision implementation of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The invention relates to a blind auxiliary vision system based on CNN/LSTM, comprising:

the image acquisition device acquires images around a user in real time and can be realized by adopting a small portable camera;

the control system can adopt an embedded chip, carry and use a large number of deep neural networks with labeled picture pre-training, translate the things (human, animal and other common objects and the like) and the relation information among various things contained in the scene in real time, and output the information in the form of character information;

the voice broadcasting system selects voice synthesis software, for example, voice synthesis software of the science and technology communication company can be selected, the text information is broadcasted in a voice form which can be understood by the blind person, the blind person can obtain information contained in the environment in real time through a loudspeaker or an earphone, and the blind person can quickly obtain surrounding people, vehicles, obstacles and other environment information and the association between various things.

The image acquisition device, the control system and the voice broadcasting system can be integrated into a whole.

In the invention, the deep neural network uses a deep convolution neural network VGG16, a marked Microsoft COCO data set is adopted to train the neural network VGG16, a Dropout algorithm is used to optimize the deep convolution neural network VGG16, and then a hole convolution is used to replace a pooling layer; and (3) translating the feature graph output by the deep convolutional neural network VGG16 by adopting a long-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting character information marked by pictures in a data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.

The VGG convolutional neural network is a neural network model proposed in 2014 of Oxford university, and the conciseness and the practicability of the VGG convolutional neural network show very good results in image classification and target detection tasks. The VGG16 is a model with 16 layers of networks, which replaces the convolution kernel of the conventional convolutional neural network 7 × 7 with a stack of three convolution layers of 3 × 3, and has better feature extraction function and fewer parameters under the condition of unchanged receptive field. Its non-linear characteristics are also very good due to its deep layer count. The whole network model consists of 5 convolutional layers containing 3 × 3 convolutional kernels and a 2 × 2 maximal pooling, and finally three fully-connected layers are added for further processing of the features. The Dropout algorithm is to stop the activation value of a certain neuron from working with a certain probability when the neural network conducts forward, so that overfitting of the network can be effectively prevented, and the training time is reduced. In the convolutional neural network, in order to extract picture features, a pooling layer is often adopted. However, as the depth of the network increases, the pooling layer continues to function, and the picture size becomes smaller and smaller, so that much detail information is inevitably lost. Therefore, the hole convolution is adopted to replace the pooling layer, the receptive field is widened under the condition that the parameters are not changed, and the detailed information of the picture is reserved.

The pictures are sent to the LSTM after being processed by the VGG16 network. The LSTM is a special Recurrent Neural Network (RNN) and is proposed to solve the problems of disappearance of echelons and the like in the traditional RNN. In the conventional RNN, the gradient of the subsequent node gradually decreases in the backward propagation as time increases, and it is difficult to form an effective update for the previous node. LSTM cells are therefore designed with the addition of a memory cell and a number of gates in each time step for controlling the storage state of information and for controlling when and how the state of the memory cell is updated. CNN can extract image features, LSTM can memorize word sequences in sentences. Through the structure of CNN + LSTM, the image can be described and translated quickly and accurately. The optimized neural network has higher convergence speed and more stable structure, and provides a reliable core software framework for the blind auxiliary vision system.

The Microsoft COCO dataset consists of 123287 images, of which 80% of the pictures were used for training and the remaining 20% were used as the test set. Each picture at least comprises 5 artificially marked reference description sentences which describe the things appearing in the picture, the association between the things and some other information. By using the data set, the neural network can be helped to learn the picture characteristics and acquire the information on the picture quickly and accurately in real time. When a new picture input is received, things appearing on the picture can be well distinguished, and the new picture input is translated into characters. Because the data of the training set is huge, the output of the neural network is very reliable.

In the training stage, the invention uses the deep convolutional neural network VGG16 to process the part of the Microsoft COCO data set as the training set, converts the image into a feature vector with fixed length, optimizes the convolutional neural network by using Dropout algorithm, and accelerates the convergence of the deep convolutional neural network, wherein the convolution calculation process is as follows:

I_j，k，j∈[0，x)，k∈[0，x)

W_l，m，l∈[0，y)，m∈[0，y)

wherein

Wherein, I_j，kRepresenting an input image, W_l，mRepresenting the weight corresponding to convolution, wherein x is the size of an input layer, y is the size of a convolution kernel, j and k both represent the position coordinates of a pixel point on an image, l and m are the positions of the convolution kernel corresponding to the weight, sigma is a modified linear unit (Relu) activation function, phi is an output value after primary convolution calculation, and bias is bias;

the method for replacing the pooling layer by the cavity convolution is to fill spaces between convolution kernel elements during convolution, and the specific formula is as follows:

n＝y+(y-1)*(d-1)

wherein d is a hyper-parameter, (d-1) is the number of filled empty lattices, and n is the size of the convolution kernel after the blank is added;

wherein i is the size of the input void convolution, s is the step length, o is the size of the feature map after the void convolution, and p is the number of filled pixels;

and splicing the feature graph output by the deep convolutional neural network and the word embedded vector together to form a multi-mode feature, and sending the multi-mode feature into a long-time memory network LSTM for translation. The LSTM unit is a special recurrent neural network, and can solve the problem that when the time sequence is too long, the subsequent nodes are difficult to acquire effective information from the previous nodes in the traditional recurrent neural network, and the specific functions of the LSTM are as follows:

wherein f is_tTo forget the gate, the degree to which the state of the last cell was forgotten is controlled, σ is the ReLU activation function, i_tIn order to input the information into the gate,a new candidate vector generated for tan h, and i_tTogether with controlling how much new information is added, c_tFor a new state of the memory cell, o_tIs an output gate for controlling how much of the current cell state is filtered, h_tAs an output of the unit, W_f，W_h，W_i，W_c，W_oAre the weight of each door, b_f，b_i，b_c，b_oBiasing each gate;

updating LSTM unit parameters by using a BPTT algorithm, wherein the specific formula is as follows:

where p denotes the p-th sample, x_i(t) table sample characteristics, y, input to the network at time t_j(t) output of hidden layer at time t, y_k(t) table network output, z_kIs the target output. Delta_pk＝(z_pk-y_pk)·g′(y_pk) Is the output residual of the p-th sample, g' (y)_pk) The derivative of the output network function for the p-th sample.Is a hidden layer residual, expressed as a weighted sum of the output residual from the output layer into the mth hidden layer and the weight of the local layer, Δ W_kjAs a weight between the output layer and the hidden layer, Δ V_jtAs a weight between the input layer and the hidden layer, Δ U_jiη is a constant generated when the derivative is taken as the weight between hidden layers;

through the transformation of the Softmax function at the top end of the LSTM, a probability vector matrix of the word sequence is generated and converted into the corresponding word sequence, and the formula of the Softmax function is

Wherein,representing a certain vocabulary in the word list, V represents the word list, and the formula means that the softmax value of the certain vocabulary is equal to the ratio of the index of the word to the sum of the indexes of all words, and the probability vector of the output jth word belonging to all words in the word list is obtained through the formula;

the distance between the generated word sequence matrix and the word sequence matrix in the reference sentence is calculated by using a distance function

Wherein, w_kK is the total number of word matrices for the weight used in fusing the kth stage.

And finally, converting the image and character information translated by the neural network into voice by adopting scientific communication flying voice synthesis software.

In the invention, when the blind person enters a strange scene and needs to acquire scene information, the switch can be pressed down to control the camera to take some photos. The photos are sent to the deep neural network in the main control chip in sequence. The trained neural network is used for extracting and translating the features of the pictures, the characters of the scenes contained in the pictures are output after being processed by the neural network, finally the characters are sent to a voice broadcasting system, the characters are converted into voice through voice synthesis software and sent to the blind with limited vision or the blind with vision disorder, and the blind can obtain the information of the environment in real time. The device has the advantages of portability, lightness, low cost, high precision, strong reliability and the like.

The blind person can be helped to quickly know the information of various barriers and objects around the position of the blind person in a voice broadcasting mode under the condition of active selection, the sense richness of the blind person is improved, and the blind person can feel the world like normal people.

Claims

1. A CNN/LSTM-based blind aided vision system, comprising:

the image acquisition device acquires images around the user in real time;

and the voice broadcasting system broadcasts the information in a voice form.

2. The CNN/LSTM-based blind-aided vision system as claimed in claim 1, wherein the image capturing device is a portable camera, the image capturing device, the control system and the voice broadcasting system are integrated into a whole, and the control system is an embedded chip.

3. The CNN/LSTM-based blind-aided vision system as claimed in claim 1, wherein the information contained in the translated scene is the relationship between things and various things in the current scene, and is outputted in the form of literal information.

4. The CNN/LSTM-based blind-aided vision system of claim 1, wherein the deep neural network uses a deep convolutional neural network, the neural network is trained using labeled data sets, the deep convolutional neural network is optimized using Dropout algorithm, and then the pooling layer is replaced with a hole convolution; and translating the feature graph output by the deep convolutional neural network by adopting a long-time and short-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting the character information marked by the picture in the data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.

5. The CNN/LSTM-based blind-aided vision system of claim 4, wherein the deep convolutional neural network is VGG16 and the dataset is a Microsoft COCO dataset.

6. The CNN/LSTM-based blind-aided vision system of claim 4 or claim 5, wherein a deep convolutional neural network is used to process the part of the data set as a training set, convert the image into a feature vector with a fixed length, and optimize the convolutional neural network by using Dropout algorithm to accelerate the convergence of the deep convolutional neural network, and the use of the hole convolution instead of the pooling layer is to fill spaces between convolution kernel elements during convolution; and the feature graph output by the deep convolutional neural network is spliced with the word embedded vector to form a multi-mode feature together, and the multi-mode feature is sent to a long-time memory network LSTM for translation.

7. The CNN/LSTM-based blind-aided vision system of claim 6, wherein the convolution calculation process is as follows:

I_j，k，j∈[0，x)，k∈[0，x)

W_l，m，l∈[0，y)，m∈[0，y)

wherein

the specific formula for stuffing spaces between convolution kernel elements during convolution is as follows:

n＝y+(y-1)*(d-1)

the specific function of the LSTM is as follows:

where p denotes the p-th sample, x_i(t) table sample characteristics, y, input to the network at time t_j(t) output of hidden layer at time t, y_k(t) table network output, z_kIs the target output. Delta_pk＝(z_pk-y_pk)·g(y_pk) Is the output residual of the p-th sample, g' (y)_pk) The derivative of the output network function for the p-th sample. Is hidden layer residual, expressed as slave outputWeighted summation of output residuals in layer-to-mth hidden layer with local layer weights, Δ W_kjAs a weight between the output layer and the hidden layer, Δ V_jtAs a weight between the input layer and the hidden layer, Δ U_jiη is a constant generated when the derivative is taken as the weight between hidden layers;

8. The CNN/LSTM-based blind-aided vision system of claim 6, wherein the trained network is tested using the remaining data in the data set as a test set, LSTM generates a probability matrix of word sequences, and the largest corresponding word in each probability vector in the matrix is a predicted word, and the predicted words are combined together in sequence to generate a description sentence.

9. The CNN/LSTM-based blind aided vision system of claim 1, wherein the voice broadcasting system adopts text-to-speech software to convert text information translated by the deep neural network into voice and broadcast the voice through a loudspeaker or an earphone.