Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a CNN/LSTM-based blind auxiliary vision system, overcomes the inconvenience of the traditional guide stick, guide dog and guide glasses by using artificial intelligence, and can enrich the sense of the blind, so that the guide not only can prompt the distance of obstacles, but also can provide a sensed world to take care of the psychological feelings of the blind, and the CNN/LSTM-based blind auxiliary vision system has the advantages of light and handy structure, convenience, strong real-time performance, high precision, low cost, convenience in carrying, no need of networking and the like.
In order to achieve the purpose, the invention adopts the technical scheme that:
a CNN/LSTM-based blind-aided vision system, comprising:
the image acquisition device acquires images around the user in real time;
the control system is used for carrying a deep neural network pre-trained by using a large number of pictures with labels and translating information contained in a scene in real time;
and the voice broadcasting system broadcasts the information in a voice form.
The image acquisition device is a portable camera, the image acquisition device, the control system and the voice broadcast system are integrated into a whole, wherein the control system adopts an embedded chip.
The information contained in the translated scene is the relationship between the objects and various objects in the current scene and is output in the form of character information.
The deep neural network uses a deep convolution neural network, the neural network is trained by adopting a data set with labels, the deep convolution neural network is optimized by using a Dropout algorithm, and then the pooling layer is replaced by using a cavity convolution; and translating the feature graph output by the deep convolutional neural network by adopting a long-time and short-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting the character information marked by the picture in the data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.
The deep convolutional neural network is VGG16, and the data set is a Microsoft COCO data set.
In the training stage, the deep convolutional neural network is used for processing a part which is used as a training set in a data set, an image is converted into a characteristic vector with fixed length, the Dropout algorithm is used for optimizing the convolutional neural network, the convergence of the deep convolutional neural network is accelerated, and the void convolution is used for replacing a pooling layer by plugging spaces among convolutional kernel elements during convolution; and the feature graph output by the deep convolutional neural network is spliced with the word embedded vector to form a multi-mode feature together, and the multi-mode feature is sent to a long-time memory network LSTM for translation.
In the testing stage, the invention uses the remaining data in the data set as the testing set to test the trained network, and generates a probability matrix of a word sequence by the LSTM, wherein the maximum corresponding word in each probability vector in the matrix is the predicted word, and the words are combined together according to the sequence to generate the description sentence.
A word is represented as an onehot matrix, i.e. each dimension has elements with a value 1, and the remaining elements are 0. Each word in the dictionary is assigned a number. The length of the vector is equal to the length of the dictionary. Because the pictures used for training are provided with labels, words in the labels are converted into onehot vectors, the vectors are spliced into a one-dimensional long vector, and the one-dimensional long vector is combined with the feature map of the corresponding picture to form a multi-mode feature.
The voice broadcasting system adopts text-to-voice software to convert text information translated by the deep neural network into voice, and the voice is played through a loudspeaker or an earphone.
Compared with the prior art, the method has the advantages that the information contained in the scene is translated in real time by using a large number of the labeled image pre-trained deep neural networks. The deep neural network outputs some things (including people, animals and other objects) in the current scene and the text information of the relationship among various things, and then the text information is converted into the voice information which can be understood by the blind person through the voice broadcasting system. The method has the advantages of reliability, strong real-time performance, small size, low cost, high precision and the like, and can help the blind to quickly acquire the environmental information of the current position in a voice mode.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
The invention relates to a blind auxiliary vision system based on CNN/LSTM, comprising:
the image acquisition device acquires images around a user in real time and can be realized by adopting a small portable camera;
the control system can adopt an embedded chip, carry and use a large number of deep neural networks with labeled picture pre-training, translate the things (human, animal and other common objects and the like) and the relation information among various things contained in the scene in real time, and output the information in the form of character information;
the voice broadcasting system selects voice synthesis software, for example, voice synthesis software of the science and technology communication company can be selected, the text information is broadcasted in a voice form which can be understood by the blind person, the blind person can obtain information contained in the environment in real time through a loudspeaker or an earphone, and the blind person can quickly obtain surrounding people, vehicles, obstacles and other environment information and the association between various things.
The image acquisition device, the control system and the voice broadcasting system can be integrated into a whole.
In the invention, the deep neural network uses a deep convolution neural network VGG16, a marked Microsoft COCO data set is adopted to train the neural network VGG16, a Dropout algorithm is used to optimize the deep convolution neural network VGG16, and then a hole convolution is used to replace a pooling layer; and (3) translating the feature graph output by the deep convolutional neural network VGG16 by adopting a long-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting character information marked by pictures in a data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.
The VGG convolutional neural network is a neural network model proposed in 2014 of Oxford university, and the conciseness and the practicability of the VGG convolutional neural network show very good results in image classification and target detection tasks. The VGG16 is a model with 16 layers of networks, which replaces the convolution kernel of the conventional convolutional neural network 7 × 7 with a stack of three convolution layers of 3 × 3, and has better feature extraction function and fewer parameters under the condition of unchanged receptive field. Its non-linear characteristics are also very good due to its deep layer count. The whole network model consists of 5 convolutional layers containing 3 × 3 convolutional kernels and a 2 × 2 maximal pooling, and finally three fully-connected layers are added for further processing of the features. The Dropout algorithm is to stop the activation value of a certain neuron from working with a certain probability when the neural network conducts forward, so that overfitting of the network can be effectively prevented, and the training time is reduced. In the convolutional neural network, in order to extract picture features, a pooling layer is often adopted. However, as the depth of the network increases, the pooling layer continues to function, and the picture size becomes smaller and smaller, so that much detail information is inevitably lost. Therefore, the hole convolution is adopted to replace the pooling layer, the receptive field is widened under the condition that the parameters are not changed, and the detailed information of the picture is reserved.
The pictures are sent to the LSTM after being processed by the VGG16 network. The LSTM is a special Recurrent Neural Network (RNN) and is proposed to solve the problems of disappearance of echelons and the like in the traditional RNN. In the conventional RNN, the gradient of the subsequent node gradually decreases in the backward propagation as time increases, and it is difficult to form an effective update for the previous node. LSTM cells are therefore designed with the addition of a memory cell and a number of gates in each time step for controlling the storage state of information and for controlling when and how the state of the memory cell is updated. CNN can extract image features, LSTM can memorize word sequences in sentences. Through the structure of CNN + LSTM, the image can be described and translated quickly and accurately. The optimized neural network has higher convergence speed and more stable structure, and provides a reliable core software framework for the blind auxiliary vision system.
The Microsoft COCO dataset consists of 123287 images, of which 80% of the pictures were used for training and the remaining 20% were used as the test set. Each picture at least comprises 5 artificially marked reference description sentences which describe the things appearing in the picture, the association between the things and some other information. By using the data set, the neural network can be helped to learn the picture characteristics and acquire the information on the picture quickly and accurately in real time. When a new picture input is received, things appearing on the picture can be well distinguished, and the new picture input is translated into characters. Because the data of the training set is huge, the output of the neural network is very reliable.
In the training stage, the invention uses the deep convolutional neural network VGG16 to process the part of the Microsoft COCO data set as the training set, converts the image into a feature vector with fixed length, optimizes the convolutional neural network by using Dropout algorithm, and accelerates the convergence of the deep convolutional neural network, wherein the convolution calculation process is as follows:
Ij,k,j∈[0,x),k∈[0,x)
Wl,m,l∈[0,y),m∈[0,y)
wherein
Wherein, Ij,kRepresenting an input image, Wl,mRepresenting the weight corresponding to convolution, wherein x is the size of an input layer, y is the size of a convolution kernel, j and k both represent the position coordinates of a pixel point on an image, l and m are the positions of the convolution kernel corresponding to the weight, sigma is a modified linear unit (Relu) activation function, phi is an output value after primary convolution calculation, and bias is bias;
the method for replacing the pooling layer by the cavity convolution is to fill spaces between convolution kernel elements during convolution, and the specific formula is as follows:
n=y+(y-1)*(d-1)
wherein d is a hyper-parameter, (d-1) is the number of filled empty lattices, and n is the size of the convolution kernel after the blank is added;
wherein i is the size of the input void convolution, s is the step length, o is the size of the feature map after the void convolution, and p is the number of filled pixels;
and splicing the feature graph output by the deep convolutional neural network and the word embedded vector together to form a multi-mode feature, and sending the multi-mode feature into a long-time memory network LSTM for translation. The LSTM unit is a special recurrent neural network, and can solve the problem that when the time sequence is too long, the subsequent nodes are difficult to acquire effective information from the previous nodes in the traditional recurrent neural network, and the specific functions of the LSTM are as follows:
wherein f istTo forget the gate, the degree to which the state of the last cell was forgotten is controlled, σ is the ReLU activation function, itIn order to input the information into the gate,a new candidate vector generated for tan h, and itTogether with controlling how much new information is added, ctFor a new state of the memory cell, otIs an output gate for controlling how much of the current cell state is filtered, htAs an output of the unit, Wf,Wh,Wi,Wc,WoAre the weight of each door, bf,bi,bc,boBiasing each gate;
updating LSTM unit parameters by using a BPTT algorithm, wherein the specific formula is as follows:
where p denotes the p-th sample, xi(t) table sample characteristics, y, input to the network at time tj(t) output of hidden layer at time t, yk(t) table network output, zkIs the target output. Deltapk=(zpk-ypk)·g′(ypk) Is the output residual of the p-th sample, g' (y)pk) The derivative of the output network function for the p-th sample.Is a hidden layer residual, expressed as a weighted sum of the output residual from the output layer into the mth hidden layer and the weight of the local layer, Δ WkjAs a weight between the output layer and the hidden layer, Δ VjtAs a weight between the input layer and the hidden layer, Δ Ujiη is a constant generated when the derivative is taken as the weight between hidden layers;
through the transformation of the Softmax function at the top end of the LSTM, a probability vector matrix of the word sequence is generated and converted into the corresponding word sequence, and the formula of the Softmax function is
Wherein,representing a certain vocabulary in the word list, V represents the word list, and the formula means that the softmax value of the certain vocabulary is equal to the ratio of the index of the word to the sum of the indexes of all words, and the probability vector of the output jth word belonging to all words in the word list is obtained through the formula;
the distance between the generated word sequence matrix and the word sequence matrix in the reference sentence is calculated by using a distance function
Wherein, wkK is the total number of word matrices for the weight used in fusing the kth stage.
In the testing stage, the invention uses the remaining data in the data set as the testing set to test the trained network, and generates a probability matrix of a word sequence by the LSTM, wherein the maximum corresponding word in each probability vector in the matrix is the predicted word, and the words are combined together according to the sequence to generate the description sentence.
And finally, converting the image and character information translated by the neural network into voice by adopting scientific communication flying voice synthesis software.
In the invention, when the blind person enters a strange scene and needs to acquire scene information, the switch can be pressed down to control the camera to take some photos. The photos are sent to the deep neural network in the main control chip in sequence. The trained neural network is used for extracting and translating the features of the pictures, the characters of the scenes contained in the pictures are output after being processed by the neural network, finally the characters are sent to a voice broadcasting system, the characters are converted into voice through voice synthesis software and sent to the blind with limited vision or the blind with vision disorder, and the blind can obtain the information of the environment in real time. The device has the advantages of portability, lightness, low cost, high precision, strong reliability and the like.
The blind person can be helped to quickly know the information of various barriers and objects around the position of the blind person in a voice broadcasting mode under the condition of active selection, the sense richness of the blind person is improved, and the blind person can feel the world like normal people.