[go: up one dir, main page]

CN109753900A - A kind of blind person's auxiliary vision system based on CNN/LSTM - Google Patents

A kind of blind person's auxiliary vision system based on CNN/LSTM Download PDF

Info

Publication number
CN109753900A
CN109753900A CN201811573815.3A CN201811573815A CN109753900A CN 109753900 A CN109753900 A CN 109753900A CN 201811573815 A CN201811573815 A CN 201811573815A CN 109753900 A CN109753900 A CN 109753900A
Authority
CN
China
Prior art keywords
lstm
neural network
output
word
cnn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811573815.3A
Other languages
Chinese (zh)
Other versions
CN109753900B (en
Inventor
潘红光
雷心宇
黄向东
温帆
张奇
米文毓
苏涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Science and Technology
Original Assignee
Xian University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Science and Technology filed Critical Xian University of Science and Technology
Priority to CN201811573815.3A priority Critical patent/CN109753900B/en
Publication of CN109753900A publication Critical patent/CN109753900A/en
Application granted granted Critical
Publication of CN109753900B publication Critical patent/CN109753900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

一种基于CNN/LSTM的盲人辅助视觉系统,包括:图像采集装置,实时采集使用者周围图像;控制系统,搭载使用大量带标注的图片预训练的深度神经网络,实时翻译出所处场景中包含的信息;语音播报系统,以语音形式播报所述信息,本发明克服了传统导盲杖、导盲犬和导盲眼镜的不便之处,同时还能丰富盲人的感官,使导盲不仅仅只是提示障碍物距离,更多的是提供一个可被感知到的世界,照顾盲人心理感受,其具有结构轻巧、方便、实时性强、精度高、成本低、便于携带,无需联网等优点。

A blind auxiliary vision system based on CNN/LSTM, comprising: an image acquisition device, which collects images around the user in real time; a control system, which is equipped with a deep neural network pre-trained by using a large number of labeled pictures, and translates in real time the images contained in the scene. information; a voice broadcast system, broadcasts the information in the form of voice, the present invention overcomes the inconvenience of traditional guide sticks, guide dogs and guide glasses, and at the same time can enrich the senses of the blind, so that the guide is not only a reminder The obstacle distance is more about providing a world that can be perceived and taking care of the psychological feelings of the blind. It has the advantages of light structure, convenience, strong real-time performance, high precision, low cost, easy portability, and no need for networking.

Description

Blind person auxiliary vision system based on CNN/LSTM
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to an auxiliary vision system, and particularly relates to a blind auxiliary vision system based on CNN/LSTM.
Background
In modern society, blind or temporarily blinded people due to illness still occupy a large portion of the population of society. Most blind people adopt traditional guide sticks, guide dogs and the like to sense surrounding environment information including obstacles and the like. However, the traditional blind guiding mode is often low in efficiency, the blind guiding stick is inconvenient to use, the blind guiding stick depends on the self feeling of the blind to a great extent when acquiring the surrounding information, and meanwhile, the blind guiding stick is limited in action range, only the information of obstacles in a previous small area can be detected, and the complete external environment cannot be sensed. The guide dog can help the blind person to avoid obstacles quickly, and certain emergency situations are avoided. However, the blind guiding dog usually needs to be trained professionally by selecting a specific dog species, and can only play a blind guiding role after training. This training process is tedious, time consuming, requires high costs and is difficult to undertake by an individual. When the blind person walks with the guide dog, the influence on pedestrians on the road cannot be avoided. Furthermore, the nature of canines is very difficult to be extinguished by training for a period of time, and blind people need to be raised, trained and controlled during daily activities. Therefore, the blind guiding dog cannot completely meet the blind guiding requirement of the blind in daily activities.
A blind guiding tool is also available in the market, and is a pair of blind guiding glasses based on integrated circuit ultrasonic waves. The electronic box comprises an electronic box and glasses, wherein the glasses are provided with two ultrasonic transducers and an earplug, the transducers can transmit and receive reflected ultrasonic pulse waves forwards, and the blind can sense the front obstacles through the sound change of the earphones. The blind guiding glasses have the advantages of small volume, sensitive response and correct direction. However, the blind guiding glasses are expensive, and only the distance between the front obstacle and the blind guiding glasses can be sensed, and the position information and the attribute of the obstacle cannot be accurately known. In addition, when the blind guiding glasses are used, the blind person is reminded by different sounds emitted by the earphones, and when the blind person walks on the road, the blind person must pay attention to the sounds in the earphones all the time, so that other dangers can be caused due to the fact that the attention is not focused. Moreover, the blind guiding glasses need to select a distance gear before use, and the distances which can be detected by different gears are different. But sometimes the blind person himself cannot accurately judge in which position he is. Therefore, the ultrasonic blind guiding glasses also have many inconveniences.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a CNN/LSTM-based blind auxiliary vision system, overcomes the inconvenience of the traditional guide stick, guide dog and guide glasses by using artificial intelligence, and can enrich the sense of the blind, so that the guide not only can prompt the distance of obstacles, but also can provide a sensed world to take care of the psychological feelings of the blind, and the CNN/LSTM-based blind auxiliary vision system has the advantages of light and handy structure, convenience, strong real-time performance, high precision, low cost, convenience in carrying, no need of networking and the like.
In order to achieve the purpose, the invention adopts the technical scheme that:
a CNN/LSTM-based blind-aided vision system, comprising:
the image acquisition device acquires images around the user in real time;
the control system is used for carrying a deep neural network pre-trained by using a large number of pictures with labels and translating information contained in a scene in real time;
and the voice broadcasting system broadcasts the information in a voice form.
The image acquisition device is a portable camera, the image acquisition device, the control system and the voice broadcast system are integrated into a whole, wherein the control system adopts an embedded chip.
The information contained in the translated scene is the relationship between the objects and various objects in the current scene and is output in the form of character information.
The deep neural network uses a deep convolution neural network, the neural network is trained by adopting a data set with labels, the deep convolution neural network is optimized by using a Dropout algorithm, and then the pooling layer is replaced by using a cavity convolution; and translating the feature graph output by the deep convolutional neural network by adopting a long-time and short-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting the character information marked by the picture in the data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.
The deep convolutional neural network is VGG16, and the data set is a Microsoft COCO data set.
In the training stage, the deep convolutional neural network is used for processing a part which is used as a training set in a data set, an image is converted into a characteristic vector with fixed length, the Dropout algorithm is used for optimizing the convolutional neural network, the convergence of the deep convolutional neural network is accelerated, and the void convolution is used for replacing a pooling layer by plugging spaces among convolutional kernel elements during convolution; and the feature graph output by the deep convolutional neural network is spliced with the word embedded vector to form a multi-mode feature together, and the multi-mode feature is sent to a long-time memory network LSTM for translation.
In the testing stage, the invention uses the remaining data in the data set as the testing set to test the trained network, and generates a probability matrix of a word sequence by the LSTM, wherein the maximum corresponding word in each probability vector in the matrix is the predicted word, and the words are combined together according to the sequence to generate the description sentence.
A word is represented as an onehot matrix, i.e. each dimension has elements with a value 1, and the remaining elements are 0. Each word in the dictionary is assigned a number. The length of the vector is equal to the length of the dictionary. Because the pictures used for training are provided with labels, words in the labels are converted into onehot vectors, the vectors are spliced into a one-dimensional long vector, and the one-dimensional long vector is combined with the feature map of the corresponding picture to form a multi-mode feature.
The voice broadcasting system adopts text-to-voice software to convert text information translated by the deep neural network into voice, and the voice is played through a loudspeaker or an earphone.
Compared with the prior art, the method has the advantages that the information contained in the scene is translated in real time by using a large number of the labeled image pre-trained deep neural networks. The deep neural network outputs some things (including people, animals and other objects) in the current scene and the text information of the relationship among various things, and then the text information is converted into the voice information which can be understood by the blind person through the voice broadcasting system. The method has the advantages of reliability, strong real-time performance, small size, low cost, high precision and the like, and can help the blind to quickly acquire the environmental information of the current position in a voice mode.
Drawings
FIG. 1 is a flow chart of an assisted vision implementation of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples.
The invention relates to a blind auxiliary vision system based on CNN/LSTM, comprising:
the image acquisition device acquires images around a user in real time and can be realized by adopting a small portable camera;
the control system can adopt an embedded chip, carry and use a large number of deep neural networks with labeled picture pre-training, translate the things (human, animal and other common objects and the like) and the relation information among various things contained in the scene in real time, and output the information in the form of character information;
the voice broadcasting system selects voice synthesis software, for example, voice synthesis software of the science and technology communication company can be selected, the text information is broadcasted in a voice form which can be understood by the blind person, the blind person can obtain information contained in the environment in real time through a loudspeaker or an earphone, and the blind person can quickly obtain surrounding people, vehicles, obstacles and other environment information and the association between various things.
The image acquisition device, the control system and the voice broadcasting system can be integrated into a whole.
In the invention, the deep neural network uses a deep convolution neural network VGG16, a marked Microsoft COCO data set is adopted to train the neural network VGG16, a Dropout algorithm is used to optimize the deep convolution neural network VGG16, and then a hole convolution is used to replace a pooling layer; and (3) translating the feature graph output by the deep convolutional neural network VGG16 by adopting a long-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting character information marked by pictures in a data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.
The VGG convolutional neural network is a neural network model proposed in 2014 of Oxford university, and the conciseness and the practicability of the VGG convolutional neural network show very good results in image classification and target detection tasks. The VGG16 is a model with 16 layers of networks, which replaces the convolution kernel of the conventional convolutional neural network 7 × 7 with a stack of three convolution layers of 3 × 3, and has better feature extraction function and fewer parameters under the condition of unchanged receptive field. Its non-linear characteristics are also very good due to its deep layer count. The whole network model consists of 5 convolutional layers containing 3 × 3 convolutional kernels and a 2 × 2 maximal pooling, and finally three fully-connected layers are added for further processing of the features. The Dropout algorithm is to stop the activation value of a certain neuron from working with a certain probability when the neural network conducts forward, so that overfitting of the network can be effectively prevented, and the training time is reduced. In the convolutional neural network, in order to extract picture features, a pooling layer is often adopted. However, as the depth of the network increases, the pooling layer continues to function, and the picture size becomes smaller and smaller, so that much detail information is inevitably lost. Therefore, the hole convolution is adopted to replace the pooling layer, the receptive field is widened under the condition that the parameters are not changed, and the detailed information of the picture is reserved.
The pictures are sent to the LSTM after being processed by the VGG16 network. The LSTM is a special Recurrent Neural Network (RNN) and is proposed to solve the problems of disappearance of echelons and the like in the traditional RNN. In the conventional RNN, the gradient of the subsequent node gradually decreases in the backward propagation as time increases, and it is difficult to form an effective update for the previous node. LSTM cells are therefore designed with the addition of a memory cell and a number of gates in each time step for controlling the storage state of information and for controlling when and how the state of the memory cell is updated. CNN can extract image features, LSTM can memorize word sequences in sentences. Through the structure of CNN + LSTM, the image can be described and translated quickly and accurately. The optimized neural network has higher convergence speed and more stable structure, and provides a reliable core software framework for the blind auxiliary vision system.
The Microsoft COCO dataset consists of 123287 images, of which 80% of the pictures were used for training and the remaining 20% were used as the test set. Each picture at least comprises 5 artificially marked reference description sentences which describe the things appearing in the picture, the association between the things and some other information. By using the data set, the neural network can be helped to learn the picture characteristics and acquire the information on the picture quickly and accurately in real time. When a new picture input is received, things appearing on the picture can be well distinguished, and the new picture input is translated into characters. Because the data of the training set is huge, the output of the neural network is very reliable.
In the training stage, the invention uses the deep convolutional neural network VGG16 to process the part of the Microsoft COCO data set as the training set, converts the image into a feature vector with fixed length, optimizes the convolutional neural network by using Dropout algorithm, and accelerates the convergence of the deep convolutional neural network, wherein the convolution calculation process is as follows:
Ij,k,j∈[0,x),k∈[0,x)
Wl,m,l∈[0,y),m∈[0,y)
wherein
Wherein, Ij,kRepresenting an input image, Wl,mRepresenting the weight corresponding to convolution, wherein x is the size of an input layer, y is the size of a convolution kernel, j and k both represent the position coordinates of a pixel point on an image, l and m are the positions of the convolution kernel corresponding to the weight, sigma is a modified linear unit (Relu) activation function, phi is an output value after primary convolution calculation, and bias is bias;
the method for replacing the pooling layer by the cavity convolution is to fill spaces between convolution kernel elements during convolution, and the specific formula is as follows:
n=y+(y-1)*(d-1)
wherein d is a hyper-parameter, (d-1) is the number of filled empty lattices, and n is the size of the convolution kernel after the blank is added;
wherein i is the size of the input void convolution, s is the step length, o is the size of the feature map after the void convolution, and p is the number of filled pixels;
and splicing the feature graph output by the deep convolutional neural network and the word embedded vector together to form a multi-mode feature, and sending the multi-mode feature into a long-time memory network LSTM for translation. The LSTM unit is a special recurrent neural network, and can solve the problem that when the time sequence is too long, the subsequent nodes are difficult to acquire effective information from the previous nodes in the traditional recurrent neural network, and the specific functions of the LSTM are as follows:
wherein f istTo forget the gate, the degree to which the state of the last cell was forgotten is controlled, σ is the ReLU activation function, itIn order to input the information into the gate,a new candidate vector generated for tan h, and itTogether with controlling how much new information is added, ctFor a new state of the memory cell, otIs an output gate for controlling how much of the current cell state is filtered, htAs an output of the unit, Wf,Wh,Wi,Wc,WoAre the weight of each door, bf,bi,bc,boBiasing each gate;
updating LSTM unit parameters by using a BPTT algorithm, wherein the specific formula is as follows:
where p denotes the p-th sample, xi(t) table sample characteristics, y, input to the network at time tj(t) output of hidden layer at time t, yk(t) table network output, zkIs the target output. Deltapk=(zpk-ypk)·g′(ypk) Is the output residual of the p-th sample, g' (y)pk) The derivative of the output network function for the p-th sample.Is a hidden layer residual, expressed as a weighted sum of the output residual from the output layer into the mth hidden layer and the weight of the local layer, Δ WkjAs a weight between the output layer and the hidden layer, Δ VjtAs a weight between the input layer and the hidden layer, Δ Ujiη is a constant generated when the derivative is taken as the weight between hidden layers;
through the transformation of the Softmax function at the top end of the LSTM, a probability vector matrix of the word sequence is generated and converted into the corresponding word sequence, and the formula of the Softmax function is
Wherein,representing a certain vocabulary in the word list, V represents the word list, and the formula means that the softmax value of the certain vocabulary is equal to the ratio of the index of the word to the sum of the indexes of all words, and the probability vector of the output jth word belonging to all words in the word list is obtained through the formula;
the distance between the generated word sequence matrix and the word sequence matrix in the reference sentence is calculated by using a distance function
Wherein, wkK is the total number of word matrices for the weight used in fusing the kth stage.
In the testing stage, the invention uses the remaining data in the data set as the testing set to test the trained network, and generates a probability matrix of a word sequence by the LSTM, wherein the maximum corresponding word in each probability vector in the matrix is the predicted word, and the words are combined together according to the sequence to generate the description sentence.
And finally, converting the image and character information translated by the neural network into voice by adopting scientific communication flying voice synthesis software.
In the invention, when the blind person enters a strange scene and needs to acquire scene information, the switch can be pressed down to control the camera to take some photos. The photos are sent to the deep neural network in the main control chip in sequence. The trained neural network is used for extracting and translating the features of the pictures, the characters of the scenes contained in the pictures are output after being processed by the neural network, finally the characters are sent to a voice broadcasting system, the characters are converted into voice through voice synthesis software and sent to the blind with limited vision or the blind with vision disorder, and the blind can obtain the information of the environment in real time. The device has the advantages of portability, lightness, low cost, high precision, strong reliability and the like.
The blind person can be helped to quickly know the information of various barriers and objects around the position of the blind person in a voice broadcasting mode under the condition of active selection, the sense richness of the blind person is improved, and the blind person can feel the world like normal people.

Claims (9)

1. A CNN/LSTM-based blind aided vision system, comprising:
the image acquisition device acquires images around the user in real time;
the control system is used for carrying a deep neural network pre-trained by using a large number of pictures with labels and translating information contained in a scene in real time;
and the voice broadcasting system broadcasts the information in a voice form.
2. The CNN/LSTM-based blind-aided vision system as claimed in claim 1, wherein the image capturing device is a portable camera, the image capturing device, the control system and the voice broadcasting system are integrated into a whole, and the control system is an embedded chip.
3. The CNN/LSTM-based blind-aided vision system as claimed in claim 1, wherein the information contained in the translated scene is the relationship between things and various things in the current scene, and is outputted in the form of literal information.
4. The CNN/LSTM-based blind-aided vision system of claim 1, wherein the deep neural network uses a deep convolutional neural network, the neural network is trained using labeled data sets, the deep convolutional neural network is optimized using Dropout algorithm, and then the pooling layer is replaced with a hole convolution; and translating the feature graph output by the deep convolutional neural network by adopting a long-time and short-time memory network (LSTM), updating LSTM unit parameters by using a BPTT algorithm, and finally outputting the character information marked by the picture in the data set to obtain a pre-trained deep neural network capable of translating the picture information into the character information.
5. The CNN/LSTM-based blind-aided vision system of claim 4, wherein the deep convolutional neural network is VGG16 and the dataset is a Microsoft COCO dataset.
6. The CNN/LSTM-based blind-aided vision system of claim 4 or claim 5, wherein a deep convolutional neural network is used to process the part of the data set as a training set, convert the image into a feature vector with a fixed length, and optimize the convolutional neural network by using Dropout algorithm to accelerate the convergence of the deep convolutional neural network, and the use of the hole convolution instead of the pooling layer is to fill spaces between convolution kernel elements during convolution; and the feature graph output by the deep convolutional neural network is spliced with the word embedded vector to form a multi-mode feature together, and the multi-mode feature is sent to a long-time memory network LSTM for translation.
7. The CNN/LSTM-based blind-aided vision system of claim 6, wherein the convolution calculation process is as follows:
Ij,k,j∈[0,x),k∈[0,x)
Wl,m,l∈[0,y),m∈[0,y)
wherein
Wherein, Ij,kRepresenting an input image, Wl,mRepresenting the weight corresponding to convolution, wherein x is the size of an input layer, y is the size of a convolution kernel, j and k both represent the position coordinates of a pixel point on an image, l and m are the positions of the convolution kernel corresponding to the weight, sigma is a modified linear unit (Relu) activation function, phi is an output value after primary convolution calculation, and bias is bias;
the specific formula for stuffing spaces between convolution kernel elements during convolution is as follows:
n=y+(y-1)*(d-1)
wherein d is a hyper-parameter, (d-1) is the number of filled empty lattices, and n is the size of the convolution kernel after the blank is added;
wherein i is the size of the input void convolution, s is the step length, o is the size of the feature map after the void convolution, and p is the number of filled pixels;
the specific function of the LSTM is as follows:
wherein f istTo forget the gate, the degree to which the state of the last cell was forgotten is controlled, σ is the ReLU activation function, itIn order to input the information into the gate,a new candidate vector generated for tan h, and itTogether with controlling how much new information is added, ctFor a new state of the memory cell, otIs an output gate for controlling how much of the current cell state is filtered, htAs an output of the unit, Wf,Wh,Wi,Wc,WoAre the weight of each door, bf,bi,bc,boBiasing each gate;
updating LSTM unit parameters by using a BPTT algorithm, wherein the specific formula is as follows:
where p denotes the p-th sample, xi(t) table sample characteristics, y, input to the network at time tj(t) output of hidden layer at time t, yk(t) table network output, zkIs the target output. Deltapk=(zpk-ypk)·g(ypk) Is the output residual of the p-th sample, g' (y)pk) The derivative of the output network function for the p-th sample. Is hidden layer residual, expressed as slave outputWeighted summation of output residuals in layer-to-mth hidden layer with local layer weights, Δ WkjAs a weight between the output layer and the hidden layer, Δ VjtAs a weight between the input layer and the hidden layer, Δ Ujiη is a constant generated when the derivative is taken as the weight between hidden layers;
through the transformation of the Softmax function at the top end of the LSTM, a probability vector matrix of the word sequence is generated and converted into the corresponding word sequence, and the formula of the Softmax function is
Wherein,representing a certain vocabulary in the word list, V represents the word list, and the formula means that the softmax value of the certain vocabulary is equal to the ratio of the index of the word to the sum of the indexes of all words, and the probability vector of the output jth word belonging to all words in the word list is obtained through the formula;
the distance between the generated word sequence matrix and the word sequence matrix in the reference sentence is calculated by using a distance function
Wherein, wkK is the total number of word matrices for the weight used in fusing the kth stage.
8. The CNN/LSTM-based blind-aided vision system of claim 6, wherein the trained network is tested using the remaining data in the data set as a test set, LSTM generates a probability matrix of word sequences, and the largest corresponding word in each probability vector in the matrix is a predicted word, and the predicted words are combined together in sequence to generate a description sentence.
9. The CNN/LSTM-based blind aided vision system of claim 1, wherein the voice broadcasting system adopts text-to-speech software to convert text information translated by the deep neural network into voice and broadcast the voice through a loudspeaker or an earphone.
CN201811573815.3A 2018-12-21 2018-12-21 Blind person auxiliary vision system based on CNN/LSTM Active CN109753900B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811573815.3A CN109753900B (en) 2018-12-21 2018-12-21 Blind person auxiliary vision system based on CNN/LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811573815.3A CN109753900B (en) 2018-12-21 2018-12-21 Blind person auxiliary vision system based on CNN/LSTM

Publications (2)

Publication Number Publication Date
CN109753900A true CN109753900A (en) 2019-05-14
CN109753900B CN109753900B (en) 2020-06-23

Family

ID=66402962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811573815.3A Active CN109753900B (en) 2018-12-21 2018-12-21 Blind person auxiliary vision system based on CNN/LSTM

Country Status (1)

Country Link
CN (1) CN109753900B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490087A (en) * 2019-07-25 2019-11-22 中山大学 A kind of vision-aided system based on depth learning technology
CN110852171A (en) * 2019-10-14 2020-02-28 清华大学深圳国际研究生院 Scene description robot system and method for online training
CN111329735A (en) * 2020-02-21 2020-06-26 北京理工大学 Method, device and system for guiding the blind
CN111858989A (en) * 2020-06-09 2020-10-30 西安工程大学 An attention-based spiking convolutional neural network for image classification
CN112381052A (en) * 2020-12-01 2021-02-19 創啟社會科技有限公司 System and method for identifying visually impaired users in real time
CN113091747A (en) * 2021-04-09 2021-07-09 北京深睿博联科技有限责任公司 Blind person navigation method and device based on auxiliary information
CN114049553A (en) * 2021-11-02 2022-02-15 北京科技大学顺德研究生院 Offline blind person vision assisting method and device
CN114469661A (en) * 2022-02-24 2022-05-13 沈阳理工大学 A visual content guidance assistance system and method based on coding and decoding technology
CN114699287A (en) * 2022-03-02 2022-07-05 北京工业大学 Daily trip assisting method for blind person based on mobile terminal rapid deep neural network
CN115273810A (en) * 2022-07-04 2022-11-01 成都理工大学 Multimodal image speech interpretation method and system based on deep learning
CN116071648A (en) * 2023-01-31 2023-05-05 武汉工程大学 Blind person assisting method, wearable blind person assisting device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650789A (en) * 2016-11-16 2017-05-10 同济大学 Image description generation method based on depth LSTM network
CN107153812A (en) * 2017-03-31 2017-09-12 深圳先进技术研究院 A kind of exercising support method and system based on machine vision
US20170364766A1 (en) * 2014-12-22 2017-12-21 Gonzalo Vaca First-Person Camera Based Visual Context Aware System
CN108245384A (en) * 2017-12-12 2018-07-06 清华大学苏州汽车研究院(吴江) Binocular vision apparatus for guiding blind based on enhancing study

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364766A1 (en) * 2014-12-22 2017-12-21 Gonzalo Vaca First-Person Camera Based Visual Context Aware System
CN106650789A (en) * 2016-11-16 2017-05-10 同济大学 Image description generation method based on depth LSTM network
CN107153812A (en) * 2017-03-31 2017-09-12 深圳先进技术研究院 A kind of exercising support method and system based on machine vision
CN108245384A (en) * 2017-12-12 2018-07-06 清华大学苏州汽车研究院(吴江) Binocular vision apparatus for guiding blind based on enhancing study

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
王小新: "语义分割中的深度学习方法全解:从FCN、SegNet到各版本DeepLab", 《量子位》 *
辛阳等: "《大数据技术原理与实践》", 31 March 2018, 北京邮电大学出版社 *
陈强普: "面向图像描述的深度神经网络模型研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490087B (en) * 2019-07-25 2022-08-05 中山大学 A visual aid system based on deep learning technology
CN110490087A (en) * 2019-07-25 2019-11-22 中山大学 A kind of vision-aided system based on depth learning technology
CN110852171A (en) * 2019-10-14 2020-02-28 清华大学深圳国际研究生院 Scene description robot system and method for online training
CN111329735A (en) * 2020-02-21 2020-06-26 北京理工大学 Method, device and system for guiding the blind
CN111858989A (en) * 2020-06-09 2020-10-30 西安工程大学 An attention-based spiking convolutional neural network for image classification
CN111858989B (en) * 2020-06-09 2023-11-10 西安工程大学 An image classification method based on attention mechanism and spiking convolutional neural network
CN112381052A (en) * 2020-12-01 2021-02-19 創啟社會科技有限公司 System and method for identifying visually impaired users in real time
CN113091747A (en) * 2021-04-09 2021-07-09 北京深睿博联科技有限责任公司 Blind person navigation method and device based on auxiliary information
CN114049553A (en) * 2021-11-02 2022-02-15 北京科技大学顺德研究生院 Offline blind person vision assisting method and device
CN114469661A (en) * 2022-02-24 2022-05-13 沈阳理工大学 A visual content guidance assistance system and method based on coding and decoding technology
CN114469661B (en) * 2022-02-24 2023-10-03 沈阳理工大学 Visual content blind guiding auxiliary system and method based on coding and decoding technology
CN114699287A (en) * 2022-03-02 2022-07-05 北京工业大学 Daily trip assisting method for blind person based on mobile terminal rapid deep neural network
CN115273810A (en) * 2022-07-04 2022-11-01 成都理工大学 Multimodal image speech interpretation method and system based on deep learning
CN116071648A (en) * 2023-01-31 2023-05-05 武汉工程大学 Blind person assisting method, wearable blind person assisting device and storage medium

Also Published As

Publication number Publication date
CN109753900B (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN109753900B (en) Blind person auxiliary vision system based on CNN/LSTM
CN115329779B (en) A multi-person conversation emotion recognition method
CN115311389B (en) A multimodal visual cueing technology representation learning method based on pre-trained models
Lu et al. Knowing when to look: Adaptive attention via a visual sentinel for image captioning
CN111226224B (en) Method and electronic device for translating speech signals
KR102601848B1 (en) Device and method of data recognition model construction, and data recognition devicce
WO2022161298A1 (en) Information generation method and apparatus, device, storage medium, and program product
KR20210095208A (en) Video caption creation method, device and apparatus, and storage medium
WO2017168870A1 (en) Information processing device and information processing method
CN106846306A (en) A kind of ultrasonoscopy automatic describing method and system
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
WO2023207541A1 (en) Speech processing method and related device
CN116185182B (en) A controllable image description generation system and method integrating eye movement attention
Rastgoo et al. All you need in sign language production
CN111597341A (en) Document level relation extraction method, device, equipment and storage medium
CN113948060A (en) Network training method, data processing method and related equipment
US9525841B2 (en) Imaging device for associating image data with shooting condition information
CN111092798B (en) Wearable system based on spoken language understanding
CN114238587A (en) Reading understanding method and device, storage medium and computer equipment
CN115273810A (en) Multimodal image speech interpretation method and system based on deep learning
Patra et al. Exploring Bengali image descriptions through the combination of diverse CNN architectures and transformer decoders
Mishra et al. Environment descriptor for the visually impaired
Saini et al. Artificial intelligence inspired fog-cloud-based visual-assistance framework for blind and visually-impaired people
CN115731917A (en) Voice data processing method, model training method, device and storage medium
Kuang et al. Research on an end-to-end assistive model for the visually impaired population

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared