WO2020186484A1

WO2020186484A1 - Automatic image description generation method and system, electronic device, and storage medium

Info

Publication number: WO2020186484A1
Application number: PCT/CN2019/078915
Authority: WO
Inventors: 王娜; 吕锦涛
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2020-09-24
Anticipated expiration: 2021-09-20

Abstract

Disclosed are an automatic image description generation method and system, an electronic device, and a storage medium, for generating a text description for an image, and resolving the prior art issue in which a trained model does not meet an assessment criteria. The method comprises: using a loss function, a mobilenet convolutional neural network, and a long short-term memory network to construct an automatic image description generation model; recording a network parameter during construction of the model; inputting an image to the model; and the model outputting, according to the inputted image and the network parameter, a piece of text for describing the image. The automatic image description generation method provided by the present application develops a novel loss function by means of reinforced learning, such that model training standards meet the assessment criteria.

Description

Method, system, electronic device and storage medium for automatically generating and describing pictures

Technical field

本发明涉及图片处理技术领域，尤其涉及一种图片自动生成描述的方法、系统、电子装置及存储介质。The present invention relates to the technical field of picture processing, and in particular to a method, system, electronic device and storage medium for automatically generating and describing pictures.

Background technique

进入二十一世纪以来，互联网存储水平和计算机运算能力都经历了一个巨大的飞跃，智能手机用户的数量也在很大程度上得到了提升，用户通过手机终端、PC机等智能设备，每天都在互联网上共享大量的图片数据。这些庞大的图片数据资源可以更精准地反映客观世界，它们在视觉上产生的不同变化，蕴含了丰富的语义信息，为感知现实世界提供了充足的信息来源。Since the beginning of the 21st century, the Internet storage level and computer computing power have experienced a huge leap. The number of smartphone users has also increased to a large extent. Users use mobile terminals, PCs and other intelligent devices, every day Share a lot of picture data on the Internet. These huge picture data resources can more accurately reflect the objective world. The different visual changes they produce contain rich semantic information and provide a sufficient source of information for perceiving the real world.

随着神经网络和深度学习的发展，图片理解由原始的基于低层视觉特征的图片处理向高层次的基于图片语义信息和语义理解的方向发展。现有的神经网络模型一般是encoder-decoder的模型,即通过CNN(卷积神经网络)提取特征，将提取到的特征作为LSTM(长短时记忆神经网络)的初始状态，由LSTM生成一段可以描述该图片的文字。With the development of neural networks and deep learning, image understanding has developed from primitive image processing based on low-level visual features to high-level image semantic information and semantic understanding. The existing neural network model is generally an encoder-decoder model, that is, features are extracted through CNN (convolutional neural network), and the extracted features are used as the initial state of LSTM (long and short-term memory neural network), and a segment can be generated by LSTM to describe The text of the image.

technical problem

现有的模型在训练时，主要使用的是交叉熵损失函数，而在模型训练完成后，需要使用BLUE等评估指标对模型进行评估，但是使用交叉熵损失函数训练的模型，普遍存在训练标准与评估标准不统一的问题。Existing models mainly use the cross-entropy loss function when training. After the model training is completed, it is necessary to use evaluation indicators such as BLUE to evaluate the model. However, for models trained with the cross-entropy loss function, there are common training standards and The problem of inconsistent evaluation standards.

Technical solutions

本发明第一方面提供一种图片自动生成描述的方法，包括：使用损失函数、mobilenet卷积神经网络及长短时记忆神经网络构建图片自动生成描述的模型；记录构建所述模型时的网络参数；将图片输入所述模型；所述模型根据所述输入的图片及所述网络参数输出一段可以描述图片的文字。The first aspect of the present invention provides a method for automatically generating descriptions of pictures, including: using a loss function, a mobilenet convolutional neural network, and a long and short-term memory neural network to construct a model for automatically generating descriptions of pictures; recording network parameters when constructing the model; The picture is input into the model; the model outputs a paragraph of text that can describe the picture according to the input picture and the network parameters.

Beneficial effect

使用mobilenet卷积神经网络能够提取图像特征，并且参数量和计算量都较少，能够有效提高效率，在使用长短时记忆神经网络后，能够生成图像特征的文本信息，并且整体使用损失函数进行反向传播网络参数来更新和完善模型，从而使得训练出的模型在生成图片描述的时候，数据能够更加准确，从而增加了训练的模型与评估标准的统一性。Using the mobilenet convolutional neural network can extract image features, and the amount of parameters and calculations are small, which can effectively improve efficiency. After using the long- and short-term memory neural network, it can generate text information of image features, and use the loss function as a whole to reverse To propagate network parameters to update and improve the model, so that the data of the trained model can be more accurate when generating the picture description, thereby increasing the unity of the trained model and the evaluation standard.

Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

图1为本发明实施例图片自动生成描述的方法的流程示意框图；FIG. 1 is a schematic block diagram of the process of a method for automatically generating a description of pictures according to an embodiment of the present invention;

图2为本发明实施例图片自动生成描述的方法中Mobilenet模型的矩阵分解原理图；2 is a schematic diagram of matrix decomposition of the Mobilenet model in the method for automatically generating and describing pictures according to an embodiment of the present invention;

图3为本发明实施例图片自动生成描述的系统的结构示意框图；3 is a schematic block diagram of the structure of a system for automatically generating and describing pictures according to an embodiment of the present invention;

图4为本发明实施例电子装置的结构示意框图。4 is a schematic block diagram of the structure of an electronic device according to an embodiment of the present invention.

The best mode of the invention

进一步地，所述使用损失函数、mobilenet卷积神经网络及长短时记忆神经网络构建图片自动生成描述的模型包括：采用强化学习的思想设计并构建损失函数；使用预先训练的mobilenet卷积神经网络提取图片特征向量；使用矩阵变换加入注意力机制，以将所述特征向量与从预先训练的长短时记忆神经网络中提取的原状态向量融合，得到新特征向量；将所述新特征向量输入长短时记忆神经网络，所述长短时记忆神经网络根据所述新特征向量生成具有图片信息的文本；对所述损失函数求导，并将所述文本输入至求导后的损失函数中；判断所述损失函数在接收所述文本后是否收敛；若所述损失函数收敛，则保存训练时的网络参数；若所述损失函数不收敛，则继续使用mobilenet卷积神经网络、矩阵变换及长短时记忆神经网络生成具有图片信息的文本，并将文本输入至求导后的损失函数中，直至将所述文本输入所述损失函数中后，所述损失函数收敛。Further, the use of loss function, mobilenet convolutional neural network and long-short-term memory neural network to construct a model for automatically generating descriptions of pictures includes: adopting the idea of reinforcement learning to design and construct a loss function; using pre-trained mobilenet convolutional neural network to extract Picture feature vector; use matrix transformation to join the attention mechanism to fuse the feature vector with the original state vector extracted from the pre-trained long and short-term memory neural network to obtain a new feature vector; input the new feature vector into the long and short time A memory neural network, the long-short-term memory neural network generates text with picture information according to the new feature vector; derivates the loss function, and inputs the text into the derivated loss function; judges the Whether the loss function converges after receiving the text; if the loss function converges, save the network parameters during training; if the loss function does not converge, continue to use the mobilenet convolutional neural network, matrix transformation, and long and short-term memory The network generates text with picture information, and inputs the text into the derived loss function, until after the text is input into the loss function, the loss function converges.

进一步地，所述使用预先训练的mobilenet卷积神经网络提取图片特征向量包括：在将图片输入到预先训练的mobilenet卷积神经网络中后，保存mobilenet卷积神经网络的平均池化层输出的特征向量。Further, said extracting the picture feature vector using the pre-trained mobilenet convolutional neural network includes: after inputting the picture into the pre-trained mobilenet convolutional neural network, saving the output characteristics of the average pooling layer of the mobilenet convolutional neural network vector.

本发明第二方面提供一种图片自动生成描述的系统，包括：模型训练模块，用于使用损失函数、mobilenet卷积神经网络及长短时记忆神经网络构建图片自动生成描述的模型；网络参数记录模块，用于记录构建所述模型训练模块训练模型时的网络参数；图片接收模块，用于接收输入所述模型训练模块训练的模型的图片；文字生成模块，用于使所述模型根据所述图片接收模块接收的图片及所述网络参数记录模块记录的网络参数输出一段可以描述图片的文字。The second aspect of the present invention provides a system for automatically generating descriptions of pictures, including: a model training module for building a model for automatically generating descriptions of pictures using loss function, mobilenet convolutional neural network and long-short-term memory neural network; network parameter recording module , Used to record the network parameters when constructing the training model of the model training module; the picture receiving module, used to receive pictures input to the model trained by the model training module; the text generation module, used to make the model according to the picture The picture received by the receiving module and the network parameters recorded by the network parameter recording module output a paragraph of text that can describe the picture.

进一步地，所述神经网络预训练模块包括：神经网络构建单元，用于构建mobilenet卷积神经网络；神经网络参数更新单元，用于根据现有的图片数据集对所述神经网络构建单元构建的mobilenet卷积神经网络进行预训练，更新mobilenet卷积神经网络的参数。Further, the neural network pre-training module includes: a neural network construction unit, used to construct a mobilenet convolutional neural network; a neural network parameter update unit, used to construct the neural network construction unit based on an existing image data set The mobilenet convolutional neural network is pre-trained and the parameters of the mobilenet convolutional neural network are updated.

进一步地，所述模型训练模块包括：损失函数构建单元，用于采用强化学习的思想设计并构建损失函数；特征向量提取单元，用于使用预先训练的mobilenet卷积神经网络提取样本图片的特征向量；注意力机制引入单元，用于使用矩阵变换引入注意力机制，以将所述特征向量与从预先训练的长短时记忆神经网络中提取的原状态向量融合，得到新特征向量；文本生成单元，用于将所述注意力机制引入单元得到的新特征向量输入长短时记忆神经网络，所述长短时记忆神经网络根据所述新特征向量生成具有图片信息的文本；损失函数求导单元，用于对所述损失函数构建单元构建的损失函数求导，并将所述文本输入至求导后的损失函数中；损失函数收敛判断单元，用于判断所述损失函数求导单元求导后的损失函数是否收敛；网络参数保存单元，用于在所述损失函数收敛判断单元判断损失函数收敛后，保存训练时的网络参数；循环单元，用于在所述损失函数收敛判断单元判断损失函数不收敛后，继续使用所述特征向量提取单元、所述注意力机制引入单元及所述文本生成单元生成具有图片信息的文本，并将文本输入至所述损失函数求导单元得到的求导后的损失函数中，直至将所述文本输入所述损失函数中后，所述损失函数收敛。Further, the model training module includes: a loss function construction unit for designing and constructing a loss function using the idea of reinforcement learning; a feature vector extraction unit for extracting feature vectors of sample images using a pre-trained mobilenet convolutional neural network The attention mechanism introduction unit is used to introduce the attention mechanism using matrix transformation to fuse the feature vector with the original state vector extracted from the pre-trained long and short-term memory neural network to obtain a new feature vector; text generation unit, The new feature vector obtained by introducing the attention mechanism into the unit is used to input the long and short-term memory neural network, which generates text with picture information according to the new feature vector; the loss function derivation unit is used for Derivation of the loss function constructed by the loss function construction unit, and input the text into the derivative loss function; a loss function convergence judging unit for judging the loss after the derivation of the loss function derivation unit Whether the function is converged; a network parameter storage unit for saving the network parameters during training after the loss function convergence judging unit judges the convergence of the loss function; a loop unit for judging the loss function convergence judgment unit for the loss function convergence judging unit After that, continue to use the feature vector extraction unit, the attention mechanism introduction unit, and the text generation unit to generate text with picture information, and input the text into the loss function derivation unit to obtain the derivation loss In the function, the loss function converges until the text is input into the loss function.

进一步地，所述特征向量提取单元包括：池化层输出保存子单元，用于在将图片输入到预先训练的mobilenet卷积神经网络中后，保存mobilenet卷积神经网络的平均池化层输出的特征向量。Further, the feature vector extraction unit includes: a pooling layer output saving subunit for saving the output of the average pooling layer of the mobilenet convolutional neural network after the picture is input into the pre-trained mobilenet convolutional neural network Feature vector.

本发明第三方面提供一种电子装置，包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序，其特征在于，所述处理器执行所述计算机程序时，实现上述中的任意一项所述方法。A third aspect of the present invention provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and capable of being run on the processor, wherein the processor executes the computer program When, implement any of the above methods.

本发明第四方面提供一种计算机可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时，实现上述中的任意一项所述方法。A fourth aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method described in any one of the above is implemented.

Embodiments of the invention

为使得本发明的发明目的、特征、优点能够更加的明显和易懂，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而非全部实施例。基于本发明中的实施例，本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the objectives, features, and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the description The embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present invention.

请参阅图1，为一种图片自动生成描述的方法，包括：S1、使用损失函数、mobilenet卷积神经网络及长短时记忆神经网络构建图片自动生成描述的模型；S2、记录构建模型时的网络参数；S3、将图片输入模型；S4、模型根据输入的图片及网络参数输出一段可以描述图片的文字。Please refer to Figure 1, a method for automatically generating descriptions for pictures, including: S1, using loss function, mobilenet convolutional neural network and long-short-term memory neural network to build a model for automatically generating descriptions for pictures; S2, recording the network when building the model Parameters; S3, input the picture into the model; S4, the model outputs a paragraph of text that can describe the picture according to the input picture and network parameters.

使用损失函数、mobilenet卷积神经网络及长短时记忆神经网络构建图片自动生成描述的模型包括：采用强化学习的思想设计并构建损失函数；使用预先训练的mobilenet卷积神经网络提取图片特征向量；使用矩阵变换加入注意力机制，以将特征向量与从预先训练的长短时记忆神经网络中提取的原状态向量融合，得到新特征向量；将新特征向量输入长短时记忆神经网络，长短时记忆神经网络根据新特征向量生成具有图片信息的文本；对损失函数求导，并将文本输入至求导后的损失函数中；判断损失函数在接收文本后是否收敛；若损失函数收敛，则保存训练时的网络参数；若损失函数不收敛，则继续使用mobilenet卷积神经网络、矩阵变换及长短时记忆神经网络生成具有图片信息的文本，并将文本输入至求导后的损失函数中，直至将文本输入损失函数中后，损失函数收敛。The use of loss function, mobilenet convolutional neural network and long-short-term memory neural network to build a model to automatically generate descriptions for pictures includes: adopting the idea of reinforcement learning to design and construct a loss function; use pre-trained mobilenet convolutional neural network to extract image feature vectors; use Matrix transformation is added to the attention mechanism to fuse the feature vector with the original state vector extracted from the pre-trained long and short-term memory neural network to obtain a new feature vector; input the new feature vector into the long- and short-term memory neural network, and the long and short-term memory neural network Generate text with picture information according to the new feature vector; derivate the loss function and input the text into the derivation loss function; determine whether the loss function converges after receiving the text; if the loss function converges, save the training time Network parameters; if the loss function does not converge, continue to use mobilenet convolutional neural network, matrix transformation and long-short-term memory neural network to generate text with picture information, and input the text into the derivation of the loss function until the text is input After the loss function is in, the loss function converges.

使用预先训练的mobilenet卷积神经网络提取图片特征向量包括：在将图片输入到预先训练的mobilenet卷积神经网络中后，保存mobilenet卷积神经网络的平均池化层输出的特征向量。Using the pre-trained mobilenet convolutional neural network to extract the image feature vector includes: inputting the image into the pre-trained mobilenet convolutional neural network, and saving the feature vector output by the average pooling layer of the mobilenet convolutional neural network.

Mobilenet重点在压缩模型，同时保证精度。其思想就是，分解一个标准的卷积为一个深度卷积和一个1x1的普通卷积(也叫pointwise卷积）。简单理解就是矩阵的因式分解，具体步骤如图2所示。假设，输入的图片特征向量大小为DF * DF，维度为M，滤波器的大小为DK * DK，维度为N，并假设步长为1。则原始的卷积操作，需要进行的矩阵运算次数为DK*DK*M*N*DF*DF，卷积核参数为DK *DK *N。Mobilenet中的卷积需要进行的矩阵运算次数为DK*DK*M*DF*DF + M *N *DF*DF，卷积核参数为DK *DK *M+N。由于卷积的过程，主要是一个空间维度减少，通道维度增加的过程，即N>M，所以，DK *DK *N> DK *DK *M+N。因此，深度可分离卷积在模型大小上和模型计算量上都进行了大量的压缩，使得模型速度快，计算开销少，准确性好。Mobilenet focuses on compressing the model while ensuring accuracy. The idea is to decompose a standard convolution into a deep convolution and a 1x1 ordinary convolution (also called pointwise convolution). A simple understanding is the factorization of the matrix. The specific steps are shown in Figure 2. Suppose that the size of the input image feature vector is DF * DF, the dimension is M, the size of the filter is DK * DK, the dimension is N, and the step size is assumed to be 1. In the original convolution operation, the number of matrix operations that need to be performed is DK*DK*M*N*DF*DF, and the convolution kernel parameter is DK *DK *N. The number of matrix operations required for convolution in Mobilenet is DK*DK*M*DF*DF + M *N *DF*DF, and the convolution kernel parameter is DK *DK *M+N. Since the process of convolution is mainly a process of reducing the spatial dimension and increasing the channel dimension, that is, N>M, so DK *DK *N> DK *DK *M+N. Therefore, the depth separable convolution performs a lot of compression in both the model size and the model calculation amount, so that the model is fast, the calculation overhead is low, and the accuracy is good.

将新状态向量作为长短时记忆神经网络下一时刻状态的输入，并使用强化学习及反向传播的方法重新训练预先训练的长短时记忆神经网络，得到新长短时记忆神经网络包括：使用预先训练的长短时记忆神经网络根据图片生成的句子作为baseline，将随机选择的单词组成的句子作为reward，使用baseline减去reward构建损失函数；使用损失函数及新状态向量，并通过反向传播再次对长短时记忆神经网络进行训练，得到新长短时记忆神经网络，并更新新长短时记忆神经网络的训练参数。Use the new state vector as the input of the next state of the long and short-term memory neural network, and use reinforcement learning and back propagation methods to retrain the pre-trained long and short-term memory neural network to obtain a new long and short-term memory neural network including: using pre-training The long-short-term memory neural network uses the sentence generated by the picture as the baseline, and the sentence composed of randomly selected words as the reward, and uses the baseline to subtract the reward to construct the loss function; the loss function and the new state vector are used, and the length is again calculated through backpropagation The time memory neural network is trained to obtain a new long and short time memory neural network, and the training parameters of the new long and short time memory neural network are updated.

关于强化学习的方法：把序列问题看作是一个强化学习的问题：Regarding the reinforcement learning method: regard the sequence problem as a reinforcement learning problem:

Agent: LSTM；Agent: LSTM;

环境状态: 单词和图片特征；Environmental status: word and picture features;

行动: 预测下一个单词；Action: predict the next word;

状态: LSTM的单元和隐含层的状态；State: the state of LSTM unit and hidden layer;

奖励: CIDEr评分；Reward: CIDEr score;

训练目标是最小化负的期望L(θ)：The training goal is to minimize the negative expectation L(θ):

L(θ)=−Ews∼pθ[r(ws)]L(θ)=−Ews∼pθ[r(ws)]=−∑r(ws)pθ(ws)=−∑r(ws)pθ(ws)；L(θ)=−Ews∼pθ[r(ws)]L(θ)=−Ews∼pθ[r(ws)]=−∑r(ws)pθ(ws)=−∑r(ws)pθ( ws);

ws=(ws1,…,wsT)ws=(w1s,…,wTs)是生成的句子。ws=(ws1,...,wsT)ws=(w1s,...,wTs) is the generated sentence.

实际上，ws可以依据pθ的概率来进行随即选择（而不是选择概率最大的那一个），L(θ)可以近似为：In fact, ws can be randomly selected based on the probability of pθ (rather than choosing the one with the highest probability), L(θ) can be approximated as:

L(θ)≈−r(ws),ws∼pθ；L(θ)≈−r(ws),ws∼pθ;

L关于θ的梯度为：The gradient of L with respect to θ is:

∇θL(θ)=−Ews∼pθ[r(ws)∇θlogpθ(ws)]；∇θL(θ)=−Ews∼pθ[r(ws)∇θlogpθ(ws)];

再引入一个baseline来减少方差：Then introduce a baseline to reduce the variance:

∇θL(θ)=−Ews∼pθ[(r(ws)−b)∇θlogpθ(ws)]；∇θL(θ)=−Ews∼pθ[(r(ws)−b)∇θlogpθ(ws)];

baseline可以是任意函数，只要它不依赖行动ws，引入它并不会改变梯度的值。The baseline can be any function, as long as it does not depend on the action ws, introducing it will not change the value of the gradient.

实际上L(θ)可以被近似为：In fact, L(θ) can be approximated as:

∇θL(θ)≈−(r(ws)−b)∇θlogpθ(ws)；∇θL(θ)≈−(r(ws)−b)∇θlogpθ(ws);

应用链式法则，梯度可以表示为：Applying the chain rule, the gradient can be expressed as:

∂L(θ)/∂st≈(r(ws)−b)(pθ(wt|ht)−1wst)；∂L(θ)/∂st≈(r(ws)−b)(pθ(wt|ht)−1wst);

强化学习的思想就是用当前模型在测试阶段生成的词的reward作为baseline，梯度就变成了：The idea of reinforcement learning is to use the reward of the word generated by the current model in the testing phase as the baseline, and the gradient becomes:

∂L(θ)/∂st≈(r(ws)−r(w^))(pθ(wt|ht)−1wst)；∂L(θ)/∂st≈(r(ws)−r(w^))(pθ(wt|ht)−1wst);

其中r(w^)=argmax _wtp(wt|ht)，就是在测试阶段使用贪婪解码取概率最大的词来生成句子；而r(ws)是通过根据概率来随机选择的词，如果当前概率最大的词的概率为60%，那就有60%的概率选到它，而不是像贪婪解码一样100%选概率最大的。 Among them, r(w^)=argmax _wt p(wt|ht) is to use greedy decoding to take the word with the highest probability to generate a sentence in the test phase; and r(ws) is a word randomly selected according to the probability, if the current probability The probability of the largest word is 60%, so there is a 60% probability of picking it, instead of 100% picking the highest probability like greedy decoding.

公式的意思就是：对于如果当前随机到的词比测试阶段生成的词好，那么在这次词的维度上，整个式子的值就是负的（因为后面那一项一定为负），这样梯度就会上升，从而提高这个词的分数；而对于其他词，后面那一项为正，梯度就会下降，从而降低其他词的分数。The meaning of the formula is: if the current random word is better than the word generated in the test phase, then in the dimension of the word this time, the value of the entire formula is negative (because the latter item must be negative), so the gradient It will rise, thereby increasing the score of this word; and for other words, the latter one is positive, the gradient will drop, thereby reducing the scores of other words.

请参阅图3，为一种图片自动生成描述的系统，包括：模型训练模块1、网络参数记录模块2、图片接收模块3及文字生成模块4；模型训练模块1用于使用损失函数、mobilenet卷积神经网络及长短时记忆神经网络构建图片自动生成描述的模型；网络参数记录模块2用于记录构建模型训练模块1训练模型时的网络参数；图片接收模块3用于接收输入模型训练模块1训练的模型的图片；文字生成模块4用于使模型根据图片接收模块3接收的图片及网络参数记录模块2记录的网络参数输出一段可以描述图片的文字。Please refer to Figure 3, which is a system for automatically generating descriptions of pictures, including: model training module 1, network parameter recording module 2, picture receiving module 3, and text generation module 4; model training module 1 is used to use loss function and mobilenet volume Product neural network and long-short-term memory neural network build a model for automatically generating descriptions of pictures; network parameter recording module 2 is used to record network parameters when building model training module 1 training model; picture receiving module 3 is used to receive input model training module 1 training A picture of the model; the text generation module 4 is used to make the model output a paragraph of text that can describe the picture according to the picture received by the picture receiving module 3 and the network parameters recorded by the network parameter recording module 2.

模型训练模块1包括：损失函数构建单元、特征向量提取单元、注意力机制引入单元、文本生成单元、损失函数求导单元、损失函数收敛判断单元、网络参数保存单元及循环单元；损失函数构建单元用于采用强化学习的思想设计并构建损失函数；特征向量提取单元用于使用预先训练的mobilenet卷积神经网络提取样本图片的特征向量；注意力机制引入单元用于使用矩阵变换引入注意力机制，以将特征向量与从预先训练的长短时记忆神经网络中提取的原状态向量融合，得到新特征向量；文本生成单元用于将注意力机制引入单元得到的新特征向量输入长短时记忆神经网络，长短时记忆神经网络根据新特征向量生成具有图片信息的文本；损失函数求导单元用于对损失函数构建单元构建的损失函数求导，并将文本输入至求导后的损失函数中；损失函数收敛判断单元用于判断损失函数求导单元求导后的损失函数是否收敛；网络参数保存单元用于在损失函数收敛判断单元判断损失函数收敛后，保存训练时的网络参数；循环单元用于在损失函数收敛判断单元判断损失函数不收敛后，继续使用特征向量提取单元、注意力机制引入单元及文本生成单元生成具有图片信息的文本，并将文本输入至损失函数求导单元得到的求导后的损失函数中，直至将文本输入损失函数中后，损失函数收敛。Model training module 1 includes: loss function construction unit, feature vector extraction unit, attention mechanism introduction unit, text generation unit, loss function derivation unit, loss function convergence judgment unit, network parameter storage unit and loop unit; loss function construction unit It is used to design and construct the loss function using the idea of reinforcement learning; the feature vector extraction unit is used to extract the feature vector of the sample picture using the pre-trained mobilenet convolutional neural network; the attention mechanism introduction unit is used to introduce the attention mechanism using matrix transformation, The feature vector is fused with the original state vector extracted from the pre-trained long and short-term memory neural network to obtain a new feature vector; the text generation unit is used to introduce the attention mechanism into the unit to obtain the new feature vector into the long and short-term memory neural network, The long and short-term memory neural network generates text with picture information according to the new feature vector; the loss function derivation unit is used to derive the loss function constructed by the loss function construction unit, and input the text into the derivated loss function; loss function The convergence judgment unit is used to judge whether the loss function after the derivation of the loss function derivation unit has converged; the network parameter storage unit is used to save the network parameters during training after the loss function convergence judgment unit judges the convergence of the loss function; the loop unit is used in the After the loss function convergence judgment unit judges that the loss function does not converge, it continues to use the feature vector extraction unit, attention mechanism introduction unit and text generation unit to generate text with picture information, and input the text to the derivation obtained by the loss function derivation unit In the loss function, until the text is input into the loss function, the loss function converges.

特征向量提取单元包括：池化层输出保存子单元，用于在将图片输入到预先训练的mobilenet卷积神经网络中后，保存mobilenet卷积神经网络的平均池化层输出的特征向量。The feature vector extraction unit includes: a pooling layer output saving subunit, which is used to save the feature vector output by the average pooling layer of the mobilenet convolutional neural network after the picture is input into the pre-trained mobilenet convolutional neural network.

本申请实施例提供一种电子装置，请参阅4，该电子装置包括：存储器601、处理器602及存储在存储器601上并可在处理器602上运行的计算机程序，处理器602执行该计算机程序时，实现前述中描述的图片自动生成描述的方法。An embodiment of the present application provides an electronic device. Please refer to 4. The electronic device includes: a memory 601, a processor 602, and a computer program stored in the memory 601 and running on the processor 602, and the processor 602 executes the computer program At the time, implement the method of automatically generating descriptions for pictures described in the foregoing.

进一步的，该电子装置还包括：至少一个输入设备603以及至少一个输出设备604。Further, the electronic device further includes: at least one input device 603 and at least one output device 604.

上述存储器601、处理器602、输入设备603以及输出设备604，通过总线605连接。The aforementioned memory 601, processor 602, input device 603, and output device 604 are connected via a bus 605.

其中，输入设备603具体可为摄像头、触控面板、物理按键或者鼠标等等。输出设备604具体可为显示屏。Among them, the input device 603 may specifically be a camera, a touch panel, a physical button or a mouse, etc. The output device 604 may specifically be a display screen.

存储器601可以是高速随机存取记忆体（RAM，Random Access Memory）存储器，也可为非不稳定的存储器（non-volatile memory），例如磁盘存储器。存储器601用于存储一组可执行程序代码，处理器602与存储器601耦合。The memory 601 may be a high-speed random access memory (RAM, Random Access Memory) memory can also be non-volatile memory (non-volatile memory), such as disk memory. The memory 601 is used to store a group of executable program codes, and the processor 602 is coupled with the memory 601.

进一步的，本申请实施例还提供了一种计算机可读存储介质，该计算机可读存储介质可以是设置于上述各实施例中的电子装置中，该计算机可读存储介质可以是前述中的存储器601。该计算机可读存储介质上存储有计算机程序，该程序被处理器602执行时实现前述实施例中描述的图片自动生成描述的方法。Further, the embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium may be the electronic device provided in each of the foregoing embodiments, and the computer-readable storage medium may be the aforementioned memory. 601. A computer program is stored on the computer-readable storage medium, and when the program is executed by the processor 602, the method for automatically generating a picture described in the foregoing embodiment is realized.

进一步的，该计算机可存储介质还可以是U盘、移动硬盘、只读存储器601（ROM，Read-Only Memory）、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Further, the computer storage medium may also be a U disk, a mobile hard disk, a read-only memory 601 (ROM, Read-Only Memory), RAM, magnetic disks or optical disks and other media that can store program codes.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个模块或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或模块的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other divisions in actual implementation, for example, multiple modules or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or modules, and may be in electrical, mechanical or other forms.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理模块，即可以位于一个地方，或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or they may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

另外，在本发明各个实施例中的各功能模块可以集成在一个处理模块中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。In addition, the functional modules in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.

需要说明的是，对于前述的各方法实施例，为了简便描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其它顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定都是本发明所必须的。It should be noted that for the foregoing method embodiments, for simplicity of description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described sequence of actions. Because according to the present invention, certain steps can be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the involved actions and modules are not necessarily all required by the present invention.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其它实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

以上为对本发明所提供的一种图片自动生成描述的方法、系统、电子装置及存储介质的描述，对于本领域的技术人员，依据本发明实施例的思想，在具体实施方式及应用范围上均会有改变之处，综上，本说明书内容不应理解为对本发明的限制。The above is a description of the method, system, electronic device, and storage medium for automatically generating descriptions of pictures provided by the present invention. For those skilled in the art, based on the ideas of the embodiments of the present invention, both in terms of specific implementation and scope of application There will be changes. In summary, the content of this specification should not be construed as limiting the present invention.

Industrial applicability

本发明提供的一种图片自动生成描述的方法、系统、电子装置及存储介质，解决了现有技术中训练的模型与评估标准不统一的技术问题。The invention provides a method, system, electronic device and storage medium for automatically generating and describing pictures, which solves the technical problem that the training model and the evaluation standard are not uniform in the prior art.

Claims

A method for automatically generating descriptions of pictures, characterized in that it includes:

Use loss function, mobilenet convolutional neural network and long-short-term memory neural network to build a model that automatically generates descriptions for pictures;

Record the network parameters when constructing the model;

Input pictures into the model;

The model outputs a paragraph of text that can describe the picture according to the input picture and the network parameters.

The method for automatically generating a description of a picture according to claim 1, wherein:

The use of loss function, mobilenet convolutional neural network and long and short-term memory neural network to construct a model for automatically generating descriptions of pictures includes:

Design and construct the loss function with the idea of reinforcement learning;

Use pre-trained mobilenet convolutional neural network to extract image feature vectors;

Use matrix transformation to join the attention mechanism to fuse the feature vector with the original state vector extracted from the pre-trained long and short-term memory neural network to obtain a new feature vector;

Inputting the new feature vector into a long and short-term memory neural network, which generates text with picture information according to the new feature vector;

Derivative of the loss function, and input the text into the derived loss function;

Determine whether the loss function converges after receiving the text;

If the loss function converges, save the network parameters during training;

If the loss function does not converge, continue to use the mobilenet convolutional neural network, matrix transformation and long-short-term memory neural network to generate text with picture information, and input the text into the derivation of the loss function until the text After inputting the loss function, the loss function converges.

The method for automatically generating descriptions for pictures according to claim 2, characterized in that,

The extraction of image feature vectors using a pre-trained mobilenet convolutional neural network includes:

After inputting the picture into the pre-trained mobilenet convolutional neural network, save the feature vector output by the average pooling layer of the mobilenet convolutional neural network.

A system for automatically generating descriptions of pictures, characterized in that it includes:

Model training module, used to use loss function, mobilenet convolutional neural network and long and short-term memory neural network to build a model for automatically generating descriptions of pictures;

The network parameter recording module is used to record the network parameters when constructing the training model of the model training module;

The picture receiving module is configured to receive pictures of the model trained by the model training module;

The text generation module is used to enable the model to output a paragraph of text that can describe a picture based on the picture received by the picture receiving module and the network parameters recorded by the network parameter recording module.

The system for automatically generating descriptions of pictures according to claim 4, wherein:

The model training module includes:

The loss function construction unit is used to design and construct the loss function using the idea of reinforcement learning;

The feature vector extraction unit is used to extract the feature vector of the sample picture using the pre-trained mobilenet convolutional neural network;

An attention mechanism introduction unit for introducing an attention mechanism using matrix transformation to fuse the feature vector with the original state vector extracted from the pre-trained long and short-term memory neural network to obtain a new feature vector;

A text generation unit, configured to input the new feature vector obtained by the attention mechanism introduction unit into a long and short-term memory neural network, and the long and short-term memory neural network generates text with picture information according to the new feature vector;

A loss function derivation unit, configured to derivate the loss function constructed by the loss function construction unit, and input the text into the derivated loss function;

A loss function convergence judging unit for judging whether the loss function after the derivation of the loss function derivation unit converges;

The network parameter saving unit is configured to save the network parameters during training after the loss function convergence judging unit judges the convergence of the loss function;

The loop unit is configured to continue to use the feature vector extraction unit, the attention mechanism introduction unit, and the text generation unit to generate text with picture information after the loss function convergence judgment unit judges that the loss function does not converge, and The text is input into the derivated loss function obtained by the loss function derivation unit, and the loss function converges after the text is input into the loss function.

The system for automatically generating descriptions of pictures according to claim 5, wherein:

The feature vector extraction unit includes:

The pooling layer output saving subunit is used to save the feature vector output by the average pooling layer of the mobilenet convolutional neural network after the picture is input into the pre-trained mobilenet convolutional neural network.

An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements claim 1 when the processor executes the computer program To the method described in any one of 3.

A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method described in any one of claims 1 to 3 when the computer program is executed by a processor.