CN112116685A - Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism - Google Patents
Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism Download PDFInfo
- Publication number
- CN112116685A CN112116685A CN202010974467.1A CN202010974467A CN112116685A CN 112116685 A CN112116685 A CN 112116685A CN 202010974467 A CN202010974467 A CN 202010974467A CN 112116685 A CN112116685 A CN 112116685A
- Authority
- CN
- China
- Prior art keywords
- reward
- network
- word
- image
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
本发明公开了一种基于多粒度奖励机制的多注意力融合网络的图像字幕生成方法,它解决了在基于强化学习奖励机制的图像字幕生成方法中,每个生成单词不同重要性的问题。本发明首次提出了一种基于多粒度奖励机制的多注意力融合网络用于图像字幕生成,它包括多注意力融合模型、单词重要性重评估网络和标签检索网络。多注意力融合模型用作基于强化学习的图像字幕方法的基线;单词重要性重评估网络通过估算生成标题中每个单词的不同重要性而被用于奖励重估;标签检索网络能够从一批字幕中检索相应的真实标签作为检索奖励,然后通过训练该网络以最大化奖励的方式生成更好的字幕。本发明在MSCOCO数据集上进行了大量的实验验证,取得了非常有竞争力的评价结果。The invention discloses an image caption generation method based on a multi-granularity reward mechanism and a multi-attention fusion network, which solves the problem that each generated word has different importance in the image caption generation method based on the reinforcement learning reward mechanism. The present invention firstly proposes a multi-attention fusion network based on a multi-granularity reward mechanism for image caption generation, which includes a multi-attention fusion model, a word importance re-evaluation network and a label retrieval network. A multi-attention fusion model is used as a baseline for reinforcement learning-based image captioning methods; a word importance re-evaluation network is used for reward re-evaluation by estimating the different importance of each word in the generated caption; a label retrieval network can The corresponding ground-truth labels are retrieved from the subtitles as the retrieval reward, and then the network is trained to generate better subtitles in a way that maximizes the reward. The present invention has been verified by a large number of experiments on the MSCOCO data set, and obtained very competitive evaluation results.
Description
技术领域technical field
本发明属于图像字幕自动生成方法,涉及计算机视觉和自然语言处理的技术领域。The invention belongs to a method for automatically generating image captions, and relates to the technical fields of computer vision and natural language processing.
背景技术Background technique
图像字幕(image caption)的目标是自动生成给定图像的自然语言描述。目前这项任务面临着巨大的挑战,一方面,计算机必须从多层次的视觉特征中全面了解图像内容;另一方面,图像字幕生成算法需要逐步将粗略语义概念修改为类似于人的自然语言描述。近些年,深度学习相关技术(包括注意力机制和强化学习)的进步显著提高了字幕生成的质量,而这其中编码-解码框架是图像字幕生成的主流方法。Vinyals等人利用空间合并的CNN特征图生成字幕,将整个图像压缩成静态表示,再用注意力机制通过学习自适应地关注图像的区域来改善字幕的性能,但是只有单个LSTM用作可视信息处理程序以及语言生成器,语言生成器被同时可视化处理程序削弱。Peter Anderson等人提出了具有两个独立LSTM层的自上而下架构:第一个LSTM层充当自上而下的视觉注意模型,第二个LSTM层充当语言生成器。上面提到的所有图像字幕方法均采用CNN最后卷积层的高级视觉特征作为图像编码器,忽略了低级视觉特征,事实上低级视觉特征也有利于理解图像。由于多层特征之间的互补性,采用多层特征融合也可以优化图像字幕,然而,早期融合方法效果并不是很好,如何将多级视觉特征融入图像字幕模型是值得考虑的问题。一般情况下,训练图像字幕模型是通过最大化交叉熵(XE)来实现的,这使得图像字幕模型对异常字幕比较敏感,而不是围绕人类对合适字幕的共识进行优化以获得稳定的输出。此外,通常通过计算测试集上的不同度量来评估字幕模型,例如BLEU,ROUGE,METEOR和CIDEr。目标函数与评估度量之间的不匹配会对图像字幕模型造成不利的影响,这个问题可以通过强化学习(RL)来解决,如PolicyGradient和Actor-Critic。强化学习方法可以优化不可微分的基于序列的评估指标,当使用Policy Gradient方法时,SCST的作者应用CIDEr作为奖励,产生更符合人类语言共识的字幕。The goal of image captioning is to automatically generate a natural language description of a given image. This task currently faces huge challenges. On the one hand, the computer must comprehensively understand the image content from the multi-level visual features; on the other hand, the image caption generation algorithm needs to gradually modify the rough semantic concepts into human-like natural language descriptions . In recent years, advances in deep learning related technologies (including attention mechanism and reinforcement learning) have significantly improved the quality of caption generation, among which the encoder-decoder framework is the mainstream method for image caption generation. Vinyals et al. used spatially merged CNN feature maps to generate subtitles, compress the entire image into a static representation, and then used an attention mechanism to improve the performance of subtitles by learning to adaptively focus on regions of the image, but only a single LSTM was used for visual information The handler as well as the language generator, which is weakened by the simultaneous visual handler. Peter Anderson et al. proposed a top-down architecture with two independent LSTM layers: the first LSTM layer acts as a top-down visual attention model, and the second LSTM layer acts as a language generator. All the image captioning methods mentioned above use the high-level visual features of the last convolutional layer of CNN as the image encoder, ignoring the low-level visual features, which in fact are also beneficial for understanding the image. Due to the complementarity between multi-layer features, image captioning can also be optimized by using multi-layer feature fusion. However, early fusion methods are not very effective, and how to integrate multi-level visual features into image captioning models is a problem worth considering. In general, training image captioning models is achieved by maximizing cross-entropy (XE), which makes image captioning models more sensitive to abnormal captions, rather than optimizing around human consensus on suitable captions for stable output. Furthermore, captioning models, such as BLEU, ROUGE, METEOR, and CIDEr, are usually evaluated by computing different metrics on the test set. The mismatch between objective function and evaluation metric can adversely affect image captioning models, a problem that can be addressed by reinforcement learning (RL) such as PolicyGradient and Actor-Critic. Reinforcement learning methods can optimize non-differentiable sequence-based evaluation metrics, and when using the Policy Gradient method, SCST authors apply CIDEr as a reward, producing captions that are more in line with human linguistic consensus.
在SCST中,对每个单词给予相同的奖励作为梯度权重。然而,并不是所有的单词都应该在一个句子中给予同等的奖励,不同的单词可能具有不同的重要性。Yu等利用蒙特卡罗推出SeqGan来估计每个单词的重要性,然而,它必须产生丰富的句子,这就导致昂贵的时间复杂性。基于Actor-Critic策略,Dzmitry Bahdanau等人采用价值评估网络来评估单词,但是评估指标(例如,CIDEr,BLEU)无法直接优化。在本文中,提出利用词级奖励来优化基于RL训练的图像字幕模型,旨在解决每个生成单词的不同重要性问题。In SCST, each word is given the same reward as a gradient weight. However, not all words should be given equal rewards in a sentence, and different words may have different importance. Yu et al. utilize Monte Carlo to introduce SeqGan to estimate the importance of each word, however, it must generate rich sentences, which leads to expensive time complexity. Based on the Actor-Critic strategy, Dzmitry Bahdanau et al. adopted a value evaluation network to evaluate words, but evaluation metrics (eg, CIDEr, BLEU) could not be directly optimized. In this paper, we propose to utilize word-level rewards to optimize an RL-trained image captioning model, aiming to address the different importance of each generated word.
将评估度量(例如,CIDEr,BLEU)计算为奖励信号是RL训练中直观的方式,以生成更多类似人类语言的字幕,但是,这些评估指标并不是判断生成字幕质量的唯一标准,生成的字幕的质量也可以通过它是否可以在检索系统中检索到相应的标签来评估。从信息利用的角度来看,传统的CIDEr奖励充分利用了匹配的标签信息,而检索奖励则从额外的标签信息中获益,检索损失也可以作为奖励系统来使用。Computing evaluation metrics (e.g., CIDEr, BLEU) as reward signals is an intuitive way in RL training to generate more human-like subtitles, however, these evaluation metrics are not the only criteria for judging the quality of generated subtitles. The quality of a can also be assessed by whether it can retrieve the corresponding tags in the retrieval system. From the perspective of information utilization, the traditional CIDEr reward makes full use of the matching label information, while the retrieval reward benefits from the additional label information, and the retrieval loss can also be used as a reward system.
在本文中,提出了一种图像字幕的分层注意力融合(HAF)模型,该模型将Resnet的多级特征映射与层次关注集成在一起,充当基于RL的图像字幕方法的基线。此外,在RL阶段呈现多粒度奖励以修改所提出的HAF。具体而言,单词重要性重评估网络(REN)通过估算生成字幕中每个单词的不同重要性而被用于奖励重估,其中,用于重评估的奖励是通过加权CIDEr得分来得到的,不同的权重是从REN计算的,重评估的奖励可以被视为词级奖励。为了从额外的标签中获益,实施了标签检索网络(RN)以从一批字幕中检索相应的标签作为检索奖励,其可以被视为句子级奖励。In this paper, a Hierarchical Attention Fusion (HAF) model for image captioning is proposed, which integrates Resnet's multi-level feature maps with hierarchical attention to serve as a baseline for RL-based image captioning methods. Furthermore, multi-granularity rewards are presented in the RL stage to modify the proposed HAF. Specifically, the word importance re-evaluation network (REN) is used for reward re-evaluation by estimating the different importance of each word in the generated caption, where the reward for re-evaluation is obtained by weighting the CIDEr score, Different weights are computed from REN, and the re-evaluated reward can be viewed as a word-level reward. To benefit from the extra labels, a Label Retrieval Network (RN) is implemented to retrieve the corresponding labels from a batch of captions as a retrieval reward, which can be viewed as a sentence-level reward.
发明内容SUMMARY OF THE INVENTION
本发明的目的是为了解决在基于强化学习奖励机制的图像字幕生成方法中,每个生成单词的不同重要性问题,从而产生更符合人类语言共识的句子,并不是所有的单词都应该在一个句子中给予同等的奖励,不同的单词可能具有不同的重要性。The purpose of the present invention is to solve the problem of the different importance of each generated word in the image caption generation method based on the reinforcement learning reward mechanism, so as to generate a sentence that is more in line with the consensus of human language, not all words should be in one sentence given equal rewards, different words may have different importance.
本发明为解决上述技术问题采取的技术方案是:The technical scheme that the present invention takes for solving the above-mentioned technical problems is:
S1.构建多注意力融合模型。S1. Build a multi-attention fusion model.
S2.构建基于强化学习奖励机制的单词重要性重评估网络。S2. Construct a word importance re-evaluation network based on reinforcement learning reward mechanism.
S3.结合强化学习奖励机制,构建标签检索网络。S3. Combine the reinforcement learning reward mechanism to build a label retrieval network.
S4.结合S1中的模型、S2中的单词重要性重评估网络和S3中的标签检索网络构建基于多粒度奖励机制的多注意力融合网络架构。S4. Combine the model in S1, the word importance re-evaluation network in S2, and the label retrieval network in S3 to build a multi-attention fusion network architecture based on a multi-granularity reward mechanism.
S5.基于多粒度奖励机制的多注意力融合网络的训练和字幕生成。S5. Training and caption generation of a multi-attention fusion network based on a multi-granularity reward mechanism.
其中,多注意力融合模型(HAF)作为图像字幕RL训练的基线,关注CNN的分层视觉特征,充分利用了多层次的视觉信息,除了利用图像的最后一层卷积表示和采用单个注意力模型在每个时间步骤聚焦于图像的特定区域之外,我们还考虑融合用于字幕的注意力模型,并且输入注意力衍生的图像特征到语言LSTM的单元节点。我们采用的是一个经典网络结构,它根据每个时间步t的LSTM隐藏状态ht产生归一化注意权重αt。αt用于参与图像特征的不同空间Att作为图像的最终表示(A):Among them, the multi-attention fusion model (HAF), as the baseline for image captioning RL training, pays attention to the layered visual features of CNN and makes full use of the multi-layered visual information, in addition to using the last layer of convolutional representation of the image and adopting a single attention The model focuses outside a specific region of the image at each time step, and we also consider fusing an attention model for captioning and input attention-derived image features to the unit nodes of a linguistic LSTM. We adopt a classical network structure that produces normalized attention weights αt based on the LSTM hidden state ht at each time step t . α t is used to participate in the different spaces Att of the image features as the final representation of the image (A):
αt=softmax(at) (2)α t =softmax(at ) (2)
其中,Wa,Ua,是学习参数。Among them, W a , U a , is the learning parameter.
其中,h2是第二LSTM的输出,其由卷积层的图像信息和生成的序列的内容组成。产生h2的过程可以通过以下方式给出:where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of generating h2 can be given by:
最后,通过非线性softmax函数给出输出单词的概率:Finally, the probability of the output word is given by a non-linear softmax function:
单词重要性重评估网络基于强化学习奖励机制构建,通过自动估算生成字幕中不同单词的重要性来重新评估基于指标的奖励。首先,REN将生成的句子S作为输入,然后,句子由带有具有注意力网络和平均池化层的RNN处理,词嵌入向量由带有注意力的句子嵌入向量和池化之后的句子嵌入向量连接而成,作为生成字幕的综合表示,然后应用两个全连接层和sigmoid变换获得不同单词的权重Wt。特别地,由CIDEr奖励机制预训练的字幕模型(rl-模型)充当基线(b),在不改变预期梯度的情况下显着减小方差。我们将字级奖励Wrt构造为16个,因此,只有来自模型的样本优于当前的测试模型(rl-模型)被赋予正权重,而劣质样本被抑制。在数学上,损失函数可以形式化为公式(11):The word importance re-evaluation network is constructed based on a reinforcement learning reward mechanism to re-evaluate metric-based rewards by automatically estimating the importance of different words in the generated captions. First, REN takes the generated sentence S as input, then, the sentence is processed by the RNN with attention network and average pooling layer, the word embedding vector is composed of the sentence embedding vector with attention and the sentence embedding vector after pooling are connected as a comprehensive representation for generating subtitles, and then two fully connected layers and sigmoid transform are applied to obtain the weights W t of different words. In particular, the caption model (rl-model) pretrained by the CIDEr reward mechanism acts as a baseline (b), reducing variance significantly without changing the expected gradient. We construct the word-level reward Wr t to be 16, so only samples from the model that outperform the current test model (rl-model) are given positive weights, while inferior samples are suppressed. Mathematically, the loss function can be formalized as formula (11):
Wrt=RWt+R-b (10)Wr t =RW t +Rb (10)
其中,Wi是REN的输出权重,θ是图像字幕网络的参数,表示生成的句子的不同单词。where Wi is the output weight of REN , θ is the parameter of the image captioning network, Represent different words of the generated sentence.
为了利用基于指标的奖励(CIDEr)并约束句子空间,在CIDEr优化之后,采用词级奖励来微调字幕网络,此外,为了同时优化REN,我们将REN的更新定义为具有奖励R-b的另一个RL过程。我们观察到R-b太小而导致REN的梯度较弱,因此设置超参数γ以增强梯度,类似地,可以通过强化学习算法通过以下损失函数更新REN:In order to take advantage of the metric-based reward (CIDEr) and constrain the sentence space, after CIDEr optimization, word-level rewards are employed to fine-tune the captioning network, in addition, to optimize REN at the same time, we define the update of REN as another RL process with reward R-b . We observe that R-b is too small resulting in a weak gradient of REN, so the hyperparameter γ is set to enhance the gradient, and similarly, REN can be updated by a reinforcement learning algorithm with the following loss function:
标签检索网络(RN)也是基于强化学习奖励机制构建,为了增强基于指标的奖励(CIDEr)并利用标签和其他未匹配的标签,引入了标签检索网络,使得生成的字幕应该与其相应的标签相匹配。按照FartashFaghri等人提出的称为跨媒体检索的方法,我们重构了一个带有两个LSTM网络的句子检索模型,首先,RN由图像的不同标签预先训练至收敛,因为每个图像具有五个不同的标签,我们编码标签并在RN的相同嵌入空间中为特征生成字幕:The Label Retrieval Network (RN) is also constructed based on the reinforcement learning reward mechanism. In order to enhance the indicator-based reward (CIDEr) and utilize labels and other unmatched labels, a label retrieval network is introduced so that the generated captions should match their corresponding labels . Following a method called cross-media retrieval proposed by FartashFaghri et al., we reconstruct a sentence retrieval model with two LSTM networks. First, the RN is pre-trained with different labels of images to converge, since each image has five Different labels, we encode labels and generate captions for features in the same embedding space of RN:
si=LSTM(Ci) (13)s i =LSTM(C i ) (13)
gj=LSTM(Gj) (14)g j =LSTM(G j ) (14)
其中C和G表示生成的字幕和标签,Si和gi表示其各自的嵌入特征。计算S和g之间的相似度的余弦相似度:where C and G represent the generated captions and labels, and S i and gi represent their respective embedding features. Compute the cosine similarity of the similarity between S and g:
指定匹配单词对的得分高于任何不匹配单词对的得分,RN的损失是通过铰链损失来计算的: Specifying that matching word pairs have a higher score than any unmatched word pairs, the RN loss is computed via the hinge loss:
其中是正确的单词对,而是不正确的。CIDEr的铰链损失在RL训练中充当句子级奖励,这鼓励字幕模型的生成字幕与给定的标签最佳匹配。in is the correct word pair, and is incorrect. The hinge loss of CIDEr acts as a sentence-level reward in RL training, which encourages the caption model to generate captions that best match the given labels.
公式(17)是用于通过句子级奖励来β优化字幕模型的损失函数,其中β是用于平衡铰链损失和CIDEr的超参数。值得注意的是,检索过程是在每个mini-batch(小批次)中执行的,因为在整个数据集中检索是比较耗时的。Equation (17) is the loss function for β-optimized captioning models with sentence-level rewards, where β is a hyperparameter for balancing hinge loss and CIDEr. It is worth noting that the retrieval process is performed in each mini-batch, as it is time-consuming to retrieve the entire dataset.
本发明提出的基于多粒度奖励机制的多注意力融合网络包含一个多注意力融合模型(HAF)、一个单词重要性重评估网络(REN)和一个标签检索网络(RN)。The multi-attention fusion network based on the multi-granularity reward mechanism proposed by the present invention includes a multi-attention fusion model (HAF), a word importance re-evaluation network (REN) and a label retrieval network (RN).
最后,所述的基于多粒度奖励机制的多注意力融合网络的训练方法如下:Finally, the training method of the multi-attention fusion network based on the multi-granularity reward mechanism is as follows:
所有模型都通过交叉熵损失进行预训练,然后进行训练以最大化不同的RL奖励。编码器使用预先训练的Resnet-101来获得图像的表示,对于每个图像,我们从Resnet中提取conv4和conv5卷积层的输出,它们映射到维度1024的向量作为HAF的输入。对于HAF,图像特征嵌入维度,LSTM隐藏状态和单词嵌入的维度都设置为512。基线模型使用ADAM优化器在XE目标下训练,初始学习率为10-4。在每个迭代周期,我们评估模型并选择最佳CIDEr作为基线分数。强化训练从第30个迭代周期开始,以优化CIDEr度量,学习率为10-5。All models are pretrained with a cross-entropy loss and then trained to maximize different RL rewards. The encoder uses a pre-trained Resnet-101 to obtain the representation of the image, and for each image, we extract the outputs of the conv4 and conv5 convolutional layers from the Resnet, which are mapped to vectors of dimension 1024 as the input of the HAF. For HAF, the dimension of image feature embedding, the dimension of LSTM hidden state and word embedding are all set to 512. The baseline model is trained under the XE objective using the ADAM optimizer with an initial learning rate of 10 −4 . At each iteration, we evaluate the model and select the best CIDEr as the baseline score. Reinforcement training starts at the 30th iteration epoch to optimize the CIDEr metric with a learning rate of 10 −5 .
在单词级奖励训练阶段,图像字幕模型预先训练了20个迭代周期的CIDEr奖励,以及10个迭代周期的奖励级别奖励。在句子级奖励训练中,RN通过每个img的不同标签预先训练10个迭代周期。其中,单词嵌入和LSTM隐藏大小被设置为512并且联合嵌入大小被设置为1024,并且超参数边缘α被设置为0.2。此外,基线(b)的字幕模型使用交叉熵训练30个时期,句子级奖励训练的迭代周期设定为30。In the word-level reward training phase, the image captioning model is pre-trained with CIDEr rewards for 20 iterations, and reward-level rewards for 10 iterations. In sentence-level reward training, the RN is pre-trained for 10 epochs with different labels for each img. where the word embedding and LSTM hidden size are set to 512 and the joint embedding size is set to 1024, and the hyperparameter edge α is set to 0.2. In addition, the subtitle model of baseline (b) is trained for 30 epochs using cross-entropy, and the iteration epoch for sentence-level reward training is set to 30.
与现有的技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:
1.本发明提出了分层注意力融合(HAF)模型作为图像字幕RL训练的基线。HAF多次关注CNN的分层视觉特征,能够充分利用多层次的视觉信息。1. The present invention proposes a Hierarchical Attention Fusion (HAF) model as a baseline for image captioning RL training. HAF pays attention to the hierarchical visual features of CNN for many times, which can make full use of multi-level visual information.
2.本发明提出了单词重要性重评估网络(REN)用于促进重估奖励计算,其在RL训练阶段期间自动地对句子中生成的单词赋予不同的重要性。2. The present invention proposes a word importance re-evaluation network (REN) for facilitating re-evaluation reward computation, which automatically assigns different importance to words generated in a sentence during the RL training phase.
3.本发明提出了标签检索网络(RN)以获得句子级检索奖励。RN会驱使生成的字幕倾向于匹配其相应的标签而不是其他句子。3. The present invention proposes a label retrieval network (RN) to obtain sentence-level retrieval rewards. RN drives the generated captions to tend to match their corresponding labels rather than other sentences.
附图说明Description of drawings
图1为基于多粒度奖励机制的多注意力融合网络结构示意图。Figure 1 is a schematic diagram of the structure of a multi-attention fusion network based on a multi-granularity reward mechanism.
图2为分层注意力融合(HAF)模型示意图。Figure 2 is a schematic diagram of the Hierarchical Attention Fusion (HAF) model.
图3为单词重要性重评估网络(REN)结构示意图。Figure 3 is a schematic diagram of the structure of the word importance re-evaluation network (REN).
图4为标签检索网络(RN)结构示意图。FIG. 4 is a schematic diagram of the structure of a label retrieval network (RN).
图5为基于多粒度奖励机制的多注意力融合网络生成的字幕与自上而下方法生成的字幕、单独使用分层注意力融合模型生成的字幕和真实字幕的对比图。Figure 5 is a comparison diagram of the subtitles generated by the multi-attention fusion network based on the multi-granularity reward mechanism and the subtitles generated by the top-down method, the subtitles generated by the hierarchical attention fusion model alone, and the real subtitles.
具体实施方式Detailed ways
附图仅用于示例性说明,不能理解为对本专利的限制。The drawings are for illustrative purposes only and should not be construed as limiting the patent.
以下结合附图和实施例对本发明做进一步的阐述。The present invention will be further elaborated below in conjunction with the accompanying drawings and embodiments.
图1为基于多粒度奖励机制的多注意力融合网络结构示意图。如图1所示,分别为句子级奖励和词级奖励,在左侧,通过自适应地重新评估单词的重要性来产生单词级奖励,在右边,句子级奖励由检索损失构成,其中,检索损失由检索相似度S计算得到。Figure 1 is a schematic diagram of the structure of a multi-attention fusion network based on a multi-granularity reward mechanism. As shown in Figure 1, sentence-level reward and word-level reward, respectively, on the left, word-level reward is generated by adaptively re-evaluating the importance of words, and on the right, sentence-level reward is composed of retrieval loss, where retrieval The loss is calculated by the retrieval similarity S.
图2为分层注意力融合(HAF)模型示意图。如图2所示,表示conv4和conv5的平均特征,X是输入字的one-hot编码,E是词汇表的词嵌入矩阵。我们采用的是一个经典网络结构,它根据每个时间步t的LSTM隐藏状态ht产生归一化注意权重αt,αt用于参与图像特征的不同空间Att作为图像的最终表示(A):Figure 2 is a schematic diagram of the Hierarchical Attention Fusion (HAF) model. as shown in
αt=softmax(at) (2)α t =softmax(at ) (2)
其中,Wa,Ua,是学习参数。Among them, W a , U a , is the learning parameter.
其中,h2是第二LSTM的输出,其由卷积层的图像信息和生成的序列的内容组成。产生h2的过程可以通过以下方式给出:where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of producing h2 can be given by:
最后,通过非线性softmax函数给出输出字的概率:Finally, the probability of the output word is given by a non-linear softmax function:
图3为单词重要性重评估网络(REN)结构示意图。如图3所示,单词重要性重评估网络嵌入生成的句子并提供奖励权重W,S是sigmoid,rl-model是由CIDEr预训练的字幕模型。首先,REN将生成的句子S作为输入,然后,句子由带有具有注意力网络和平均池化层的RNN处理,词嵌入向量由带有注意力的句子嵌入向量和池化之后的句子嵌入向量连接而成,作为生成字幕的综合表示,然后应用两个全连接层和sigmoid变换获得不同单词的权重Wt。在数学上,损失函数可以形式化为11:Figure 3 is a schematic diagram of the structure of the word importance re-evaluation network (REN). As shown in Figure 3, the word importance re-evaluation network embeds the generated sentences and provides reward weights W, S is the sigmoid, and rl-model is the caption model pretrained by CIDEr. First, REN takes the generated sentence S as input, then, the sentence is processed by the RNN with attention network and average pooling layer, the word embedding vector is composed of the sentence embedding vector with attention and the sentence embedding vector after pooling are connected as a comprehensive representation for generating subtitles, and then two fully connected layers and sigmoid transform are applied to obtain the weights W t of different words. Mathematically, the loss function can be formalized as 11:
Wrt=RWt+R-b (10)Wr t =RW t +Rb (10)
其中,Wi是REN的输出权重,θ是图像字幕网络的参数,表示生成的句子的不同单词。where Wi is the output weight of REN , θ is the parameter of the image captioning network, Represent different words of the generated sentence.
为了利用基于指标的奖励(CIDEr)并约束句子空间,在CIDEr优化之后,采用词级奖励来微调字幕网络。此外,为了同时优化REN,我们将REN的更新定义为具有奖励R-b的另一个RL过程。我们观察到R-b太小而导致REN的梯度较弱,因此设置超参数γ以增强梯度。类似地,可以通过强化学习算法通过以下损失函数更新REN:To exploit the metric-based reward (CIDEr) and constrain the sentence space, word-level rewards are employed to fine-tune the captioning network after CIDEr optimization. Furthermore, to simultaneously optimize REN, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weak gradient of REN, so the hyperparameter γ is set to enhance the gradient. Similarly, REN can be updated by a reinforcement learning algorithm with the following loss function:
图4为标签检索网络(RN)结构示意图。如图4所示,通过文本到文本检索,利用标签和未匹配的标签来构成RL训练的句子级奖励,我们编码标签并在RN的相同嵌入空间中为特征生成字幕:FIG. 4 is a schematic diagram of the structure of a label retrieval network (RN). As shown in Figure 4, through text-to-text retrieval, leveraging labels and unmatched labels to compose sentence-level rewards for RL training, we encode labels and generate captions for features in the same embedding space of RN:
si=LSTM(Ci) (13)s i =LSTM(C i ) (13)
gj=LSTM(Gj) (14)g j =LSTM(G j ) (14)
其中C和G表示生成的字幕和标签,Si和gi表示其各自的嵌入特征,计算S和g之间的相似度的余弦相似度:where C and G represent the generated captions and labels, S i and gi represent their respective embedded features, and calculate the cosine similarity of the similarity between S and g:
指定匹配的单词对的得分高于任何不匹配的单词对的得分,RN的损失是通过铰链损失来计算的: Specifying that matched word pairs have a higher score than any unmatched word pairs, the RN loss is computed via the hinge loss:
其中是正确的单词对,而是不正确的单词对。CIDEr的铰链损失在RL训练中充当句子级奖励,这鼓励字幕模型的生成字幕与给定的标签最佳匹配。in is the correct word pair, and is an incorrect word pair. The hinge loss of CIDEr acts as a sentence-level reward in RL training, which encourages the caption model to generate captions that best match the given labels.
公式(17)是用于通过句子级奖励来β优化字幕模型的损失函数,其中β是用于平衡铰链损失和CIDEr的超参数,值得注意的是,检索过程是在每个mini-batch(小批次)中执行的,因为在整个数据集中检索是比较耗时的。Equation (17) is the loss function used to β-optimize the caption model with sentence-level rewards, where β is a hyperparameter used to balance the hinge loss and CIDEr. It is worth noting that the retrieval process is performed in each mini-batch (small batches), as it is time-consuming to retrieve the entire dataset.
图5为基于多粒度奖励机制的多注意力融合网络生成的字幕与自上而下方法生成的字幕、单独使用分层注意力融合模型生成的字幕和真实字幕的对比图。如图5所示,基于多粒度奖励机制的多注意力融合网络生成的句子要比图中其他模型更加准确以及人性化。Figure 5 is a comparison diagram of the subtitles generated by the multi-attention fusion network based on the multi-granularity reward mechanism and the subtitles generated by the top-down method, the subtitles generated by the hierarchical attention fusion model alone, and the real subtitles. As shown in Figure 5, the sentences generated by the multi-attention fusion network based on the multi-granularity reward mechanism are more accurate and human-friendly than other models in the figure.
本发明提出了基于强化学习奖励机制的单词重要性重评估网络和标签检索网络,并在此基础上提出了基于多粒度奖励机制的多注意力融合网络的图像字幕生成方法,该网络框架包含一个多注意力融合模型(HAF)、一个单词重要性重评估网络(REN)和一个标签检索网络(RN)。本发明提出了分层注意力融合(HAF)模型作为图像字幕RL训练的基线,HAF多次关注CNN的分层视觉特征,能够充分利用多层次的视觉信息,同时,单词重要性重评估网络(REN)用于促进重估奖励计算,其在RL训练阶段期间自动地对句子中生成的单词赋予不同的重要性。标签检索网络(RN)鼓励生成的字幕匹配其相应的标签而不是其他句子。通过训练使得生成的图像字幕表达准确流畅,能够很好的反应图像中的内容。The invention proposes a word importance re-evaluation network and a label retrieval network based on a reinforcement learning reward mechanism, and on this basis, a multi-attention fusion network-based image caption generation method based on a multi-granularity reward mechanism is proposed. The network framework includes a A multi-attention fusion model (HAF), a word importance re-evaluation network (REN), and a label retrieval network (RN). The present invention proposes a Hierarchical Attention Fusion (HAF) model as the baseline for image captioning RL training. HAF pays attention to the hierarchical visual features of CNN for many times, and can make full use of multi-level visual information. At the same time, the word importance re-evaluation network ( REN) is used to facilitate re-evaluation reward computation, which automatically assigns different importance to words generated in sentences during the RL training phase. The Label Retrieval Network (RN) encourages the generated captions to match their corresponding labels rather than other sentences. Through training, the generated image captions can be expressed accurately and fluently, and can well reflect the content of the images.
最后,本发明的上述示例的细节仅为解释说明本发明所做的举例,对于本领域技术人员,对上述实施例的任何修改、改进和替换等,均应包含在本发明权利要求的保护范围之内。Finally, the details of the above-mentioned examples of the present invention are only examples for explaining the present invention. For those skilled in the art, any modification, improvement and replacement of the above-mentioned embodiments should be included in the protection scope of the claims of the present invention. within.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010974467.1A CN112116685A (en) | 2020-09-16 | 2020-09-16 | Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010974467.1A CN112116685A (en) | 2020-09-16 | 2020-09-16 | Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN112116685A true CN112116685A (en) | 2020-12-22 |
Family
ID=73803138
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010974467.1A Pending CN112116685A (en) | 2020-09-16 | 2020-09-16 | Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112116685A (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113408430A (en) * | 2021-06-22 | 2021-09-17 | 哈尔滨理工大学 | Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework |
| CN113918754A (en) * | 2021-11-01 | 2022-01-11 | 中国石油大学(华东) | Image caption generation method based on scene graph update and feature stitching |
| CN114090815A (en) * | 2021-11-12 | 2022-02-25 | 海信电子科技(武汉)有限公司 | An image description model training method and training device |
| CN116501859A (en) * | 2023-06-26 | 2023-07-28 | 中国海洋大学 | Paragraph retrieval method, device and medium based on refrigerator field |
| CN120281862A (en) * | 2025-04-02 | 2025-07-08 | 中国石油大学(华东) | Underwater image subtitle generation method and system based on multi-modal information fusion |
Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170098153A1 (en) * | 2015-10-02 | 2017-04-06 | Baidu Usa Llc | Intelligent image captioning |
| US20170127016A1 (en) * | 2015-10-29 | 2017-05-04 | Baidu Usa Llc | Systems and methods for video paragraph captioning using hierarchical recurrent neural networks |
| CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
| US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
| CN109711464A (en) * | 2018-12-25 | 2019-05-03 | 中山大学 | Image Description Method Based on Hierarchical Feature Relation Graph |
| CN110135567A (en) * | 2019-05-27 | 2019-08-16 | 中国石油大学(华东) | Image Caption Generation Method Based on Multi-Attention Generative Adversarial Network |
| CN110347860A (en) * | 2019-07-01 | 2019-10-18 | 南京航空航天大学 | Depth image based on convolutional neural networks describes method |
| US10467274B1 (en) * | 2016-11-10 | 2019-11-05 | Snap Inc. | Deep reinforcement learning-based captioning with embedding reward |
| CN110473267A (en) * | 2019-07-12 | 2019-11-19 | 北京邮电大学 | Social networks image based on attention feature extraction network describes generation method |
| CN111046966A (en) * | 2019-12-18 | 2020-04-21 | 江南大学 | Image caption generation method based on metric attention mechanism |
| US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
| KR20200104663A (en) * | 2019-02-27 | 2020-09-04 | 한국전력공사 | System and method for automatic generation of image caption |
| KR20200106115A (en) * | 2019-02-27 | 2020-09-11 | 한국전력공사 | Apparatus and method for automatically generating explainable image caption |
-
2020
- 2020-09-16 CN CN202010974467.1A patent/CN112116685A/en active Pending
Patent Citations (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170098153A1 (en) * | 2015-10-02 | 2017-04-06 | Baidu Usa Llc | Intelligent image captioning |
| US20170127016A1 (en) * | 2015-10-29 | 2017-05-04 | Baidu Usa Llc | Systems and methods for video paragraph captioning using hierarchical recurrent neural networks |
| US10467274B1 (en) * | 2016-11-10 | 2019-11-05 | Snap Inc. | Deep reinforcement learning-based captioning with embedding reward |
| US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
| CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
| CN109711464A (en) * | 2018-12-25 | 2019-05-03 | 中山大学 | Image Description Method Based on Hierarchical Feature Relation Graph |
| KR20200104663A (en) * | 2019-02-27 | 2020-09-04 | 한국전력공사 | System and method for automatic generation of image caption |
| KR20200106115A (en) * | 2019-02-27 | 2020-09-11 | 한국전력공사 | Apparatus and method for automatically generating explainable image caption |
| CN110135567A (en) * | 2019-05-27 | 2019-08-16 | 中国石油大学(华东) | Image Caption Generation Method Based on Multi-Attention Generative Adversarial Network |
| CN110347860A (en) * | 2019-07-01 | 2019-10-18 | 南京航空航天大学 | Depth image based on convolutional neural networks describes method |
| CN110473267A (en) * | 2019-07-12 | 2019-11-19 | 北京邮电大学 | Social networks image based on attention feature extraction network describes generation method |
| US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
| CN111046966A (en) * | 2019-12-18 | 2020-04-21 | 江南大学 | Image caption generation method based on metric attention mechanism |
Non-Patent Citations (3)
| Title |
|---|
| CHUNLEI WU ET AL: "Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards", 《IEEE ACCESS》, pages 57943 - 57951 * |
| 杜海骏等: "融合约束学习的图像字幕生成方法", 《中国图象图形学报》, pages 0333 - 0342 * |
| 袁韶祖等: "基于多粒度视频信息和注意力机制的视频 场景识别", 《计算机系统应用》, pages 252 - 256 * |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113408430A (en) * | 2021-06-22 | 2021-09-17 | 哈尔滨理工大学 | Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework |
| CN113408430B (en) * | 2021-06-22 | 2022-09-09 | 哈尔滨理工大学 | Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework |
| CN113918754A (en) * | 2021-11-01 | 2022-01-11 | 中国石油大学(华东) | Image caption generation method based on scene graph update and feature stitching |
| CN114090815A (en) * | 2021-11-12 | 2022-02-25 | 海信电子科技(武汉)有限公司 | An image description model training method and training device |
| CN116501859A (en) * | 2023-06-26 | 2023-07-28 | 中国海洋大学 | Paragraph retrieval method, device and medium based on refrigerator field |
| CN116501859B (en) * | 2023-06-26 | 2023-09-01 | 中国海洋大学 | Paragraph retrieval method, equipment and medium based on refrigerator field |
| CN120281862A (en) * | 2025-04-02 | 2025-07-08 | 中国石油大学(华东) | Underwater image subtitle generation method and system based on multi-modal information fusion |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115310560B (en) | A multimodal sentiment classification method based on modal space assimilation and contrastive learning | |
| CN114842267B (en) | Image classification method and system based on label noise domain adaptation | |
| CN112116685A (en) | Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism | |
| WO2022057669A1 (en) | Method for pre-training knowledge graph on the basis of structured context information | |
| US11948387B2 (en) | Optimized policy-based active learning for content detection | |
| CN113157919B (en) | Sentence Text Aspect-Level Sentiment Classification Method and System | |
| CN111062489A (en) | Knowledge distillation-based multi-language model compression method and device | |
| CN111401928B (en) | Method and device for determining semantic similarity of text based on graph data | |
| CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
| CN116109978B (en) | Unsupervised video description method based on self-constrained dynamic text features | |
| CN111402365B (en) | Method for generating picture from characters based on bidirectional architecture confrontation generation network | |
| CN117058673A (en) | Text generation image model training method and system and text generation image method and system | |
| CN112561064A (en) | Knowledge base completion method based on OWKBC model | |
| Yu et al. | Cgt-gan: Clip-guided text gan for image captioning | |
| CN117370736A (en) | Fine granularity emotion recognition method, electronic equipment and storage medium | |
| CN110188819A (en) | A high-level semantic understanding method for CNN and LSTM images based on information gain | |
| CN113312919A (en) | Method and device for generating text of knowledge graph | |
| CN114881125A (en) | Label-noisy image classification method based on graph consistency and semi-supervised model | |
| CN113408430B (en) | Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework | |
| CN113140023A (en) | Text-to-image generation method and system based on space attention | |
| WO2024159132A1 (en) | Lifelong pretraining of mixture-of-experts neural networks | |
| CN118427608A (en) | Multi-modal image language model combined prompt learning method and device | |
| CN118334463A (en) | Adversarial knowledge distillation method, system and device for visual language model | |
| CN118298422A (en) | Nuclear detection method of pathological image based on visual language big model | |
| CN116595222A (en) | Short video multi-label classification method and device based on multi-modal knowledge distillation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201222 |
|
| WD01 | Invention patent application deemed withdrawn after publication |