[go: up one dir, main page]

CN112116685A - Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism - Google Patents

Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism Download PDF

Info

Publication number
CN112116685A
CN112116685A CN202010974467.1A CN202010974467A CN112116685A CN 112116685 A CN112116685 A CN 112116685A CN 202010974467 A CN202010974467 A CN 202010974467A CN 112116685 A CN112116685 A CN 112116685A
Authority
CN
China
Prior art keywords
reward
network
word
image
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010974467.1A
Other languages
Chinese (zh)
Inventor
王雷全
袁韶祖
段海龙
吴杰
路静
吴春雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202010974467.1A priority Critical patent/CN112116685A/en
Publication of CN112116685A publication Critical patent/CN112116685A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种基于多粒度奖励机制的多注意力融合网络的图像字幕生成方法,它解决了在基于强化学习奖励机制的图像字幕生成方法中,每个生成单词不同重要性的问题。本发明首次提出了一种基于多粒度奖励机制的多注意力融合网络用于图像字幕生成,它包括多注意力融合模型、单词重要性重评估网络和标签检索网络。多注意力融合模型用作基于强化学习的图像字幕方法的基线;单词重要性重评估网络通过估算生成标题中每个单词的不同重要性而被用于奖励重估;标签检索网络能够从一批字幕中检索相应的真实标签作为检索奖励,然后通过训练该网络以最大化奖励的方式生成更好的字幕。本发明在MSCOCO数据集上进行了大量的实验验证,取得了非常有竞争力的评价结果。The invention discloses an image caption generation method based on a multi-granularity reward mechanism and a multi-attention fusion network, which solves the problem that each generated word has different importance in the image caption generation method based on the reinforcement learning reward mechanism. The present invention firstly proposes a multi-attention fusion network based on a multi-granularity reward mechanism for image caption generation, which includes a multi-attention fusion model, a word importance re-evaluation network and a label retrieval network. A multi-attention fusion model is used as a baseline for reinforcement learning-based image captioning methods; a word importance re-evaluation network is used for reward re-evaluation by estimating the different importance of each word in the generated caption; a label retrieval network can The corresponding ground-truth labels are retrieved from the subtitles as the retrieval reward, and then the network is trained to generate better subtitles in a way that maximizes the reward. The present invention has been verified by a large number of experiments on the MSCOCO data set, and obtained very competitive evaluation results.

Description

基于多粒度奖励机制的多注意力融合网络的图像字幕生成 方法Image caption generation with multi-attention fusion network based on multi-granularity reward mechanism method

技术领域technical field

本发明属于图像字幕自动生成方法,涉及计算机视觉和自然语言处理的技术领域。The invention belongs to a method for automatically generating image captions, and relates to the technical fields of computer vision and natural language processing.

背景技术Background technique

图像字幕(image caption)的目标是自动生成给定图像的自然语言描述。目前这项任务面临着巨大的挑战,一方面,计算机必须从多层次的视觉特征中全面了解图像内容;另一方面,图像字幕生成算法需要逐步将粗略语义概念修改为类似于人的自然语言描述。近些年,深度学习相关技术(包括注意力机制和强化学习)的进步显著提高了字幕生成的质量,而这其中编码-解码框架是图像字幕生成的主流方法。Vinyals等人利用空间合并的CNN特征图生成字幕,将整个图像压缩成静态表示,再用注意力机制通过学习自适应地关注图像的区域来改善字幕的性能,但是只有单个LSTM用作可视信息处理程序以及语言生成器,语言生成器被同时可视化处理程序削弱。Peter Anderson等人提出了具有两个独立LSTM层的自上而下架构:第一个LSTM层充当自上而下的视觉注意模型,第二个LSTM层充当语言生成器。上面提到的所有图像字幕方法均采用CNN最后卷积层的高级视觉特征作为图像编码器,忽略了低级视觉特征,事实上低级视觉特征也有利于理解图像。由于多层特征之间的互补性,采用多层特征融合也可以优化图像字幕,然而,早期融合方法效果并不是很好,如何将多级视觉特征融入图像字幕模型是值得考虑的问题。一般情况下,训练图像字幕模型是通过最大化交叉熵(XE)来实现的,这使得图像字幕模型对异常字幕比较敏感,而不是围绕人类对合适字幕的共识进行优化以获得稳定的输出。此外,通常通过计算测试集上的不同度量来评估字幕模型,例如BLEU,ROUGE,METEOR和CIDEr。目标函数与评估度量之间的不匹配会对图像字幕模型造成不利的影响,这个问题可以通过强化学习(RL)来解决,如PolicyGradient和Actor-Critic。强化学习方法可以优化不可微分的基于序列的评估指标,当使用Policy Gradient方法时,SCST的作者应用CIDEr作为奖励,产生更符合人类语言共识的字幕。The goal of image captioning is to automatically generate a natural language description of a given image. This task currently faces huge challenges. On the one hand, the computer must comprehensively understand the image content from the multi-level visual features; on the other hand, the image caption generation algorithm needs to gradually modify the rough semantic concepts into human-like natural language descriptions . In recent years, advances in deep learning related technologies (including attention mechanism and reinforcement learning) have significantly improved the quality of caption generation, among which the encoder-decoder framework is the mainstream method for image caption generation. Vinyals et al. used spatially merged CNN feature maps to generate subtitles, compress the entire image into a static representation, and then used an attention mechanism to improve the performance of subtitles by learning to adaptively focus on regions of the image, but only a single LSTM was used for visual information The handler as well as the language generator, which is weakened by the simultaneous visual handler. Peter Anderson et al. proposed a top-down architecture with two independent LSTM layers: the first LSTM layer acts as a top-down visual attention model, and the second LSTM layer acts as a language generator. All the image captioning methods mentioned above use the high-level visual features of the last convolutional layer of CNN as the image encoder, ignoring the low-level visual features, which in fact are also beneficial for understanding the image. Due to the complementarity between multi-layer features, image captioning can also be optimized by using multi-layer feature fusion. However, early fusion methods are not very effective, and how to integrate multi-level visual features into image captioning models is a problem worth considering. In general, training image captioning models is achieved by maximizing cross-entropy (XE), which makes image captioning models more sensitive to abnormal captions, rather than optimizing around human consensus on suitable captions for stable output. Furthermore, captioning models, such as BLEU, ROUGE, METEOR, and CIDEr, are usually evaluated by computing different metrics on the test set. The mismatch between objective function and evaluation metric can adversely affect image captioning models, a problem that can be addressed by reinforcement learning (RL) such as PolicyGradient and Actor-Critic. Reinforcement learning methods can optimize non-differentiable sequence-based evaluation metrics, and when using the Policy Gradient method, SCST authors apply CIDEr as a reward, producing captions that are more in line with human linguistic consensus.

在SCST中,对每个单词给予相同的奖励作为梯度权重。然而,并不是所有的单词都应该在一个句子中给予同等的奖励,不同的单词可能具有不同的重要性。Yu等利用蒙特卡罗推出SeqGan来估计每个单词的重要性,然而,它必须产生丰富的句子,这就导致昂贵的时间复杂性。基于Actor-Critic策略,Dzmitry Bahdanau等人采用价值评估网络来评估单词,但是评估指标(例如,CIDEr,BLEU)无法直接优化。在本文中,提出利用词级奖励来优化基于RL训练的图像字幕模型,旨在解决每个生成单词的不同重要性问题。In SCST, each word is given the same reward as a gradient weight. However, not all words should be given equal rewards in a sentence, and different words may have different importance. Yu et al. utilize Monte Carlo to introduce SeqGan to estimate the importance of each word, however, it must generate rich sentences, which leads to expensive time complexity. Based on the Actor-Critic strategy, Dzmitry Bahdanau et al. adopted a value evaluation network to evaluate words, but evaluation metrics (eg, CIDEr, BLEU) could not be directly optimized. In this paper, we propose to utilize word-level rewards to optimize an RL-trained image captioning model, aiming to address the different importance of each generated word.

将评估度量(例如,CIDEr,BLEU)计算为奖励信号是RL训练中直观的方式,以生成更多类似人类语言的字幕,但是,这些评估指标并不是判断生成字幕质量的唯一标准,生成的字幕的质量也可以通过它是否可以在检索系统中检索到相应的标签来评估。从信息利用的角度来看,传统的CIDEr奖励充分利用了匹配的标签信息,而检索奖励则从额外的标签信息中获益,检索损失也可以作为奖励系统来使用。Computing evaluation metrics (e.g., CIDEr, BLEU) as reward signals is an intuitive way in RL training to generate more human-like subtitles, however, these evaluation metrics are not the only criteria for judging the quality of generated subtitles. The quality of a can also be assessed by whether it can retrieve the corresponding tags in the retrieval system. From the perspective of information utilization, the traditional CIDEr reward makes full use of the matching label information, while the retrieval reward benefits from the additional label information, and the retrieval loss can also be used as a reward system.

在本文中,提出了一种图像字幕的分层注意力融合(HAF)模型,该模型将Resnet的多级特征映射与层次关注集成在一起,充当基于RL的图像字幕方法的基线。此外,在RL阶段呈现多粒度奖励以修改所提出的HAF。具体而言,单词重要性重评估网络(REN)通过估算生成字幕中每个单词的不同重要性而被用于奖励重估,其中,用于重评估的奖励是通过加权CIDEr得分来得到的,不同的权重是从REN计算的,重评估的奖励可以被视为词级奖励。为了从额外的标签中获益,实施了标签检索网络(RN)以从一批字幕中检索相应的标签作为检索奖励,其可以被视为句子级奖励。In this paper, a Hierarchical Attention Fusion (HAF) model for image captioning is proposed, which integrates Resnet's multi-level feature maps with hierarchical attention to serve as a baseline for RL-based image captioning methods. Furthermore, multi-granularity rewards are presented in the RL stage to modify the proposed HAF. Specifically, the word importance re-evaluation network (REN) is used for reward re-evaluation by estimating the different importance of each word in the generated caption, where the reward for re-evaluation is obtained by weighting the CIDEr score, Different weights are computed from REN, and the re-evaluated reward can be viewed as a word-level reward. To benefit from the extra labels, a Label Retrieval Network (RN) is implemented to retrieve the corresponding labels from a batch of captions as a retrieval reward, which can be viewed as a sentence-level reward.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决在基于强化学习奖励机制的图像字幕生成方法中,每个生成单词的不同重要性问题,从而产生更符合人类语言共识的句子,并不是所有的单词都应该在一个句子中给予同等的奖励,不同的单词可能具有不同的重要性。The purpose of the present invention is to solve the problem of the different importance of each generated word in the image caption generation method based on the reinforcement learning reward mechanism, so as to generate a sentence that is more in line with the consensus of human language, not all words should be in one sentence given equal rewards, different words may have different importance.

本发明为解决上述技术问题采取的技术方案是:The technical scheme that the present invention takes for solving the above-mentioned technical problems is:

S1.构建多注意力融合模型。S1. Build a multi-attention fusion model.

S2.构建基于强化学习奖励机制的单词重要性重评估网络。S2. Construct a word importance re-evaluation network based on reinforcement learning reward mechanism.

S3.结合强化学习奖励机制,构建标签检索网络。S3. Combine the reinforcement learning reward mechanism to build a label retrieval network.

S4.结合S1中的模型、S2中的单词重要性重评估网络和S3中的标签检索网络构建基于多粒度奖励机制的多注意力融合网络架构。S4. Combine the model in S1, the word importance re-evaluation network in S2, and the label retrieval network in S3 to build a multi-attention fusion network architecture based on a multi-granularity reward mechanism.

S5.基于多粒度奖励机制的多注意力融合网络的训练和字幕生成。S5. Training and caption generation of a multi-attention fusion network based on a multi-granularity reward mechanism.

其中,多注意力融合模型(HAF)作为图像字幕RL训练的基线,关注CNN的分层视觉特征,充分利用了多层次的视觉信息,除了利用图像的最后一层卷积表示和采用单个注意力模型在每个时间步骤聚焦于图像的特定区域之外,我们还考虑融合用于字幕的注意力模型,并且输入注意力衍生的图像特征到语言LSTM的单元节点。我们采用的是一个经典网络结构,它根据每个时间步t的LSTM隐藏状态ht产生归一化注意权重αt。αt用于参与图像特征的不同空间Att作为图像的最终表示(A):Among them, the multi-attention fusion model (HAF), as the baseline for image captioning RL training, pays attention to the layered visual features of CNN and makes full use of the multi-layered visual information, in addition to using the last layer of convolutional representation of the image and adopting a single attention The model focuses outside a specific region of the image at each time step, and we also consider fusing an attention model for captioning and input attention-derived image features to the unit nodes of a linguistic LSTM. We adopt a classical network structure that produces normalized attention weights αt based on the LSTM hidden state ht at each time step t . α t is used to participate in the different spaces Att of the image features as the final representation of the image (A):

Figure BDA0002685289120000031
Figure BDA0002685289120000031

αt=softmax(at) (2)α t =softmax(at ) (2)

Figure BDA0002685289120000032
Figure BDA0002685289120000032

其中,Wa,Ua

Figure BDA0002685289120000033
是学习参数。Among them, W a , U a ,
Figure BDA0002685289120000033
is the learning parameter.

Figure BDA0002685289120000034
Figure BDA0002685289120000034

其中,h2是第二LSTM的输出,其由卷积层的图像信息和生成的序列的内容组成。产生h2的过程可以通过以下方式给出:where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of generating h2 can be given by:

Figure BDA0002685289120000035
Figure BDA0002685289120000035

Figure BDA0002685289120000036
Figure BDA0002685289120000036

Figure BDA0002685289120000037
Figure BDA0002685289120000037

Figure BDA0002685289120000038
Figure BDA0002685289120000038

最后,通过非线性softmax函数给出输出单词的概率:Finally, the probability of the output word is given by a non-linear softmax function:

Figure BDA0002685289120000039
Figure BDA0002685289120000039

单词重要性重评估网络基于强化学习奖励机制构建,通过自动估算生成字幕中不同单词的重要性来重新评估基于指标的奖励。首先,REN将生成的句子S作为输入,然后,句子由带有具有注意力网络和平均池化层的RNN处理,词嵌入向量由带有注意力的句子嵌入向量和池化之后的句子嵌入向量连接而成,作为生成字幕的综合表示,然后应用两个全连接层和sigmoid变换获得不同单词的权重Wt。特别地,由CIDEr奖励机制预训练的字幕模型(rl-模型)充当基线(b),在不改变预期梯度的情况下显着减小方差。我们将字级奖励Wrt构造为16个,因此,只有来自模型的样本优于当前的测试模型(rl-模型)被赋予正权重,而劣质样本被抑制。在数学上,损失函数可以形式化为公式(11):The word importance re-evaluation network is constructed based on a reinforcement learning reward mechanism to re-evaluate metric-based rewards by automatically estimating the importance of different words in the generated captions. First, REN takes the generated sentence S as input, then, the sentence is processed by the RNN with attention network and average pooling layer, the word embedding vector is composed of the sentence embedding vector with attention and the sentence embedding vector after pooling are connected as a comprehensive representation for generating subtitles, and then two fully connected layers and sigmoid transform are applied to obtain the weights W t of different words. In particular, the caption model (rl-model) pretrained by the CIDEr reward mechanism acts as a baseline (b), reducing variance significantly without changing the expected gradient. We construct the word-level reward Wr t to be 16, so only samples from the model that outperform the current test model (rl-model) are given positive weights, while inferior samples are suppressed. Mathematically, the loss function can be formalized as formula (11):

Wrt=RWt+R-b (10)Wr t =RW t +Rb (10)

Figure BDA00026852891200000310
Figure BDA00026852891200000310

其中,Wi是REN的输出权重,θ是图像字幕网络的参数,

Figure BDA0002685289120000041
表示生成的句子的不同单词。where Wi is the output weight of REN , θ is the parameter of the image captioning network,
Figure BDA0002685289120000041
Represent different words of the generated sentence.

为了利用基于指标的奖励(CIDEr)并约束句子空间,在CIDEr优化之后,采用词级奖励来微调字幕网络,此外,为了同时优化REN,我们将REN的更新定义为具有奖励R-b的另一个RL过程。我们观察到R-b太小而导致REN的梯度较弱,因此设置超参数γ以增强梯度,类似地,可以通过强化学习算法通过以下损失函数更新REN:In order to take advantage of the metric-based reward (CIDEr) and constrain the sentence space, after CIDEr optimization, word-level rewards are employed to fine-tune the captioning network, in addition, to optimize REN at the same time, we define the update of REN as another RL process with reward R-b . We observe that R-b is too small resulting in a weak gradient of REN, so the hyperparameter γ is set to enhance the gradient, and similarly, REN can be updated by a reinforcement learning algorithm with the following loss function:

Figure BDA0002685289120000042
Figure BDA0002685289120000042

标签检索网络(RN)也是基于强化学习奖励机制构建,为了增强基于指标的奖励(CIDEr)并利用标签和其他未匹配的标签,引入了标签检索网络,使得生成的字幕应该与其相应的标签相匹配。按照FartashFaghri等人提出的称为跨媒体检索的方法,我们重构了一个带有两个LSTM网络的句子检索模型,首先,RN由图像的不同标签预先训练至收敛,因为每个图像具有五个不同的标签,我们编码标签并在RN的相同嵌入空间中为特征生成字幕:The Label Retrieval Network (RN) is also constructed based on the reinforcement learning reward mechanism. In order to enhance the indicator-based reward (CIDEr) and utilize labels and other unmatched labels, a label retrieval network is introduced so that the generated captions should match their corresponding labels . Following a method called cross-media retrieval proposed by FartashFaghri et al., we reconstruct a sentence retrieval model with two LSTM networks. First, the RN is pre-trained with different labels of images to converge, since each image has five Different labels, we encode labels and generate captions for features in the same embedding space of RN:

si=LSTM(Ci) (13)s i =LSTM(C i ) (13)

gj=LSTM(Gj) (14)g j =LSTM(G j ) (14)

其中C和G表示生成的字幕和标签,Si和gi表示其各自的嵌入特征。计算S和g之间的相似度的余弦相似度:where C and G represent the generated captions and labels, and S i and gi represent their respective embedding features. Compute the cosine similarity of the similarity between S and g:

Figure BDA0002685289120000043
Figure BDA0002685289120000043

Figure BDA0002685289120000044
指定匹配单词对的得分高于任何不匹配单词对的得分,RN的损失是通过铰链损失来计算的:
Figure BDA0002685289120000044
Specifying that matching word pairs have a higher score than any unmatched word pairs, the RN loss is computed via the hinge loss:

Figure BDA0002685289120000045
Figure BDA0002685289120000045

其中

Figure BDA0002685289120000046
是正确的单词对,而
Figure BDA0002685289120000047
是不正确的。CIDEr的铰链损失在RL训练中充当句子级奖励,这鼓励字幕模型的生成字幕与给定的标签最佳匹配。in
Figure BDA0002685289120000046
is the correct word pair, and
Figure BDA0002685289120000047
is incorrect. The hinge loss of CIDEr acts as a sentence-level reward in RL training, which encourages the caption model to generate captions that best match the given labels.

Figure BDA0002685289120000048
Figure BDA0002685289120000048

公式(17)是用于通过句子级奖励来β优化字幕模型的损失函数,其中β是用于平衡铰链损失和CIDEr的超参数。值得注意的是,检索过程是在每个mini-batch(小批次)中执行的,因为在整个数据集中检索是比较耗时的。Equation (17) is the loss function for β-optimized captioning models with sentence-level rewards, where β is a hyperparameter for balancing hinge loss and CIDEr. It is worth noting that the retrieval process is performed in each mini-batch, as it is time-consuming to retrieve the entire dataset.

本发明提出的基于多粒度奖励机制的多注意力融合网络包含一个多注意力融合模型(HAF)、一个单词重要性重评估网络(REN)和一个标签检索网络(RN)。The multi-attention fusion network based on the multi-granularity reward mechanism proposed by the present invention includes a multi-attention fusion model (HAF), a word importance re-evaluation network (REN) and a label retrieval network (RN).

最后,所述的基于多粒度奖励机制的多注意力融合网络的训练方法如下:Finally, the training method of the multi-attention fusion network based on the multi-granularity reward mechanism is as follows:

所有模型都通过交叉熵损失进行预训练,然后进行训练以最大化不同的RL奖励。编码器使用预先训练的Resnet-101来获得图像的表示,对于每个图像,我们从Resnet中提取conv4和conv5卷积层的输出,它们映射到维度1024的向量作为HAF的输入。对于HAF,图像特征嵌入维度,LSTM隐藏状态和单词嵌入的维度都设置为512。基线模型使用ADAM优化器在XE目标下训练,初始学习率为10-4。在每个迭代周期,我们评估模型并选择最佳CIDEr作为基线分数。强化训练从第30个迭代周期开始,以优化CIDEr度量,学习率为10-5All models are pretrained with a cross-entropy loss and then trained to maximize different RL rewards. The encoder uses a pre-trained Resnet-101 to obtain the representation of the image, and for each image, we extract the outputs of the conv4 and conv5 convolutional layers from the Resnet, which are mapped to vectors of dimension 1024 as the input of the HAF. For HAF, the dimension of image feature embedding, the dimension of LSTM hidden state and word embedding are all set to 512. The baseline model is trained under the XE objective using the ADAM optimizer with an initial learning rate of 10 −4 . At each iteration, we evaluate the model and select the best CIDEr as the baseline score. Reinforcement training starts at the 30th iteration epoch to optimize the CIDEr metric with a learning rate of 10 −5 .

在单词级奖励训练阶段,图像字幕模型预先训练了20个迭代周期的CIDEr奖励,以及10个迭代周期的奖励级别奖励。在句子级奖励训练中,RN通过每个img的不同标签预先训练10个迭代周期。其中,单词嵌入和LSTM隐藏大小被设置为512并且联合嵌入大小被设置为1024,并且超参数边缘α被设置为0.2。此外,基线(b)的字幕模型使用交叉熵训练30个时期,句子级奖励训练的迭代周期设定为30。In the word-level reward training phase, the image captioning model is pre-trained with CIDEr rewards for 20 iterations, and reward-level rewards for 10 iterations. In sentence-level reward training, the RN is pre-trained for 10 epochs with different labels for each img. where the word embedding and LSTM hidden size are set to 512 and the joint embedding size is set to 1024, and the hyperparameter edge α is set to 0.2. In addition, the subtitle model of baseline (b) is trained for 30 epochs using cross-entropy, and the iteration epoch for sentence-level reward training is set to 30.

与现有的技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:

1.本发明提出了分层注意力融合(HAF)模型作为图像字幕RL训练的基线。HAF多次关注CNN的分层视觉特征,能够充分利用多层次的视觉信息。1. The present invention proposes a Hierarchical Attention Fusion (HAF) model as a baseline for image captioning RL training. HAF pays attention to the hierarchical visual features of CNN for many times, which can make full use of multi-level visual information.

2.本发明提出了单词重要性重评估网络(REN)用于促进重估奖励计算,其在RL训练阶段期间自动地对句子中生成的单词赋予不同的重要性。2. The present invention proposes a word importance re-evaluation network (REN) for facilitating re-evaluation reward computation, which automatically assigns different importance to words generated in a sentence during the RL training phase.

3.本发明提出了标签检索网络(RN)以获得句子级检索奖励。RN会驱使生成的字幕倾向于匹配其相应的标签而不是其他句子。3. The present invention proposes a label retrieval network (RN) to obtain sentence-level retrieval rewards. RN drives the generated captions to tend to match their corresponding labels rather than other sentences.

附图说明Description of drawings

图1为基于多粒度奖励机制的多注意力融合网络结构示意图。Figure 1 is a schematic diagram of the structure of a multi-attention fusion network based on a multi-granularity reward mechanism.

图2为分层注意力融合(HAF)模型示意图。Figure 2 is a schematic diagram of the Hierarchical Attention Fusion (HAF) model.

图3为单词重要性重评估网络(REN)结构示意图。Figure 3 is a schematic diagram of the structure of the word importance re-evaluation network (REN).

图4为标签检索网络(RN)结构示意图。FIG. 4 is a schematic diagram of the structure of a label retrieval network (RN).

图5为基于多粒度奖励机制的多注意力融合网络生成的字幕与自上而下方法生成的字幕、单独使用分层注意力融合模型生成的字幕和真实字幕的对比图。Figure 5 is a comparison diagram of the subtitles generated by the multi-attention fusion network based on the multi-granularity reward mechanism and the subtitles generated by the top-down method, the subtitles generated by the hierarchical attention fusion model alone, and the real subtitles.

具体实施方式Detailed ways

附图仅用于示例性说明,不能理解为对本专利的限制。The drawings are for illustrative purposes only and should not be construed as limiting the patent.

以下结合附图和实施例对本发明做进一步的阐述。The present invention will be further elaborated below in conjunction with the accompanying drawings and embodiments.

图1为基于多粒度奖励机制的多注意力融合网络结构示意图。如图1所示,分别为句子级奖励和词级奖励,在左侧,通过自适应地重新评估单词的重要性来产生单词级奖励,在右边,句子级奖励由检索损失构成,其中,检索损失由检索相似度S计算得到。Figure 1 is a schematic diagram of the structure of a multi-attention fusion network based on a multi-granularity reward mechanism. As shown in Figure 1, sentence-level reward and word-level reward, respectively, on the left, word-level reward is generated by adaptively re-evaluating the importance of words, and on the right, sentence-level reward is composed of retrieval loss, where retrieval The loss is calculated by the retrieval similarity S.

图2为分层注意力融合(HAF)模型示意图。如图2所示,

Figure BDA0002685289120000061
表示conv4和conv5的平均特征,X是输入字的one-hot编码,E是词汇表的词嵌入矩阵。我们采用的是一个经典网络结构,它根据每个时间步t的LSTM隐藏状态ht产生归一化注意权重αt,αt用于参与图像特征的不同空间Att作为图像的最终表示(A):Figure 2 is a schematic diagram of the Hierarchical Attention Fusion (HAF) model. as shown in picture 2,
Figure BDA0002685289120000061
represents the average feature of conv4 and conv5, X is the one-hot encoding of the input word, and E is the word embedding matrix of the vocabulary. We adopt a classical network structure that generates normalized attention weights α t according to the LSTM hidden state h t at each time step t, α t is used to participate in different spaces of image features Att as the final representation of the image (A) :

Figure BDA0002685289120000062
Figure BDA0002685289120000062

αt=softmax(at) (2)α t =softmax(at ) (2)

Figure BDA0002685289120000063
Figure BDA0002685289120000063

其中,Wa,Ua

Figure BDA0002685289120000064
是学习参数。Among them, W a , U a ,
Figure BDA0002685289120000064
is the learning parameter.

Figure BDA0002685289120000065
Figure BDA0002685289120000065

其中,h2是第二LSTM的输出,其由卷积层的图像信息和生成的序列的内容组成。产生h2的过程可以通过以下方式给出:where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of producing h2 can be given by:

Figure BDA0002685289120000066
Figure BDA0002685289120000066

Figure BDA0002685289120000067
Figure BDA0002685289120000067

Figure BDA0002685289120000068
Figure BDA0002685289120000068

Figure BDA0002685289120000069
Figure BDA0002685289120000069

最后,通过非线性softmax函数给出输出字的概率:Finally, the probability of the output word is given by a non-linear softmax function:

Figure BDA00026852891200000610
Figure BDA00026852891200000610

图3为单词重要性重评估网络(REN)结构示意图。如图3所示,单词重要性重评估网络嵌入生成的句子并提供奖励权重W,S是sigmoid,rl-model是由CIDEr预训练的字幕模型。首先,REN将生成的句子S作为输入,然后,句子由带有具有注意力网络和平均池化层的RNN处理,词嵌入向量由带有注意力的句子嵌入向量和池化之后的句子嵌入向量连接而成,作为生成字幕的综合表示,然后应用两个全连接层和sigmoid变换获得不同单词的权重Wt。在数学上,损失函数可以形式化为11:Figure 3 is a schematic diagram of the structure of the word importance re-evaluation network (REN). As shown in Figure 3, the word importance re-evaluation network embeds the generated sentences and provides reward weights W, S is the sigmoid, and rl-model is the caption model pretrained by CIDEr. First, REN takes the generated sentence S as input, then, the sentence is processed by the RNN with attention network and average pooling layer, the word embedding vector is composed of the sentence embedding vector with attention and the sentence embedding vector after pooling are connected as a comprehensive representation for generating subtitles, and then two fully connected layers and sigmoid transform are applied to obtain the weights W t of different words. Mathematically, the loss function can be formalized as 11:

Wrt=RWt+R-b (10)Wr t =RW t +Rb (10)

Figure BDA0002685289120000071
Figure BDA0002685289120000071

其中,Wi是REN的输出权重,θ是图像字幕网络的参数,

Figure BDA0002685289120000072
表示生成的句子的不同单词。where Wi is the output weight of REN , θ is the parameter of the image captioning network,
Figure BDA0002685289120000072
Represent different words of the generated sentence.

为了利用基于指标的奖励(CIDEr)并约束句子空间,在CIDEr优化之后,采用词级奖励来微调字幕网络。此外,为了同时优化REN,我们将REN的更新定义为具有奖励R-b的另一个RL过程。我们观察到R-b太小而导致REN的梯度较弱,因此设置超参数γ以增强梯度。类似地,可以通过强化学习算法通过以下损失函数更新REN:To exploit the metric-based reward (CIDEr) and constrain the sentence space, word-level rewards are employed to fine-tune the captioning network after CIDEr optimization. Furthermore, to simultaneously optimize REN, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weak gradient of REN, so the hyperparameter γ is set to enhance the gradient. Similarly, REN can be updated by a reinforcement learning algorithm with the following loss function:

Figure BDA0002685289120000073
Figure BDA0002685289120000073

图4为标签检索网络(RN)结构示意图。如图4所示,通过文本到文本检索,利用标签和未匹配的标签来构成RL训练的句子级奖励,我们编码标签并在RN的相同嵌入空间中为特征生成字幕:FIG. 4 is a schematic diagram of the structure of a label retrieval network (RN). As shown in Figure 4, through text-to-text retrieval, leveraging labels and unmatched labels to compose sentence-level rewards for RL training, we encode labels and generate captions for features in the same embedding space of RN:

si=LSTM(Ci) (13)s i =LSTM(C i ) (13)

gj=LSTM(Gj) (14)g j =LSTM(G j ) (14)

其中C和G表示生成的字幕和标签,Si和gi表示其各自的嵌入特征,计算S和g之间的相似度的余弦相似度:where C and G represent the generated captions and labels, S i and gi represent their respective embedded features, and calculate the cosine similarity of the similarity between S and g:

Figure BDA0002685289120000074
Figure BDA0002685289120000074

Figure BDA0002685289120000075
指定匹配的单词对的得分高于任何不匹配的单词对的得分,RN的损失是通过铰链损失来计算的:
Figure BDA0002685289120000075
Specifying that matched word pairs have a higher score than any unmatched word pairs, the RN loss is computed via the hinge loss:

Figure BDA0002685289120000076
Figure BDA0002685289120000076

其中

Figure BDA0002685289120000077
是正确的单词对,而
Figure BDA0002685289120000078
是不正确的单词对。CIDEr的铰链损失在RL训练中充当句子级奖励,这鼓励字幕模型的生成字幕与给定的标签最佳匹配。in
Figure BDA0002685289120000077
is the correct word pair, and
Figure BDA0002685289120000078
is an incorrect word pair. The hinge loss of CIDEr acts as a sentence-level reward in RL training, which encourages the caption model to generate captions that best match the given labels.

Figure BDA0002685289120000081
Figure BDA0002685289120000081

公式(17)是用于通过句子级奖励来β优化字幕模型的损失函数,其中β是用于平衡铰链损失和CIDEr的超参数,值得注意的是,检索过程是在每个mini-batch(小批次)中执行的,因为在整个数据集中检索是比较耗时的。Equation (17) is the loss function used to β-optimize the caption model with sentence-level rewards, where β is a hyperparameter used to balance the hinge loss and CIDEr. It is worth noting that the retrieval process is performed in each mini-batch (small batches), as it is time-consuming to retrieve the entire dataset.

图5为基于多粒度奖励机制的多注意力融合网络生成的字幕与自上而下方法生成的字幕、单独使用分层注意力融合模型生成的字幕和真实字幕的对比图。如图5所示,基于多粒度奖励机制的多注意力融合网络生成的句子要比图中其他模型更加准确以及人性化。Figure 5 is a comparison diagram of the subtitles generated by the multi-attention fusion network based on the multi-granularity reward mechanism and the subtitles generated by the top-down method, the subtitles generated by the hierarchical attention fusion model alone, and the real subtitles. As shown in Figure 5, the sentences generated by the multi-attention fusion network based on the multi-granularity reward mechanism are more accurate and human-friendly than other models in the figure.

本发明提出了基于强化学习奖励机制的单词重要性重评估网络和标签检索网络,并在此基础上提出了基于多粒度奖励机制的多注意力融合网络的图像字幕生成方法,该网络框架包含一个多注意力融合模型(HAF)、一个单词重要性重评估网络(REN)和一个标签检索网络(RN)。本发明提出了分层注意力融合(HAF)模型作为图像字幕RL训练的基线,HAF多次关注CNN的分层视觉特征,能够充分利用多层次的视觉信息,同时,单词重要性重评估网络(REN)用于促进重估奖励计算,其在RL训练阶段期间自动地对句子中生成的单词赋予不同的重要性。标签检索网络(RN)鼓励生成的字幕匹配其相应的标签而不是其他句子。通过训练使得生成的图像字幕表达准确流畅,能够很好的反应图像中的内容。The invention proposes a word importance re-evaluation network and a label retrieval network based on a reinforcement learning reward mechanism, and on this basis, a multi-attention fusion network-based image caption generation method based on a multi-granularity reward mechanism is proposed. The network framework includes a A multi-attention fusion model (HAF), a word importance re-evaluation network (REN), and a label retrieval network (RN). The present invention proposes a Hierarchical Attention Fusion (HAF) model as the baseline for image captioning RL training. HAF pays attention to the hierarchical visual features of CNN for many times, and can make full use of multi-level visual information. At the same time, the word importance re-evaluation network ( REN) is used to facilitate re-evaluation reward computation, which automatically assigns different importance to words generated in sentences during the RL training phase. The Label Retrieval Network (RN) encourages the generated captions to match their corresponding labels rather than other sentences. Through training, the generated image captions can be expressed accurately and fluently, and can well reflect the content of the images.

最后,本发明的上述示例的细节仅为解释说明本发明所做的举例,对于本领域技术人员,对上述实施例的任何修改、改进和替换等,均应包含在本发明权利要求的保护范围之内。Finally, the details of the above-mentioned examples of the present invention are only examples for explaining the present invention. For those skilled in the art, any modification, improvement and replacement of the above-mentioned embodiments should be included in the protection scope of the claims of the present invention. within.

Claims (6)

1. The method for generating the image captions of the multi-attention fusion network based on the multi-granularity reward mechanism is characterized by comprising the following steps:
s1, constructing a multi-attention fusion model.
And S2, constructing a word importance re-evaluation network based on a reinforcement learning reward mechanism.
And S3, constructing a label retrieval network by combining a reinforcement learning reward mechanism.
And S4, combining the model in S1, the word importance re-evaluation network in S2 and the label retrieval network in S3 to construct a multi-attention fusion network architecture based on a multi-granularity reward mechanism.
And S5, training and subtitle generation of a multi-attention fusion network based on a multi-granularity reward mechanism.
2. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S1 is as follows:
a classical network structure is adopted, which is based on LSTM hidden state h of each time step ttGenerating a normalized attention weight αt。αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
Figure FDA0002685289110000011
αt=softmax(at) (2)
Figure FDA0002685289110000012
wherein, Wa,Ua
Figure FDA0002685289110000013
Are learning parameters.
Figure FDA0002685289110000014
Wherein h is2Is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. Generation of h2The process of (a) can be given by:
Figure FDA0002685289110000015
Figure FDA0002685289110000016
Figure FDA0002685289110000017
Figure FDA0002685289110000018
finally, the probability of the output word is given by the non-linear softmax function:
Figure FDA0002685289110000019
3. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S2 is as follows:
REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation of generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. Mathematically, the loss function can be formalized as (11):
Wrt=RWt+R-b (10)
Figure FDA0002685289110000021
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,
Figure FDA0002685289110000022
different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:
Figure FDA0002685289110000023
4. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S3 is as follows:
the RN is pre-trained to converge with different labels for the images because each image has five different labels. We encode the tags and generate subtitles for the features in the same embedding space of the RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giIndicating their respective embedded characteristics. Calculating cosine similarity of similarity between S and g:
Figure FDA0002685289110000024
Figure FDA0002685289110000025
the score of the designated matching word pair is higher than the score of any unmatched word pair, and the penalty of the RN is calculated by the hinge penalty:
Figure FDA0002685289110000031
wherein
Figure FDA0002685289110000032
Is the correct word pair, and
Figure FDA0002685289110000033
is incorrect. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Figure FDA0002685289110000034
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDEr. It is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire data set.
5. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S4 is as follows:
the multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a reevaluation network (REN) and a Retrieval Network (RN).
6. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S5 is as follows:
the training and training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:
all models were pre-trained with cross-entropy loss and then trained to maximize different RL rewards. The encoder uses the pre-trained Resnet-101 to obtain a representation of the images, and for each image we extract the outputs of the conv4 and conv5 convolutional layers from Resnet, which map to a vector of dimension 1024 as the input to the HAF. For the HAF, the image feature embedding dimension, LSTM hidden state and word embedding dimension are all set to 512. The baseline model was trained using an ADAM optimizer at XE goal with an initial learning rate of 10-4. At each iteration cycle, we evaluate the model and select the best CIDER as the baseline score. The reinforced training starts from the 30 th iteration cycle to optimize the CIDER measurement, and the learning rate is 10-5
In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, RN trains 10 iteration cycles in advance with different real labels per img, where word embedding and LSTM hidden sizes are set to 512 and joint embedding size is set to 1024, and the hyper-parameter edge α is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.
CN202010974467.1A 2020-09-16 2020-09-16 Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism Pending CN112116685A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010974467.1A CN112116685A (en) 2020-09-16 2020-09-16 Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010974467.1A CN112116685A (en) 2020-09-16 2020-09-16 Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism

Publications (1)

Publication Number Publication Date
CN112116685A true CN112116685A (en) 2020-12-22

Family

ID=73803138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010974467.1A Pending CN112116685A (en) 2020-09-16 2020-09-16 Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism

Country Status (1)

Country Link
CN (1) CN112116685A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408430A (en) * 2021-06-22 2021-09-17 哈尔滨理工大学 Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework
CN113918754A (en) * 2021-11-01 2022-01-11 中国石油大学(华东) Image caption generation method based on scene graph update and feature stitching
CN114090815A (en) * 2021-11-12 2022-02-25 海信电子科技(武汉)有限公司 An image description model training method and training device
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, device and medium based on refrigerator field
CN120281862A (en) * 2025-04-02 2025-07-08 中国石油大学(华东) Underwater image subtitle generation method and system based on multi-modal information fusion

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098153A1 (en) * 2015-10-02 2017-04-06 Baidu Usa Llc Intelligent image captioning
US20170127016A1 (en) * 2015-10-29 2017-05-04 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Method Based on Hierarchical Feature Relation Graph
CN110135567A (en) * 2019-05-27 2019-08-16 中国石油大学(华东) Image Caption Generation Method Based on Multi-Attention Generative Adversarial Network
CN110347860A (en) * 2019-07-01 2019-10-18 南京航空航天大学 Depth image based on convolutional neural networks describes method
US10467274B1 (en) * 2016-11-10 2019-11-05 Snap Inc. Deep reinforcement learning-based captioning with embedding reward
CN110473267A (en) * 2019-07-12 2019-11-19 北京邮电大学 Social networks image based on attention feature extraction network describes generation method
CN111046966A (en) * 2019-12-18 2020-04-21 江南大学 Image caption generation method based on metric attention mechanism
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
KR20200104663A (en) * 2019-02-27 2020-09-04 한국전력공사 System and method for automatic generation of image caption
KR20200106115A (en) * 2019-02-27 2020-09-11 한국전력공사 Apparatus and method for automatically generating explainable image caption

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098153A1 (en) * 2015-10-02 2017-04-06 Baidu Usa Llc Intelligent image captioning
US20170127016A1 (en) * 2015-10-29 2017-05-04 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
US10467274B1 (en) * 2016-11-10 2019-11-05 Snap Inc. Deep reinforcement learning-based captioning with embedding reward
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Method Based on Hierarchical Feature Relation Graph
KR20200104663A (en) * 2019-02-27 2020-09-04 한국전력공사 System and method for automatic generation of image caption
KR20200106115A (en) * 2019-02-27 2020-09-11 한국전력공사 Apparatus and method for automatically generating explainable image caption
CN110135567A (en) * 2019-05-27 2019-08-16 中国石油大学(华东) Image Caption Generation Method Based on Multi-Attention Generative Adversarial Network
CN110347860A (en) * 2019-07-01 2019-10-18 南京航空航天大学 Depth image based on convolutional neural networks describes method
CN110473267A (en) * 2019-07-12 2019-11-19 北京邮电大学 Social networks image based on attention feature extraction network describes generation method
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
CN111046966A (en) * 2019-12-18 2020-04-21 江南大学 Image caption generation method based on metric attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUNLEI WU ET AL: "Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards", 《IEEE ACCESS》, pages 57943 - 57951 *
杜海骏等: "融合约束学习的图像字幕生成方法", 《中国图象图形学报》, pages 0333 - 0342 *
袁韶祖等: "基于多粒度视频信息和注意力机制的视频 场景识别", 《计算机系统应用》, pages 252 - 256 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408430A (en) * 2021-06-22 2021-09-17 哈尔滨理工大学 Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework
CN113408430B (en) * 2021-06-22 2022-09-09 哈尔滨理工大学 Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN113918754A (en) * 2021-11-01 2022-01-11 中国石油大学(华东) Image caption generation method based on scene graph update and feature stitching
CN114090815A (en) * 2021-11-12 2022-02-25 海信电子科技(武汉)有限公司 An image description model training method and training device
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, device and medium based on refrigerator field
CN116501859B (en) * 2023-06-26 2023-09-01 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field
CN120281862A (en) * 2025-04-02 2025-07-08 中国石油大学(华东) Underwater image subtitle generation method and system based on multi-modal information fusion

Similar Documents

Publication Publication Date Title
CN115310560B (en) A multimodal sentiment classification method based on modal space assimilation and contrastive learning
CN114842267B (en) Image classification method and system based on label noise domain adaptation
CN112116685A (en) Image caption generation method with multi-attention fusion network based on multi-granularity reward mechanism
WO2022057669A1 (en) Method for pre-training knowledge graph on the basis of structured context information
US11948387B2 (en) Optimized policy-based active learning for content detection
CN113157919B (en) Sentence Text Aspect-Level Sentiment Classification Method and System
CN111062489A (en) Knowledge distillation-based multi-language model compression method and device
CN111401928B (en) Method and device for determining semantic similarity of text based on graph data
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN116109978B (en) Unsupervised video description method based on self-constrained dynamic text features
CN111402365B (en) Method for generating picture from characters based on bidirectional architecture confrontation generation network
CN117058673A (en) Text generation image model training method and system and text generation image method and system
CN112561064A (en) Knowledge base completion method based on OWKBC model
Yu et al. Cgt-gan: Clip-guided text gan for image captioning
CN117370736A (en) Fine granularity emotion recognition method, electronic equipment and storage medium
CN110188819A (en) A high-level semantic understanding method for CNN and LSTM images based on information gain
CN113312919A (en) Method and device for generating text of knowledge graph
CN114881125A (en) Label-noisy image classification method based on graph consistency and semi-supervised model
CN113408430B (en) Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN113140023A (en) Text-to-image generation method and system based on space attention
WO2024159132A1 (en) Lifelong pretraining of mixture-of-experts neural networks
CN118427608A (en) Multi-modal image language model combined prompt learning method and device
CN118334463A (en) Adversarial knowledge distillation method, system and device for visual language model
CN118298422A (en) Nuclear detection method of pathological image based on visual language big model
CN116595222A (en) Short video multi-label classification method and device based on multi-modal knowledge distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201222

WD01 Invention patent application deemed withdrawn after publication