CN116701696A

CN116701696A - Picture description method based on pulse transducer model

Info

Publication number: CN116701696A
Application number: CN202310682762.3A
Authority: CN
Inventors: 梁秀波; 张璇; 王宏志
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-05

Abstract

The invention discloses a picture description method based on a pulse transducer model, which comprises the following steps: firstly, designing a novel pulse neuron PLMP which is more in accordance with biodiversity and has a learnable membrane potential time constant and a voltage threshold value, wherein the neuron can optimize the problem that the training gradient of a pulse model disappears; modifying a common self-attention mechanism in a transducer into a pulse attention mechanism by using PLMP neurons; constructing a pulse transducer by using a pulse self-attention mechanism for training a picture description model; finally, a pulse transducer model which is applicable to the field of picture description, energy-saving and capable of generating high-quality picture description is obtained.

Description

A picture description method based on impulse Transformer model

技术领域technical field

本发明涉及脉冲神经网络在图片描述领域的应用，具体涉及一种基于脉冲Transformer模型的图片描述方法。The invention relates to the application of a pulse neural network in the field of picture description, in particular to a picture description method based on a pulse Transformer model.

背景技术Background technique

2022年4月，Google发布基于「通用AI架构」的语言模型PaLM，2022年11月30日，OpenAI公布了一个通过由GPT-3.5系列大型语音模型微调而成的全新对话式AI模型ChatGPT，它不仅能进行自然的多轮对话、高效的精准问答，还能生成编程代码、电子邮件、论文、小说等各类文本。ChatGPT的火热掀起了国内外对大模型的探索热情。随着语言大模型的成熟，越来越多的科技公司和研究人员着力于多模态大模型的研究。谷歌于2023年3月6日推出史上最大视觉语言多模态模型PaLM-E，在机器人操作领域、视觉问答、图片描述以及纯语言任务上展示了非凡的性能。OpenAI于2023的3月14日推出了具有更大的模型规模、更丰富的知识库、更强的上下文理解能力的多模态大语言模型ChatGPT-4。ChatGPT的火热掀起了国内外对大模型的探索热情。国内学术界和科技企业也相继宣布或将推出类似机器人对话模型如百度的文心一言、阿里“通义千问”、华为盘古大模型、腾讯混元大模型等等。大模型成为大势所趋。然而训练一个大模型开销巨大，伴随着参数规模增速的不断提升，算力和训练成本仍存在瓶颈。2023年4月8日召开的人工智能大模型技术高峰论坛上华为大模型负责人田奇称，大模型开发和训练一次需要1200万美元，其中720万美元花费在了电力，大模型降本增效有两方面，一方面是优化算力，一方面是优化电力，电力降本增效空间巨大。In April 2022, Google released the language model PaLM based on the "general AI architecture". On November 30, 2022, OpenAI announced a new conversational AI model ChatGPT fine-tuned by the GPT-3.5 series of large-scale speech models. It can not only conduct natural multi-round dialogues, efficient and precise question and answer, but also generate programming codes, emails, papers, novels and other texts. The popularity of ChatGPT has set off a passion for exploring large models at home and abroad. With the maturity of large language models, more and more technology companies and researchers are focusing on the research of multimodal large models. Google launched PaLM-E, the largest visual language multimodal model in history, on March 6, 2023, demonstrating extraordinary performance in the field of robot manipulation, visual question answering, picture description, and pure language tasks. On March 14, 2023, OpenAI launched ChatGPT-4, a multimodal large language model with a larger model scale, richer knowledge base, and stronger context understanding capabilities. The popularity of ChatGPT has set off a passion for exploring large models at home and abroad. Domestic academic circles and technology companies have also announced or will launch similar robot dialogue models, such as Baidu's Wenxinyiyan, Ali's "Tongyi Qianwen", Huawei's Pangu model, Tencent Hunyuan model, etc. Large models have become the general trend. However, the cost of training a large model is huge. With the continuous increase in the growth rate of parameter scale, there are still bottlenecks in computing power and training costs. At the Artificial Intelligence Large Model Technology Summit Forum held on April 8, 2023, Tian Qi, head of Huawei's large model, said that the development and training of a large model cost 12 million U.S. dollars, of which 7.2 million U.S. dollars were spent on electricity, and the cost of large models was reduced and increased. There are two aspects to efficiency, one is to optimize computing power, and the other is to optimize power. There is a huge space for power cost reduction and efficiency increase.

脉冲神经网络(Spiking Neuron Networks，SNNs)作为第三代人工神经网络，有计算量小、功耗低、信息传递速度快等优点。传统的人工神经网络(Artificial NeuralNetworks，ANNs)是一种基于生物神经系统的信息处理方式的计算模型。图片描述作为多模态的一个细分领域，是多模态大模型的一个细分任务，它可以为视觉障碍人士提供有用的信息，可以用于自动化图像和视频的标注等，具有巨大的研究价值。未来神经形态计算作为一种高效方式，通过神经形态芯片和SNNs算法的协同演化的方式来实现多模态大模型训练的降本增效或将成为可能。本发明研究了一种节能的脉冲图片描述模型，是对节能多模态模型的一种探索应用。但是由于峰值活动的二元和不可微特性，直接训练SNNs可能会出现的严重的梯度消失和网络退化的问题。而且当前常用脉冲神经元单一，膜电位时间常数和膜电压阈值需要根据经验或优化手段指定为超参数，违背了神经元的生物多样性。Spiking Neuron Networks (SNNs), as the third-generation artificial neural network, has the advantages of small amount of calculation, low power consumption, and fast information transmission. The traditional artificial neural network (Artificial Neural Networks, ANNs) is a computing model based on the information processing method of the biological nervous system. As a subdivision field of multimodality, picture description is a subdivision task of multimodal large models. It can provide useful information for visually impaired people, and can be used for automatic image and video labeling. It has huge research value. In the future, neuromorphic computing, as an efficient method, may realize the cost reduction and efficiency increase of multi-modal large model training through the co-evolution of neuromorphic chips and SNNs algorithms. The invention studies an energy-saving pulse picture description model, which is an exploration and application of an energy-saving multi-mode model. However, due to the binary and non-differentiable nature of peak activity, direct training of SNNs may suffer from severe gradient vanishing and network degradation problems. Moreover, the currently commonly used spiking neurons are single, and the membrane potential time constant and membrane voltage threshold need to be specified as hyperparameters based on experience or optimization methods, which violates the biological diversity of neurons.

发明内容Contents of the invention

本发明的目的针对现有技术的不足，提供了一种基于脉冲Transformer模型的图片描述方法。The object of the present invention is to provide a picture description method based on the impulse Transformer model to address the deficiencies of the prior art.

本发明的目的是通过以下技术方案来实现的：一种基于脉冲Transformer模型的图片描述方法，包括以下步骤：The object of the present invention is achieved by the following technical solutions: a method for describing pictures based on the pulse Transformer model, comprising the following steps:

(1)设计一种脉冲神经元PLMP，每个PLMP单元包含多个具有不同膜电位时间常数和电压阈值的平行LIF单元；(1) Design a spiking neuron PLMP, each PLMP unit contains multiple parallel LIF units with different membrane potential time constants and voltage thresholds;

(2)基于PLMP单元设计实现脉冲自注意力机制；(2) Based on the PLMP unit design to realize the pulse self-attention mechanism;

(3)构建Transformer模型；所述Transformer模型采用编码器-解码器框架实现，其中编码器由一个Swin Transformer和N个细化编码器块组成，解码器由N个解码器块组成；(3) Build Transformer model; Described Transformer model adopts encoder-decoder frame to realize, and wherein encoder is made up of a Swin Transformer and N refinement encoder blocks, and decoder is made up of N decoder blocks;

(4)基于PLMP单元和脉冲自注意力机制将步骤(3)中的Transformer模型改为脉冲Transformer模型；(4) Change the Transformer model in step (3) to a pulse Transformer model based on the PLMP unit and the pulse self-attention mechanism;

(5)获取图片描述领域的数据集，将数据集分为训练集、验证集和测试集，所述训练集用于训练脉冲Transformer模型；所述验证集用于选择最优脉冲Transformer模型；将所述测试集输入最优脉冲Transformer模型，输出图片描述。(5) obtain the data set of picture description field, divide data set into training set, verification set and test set, described training set is used for training pulse Transformer model; Described verification set is used for selecting optimal pulse Transformer model; The test set is input to the optimal pulse Transformer model, and the picture description is output.

进一步地，所述步骤(1)中，在接收到输入后，每个平行LIF单元将根据各自的膜电位时间常数更新膜电位，若膜电位超过该平行LIF单元对应的电压阈值，该平行LIF单元就会产生一个峰值；PLMP的输出是所有平行LIF单元产生峰值的集合。Further, in the step (1), after receiving the input, each parallel LIF unit will update the membrane potential according to its respective membrane potential time constant. If the membrane potential exceeds the voltage threshold corresponding to the parallel LIF unit, the parallel LIF unit will The unit produces a peak; the output of the PLMP is the aggregate of the peaks produced by all parallel LIF units.

进一步地，所述脉冲神经元PLMP的前向过程如下：Further, the forward process of the spiking neuron PLMP is as follows:

Vth_k＝tanh(z_k)Vth _k ＝tanh(z _k )

PLMP单元引入了两个可训练参数m和z，分别代表可学习的膜电位时间常数参数和可学习的膜电压参数。z_k代表每个PLMP单元中第k个LIF单元的可学习的膜电压参数，通过双曲切函数得到Vth_k，Vth_k是第k个LIF单元的膜电压阈值。m_k代表每个PLMP单元中第k个LIF单元的可学习的膜电位时间常数参数，由τ_k计算得到，τ_k是每个PLMP单元中第k个LIF单元的膜电位时间常数。p(n-1)表示第(n-1)层的神经元数量。是t时刻第n层第i个神经元的突触前输入。/>是从第(n-1)层第j个神经元到第n层第i个神经元的突触权值，/>是偏置。和/>分别代表t+1时刻第n层第i个神经元的第k个LIF单元的膜电位向量和输出向量。/>代表t+1时刻第n层第i个PLMP单元的最终输出，它会参与第n+1层所有与该单元相连的神经元的计算。The PLMP unit introduces two trainable parameters m and z, representing the learnable membrane potential time constant parameter and the learnable membrane voltage parameter, respectively. z _k represents the learnable membrane voltage parameter of the kth LIF unit in each PLMP unit, and Vth _k is obtained through the hyperbolic tangent function, and Vth _k is the membrane voltage threshold of the kth LIF unit. _mk represents the learnable membrane potential time constant parameter of the kth LIF unit in each PLMP unit, calculated from _τk , where _τk is the membrane potential time constant of the kth LIF unit in each PLMP unit. p(n-1) represents the number of neurons in the (n-1)th layer. is the presynaptic input to the i-th neuron in layer n at time t. /> is the synaptic weight from the jth neuron in the (n-1)th layer to the ith neuron in the nth layer, /> is biased. and /> Represent the membrane potential vector and output vector of the k-th LIF unit of the i-th neuron in the n-th layer at time t+1, respectively. /> Represents the final output of the i-th PLMP unit in the n-th layer at time t+1, and it will participate in the calculation of all neurons connected to the unit in the n+1-th layer.

进一步地，所述步骤(2)中，通过脉冲神经元PLMP将Query、Key和Value转换成脉冲，公式如下：Further, in the step (2), the Query, Key and Value are converted into pulses by the pulse neuron PLMP, and the formula is as follows:

Q_i＝PLMP(BN(XW_i ^Q))Q _i ＝PLMP(BN(XW _i ^Q ))

K_i＝PLMP(BN(XW_i ^K))K _i =PLMP(BN(XW _i ^K ))

V_i＝PLMP(BN(XW_i ^V))V _i =PLMP(BN(XW _i ^V ))

head_i＝Q_iK_i ^TV_i head _i = Q _i K _i ^T V _i

S'＝Concat(head₁,...,head_n)S'＝Concat(head ₁ ,...,head _n )

SpikingMSA(Q,K,V)＝PLMP(BN(Linear(S')))SpikingMSA(Q,K,V)=PLMP(BN(Linear(S')))

X是自注意力机制的输入，是可学习的的线性矩阵。i＝1,2,...,h，h代表注意力机制有h个头。V_i表示第i个注意力头的输入特征的向量，Q_i、K_i是第i个注意力头计算注意力权重权重的特征向量。BN是Batch Normalization批归一化操作。head_i代表第i个注意力头的输出。S′是多头就是把h个注意力头的输出拼起来的结果。SpikingMSA是改造后的脉冲自注意力机制。X is the input of the self-attention mechanism, is a learnable linear matrix. i=1,2,...,h, h means that the attention mechanism has h heads. V _i represents the vector of the input feature of the i-th attention head, Q _i and K _i are the feature vectors of the i-th attention head to calculate the weight of the attention weight. BN is Batch Normalization batch normalization operation. head _i represents the output of the i-th attention head. S' is the result of putting together the output of h attention heads. SpikingMSA is a modified Spiking Self-Attention mechanism.

进一步地，所述步骤(3)中，Swin Transformer用于从输入图像中提取网格特征，对网格特征进行平均池化得到全局特征；细化编码器用于捕获网格特征和网格特征、网格特征和全局特征之间的内部关系；捕获网格特征和网格特征之间的关系采用SW/W-MSA；捕获网格特征和全局特征之间的关系采用MSA，其中全局特征作为自注意力机制中的Key；编码器中的每个细化编码器块先将得到的网格特征和全局特征送入自注意力机制，将每个自注意力机制的输入与其输出求和，归一化后传入前馈神经网络，最后残差求和归一化后输出，得到细化后的全局特征和网格特征。Further, in the step (3), the Swin Transformer is used to extract grid features from the input image, and average pooling is performed on the grid features to obtain global features; the refinement encoder is used to capture grid features and grid features, The internal relationship between grid features and global features; capture the relationship between grid features and grid features using SW/W-MSA; capture the relationship between grid features and global features using MSA, in which global features as self- Key in the attention mechanism; each refined encoder block in the encoder first sends the obtained grid features and global features to the self-attention mechanism, sums the input of each self-attention mechanism with its output, and normalizes After normalization, it is passed into the feed-forward neural network, and finally the residuals are summed and normalized to output, and the refined global features and grid features are obtained.

进一步地，编码器中的每个细化编码器块的公式如下：Further, the formulation of each refined encoder block in the encoder is as follows:

FeedForward(x)＝W₂ReLU(W₁x)FeedForward(x)=W ₂ ReLU(W ₁ x)

和/>分别表示第l个Encoder块的输出网格特征和全局特征。/>W₁，W₂是可学习的参数。/>代表对网格特征和全局特征进行concate操作。/>是滑动窗口多头自注意力机制的输入和输出相加再进行层归一化得到的结果。/>是普通多头自注意力机制的输入和输出相加再进行层归一化得到的结果。 and /> Denote the output grid features and global features of the lth Encoder block, respectively. /> W ₁ , W ₂ are learnable parameters. /> Represents the concate operation on grid features and global features. /> It is the result of adding the input and output of the sliding window multi-head self-attention mechanism and then performing layer normalization. /> It is the result of adding the input and output of the ordinary multi-head self-attention mechanism and then performing layer normalization.

进一步地，先将细化后的全局特征融合到解码器的输入中，通过第一次多模态交互来捕获全局视觉上下文信息得到然后通过Language Masked MSA模块捕获/>中单词到单词的内模态关系得到/> Further, the refined global features are first fused into the input of the decoder, and the global visual context information is captured through the first multimodal interaction to obtain Then captured by the Language Masked MSA module /> The word-to-word intramodal relations in get />

表示第(l-1)个细化编码器的输入，第(l-1)个细化编码器的输出将作为t时刻第l个细化编码器的输入，W_f是线性层一个可学习的参数；/>和/>是可学习的参数，/>表示生成的单词在(t-1)时刻对应的嵌入向量，每个单词计算其之前生成的单词的注意映射； Represents the input of the (l-1)th refinement encoder, the output of the (l-1)th refinement encoder will be used as the input of the lth refinement encoder at time t, W _f is a learnable linear layer parameters; /> and /> is a learnable parameter, /> Represents the embedding vector corresponding to the generated word at (t-1) time, and each word calculates the attention map of the previously generated word;

最后通过Cross MSA Module模块对和细化后的网格特征之间的多模态关系进行建模，捕获局部视觉上下文信息以生成图片描述。Finally, through the Cross MSA Module module pair Modeling the multimodal relationship between features and refined mesh features captures local visual context information to generate image captions.

进一步地，Cross MSAModule模块的实现公式如下：Further, the implementation formula of the Cross MSAModule module is as follows:

其中，和W_x是可学习的参数。/>是Cross MSA的输出，其中/>作为Cross MSA的Query,/>作为Cross MSA的Key和Value。/>是Cross MSA Module的最终输出。in, and W _x are learnable parameters. /> is the output of the Cross MSA, where /> As the Query of Cross MSA, /> As the Key and Value of the Cross MSA. /> Is the final output of the Cross MSA Module.

进一步地，所述步骤(4)中，脉冲自注意力机制用于替换自注意力机制，PLMP单元用于代替FeedForward单元中的ReLU单元，脉冲FeedForward单元的实现如下：Further, in the step (4), the impulse self-attention mechanism is used to replace the self-attention mechanism, and the PLMP unit is used to replace the ReLU unit in the FeedForward unit, and the implementation of the impulse FeedForward unit is as follows:

SpikingFeedForward(x)＝W₂PLMP(W₁x)。SpikingFeedForward(x) = W ₂ PLMP(W ₁ x).

进一步地，所述图片描述领域的数据集包括MSCOCO 2014、Flickr30K和Flickr8K、Flickr30K和Flickr8K、VizWiz、TextCaps、Fashion Captioning和CUB-200。Further, the datasets in the field of picture description include MSCOCO 2014, Flickr30K and Flickr8K, Flickr30K and Flickr8K, VizWiz, TextCaps, Fashion Captioning and CUB-200.

本发明的有益效果是：The beneficial effects of the present invention are:

1、提出了一种新的脉冲激活单元PLMP。与SNN模型常用的LIF单元相比,具有多级可学习的膜电位时间常数和阈值。该方法可以有效减轻梯度消失问题，削弱参数初始值设置的影响，加速训练过程，具有更好的激活效果。1. A new pulse activation unit PLMP is proposed. Compared with the LIF unit commonly used in SNN models, it has multi-level learnable membrane potential time constants and thresholds. This method can effectively alleviate the problem of gradient disappearance, weaken the influence of parameter initial value settings, accelerate the training process, and have better activation effects.

2、基于PLMP实现了一种新的尖峰型自注意力机制Spiking-MSA/W-MSA/SW-MSA。使用稀疏尖峰形式的Query、Key和Value，使其计算过程中避免了乘法，有效减少计算量。2. A new spike-type self-attention mechanism Spiking-MSA/W-MSA/SW-MSA is implemented based on PLMP. Query, Key, and Value in the form of sparse spikes are used to avoid multiplication during the calculation process and effectively reduce the amount of calculation.

3、创新性地将脉冲神经网络和Transformer模型结合起来应用于图片描述领域并取得了具有竞争力的效果，是一个适用于图片描述领域、节能的、可以生成高质量的图片描述的脉冲Transformer模型；并且，这是脉冲神经网络首次应用在图片描述领域。3. Innovatively combine the spiking neural network and Transformer model in the field of picture description and achieve competitive results. It is a pulse Transformer model suitable for the field of picture description, energy-saving, and capable of generating high-quality picture descriptions ; and, this is the first application of spiking neural networks in the field of image description.

附图说明Description of drawings

图1是本发明的总体流程图；Fig. 1 is the general flowchart of the present invention;

图2脉冲自注意力机制结构示意图；Figure 2 Schematic diagram of the structure of the impulse self-attention mechanism;

图3脉冲Transformer模型结构示意图。Figure 3 Schematic diagram of the pulse Transformer model structure.

具体实施方式Detailed ways

下面结合附图和具体实施方式对本发明的具体实施方式做进一步详细描述。以下附图用于说明本发明，但不用来限制本发明的范围。The specific implementation manners of the present invention will be further described in detail below in conjunction with the drawings and specific implementation manners. The following drawings are used to illustrate the present invention, but not to limit the scope of the present invention.

实施例一Embodiment one

请参见图1，图2和图3，图1是本发明实例提供的一种基于脉冲Transformer模型的图片描述算法流程图。包括如下步骤：Please refer to FIG. 1 , FIG. 2 and FIG. 3 , and FIG. 1 is a flow chart of a picture description algorithm based on an impulse Transformer model provided by an example of the present invention. Including the following steps:

步骤1：设计一种新型的脉冲神经元PLMP(Parallel LIF with MultistageLearnable Parameters)；Step 1: Design a new type of spiking neuron PLMP (Parallel LIF with Multistage Learnable Parameters);

每个PLMP单元包含多个具有不同膜电位时间常数和阈值的平行LIF单元，在接收到输入后，每个LIF单元将根据各自的膜电位时间常数更新膜电位，若膜电位超过该单元对应的阈值，该LIF单元就会产生一个峰值。PLMP的输出是所有平行LIF单元产生峰值的集合。在这个过程中膜电位时间常数和阈值在训练的过程中自动优化，不再训练前手动设置为超参数。PLMP神经元的前向过程如下：Each PLMP unit contains multiple parallel LIF units with different membrane potential time constants and thresholds. After receiving the input, each LIF unit will update the membrane potential according to its respective membrane potential time constant. If the membrane potential exceeds the corresponding threshold, the LIF unit will generate a peak. The output of the PLMP is the aggregate of the peaks produced by all parallel LIF units. In this process, the membrane potential time constant and threshold are automatically optimized during training, and are no longer manually set as hyperparameters before training. The forward process of a PLMP neuron is as follows:

Vth_k＝tanh(z_k)Vth _k ＝tanh(z _k )

PLMP单元引入了两个可训练参数m和z，分别代表可学习的膜电位时间常数参数和可学习的膜电压阈值参数。z_k代表每个PLMP单元中第k个LIF单元的可学习膜电压参数，通过双曲切函数得到Vth_k，Vth_k是第k个LIF单元的膜电压阈值。m_k代表每个PLMP单元中第k个LIF单元的可学习膜电位时间常数的参数，由τ_k计算得到，τ_k是每个PLMP单元中第k个LIF单元的膜电位时间常数。n和p(n-1)分别表示第n层和第(n-1)层的神经元数量。是t时刻第n层第i个神经元的突触前输入。/>是从第(n-1)层第j个神经元到第n层第i个神经元的突触权值，/>是偏置。/>和/>分别代表t时刻第n层第i个神经元的第k个LIF单元的膜电位向量和输出向量，K为LIF单元的总数。/>代表t时刻第n层第i个PLMP单元的最终输出，它会参与第n+1层所有与该单元相连的神经元的计算。The PLMP unit introduces two trainable parameters m and z, representing the learnable membrane potential time constant parameter and the learnable membrane voltage threshold parameter, respectively. z _k represents the learnable membrane voltage parameter of the kth LIF unit in each PLMP unit, and Vth _k is obtained through the hyperbolic tangent function, and Vth _k is the membrane voltage threshold of the kth LIF unit. _mk represents the parameter of the learnable membrane potential time constant of the kth LIF unit in each PLMP unit, calculated from _τk , where _τk is the membrane potential time constant of the kth LIF unit in each PLMP unit. n and p(n-1) denote the number of neurons in layer n and (n-1) respectively. is the presynaptic input to the i-th neuron in layer n at time t. /> is the synaptic weight from the jth neuron in the (n-1)th layer to the ith neuron in the nth layer, /> is biased. /> and /> Represent the membrane potential vector and output vector of the k-th LIF unit of the i-th neuron in the n-th layer at time t, respectively, and K is the total number of LIF units. /> Represents the final output of the i-th PLMP unit in the n-th layer at time t, and it will participate in the calculation of all neurons connected to the unit in the n+1-th layer.

本实施例以每个PLMP单元包含三个LIF单元为例。每个LIF单元的膜电压阈值初始值分别为0.6、1.6、2.6；初始膜电位时间常数均为0.25；TimeStep为4。In this embodiment, it is taken that each PLMP unit includes three LIF units as an example. The initial value of the membrane voltage threshold of each LIF unit was 0.6, 1.6, 2.6; the initial membrane potential time constant was 0.25; TimeStep was 4.

步骤2：基于PLMP单元设计实现脉冲自注意力机制Step 2: Implement the impulse self-attention mechanism based on PLMP unit design

通过脉冲神经元PLMP将Query，Key和Value转换成脉冲来避免矩阵的浮点数乘法，矩阵之间的操作可以通过逻辑与运算和加法来完成。由于脉冲形式的Q,K,V计算出来的注意力矩阵具有天然的非负性，不需要Softmax来保持注意矩阵的非负值具体公式如下：The Query, Key and Value are converted into pulses by the pulse neuron PLMP to avoid the floating-point multiplication of the matrix, and the operations between the matrices can be completed by logical AND operations and addition. Since the attention matrix calculated by Q, K, and V in pulse form is naturally non-negative, Softmax is not required to maintain the non-negative value of the attention matrix. The specific formula is as follows:

Q_i＝PLMP(BN(XW_i ^Q))Q _i ＝PLMP(BN(XW _i ^Q ))

K_i＝PLMP(BN(XW_i ^K))K _i =PLMP(BN(XW _i ^K ))

V_i＝PLMP(BN(XW_i ^V))V _i =PLMP(BN(XW _i ^V ))

head_i＝Q_iK_i ^TV_i head _i = Q _i K _i ^T V _i

S'＝Concat(head₁,...,head_n)S'＝Concat(head ₁ ,...,head _n )

SpikingMSA(Q,K,V)＝PLMP(BN(Linear(S')))SpikingMSA(Q,K,V)=PLMP(BN(Linear(S')))

步骤3：构建Transformer模型Step 3: Build the Transformer model

采用广泛使用的编码器-解码器框架实现。其中编码器由一个Swin Transformer和3个细化编码器块组成，解码器由3个解码器块组成。预训练好的Swin Transformer负责从输入图像中提取网格特征，对网格特征进行平均池化得到全局特征，细化编码器负责捕获网格特征和网格特征、网格特征和全局特征之间的内部关系。捕获网格特征和网格特征之间的关系采用SW/W-MSA。捕获网格特征和全局特征之间的关系采用MSA,全局特征作为自注意力机制中的Key。编码器的每个细化编码器块先将得到的网格特征和全局特征送入自注意力机制，这里引入残差结构来解决深度网络中的梯度消失问题，将输入与注意力机制的输出求和，归一化后传入前馈神经网络，最后残差求和归一化后输出。解码器利用细化后的图像网格特征，通过捕获文字和图像网格特征之间的相互关系，逐字生成标题。Encoder中的每个细化编码器块的公式如下：Implemented using the widely used encoder-decoder framework. The encoder consists of a Swin Transformer and 3 refined encoder blocks, and the decoder consists of 3 decoder blocks. The pre-trained Swin Transformer is responsible for extracting grid features from the input image, performing average pooling on the grid features to obtain global features, and the refinement encoder is responsible for capturing the relationship between grid features and grid features, grid features and global features. internal relationship. Capturing mesh features and relationships between mesh features employs SW/W-MSA. MSA is used to capture the relationship between grid features and global features, and global features are used as the Key in the self-attention mechanism. Each refined encoder block of the encoder first sends the obtained grid features and global features to the self-attention mechanism. Here, the residual structure is introduced to solve the problem of gradient disappearance in the deep network, and the input and the output of the attention mechanism The summation is passed into the feed-forward neural network after normalization, and the final residual summation and normalization is output. The decoder utilizes the refined image grid features to generate captions verbatim by capturing the correlation between text and image grid features. The formula for each refinement encoder block in the Encoder is as follows:

FeedForward(x)＝W₂ReLU(W₁x)FeedForward(x)=W ₂ ReLU(W ₁ x)

Decoder中先将细化后的Global Feature融合到解码器的输入中，通过第一次多模态交互来捕获全局视觉上下文信息得到然后通过Language Masked MSA模块捕获中单词到单词的内模态关系得到/> In the Decoder, the refined Global Feature is first fused into the input of the decoder, and the global visual context information is captured through the first multimodal interaction to obtain Then captured by the Language Masked MSA module The word-to-word intramodal relations in get />

表示第(l-1)个细化编码器的输入，第(l-1)个细化编码器的输出将作为t时刻第l个细化编码器的输入，W_f是线性层一个可学习的参数；/>和/>是可学习的参数，/>表示生成的单词在(t-1)时刻对应的嵌入向量，每个单词只允许计算其之前生成的单词的注意映射； Represents the input of the (l-1)th refinement encoder, the output of the (l-1)th refinement encoder will be used as the input of the lth refinement encoder at time t, W _f is a learnable linear layer parameters; /> and /> is a learnable parameter, /> Represents the embedding vector corresponding to the generated word at (t-1) time, and each word only allows the calculation of the attention map of the previously generated word;

最后通过Cross MSA Module模块对和Grid Feature之间的多模态关系进行建模，捕获局部视觉上下文信息以生成图片描述。通过Global Feature和Grid Feature和句子之间的两次多模态交互，增强推理能力。Cross MSAModule模块的实现公式如下：Finally, through the Cross MSA Module module pair Model the multi-modal relationship between the grid feature and the grid feature, and capture the local visual context information to generate a picture description. Through two multimodal interactions between Global Feature and Grid Feature and sentences, the reasoning ability is enhanced. The implementation formula of the Cross MSAModule module is as follows:

其中，和W_x是可学习的参数。/>是Cross MSA的输出，其中/>作为Cross MSA的Query,/>作为Cross MSA的Key和Value。/>是Cross MSA Module的最终输出，由/>根据如上公式计算得来。in, and W _x are learnable parameters. /> is the output of the Cross MSA, where /> As the Query of Cross MSA, /> As the Key and Value of the Cross MSA. /> Is the final output of Cross MSA Module, by /> Calculated according to the above formula.

步骤4：基于PLMP单元和脉冲自注意力机制将步骤(3)中的Transformer模型改为脉冲Transformer模型；Step 4: Change the Transformer model in step (3) to a pulse Transformer model based on the PLMP unit and the pulse self-attention mechanism;

基于脉冲自注意力机制，对整个Transformer模型进行脉冲改造。用步骤2中的脉冲自注意力机制替换自注意力机制，用步骤1中PLMP单元代替FeedForward单元中的ReLU单元，将图像信息和单词表示为事件驱动的峰值，随后对整个网络进行训练得到一个适用于图片描述领域、节能的、可以生成高质量的图片描述的脉冲Transformer模型。脉冲自注意力的实现参考步骤2，脉冲FeedForward单元的实现如下：Based on the impulse self-attention mechanism, the entire Transformer model is impulsively transformed. Replace the self-attention mechanism with the pulse self-attention mechanism in step 2, replace the ReLU unit in the FeedForward unit with the PLMP unit in step 1, represent the image information and words as event-driven peaks, and then train the entire network to obtain a Suitable for the field of picture description, energy-saving, pulse Transformer model that can generate high-quality picture descriptions. The implementation of pulse self-attention refers to step 2, and the implementation of the pulse FeedForward unit is as follows:

SpikingFeedForward(x)＝W₂PLMP(W₁x)SpikingFeedForward(x) = W ₂ PLMP(W ₁ x)

步骤5：获取数据集，将数据集分为训练集、验证集和测试集，所述训练集用于训练脉冲Transformer模型；所述验证集用于选择最优脉冲Transformer模型；将所述测试集输入最优脉冲Transformer模型，输出图片描述。Step 5: obtain data set, divide data set into training set, verification set and test set, described training set is used for training impulse Transformer model; Described verification set is used for selecting optimal impulse Transformer model; Described test set Input the optimal pulse Transformer model and output a picture description.

训练脉冲Transformer模型。在MSCOCO 2014数据集上进行训练，该数据集包含123287张图像，每个图像都有5个参考标题，遵循“Karpathy”分割来重新划分MSCOCO，其中113287张图像用于训练；5000张图像用于挑选超参数，选择最优脉冲Transformer模型；5000张图像用于离线评估，输出关于图片的描述。本发明也可采用Flickr30K和Flickr8K、Flickr30K和Flickr8K、VizWiz、TextCaps、Fashion Captioning、CUB-200等图片描述领域常用数据集训练。Train the Impulse Transformer model. Trained on the MSCOCO 2014 dataset, which contains 123,287 images, each with 5 reference captions, following the "Karpathy" segmentation to re-partition MSCOCO, where 113,287 images are used for training; 5,000 images are used for Select hyperparameters, select the optimal pulse Transformer model; 5000 images are used for offline evaluation, and output descriptions about the images. The present invention can also adopt Flickr30K and Flickr8K, Flickr30K and Flickr8K, VizWiz, TextCaps, Fashion Captioning, CUB-200 and other commonly used data sets in the field of picture description for training.

本发明最终效果为输入一张图片，可以得到该图片的文字描述，效果如图3所示。The final effect of the present invention is that a picture can be input, and a text description of the picture can be obtained, and the effect is shown in FIG. 3 .

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. The picture description method based on the pulse transducer model is characterized by comprising the following steps of:

(1) Designing a pulse neuron PLMP, wherein each PLMP unit comprises a plurality of parallel LIF units with different membrane potential time constants and voltage thresholds;

(2) Realizing a pulse self-attention mechanism based on PLMP unit design;

(3) Constructing a transducer model; the transducer model is realized by adopting an encoder-decoder framework, wherein the encoder consists of one Swin transducer and N refinement encoder blocks, and the decoder consists of N decoder blocks;

(4) Changing the transducer model in step (3) to a pulse transducer model based on the PLMP unit and the pulse self-attention mechanism;

(5) Acquiring a data set in the field of picture description, and dividing the data set into a training set, a verification set and a test set, wherein the training set is used for training a pulse transducer model; the verification set is used for selecting an optimal pulse transducer model; and inputting the test set into an optimal pulse transducer model, and outputting a picture description.

2. The method of claim 1, wherein in the step (1), after receiving the input, each parallel LIF unit updates the membrane potential according to the respective membrane potential time constant, and if the membrane potential exceeds the voltage threshold corresponding to the parallel LIF unit, the parallel LIF unit generates a peak value; the output of PLMP is the set of peaks produced by all parallel LIF units.

3. The picture description method based on the pulse transducer model according to claim 1, wherein the forward process of the pulse neuron PLMP is as follows:

Vth _k ＝tanh(z _k )

the PLMP unit introduces two trainable parameters m and z, which respectively represent a learnable membrane potential time constant parameter and a learnable membrane voltage parameter; z _k Learning film voltage parameters representing the kth LIF unit in each PLMP unit, vth is obtained by hyperbolic cut-off function _k ，Vth _k Is the membrane voltage threshold of the kth LIF unit; m is m _k A learnable membrane potential time constant parameter representing the kth LIF unit in each PLMP unit, represented by τ _k Calculated τ _k Is the membrane potential time constant of the kth LIF unit in each PLMP unit; p (n-1) represents the number of neurons of the (n-1) th layer;is the presynaptic input of the nth layer of the ith neuron at time t; />Is the synaptic weight from the (n-1) th layer (j) th neuron to the (n) th layer (i) th neuron,/v>Is biased; /> and />Respectively representing a membrane potential vector and an output vector of a kth LIF unit of an nth layer of an ith neuron at the time t+1;representing the final output of the nth PLMP unit of the nth layer at time t+1, would participate in the calculation of all neurons connected to that unit of the nth layer.

4. The picture description method based on the pulse Transformer model according to claim 1, wherein in the step (2), the Query, key and Value are converted into pulses by the pulse neurons PLMP, and the formula is as follows:

Q _i ＝PLMP(BN(XW _i ^Q ))

K _i ＝PLMP(BN(XW _i ^K ))

V _i ＝PLMP(BN(XW _i ^V ))

head _i ＝Q _i K _i ^T V _i

S'＝Concat(head ₁ ,...,head _n )

SpikingMSA(Q,K,V)＝PLMP(BN(Linear(S')))

x is the input of the self-attention mechanism, W _i ^Q，，W _i ^K ，W _i ^V Is a linear matrix that can be learned; i=1, 2,..h, h represents that the attention mechanism has h heads; v (V) _i Input features representing the ith attention headerVector, Q of (1) _i 、K _i Is the feature vector of the ith attention head for calculating the attention weight; BN was a Batch Normalization lot normalization operation; head part _i Representing the output of the ith attention head; s' is the result of the multi-head which is the combination of the outputs of the h attention heads; spikingMSA is a modified pulsed self-attention mechanism.

5. The picture description method based on the pulse Transformer model according to claim 1, wherein in the step (3), swin Transformer is used for extracting grid features from an input image, and the grid features are averaged and pooled to obtain global features; the refinement encoder is used for capturing grid features and internal relations among the grid features, the grid features and the global features; capturing the relation between the grid characteristics and the grid characteristics by adopting an SW/W-MSA; capturing the relation between grid features and global features by adopting MSA, wherein the global features are used as keys in a self-attention mechanism; each refined encoder block in the encoder firstly sends the obtained grid features and global features into a self-attention mechanism, sums the input of each self-attention mechanism with the output of each self-attention mechanism, normalizes the input of each self-attention mechanism, sends the normalized input into a feedforward neural network, and finally, sums the residual errors, normalizes the normalized output and obtains the refined global features and grid features.

6. The picture description method based on the pulse Transformer model according to claim 5, wherein the formula of each refinement encoder block in the encoder is as follows:

FeedForward(x)＝W ₂ ReLU(W ₁ x)

and />Respectively representing the output grid characteristics and the global characteristics of the first Encoder block; />W ₁ ，W ₂ Is a learnable parameter; />Performing conccate operation on the grid features and the global features; />The result is obtained by adding the input and output of the multi-head self-attention mechanism of the sliding window and then carrying out layer normalization; />Is the result of adding the input and output of the common multi-head self-attention mechanism and then carrying out layer normalization.

7. The picture description method based on pulse transducer model as claimed in claim 5, wherein the refined global features are fused into the input of the decoder, and the global visual context information is captured through the first multi-modal interaction to obtainThen capture by Language Masked MSA module->The internal mode relation from medium word to word is obtained +.>

Representing the input of the (l-1) th refinement encoder, the output of which will be the input of the (l-1) th refinement encoder at time t, W _f Is a learnable parameter of the linear layer; /> and />Is a parameter that can be learned and is,representing the embedding vector corresponding to the generated word at time (t-1), each word calculating an attention map of the word it generated before;

finally, the Cross MSA Module pair is passedAnd modeling the multi-modal relationship between the refined grid features, capturing local visual context information to generate a picture description.

8. The picture description method based on the pulse Transformer model according to claim 7, wherein the Cross msamodel module is implemented as follows:

wherein , and W_x Is a learnable parameter; />Is the output of Cross MSA, wherein +.>Query as Cross MSA, < ->Key and Value as Cross MSA; />Is the final output of the Cross MSA Module.

9. The picture description method based on the pulse Transformer model according to claim 1, wherein in the step (4), a pulse self-attention mechanism is used to replace the self-attention mechanism, a PLMP unit is used to replace a ReLU unit in a FeedForward unit, and the implementation of the pulse FeedForward unit is as follows:

SpikingFeedForward(x)＝W ₂ PLMP(W ₁ x)。

10. the picture description method based on the pulse transducer model according to claim 1, wherein the data set in the picture description field comprises MSCOCO 2014, flickr30K, flickr8K, flickr K, flickr8K, vizWiz, textCaps, fashion Captioning and CUB-200.