CN116701696A - Picture description method based on pulse transducer model - Google Patents
Picture description method based on pulse transducer model Download PDFInfo
- Publication number
- CN116701696A CN116701696A CN202310682762.3A CN202310682762A CN116701696A CN 116701696 A CN116701696 A CN 116701696A CN 202310682762 A CN202310682762 A CN 202310682762A CN 116701696 A CN116701696 A CN 116701696A
- Authority
- CN
- China
- Prior art keywords
- plmp
- unit
- pulse
- self
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Processing (AREA)
Abstract
Description
技术领域technical field
本发明涉及脉冲神经网络在图片描述领域的应用,具体涉及一种基于脉冲Transformer模型的图片描述方法。The invention relates to the application of a pulse neural network in the field of picture description, in particular to a picture description method based on a pulse Transformer model.
背景技术Background technique
2022年4月,Google发布基于「通用AI架构」的语言模型PaLM,2022年11月30日,OpenAI公布了一个通过由GPT-3.5系列大型语音模型微调而成的全新对话式AI模型ChatGPT,它不仅能进行自然的多轮对话、高效的精准问答,还能生成编程代码、电子邮件、论文、小说等各类文本。ChatGPT的火热掀起了国内外对大模型的探索热情。随着语言大模型的成熟,越来越多的科技公司和研究人员着力于多模态大模型的研究。谷歌于2023年3月6日推出史上最大视觉语言多模态模型PaLM-E,在机器人操作领域、视觉问答、图片描述以及纯语言任务上展示了非凡的性能。OpenAI于2023的3月14日推出了具有更大的模型规模、更丰富的知识库、更强的上下文理解能力的多模态大语言模型ChatGPT-4。ChatGPT的火热掀起了国内外对大模型的探索热情。国内学术界和科技企业也相继宣布或将推出类似机器人对话模型如百度的文心一言、阿里“通义千问”、华为盘古大模型、腾讯混元大模型等等。大模型成为大势所趋。然而训练一个大模型开销巨大,伴随着参数规模增速的不断提升,算力和训练成本仍存在瓶颈。2023年4月8日召开的人工智能大模型技术高峰论坛上华为大模型负责人田奇称,大模型开发和训练一次需要1200万美元,其中720万美元花费在了电力,大模型降本增效有两方面,一方面是优化算力,一方面是优化电力,电力降本增效空间巨大。In April 2022, Google released the language model PaLM based on the "general AI architecture". On November 30, 2022, OpenAI announced a new conversational AI model ChatGPT fine-tuned by the GPT-3.5 series of large-scale speech models. It can not only conduct natural multi-round dialogues, efficient and precise question and answer, but also generate programming codes, emails, papers, novels and other texts. The popularity of ChatGPT has set off a passion for exploring large models at home and abroad. With the maturity of large language models, more and more technology companies and researchers are focusing on the research of multimodal large models. Google launched PaLM-E, the largest visual language multimodal model in history, on March 6, 2023, demonstrating extraordinary performance in the field of robot manipulation, visual question answering, picture description, and pure language tasks. On March 14, 2023, OpenAI launched ChatGPT-4, a multimodal large language model with a larger model scale, richer knowledge base, and stronger context understanding capabilities. The popularity of ChatGPT has set off a passion for exploring large models at home and abroad. Domestic academic circles and technology companies have also announced or will launch similar robot dialogue models, such as Baidu's Wenxinyiyan, Ali's "Tongyi Qianwen", Huawei's Pangu model, Tencent Hunyuan model, etc. Large models have become the general trend. However, the cost of training a large model is huge. With the continuous increase in the growth rate of parameter scale, there are still bottlenecks in computing power and training costs. At the Artificial Intelligence Large Model Technology Summit Forum held on April 8, 2023, Tian Qi, head of Huawei's large model, said that the development and training of a large model cost 12 million U.S. dollars, of which 7.2 million U.S. dollars were spent on electricity, and the cost of large models was reduced and increased. There are two aspects to efficiency, one is to optimize computing power, and the other is to optimize power. There is a huge space for power cost reduction and efficiency increase.
脉冲神经网络(Spiking Neuron Networks,SNNs)作为第三代人工神经网络,有计算量小、功耗低、信息传递速度快等优点。传统的人工神经网络(Artificial NeuralNetworks,ANNs)是一种基于生物神经系统的信息处理方式的计算模型。图片描述作为多模态的一个细分领域,是多模态大模型的一个细分任务,它可以为视觉障碍人士提供有用的信息,可以用于自动化图像和视频的标注等,具有巨大的研究价值。未来神经形态计算作为一种高效方式,通过神经形态芯片和SNNs算法的协同演化的方式来实现多模态大模型训练的降本增效或将成为可能。本发明研究了一种节能的脉冲图片描述模型,是对节能多模态模型的一种探索应用。但是由于峰值活动的二元和不可微特性,直接训练SNNs可能会出现的严重的梯度消失和网络退化的问题。而且当前常用脉冲神经元单一,膜电位时间常数和膜电压阈值需要根据经验或优化手段指定为超参数,违背了神经元的生物多样性。Spiking Neuron Networks (SNNs), as the third-generation artificial neural network, has the advantages of small amount of calculation, low power consumption, and fast information transmission. The traditional artificial neural network (Artificial Neural Networks, ANNs) is a computing model based on the information processing method of the biological nervous system. As a subdivision field of multimodality, picture description is a subdivision task of multimodal large models. It can provide useful information for visually impaired people, and can be used for automatic image and video labeling. It has huge research value. In the future, neuromorphic computing, as an efficient method, may realize the cost reduction and efficiency increase of multi-modal large model training through the co-evolution of neuromorphic chips and SNNs algorithms. The invention studies an energy-saving pulse picture description model, which is an exploration and application of an energy-saving multi-mode model. However, due to the binary and non-differentiable nature of peak activity, direct training of SNNs may suffer from severe gradient vanishing and network degradation problems. Moreover, the currently commonly used spiking neurons are single, and the membrane potential time constant and membrane voltage threshold need to be specified as hyperparameters based on experience or optimization methods, which violates the biological diversity of neurons.
发明内容Contents of the invention
本发明的目的针对现有技术的不足,提供了一种基于脉冲Transformer模型的图片描述方法。The object of the present invention is to provide a picture description method based on the impulse Transformer model to address the deficiencies of the prior art.
本发明的目的是通过以下技术方案来实现的:一种基于脉冲Transformer模型的图片描述方法,包括以下步骤:The object of the present invention is achieved by the following technical solutions: a method for describing pictures based on the pulse Transformer model, comprising the following steps:
(1)设计一种脉冲神经元PLMP,每个PLMP单元包含多个具有不同膜电位时间常数和电压阈值的平行LIF单元;(1) Design a spiking neuron PLMP, each PLMP unit contains multiple parallel LIF units with different membrane potential time constants and voltage thresholds;
(2)基于PLMP单元设计实现脉冲自注意力机制;(2) Based on the PLMP unit design to realize the pulse self-attention mechanism;
(3)构建Transformer模型;所述Transformer模型采用编码器-解码器框架实现,其中编码器由一个Swin Transformer和N个细化编码器块组成,解码器由N个解码器块组成;(3) Build Transformer model; Described Transformer model adopts encoder-decoder frame to realize, and wherein encoder is made up of a Swin Transformer and N refinement encoder blocks, and decoder is made up of N decoder blocks;
(4)基于PLMP单元和脉冲自注意力机制将步骤(3)中的Transformer模型改为脉冲Transformer模型;(4) Change the Transformer model in step (3) to a pulse Transformer model based on the PLMP unit and the pulse self-attention mechanism;
(5)获取图片描述领域的数据集,将数据集分为训练集、验证集和测试集,所述训练集用于训练脉冲Transformer模型;所述验证集用于选择最优脉冲Transformer模型;将所述测试集输入最优脉冲Transformer模型,输出图片描述。(5) obtain the data set of picture description field, divide data set into training set, verification set and test set, described training set is used for training pulse Transformer model; Described verification set is used for selecting optimal pulse Transformer model; The test set is input to the optimal pulse Transformer model, and the picture description is output.
进一步地,所述步骤(1)中,在接收到输入后,每个平行LIF单元将根据各自的膜电位时间常数更新膜电位,若膜电位超过该平行LIF单元对应的电压阈值,该平行LIF单元就会产生一个峰值;PLMP的输出是所有平行LIF单元产生峰值的集合。Further, in the step (1), after receiving the input, each parallel LIF unit will update the membrane potential according to its respective membrane potential time constant. If the membrane potential exceeds the voltage threshold corresponding to the parallel LIF unit, the parallel LIF unit will The unit produces a peak; the output of the PLMP is the aggregate of the peaks produced by all parallel LIF units.
进一步地,所述脉冲神经元PLMP的前向过程如下:Further, the forward process of the spiking neuron PLMP is as follows:
Vthk=tanh(zk)Vth k =tanh(z k )
PLMP单元引入了两个可训练参数m和z,分别代表可学习的膜电位时间常数参数和可学习的膜电压参数。zk代表每个PLMP单元中第k个LIF单元的可学习的膜电压参数,通过双曲切函数得到Vthk,Vthk是第k个LIF单元的膜电压阈值。mk代表每个PLMP单元中第k个LIF单元的可学习的膜电位时间常数参数,由τk计算得到,τk是每个PLMP单元中第k个LIF单元的膜电位时间常数。p(n-1)表示第(n-1)层的神经元数量。是t时刻第n层第i个神经元的突触前输入。/>是从第(n-1)层第j个神经元到第n层第i个神经元的突触权值,/>是偏置。和/>分别代表t+1时刻第n层第i个神经元的第k个LIF单元的膜电位向量和输出向量。/>代表t+1时刻第n层第i个PLMP单元的最终输出,它会参与第n+1层所有与该单元相连的神经元的计算。The PLMP unit introduces two trainable parameters m and z, representing the learnable membrane potential time constant parameter and the learnable membrane voltage parameter, respectively. z k represents the learnable membrane voltage parameter of the kth LIF unit in each PLMP unit, and Vth k is obtained through the hyperbolic tangent function, and Vth k is the membrane voltage threshold of the kth LIF unit. mk represents the learnable membrane potential time constant parameter of the kth LIF unit in each PLMP unit, calculated from τk , where τk is the membrane potential time constant of the kth LIF unit in each PLMP unit. p(n-1) represents the number of neurons in the (n-1)th layer. is the presynaptic input to the i-th neuron in layer n at time t. /> is the synaptic weight from the jth neuron in the (n-1)th layer to the ith neuron in the nth layer, /> is biased. and /> Represent the membrane potential vector and output vector of the k-th LIF unit of the i-th neuron in the n-th layer at time t+1, respectively. /> Represents the final output of the i-th PLMP unit in the n-th layer at time t+1, and it will participate in the calculation of all neurons connected to the unit in the n+1-th layer.
进一步地,所述步骤(2)中,通过脉冲神经元PLMP将Query、Key和Value转换成脉冲,公式如下:Further, in the step (2), the Query, Key and Value are converted into pulses by the pulse neuron PLMP, and the formula is as follows:
Qi=PLMP(BN(XWi Q))Q i =PLMP(BN(XW i Q ))
Ki=PLMP(BN(XWi K))K i =PLMP(BN(XW i K ))
Vi=PLMP(BN(XWi V))V i =PLMP(BN(XW i V ))
headi=QiKi TVi head i = Q i K i T V i
S'=Concat(head1,...,headn)S'=Concat(head 1 ,...,head n )
SpikingMSA(Q,K,V)=PLMP(BN(Linear(S')))SpikingMSA(Q,K,V)=PLMP(BN(Linear(S')))
X是自注意力机制的输入,是可学习的的线性矩阵。i=1,2,...,h,h代表注意力机制有h个头。Vi表示第i个注意力头的输入特征的向量,Qi、Ki是第i个注意力头计算注意力权重权重的特征向量。BN是Batch Normalization批归一化操作。headi代表第i个注意力头的输出。S′是多头就是把h个注意力头的输出拼起来的结果。SpikingMSA是改造后的脉冲自注意力机制。X is the input of the self-attention mechanism, is a learnable linear matrix. i=1,2,...,h, h means that the attention mechanism has h heads. V i represents the vector of the input feature of the i-th attention head, Q i and K i are the feature vectors of the i-th attention head to calculate the weight of the attention weight. BN is Batch Normalization batch normalization operation. head i represents the output of the i-th attention head. S' is the result of putting together the output of h attention heads. SpikingMSA is a modified Spiking Self-Attention mechanism.
进一步地,所述步骤(3)中,Swin Transformer用于从输入图像中提取网格特征,对网格特征进行平均池化得到全局特征;细化编码器用于捕获网格特征和网格特征、网格特征和全局特征之间的内部关系;捕获网格特征和网格特征之间的关系采用SW/W-MSA;捕获网格特征和全局特征之间的关系采用MSA,其中全局特征作为自注意力机制中的Key;编码器中的每个细化编码器块先将得到的网格特征和全局特征送入自注意力机制,将每个自注意力机制的输入与其输出求和,归一化后传入前馈神经网络,最后残差求和归一化后输出,得到细化后的全局特征和网格特征。Further, in the step (3), the Swin Transformer is used to extract grid features from the input image, and average pooling is performed on the grid features to obtain global features; the refinement encoder is used to capture grid features and grid features, The internal relationship between grid features and global features; capture the relationship between grid features and grid features using SW/W-MSA; capture the relationship between grid features and global features using MSA, in which global features as self- Key in the attention mechanism; each refined encoder block in the encoder first sends the obtained grid features and global features to the self-attention mechanism, sums the input of each self-attention mechanism with its output, and normalizes After normalization, it is passed into the feed-forward neural network, and finally the residuals are summed and normalized to output, and the refined global features and grid features are obtained.
进一步地,编码器中的每个细化编码器块的公式如下:Further, the formulation of each refined encoder block in the encoder is as follows:
FeedForward(x)=W2ReLU(W1x)FeedForward(x)=W 2 ReLU(W 1 x)
和/>分别表示第l个Encoder块的输出网格特征和全局特征。/>W1,W2是可学习的参数。/>代表对网格特征和全局特征进行concate操作。/>是滑动窗口多头自注意力机制的输入和输出相加再进行层归一化得到的结果。/>是普通多头自注意力机制的输入和输出相加再进行层归一化得到的结果。 and /> Denote the output grid features and global features of the lth Encoder block, respectively. /> W 1 , W 2 are learnable parameters. /> Represents the concate operation on grid features and global features. /> It is the result of adding the input and output of the sliding window multi-head self-attention mechanism and then performing layer normalization. /> It is the result of adding the input and output of the ordinary multi-head self-attention mechanism and then performing layer normalization.
进一步地,先将细化后的全局特征融合到解码器的输入中,通过第一次多模态交互来捕获全局视觉上下文信息得到然后通过Language Masked MSA模块捕获/>中单词到单词的内模态关系得到/> Further, the refined global features are first fused into the input of the decoder, and the global visual context information is captured through the first multimodal interaction to obtain Then captured by the Language Masked MSA module /> The word-to-word intramodal relations in get />
表示第(l-1)个细化编码器的输入,第(l-1)个细化编码器的输出将作为t时刻第l个细化编码器的输入,Wf是线性层一个可学习的参数;/>和/>是可学习的参数,/>表示生成的单词在(t-1)时刻对应的嵌入向量,每个单词计算其之前生成的单词的注意映射; Represents the input of the (l-1)th refinement encoder, the output of the (l-1)th refinement encoder will be used as the input of the lth refinement encoder at time t, W f is a learnable linear layer parameters; /> and /> is a learnable parameter, /> Represents the embedding vector corresponding to the generated word at (t-1) time, and each word calculates the attention map of the previously generated word;
最后通过Cross MSA Module模块对和细化后的网格特征之间的多模态关系进行建模,捕获局部视觉上下文信息以生成图片描述。Finally, through the Cross MSA Module module pair Modeling the multimodal relationship between features and refined mesh features captures local visual context information to generate image captions.
进一步地,Cross MSAModule模块的实现公式如下:Further, the implementation formula of the Cross MSAModule module is as follows:
其中,和Wx是可学习的参数。/>是Cross MSA的输出,其中/>作为Cross MSA的Query,/>作为Cross MSA的Key和Value。/>是Cross MSA Module的最终输出。in, and W x are learnable parameters. /> is the output of the Cross MSA, where /> As the Query of Cross MSA, /> As the Key and Value of the Cross MSA. /> Is the final output of the Cross MSA Module.
进一步地,所述步骤(4)中,脉冲自注意力机制用于替换自注意力机制,PLMP单元用于代替FeedForward单元中的ReLU单元,脉冲FeedForward单元的实现如下:Further, in the step (4), the impulse self-attention mechanism is used to replace the self-attention mechanism, and the PLMP unit is used to replace the ReLU unit in the FeedForward unit, and the implementation of the impulse FeedForward unit is as follows:
SpikingFeedForward(x)=W2PLMP(W1x)。SpikingFeedForward(x) = W 2 PLMP(W 1 x).
进一步地,所述图片描述领域的数据集包括MSCOCO 2014、Flickr30K和Flickr8K、Flickr30K和Flickr8K、VizWiz、TextCaps、Fashion Captioning和CUB-200。Further, the datasets in the field of picture description include MSCOCO 2014, Flickr30K and Flickr8K, Flickr30K and Flickr8K, VizWiz, TextCaps, Fashion Captioning and CUB-200.
本发明的有益效果是:The beneficial effects of the present invention are:
1、提出了一种新的脉冲激活单元PLMP。与SNN模型常用的LIF单元相比,具有多级可学习的膜电位时间常数和阈值。该方法可以有效减轻梯度消失问题,削弱参数初始值设置的影响,加速训练过程,具有更好的激活效果。1. A new pulse activation unit PLMP is proposed. Compared with the LIF unit commonly used in SNN models, it has multi-level learnable membrane potential time constants and thresholds. This method can effectively alleviate the problem of gradient disappearance, weaken the influence of parameter initial value settings, accelerate the training process, and have better activation effects.
2、基于PLMP实现了一种新的尖峰型自注意力机制Spiking-MSA/W-MSA/SW-MSA。使用稀疏尖峰形式的Query、Key和Value,使其计算过程中避免了乘法,有效减少计算量。2. A new spike-type self-attention mechanism Spiking-MSA/W-MSA/SW-MSA is implemented based on PLMP. Query, Key, and Value in the form of sparse spikes are used to avoid multiplication during the calculation process and effectively reduce the amount of calculation.
3、创新性地将脉冲神经网络和Transformer模型结合起来应用于图片描述领域并取得了具有竞争力的效果,是一个适用于图片描述领域、节能的、可以生成高质量的图片描述的脉冲Transformer模型;并且,这是脉冲神经网络首次应用在图片描述领域。3. Innovatively combine the spiking neural network and Transformer model in the field of picture description and achieve competitive results. It is a pulse Transformer model suitable for the field of picture description, energy-saving, and capable of generating high-quality picture descriptions ; and, this is the first application of spiking neural networks in the field of image description.
附图说明Description of drawings
图1是本发明的总体流程图;Fig. 1 is the general flowchart of the present invention;
图2脉冲自注意力机制结构示意图;Figure 2 Schematic diagram of the structure of the impulse self-attention mechanism;
图3脉冲Transformer模型结构示意图。Figure 3 Schematic diagram of the pulse Transformer model structure.
具体实施方式Detailed ways
下面结合附图和具体实施方式对本发明的具体实施方式做进一步详细描述。以下附图用于说明本发明,但不用来限制本发明的范围。The specific implementation manners of the present invention will be further described in detail below in conjunction with the drawings and specific implementation manners. The following drawings are used to illustrate the present invention, but not to limit the scope of the present invention.
实施例一Embodiment one
请参见图1,图2和图3,图1是本发明实例提供的一种基于脉冲Transformer模型的图片描述算法流程图。包括如下步骤:Please refer to FIG. 1 , FIG. 2 and FIG. 3 , and FIG. 1 is a flow chart of a picture description algorithm based on an impulse Transformer model provided by an example of the present invention. Including the following steps:
步骤1:设计一种新型的脉冲神经元PLMP(Parallel LIF with MultistageLearnable Parameters);Step 1: Design a new type of spiking neuron PLMP (Parallel LIF with Multistage Learnable Parameters);
每个PLMP单元包含多个具有不同膜电位时间常数和阈值的平行LIF单元,在接收到输入后,每个LIF单元将根据各自的膜电位时间常数更新膜电位,若膜电位超过该单元对应的阈值,该LIF单元就会产生一个峰值。PLMP的输出是所有平行LIF单元产生峰值的集合。在这个过程中膜电位时间常数和阈值在训练的过程中自动优化,不再训练前手动设置为超参数。PLMP神经元的前向过程如下:Each PLMP unit contains multiple parallel LIF units with different membrane potential time constants and thresholds. After receiving the input, each LIF unit will update the membrane potential according to its respective membrane potential time constant. If the membrane potential exceeds the corresponding threshold, the LIF unit will generate a peak. The output of the PLMP is the aggregate of the peaks produced by all parallel LIF units. In this process, the membrane potential time constant and threshold are automatically optimized during training, and are no longer manually set as hyperparameters before training. The forward process of a PLMP neuron is as follows:
Vthk=tanh(zk)Vth k =tanh(z k )
PLMP单元引入了两个可训练参数m和z,分别代表可学习的膜电位时间常数参数和可学习的膜电压阈值参数。zk代表每个PLMP单元中第k个LIF单元的可学习膜电压参数,通过双曲切函数得到Vthk,Vthk是第k个LIF单元的膜电压阈值。mk代表每个PLMP单元中第k个LIF单元的可学习膜电位时间常数的参数,由τk计算得到,τk是每个PLMP单元中第k个LIF单元的膜电位时间常数。n和p(n-1)分别表示第n层和第(n-1)层的神经元数量。是t时刻第n层第i个神经元的突触前输入。/>是从第(n-1)层第j个神经元到第n层第i个神经元的突触权值,/>是偏置。/>和/>分别代表t时刻第n层第i个神经元的第k个LIF单元的膜电位向量和输出向量,K为LIF单元的总数。/>代表t时刻第n层第i个PLMP单元的最终输出,它会参与第n+1层所有与该单元相连的神经元的计算。The PLMP unit introduces two trainable parameters m and z, representing the learnable membrane potential time constant parameter and the learnable membrane voltage threshold parameter, respectively. z k represents the learnable membrane voltage parameter of the kth LIF unit in each PLMP unit, and Vth k is obtained through the hyperbolic tangent function, and Vth k is the membrane voltage threshold of the kth LIF unit. mk represents the parameter of the learnable membrane potential time constant of the kth LIF unit in each PLMP unit, calculated from τk , where τk is the membrane potential time constant of the kth LIF unit in each PLMP unit. n and p(n-1) denote the number of neurons in layer n and (n-1) respectively. is the presynaptic input to the i-th neuron in layer n at time t. /> is the synaptic weight from the jth neuron in the (n-1)th layer to the ith neuron in the nth layer, /> is biased. /> and /> Represent the membrane potential vector and output vector of the k-th LIF unit of the i-th neuron in the n-th layer at time t, respectively, and K is the total number of LIF units. /> Represents the final output of the i-th PLMP unit in the n-th layer at time t, and it will participate in the calculation of all neurons connected to the unit in the n+1-th layer.
本实施例以每个PLMP单元包含三个LIF单元为例。每个LIF单元的膜电压阈值初始值分别为0.6、1.6、2.6;初始膜电位时间常数均为0.25;TimeStep为4。In this embodiment, it is taken that each PLMP unit includes three LIF units as an example. The initial value of the membrane voltage threshold of each LIF unit was 0.6, 1.6, 2.6; the initial membrane potential time constant was 0.25; TimeStep was 4.
步骤2:基于PLMP单元设计实现脉冲自注意力机制Step 2: Implement the impulse self-attention mechanism based on PLMP unit design
通过脉冲神经元PLMP将Query,Key和Value转换成脉冲来避免矩阵的浮点数乘法,矩阵之间的操作可以通过逻辑与运算和加法来完成。由于脉冲形式的Q,K,V计算出来的注意力矩阵具有天然的非负性,不需要Softmax来保持注意矩阵的非负值具体公式如下:The Query, Key and Value are converted into pulses by the pulse neuron PLMP to avoid the floating-point multiplication of the matrix, and the operations between the matrices can be completed by logical AND operations and addition. Since the attention matrix calculated by Q, K, and V in pulse form is naturally non-negative, Softmax is not required to maintain the non-negative value of the attention matrix. The specific formula is as follows:
Qi=PLMP(BN(XWi Q))Q i =PLMP(BN(XW i Q ))
Ki=PLMP(BN(XWi K))K i =PLMP(BN(XW i K ))
Vi=PLMP(BN(XWi V))V i =PLMP(BN(XW i V ))
headi=QiKi TVi head i = Q i K i T V i
S'=Concat(head1,...,headn)S'=Concat(head 1 ,...,head n )
SpikingMSA(Q,K,V)=PLMP(BN(Linear(S')))SpikingMSA(Q,K,V)=PLMP(BN(Linear(S')))
X是自注意力机制的输入,是可学习的的线性矩阵。i=1,2,...,h,h代表注意力机制有h个头。Vi表示第i个注意力头的输入特征的向量,Qi、Ki是第i个注意力头计算注意力权重权重的特征向量。BN是Batch Normalization批归一化操作。headi代表第i个注意力头的输出。S′是多头就是把h个注意力头的输出拼起来的结果。SpikingMSA是改造后的脉冲自注意力机制。X is the input of the self-attention mechanism, is a learnable linear matrix. i=1,2,...,h, h means that the attention mechanism has h heads. V i represents the vector of the input feature of the i-th attention head, Q i and K i are the feature vectors of the i-th attention head to calculate the weight of the attention weight. BN is Batch Normalization batch normalization operation. head i represents the output of the i-th attention head. S' is the result of putting together the output of h attention heads. SpikingMSA is a modified Spiking Self-Attention mechanism.
步骤3:构建Transformer模型Step 3: Build the Transformer model
采用广泛使用的编码器-解码器框架实现。其中编码器由一个Swin Transformer和3个细化编码器块组成,解码器由3个解码器块组成。预训练好的Swin Transformer负责从输入图像中提取网格特征,对网格特征进行平均池化得到全局特征,细化编码器负责捕获网格特征和网格特征、网格特征和全局特征之间的内部关系。捕获网格特征和网格特征之间的关系采用SW/W-MSA。捕获网格特征和全局特征之间的关系采用MSA,全局特征作为自注意力机制中的Key。编码器的每个细化编码器块先将得到的网格特征和全局特征送入自注意力机制,这里引入残差结构来解决深度网络中的梯度消失问题,将输入与注意力机制的输出求和,归一化后传入前馈神经网络,最后残差求和归一化后输出。解码器利用细化后的图像网格特征,通过捕获文字和图像网格特征之间的相互关系,逐字生成标题。Encoder中的每个细化编码器块的公式如下:Implemented using the widely used encoder-decoder framework. The encoder consists of a Swin Transformer and 3 refined encoder blocks, and the decoder consists of 3 decoder blocks. The pre-trained Swin Transformer is responsible for extracting grid features from the input image, performing average pooling on the grid features to obtain global features, and the refinement encoder is responsible for capturing the relationship between grid features and grid features, grid features and global features. internal relationship. Capturing mesh features and relationships between mesh features employs SW/W-MSA. MSA is used to capture the relationship between grid features and global features, and global features are used as the Key in the self-attention mechanism. Each refined encoder block of the encoder first sends the obtained grid features and global features to the self-attention mechanism. Here, the residual structure is introduced to solve the problem of gradient disappearance in the deep network, and the input and the output of the attention mechanism The summation is passed into the feed-forward neural network after normalization, and the final residual summation and normalization is output. The decoder utilizes the refined image grid features to generate captions verbatim by capturing the correlation between text and image grid features. The formula for each refinement encoder block in the Encoder is as follows:
FeedForward(x)=W2ReLU(W1x)FeedForward(x)=W 2 ReLU(W 1 x)
和/>分别表示第L个Encoder块的输出网格特征和全局特征。/>W1,W2是可学习的参数。/>代表对网格特征和全局特征进行concate操作。/>是滑动窗口多头自注意力机制的输入和输出相加再进行层归一化得到的结果。/>是普通多头自注意力机制的输入和输出相加再进行层归一化得到的结果。 and /> Denote the output grid features and global features of the Lth Encoder block, respectively. /> W 1 , W 2 are learnable parameters. /> Represents the concate operation on grid features and global features. /> It is the result of adding the input and output of the sliding window multi-head self-attention mechanism and then performing layer normalization. /> It is the result of adding the input and output of the ordinary multi-head self-attention mechanism and then performing layer normalization.
Decoder中先将细化后的Global Feature融合到解码器的输入中,通过第一次多模态交互来捕获全局视觉上下文信息得到然后通过Language Masked MSA模块捕获中单词到单词的内模态关系得到/> In the Decoder, the refined Global Feature is first fused into the input of the decoder, and the global visual context information is captured through the first multimodal interaction to obtain Then captured by the Language Masked MSA module The word-to-word intramodal relations in get />
表示第(l-1)个细化编码器的输入,第(l-1)个细化编码器的输出将作为t时刻第l个细化编码器的输入,Wf是线性层一个可学习的参数;/>和/>是可学习的参数,/>表示生成的单词在(t-1)时刻对应的嵌入向量,每个单词只允许计算其之前生成的单词的注意映射; Represents the input of the (l-1)th refinement encoder, the output of the (l-1)th refinement encoder will be used as the input of the lth refinement encoder at time t, W f is a learnable linear layer parameters; /> and /> is a learnable parameter, /> Represents the embedding vector corresponding to the generated word at (t-1) time, and each word only allows the calculation of the attention map of the previously generated word;
最后通过Cross MSA Module模块对和Grid Feature之间的多模态关系进行建模,捕获局部视觉上下文信息以生成图片描述。通过Global Feature和Grid Feature和句子之间的两次多模态交互,增强推理能力。Cross MSAModule模块的实现公式如下:Finally, through the Cross MSA Module module pair Model the multi-modal relationship between the grid feature and the grid feature, and capture the local visual context information to generate a picture description. Through two multimodal interactions between Global Feature and Grid Feature and sentences, the reasoning ability is enhanced. The implementation formula of the Cross MSAModule module is as follows:
其中,和Wx是可学习的参数。/>是Cross MSA的输出,其中/>作为Cross MSA的Query,/>作为Cross MSA的Key和Value。/>是Cross MSA Module的最终输出,由/>根据如上公式计算得来。in, and W x are learnable parameters. /> is the output of the Cross MSA, where /> As the Query of Cross MSA, /> As the Key and Value of the Cross MSA. /> Is the final output of Cross MSA Module, by /> Calculated according to the above formula.
步骤4:基于PLMP单元和脉冲自注意力机制将步骤(3)中的Transformer模型改为脉冲Transformer模型;Step 4: Change the Transformer model in step (3) to a pulse Transformer model based on the PLMP unit and the pulse self-attention mechanism;
基于脉冲自注意力机制,对整个Transformer模型进行脉冲改造。用步骤2中的脉冲自注意力机制替换自注意力机制,用步骤1中PLMP单元代替FeedForward单元中的ReLU单元,将图像信息和单词表示为事件驱动的峰值,随后对整个网络进行训练得到一个适用于图片描述领域、节能的、可以生成高质量的图片描述的脉冲Transformer模型。脉冲自注意力的实现参考步骤2,脉冲FeedForward单元的实现如下:Based on the impulse self-attention mechanism, the entire Transformer model is impulsively transformed. Replace the self-attention mechanism with the pulse self-attention mechanism in step 2, replace the ReLU unit in the FeedForward unit with the PLMP unit in step 1, represent the image information and words as event-driven peaks, and then train the entire network to obtain a Suitable for the field of picture description, energy-saving, pulse Transformer model that can generate high-quality picture descriptions. The implementation of pulse self-attention refers to step 2, and the implementation of the pulse FeedForward unit is as follows:
SpikingFeedForward(x)=W2PLMP(W1x)SpikingFeedForward(x) = W 2 PLMP(W 1 x)
步骤5:获取数据集,将数据集分为训练集、验证集和测试集,所述训练集用于训练脉冲Transformer模型;所述验证集用于选择最优脉冲Transformer模型;将所述测试集输入最优脉冲Transformer模型,输出图片描述。Step 5: obtain data set, divide data set into training set, verification set and test set, described training set is used for training impulse Transformer model; Described verification set is used for selecting optimal impulse Transformer model; Described test set Input the optimal pulse Transformer model and output a picture description.
训练脉冲Transformer模型。在MSCOCO 2014数据集上进行训练,该数据集包含123287张图像,每个图像都有5个参考标题,遵循“Karpathy”分割来重新划分MSCOCO,其中113287张图像用于训练;5000张图像用于挑选超参数,选择最优脉冲Transformer模型;5000张图像用于离线评估,输出关于图片的描述。本发明也可采用Flickr30K和Flickr8K、Flickr30K和Flickr8K、VizWiz、TextCaps、Fashion Captioning、CUB-200等图片描述领域常用数据集训练。Train the Impulse Transformer model. Trained on the MSCOCO 2014 dataset, which contains 123,287 images, each with 5 reference captions, following the "Karpathy" segmentation to re-partition MSCOCO, where 113,287 images are used for training; 5,000 images are used for Select hyperparameters, select the optimal pulse Transformer model; 5000 images are used for offline evaluation, and output descriptions about the images. The present invention can also adopt Flickr30K and Flickr8K, Flickr30K and Flickr8K, VizWiz, TextCaps, Fashion Captioning, CUB-200 and other commonly used data sets in the field of picture description for training.
本发明最终效果为输入一张图片,可以得到该图片的文字描述,效果如图3所示。The final effect of the present invention is that a picture can be input, and a text description of the picture can be obtained, and the effect is shown in FIG. 3 .
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310682762.3A CN116701696A (en) | 2023-06-09 | 2023-06-09 | Picture description method based on pulse transducer model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310682762.3A CN116701696A (en) | 2023-06-09 | 2023-06-09 | Picture description method based on pulse transducer model |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116701696A true CN116701696A (en) | 2023-09-05 |
Family
ID=87840584
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310682762.3A Pending CN116701696A (en) | 2023-06-09 | 2023-06-09 | Picture description method based on pulse transducer model |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116701696A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118966127A (en) * | 2024-07-22 | 2024-11-15 | 浙江大学 | Modeling method of 4-level pulse amplitude modulation high-speed transmitter based on non-autoregressive Transformer |
| CN119962593A (en) * | 2025-01-23 | 2025-05-09 | 广东工业大学 | A pulse transformer self-attention time domain interaction enhancement method and device |
-
2023
- 2023-06-09 CN CN202310682762.3A patent/CN116701696A/en active Pending
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118966127A (en) * | 2024-07-22 | 2024-11-15 | 浙江大学 | Modeling method of 4-level pulse amplitude modulation high-speed transmitter based on non-autoregressive Transformer |
| CN118966127B (en) * | 2024-07-22 | 2025-09-09 | 浙江大学 | Modeling method of 4-level pulse amplitude modulation high-speed transmitter based on non-autoregressive transducer |
| CN119962593A (en) * | 2025-01-23 | 2025-05-09 | 广东工业大学 | A pulse transformer self-attention time domain interaction enhancement method and device |
| CN119962593B (en) * | 2025-01-23 | 2025-10-21 | 广东工业大学 | A method and device for enhancing the temporal domain interaction of pulse transformer self-attention |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110163299B (en) | Visual question-answering method based on bottom-up attention mechanism and memory network | |
| Xiang et al. | A convolutional neural network-based linguistic steganalysis for synonym substitution steganography | |
| CN109284506B (en) | User comment emotion analysis system and method based on attention convolution neural network | |
| Liu et al. | Time series prediction based on temporal convolutional network | |
| CN107092959B (en) | Pulse neural network model construction method based on STDP unsupervised learning algorithm | |
| CN106980683B (en) | Blog text abstract generating method based on deep learning | |
| CN112232087B (en) | Specific aspect emotion analysis method of multi-granularity attention model based on Transformer | |
| CN114998659B (en) | Image data classification method using spiking neural network model trained online over time | |
| CN114090780A (en) | A fast image classification method based on cue learning | |
| CN112860856B (en) | Intelligent problem solving method and system for arithmetic application problem | |
| CN113157919B (en) | Sentence Text Aspect-Level Sentiment Classification Method and System | |
| CN116579347A (en) | Comment text emotion analysis method, system, equipment and medium based on dynamic semantic feature fusion | |
| CN119692484B (en) | Case question and answer method, medium and equipment based on large language model | |
| CN116701696A (en) | Picture description method based on pulse transducer model | |
| CN116804999A (en) | A robust visual question answering method based on debiased labels and global context | |
| Yang et al. | Recurrent neural network-based language models with variation in net topology, language, and granularity | |
| CN111158640A (en) | One-to-many demand analysis and identification method based on deep learning | |
| CN116561314A (en) | Text classification method for selecting self-attention based on self-adaptive threshold | |
| CN111027292B (en) | A method and system for generating a limited sampling text sequence | |
| Zhong et al. | Recurrent attention unit | |
| CN115017314A (en) | A text classification method based on attention mechanism | |
| Yan et al. | CQ $^{+} $+ Training: Minimizing Accuracy Loss in Conversion From Convolutional Neural Networks to Spiking Neural Networks | |
| CN114548293A (en) | Video-text cross-modal retrieval method based on cross-granularity self-distillation | |
| Song et al. | Transformer: A Survey and Application | |
| CN112269876A (en) | A text classification method based on deep learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |