CN117892175A

CN117892175A - SNN multi-mode target identification method, system, equipment and medium

Info

Publication number: CN117892175A
Application number: CN202410066331.9A
Authority: CN
Inventors: 韩佳宁; 申江荣; 唐华锦
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-01-16
Filing date: 2024-01-16
Publication date: 2024-04-16

Abstract

The invention provides an SNN multi-mode target identification method, an SNN multi-mode target identification system, SNN multi-mode target identification equipment and an SNN multi-mode target identification medium, and relates to the field of artificial intelligence, wherein the SNN multi-mode target identification method comprises the following steps: extracting key features in the multi-mode information through a feature extractor; adjusting the time scale of key features corresponding to different single-mode information, simulating forward propagation of each time-aligned single-mode key feature and a multi-mode classifier parameter through matrix multiplication to obtain single-mode output and multi-mode output, evaluating the contribution ratio of different single-mode outputs to a target task, and determining a mode adjustment factor; and calculating cross entropy loss corresponding to the single-mode output and the multi-mode output, dynamically adjusting a loss function according to the mode adjustment factor and the cross entropy loss, and determining the final loss of the impulse neural network so as to identify the multi-mode target of the impulse neural network. The method can effectively solve the problems of unbalanced multi-mode convergence and unmatched time scales, and remarkably improves the overall performance of the multi-mode model in the target recognition task.

Description

A SNN multimodal target recognition method, system, device and medium

技术领域Technical Field

本发明涉及人工智能领域，特别是涉及一种SNN多模态目标识别方法、系统、设备及介质。The present invention relates to the field of artificial intelligence, and in particular to a SNN multimodal target recognition method, system, device and medium.

背景技术Background technique

在当今的信息爆炸时代，多模态学习已经成为了人工智能领域的重要分支，多模态学习通过整合来自不同模态的信息，旨在构建性能更优、鲁棒性更强的模型。多模态模型利用文本、图像、声音等多种信息源，在视觉问答、情感分析、医学图像处理、跨模态检索等多类复杂任务中得到应用。随着脉冲神经网络(Spiking neural networks，SNN)训练算法的不断发展，SNN在多种应用场景中的适用性逐渐增强，其较高的生物可解释性在处理多模态问题时也展现了一定的优势。一方面，SNN中的神经元模拟真实神经元的动态行为，不仅能学习多模态数据的空间特征，还能高效地利用模态的时间信息，尤其适合于处理含有时间维度的视频、语音等数据。另一方面，SNN的事件驱动特性使其在处理稀疏模态数据时保持高效，能显著降低多模态处理的能耗。研究人员已利用SNN处理事件、视觉、音频等数据，完成诸如唇语、语音、目标识别等任务。文献1借助动态视觉传感器(Dynamic VisionSensor，DVS)和动态音频传感器(Dynamic Audio Sensor,DAS)的信息融合实现唇语识别任务，并通过计算不同模态事件流之间的互相关度过滤噪声事件，实现模态间对齐，解决视听事件流记录中的时间偏移问题。此外，文献2借鉴门控逻辑，采用兴奋型和抑制型的突触连接组成超模态层，实现跨模态信息耦合。还有研究3使用卷积脉冲神经网络和循环脉冲神经网络分别处理DVS和DAS数据，并借鉴注意力机制对单模态网络的输出进行加权，以实现选择性融合。基于SNN的多模态算法在各类任务上均显著提升了模型的整体性能。In today's era of information explosion, multimodal learning has become an important branch of artificial intelligence. Multimodal learning aims to build models with better performance and stronger robustness by integrating information from different modalities. Multimodal models use multiple information sources such as text, images, and sounds, and are applied in many complex tasks such as visual question answering, sentiment analysis, medical image processing, and cross-modal retrieval. With the continuous development of spiking neural network (SNN) training algorithms, the applicability of SNN in various application scenarios has gradually increased, and its high biological interpretability has also shown certain advantages in dealing with multimodal problems. On the one hand, the neurons in SNN simulate the dynamic behavior of real neurons, which can not only learn the spatial characteristics of multimodal data, but also efficiently utilize the temporal information of the modality, especially suitable for processing video, speech and other data with time dimension. On the other hand, the event-driven characteristics of SNN make it efficient when processing sparse modal data, which can significantly reduce the energy consumption of multimodal processing. Researchers have used SNN to process event, visual, audio and other data to complete tasks such as lip reading, speech, and object recognition. Reference 1 uses the information fusion of dynamic vision sensor (DVS) and dynamic audio sensor (DAS) to realize lip reading recognition task, and filters noise events by calculating the cross-correlation between different modal event streams to achieve inter-modal alignment and solve the time offset problem in audio-visual event stream recording. In addition, Reference 2 borrows from gating logic and uses excitatory and inhibitory synaptic connections to form a supermodal layer to achieve cross-modal information coupling. There is also research 3 that uses convolutional pulse neural network and recurrent pulse neural network to process DVS and DAS data respectively, and draws on the attention mechanism to weight the output of the unimodal network to achieve selective fusion. The multimodal algorithm based on SNN has significantly improved the overall performance of the model in various tasks.

相比单模态模型，多模态模型中不同模态的互补信息通常可以协助优化模型效果，并提高模型对噪声和缺失数据的鲁棒性。然而，在现有的SNN多模态算法中，仍存在模态不平衡以及时间尺度不匹配的问题，这些挑战限制了SNN在多模态任务中的应用效果。Compared with unimodal models, the complementary information of different modes in multimodal models can usually help optimize the model effect and improve the robustness of the model to noise and missing data. However, in the existing SNN multimodal algorithms, there are still problems of modality imbalance and time scale mismatch. These challenges limit the application of SNN in multimodal tasks.

模态不平衡问题是指在多模态联合学习场景中，最优单模态模型的性能优于多模态模型的性能，且随着输入流数量增多，多模态模型的性能下降更加显著；即使多模态模型联合训练的效果优于单模态模型，其单模态分支也很难超过单模态模型单独训练的性能。多模态不平衡问题导致单模态分支难以完全收敛，从而弱化了多模态模型利用不同模态信息的能力。从优化进程的角度来看，多模态不平衡问题主要是由不同模态的收敛速度差异引起的。在多模态模型联合训练时，若使用统一的学习率，可能导致收敛速度较快的模态主导整个优化过程，而其他模态则无法得到充分的训练。因此，现有面向模态不平衡问题的算法大多通过监控不同模态的收敛来调节学习过程，旨在减缓主导模态的优化速度，以缓解其他模态的欠拟合问题，进而提高多模态模型的整体性能。例如，文献4首次提出了梯度混合离线调节算法，通过计算过拟合程度与泛化能力的比率来调整各模态的损失函数。但是，该算法只能实现离线调节，无法实时优化。文献5引入自适应跟踪因子来实时调整每个模态的学习率，使得越接近收敛的模态学习率越低。为衡量多模态模型对每个模态的学习依赖，文献6提出条件利用率的概念，并将条件利用率优化为条件学习速度以实现在线调节。文献7提出了一种在线梯度调节算法，在训练阶段动态监控不同模态对学习目标的贡献差异，然后用该差异来自适应调节梯度，并引入高斯噪声实现泛化增强。但是，在线梯度调节算法在模态间精度差异较大时，无法很好地对梯度进行调节，限制了多模态融合效果的提升。The modality imbalance problem refers to the situation that in the multimodal joint learning scenario, the performance of the optimal single-modal model is better than that of the multimodal model, and as the number of input streams increases, the performance of the multimodal model decreases more significantly; even if the effect of the joint training of the multimodal model is better than that of the single-modal model, its single-modal branch is difficult to exceed the performance of the single-modal model trained alone. The multimodal imbalance problem makes it difficult for the single-modal branch to fully converge, thereby weakening the ability of the multimodal model to utilize different modal information. From the perspective of the optimization process, the multimodal imbalance problem is mainly caused by the difference in convergence speed of different modalities. When the multimodal model is jointly trained, if a unified learning rate is used, the modality with a faster convergence speed may dominate the entire optimization process, while other modalities cannot be fully trained. Therefore, most of the existing algorithms for the modality imbalance problem adjust the learning process by monitoring the convergence of different modalities, aiming to slow down the optimization speed of the dominant modality to alleviate the underfitting problem of other modalities, thereby improving the overall performance of the multimodal model. For example, Reference 4 first proposed a gradient hybrid offline adjustment algorithm, which adjusts the loss function of each modality by calculating the ratio of overfitting degree to generalization ability. However, this algorithm can only achieve offline adjustment and cannot be optimized in real time. Reference 5 introduces an adaptive tracking factor to adjust the learning rate of each modality in real time, so that the modality closer to convergence has a lower learning rate. In order to measure the learning dependence of the multimodal model on each modality, Reference 6 proposed the concept of conditional utilization and optimized the conditional utilization to the conditional learning speed to achieve online adjustment. Reference 7 proposed an online gradient adjustment algorithm, which dynamically monitors the difference in contribution of different modalities to the learning objective during the training phase, and then uses the difference to adaptively adjust the gradient, and introduces Gaussian noise to achieve generalization enhancement. However, when the accuracy difference between modalities is large, the online gradient adjustment algorithm cannot adjust the gradient well, which limits the improvement of the multimodal fusion effect.

在应用SNN处理多模态任务的过程中，除了由不同模态收敛速度的差异导致的模态不平衡问题外，SNN中不同模态的时间尺度不匹配、事件稀疏性不同等特性也为多模态信息的充分利用带来了更大的挑战。时间尺度不匹配问题的主要原因是SNN不同模态的信息对于时间的敏感度不同，静态视觉模态本身不涉及时间维度，使用多个时间步重复的静态视觉信息可能导致信息冗余；动态视觉模态数据天生具有高时间分辨率和动态变化特性，需要一定的时间步来来捕捉事件流中的快速和细粒度变化；而听觉模态通常包含比较丰富的时间维度信息，需要较多的时间步精确捕捉声音的变化，尤其是在处理复杂的音频场景时更为显著。而现有SNN多模态算法中，通常统一为所有模态分配相同的时间步，而未对模态间的时间尺度不匹配问题进行有效处理。这种方式会导致某些模态信息的过度表达或缺失，进一步增加解决模态不平衡问题的难度。In the process of applying SNN to handle multimodal tasks, in addition to the modal imbalance problem caused by the difference in convergence speed of different modalities, the characteristics of time scale mismatch and different event sparsity of different modalities in SNN also bring greater challenges to the full utilization of multimodal information. The main reason for the time scale mismatch problem is that the information of different modalities of SNN has different sensitivity to time. The static visual modality itself does not involve the time dimension. Using static visual information repeated in multiple time steps may lead to information redundancy; dynamic visual modal data is inherently high in time resolution and dynamic change characteristics, and requires a certain time step to capture the rapid and fine-grained changes in the event stream; while the auditory modality usually contains relatively rich time dimension information, and requires more time steps to accurately capture the changes in sound, especially when processing complex audio scenes. In existing SNN multimodal algorithms, the same time step is usually uniformly assigned to all modalities, and the time scale mismatch problem between modalities is not effectively handled. This approach will lead to the over-expression or loss of some modal information, further increasing the difficulty of solving the modal imbalance problem.

参考文献：references:

文献1.Li,X.,Neil,D.,Delbruck,T.&Liu,S.-C.Lip reading deep networkexploiting multi-modal spiking visual and auditory sensors.in 2019 IEEEInternational Symposium on Circuits and Systems(ISCAS)1-5(IEEE,2019).Reference 1. Li, X., Neil, D., Delbruck, T. & Liu, S.-C. Lip reading deep network exploiting multi-modal spiking visual and auditory sensors. in 2019 IEEE International Symposium on Circuits and Systems (ISCAS) 1-5 (IEEE, 2019).

文献2.Wysoski,S.G.,Benuskova,L.&Kasabov,N.Brain-like evolving spikingneural networks for multimodal information processing.in Brain-InspiredInformation Technology 15-27(Springer,2010).Reference 2. Wysoski, S.G., Benuskova, L. & Kasabov, N. Brain-like evolving spiking neural networks for multimodal information processing. in Brain-Inspired Information Technology 15-27 (Springer, 2010).

文献3.Liu,Q.,Xing,D.,Feng,L.,Tang,H.&Pan,G.Event-based multimodalspiking neural network with attention mechanism.in ICASSP 2022-2022 IEEEInternational Conference on Acoustics,Speech and Signal Processing(ICASSP)8922-8926(IEEE,2022).Reference 3. Liu, Q., Xing, D., Feng, L., Tang, H. & Pan, G. Event-based multimodal spiking neural network with attention mechanism. in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 8922-8926 (IEEE, 2022).

文献4.Wang,W.,Tran,D.&Feiszli,M.What makes training multi-modalclassification networks hard？ in Proceedings of the IEEE/CVF conference oncomputer vision and pattern recognition 12695-12705(2020).Reference 4. Wang, W., Tran, D. & Feiszli, M. What makes training multi-modal classification networks hard? in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 12695-12705(2020).

文献5.Sun,Y.,Mai,S.&Hu,H.Learning to balance the learning ratesbetween various modalities via adaptive tracking factor.IEEE SignalProcess.Lett.28,1650-1654(2021).Reference 5. Sun, Y., Mai, S. & Hu, H. Learning to balance the learning rates between various modalities via adaptive tracking factor. IEEE Signal Process. Lett. 28, 1650-1654 (2021).

文献6.Wu,N.,Jastrzebski,S.,Cho,K.&Geras,K.J.Characterizing andovercoming the greedy nature of learning in multi-modal deep neuralnetworks.in International Conference on Machine Learning 24043-24055(PMLR,2022).Reference 6. Wu, N., Jastrzebski, S., Cho, K. & Geras, K. J. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. in International Conference on Machine Learning 24043-24055 (PMLR, 2022).

文献7.Peng,X.,Wei,Y.,Deng,A.,Wang,D.&Hu,D.Balanced multimodallearning via on-the-fly gradient modulation.in Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition 8238-8247(2022).Reference 7. Peng, X., Wei, Y., Deng, A., Wang, D. & Hu, D. Balanced multimodal learning via on-the-fly gradient modulation. in Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition 8238-8247 (2022).

发明内容Summary of the invention

本发明的目的是提供一种SNN多模态目标识别方法、系统、设备及介质，以解决多模态收敛不平衡以及时间尺度不匹配的问题。The purpose of the present invention is to provide a SNN multimodal target recognition method, system, device and medium to solve the problems of multimodal convergence imbalance and time scale mismatch.

为实现上述目的，本发明提供了如下方案：To achieve the above object, the present invention provides the following solutions:

一种SNN多模态目标识别方法，包括：A SNN multimodal target recognition method, comprising:

通过特征提取器提取多模态信息中的关键特征；所述关键特征包括视觉信息以及听觉信息；所述特征提取器包括视觉特征提取器以及听觉特征提取器；Extracting key features from the multimodal information by a feature extractor; the key features include visual information and auditory information; the feature extractor includes a visual feature extractor and an auditory feature extractor;

利用卷积时间对齐模块调整不同单模态信息对应的关键特征的时间尺度，确定时间对齐的单模态关键特征；所述卷积时间对齐模块包括视觉卷积时间对齐模块以及听觉卷积时间对齐模块；Using a convolutional time alignment module to adjust the time scale of key features corresponding to different unimodal information, and determine the unimodal key features of time alignment; the convolutional time alignment module includes a visual convolutional time alignment module and an auditory convolutional time alignment module;

将每个所述时间对齐的单模态关键特征分别与多模态分类器参数通过矩阵乘法模拟前向传播，得到单模态输出以及多模态输出，并根据所述单模态输出以及所述多模态输出评估不同单模态输出对目标任务的贡献比例，确定模态调节因子；Simulating forward propagation by matrix multiplication of each of the time-aligned unimodal key features and the multimodal classifier parameters to obtain unimodal output and multimodal output, and evaluating the contribution ratio of different unimodal outputs to the target task according to the unimodal output and the multimodal output to determine the modality adjustment factor;

根据所述单模态输出以及所述多模态输出确定单模态输出以及多模态输出对应的交叉熵损失，并根据所述模态调节因子以及所述交叉熵损失动态调整损失函数，确定脉冲神经网络的最终损失，以识别所述脉冲神经网络的的多模态目标；所述最终损失为所述脉冲神经网络的优化目标。The cross entropy loss corresponding to the unimodal output and the multimodal output is determined according to the unimodal output and the multimodal output, and the loss function is dynamically adjusted according to the modality adjustment factor and the cross entropy loss to determine the final loss of the spiking neural network to identify the multimodal target of the spiking neural network; the final loss is the optimization target of the spiking neural network.

可选的，通过特征提取器提取多模态信息中的关键特征，之前还包括：Optionally, the key features in the multimodal information are extracted by a feature extractor, which also includes:

分别对所述视觉信息以及所述听觉信息进行编码，生成编码后的视觉信息以及编码后的听觉信息；Encoding the visual information and the auditory information respectively to generate encoded visual information and encoded auditory information;

根据所述编码后的视觉信息以及编码后的听觉信息构建多模态数据集；所述多模态数据集包括多模态信息。A multimodal data set is constructed according to the encoded visual information and the encoded auditory information; the multimodal data set includes multimodal information.

可选的，通过特征提取器提取多模态信息中的关键特征，具体包括：Optionally, key features in the multimodal information are extracted by a feature extractor, specifically including:

将所述视觉信息通过三层脉冲全连接层构成的视觉特征提取器后，生成T_e个时间步的视觉特征；After the visual information passes through a visual feature extractor consisting of three pulse fully connected layers, visual features of _Te time steps are generated;

将所述听觉信息通过三层循环脉冲网络构成的听觉特征提取器，生成T_a个时间步的听觉特征。The auditory information is passed through an auditory feature extractor consisting of a three-layer recurrent pulse network to generate auditory features of _Ta time steps.

可选的，利用卷积时间对齐模块调整不同单模态信息对应的关键特征的时间尺度，确定时间对齐的单模态关键特征，具体包括：所述视觉卷积时间对齐模块包括T个1×1×T_e的卷积核；所述听觉卷积时间对齐模块包括T个1×1×T_a的卷积核；Optionally, a convolution time alignment module is used to adjust the time scale of key features corresponding to different unimodal information to determine the time-aligned unimodal key features, specifically comprising: the visual convolution time alignment module includes T 1×1×T _e convolution kernels; the auditory convolution time alignment module includes T 1×1×T _a convolution kernels;

利用视觉卷积时间对齐模块处理所述视觉特征，生成T个时间步的视觉对齐特征；Processing the visual features using a visual convolutional time alignment module to generate visual alignment features of T time steps;

利用听觉卷积时间对齐模块处理所述听觉特征，生成T个时间步的听觉对齐特征。The auditory features are processed using an auditory convolutional time alignment module to generate auditory alignment features of T time steps.

可选的，将每个所述时间对齐的单模态关键特征分别与多模态分类器参数通过矩阵乘法模拟前向传播，得到单模态输出以及多模态输出，并根据所述单模态输出以及所述多模态输出评估不同单模态输出对目标任务的贡献比例，确定模态调节因子，具体包括：Optionally, each of the time-aligned unimodal key features is respectively subjected to matrix multiplication to simulate forward propagation with the multimodal classifier parameters to obtain unimodal output and multimodal output, and the contribution ratio of different unimodal outputs to the target task is evaluated according to the unimodal output and the multimodal output to determine the modality adjustment factor, specifically including:

根据公式确定多模态输出；其中，/>为多模态输出；W为全连接层参数；/>为视觉对齐特征；/>为听觉对齐特征；b为偏置项；According to the formula Determine multimodal output; wherein, /> is the multimodal output; W is the fully connected layer parameter; /> It is a visual alignment feature; /> is the auditory alignment feature; b is the bias term;

根据公式确定视觉模态输出；其中，/>为视觉模态输出；W_e为视觉分类器参数；According to the formula Determine the visual modality output; wherein, /> is the visual modality output; _We is the visual classifier parameter;

根据公式确定听觉模态输出；其中，/>为听觉模态输出；W_a为听觉分类器参数；According to the formula Determine the auditory modality output; wherein, /> is the auditory modal output; Wa _is the auditory classifier parameter;

将所述视觉模态输出以及所述听觉模态输出输入至softmax函数，确定视觉模态分数以及听觉模态分数；Inputting the visual modality output and the auditory modality output into a softmax function to determine a visual modality score and an auditory modality score;

根据所述视觉模态分数以及所述听觉模态分数确定视觉模态对目标任务的视觉贡献比例以及听觉模态对目标任务的听觉贡献比例；Determining a visual contribution ratio of the visual modality to the target task and an auditory contribution ratio of the auditory modality to the target task according to the visual modality score and the auditory modality score;

根据所述视觉贡献比例以及所述听觉贡献比例确定模态调节因子。A modality adjustment factor is determined according to the visual contribution ratio and the auditory contribution ratio.

可选的，所述单模态输出对应的交叉熵损失包括视觉模态对应的交叉熵损失以及听觉模态对应的交叉熵损失；Optionally, the cross entropy loss corresponding to the unimodal output includes a cross entropy loss corresponding to a visual modality and a cross entropy loss corresponding to an auditory modality;

所述视觉模态对应的交叉熵损失L_e为：其中，C为样本类型数量，j为样本类型序号，N为样本数量；y_i表示第i个样本的真实类别标签；The cross entropy loss _Le corresponding to the visual modality is: Where C is the number of sample types, j is the sample type serial number, N is the number of samples; _yi represents the true category label of the i-th sample;

所述听觉模态对应的交叉熵损失L_a为： The cross entropy loss _La corresponding to the auditory modality is:

所述多模态输出对应的交叉熵损失L_f为： The cross entropy loss _Lf corresponding to the multimodal output is:

可选的，所述最终损失L为：其中，β为超参数，/>为听觉调节因子，/>为视觉调节因子。Optionally, the final loss L is: Among them, β is a hyperparameter,/> is the auditory regulation factor,/> is the visual adjustment factor.

一种SNN多模态目标识别系统，包括：A SNN multimodal target recognition system, comprising:

特征提取模块，用于通过特征提取器提取多模态信息中的关键特征；所述关键特征包括视觉信息以及听觉信息；所述特征提取器包括视觉特征提取器以及听觉特征提取器；A feature extraction module, used to extract key features from multimodal information through a feature extractor; the key features include visual information and auditory information; the feature extractor includes a visual feature extractor and an auditory feature extractor;

卷积时间对齐模块，用于调整不同单模态信息对应的关键特征的时间尺度，确定时间对齐的单模态关键特征；The convolutional time alignment module is used to adjust the time scale of the key features corresponding to different unimodal information and determine the unimodal key features for time alignment;

模拟前馈模块，用于将每个所述时间对齐的单模态关键特征分别与多模态分类器参数通过矩阵乘法模拟前向传播，得到单模态输出以及多模态输出，并根据所述单模态输出以及所述多模态输出评估不同单模态输出对目标任务的贡献比例，确定模态调节因子；A simulated feedforward module is used to simulate forward propagation of each of the time-aligned unimodal key features with the multimodal classifier parameters through matrix multiplication to obtain unimodal output and multimodal output, and to evaluate the contribution ratio of different unimodal outputs to the target task according to the unimodal output and the multimodal output, and determine the modal adjustment factor;

在线损失调节模块，用于根据所述单模态输出以及所述多模态输出确定单模态输出以及多模态输出对应的交叉熵损失，并根据所述模态调节因子以及所述交叉熵损失动态调整损失函数，确定脉冲神经网络的最终损失；所述最终损失为所述脉冲神经网络的优化目标。An online loss adjustment module is used to determine the cross entropy loss corresponding to the unimodal output and the multimodal output according to the unimodal output and the multimodal output, and dynamically adjust the loss function according to the modality adjustment factor and the cross entropy loss to determine the final loss of the pulse neural network; the final loss is the optimization target of the pulse neural network.

一种电子设备，包括存储器及处理器，所述存储器用于存储计算机程序，所述处理器运行所述计算机程序以使所述电子设备执行上述SNN多模态目标识别方法。An electronic device comprises a memory and a processor, wherein the memory is used to store a computer program, and the processor runs the computer program so that the electronic device executes the above-mentioned SNN multimodal target recognition method.

一种计算机可读存储介质，其存储有计算机程序，所述计算机程序被处理器执行时实现上述SNN多模态目标识别方法。A computer-readable storage medium stores a computer program, which implements the above-mentioned SNN multimodal target recognition method when executed by a processor.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明修改了传统的多模态模型损失函数，引入了额外的单模态输出对应的交叉熵损失，该单模态输出对应的交叉熵损失为单模态特征提取监督损失，使得整个发明不止关注多模态分类的效果，同时关注单模态特征提取器的表征效果，监督模型充分学习单模态内信息。此外，针对有些算法只能离线解决模态不平衡的问题，本发明借鉴在线梯度调节算法中评估单模态对目标的贡献的方法，在单模态损失前引入模态调节因子，在线降低收敛较快的主导模态的学习率，从而缓解模态不平衡问题，本发明所涉及的多模态模型为SNN多模态模型。The present invention modifies the traditional multimodal model loss function and introduces an additional cross entropy loss corresponding to the unimodal output. The cross entropy loss corresponding to the unimodal output is a supervised loss for unimodal feature extraction, so that the entire invention not only focuses on the effect of multimodal classification, but also focuses on the characterization effect of the unimodal feature extractor, and the supervised model fully learns the information within the unimodal. In addition, in view of the problem that some algorithms can only solve the modal imbalance problem offline, the present invention draws on the method of evaluating the contribution of a unimodal to the target in the online gradient adjustment algorithm, introduces a modal adjustment factor before the unimodal loss, and reduces the learning rate of the dominant mode with faster convergence online, thereby alleviating the modal imbalance problem. The multimodal model involved in the present invention is an SNN multimodal model.

本发明还提出了卷积时间对齐模块，通过在每个模态的特征提取器后增加固定个数的卷积核，调整不同单模态信息对应的关键特征的时间尺度，确定时间对齐的单模态关键特征，适应不同模态数据的时间步差异。The present invention also proposes a convolutional time alignment module, which adjusts the time scale of key features corresponding to different unimodal information by adding a fixed number of convolution kernels after the feature extractor of each modality, determines the unimodal key features of time alignment, and adapts to the time step differences of different modal data.

本发明采用在线损失调节和卷积时间对齐这两项关键技术，有效地解决了多模态收敛不平衡以及时间尺度不匹配的两大主要问题，从而显著提升了多模态模型在目标识别任务中的整体性能。The present invention adopts two key technologies, online loss adjustment and convolutional time alignment, to effectively solve the two major problems of multimodal convergence imbalance and time scale mismatch, thereby significantly improving the overall performance of the multimodal model in target recognition tasks.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为本发明所提供的SNN多模态目标识别方法流程图；FIG1 is a flow chart of the SNN multimodal target recognition method provided by the present invention;

图2为本发明所提供的SNN多模态目标识别系统结构图。FIG2 is a structural diagram of the SNN multimodal target recognition system provided by the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

本发明的目的是提供一种SNN多模态目标识别方法、系统、设备及介质，能够有效解决多模态收敛不平衡以及时间尺度不匹配的问题，显著提升了多模态模型在目标识别任务中的整体性能。The purpose of the present invention is to provide a SNN multimodal target recognition method, system, device and medium, which can effectively solve the problems of multimodal convergence imbalance and time scale mismatch, and significantly improve the overall performance of the multimodal model in target recognition tasks.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the present invention is further described in detail below with reference to the accompanying drawings and specific embodiments.

实施例一Embodiment 1

如图1所示，本发明提供了一种SNN多模态目标识别方法，包括：As shown in FIG1 , the present invention provides a SNN multimodal target recognition method, comprising:

步骤101：通过特征提取器提取多模态信息中的关键特征；所述关键特征包括视觉信息以及听觉信息；所述特征提取器包括视觉特征提取器以及听觉特征提取器。Step 101: extract key features from multimodal information through a feature extractor; the key features include visual information and auditory information; the feature extractor includes a visual feature extractor and an auditory feature extractor.

在实际应用中，所述步骤101之前还包括对输入多模态信息进行编码预处理，以适应多种模态信息的输入需求。In practical applications, the step 101 also includes encoding preprocessing of the input multimodal information to meet the input requirements of multiple modal information.

视觉信息编码：在本多模态目标识别系统中，视觉信息模态选取N-MNIST数据集作为输入。该数据集通过在慢速移动的显示器上依次显示MNIST数据集中原始手写图像，用事件传感器记录得到。N-MNIST数据集包括60000个训练样本和10000个测试样本，每个样本尺寸为34*34像素。在编码视觉事件信息时，采取以下步骤：将事件流按照总数量N接近均匀地划分为T_e段，每段内的事件流积分为一帧，通过这种方法，最后形成T_e个时间步的视觉输入，其中每一帧的输入格式定义为(t,p,h,w)，t表示当前时间步，p表示事件极性，(h,w)表示帧的尺寸，最后得到视觉输入2312×T_e，在本实验中，时间步T_e设定为5。Visual information encoding: In this multimodal target recognition system, the visual information modality selects the N-MNIST dataset as input. This dataset is recorded by an event sensor by displaying the original handwritten images in the MNIST dataset on a slow-moving display. The N-MNIST dataset includes 60,000 training samples and 10,000 test samples, each with a size of 34*34 pixels. When encoding visual event information, the following steps are taken: the event stream is divided into _Te segments according to the total number N, and the event stream in each segment is integrated into a frame. In this way, a visual input of _Te time steps is finally formed, where the input format of each frame is defined as (t, p, h, w), t represents the current time step, p represents the event polarity, and (h, w) represents the size of the frame. Finally, the visual input 2312× _Te is obtained. In this experiment, the time step _Te is set to 5.

听觉信息编码：为了保持多模态类别一致性，本多模态目标识别系统中的听觉模态采用谷歌语音命令数据集中的0～9数字音频样本。该数据集共包含38908个样本，按照9：1的比例划分为训练集和测试集。在对听觉信号进行预处理的过程中，首先将每个听觉样本转换为梅尔频谱图，频谱图的帧数等于给定听觉模态时间步T_a，基于频谱图计算3阶差分，从而形成3通道的输入数据，并对其进行跨时间步的规范化处理。最后得到听觉输入格式为120×T_a。在本实验中，时间步T_a设定为20，后续多模态融合模块时间步T也设定为20。Auditory information encoding: In order to maintain the consistency of multimodal categories, the auditory modality in this multimodal target recognition system uses 0-9 digital audio samples from the Google voice command dataset. The dataset contains a total of 38,908 samples, which are divided into training and test sets in a ratio of 9:1. In the process of preprocessing the auditory signal, each auditory sample is first converted into a Mel spectrogram. The number of frames of the spectrogram is equal to the given auditory modality time step _Ta . The third-order difference is calculated based on the spectrogram to form 3-channel input data, and it is normalized across time steps. Finally, the auditory input format is 120× _Ta . In this experiment, the time step _Ta is set to 20, and the time step T of the subsequent multimodal fusion module is also set to 20.

多模态数据集构建：首先在0～9中随机选择样本的类别标签yⁱ，然后在视觉模态和听觉模态对应类别标签的数据内随机采样，分别获取视觉数据和听觉数据/>通过这一过程，构建训练集D＝{xⁱ,yⁱ}_{{i＝1,2,…,N}}。其中/>采用同样的方法构建测试集。在本发明的实验验证中，训练集的规模设定为10000个样本，而测试集的规模设定为2000个样本。此步骤旨在确保系统能够有效地处理。和识别来自不同模态的样本数据，从而验证系统的实用性和效能。Multimodal dataset construction: First, randomly select the category label y ⁱ of the sample from 0 to 9, and then randomly sample the data corresponding to the category label in the visual modality and auditory modality to obtain visual data and auditory data/> Through this process, the training set D = {x ⁱ , y ⁱ } _{{i = 1, 2, ..., N}} is constructed. The same method is used to construct the test set. In the experimental verification of the present invention, the size of the training set is set to 10,000 samples, while the size of the test set is set to 2,000 samples. This step is intended to ensure that the system can effectively process and identify sample data from different modalities, thereby verifying the practicality and effectiveness of the system.

对编码后的信息进行特征提取，以获取视觉特征和听觉特征/>特征提取操作包括以下几个步骤：Perform feature extraction on the encoded information to obtain visual features and auditory features/> The feature extraction operation includes the following steps:

在本发明中，视觉特征提取器φ_e由2312-800-128的三层脉冲全连接层构成，其参数用视觉特征提取器的可学习网络参数θ_e表示。视觉信息通过特征提取器后，得到T_e个时间步的视觉特征公式如下所示：In the present invention, the visual feature extractor _φe is composed of three layers of pulse fully connected layers of 2312-800-128, and its parameters are represented by the learnable network parameters _θe of the visual feature extractor. After the visual information passes through the feature extractor, the visual features of _Te time steps are obtained. The formula is as follows:

听觉特征提取器φ_a由120-240-128的三层循环脉冲网络组成，其参数用听觉特征提取器的可学习网络参数θ_a表示。听觉信息通过特征提取器后，得到T_a个时间步的听觉特征公式如下所示：The auditory feature extractor _φa is composed of a three-layer recurrent pulse network of 120-240-128, and its parameters are represented by the learnable network parameters _θa of the auditory feature extractor. After the auditory information passes through the feature extractor, the auditory features of T _a time steps are obtained. The formula is as follows:

步骤102：利用卷积时间对齐模块调整不同单模态信息对应的关键特征的时间尺度，确定时间对齐的单模态关键特征；所述卷积时间对齐模块包括视觉卷积时间对齐模块以及听觉卷积时间对齐模块。Step 102: Use a convolutional time alignment module to adjust the time scale of key features corresponding to different unimodal information to determine the time-aligned unimodal key features; the convolutional time alignment module includes a visual convolutional time alignment module and an auditory convolutional time alignment module.

在实际应用中，在实现具有不同时间尺度的模态特征的对齐，最终得到视觉对齐特征和听觉对齐特征/>卷积时间对齐操作包括以下几个步骤：In practical applications, the alignment of modal features with different time scales is realized, and finally the visual alignment feature is obtained. and auditory alignment features/> The convolution time alignment operation includes the following steps:

视觉特征的时间对齐：通过视觉卷积时间对齐模块处理视觉特征，得到T个时间步的视觉对齐特征。该视觉卷积时间对齐模块由T个1×1×T_e的卷积核组成，公式如下：Temporal alignment of visual features: The visual features are processed by the visual convolutional temporal alignment module to obtain visual alignment features of T time steps. The visual convolutional temporal alignment module consists of T 1×1×T _e convolution kernels, and the formula is as follows:

其中，代表视觉卷积时间对齐模块的模型参数。in, Represents the model parameters of the visual convolutional temporal alignment module.

听觉特征的时间对齐：通过听觉卷积时间对齐模块处理听觉特征，得到T个时间步的听觉对齐特征。该听觉卷积时间对齐模块由T个1×1×T_a的卷积核组成，公式如下：Time alignment of auditory features: The auditory features are processed through the auditory convolution time alignment module to obtain auditory alignment features of T time steps. The auditory convolution time alignment module consists of T 1×1×T _a convolution kernels, and the formula is as follows:

其中，为听觉卷积时间对齐模块的模型参数表示。in, Representation of model parameters for the auditory convolutional time alignment module.

本步骤中的时间对齐操作确保了不同模态特征在时间尺度上的一致性，为多模态目标识别系统的有效运行提供了支持。The time alignment operation in this step ensures the consistency of different modal features on the time scale, providing support for the effective operation of the multimodal target recognition system.

步骤103：将每个所述时间对齐的单模态关键特征分别与多模态分类器参数通过矩阵乘法模拟前向传播，得到单模态输出以及多模态输出，并根据所述单模态输出以及所述多模态输出评估不同单模态输出对目标任务的贡献比例，确定模态调节因子。Step 103: Simulate forward propagation by matrix multiplication of each of the time-aligned unimodal key features and the multimodal classifier parameters to obtain unimodal output and multimodal output, and evaluate the contribution ratio of different unimodal outputs to the target task based on the unimodal output and the multimodal output to determine the modal adjustment factor.

在实际应用中，通过模拟前馈过程得到多模态输出和单模态输出，并计算不同模态的目标贡献，包括以下几个步骤：In practical applications, multi-modal output and single-modal output are obtained by simulating the feedforward process, and the target contribution of different modes is calculated, which includes the following steps:

计算多模态输出：模型分类器由256-10脉冲全连接层构成。W为全连接层参数，b为偏置项，首先通过前馈得到多模态输出计算公式为：Calculate multimodal output: The model classifier consists of a 256-10 pulse fully connected layer. W is the fully connected layer parameter, b is the bias term, and the multimodal output is first obtained by feedforward. The calculation formula is:

模拟单模态前馈：为了估计不同模态对目标的贡献，将原始分类器分为128-10的视觉分类器和128-10的听觉分类器两部分，W＝[W_e,W_a]。W_e,W_a分别为视觉、听觉分类器的参数。模拟视觉、听觉前馈分类，得到视觉、听觉模态输出 Simulate single-modal feedforward: In order to estimate the contribution of different modalities to the target, the original classifier is divided into two parts: a 128-10 visual classifier and a 128-10 auditory classifier. W = [ _We , _Wa ]. _We and _Wa are the parameters of the visual and auditory classifiers, respectively. Simulate visual and auditory feedforward classification to obtain visual and auditory modal outputs

多模态贡献比例计算：将单模态输出分别传入softmax函数，得到不同模态分数s_e，s_a。计算公式分别为：Multimodal contribution ratio calculation: Output single mode They are respectively passed into the softmax function to obtain different modal scores s _e , s _a . The calculation formulas are:

其中，C表示样本类别数量，j为样本类型序号；y_i为第i个样本的真实类别标签，为指示函数，当j等于样本i的真实类别y_i时，它的值为1；否则为0。它用于确保只有正确类别得分被计入。从公式上来看，也就是j＝y_i时，/>的第j个分量才被加到L_e中。然后，根据模态分数计算当前批次样本的视觉和听觉比例/>和/>B_t表示当前批次的样本集合，计算公式分别为：Among them, C represents the number of sample categories, j is the sample type serial number; _yi is the true category label of the i-th sample, is an indicator function. When j is equal to the true category _yi of sample i, its value is 1; otherwise, it is 0. It is used to ensure that only the correct category score is counted. From the formula, that is, when j = _yi , /> The jth component of is added to _Le . Then, the visual and auditory proportions of the current batch of samples are calculated based on the modal scores. and/> _Bt represents the sample set of the current batch, and the calculation formulas are:

模态调节因子计算：如果当前模态的比例因子大于1，说明当前模态对整体目标贡献较大，为主导模态，通过调节因子减慢其优化进程；反之，说明当前模态为弱模态，不进行抑制。该步骤通过以下公式得到模态调节因子其中u＝{e,a}，表示特定模态。Modal adjustment factor calculation: If the proportional factor of the current mode is greater than 1, it means that the current mode contributes more to the overall goal and is the dominant mode. The optimization process is slowed down by adjusting the factor; otherwise, it means that the current mode is a weak mode and is not suppressed. This step obtains the modal adjustment factor through the following formula: Where u = {e, a}, indicating a specific mode.

其中，α为超参数，用于控制调节程度，α越大对主导模态的抑制越强。Among them, α is a hyperparameter used to control the degree of regulation. The larger α is, the stronger the suppression of the dominant mode is.

步骤104：根据所述单模态输出以及所述多模态输出确定单模态输出以及多模态输出对应的交叉熵损失，并根据所述模态调节因子以及所述交叉熵损失动态调整损失函数，确定脉冲神经网络的最终损失，以识别所述脉冲神经网络的的多模态目标；所述最终损失为所述脉冲神经网络的优化目标。Step 104: Determine the cross entropy loss corresponding to the unimodal output and the multimodal output according to the unimodal output and the multimodal output, and dynamically adjust the loss function according to the modality adjustment factor and the cross entropy loss to determine the final loss of the pulse neural network to identify the multimodal target of the pulse neural network; the final loss is the optimization target of the pulse neural network.

在实际应用中，计算单模态输出以及多模态输出对应的交叉熵损失，并依据调节因子进行调控，包括以下几个步骤：In practical applications, the cross entropy loss corresponding to the unimodal output and the multimodal output is calculated and regulated according to the adjustment factor, including the following steps:

损失计算：本步骤根据视觉模态输出听觉模态输出/>和多模态输出/>分别计算相应的交叉熵损失L_e、L_a和L_f。各损失的计算公式如下。Loss calculation: This step is based on the visual modality output Auditory modality output/> and multimodal output/> Calculate the corresponding cross entropy losses _Le , _La and _Lf respectively. The calculation formulas of each loss are as follows.

损失调节：根据目标贡献加权计算模型最终损失L。Loss adjustment: Calculate the final loss L of the model based on the weighted target contribution.

其中，β为人为设定超参数，代表单模态监督损失在整体损失函数中所占的比例，为听觉调节因子，/>为视觉调节因子。训练时，采用损失L作为模型整体优化目标。Among them, β is a manually set hyperparameter, which represents the proportion of single-modal supervision loss in the overall loss function. is the auditory regulation factor,/> is the visual adjustment factor. During training, the loss L is used as the overall optimization target of the model.

本发明具有以下优点：The present invention has the following advantages:

(1)创新性：本发明首次针对脉冲神经网络中的多模态不平衡问题提出解决方案，有效地克服了多模态联合训练可能导致单模态模型特征提取能力降低的问题。(1) Innovation: This paper proposes a solution to the multimodal imbalance problem in pulse neural networks for the first time, effectively overcoming the problem that multimodal joint training may lead to a decrease in the feature extraction ability of a single modal model.

(2)高效性：通过整合在线损失调节模块和卷积时间对齐模块两项关键技术，本发明实现了不同模态特征提取器的全面优化。在示例数据集上的目标识别任务中，精度提升了2.3％，从而验证了本发明在多模态学习方面的高效性。(2) Efficiency: By integrating two key technologies, the online loss adjustment module and the convolutional time alignment module, the present invention achieves comprehensive optimization of feature extractors of different modalities. In the object recognition task on the example dataset, the accuracy is improved by 2.3%, which verifies the efficiency of the present invention in multimodal learning.

(3)在线调节：本发明中的在线损失调节模块依托于单模态前馈过程，监控不同模态对目标的贡献，有效减弱了强模态在整体优化进程中的主导作用，同时缓解了对其他模态优化的抑制，达到了实时损失调节的效果。(3) Online adjustment: The online loss adjustment module in the present invention relies on a single-mode feedforward process to monitor the contribution of different modes to the target, effectively weakening the dominant role of the strong mode in the overall optimization process, while alleviating the inhibition of optimization of other modes, thus achieving the effect of real-time loss adjustment.

(4)时间同步：本发明中的卷积时间对齐模块允许系统处理具有不同时间尺度的多种模态，实现对不同时间尺度模态信息的自适应学习，并实现模态时间尺度的有效对齐，具有良好的可扩展性。(4) Time synchronization: The convolutional time alignment module in the present invention allows the system to process multiple modalities with different time scales, realize adaptive learning of modal information at different time scales, and achieve effective alignment of modal time scales, with good scalability.

(5)信息互补性：本发明在示例数据集上进行的目标识别任务实验表明，听觉模态精度相比原系统提高40.65％，并且超过了听觉单模态模型的训练精度。这一结果证明了本发明的调节算法能够高效利用跨模态信息，实现信息的互补。(5) Information complementarity: The target recognition task experiment conducted by the present invention on the example data set shows that the auditory modality accuracy is improved by 40.65% compared with the original system, and exceeds the training accuracy of the auditory unimodal model. This result proves that the adjustment algorithm of the present invention can efficiently utilize cross-modal information and achieve information complementarity.

实施例二Embodiment 2

为了执行上述实施例一对应的方法，以实现相应的功能和技术效果，下面提供一种SNN多模态目标识别系统。In order to execute the method corresponding to the above-mentioned embodiment 1 to achieve the corresponding functions and technical effects, a SNN multimodal target recognition system is provided below.

如图2所示，一种SNN多模态目标识别系统，包括：As shown in FIG2 , a SNN multimodal target recognition system includes:

特征提取模块，用于通过特征提取器提取多模态信息中的关键特征；所述关键特征包括视觉信息以及听觉信息；所述特征提取器包括视觉特征提取器以及听觉特征提取器。The feature extraction module is used to extract key features in multimodal information through a feature extractor; the key features include visual information and auditory information; the feature extractor includes a visual feature extractor and an auditory feature extractor.

卷积时间对齐模块，用于调整不同单模态信息对应的关键特征的时间尺度，确定时间对齐的单模态关键特征。The convolutional time alignment module is used to adjust the time scale of the key features corresponding to different unimodal information and determine the unimodal key features for time alignment.

模拟前馈模块，用于将每个所述时间对齐的单模态关键特征分别与多模态分类器参数通过矩阵乘法模拟前向传播，得到单模态输出以及多模态输出，并根据所述单模态输出以及所述多模态输出评估不同单模态输出对目标任务的贡献比例，确定模态调节因子。The simulated feedforward module is used to simulate the forward propagation of each of the time-aligned unimodal key features with the multimodal classifier parameters through matrix multiplication to obtain unimodal output and multimodal output, and evaluate the contribution ratio of different unimodal outputs to the target task based on the unimodal output and the multimodal output to determine the modal adjustment factor.

实施例三Embodiment 3

一种电子设备，包括存储器及处理器，所述存储器用于存储计算机程序，所述处理器运行所述计算机程序以使所述电子设备执行上述所述的SNN多模态目标识别方法。An electronic device comprises a memory and a processor, wherein the memory is used to store a computer program, and the processor runs the computer program so that the electronic device executes the SNN multimodal target recognition method described above.

一种计算机可读存储介质，其存储有计算机程序，所述计算机程序被处理器执行时实现上述所述的SNN多模态目标识别方法。A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the above-mentioned SNN multimodal target recognition method.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part.

本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处。综上所述，本说明书内容不应理解为对本发明的限制。This article uses specific examples to illustrate the principles and implementation methods of the present invention. The above examples are only used to help understand the method and core ideas of the present invention. At the same time, for those skilled in the art, according to the ideas of the present invention, there will be changes in the specific implementation methods and application scope. In summary, the content of this specification should not be understood as limiting the present invention.

Claims

1. A SNN multimodal target recognition method, characterized by comprising:

Extracting key features from the multimodal information by a feature extractor; the key features include visual information and auditory information; the feature extractor includes a visual feature extractor and an auditory feature extractor;

Using a convolutional time alignment module to adjust the time scale of key features corresponding to different unimodal information, and determine the unimodal key features of time alignment; the convolutional time alignment module includes a visual convolutional time alignment module and an auditory convolutional time alignment module;

Simulating forward propagation by matrix multiplication of each of the time-aligned unimodal key features and the multimodal classifier parameters to obtain unimodal output and multimodal output, and evaluating the contribution ratio of different unimodal outputs to the target task according to the unimodal output and the multimodal output to determine the modality adjustment factor;

The cross entropy loss corresponding to the unimodal output and the multimodal output is determined according to the unimodal output and the multimodal output, and the loss function is dynamically adjusted according to the modality adjustment factor and the cross entropy loss to determine the final loss of the spiking neural network to identify the multimodal target of the spiking neural network; the final loss is the optimization target of the spiking neural network.

2. The SNN multimodal target recognition method according to claim 1 is characterized in that the key features in the multimodal information are extracted by a feature extractor, and the method also includes:

Encoding the visual information and the auditory information respectively to generate encoded visual information and encoded auditory information;

A multimodal data set is constructed according to the encoded visual information and the encoded auditory information; the multimodal data set includes multimodal information.

3. The SNN multimodal target recognition method according to claim 1 is characterized in that the key features in the multimodal information are extracted by a feature extractor, specifically comprising:

After the visual information passes through a visual feature extractor consisting of three pulse fully connected layers, visual features of _Te time steps are generated;

The auditory information is passed through an auditory feature extractor consisting of a three-layer recurrent pulse network to generate auditory features of _Ta time steps.

4. The SNN multimodal target recognition method according to claim 1 is characterized in that the time scale of the key features corresponding to different single-modal information is adjusted by using a convolution time alignment module to determine the single-modal key features of time alignment, specifically comprising: the visual convolution time alignment module includes T 1×1×T _e convolution kernels; the auditory convolution time alignment module includes T 1×1×T _a convolution kernels;

Processing the visual features using a visual convolutional time alignment module to generate visual alignment features of T time steps;

The auditory features are processed using an auditory convolutional time alignment module to generate auditory alignment features of T time steps.

5. The SNN multimodal target recognition method according to claim 4 is characterized in that each of the time-aligned unimodal key features is respectively simulated with the multimodal classifier parameters by matrix multiplication to simulate forward propagation to obtain unimodal output and multimodal output, and the contribution ratio of different unimodal outputs to the target task is evaluated according to the unimodal output and the multimodal output to determine the modal adjustment factor, which specifically includes:

According to the formula Determine multimodal output; wherein, /> is the multimodal output; W is the fully connected layer parameter; /> It is a visual alignment feature; /> is the auditory alignment feature; b is the bias term;

According to the formula Determine the visual modality output; wherein, /> is the visual modality output; _We is the visual classifier parameter;

According to the formula Determine the auditory modality output; wherein, /> is the auditory modal output; Wa _is the auditory classifier parameter;

Inputting the visual modality output and the auditory modality output into a softmax function to determine a visual modality score and an auditory modality score;

Determining a visual contribution ratio of the visual modality to the target task and an auditory contribution ratio of the auditory modality to the target task according to the visual modality score and the auditory modality score;

A modality adjustment factor is determined according to the visual contribution ratio and the auditory contribution ratio.

6. The SNN multimodal target recognition method according to claim 5, characterized in that the cross entropy loss corresponding to the single modal output includes the cross entropy loss corresponding to the visual modality and the cross entropy loss corresponding to the auditory modality;

The cross entropy loss _Le corresponding to the visual modality is: Where C is the number of sample types, j is the sample type serial number, N is the number of samples; _yi represents the true category label of the i-th sample;

The cross entropy loss _La corresponding to the auditory modality is:

The cross entropy loss _Lf corresponding to the multimodal output is:

7. The SNN multimodal target recognition method according to claim 6, characterized in that the final loss L is: Among them, β is a hyperparameter,/> is the auditory regulation factor,/> is the visual adjustment factor.

8. A SNN multimodal target recognition system, characterized by comprising:

A feature extraction module, used to extract key features from multimodal information through a feature extractor; the key features include visual information and auditory information; the feature extractor includes a visual feature extractor and an auditory feature extractor;

The convolutional time alignment module is used to adjust the time scale of the key features corresponding to different unimodal information and determine the unimodal key features for time alignment;

A simulated feedforward module is used to simulate forward propagation of each of the time-aligned unimodal key features with the multimodal classifier parameters through matrix multiplication to obtain unimodal output and multimodal output, and to evaluate the contribution ratio of different unimodal outputs to the target task according to the unimodal output and the multimodal output, and determine the modal adjustment factor;

An online loss adjustment module is used to determine the cross entropy loss corresponding to the unimodal output and the multimodal output according to the unimodal output and the multimodal output, and dynamically adjust the loss function according to the modality adjustment factor and the cross entropy loss to determine the final loss of the pulse neural network; the final loss is the optimization target of the pulse neural network.

9. An electronic device, characterized in that it includes a memory and a processor, the memory is used to store a computer program, and the processor runs the computer program to enable the electronic device to execute the SNN multimodal target recognition method as described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that it stores a computer program, and when the computer program is executed by a processor, it implements the SNN multimodal target recognition method as described in any one of claims 1-7.