WO2024216857A1

WO2024216857A1 - Sparse spiking neural network accelerator based on ping-pong architecture

Info

Publication number: WO2024216857A1
Application number: PCT/CN2023/121949
Authority: WO
Inventors: 王源
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-04-17
Filing date: 2023-09-27
Publication date: 2024-10-24
Anticipated expiration: 2025-10-17
Also published as: CN116663626B; CN116663626A

Abstract

The present application provides a sparse spiking neural network accelerator based on a ping-pong architecture. Compressed weight values are transmitted to a compressed weight calculation module, and a sparse spike detection module is used for extracting an effective spike index from a spike input signal, thereby preventing subsequent spike signals from all participating in an operation, and reducing the amount of calculation; and the compressed weight calculation module accumulates a non-zero value among the compressed weight values to a membrane potential of a neuron according to the effective spike index, and finally decides whether to emit a spike or not. Compared with a traditional technical solution that all synapses in a synaptic crossbar array are activated and participate in an operation, in the present application, only a synaptic weight corresponding to the effective spike index is activated, and other synapses do not participate in the operation, thereby reducing the amount of calculation, reducing the operating power consumption of the whole chip, and improving the operating speed, energy efficiency and area efficiency of a spiking neural network.

Description

Sparse spiking neural network accelerator based on ping-pong architecture

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求于2023年04月17日提交的申请号为202310410779.3，名称为“基于乒乓架构的稀疏脉冲神经网络加速器”的中国专利申请的优先权，其通过引用方式全部并入本文。This application claims priority to Chinese patent application No. 202310410779.3, filed on April 17, 2023, and entitled “Sparse Pulse Neural Network Accelerator Based on Ping-Pong Architecture”, which is incorporated herein by reference in its entirety.

Technical Field

本申请涉及人工智能技术领域，尤其涉及一种基于乒乓架构的稀疏脉冲神经网络加速器。The present application relates to the field of artificial intelligence technology, and in particular to a sparse pulse neural network accelerator based on a ping-pong architecture.

Background Art

脉冲神经网络(Spiking Neural Network，SNN)由于其具有的低功耗、高并发的特性，能够带来机器算力的提升，是一种极具潜力的计算模式，被认为是人工智能研究的未来。Spiking Neural Network (SNN) is a computing model with great potential, which can improve the computing power of machines due to its low power consumption and high concurrency. It is considered to be the future of artificial intelligence research.

由于脉冲神经网络中神经元的信号传导机制与传统的冯诺依曼计算机体系架构并不吻合，因此急需为脉冲神经网络设计合适的硬件加速器来运行脉冲神经网络。目前的神经形态加速器往往采用结构规则且大小固定的突触交叉阵列直接存储SNN模型的突触连接矩阵，而无论突触稀疏性如何，所有的突触都会参与运算，增加了计算量，这导致SNN模型的空间稀疏性无法在这种神经形态加速器上体现。另一种加速器设计方案是采用位图形式缓存输入脉冲，这会导致硬件需要判断输入脉冲向量中每一位的脉冲有效性，增加了计算时间，其结果是无法利用脉冲信号的时间稀疏性来提高硬件的运算速度。Since the signal conduction mechanism of neurons in spiking neural networks does not match the traditional von Neumann computer architecture, it is urgent to design a suitable hardware accelerator for spiking neural networks to run spiking neural networks. Current neuromorphic accelerators often use a synaptic cross array with a regular structure and fixed size to directly store the synaptic connection matrix of the SNN model. Regardless of the synaptic sparsity, all synapses will participate in the calculation, increasing the amount of calculation, which makes the spatial sparsity of the SNN model unable to be reflected on this neuromorphic accelerator. Another accelerator design scheme is to cache input pulses in the form of a bitmap, which will cause the hardware to judge the pulse validity of each bit in the input pulse vector, increasing the calculation time. As a result, the time sparsity of the pulse signal cannot be used to improve the hardware's calculation speed.

可见，目前的神经网络加速器还不能充分发挥SNN模型的潜在性能优势，导致功耗不够低，运算速度不够快，对于SNN模型的运行能效不够高。It can be seen that the current neural network accelerators cannot fully utilize the potential performance advantages of the SNN model, resulting in insufficient power consumption, insufficient computing speed, and insufficient energy efficiency for the operation of the SNN model.

发明内容Summary of the invention

本申请提供一种基于乒乓架构的稀疏脉冲神经网络加速器，用以解决现有技术中脉冲神经网络加速器中要么所有突触都参与运算要么输入的脉冲信号中的每一个符号都要参与运算从而导致芯片功耗大、计算量大的缺陷，实现脉冲神经网络的低功耗、低延迟运行。The present application provides a sparse pulse neural network accelerator based on a ping-pong architecture to solve the problem that either all synapses are involved in the operation or the input Every symbol in the pulse signal must participate in the calculation, which leads to the defects of high chip power consumption and large amount of calculation, thus realizing the low power consumption and low latency operation of the pulse neural network.

本申请提供一种基于乒乓架构的稀疏脉冲神经网络加速器，包括脉冲输入接口、权值和神经元参数输入接口、稀疏脉冲检测模块、压缩权重计算模块、泄漏积分发放模块；其中，The present application provides a sparse pulse neural network accelerator based on a ping-pong architecture, including a pulse input interface, a weight and neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module, and a leakage integral issuance module; wherein,

所述脉冲输入接口，用于接收脉冲输入信号，并将所述脉冲输入信号输入至稀疏脉冲检测模块；The pulse input interface is used to receive a pulse input signal and input the pulse input signal to the sparse pulse detection module;

所述权值和神经元参数输入接口，用于接收压缩权重值，并将所述压缩权重值输入至所述压缩权重计算模块；The weight and neuron parameter input interface is used to receive the compressed weight value and input the compressed weight value into the compressed weight calculation module;

所述稀疏脉冲检测模块，用于从所述脉冲输入信号中提取有效脉冲索引；所述有效脉冲索引用于表征所述脉冲输入信号中非零值的位置；The sparse pulse detection module is used to extract a valid pulse index from the pulse input signal; the valid pulse index is used to characterize the position of a non-zero value in the pulse input signal;

所述压缩权重计算模块，用于根据所述有效脉冲索引，对所述压缩权重值进行解压得到有效权值矩阵；计算所述有效权值矩阵与所述脉冲输入信号的加权和，得到每一神经元上的膜电位增量；利用所述每一神经元上的膜电位增量更新与所述每一神经元对应的膜电位累积量；The compression weight calculation module is used to decompress the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculate the weighted sum of the effective weight matrix and the pulse input signal to obtain the membrane potential increment on each neuron; and use the membrane potential increment on each neuron to update the membrane potential accumulation corresponding to each neuron;

所述泄漏积分发放模块，用于判断更新后的膜电位累积量与预设阈值的大小关系，根据所述大小关系确定与所述每一神经元对应的输出脉冲结果。The leaky integral issuing module is used to determine the magnitude relationship between the updated membrane potential accumulation amount and a preset threshold value, and determine the output pulse result corresponding to each neuron according to the magnitude relationship.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，还包括脉冲缓存模块组；所述脉冲缓存模块组包括第一脉冲缓存模块和第二脉冲缓存模块；A sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application further includes a pulse buffer module group; the pulse buffer module group includes a first pulse buffer module and a second pulse buffer module;

所述脉冲缓存模块组，用于以乒乓切换的方式控制每一缓存周期内第一脉冲缓存模块和第二脉冲缓存模块的读写状态，以使每一缓存周期内其中一个脉冲缓存模块处于读状态，另一个脉冲缓存模块处于写状态。The pulse buffer module group is used to control the read and write states of the first pulse buffer module and the second pulse buffer module in each buffer cycle in a ping-pong switching manner, so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer cycle.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，还包括权重缓存模块组；所述权重缓存模块组包括第一权重缓存模块和第二权重缓存模块；A sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application also includes a weight cache module group; the weight cache module group includes a first weight cache module and a second weight cache module;

所述权重缓存模块组，用于以乒乓切换的方式控制每一缓存周期内第一权重缓存模块和第二权重缓存模块的读写状态，以使每一缓存周期内其中一个权重缓存模块处于读状态，另一个权重缓存模块处于写状态。 The weight cache module group is used to control the read and write status of the first weight cache module and the second weight cache module in each cache cycle in a ping-pong switching manner, so that one of the weight cache modules is in a read state and the other weight cache module is in a write state in each cache cycle.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，还包括神经元参数缓存模块组；所述神经元参数缓存模块组包括第一神经元参数缓存模块和第二神经元参数缓存模块；According to a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application, it also includes a neuron parameter cache module group; the neuron parameter cache module group includes a first neuron parameter cache module and a second neuron parameter cache module;

所述神经元参数缓存模块组，用于以乒乓切换的方式控制每一缓存周期内第一神经元参数缓存模块和第二神经元参数缓存模块的读写状态，以使每一缓存周期内其中一个神经元参数缓存模块处于读状态，另一个神经元参数缓存模块处于写状态。The neuron parameter cache module group is used to control the read and write states of the first neuron parameter cache module and the second neuron parameter cache module in each cache cycle in a ping-pong switching manner, so that one of the neuron parameter cache modules is in a read state and the other neuron parameter cache module is in a write state in each cache cycle.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，所述稀疏脉冲检测模块，进一步用于将所述脉冲输入信号对应的脉冲输入序列分为多组子序列；According to a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application, the sparse pulse detection module is further used to divide the pulse input sequence corresponding to the pulse input signal into a plurality of subsequences;

依次将每一组子序列与其自身进行按位或操作，得到按位或操作结果；若所述按位或操作结果为全0，则结束当前组子序列的运算；Performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result; if the bitwise OR operation result is all 0, then the operation of the current group of subsequences is terminated;

若所述按位或操作结果不为全0，则将所述按位或操作结果作为当前待检测序列，对当前待检测序列进行多轮检测，每一轮检测中，将当前待检测序列减1后得到差值；将所述差值与当前待检测序列进行按位与操作后得到按位与操作结果；将所述按位与操作结果与当前待检测序列进行按位异或操作后得到有效脉冲独热码；将所述有效脉冲独热码进行二进制转换后得到有效脉冲索引；判断所述按位与操作结果是否为全0，若是，则结束当前待检测序列的检测，并返回所述依次将每一组子序列与其自身进行按位或操作，得到按位或操作结果的步骤；若否，则将所述按位与操作结果作为当前待检测序列，返回所述将当前检测序列减1后得到差值的步骤。If the bitwise OR operation result is not all 0, the bitwise OR operation result is used as the current sequence to be detected, and multiple rounds of detection are performed on the current sequence to be detected. In each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference; the difference is bitwise ANDed with the current sequence to be detected to obtain a bitwise AND operation result; the bitwise AND operation result is bitwise XORed with the current sequence to be detected to obtain a valid pulse one-hot code; the valid pulse one-hot code is binary-converted to obtain a valid pulse index; it is determined whether the bitwise AND operation result is all 0. If so, the detection of the current sequence to be detected is terminated, and the step of performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result is returned; if not, the bitwise AND operation result is used as the current sequence to be detected, and the step of subtracting 1 from the current detection sequence to obtain a difference is returned.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，所述压缩权重计算模块包括行偏移计算模块、列索引权值计算单元、列索引编码模块、非零权值模块、权值分配器和处理单元阵列；其中，According to a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application, the compression weight calculation module includes a row offset calculation module, a column index weight calculation unit, a column index encoding module, a non-zero weight module, a weight distributor and a processing unit array; wherein,

所述行偏移运算单元，用于获取当前行的行偏移和与当前行相邻的下一行的行偏移；The row offset calculation unit is used to obtain the row offset of the current row and the row offset of the next row adjacent to the current row;

所述列索引权值运算单元，用于解析所述当前行的行偏移和所述与当前行相邻的下一行的行偏移得到所述列索引编码模块和所述非零权值模块的起始地址和结束地址；根据所述起始地址和所述结束地址，从所述列索引编码模块和所述非零权值模块中获取非零权值和列索引Delta编码；The column index weight calculation unit is used to parse the row offset of the current row and the row offset of the next row adjacent to the current row to obtain the starting address and the ending address of the column index encoding module and the non-zero weight module; according to the starting address and the ending address, The index coding module and the non-zero weight module obtain non-zero weight and column index Delta coding;

所述权值分配器，包括预设数量的加法器构成的加法器链，所述加法器链用于将所述非零权值和所述Delta编码作为处理单元阵列的输入；The weight distributor comprises an adder chain composed of a preset number of adders, wherein the adder chain is used to use the non-zero weight and the Delta code as inputs of a processing unit array;

所述处理单元阵列，包括预设数量的处理单元，每个处理单元利用加法器进行运算，根据所述加法器的运算结果更新所述膜电位累积量。The processing unit array includes a preset number of processing units, each processing unit uses an adder to perform calculations, and the membrane potential accumulation amount is updated according to the calculation results of the adder.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，所述列索引权值运算单元，还用于在所述起始地址超过所述结束地址的情况下，结束当前行的处理。According to a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application, the column index weight calculation unit is also used to end the processing of the current row when the starting address exceeds the ending address.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，每个处理单元中还包括权值掩码生成模块、多路选择器和加法器；According to a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application, each processing unit also includes a weight mask generation module, a multiplexer and an adder;

所述权值掩码生成模块，用于根据权值分布状态生成权值掩码；The weight mask generating module is used to generate a weight mask according to the weight distribution state;

所述处理单元，还用于将所述权值掩码、所述非零权值作为多路选择器的输入，使得所述多路选择器根据所述权值掩码过滤后得到有效非零值，并将所述有效非零值和所述膜电位累积量作为加法器的输入，得到所述加法器输出的所述更新后的膜电位累积量。The processing unit is also used to use the weight mask and the non-zero weight as inputs of a multiplexer, so that the multiplexer obtains a valid non-zero value after filtering according to the weight mask, and uses the valid non-zero value and the membrane potential accumulation as inputs of an adder to obtain the updated membrane potential accumulation output by the adder.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，所述泄漏积分发放模块，进一步用于将所述更新后的膜电位累积量与预设泄漏值相加后得到泄漏积分值；若所述泄漏积分值大于所述预设阈值，则确定所述输出脉冲结果为发放脉冲信号；若所述泄漏积分值小于或等于预设阈值，则确定所述脉冲结果为不发放脉冲。According to a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application, the leakage integral issuance module is further used to add the updated membrane potential accumulation amount to the preset leakage value to obtain a leakage integral value; if the leakage integral value is greater than the preset threshold, the output pulse result is determined to be a pulse signal; if the leakage integral value is less than or equal to the preset threshold, the pulse result is determined to be no pulse issuance.

根据本申请提供的一种基于乒乓架构的稀疏脉冲神经网络加速器，还包括脉冲输出接口；A sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application also includes a pulse output interface;

所述泄漏积分发放模块，进一步用于在判断所述更新后的膜电位累积量大于所述预设阈值的情况下，向所述脉冲输出接口发送所述更新后的膜电位累积量，并将所述膜电位累积量复位。The leakage integral issuance module is further used to send the updated membrane potential accumulation amount to the pulse output interface and reset the membrane potential accumulation amount when it is determined that the updated membrane potential accumulation amount is greater than the preset threshold.

本申请提供的基于乒乓架构的稀疏脉冲神经网络加速器，通过传输压缩权重值至压缩权重计算模块，使用稀疏脉冲检测模块从脉冲输入信号中提取有效脉冲索引，避免了后续每一位脉冲信号都参与运算，减少了计算量，压缩权重计算模块根据有效脉冲索引将上述压缩权重值中的非零值累加至神经元的膜电位上，最终决定是否发放脉冲或不发放脉冲。与传统的突触交叉阵列中所有突触都被激活并参与运算相比，本申请中仅对有效脉冲索引对应的突触权重进行激活，其他突触都不参与运算，从而减少了计算量，降低了整个芯片的运行功耗，提高了脉冲神经网络的运行速度、能效和面积效率。The sparse pulse neural network accelerator based on the ping-pong architecture provided in this application transmits the compression weight value to the compression weight calculation module, and uses the sparse pulse detection module to extract the effective pulse index from the pulse input signal, thereby avoiding the subsequent participation of each pulse signal in the calculation, reducing the amount of calculation. The compression weight calculation module accumulates the non-zero value in the above compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally decides whether to issue a pulse or not. Compared with the traditional Compared with the case where all synapses in the synaptic crossbar array are activated and participate in the calculation, in the present application, only the synaptic weights corresponding to the valid pulse index are activated, and other synapses do not participate in the calculation, thereby reducing the amount of calculation, reducing the operating power consumption of the entire chip, and improving the operating speed, energy efficiency and area efficiency of the pulse neural network.

BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present application or the prior art, a brief introduction will be given below to the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请提供的基于乒乓架构的稀疏脉冲神经网络加速器的功能模块示意图；FIG1 is a schematic diagram of functional modules of a sparse pulse neural network accelerator based on a ping-pong architecture provided by the present application;

图2是本申请提供的行压缩矩阵的矩阵要素示意图；FIG2 is a schematic diagram of matrix elements of a row compression matrix provided by the present application;

图3是本申请提供的LIF神经元的原理示意图；FIG3 is a schematic diagram of the principle of LIF neurons provided by the present application;

图4是本申请提供的乒乓运行方法的流程示意图；FIG4 is a schematic diagram of a flow chart of a ping-pong operation method provided by the present application;

图5是本申请提供的稀疏脉冲检测模块的硬件电路示意图；FIG5 is a schematic diagram of a hardware circuit of a sparse pulse detection module provided in the present application;

图6是本申请提供的压缩权重计算模块的结构原理示意图；FIG6 is a schematic diagram of the structural principle of a compression weight calculation module provided by the present application;

图7是本申请提供的压缩权重计算模块的功能模块示意图；FIG7 is a schematic diagram of the functional modules of the compression weight calculation module provided by the present application;

图8是本申请提供的列索引权值运算单元的硬件电路结构示意图；FIG8 is a schematic diagram of the hardware circuit structure of the column index weight calculation unit provided by the present application;

图9是本申请提供的加法器链解码示意图；FIG9 is a schematic diagram of adder chain decoding provided by the present application;

图10是本申请提供的PE阵列的结构示意图。FIG10 is a schematic diagram of the structure of the PE array provided in the present application.

DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合本申请中的附图，对本申请中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of this application clearer, the technical solutions in this application will be clearly and completely described below in conjunction with the drawings in this application. Obviously, the described embodiments are part of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

下面结合图1-图10描述本申请的具体实施方式：The specific implementation of the present application is described below in conjunction with Figures 1 to 10:

本申请实施例提供的基于乒乓架构的稀疏脉冲神经网络加速器，其功能模块如图1所示，包括脉冲输入接口101、权值和神经元参数输入接口102、稀疏脉冲检测模块103、压缩权重计算模块104、泄漏积分发放模块105；其中，The sparse pulse neural network accelerator based on the ping-pong architecture provided in the embodiment of the present application has The functional module is shown in FIG1 , including a pulse input interface 101, a weight and neuron parameter input interface 102, a sparse pulse detection module 103, a compression weight calculation module 104, and a leakage integral issuance module 105; wherein,

脉冲输入接口101，用于接收脉冲输入信号，并将该脉冲输入信号输入至稀疏脉冲检测模块103；The pulse input interface 101 is used to receive a pulse input signal and input the pulse input signal to the sparse pulse detection module 103;

具体地，脉冲输入信号通过脉冲输入接口101传输进入稀疏脉冲检测模块103。Specifically, the pulse input signal is transmitted to the sparse pulse detection module 103 through the pulse input interface 101 .

权值和神经元参数输入接口102，用于接收压缩权重值，并将所述压缩权重值输入至所述压缩权重计算模块104；The weight and neuron parameter input interface 102 is used to receive the compressed weight value and input the compressed weight value to the compressed weight calculation module 104;

其中，压缩权重值，是指处于稀疏矩阵存储格式下的权重值。由于脉冲神经网络模型中的权重参数非常多，使用稀疏的数据存储格式可以节省大量的存储空间且加快计算速度。本申请使用的稀疏矩阵表示格式为CSR(Compressed Sparse Row，行压缩)格式。Among them, the compressed weight value refers to the weight value in the sparse matrix storage format. Since there are many weight parameters in the pulse neural network model, using a sparse data storage format can save a lot of storage space and speed up the calculation. The sparse matrix representation format used in this application is CSR (Compressed Sparse Row, row compression) format.

具体地，以CSR格式存储的压缩权重值通过权值和神经元参数输入接口102输入至压缩权重计算模块104。Specifically, the compression weight value stored in the CSR format is input to the compression weight calculation module 104 through the weight and neuron parameter input interface 102 .

稀疏脉冲检测模块103，用于从上述脉冲输入信号中提取有效脉冲索引；所述有效脉冲索引用于表征所述脉冲输入信号中非零值的位置。The sparse pulse detection module 103 is used to extract a valid pulse index from the pulse input signal; the valid pulse index is used to characterize the position of a non-zero value in the pulse input signal.

由于脉冲输入信号为稀疏向量/张量(所谓稀疏，是指具有大量的零)，在计算过程中需要将上述稀疏向量/张量与突触权重相乘，也就意味着大多数突触权重都乘以“0”，其结果仍然为0，为了节约算力，本申请不需要从内存中读取脉冲输入信号中的每一位，只需读取有效脉冲(即非零值的脉冲信号)。本申请为了降低芯片整体功耗，仅对突触交叉矩阵中需要激活的神经元进行激活，即根据脉冲输入信号中非零值的位置激活突触交叉阵列中的相应神经元。为了决定哪些位置的神经元需要激活，本申请使用稀疏脉冲检测模块103从上述脉冲输入信号中提取有效脉冲索引，其中，有效脉冲索引用于表示脉冲输入信号中非零值的位置。Since the pulse input signal is a sparse vector/tensor (the so-called sparse means having a large number of zeros), the above sparse vector/tensor needs to be multiplied by the synaptic weight during the calculation process, which means that most of the synaptic weights are multiplied by "0", and the result is still 0. In order to save computing power, this application does not need to read each bit in the pulse input signal from the memory, but only needs to read the valid pulse (i.e., the pulse signal with a non-zero value). In order to reduce the overall power consumption of the chip, this application only activates the neurons that need to be activated in the synaptic cross matrix, that is, activates the corresponding neurons in the synaptic cross array according to the position of the non-zero value in the pulse input signal. In order to determine which positions of neurons need to be activated, this application uses a sparse pulse detection module 103 to extract a valid pulse index from the above pulse input signal, wherein the valid pulse index is used to represent the position of the non-zero value in the pulse input signal.

压缩权重计算模块104，用于根据上述有效脉冲索引，对上述从片外接收到的压缩权重值进行解压得到有效权值矩阵；计算有效权值矩阵与脉冲输入信号的加权和，得到每一神经元上的膜电位增量；利用每一神经元上的膜电位增量更新与每一神经元对应的膜电位累积量。The compression weight calculation module 104 is used to decompress the compression weight value received from the chip to obtain an effective weight matrix according to the effective pulse index; calculate the weighted sum of the effective weight matrix and the pulse input signal to obtain the membrane potential increment on each neuron; and use each neuron to calculate the effective weight matrix. The membrane potential increment on the neuron updates the accumulated membrane potential corresponding to each neuron.

其中，有效权值矩阵如图2中左边的矩阵所示，该有效权值矩阵是根据上述有效脉冲索引对压缩权重值进行还原得到的矩阵。压缩权重值如图2中的“非零值”所示，压缩权重值中的每个元素都是有效权值矩阵中的非零值。由于片上存储空间有限，且神经网络往往具有数量庞大的参数、权重，为了节省存储空间，本申请使用CSR(Compressed Row Storage，压缩行矩阵)格式存储权重，CSR存储格式使用三个一维数组(包括行偏移、列索引和非零值)来表示一个稀疏矩阵(即有效权值矩阵)，其中，行索引数组用于存储当前行与之前行的非零值的累积计数，例如行偏移数组中第m个元素表示有效权值矩阵中第m行上方的所有非零值的数量，例如图2中行偏移数组中第2个元素“4”表示有效权值矩阵中第2行(行序号从0开始编码)上方所有行中的非零值的总数为4(分别为1、7、2、8)，依此类推，只需将行偏移数组中的相邻两元素相减即可知道有效权值矩阵中每行所包含的非零值，例如对于有效权值矩阵中第i行，只需读取行偏移数组中第i+1个元素值与第i个元素值，将两者相减即可计算出有效权值矩阵中第i行中的非零值数量；列索引数组用于存储非零值数组中每个元素的列索引，例如非零值数组中的“7”对应的列索引是1，表示7位于有效权值矩阵中的第1列(从0开始编码索引)。Among them, the effective weight matrix is shown as the matrix on the left in Figure 2, and the effective weight matrix is a matrix obtained by restoring the compressed weight value according to the above-mentioned effective pulse index. The compressed weight value is shown as the "non-zero value" in Figure 2, and each element in the compressed weight value is a non-zero value in the effective weight matrix. Due to the limited on-chip storage space and the fact that neural networks often have a large number of parameters and weights, in order to save storage space, this application uses the CSR (Compressed Row Storage) format to store weights. The CSR storage format uses three one-dimensional arrays (including row offsets, column indexes, and non-zero values) to represent a sparse matrix (i.e., the effective weight matrix), where the row index array is used to store the cumulative count of non-zero values of the current row and the previous row. For example, the mth element in the row offset array represents the number of all non-zero values above the mth row in the effective weight matrix. For example, the second element "4" in the row offset array in Figure 2 represents the effective weight matrix The total number of non-zero values in all rows above the 2nd row (row numbers are encoded starting from 0) is 4 (1, 7, 2, 8 respectively), and so on. You only need to subtract two adjacent elements in the row offset array to know the non-zero values contained in each row of the effective weight matrix. For example, for the i-th row in the effective weight matrix, you only need to read the i+1th element value and the i-th element value in the row offset array, and subtract the two to calculate the number of non-zero values in the i-th row in the effective weight matrix; the column index array is used to store the column index of each element in the non-zero value array. For example, the column index corresponding to "7" in the non-zero value array is 1, indicating that 7 is located in the 1st column of the effective weight matrix (index is encoded starting from 0).

具体地，压缩权重计算模块104，对CSR格式存储的压缩权重值进行解压得到有效权值矩阵；计算有效权值矩阵与所述脉冲输入信号的加权和，得到每一神经元上的膜电位增量；利用所述每一神经元上的膜电位增量累加至所述每一神经元对应的膜电位累积量，得到更新后的膜电位累积量。Specifically, the compressed weight calculation module 104 decompresses the compressed weight values stored in the CSR format to obtain an effective weight matrix; calculates the weighted sum of the effective weight matrix and the pulse input signal to obtain the membrane potential increment on each neuron; and uses the membrane potential increment on each neuron to accumulate the membrane potential accumulation corresponding to each neuron to obtain an updated membrane potential accumulation.

泄漏积分发放模块105，用于判断更新后的膜电位累积量与预设阈值的大小关系，根据所述大小关系确定与所述每一神经元对应的输出脉冲结果。The leakage integral issuing module 105 is used to determine the magnitude relationship between the updated membrane potential accumulation amount and a preset threshold value, and determine the output pulse result corresponding to each neuron according to the magnitude relationship.

具体地，如图3所示，图3为本申请所使用的脉冲神经网络算法结构图。本申请所支持的脉冲神经网络的神经元为线性的LIF(Leaky Integrate-and-Fire，泄露积分发放)模型，在累积脉冲时序序列输入后，进行线性的泄漏运算、阈值比较和脉冲发射，LIF神经元模型的膜电位动力学方程如下：
V_j(t+1)＝V_j(t)+∑x_iw_ij-λ_j (1)Specifically, as shown in FIG3 , FIG3 is a structure diagram of the spiking neural network algorithm used in this application. The neurons of the spiking neural network supported by this application are linear LIF (Leaky Integrate-and-Fire) models. After the accumulated pulse timing sequence is input, Performing linear leakage operation, threshold comparison and pulse firing, the membrane potential dynamics equation of the LIF neuron model is as follows:
V _j (t+1)＝V _j (t)+∑x _i w _ij -λ _j (1)

其中V_j(t)为神经元j在t时刻的膜电位累积值，w_ij为神经元j对应的第i个突触权值，x_i为第i个突触的脉冲输入信号值，λ_j为神经元j的线性泄漏。Where V _j (t) is the accumulated value of the membrane potential of neuron j at time t, w _ij is the i-th synaptic weight corresponding to neuron j, _xi is the pulse input signal value of the i-th synapse, and λ _j is the linear leakage of neuron j.

泄漏积分发放模块105判断更新后的膜电位累积量V_j(t+1)与预设阈值的大小关系，若大于预设阈值，则发射神经脉冲信号，并将该神经元上的膜电位累积量复位。The leakage integral issuing module 105 determines the relationship between the updated membrane potential accumulation amount V _j (t+1) and the preset threshold value. If it is greater than the preset threshold value, a neural pulse signal is emitted and the membrane potential accumulation amount on the neuron is reset.

上述实施例，通过传输压缩权重值至压缩权重计算模块，使用稀疏脉冲检测模块从脉冲输入信号中提取有效脉冲索引，避免了后续每一位脉冲信号都参与运算，减少了计算量，压缩权重计算模块根据有效脉冲索引将上述压缩权重值中的非零值累加至神经元的膜电位上，最终决定是否发放脉冲或不发放脉冲。与传统的突触交叉阵列中所有突触都被激活并参与运算相比，本申请中仅对有效脉冲索引对应的突触权重进行激活，从而减少了计算量，降低了整个芯片的运行功耗，提高了脉冲神经网络的运行速度、能效和面积效率。In the above embodiment, by transmitting the compression weight value to the compression weight calculation module, the sparse pulse detection module is used to extract the effective pulse index from the pulse input signal, thereby avoiding that each subsequent pulse signal is involved in the calculation, reducing the amount of calculation, and the compression weight calculation module adds the non-zero value in the above compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally decides whether to emit a pulse or not. Compared with the traditional synaptic cross array in which all synapses are activated and participate in the calculation, in this application, only the synaptic weight corresponding to the effective pulse index is activated, thereby reducing the amount of calculation, reducing the operating power consumption of the entire chip, and improving the operating speed, energy efficiency and area efficiency of the pulse neural network.

在一实施例中，如图1所示，上述基于乒乓架构的稀疏脉冲神经网络加速器中还包括脉冲缓存模块组106，该脉冲缓存模块组106包括第一脉冲缓存模块和第二脉冲缓存模块；其中，脉冲缓存模块组106，用于以乒乓切换的方式控制每一缓存周期内第一脉冲缓存模块和第二脉冲缓存模块的读写状态，以使每一缓存周期内其中一个脉冲缓存模块处于读状态，另一个脉冲缓存模块处于写状态。In one embodiment, as shown in FIG1 , the sparse pulse neural network accelerator based on the ping-pong architecture further includes a pulse cache module group 106, and the pulse cache module group 106 includes a first pulse cache module and a second pulse cache module; wherein the pulse cache module group 106 is used to control the read and write states of the first pulse cache module and the second pulse cache module in each cache cycle in a ping-pong switching manner, so that in each cache cycle, one of the pulse cache modules is in a read state and the other pulse cache module is in a write state.

具体地，基于乒乓架构的稀疏脉冲神经网络加速器内部设置有两块脉冲缓存模块RAM(Random Access Memory，随机存取存储器)，用于将从片外接收到的脉冲输入信号进行编码得到适于计算的脉冲输入编码，在编码过程中，整个系统可以在其中一块RAM完成计算的同时，进行另一块RAM的更新工作。相应地，输出的脉冲编码也需要译码为输出脉冲信号，在译码过程中，也采用乒乓运行方法，一块RAM完成计算的同时，进行另一块RAM的更新工作。 Specifically, the sparse pulse neural network accelerator based on the ping-pong architecture is internally provided with two pulse buffer modules RAM (Random Access Memory) for encoding the pulse input signal received from outside the chip to obtain the pulse input code suitable for calculation. During the encoding process, the entire system can update the other RAM while one RAM completes the calculation. Correspondingly, the output pulse code also needs to be decoded into the output pulse signal. During the decoding process, the ping-pong operation method is also adopted. While one RAM completes the calculation, the other RAM is updated.

上述实施例，利用脉冲缓存模块组实现对脉冲输入信号编码或脉冲输出信号解码过程中的乒乓缓存，加速同时存在I/O操作以及数据处理操作，提高加速器的吞吐量。In the above-mentioned embodiment, the pulse buffer module group is used to implement ping-pong buffering in the process of encoding pulse input signals or decoding pulse output signals, thereby accelerating simultaneous I/O operations and data processing operations and improving the throughput of the accelerator.

在一实施例中，如图1所示，上述基于乒乓架构的稀疏脉冲神经网络加速器中还包括权重缓存模块组107，该权重缓存模块组107包括第一权重缓存模块和第二权重缓存模块；其中，权重缓存模块组107，用于以乒乓切换的方式控制每一缓存周期内第一权重缓存模块和第二权重缓存模块的读写状态，以使每一缓存周期内其中一个权重缓存模块处于读状态，另一个权重缓存模块处于写状态。In one embodiment, as shown in FIG1 , the sparse pulse neural network accelerator based on the ping-pong architecture further includes a weight cache module group 107, which includes a first weight cache module and a second weight cache module; wherein the weight cache module group 107 is used to control the read and write states of the first weight cache module and the second weight cache module in each cache cycle in a ping-pong switching manner, so that one of the weight cache modules is in a read state and the other weight cache module is in a write state in each cache cycle.

具体地，基于乒乓架构的稀疏脉冲神经网络加速器内部设置有两块权重缓存模块RAM，用于将从片外接收到的CSR格式的压缩权重值进行解压得到有效权值矩阵，在解压过程中，整个系统可以在其中一块RAM完成解压的同时，进行另一块RAM的更新工作。Specifically, the sparse pulse neural network accelerator based on the ping-pong architecture is internally provided with two weight cache module RAMs, which are used to decompress the compressed weight values in the CSR format received from outside the chip to obtain the effective weight matrix. During the decompression process, the entire system can update the other RAM while one RAM completes the decompression.

上述实施例，利用权重缓存模块组实现对CSR格式的压缩权重值解压过程中的乒乓缓存，加速同时存在I/O操作以及数据处理操作，提高加速器的吞吐量。The above-mentioned embodiment utilizes the weight cache module group to implement ping-pong cache in the process of decompressing the compressed weight values in the CSR format, accelerates the simultaneous I/O operations and data processing operations, and improves the throughput of the accelerator.

在一实施例中，上述加速器还包括神经元参数缓存模块组108；所述神经元参数缓存模块组108包括第一神经元参数缓存模块和第二神经元参数缓存模块；所述神经元参数缓存模块组108，用于以乒乓切换的方式控制每一缓存周期内第一神经元参数缓存模块和第二神经元参数缓存模块的读写状态，以使每一缓存周期内其中一个神经元参数缓存模块处于读状态，另一个神经元参数缓存模块处于写状态。In one embodiment, the accelerator further includes a neuron parameter cache module group 108; the neuron parameter cache module group 108 includes a first neuron parameter cache module and a second neuron parameter cache module; the neuron parameter cache module group 108 is used to control the read and write states of the first neuron parameter cache module and the second neuron parameter cache module in each cache cycle in a ping-pong switching manner, so that in each cache cycle, one of the neuron parameter cache modules is in a read state and the other neuron parameter cache module is in a write state.

具体地，基于乒乓架构的稀疏脉冲神经网络加速器内部设置有两块神经元参数缓存模块RAM，用于将从片外接收到的神经元参数进行解码得到适于计算的神经元参数，在解码过程中，整个系统可以在其中一块RAM完成解压的同时，进行另一块RAM的更新工作。Specifically, the sparse pulse neural network accelerator based on the ping-pong architecture is internally provided with two neuron parameter cache module RAMs, which are used to decode the neuron parameters received from outside the chip to obtain neuron parameters suitable for calculation. During the decoding process, the entire system can update the other RAM while one RAM is completing decompression.

进一步地，本申请中所使用的乒乓运行算法包含三个维度的控制，即每个时间步长的输入脉冲包含多少个时隙，每组神经元需要计算多少个时间步长以及每层网络需要划分为多少组分批计算。Furthermore, the ping-pong operation algorithm used in the present application includes three dimensions of control, namely, how many time slots the input pulse of each time step contains, how many time steps each group of neurons needs to calculate, and how many groups each layer of the network needs to be divided into for batch calculation.

如图4所示，以一个两层1024-512-256的全连接脉冲神经网络为例，加速器会在上电后从外部接收脉冲RAM#0、权重RAM#0和神经元参数RAM#0的数据。当收到第一次全局同步指令“Sync_all”后，加速器的权重RAM#1和神经元参数RAM#1从外部接收数据，脉冲RAM#0、权重RAM#0和神经元参数RAM#0的数据被送入核心计算单元，计算结果发送给脉冲RAM#1，此时加速器在计算第一层的前256个神经元。当收到第二次全局同步指令“Sync_all”后，加速器的权重RAM#0和神经元参数RAM#0从外部接收数据，脉冲RAM#0、权重RAM#1和神经元参数RAM#1的数据被送入核心计算单元，计算结果发送给脉冲RAM#1，此时加速器在计算第一层的后256个神经元。当收到第三次全局同步指令“Sync_all”后，加速器的权重RAM#1和神经元参数RAM#1从外部接收数据，脉冲RAM#1、权重RAM#0和神经元参数RAM#0的数据被送入核心计算单元，计算结果发送给脉冲RAM#0和加速器外部，此时加速器在计算第二层的256个神经元，至此便完成了全部计算过程。As shown in Figure 4, a two-layer 1024-512-256 fully connected spiking neural network is used as For example, the accelerator will receive data from the pulse RAM#0, weight RAM#0, and neuron parameter RAM#0 from the outside after power-on. After receiving the first global synchronization instruction "Sync_all", the accelerator's weight RAM#1 and neuron parameter RAM#1 receive data from the outside, and the data of pulse RAM#0, weight RAM#0, and neuron parameter RAM#0 are sent to the core computing unit, and the calculation results are sent to pulse RAM#1. At this time, the accelerator is calculating the first 256 neurons of the first layer. After receiving the second global synchronization instruction "Sync_all", the accelerator's weight RAM#0 and neuron parameter RAM#0 receive data from the outside, and the data of pulse RAM#0, weight RAM#1, and neuron parameter RAM#1 are sent to the core computing unit, and the calculation results are sent to pulse RAM#1. At this time, the accelerator is calculating the last 256 neurons of the first layer. After receiving the third global synchronization instruction "Sync_all", the accelerator's weight RAM#1 and neuron parameter RAM#1 receive data from the outside, and the data of pulse RAM#1, weight RAM#0 and neuron parameter RAM#0 are sent to the core computing unit. The calculation results are sent to pulse RAM#0 and the outside of the accelerator. At this time, the accelerator is calculating the 256 neurons in the second layer, and the entire calculation process is completed.

上述实施例，利用神经元参数缓存模块组实现对神经元参数在解码过程中的乒乓缓存，加速同时存在I/O操作以及数据处理操作，提高加速器的吞吐量。In the above-mentioned embodiment, the neuron parameter cache module group is used to implement ping-pong cache of neuron parameters during the decoding process, thereby accelerating simultaneous I/O operations and data processing operations and improving the throughput of the accelerator.

在一实施例中，上述稀疏脉冲检测模块105，进一步用于将脉冲输入信号对应的脉冲输入序列分为多组子序列；In one embodiment, the sparse pulse detection module 105 is further used to divide the pulse input sequence corresponding to the pulse input signal into a plurality of subsequences;

若所述按位或操作结果不为全0，则将所述按位或操作结果作为当前待检测序列，对当前待检测序列进行多轮检测，每一轮检测中，将当前待检测序列减1后得到差值；将所述差值与当前待检测序列进行按位与操作后得到按位与操作结果；将所述按位与操作结果与当前待检测序列进行按位异或操作后得到有效脉冲独热码；将所述有效脉冲独热码进行二进制转换后得到有效脉冲索引；判断所述按位与操作结果是否为全0，若是，则结束当前待检测序列的检测，并返回所述依次将每一组子序列与其自身进行按位或操作，得到按位或操作结果的步骤；若否，则将所述按位与操作结果作为当前待检测序列，返回所述将当前检测序列减1 后得到差值的步骤。If the bitwise OR operation result is not all 0, the bitwise OR operation result is used as the current sequence to be detected, and multiple rounds of detection are performed on the current sequence to be detected. In each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference; the difference is bitwise ANDed with the current sequence to be detected to obtain a bitwise AND operation result; the bitwise AND operation result is bitwise XORed with the current sequence to be detected to obtain a valid pulse one-hot code; the valid pulse one-hot code is binary-converted to obtain a valid pulse index; it is determined whether the bitwise AND operation result is all 0. If so, the detection of the current sequence to be detected is terminated, and the step of performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result is returned; if not, the bitwise AND operation result is used as the current sequence to be detected, and the step of subtracting 1 from the current detection sequence is returned. Then get the difference step.

其中，脉冲输入序列是对脉冲输入信号进行编码后得到的序列。有效脉冲索引是指脉冲输入序列中非零值所在的位置。The pulse input sequence is a sequence obtained by encoding the pulse input signal. The valid pulse index refers to the position of the non-zero value in the pulse input sequence.

具体地，如图5所示，图5展示了稀疏脉冲检测模块105的内部电路结构图，包括或门、多路选择器、D触发器、加法器、与门、异或门等逻辑门电路。稀疏脉冲检测模块105主要负责提取输入脉冲序列的有效脉冲索引。为了降低关键路径延时，本申请将输入脉冲序列分组，例如1024bit的输入脉冲序列被分割成16组64bit的子序列，每组64bit的子序列先与其自身进行按位或操作，得到按位或操作结果，如果该按位或操作结果为全0，说明本组子序列中的全部64bit均为0，则结束本组子序列的运算，即跳过该64bit子序列的计算。Specifically, as shown in FIG5 , FIG5 shows the internal circuit structure diagram of the sparse pulse detection module 105, including logic gate circuits such as an OR gate, a multiplexer, a D flip-flop, an adder, an AND gate, and an XOR gate. The sparse pulse detection module 105 is mainly responsible for extracting the effective pulse index of the input pulse sequence. In order to reduce the critical path delay, the present application groups the input pulse sequence. For example, a 1024-bit input pulse sequence is divided into 16 groups of 64-bit subsequences. Each group of 64-bit subsequences is first bitwise ORed with itself to obtain a bitwise OR operation result. If the bitwise OR operation result is all 0, it means that all 64 bits in this group of subsequences are 0, then the operation of this group of subsequences is terminated, that is, the calculation of the 64-bit subsequence is skipped.

如果按位或操作结果不是全0，则需要进一步检测当前组子序列中非零值的位置，检测方法具体包括：将上述按位或操作结果作为当前待检测序列，对当前待检测序列进行多轮检测，每一轮检测中，先把当前待检测序列(例如上述64bit的子序列)减1后得到差值，然后将差值与当前待检测序列自身进行按位与操作，得到按位与操作结果；将该按位与操作结果与上述减1之前的数据(即当前待检测序列自身)进行按位异或操作后即可得到有效脉冲独热码；同时，判断上述按位与操作结果是否为全0，若是，则结束本组子序列检测，进入下一组子序列的检测，在本实施例的例子中，即进入下一组64bit的子序列的检测；若否，则将所述按位与操作结果作为当前待检测序列，返回“将当前待检测序列减1后得到差值”的步骤。依此循环，直至每组子序列检测完成，即可得到若干个有效脉冲独热码，并通过独热码转二进制码的译码单元，可以得到脉冲输入信号对应的二进制码形式的有效脉冲索引。If the result of the bitwise OR operation is not all 0, it is necessary to further detect the position of the non-zero value in the current group of subsequences. The detection method specifically includes: taking the above-mentioned bitwise OR operation result as the current sequence to be detected, performing multiple rounds of detection on the current sequence to be detected, in each round of detection, first subtracting 1 from the current sequence to be detected (for example, the above-mentioned 64-bit subsequence) to obtain a difference, and then performing a bitwise AND operation on the difference and the current sequence to be detected itself to obtain a bitwise AND operation result; performing a bitwise XOR operation on the bitwise AND operation result and the data before the above-mentioned subtraction of 1 (that is, the current sequence to be detected itself) to obtain a valid pulse unique hot code; at the same time, judging whether the above-mentioned bitwise AND operation result is all 0, if so, ending the detection of this group of subsequences and entering the detection of the next group of subsequences, in the example of the present embodiment, that is, entering the detection of the next group of 64-bit subsequences; if not, taking the bitwise AND operation result as the current sequence to be detected, and returning to the step of "subtracting 1 from the current sequence to be detected and obtaining a difference". This cycle is repeated until each subsequence is detected, and a number of valid pulse one-hot codes can be obtained. Through a one-hot code to binary code decoding unit, a valid pulse index in binary code form corresponding to the pulse input signal can be obtained.

上述实施例，通过多个逻辑门电路进行组合得到稀疏脉冲检测模块105，能够实现稀疏脉冲的有效索引检测。In the above embodiment, the sparse pulse detection module 105 is obtained by combining multiple logic gate circuits, which can realize effective index detection of sparse pulses.

在一实施例中，如图6所示，图6展示了上述压缩权重计算模块104的结构原理图，该压缩权重计算模块104包括行偏移模块(Row Offset Module，ROM)、列索引Delta编码模块(Column Delta Module，CDM)、非零权值模块(Nonzero Weight Value Module，NWVM)、PE (Processing Element，处理单元)阵列，其中，每个PE单元都用于完成膜电位增量的计算，即脉冲输入信号与突触权值的点积，也即公式(1)中的x_iw_ij。图6中以1024×256的突触交叉阵列为例进行说明，1024表示1024个扇入(即1024个轴突)，256表示256个硬件神经元，即每个硬件神经元都有1024个扇入，但并不是每个扇入都有有效的脉冲信号。从稀疏脉冲检测模块传递来的有效脉冲索引会激活与该有效脉冲索引对应的行，然后行偏移模块(ROM)计算出相应的行偏移，列索引Delta编码模块(CDM)根据行偏移读取出该行的非零值的列索引，非零权值模块(NWVM)根据列索引把相应的非零值累加至膜电位上。当输入脉冲序列中所有的有效脉冲都被处理完毕后，所有膜电位累积值都会送入泄漏积分发放神经元动力学行为模块进行LIF运算。In one embodiment, as shown in FIG. 6, FIG. 6 shows a structural schematic diagram of the compression weight calculation module 104, which includes a row offset module (Row Offset Module, ROM), a column index Delta coding module (Column Delta Module, CDM), a nonzero weight value module (Nonzero Weight Value Module, NWVM), and a PE (Processing Element, processing unit) array, in which each PE unit is used to complete the calculation of membrane potential increment, that is, the dot product of the pulse input signal and the synaptic weight, that is, x _i w _ij in formula (1). In Figure 6, a 1024×256 synaptic crossbar array is used as an example for explanation. 1024 represents 1024 fan-in (i.e., 1024 axons), and 256 represents 256 hardware neurons, that is, each hardware neuron has 1024 fan-in, but not every fan-in has a valid pulse signal. The valid pulse index transmitted from the sparse pulse detection module will activate the row corresponding to the valid pulse index, and then the row offset module (ROM) calculates the corresponding row offset, the column index Delta encoding module (CDM) reads the column index of the non-zero value of the row according to the row offset, and the non-zero weight module (NWVM) accumulates the corresponding non-zero value to the membrane potential according to the column index. When all valid pulses in the input pulse sequence are processed, all membrane potential accumulation values will be sent to the leaky integral release neuron dynamics behavior module for LIF calculation.

由于有效权值矩阵会被三种参数来表示，即行偏移(Row Offset)、列索引(Column Indices)和非零值(Value)，如图2所示。相应地，为了根据上述三种参数还原出有效权值矩阵，如图7所示，加速器内对于上述三种参数的存储分别设置：行偏移存储模块(Row Offset Module，ROM，其中又细分为Even ROM和Odd ROM)，列索引Delta编码存储模块(Column Delta Module，CDM)和非零值存储模块(Nonzero Weight Value Module，NWVM)。为了流水线的进行，行偏移存储模块被进一步划分为偶数行偏移存储模块(Even ROM)和奇数行偏移存储模块(Odd ROM)。压缩权重计算过程的流水线示意图如图7所示，对于指向第i行的输入脉冲，CSR译码器同时从Even ROM(偶数行偏移模块)和Odd ROM(奇数行偏移模块)中分别读取第i行的行偏移和第i+1行的行偏移，将两者相减即可知道第i行脉冲对应的非零值数量，从而得到对列索引Delta编码存储模块(CDM)和非零值存储模块(NWVM)的访存地址以及访存轮数。由于对于第i行的脉冲，只需要一次行偏移访存和运算，但对于第i行，可能会需要超过一次列索引和权值访存和运算，这会带来两类数据结构处理流水线的吞吐率不匹配，因此为了避免低效的不均匀流水线停顿，电路设计中将CSR解码器解耦合为行偏移运算单元和列索引权值运算单元，二者之间使用FIFO(First Input First Output，先进先出)存储器进行同步，如图7中所示。 Since the effective weight matrix is represented by three parameters, namely, row offset, column index and non-zero value, as shown in Figure 2. Accordingly, in order to restore the effective weight matrix according to the above three parameters, as shown in Figure 7, the storage of the above three parameters in the accelerator is respectively set: row offset storage module (Row Offset Module, ROM, which is further divided into Even ROM and Odd ROM), column index Delta encoding storage module (Column Delta Module, CDM) and non-zero value storage module (NWVM). For the pipeline, the row offset storage module is further divided into even row offset storage module (Even ROM) and odd row offset storage module (Odd ROM). The pipeline diagram of the compression weight calculation process is shown in FIG7. For the input pulse pointing to the i-th row, the CSR decoder reads the row offset of the i-th row and the row offset of the i+1-th row from the Even ROM (even row offset module) and the Odd ROM (odd row offset module) respectively. The number of non-zero values corresponding to the pulse of the i-th row can be obtained by subtracting the two, thereby obtaining the memory access address and the number of memory access rounds for the column index Delta coding storage module (CDM) and the non-zero value storage module (NWVM). Since only one row offset memory access and calculation is required for the pulse of the i-th row, but more than one column index and weight memory access and calculation may be required for the i-th row, which will lead to a mismatch in the throughput of the two types of data structure processing pipelines. Therefore, in order to avoid inefficient uneven pipeline pauses, the CSR decoder is decoupled into a row offset calculation unit and a column index weight calculation unit in the circuit design, and the two are synchronized using a FIFO (First Input First Output) memory, as shown in FIG7.

列索引权值运算单元的电路结构如图8所示，对获取的第i行的行偏移RO_i和第i+1行的行偏移RO_i+1解析后分别得到列索引Delta编码存储模块(CDM)和非零值存储模块(NWVM)的访存起始地址Addr_i与结束地址Addr_i+1，在MAE(Masked Autoencoder，掩码自编码器)输出信号有效时，每个时钟周期根据Addr_i从列索引Delta编码存储模块(CDM)和非零值存储模块(NWVM)中分别读取到Delta编码格式的列索引ΔCI_j和非零权值(WV)发送到后续电路单元(即相应的PE单元)。当Addr_i超过了Addr_i+1时，列索引权值运算单元会结束第i行的处理，从行偏移存储模块中获取新一行的行偏移。The circuit structure of the column index weight calculation unit is shown in FIG8 . After parsing the obtained row offset RO _i of the i-th row and the row offset RO _i+1 of the i+1th row, the access start address Addr _i and the end address Addr _i+1 of the column index Delta coding storage module (CDM) and the non-zero value storage module (NWVM) are obtained respectively. When the output signal of the MAE (Masked Autoencoder) is valid, the column index ΔCI _j and the non-zero weight (WV) in the Delta coding format are read from the column index Delta coding storage module (CDM) and the non-zero value storage module (NWVM) according to Addr _i in each clock cycle and sent to the subsequent circuit unit (i.e., the corresponding PE unit). When Addr _i exceeds Addr _i+1 , the column index weight calculation unit will end the processing of the i-th row and obtain the row offset of a new row from the row offset storage module.

如图9所示，每个时钟周期从列索引Delta编码存储模块(CDM)中读出的Delta编码的列索引(即ΔCI_j)通过加法器链解码为列索引(即CI_j)。由于每个时钟周期可能从列索引Delta编码模块(CDM)中读出不属于当前行的ΔCI(即列索引Delta编码)，因此通过MUX(多路选择器)滤除这些不属于当前行的ΔCI得到过滤后的Delta编码ΔCI^′。如果当前时钟周期是该行运算的第1个周期，ΔCI′₀会作为本周期的CI₀；如果当前时钟周期不是该行运算的第1个周期，会在上个周期的CI₃₁的基础上加ΔCI′₀。而其他的CI_j＝CI_j-1+ΔCI′_j。As shown in FIG9 , the column index (i.e., ΔCI _j ) of the Delta code read out from the column index Delta code storage module (CDM) in each clock cycle is decoded into the column index (i.e., CI _j ) through the adder chain. Since the ΔCI (i.e., column index Delta code) that does not belong to the current row may be read out from the column index Delta code module (CDM) in each clock cycle, the ΔCI that does not belong to the current row is filtered out through the MUX (multiplexer) to obtain the filtered Delta code ΔCI ^′ . If the current clock cycle is the first cycle of the row operation, ΔCI′ ₀ will be used as CI ₀ of this cycle; if the current clock cycle is not the first cycle of the row operation, ΔCI′ ₀ will be added to CI ₃₁ of the previous cycle. The other CI _j = CI _j-1 + ΔCI′ _j .

PE阵列的结构如图10所示，图10展示了包含32个PE单元的PE阵列；每个PE单元包含了1个16位加法器与2个MUX(多路选择器)，所有树突的膜电位累积值输入到每个PE的MUX1(多选1的多路选择器)。在得到32个成对的CI_j(列索引)与WV(非零权值)后，权值分配器将每对CI_j(第j个列索引)与WV_j(第j个非零权值)作为对应的PE_j的输入：其中，CI_j会作为MUX1的多路选择信号，将对应的(第j个列索引对应的神经元的膜电位累积值)作为加法器的一个操作数；WV_j会输入到MUX2的2个输入端之一，经过权值掩码的过滤后，有效的WV_j会作为加法器的另一个操作数。权值分配器根据加法器计算的结果更新将更新后的作为下一个周期的累积值。The structure of the PE array is shown in Figure 10, which shows a PE array containing 32 PE units; each PE unit contains a 16-bit adder and 2 MUXs (multiple selectors), and the accumulated values of the membrane potentials of all dendrites are input to the MUX1 (multiple selector of multiple selection 1) of each PE. After obtaining 32 pairs of CI _j (column index) and WV (non-zero weight), the weight distributor uses each pair of CI _j (jth column index) and WV _j (jth non-zero weight) as the input of the corresponding PE _j : Among them, CI _j will be used as the multiple selection signal of MUX1, and the corresponding (The accumulated value of the membrane potential of the neuron corresponding to the jth column index) is used as an operand of the adder; WV _j will be input to one of the two input terminals of MUX2. After filtering by the weight mask, the effective WV _j will be used as the other operand of the adder. The weight distributor is updated according to the result calculated by the adder. The updated as the cumulative value for the next cycle.

权值掩码的作用是滤除无效的非零权值WV，避免产生运算错误。在计算中存在无效非零权值WV的原因有以下两点：(1)在访存NWVM时，在NWVM的同一个内存地址但不属于该行的权值也会被读出作为非零权值WV；(2)为了实现扇入扩展功能，权值矩阵的128列被分为多个共享扇入树突簇，在访存NWVM时，不属于接收当前输入脉冲向量的共享扇入树突簇的其他树突的权值也会被一起读出作为非零权值WV。The role of the weight mask is to filter out invalid non-zero weights WV to avoid calculation errors. There are two reasons why invalid non-zero weights WV exist in the calculation: (1) When accessing NWVM, the weights at the same memory address of NWVM but not belonging to the row will also be read out as invalid. Zero weight WV; (2) In order to realize the fan-in expansion function, the 128 columns of the weight matrix are divided into multiple shared fan-in dendrite clusters. When accessing NWVM, the weights of other dendrites that do not belong to the shared fan-in dendrite cluster receiving the current input pulse vector will also be read out as non-zero weight WV.

上述实施例，通过设置奇数行偏移模块和偶数行偏移模块，进一步提高了流水线的处理水平。In the above embodiment, the processing level of the pipeline is further improved by providing an odd-numbered row offset module and an even-numbered row offset module.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit it. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present application.

Claims

A sparse pulse neural network accelerator based on a ping-pong architecture includes a pulse input interface, a weight and neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module, and a leakage integral issuance module; wherein,

The pulse input interface is used to receive a pulse input signal and input the pulse input signal to the sparse pulse detection module;

The weight and neuron parameter input interface is used to receive the compressed weight value and input the compressed weight value into the compressed weight calculation module;

The sparse pulse detection module is used to extract a valid pulse index from the pulse input signal; the valid pulse index is used to characterize the position of a non-zero value in the pulse input signal;

The compression weight calculation module is used to decompress the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculate the weighted sum of the effective weight matrix and the pulse input signal to obtain the membrane potential increment on each neuron; and use the membrane potential increment on each neuron to update the membrane potential accumulation corresponding to each neuron;

The leaky integral issuing module is used to determine the magnitude relationship between the updated membrane potential accumulation amount and a preset threshold value, and determine the output pulse result corresponding to each neuron according to the magnitude relationship.

The sparse pulse neural network accelerator based on the ping-pong architecture according to claim 1 further comprises a pulse buffer module group; the pulse buffer module group comprises a first pulse buffer module and a second pulse buffer module;

The pulse buffer module group is used to control the read and write states of the first pulse buffer module and the second pulse buffer module in each buffer cycle in a ping-pong switching manner, so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer cycle.

The sparse pulse neural network accelerator based on the ping-pong architecture according to claim 2 further comprises a weight cache module group; the weight cache module group comprises a first weight cache module and a second weight cache module;

The weight cache module group is used to control the read and write states of the first weight cache module and the second weight cache module in each cache cycle in a ping-pong switching manner, so that one of the weight cache modules is in a read state and the other weight cache module is in a write state in each cache cycle. state.

The sparse pulse neural network accelerator based on the ping-pong architecture according to claim 3 further comprises a neuron parameter cache module group; the neuron parameter cache module group comprises a first neuron parameter cache module and a second neuron parameter cache module;

The neuron parameter cache module group is used to control the read and write states of the first neuron parameter cache module and the second neuron parameter cache module in each cache cycle in a ping-pong switching manner, so that one of the neuron parameter cache modules is in a read state and the other neuron parameter cache module is in a write state in each cache cycle.

The sparse pulse neural network accelerator based on ping-pong architecture according to claim 1, wherein the sparse pulse detection module is further used to divide the pulse input sequence corresponding to the pulse input signal into a plurality of subsequences;

Performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result; if the bitwise OR operation result is all 0, then the operation of the current group of subsequences is terminated;

If the bitwise OR operation result is not all 0, the bitwise OR operation result is used as the current sequence to be detected, and multiple rounds of detection are performed on the current sequence to be detected. In each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference; the difference is bitwise ANDed with the current sequence to be detected to obtain a bitwise AND operation result; the bitwise AND operation result is bitwise XORed with the current sequence to be detected to obtain a valid pulse one-hot code; the valid pulse one-hot code is binary-converted to obtain a valid pulse index; it is determined whether the bitwise AND operation result is all 0. If so, the detection of the current sequence to be detected is terminated, and the step of performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result is returned; if not, the bitwise AND operation result is used as the current sequence to be detected, and the step of subtracting 1 from the current detection sequence to obtain a difference is returned.

According to the sparse pulse neural network accelerator based on the ping-pong architecture of claim 1, wherein the compression weight calculation module includes a row offset operation unit, a column index weight operation unit, a row offset storage module, a column index Delta encoding storage module, a non-zero weight storage module, a weight distributor and a processing unit array; wherein,

The row offset calculation unit is used to read the row offset of the current row and the row offset of the next row adjacent to the current row from the row offset storage module;

The column index weight calculation unit is used to analyze the row offset of the current row and the The row offset of the next row adjacent to the current row obtains the starting address and the ending address of the column index encoding module and the non-zero weight module; according to the starting address and the ending address, the column index Delta code and the non-zero weight are obtained from the column index encoding module and the non-zero weight module respectively;

The weight distributor comprises an adder chain composed of a preset number of adders, wherein the adder chain is used to use the non-zero weight and the Delta code as inputs of a processing unit array;

The processing unit array includes a preset number of processing units, each processing unit uses an adder to perform calculations, and the membrane potential accumulation amount is updated according to the calculation results of the adder.

According to the ping-pong architecture-based sparse pulse neural network accelerator of claim 6, wherein the column index weight calculation unit is further used to end the processing of the current row when the starting address exceeds the ending address.

According to the ping-pong architecture-based sparse pulse neural network accelerator of claim 6, each processing unit further includes a weight mask generation module, a multiplexer and an adder;

The weight mask generating module is used to generate a weight mask according to the weight distribution state;

The processing unit is also used to use the weight mask and the non-zero weight as inputs of a multiplexer, so that the multiplexer obtains a valid non-zero value after filtering according to the weight mask, and uses the valid non-zero value and the membrane potential accumulation as inputs of an adder to obtain the updated membrane potential accumulation output by the adder.

According to the sparse pulse neural network accelerator based on the ping-pong architecture of claim 1, wherein the leakage integral issuance module is further used to add the updated membrane potential accumulation amount to the preset leakage value to obtain a leakage integral value; if the leakage integral value is greater than the preset threshold, the output pulse result is determined to be a pulse signal; if the leakage integral value is less than or equal to the preset threshold, the pulse result is determined to be no pulse issuance.

The sparse pulse neural network accelerator based on the ping-pong architecture according to claim 1 further comprises a pulse output interface;

The leakage integral issuance module is further used to send the updated membrane potential accumulation amount to the pulse output interface and reset the membrane potential accumulation amount when it is determined that the updated membrane potential accumulation amount is greater than the preset threshold.