CN116663626A - Sparse Spiking Neural Network Accelerator Based on Ping-Pong Architecture - Google Patents
Sparse Spiking Neural Network Accelerator Based on Ping-Pong Architecture Download PDFInfo
- Publication number
- CN116663626A CN116663626A CN202310410779.3A CN202310410779A CN116663626A CN 116663626 A CN116663626 A CN 116663626A CN 202310410779 A CN202310410779 A CN 202310410779A CN 116663626 A CN116663626 A CN 116663626A
- Authority
- CN
- China
- Prior art keywords
- pulse
- module
- weight
- sparse
- ping
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/061—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Complex Calculations (AREA)
- Manipulation Of Pulses (AREA)
- Particle Accelerators (AREA)
Abstract
The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which is characterized in that a compression weight value is transmitted to a compression weight calculation module, an effective pulse index is extracted from pulse input signals by using a sparse pulse detection module, each subsequent pulse signal is prevented from participating in operation, the calculated amount is reduced, and the compression weight calculation module accumulates non-zero values in the compression weight value to the membrane potential of neurons according to the effective pulse index, so that whether to issue pulses or not is finally determined. Compared with the technical scheme that all synapses in the traditional synapse crossing array are activated and participate in operation, only the synapse weight corresponding to the effective pulse index is activated, and other synapses do not participate in operation, so that the calculated amount is reduced, the operation power consumption of the whole chip is reduced, and the operation speed, the energy efficiency and the area efficiency of the pulse neural network are improved.
Description
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a sparse pulse neural network accelerator based on a ping-pong architecture.
Background
The impulse neural network (Spiking Nueral Network, SNN) can bring about improvement of machine calculation force due to the low power consumption and high concurrency of the impulse neural network, is a calculation mode with great potential, and is considered to be the future of artificial intelligence research.
Because the signaling mechanisms of neurons in a impulse neural network do not conform to conventional von neumann computer architectures, it is highly desirable to design appropriate hardware accelerators for impulse neural networks to run impulse neural networks. The current neuromorphic accelerator usually adopts a synaptic crossover array with regular structure and fixed size to directly store a synaptic connection matrix of an SNN model, and all synapses can participate in operation no matter how the synapses are sparse, so that the calculated amount is increased, and the spatial sparsity of the SNN model cannot be reflected on the neuromorphic accelerator. Another accelerator design scheme is to buffer input pulses in a bitmap form, which causes hardware to need to determine the validity of each bit in the input pulse vector, increasing the computation time, and as a result, the time sparsity of the pulse signals cannot be used to increase the operation speed of the hardware.
Therefore, the current neural network accelerator cannot fully exert the potential performance advantages of the SNN model, so that the power consumption is not low enough, the operation speed is not fast enough, and the operation energy efficiency of the SNN model is not high enough.
Disclosure of Invention
The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which is used for solving the defects that in the prior art, the pulse neural network accelerator has large chip power consumption and large calculation amount because all synapses are involved in calculation or each symbol in an input pulse signal is involved in calculation, and realizing low-power consumption and low-delay operation of the pulse neural network.
The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which comprises a pulse input interface, a weight and neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module and a leakage integral issuing module; wherein,,
the pulse input interface is used for receiving a pulse input signal and inputting the pulse input signal to the sparse pulse detection module;
the weight and neuron parameter input interface is used for receiving a compression weight value and inputting the compression weight value to the compression weight calculation module;
the sparse pulse detection module is used for extracting an effective pulse index from the pulse input signal; the valid pulse index is used for representing the position of a non-zero value in the pulse input signal;
the compression weight calculation module is used for decompressing the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculating the weighted sum of the effective weight matrix and the pulse input signals to obtain the membrane potential increment on each neuron; updating the membrane potential accumulation amount corresponding to each neuron by using the membrane potential increment on each neuron;
the leakage integral issuing module is used for judging the magnitude relation between the updated membrane potential accumulation amount and a preset threshold value, and determining an output pulse result corresponding to each neuron according to the magnitude relation.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a pulse cache module group; the pulse buffer module group comprises a first pulse buffer module and a second pulse buffer module;
the pulse buffer module group is used for controlling the read-write states of the first pulse buffer module and the second pulse buffer module in each buffer period in a ping-pong switching mode so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer period.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a weight cache module group; the weight buffer module group comprises a first weight buffer module and a second weight buffer module;
the weight buffer module group is used for controlling the read-write states of the first weight buffer module and the second weight buffer module in each buffer period in a ping-pong switching mode so that one weight buffer module is in a read state and the other weight buffer module is in a write state in each buffer period.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a neuron parameter cache module group; the neuron parameter cache module group comprises a first neuron parameter cache module and a second neuron parameter cache module;
the neuron parameter buffer module group is used for controlling the read-write states of the first neuron parameter buffer module and the second neuron parameter buffer module in each buffer period in a ping-pong switching mode so that one neuron parameter buffer module is in a read state and the other neuron parameter buffer module is in a write state in each buffer period.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the sparse pulse detection module is further used for dividing a pulse input sequence corresponding to the pulse input signal into a plurality of groups of subsequences;
each group of subsequences and the subsequences are subjected to bit pressing or operation in sequence, so that a bit pressing or operation result is obtained; if the bit pressing or operation result is all 0, ending the operation of the current group of subsequences;
if the bit pressing or operation result is not all 0, taking the bit pressing or operation result as a current sequence to be detected, and carrying out multi-round detection on the current sequence to be detected, wherein in each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference value; performing bit-wise operation on the difference value and the current sequence to be detected to obtain a bit-wise operation result; performing bit exclusive OR operation on the bit-wise operation result and the current sequence to be detected to obtain an effective pulse single-heat code; performing binary conversion on the effective pulse single-hot code to obtain an effective pulse index; judging whether the bit pressing and operation result is all 0, if so, ending the detection of the current sequence to be detected, and returning to the step of sequentially performing bit pressing or operation on each group of subsequences and the subsequences to obtain a bit pressing or operation result; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current detection sequence to obtain a difference value.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the compression weight calculation module comprises a row offset calculation module, a column index weight calculation unit, a column index coding module, a non-zero weight module, a weight distributor and a processing unit array; wherein,,
the row offset operation unit is used for acquiring the row offset of the current row and the row offset of the next row adjacent to the current row;
the column index weight calculation unit is used for analyzing the row offset of the current row and the row offset of the next row adjacent to the current row to obtain the starting address and the ending address of the column index coding module and the non-zero weight module; acquiring a non-zero weight and a column index Delta code from the column index coding module and the non-zero weight module according to the starting address and the ending address;
the weight distributor comprises an adder chain formed by a preset number of adders, wherein the adder chain is used for taking the non-zero weight and the Delta code as inputs of a processing unit array;
the processing unit array comprises a preset number of processing units, each processing unit performs operation by using an adder, and the film potential accumulation amount is updated according to the operation result of the adder.
According to the sparse pulse neural network accelerator based on the ping-pong architecture provided by the application, the column index weight calculation unit is further used for ending the processing of the current row under the condition that the starting address exceeds the ending address.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, each processing unit further comprises a weight mask generating module, a multiplexer and an adder;
the weight mask generation module is used for generating a weight mask according to the weight distribution state;
the processing unit is further configured to use the weight mask and the non-zero weight as inputs of a multiplexer, so that the multiplexer obtains an effective non-zero value after filtering according to the weight mask, and uses the effective non-zero value and the accumulated amount of the membrane potential as inputs of an adder to obtain the updated accumulated amount of the membrane potential output by the adder.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the leakage integration issuing module is further used for adding the updated membrane potential accumulation amount with a preset leakage value to obtain a leakage integration value; if the leakage integral value is larger than the preset threshold value, determining that the output pulse result is a release pulse signal; and if the leakage integral value is smaller than or equal to a preset threshold value, determining that the pulse result is that no pulse is issued.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a pulse output interface;
the leakage integration issuing module is further configured to send the updated accumulated amount of membrane potential to the pulse output interface and reset the accumulated amount of membrane potential if the updated accumulated amount of membrane potential is determined to be greater than the preset threshold.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the compression weight value is transmitted to the compression weight calculation module, the effective pulse index is extracted from the pulse input signal by using the sparse pulse detection module, each subsequent pulse signal is prevented from participating in operation, the calculated amount is reduced, the compression weight calculation module accumulates the non-zero value in the compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally whether to issue the pulse or not is determined. Compared with the traditional synapse cross array in which all synapses are activated and participate in operation, the application activates only the synapse weight corresponding to the effective pulse index, and other synapses do not participate in operation, thereby reducing the calculated amount, reducing the operation power consumption of the whole chip and improving the operation speed, energy efficiency and area efficiency of the pulse neural network.
Drawings
In order to more clearly illustrate the application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a functional module of a sparse pulse neural network accelerator based on a ping-pong architecture provided by the application;
FIG. 2 is a schematic diagram of matrix elements of a row compression matrix provided by the present application;
FIG. 3 is a schematic diagram of LIF neurons provided by the present application;
FIG. 4 is a schematic flow chart of a ping-pong operation method provided by the application;
FIG. 5 is a schematic diagram of a hardware circuit of the sparse pulse detection module provided by the present application;
FIG. 6 is a schematic diagram of the compression weight calculation module according to the present application;
FIG. 7 is a schematic diagram of a functional block of the compression weight calculation module according to the present application;
FIG. 8 is a schematic diagram of a hardware circuit of a column index weight calculation unit according to the present application;
FIG. 9 is a schematic diagram of adder chain decoding provided by the present application;
fig. 10 is a schematic structural diagram of a PE array according to the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Specific embodiments of the present application are described below in conjunction with fig. 1-10:
the functional modules of the sparse pulse neural network accelerator based on the ping-pong architecture provided by the embodiment of the application are shown in figure 1, and the sparse pulse neural network accelerator comprises a pulse input interface 101, a weight and neuron parameter input interface 102, a sparse pulse detection module 103, a compression weight calculation module 104 and a leakage integral issuing module 105; wherein,,
a pulse input interface 101 for receiving a pulse input signal and inputting the pulse input signal to a sparse pulse detection module 103;
specifically, the pulse input signal is transmitted through the pulse input interface 101 into the sparse pulse detection module 103.
A weight and neuron parameter input interface 102 for receiving a compression weight value and inputting the compression weight value to the compression weight calculation module 104;
the compression weight value refers to a weight value in a sparse matrix storage format. Because the weight parameters in the impulse neural network model are very many, the sparse data storage format can save a large amount of storage space and accelerate the calculation speed. The sparse matrix representation format used in the present application is CSR (Compressed Sparse Row line compression) format.
Specifically, the compression weight values stored in CSR format are input to the compression weight calculation module 104 through the weight and neuron parameter input interface 102.
A sparse pulse detection module 103, configured to extract a valid pulse index from the pulse input signal; the valid pulse index is used to characterize the location of non-zero values in the pulse input signal.
Since the pulse input signal is a sparse vector/tensor (sparse means having a large number of zeros), the sparse vector/tensor needs to be multiplied by the synaptic weight in the calculation process, which means that most of the synaptic weights are multiplied by "0", and the result is still 0. In order to reduce the overall power consumption of the chip, the application activates only the neurons needing to be activated in the synaptic crossover matrix, namely activates the corresponding neurons in the synaptic crossover array according to the non-zero value position in the pulse input signal. To determine which neurons need to be activated, the present application uses the sparse pulse detection module 103 to extract a valid pulse index from the pulse input signal, where the valid pulse index is used to represent the location of non-zero values in the pulse input signal.
The compression weight calculation module 104 is configured to decompress the compression weight value received from the outside of the chip according to the valid pulse index to obtain an valid weight matrix; calculating the weighted sum of the effective weight matrix and the pulse input signals to obtain the membrane potential increment on each neuron; the membrane potential increment on each neuron is used to update the membrane potential accumulation amount corresponding to each neuron.
The effective weight matrix is shown as the left matrix in fig. 2, and is a matrix obtained by restoring the compression weight according to the effective pulse index. Compression weight values as shown by the "non-zero values" in fig. 2, each element in the compression weight values is a non-zero value in the effective weight matrix. Because the on-chip storage space is limited, and the neural network often has huge numbers of parameters and weights, in order to save the storage space, the application uses a CSR (Compressed Row Storage, compressed row matrix) format to store the weights, the CSR storage format uses three one-dimensional arrays (including row offset, column index and non-zero value) to represent a sparse matrix (namely an effective weight matrix), wherein the row index array is used for storing the accumulated count of the non-zero values of the current row and the previous row, for example, the mth element in the row offset array represents the number of all the non-zero values above the mth row in the effective weight matrix, for example, the 2 nd element '4' in the row offset array in fig. 2 represents the total number of the non-zero values above the 2 nd row (the row sequence number is coded from 0) in the effective weight matrix to be 4 (1, 7, 2 and 8 respectively), and so on, only adjacent two elements in the row offset array are subtracted to know the non-zero value contained in each row in the effective weight matrix, for example, for the i element in the effective weight matrix, the i element in the row offset array can be subtracted from the non-zero value in the 1 th element only by reading the element in the row offset array; the column index array is used to store a column index for each element in the non-zero value array, e.g., a column index corresponding to "7" in the non-zero value array is 1, indicating that 7 is located in column 1 (starting with 0 to encode the index) in the valid weight matrix.
Specifically, the compression weight calculation module 104 decompresses the compression weight values stored in the CSR format to obtain an effective weight matrix; calculating a weighted sum of the effective weight matrix and the pulse input signals to obtain a membrane potential increment on each neuron; and adding the membrane potential increment on each neuron to the corresponding membrane potential accumulation amount of each neuron to obtain updated membrane potential accumulation amount.
The leakage integration issuing module 105 is configured to determine a magnitude relation between the updated membrane potential accumulation amount and a preset threshold, and determine an output pulse result corresponding to each neuron according to the magnitude relation.
Specifically, as shown in fig. 3, fig. 3 is a diagram of a structure of a pulse neural network algorithm used in the present application. The neurons of the impulse neural network supported by the application are linear LIF (leakage integration distribution) models, after the accumulated impulse time sequence is input, linear leakage operation, threshold comparison and impulse emission are carried out, and the membrane potential kinetic equation of the LIF neuron models is as follows:
V j (t+1)=V j (t)+∑x i w ij -λ j (1)
wherein V is j (t) is the membrane potential accumulated value of the neuron j at the time t, w ij The ith synaptic weight, x, corresponding to neuron j i Input signal value, lambda, for the pulse of the ith synapse j Is the linear leakage of neuron j.
The leak integration distribution module 105 determines the magnitude relation between the updated accumulated amount Vi (t+1) of the membrane potential and a preset threshold value, if the magnitude relation is greater than the preset threshold value, a nerve pulse signal is emitted, and the accumulated amount of the membrane potential on the neuron is reset.
According to the embodiment, the compression weight value is transmitted to the compression weight calculation module, the sparse pulse detection module is used for extracting the effective pulse index from the pulse input signal, so that each subsequent bit of pulse signal is prevented from participating in operation, the calculated amount is reduced, and the compression weight calculation module accumulates the non-zero value in the compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally decides whether to issue the pulse or not. Compared with the traditional synapse cross array in which all synapses are activated and participate in operation, the application activates only the synapse weights corresponding to the effective pulse indexes, thereby reducing the calculated amount, reducing the operation power consumption of the whole chip and improving the operation speed, energy efficiency and area efficiency of the pulse neural network.
In an embodiment, as shown in fig. 1, the sparse pulse neural network accelerator based on the ping-pong architecture further includes a pulse buffer module group 106, where the pulse buffer module group 106 includes a first pulse buffer module and a second pulse buffer module; the pulse buffer module group 106 is configured to control the read-write states of the first pulse buffer module and the second pulse buffer module in each buffer period in a ping-pong switching manner, so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer period.
Specifically, two pulse buffer modules RAM (Random Access Memory ) are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture, and are used for encoding pulse input signals received from outside the chip to obtain pulse input codes suitable for calculation, and in the encoding process, the whole system can update one RAM while the other RAM completes calculation. Correspondingly, the output pulse code also needs to be decoded into an output pulse signal, and in the decoding process, a ping-pong operation method is also adopted, and one RAM is updated while the other RAM is calculated.
According to the embodiment, ping-pong buffer storage in the process of encoding the pulse input signal or decoding the pulse output signal is realized by using the pulse buffer storage module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.
In an embodiment, as shown in fig. 1, the sparse pulse neural network accelerator based on the ping-pong architecture further includes a weight buffer module group 107, where the weight buffer module group 107 includes a first weight buffer module and a second weight buffer module; the weight buffer module group 107 is configured to control the read-write states of the first weight buffer module and the second weight buffer module in each buffer period in a ping-pong switching manner, so that one weight buffer module is in a read state and the other weight buffer module is in a write state in each buffer period.
Specifically, two weight buffer memories (RAMs) are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture and are used for decompressing compressed weight values in a CSR format received from the outside of the chip to obtain an effective weight matrix, and in the decompression process, the whole system can complete decompression of one RAM and update the other RAM.
According to the embodiment, the ping-pong buffer in the compression weight value decompression process of the CSR format is realized by using the weight buffer module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.
In one embodiment, the accelerator further includes a set of neuron parameter cache modules 108; the set of neuron parameter cache modules 108 includes a first neuron parameter cache module and a second neuron parameter cache module; the set of neuron parameter buffer modules 108 is configured to control the read-write states of the first neuron parameter buffer module and the second neuron parameter buffer module in each buffer cycle in a ping-pong switching manner, so that one of the neuron parameter buffer modules is in a read state and the other one of the neuron parameter buffer modules is in a write state in each buffer cycle.
Specifically, two pieces of neuron parameter buffer module RAM are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture, and are used for decoding the neuron parameters received from the outside of the chip to obtain the neuron parameters suitable for calculation, and in the decoding process, the whole system can update one piece of RAM while the other piece of RAM is decompressed.
Further, the ping-pong operation algorithm used in the present application comprises a three-dimensional control, i.e. how many time slots each time step of the input pulse comprises, how many time steps each group of neurons needs to calculate and how many groups of groups each layer of network needs to be divided into.
As shown in fig. 4, taking a two-layer 1024-512-256 fully-connected pulse neural network as an example, the accelerator will receive data of pulse ram#0, weight ram#0 and neuron parameters ram#0 from outside after power-up. When the first global synchronization command "sync_all" is received, the weight ram#1 and the neuron parameter ram#1 of the accelerator receive data from the outside, the data of the pulse ram#0, the weight ram#0, and the neuron parameter ram#0 are sent to the core calculation unit, and the calculation result is sent to the pulse ram#1, at which time the accelerator calculates the first 256 neurons of the first layer. When the second global synchronization command "sync_all" is received, the weight ram#0 and the neuron parameter ram#0 of the accelerator receive data from the outside, the data of the pulse ram#0, the weight ram#1, and the neuron parameter ram#1 are sent to the core calculation unit, and the calculation result is sent to the pulse ram#1, at which time the accelerator calculates the last 256 neurons of the first layer. When the third global synchronization command "sync_all" is received, the weight ram#1 and the neuron parameter ram#1 of the accelerator receive data from the outside, the data of the pulse ram#1, the weight ram#0, and the neuron parameter ram#0 are sent to the core calculation unit, the calculation result is sent to the pulse ram#0 and the outside of the accelerator, and at this time, the accelerator calculates 256 neurons of the second layer, thereby completing the whole calculation process.
According to the embodiment, the ping-pong buffer of the neuron parameters in the decoding process is realized by using the neuron parameter buffer module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.
In an embodiment, the sparse pulse detection module 105 is further configured to divide a pulse input sequence corresponding to the pulse input signal into a plurality of groups of subsequences;
each group of subsequences and the subsequences are subjected to bit pressing or operation in sequence, so that a bit pressing or operation result is obtained; if the bit pressing or operation result is all 0, ending the operation of the current group of subsequences;
if the bit pressing or operation result is not all 0, taking the bit pressing or operation result as a current sequence to be detected, and carrying out multi-round detection on the current sequence to be detected, wherein in each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference value; performing bit-wise operation on the difference value and the current sequence to be detected to obtain a bit-wise operation result; performing bit exclusive OR operation on the bit-wise operation result and the current sequence to be detected to obtain an effective pulse single-heat code; performing binary conversion on the effective pulse single-hot code to obtain an effective pulse index; judging whether the bit pressing and operation result is all 0, if so, ending the detection of the current sequence to be detected, and returning to the step of sequentially performing bit pressing or operation on each group of subsequences and the subsequences to obtain a bit pressing or operation result; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current detection sequence to obtain a difference value.
The pulse input sequence is a sequence obtained by encoding a pulse input signal. The valid pulse index refers to the position in the pulse input sequence where the non-zero value is located.
Specifically, as shown in fig. 5, fig. 5 shows an internal circuit configuration diagram of the sparse pulse detection module 105, including logic gates such as or gate, multiplexer, D flip-flop, adder, and gate, exclusive or gate, and the like. The sparse pulse detection module 105 is mainly responsible for extracting the effective pulse index of the input pulse train. In order to reduce the critical path delay, the application divides the input pulse sequence into 16 groups of 64bit subsequences, for example 1024bit input pulse sequence, each group of 64bit subsequences firstly performs bit pressing or operation with itself to obtain a bit pressing or operation result, if the bit pressing or operation result is all 0, it is indicated that all 64 bits in the group of subsequences are 0, the operation of the group of subsequences is finished, namely the calculation of the 64bit subsequences is skipped.
If the bit or operation result is not all 0, the position of the non-zero value in the current group of subsequences needs to be further detected, and the detection method specifically comprises the following steps: taking the bit pressing or operation result as a current sequence to be detected, carrying out multi-round detection on the current sequence to be detected, subtracting 1 from the current sequence to be detected (for example, the 64bit subsequence) to obtain a difference value in each round of detection, and then carrying out bit pressing and operation on the difference value and the current sequence to be detected to obtain a bit pressing and operation result; performing bit exclusive OR operation on the bit and operation result and the data (namely the current sequence to be detected) before subtracting 1 to obtain an effective pulse single-heat code; meanwhile, judging whether the bit pressing and operation result is all 0, if so, ending the detection of the group of subsequences, and entering the detection of the next group of subsequences, wherein in the example of the embodiment, the detection of the next group of subsequences with 64 bits is carried out; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current sequence to be detected to obtain a difference value. And the method is circulated until each group of subsequences are detected, a plurality of effective pulse single-heat codes can be obtained, and an effective pulse index in a binary code form corresponding to a pulse input signal can be obtained through a decoding unit for converting the single-heat codes into binary codes.
In the above embodiment, the sparse pulse detection module 105 is obtained by combining a plurality of logic gates, so that the effective index detection of sparse pulses can be realized.
In one embodiment, as shown in fig. 6, fig. 6 shows a schematic structural diagram of the compression weight calculation module 104, where the compression weight calculation module 104 includes a row offset module (Row Offset Module, ROM), a column index Delta coding module (Column Delta Module, CDM), a non-zero weight module (Nonzero Weight Value Module, NWVM), and an array of PEs (Processing Element, processing units), and each PE unit is used to perform calculation of a film potential increment, i.e., a dot product of a pulse input signal and a synaptic weight, i.e., x in formula (1) i w ij . In fig. 6, a 1024×256 synaptic crossover array is illustrated, where 1024 represents 1024 fan-ins (i.e., 1024 axons), and 256 represents 256 hardware neurons, i.e., 1024 fan-ins per hardware neuron, but not every fan-in is a valid pulse signal. The effective pulse index transmitted from the sparse pulse detection module activates the effective pulse indexThe row, then the Row Offset Module (ROM) calculates the corresponding row offset, the column index Delta encoding module (CDM) reads the column index of the non-zero value of the row according to the row offset, and the non-zero weight module (NWVM) adds the corresponding non-zero value to the membrane potential according to the column index. When all the effective pulses in the input pulse sequence are processed, all the accumulated membrane potential values are sent to a leakage integration distribution neuron dynamic behavior module for LIF operation.
Since the effective weight matrix is represented by three parameters, namely Row Offset (Row Offset), column index (columns indexes), and non-zero Value (Value), as shown in fig. 2. Accordingly, in order to restore the effective weight matrix according to the above three parameters, as shown in fig. 7, the storage of the above three parameters in the accelerator is respectively set: a row offset memory module (Row Offset Module, ROM, which is subdivided into Even ROM and Odd ROM), a column index Delta encoded memory module (Column Delta Module, CDM) and a non-zero value memory module (Nonzero Weight Value Module, NWVM). For pipeline progress, the line offset storage modules are further divided into an Even line offset storage module (Even ROM) and an Odd line offset storage module (Odd ROM). As shown in FIG. 7, for the input pulse pointing to the ith row, the CSR decoder reads the row offset of the ith row and the row offset of the (i+1) th row from Even ROM (Even row offset module) and Odd ROM (Odd row offset module) respectively, and subtracts the two to obtain the number of non-zero values corresponding to the ith row pulse, thereby obtaining the memory addresses and the memory number of the column index Delta code memory module (CDM) and the non-zero value memory module (NWVM). Since only one row offset and operation is required for the pulse of the ith row, but more than one column index and weight and operation may be required for the ith row, which may bring about throughput mismatch of the two types of data structure processing pipelines, in order to avoid inefficient uneven pipeline stalls, the circuit design decouples the CSR decoder into a row offset operation unit and a column index weight operation unit, which are synchronized using FIFO (First Input First Output, first-in first-out) memories, as shown in fig. 7.
Column index weightThe circuit configuration of the value operation unit is as shown in FIG. 8, and the row offset RO for the i-th row is obtained i And row offset RO for row i+1 i+1 After analysis, access start addresses Addr of a column index Delta coding storage module (CDM) and a non-zero value storage module (NWVM) are respectively obtained i And end address Addr i+1 When MAE (Masked Autoencoder, mask self-encoder) output signal is active, each clock cycle is according to Addr i Column index ΔCI in Delta encoding format is read from column index Delta encoding storage Module (CDM) and non-zero value storage Module (NWVM), respectively j And non-zero Weights (WV) to subsequent circuit units (i.e., corresponding PE units). When Addr i Exceeding Addr i+1 When the column index weight calculation unit finishes the processing of the ith row, the row offset of the new row is obtained from the row offset storage module.
As shown in fig. 9, the Delta-encoded column index (i.e., Δci) read from the column index Delta-encoded memory module (CDM) every clock cycle j ) Decoding into column indices (i.e. CI by adder chain j ). Since it is possible to read out Δcis not belonging to the current row (i.e., column index Delta codes) from the column index Delta coding module (CDM) every clock cycle, filtering these Δcis not belonging to the current row by MUX (multiplexer) results in filtered Delta codes Δci'. If the current clock cycle is the 1 st cycle of the row operation, ΔCI' 0 CI to be the current period 0 The method comprises the steps of carrying out a first treatment on the surface of the If the current clock cycle is not the 1 st cycle of the row operation, CI will be in the last cycle 31 Adding DeltaCI 'on the basis of' 0 . While others are CI j =CI j-1 +ΔCI′ j 。
The structure of the PE array is shown in FIG. 10, and FIG. 10 shows a PE array comprising 32 PE units; each PE unit contains 1 16-bit adder and 2 muxes (multiplexers), and the cumulative value of the membrane potential of all dendrites is input to MUX1 (multiplexer of multiple 1) of each PE. After obtaining 32 pairs of CIs j After (column index) and WV (non-zero weights), the weight allocator will add each pair of CIs j (jth column index) and WV j (jth non-zero weight) as the corresponding PE j Is input into: wherein CI is j Will be the multiplexing selection signal of MUX1, will correspond to(the accumulated value of the membrane potential of the neuron corresponding to the jth column index) as one operand of the adder; WV (WV) j Will be input to one of the 2 inputs of MUX2, and after filtering the weight mask, the effective WV j Would act as another operand to the adder. The weight distributor updates +.>Will be updatedAs an accumulated value for the next cycle.
The function of the weight mask is to filter out invalid non-zero weights WV, avoiding the generation of operation errors. The reason for the existence of the invalid non-zero weight WV in the calculation is the following two points: (1) When the NWVM is accessed, the weight which is at the same memory address of the NWVM but does not belong to the row is read out as a non-zero weight WV; (2) In order to realize the fan-in extension function, 128 columns of the weight matrix are divided into a plurality of shared fan-in dendritic clusters, and when the NWVM is accessed, weights of other dendrites which do not belong to the shared fan-in dendritic cluster receiving the current input pulse vector are read out together as non-zero weights WV.
In the embodiment, the odd line offset module and the even line offset module are arranged, so that the processing level of the pipeline is further improved.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.
Claims (10)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310410779.3A CN116663626B (en) | 2023-04-17 | 2023-04-17 | Sparse pulse neural network accelerator based on ping-pong architecture |
| PCT/CN2023/121949 WO2024216857A1 (en) | 2023-04-17 | 2023-09-27 | Sparse spiking neural network accelerator based on ping-pong architecture |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310410779.3A CN116663626B (en) | 2023-04-17 | 2023-04-17 | Sparse pulse neural network accelerator based on ping-pong architecture |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN116663626A true CN116663626A (en) | 2023-08-29 |
| CN116663626B CN116663626B (en) | 2025-11-07 |
Family
ID=87721335
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310410779.3A Active CN116663626B (en) | 2023-04-17 | 2023-04-17 | Sparse pulse neural network accelerator based on ping-pong architecture |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN116663626B (en) |
| WO (1) | WO2024216857A1 (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118798276A (en) * | 2024-09-11 | 2024-10-18 | 电子科技大学 | A block-by-block vector-zero-value sparsity-aware convolutional neural network accelerator |
| WO2024216857A1 (en) * | 2023-04-17 | 2024-10-24 | 北京大学 | Sparse spiking neural network accelerator based on ping-pong architecture |
| CN119358606A (en) * | 2024-09-25 | 2025-01-24 | 鹏城实验室 | Pulse neural network weight gradient calculation method and related equipment |
| CN120874919A (en) * | 2025-09-28 | 2025-10-31 | 山东云海国创云计算装备产业创新中心有限公司 | Integrated deposit and calculation device, deposit and calculation method, processing device, tile module and accelerator |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119474626B (en) * | 2024-11-05 | 2025-09-26 | 南京大学 | Multifunctional linear convolution accelerator |
| CN119719720B (en) * | 2024-12-24 | 2025-10-03 | 广东工业大学 | A computing system for spiking recurrent neural networks |
| CN119721153B (en) * | 2025-03-04 | 2025-05-23 | 浪潮电子信息产业股份有限公司 | A reinforcement learning accelerator, acceleration method and electronic device |
| CN120234515B (en) * | 2025-05-29 | 2025-08-15 | 兰州大学 | Event-driven CSR parser supporting multiple computing modes |
| CN120255642B (en) * | 2025-06-06 | 2025-09-19 | 南京大学 | Sensing and computing integrated system and method based on SPAD and pulse neural network |
| CN120562489B (en) * | 2025-07-31 | 2025-09-19 | 苏州元脑智能科技有限公司 | Convolution acceleration method, device, equipment and medium for pulse neural network |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105049056A (en) * | 2015-08-07 | 2015-11-11 | 杭州国芯科技股份有限公司 | One-hot code detection circuit |
| CN107239823A (en) * | 2016-08-12 | 2017-10-10 | 北京深鉴科技有限公司 | A kind of apparatus and method for realizing sparse neural network |
| US20180157969A1 (en) * | 2016-12-05 | 2018-06-07 | Beijing Deephi Technology Co., Ltd. | Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network |
| CN111445013A (en) * | 2020-04-28 | 2020-07-24 | 南京大学 | Non-zero detector for convolutional neural network and method thereof |
| CN112732222A (en) * | 2021-01-08 | 2021-04-30 | 苏州浪潮智能科技有限公司 | Sparse matrix accelerated calculation method, device, equipment and medium |
| CN113537488A (en) * | 2021-06-29 | 2021-10-22 | 杭州电子科技大学 | Neural network accelerator based on sparse vector matrix calculation and acceleration method |
| CN114860192A (en) * | 2022-06-16 | 2022-08-05 | 中山大学 | FPGA-based sparse dense matrix multiplication array with high multiplier utilization rate for graph neural network |
| CN115440226A (en) * | 2022-09-20 | 2022-12-06 | 南京大学 | A low-power system for speech keyword recognition based on spiking neural network |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115169523A (en) * | 2021-04-02 | 2022-10-11 | 华为技术有限公司 | Impulse neural network circuit and computing method based on impulse neural network |
| CN116663626B (en) * | 2023-04-17 | 2025-11-07 | 北京大学 | Sparse pulse neural network accelerator based on ping-pong architecture |
-
2023
- 2023-04-17 CN CN202310410779.3A patent/CN116663626B/en active Active
- 2023-09-27 WO PCT/CN2023/121949 patent/WO2024216857A1/en active Pending
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105049056A (en) * | 2015-08-07 | 2015-11-11 | 杭州国芯科技股份有限公司 | One-hot code detection circuit |
| CN107239823A (en) * | 2016-08-12 | 2017-10-10 | 北京深鉴科技有限公司 | A kind of apparatus and method for realizing sparse neural network |
| US20180046895A1 (en) * | 2016-08-12 | 2018-02-15 | DeePhi Technology Co., Ltd. | Device and method for implementing a sparse neural network |
| US20180157969A1 (en) * | 2016-12-05 | 2018-06-07 | Beijing Deephi Technology Co., Ltd. | Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network |
| CN111445013A (en) * | 2020-04-28 | 2020-07-24 | 南京大学 | Non-zero detector for convolutional neural network and method thereof |
| CN112732222A (en) * | 2021-01-08 | 2021-04-30 | 苏州浪潮智能科技有限公司 | Sparse matrix accelerated calculation method, device, equipment and medium |
| CN113537488A (en) * | 2021-06-29 | 2021-10-22 | 杭州电子科技大学 | Neural network accelerator based on sparse vector matrix calculation and acceleration method |
| CN114860192A (en) * | 2022-06-16 | 2022-08-05 | 中山大学 | FPGA-based sparse dense matrix multiplication array with high multiplier utilization rate for graph neural network |
| CN115440226A (en) * | 2022-09-20 | 2022-12-06 | 南京大学 | A low-power system for speech keyword recognition based on spiking neural network |
Non-Patent Citations (3)
| Title |
|---|
| YISONG KUANG等: "An Event-driven Spiking Neural Network Accelerator with On-chip Sparse Weight", 《 2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS)》, 11 November 2022 (2022-11-11), pages 3468 - 3472 * |
| YISONG KUANG等: "ESSA: Design of a Programmable Efficient Sparse Spiking Neural Network Accelerator", 《 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS》, vol. 30, no. 11, 11 November 2022 (2022-11-11), pages 1631 - 1641, XP011924800, DOI: 10.1109/TVLSI.2022.3183126 * |
| 余成宇等: "一种高效的稀疏卷积神经网络加速器的设计与实现", 《智能系统学报》, vol. 15, no. 2, 31 March 2020 (2020-03-31), pages 323 - 333 * |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2024216857A1 (en) * | 2023-04-17 | 2024-10-24 | 北京大学 | Sparse spiking neural network accelerator based on ping-pong architecture |
| CN118798276A (en) * | 2024-09-11 | 2024-10-18 | 电子科技大学 | A block-by-block vector-zero-value sparsity-aware convolutional neural network accelerator |
| CN119358606A (en) * | 2024-09-25 | 2025-01-24 | 鹏城实验室 | Pulse neural network weight gradient calculation method and related equipment |
| CN119358606B (en) * | 2024-09-25 | 2025-11-04 | 鹏城实验室 | Methods and related equipment for calculating weight gradients in spiking neural networks |
| CN120874919A (en) * | 2025-09-28 | 2025-10-31 | 山东云海国创云计算装备产业创新中心有限公司 | Integrated deposit and calculation device, deposit and calculation method, processing device, tile module and accelerator |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116663626B (en) | 2025-11-07 |
| WO2024216857A1 (en) | 2024-10-24 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN116663626B (en) | Sparse pulse neural network accelerator based on ping-pong architecture | |
| Kalchbrenner et al. | Efficient neural audio synthesis | |
| JP7366484B2 (en) | Quantum error correction decoding system, method, fault tolerant quantum error correction system and chip | |
| CN111783973B (en) | Nerve morphology processor and equipment for liquid state machine calculation | |
| Evans et al. | JPEG-ACT: Accelerating deep learning via transform-based lossy compression | |
| Wang et al. | Learning efficient binarized object detectors with information compression | |
| Kadetotad et al. | Efficient memory compression in deep neural networks using coarse-grain sparsification for speech applications | |
| CN112884149B (en) | Random sensitivity ST-SM-based deep neural network pruning method and system | |
| Wang et al. | LSMCore: a 69k-synapse/mm 2 single-core digital neuromorphic processor for liquid state machine | |
| Qin et al. | Diagonalwise refactorization: An efficient training method for depthwise convolutions | |
| CN115022637B (en) | Image encoding method, image decompression method and device | |
| CN113962371B (en) | An image recognition method and system based on a brain-like computing platform | |
| CN116663627A (en) | Digital neuromorphic computing processor and computing method | |
| Liu et al. | Spiking-diffusion: Vector quantized discrete diffusion model with spiking neural networks | |
| CN112598119B (en) | An On-Chip Memory Compression Method for Neuromorphic Processors for Liquid State Machines | |
| Lee et al. | TT-SNN: tensor train decomposition for efficient spiking neural network training | |
| Wu et al. | A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation | |
| CN109495113A (en) | A kind of compression method and device of EEG signals | |
| Dang et al. | An efficient software-hardware design framework for spiking neural network systems | |
| Guo et al. | A high-efficiency FPGA-based accelerator for binarized neural network | |
| Watkins et al. | Image data compression and noisy channel error correction using deep neural network | |
| CN113222160A (en) | Quantum state conversion method and device | |
| CN115170916B (en) | Image reconstruction method and system based on multi-scale feature fusion | |
| Li et al. | An efficient sparse lstm accelerator on embedded fpgas with bandwidth-oriented pruning | |
| Chen et al. | An Edge Neuromorphic Processor With High-Accuracy On-Chip Aggregate-Label Learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |