[go: up one dir, main page]

CN116663626A - Sparse Spiking Neural Network Accelerator Based on Ping-Pong Architecture - Google Patents

Sparse Spiking Neural Network Accelerator Based on Ping-Pong Architecture Download PDF

Info

Publication number
CN116663626A
CN116663626A CN202310410779.3A CN202310410779A CN116663626A CN 116663626 A CN116663626 A CN 116663626A CN 202310410779 A CN202310410779 A CN 202310410779A CN 116663626 A CN116663626 A CN 116663626A
Authority
CN
China
Prior art keywords
pulse
module
weight
sparse
ping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310410779.3A
Other languages
Chinese (zh)
Other versions
CN116663626B (en
Inventor
王源
王梓霖
钟毅
崔小欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202310410779.3A priority Critical patent/CN116663626B/en
Publication of CN116663626A publication Critical patent/CN116663626A/en
Priority to PCT/CN2023/121949 priority patent/WO2024216857A1/en
Application granted granted Critical
Publication of CN116663626B publication Critical patent/CN116663626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Complex Calculations (AREA)
  • Manipulation Of Pulses (AREA)
  • Particle Accelerators (AREA)

Abstract

The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which is characterized in that a compression weight value is transmitted to a compression weight calculation module, an effective pulse index is extracted from pulse input signals by using a sparse pulse detection module, each subsequent pulse signal is prevented from participating in operation, the calculated amount is reduced, and the compression weight calculation module accumulates non-zero values in the compression weight value to the membrane potential of neurons according to the effective pulse index, so that whether to issue pulses or not is finally determined. Compared with the technical scheme that all synapses in the traditional synapse crossing array are activated and participate in operation, only the synapse weight corresponding to the effective pulse index is activated, and other synapses do not participate in operation, so that the calculated amount is reduced, the operation power consumption of the whole chip is reduced, and the operation speed, the energy efficiency and the area efficiency of the pulse neural network are improved.

Description

Sparse pulse neural network accelerator based on ping-pong architecture
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a sparse pulse neural network accelerator based on a ping-pong architecture.
Background
The impulse neural network (Spiking Nueral Network, SNN) can bring about improvement of machine calculation force due to the low power consumption and high concurrency of the impulse neural network, is a calculation mode with great potential, and is considered to be the future of artificial intelligence research.
Because the signaling mechanisms of neurons in a impulse neural network do not conform to conventional von neumann computer architectures, it is highly desirable to design appropriate hardware accelerators for impulse neural networks to run impulse neural networks. The current neuromorphic accelerator usually adopts a synaptic crossover array with regular structure and fixed size to directly store a synaptic connection matrix of an SNN model, and all synapses can participate in operation no matter how the synapses are sparse, so that the calculated amount is increased, and the spatial sparsity of the SNN model cannot be reflected on the neuromorphic accelerator. Another accelerator design scheme is to buffer input pulses in a bitmap form, which causes hardware to need to determine the validity of each bit in the input pulse vector, increasing the computation time, and as a result, the time sparsity of the pulse signals cannot be used to increase the operation speed of the hardware.
Therefore, the current neural network accelerator cannot fully exert the potential performance advantages of the SNN model, so that the power consumption is not low enough, the operation speed is not fast enough, and the operation energy efficiency of the SNN model is not high enough.
Disclosure of Invention
The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which is used for solving the defects that in the prior art, the pulse neural network accelerator has large chip power consumption and large calculation amount because all synapses are involved in calculation or each symbol in an input pulse signal is involved in calculation, and realizing low-power consumption and low-delay operation of the pulse neural network.
The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which comprises a pulse input interface, a weight and neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module and a leakage integral issuing module; wherein,,
the pulse input interface is used for receiving a pulse input signal and inputting the pulse input signal to the sparse pulse detection module;
the weight and neuron parameter input interface is used for receiving a compression weight value and inputting the compression weight value to the compression weight calculation module;
the sparse pulse detection module is used for extracting an effective pulse index from the pulse input signal; the valid pulse index is used for representing the position of a non-zero value in the pulse input signal;
the compression weight calculation module is used for decompressing the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculating the weighted sum of the effective weight matrix and the pulse input signals to obtain the membrane potential increment on each neuron; updating the membrane potential accumulation amount corresponding to each neuron by using the membrane potential increment on each neuron;
the leakage integral issuing module is used for judging the magnitude relation between the updated membrane potential accumulation amount and a preset threshold value, and determining an output pulse result corresponding to each neuron according to the magnitude relation.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a pulse cache module group; the pulse buffer module group comprises a first pulse buffer module and a second pulse buffer module;
the pulse buffer module group is used for controlling the read-write states of the first pulse buffer module and the second pulse buffer module in each buffer period in a ping-pong switching mode so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer period.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a weight cache module group; the weight buffer module group comprises a first weight buffer module and a second weight buffer module;
the weight buffer module group is used for controlling the read-write states of the first weight buffer module and the second weight buffer module in each buffer period in a ping-pong switching mode so that one weight buffer module is in a read state and the other weight buffer module is in a write state in each buffer period.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a neuron parameter cache module group; the neuron parameter cache module group comprises a first neuron parameter cache module and a second neuron parameter cache module;
the neuron parameter buffer module group is used for controlling the read-write states of the first neuron parameter buffer module and the second neuron parameter buffer module in each buffer period in a ping-pong switching mode so that one neuron parameter buffer module is in a read state and the other neuron parameter buffer module is in a write state in each buffer period.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the sparse pulse detection module is further used for dividing a pulse input sequence corresponding to the pulse input signal into a plurality of groups of subsequences;
each group of subsequences and the subsequences are subjected to bit pressing or operation in sequence, so that a bit pressing or operation result is obtained; if the bit pressing or operation result is all 0, ending the operation of the current group of subsequences;
if the bit pressing or operation result is not all 0, taking the bit pressing or operation result as a current sequence to be detected, and carrying out multi-round detection on the current sequence to be detected, wherein in each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference value; performing bit-wise operation on the difference value and the current sequence to be detected to obtain a bit-wise operation result; performing bit exclusive OR operation on the bit-wise operation result and the current sequence to be detected to obtain an effective pulse single-heat code; performing binary conversion on the effective pulse single-hot code to obtain an effective pulse index; judging whether the bit pressing and operation result is all 0, if so, ending the detection of the current sequence to be detected, and returning to the step of sequentially performing bit pressing or operation on each group of subsequences and the subsequences to obtain a bit pressing or operation result; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current detection sequence to obtain a difference value.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the compression weight calculation module comprises a row offset calculation module, a column index weight calculation unit, a column index coding module, a non-zero weight module, a weight distributor and a processing unit array; wherein,,
the row offset operation unit is used for acquiring the row offset of the current row and the row offset of the next row adjacent to the current row;
the column index weight calculation unit is used for analyzing the row offset of the current row and the row offset of the next row adjacent to the current row to obtain the starting address and the ending address of the column index coding module and the non-zero weight module; acquiring a non-zero weight and a column index Delta code from the column index coding module and the non-zero weight module according to the starting address and the ending address;
the weight distributor comprises an adder chain formed by a preset number of adders, wherein the adder chain is used for taking the non-zero weight and the Delta code as inputs of a processing unit array;
the processing unit array comprises a preset number of processing units, each processing unit performs operation by using an adder, and the film potential accumulation amount is updated according to the operation result of the adder.
According to the sparse pulse neural network accelerator based on the ping-pong architecture provided by the application, the column index weight calculation unit is further used for ending the processing of the current row under the condition that the starting address exceeds the ending address.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, each processing unit further comprises a weight mask generating module, a multiplexer and an adder;
the weight mask generation module is used for generating a weight mask according to the weight distribution state;
the processing unit is further configured to use the weight mask and the non-zero weight as inputs of a multiplexer, so that the multiplexer obtains an effective non-zero value after filtering according to the weight mask, and uses the effective non-zero value and the accumulated amount of the membrane potential as inputs of an adder to obtain the updated accumulated amount of the membrane potential output by the adder.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the leakage integration issuing module is further used for adding the updated membrane potential accumulation amount with a preset leakage value to obtain a leakage integration value; if the leakage integral value is larger than the preset threshold value, determining that the output pulse result is a release pulse signal; and if the leakage integral value is smaller than or equal to a preset threshold value, determining that the pulse result is that no pulse is issued.
The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a pulse output interface;
the leakage integration issuing module is further configured to send the updated accumulated amount of membrane potential to the pulse output interface and reset the accumulated amount of membrane potential if the updated accumulated amount of membrane potential is determined to be greater than the preset threshold.
According to the sparse pulse neural network accelerator based on the ping-pong architecture, the compression weight value is transmitted to the compression weight calculation module, the effective pulse index is extracted from the pulse input signal by using the sparse pulse detection module, each subsequent pulse signal is prevented from participating in operation, the calculated amount is reduced, the compression weight calculation module accumulates the non-zero value in the compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally whether to issue the pulse or not is determined. Compared with the traditional synapse cross array in which all synapses are activated and participate in operation, the application activates only the synapse weight corresponding to the effective pulse index, and other synapses do not participate in operation, thereby reducing the calculated amount, reducing the operation power consumption of the whole chip and improving the operation speed, energy efficiency and area efficiency of the pulse neural network.
Drawings
In order to more clearly illustrate the application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a functional module of a sparse pulse neural network accelerator based on a ping-pong architecture provided by the application;
FIG. 2 is a schematic diagram of matrix elements of a row compression matrix provided by the present application;
FIG. 3 is a schematic diagram of LIF neurons provided by the present application;
FIG. 4 is a schematic flow chart of a ping-pong operation method provided by the application;
FIG. 5 is a schematic diagram of a hardware circuit of the sparse pulse detection module provided by the present application;
FIG. 6 is a schematic diagram of the compression weight calculation module according to the present application;
FIG. 7 is a schematic diagram of a functional block of the compression weight calculation module according to the present application;
FIG. 8 is a schematic diagram of a hardware circuit of a column index weight calculation unit according to the present application;
FIG. 9 is a schematic diagram of adder chain decoding provided by the present application;
fig. 10 is a schematic structural diagram of a PE array according to the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Specific embodiments of the present application are described below in conjunction with fig. 1-10:
the functional modules of the sparse pulse neural network accelerator based on the ping-pong architecture provided by the embodiment of the application are shown in figure 1, and the sparse pulse neural network accelerator comprises a pulse input interface 101, a weight and neuron parameter input interface 102, a sparse pulse detection module 103, a compression weight calculation module 104 and a leakage integral issuing module 105; wherein,,
a pulse input interface 101 for receiving a pulse input signal and inputting the pulse input signal to a sparse pulse detection module 103;
specifically, the pulse input signal is transmitted through the pulse input interface 101 into the sparse pulse detection module 103.
A weight and neuron parameter input interface 102 for receiving a compression weight value and inputting the compression weight value to the compression weight calculation module 104;
the compression weight value refers to a weight value in a sparse matrix storage format. Because the weight parameters in the impulse neural network model are very many, the sparse data storage format can save a large amount of storage space and accelerate the calculation speed. The sparse matrix representation format used in the present application is CSR (Compressed Sparse Row line compression) format.
Specifically, the compression weight values stored in CSR format are input to the compression weight calculation module 104 through the weight and neuron parameter input interface 102.
A sparse pulse detection module 103, configured to extract a valid pulse index from the pulse input signal; the valid pulse index is used to characterize the location of non-zero values in the pulse input signal.
Since the pulse input signal is a sparse vector/tensor (sparse means having a large number of zeros), the sparse vector/tensor needs to be multiplied by the synaptic weight in the calculation process, which means that most of the synaptic weights are multiplied by "0", and the result is still 0. In order to reduce the overall power consumption of the chip, the application activates only the neurons needing to be activated in the synaptic crossover matrix, namely activates the corresponding neurons in the synaptic crossover array according to the non-zero value position in the pulse input signal. To determine which neurons need to be activated, the present application uses the sparse pulse detection module 103 to extract a valid pulse index from the pulse input signal, where the valid pulse index is used to represent the location of non-zero values in the pulse input signal.
The compression weight calculation module 104 is configured to decompress the compression weight value received from the outside of the chip according to the valid pulse index to obtain an valid weight matrix; calculating the weighted sum of the effective weight matrix and the pulse input signals to obtain the membrane potential increment on each neuron; the membrane potential increment on each neuron is used to update the membrane potential accumulation amount corresponding to each neuron.
The effective weight matrix is shown as the left matrix in fig. 2, and is a matrix obtained by restoring the compression weight according to the effective pulse index. Compression weight values as shown by the "non-zero values" in fig. 2, each element in the compression weight values is a non-zero value in the effective weight matrix. Because the on-chip storage space is limited, and the neural network often has huge numbers of parameters and weights, in order to save the storage space, the application uses a CSR (Compressed Row Storage, compressed row matrix) format to store the weights, the CSR storage format uses three one-dimensional arrays (including row offset, column index and non-zero value) to represent a sparse matrix (namely an effective weight matrix), wherein the row index array is used for storing the accumulated count of the non-zero values of the current row and the previous row, for example, the mth element in the row offset array represents the number of all the non-zero values above the mth row in the effective weight matrix, for example, the 2 nd element '4' in the row offset array in fig. 2 represents the total number of the non-zero values above the 2 nd row (the row sequence number is coded from 0) in the effective weight matrix to be 4 (1, 7, 2 and 8 respectively), and so on, only adjacent two elements in the row offset array are subtracted to know the non-zero value contained in each row in the effective weight matrix, for example, for the i element in the effective weight matrix, the i element in the row offset array can be subtracted from the non-zero value in the 1 th element only by reading the element in the row offset array; the column index array is used to store a column index for each element in the non-zero value array, e.g., a column index corresponding to "7" in the non-zero value array is 1, indicating that 7 is located in column 1 (starting with 0 to encode the index) in the valid weight matrix.
Specifically, the compression weight calculation module 104 decompresses the compression weight values stored in the CSR format to obtain an effective weight matrix; calculating a weighted sum of the effective weight matrix and the pulse input signals to obtain a membrane potential increment on each neuron; and adding the membrane potential increment on each neuron to the corresponding membrane potential accumulation amount of each neuron to obtain updated membrane potential accumulation amount.
The leakage integration issuing module 105 is configured to determine a magnitude relation between the updated membrane potential accumulation amount and a preset threshold, and determine an output pulse result corresponding to each neuron according to the magnitude relation.
Specifically, as shown in fig. 3, fig. 3 is a diagram of a structure of a pulse neural network algorithm used in the present application. The neurons of the impulse neural network supported by the application are linear LIF (leakage integration distribution) models, after the accumulated impulse time sequence is input, linear leakage operation, threshold comparison and impulse emission are carried out, and the membrane potential kinetic equation of the LIF neuron models is as follows:
V j (t+1)=V j (t)+∑x i w ijj (1)
wherein V is j (t) is the membrane potential accumulated value of the neuron j at the time t, w ij The ith synaptic weight, x, corresponding to neuron j i Input signal value, lambda, for the pulse of the ith synapse j Is the linear leakage of neuron j.
The leak integration distribution module 105 determines the magnitude relation between the updated accumulated amount Vi (t+1) of the membrane potential and a preset threshold value, if the magnitude relation is greater than the preset threshold value, a nerve pulse signal is emitted, and the accumulated amount of the membrane potential on the neuron is reset.
According to the embodiment, the compression weight value is transmitted to the compression weight calculation module, the sparse pulse detection module is used for extracting the effective pulse index from the pulse input signal, so that each subsequent bit of pulse signal is prevented from participating in operation, the calculated amount is reduced, and the compression weight calculation module accumulates the non-zero value in the compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally decides whether to issue the pulse or not. Compared with the traditional synapse cross array in which all synapses are activated and participate in operation, the application activates only the synapse weights corresponding to the effective pulse indexes, thereby reducing the calculated amount, reducing the operation power consumption of the whole chip and improving the operation speed, energy efficiency and area efficiency of the pulse neural network.
In an embodiment, as shown in fig. 1, the sparse pulse neural network accelerator based on the ping-pong architecture further includes a pulse buffer module group 106, where the pulse buffer module group 106 includes a first pulse buffer module and a second pulse buffer module; the pulse buffer module group 106 is configured to control the read-write states of the first pulse buffer module and the second pulse buffer module in each buffer period in a ping-pong switching manner, so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer period.
Specifically, two pulse buffer modules RAM (Random Access Memory ) are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture, and are used for encoding pulse input signals received from outside the chip to obtain pulse input codes suitable for calculation, and in the encoding process, the whole system can update one RAM while the other RAM completes calculation. Correspondingly, the output pulse code also needs to be decoded into an output pulse signal, and in the decoding process, a ping-pong operation method is also adopted, and one RAM is updated while the other RAM is calculated.
According to the embodiment, ping-pong buffer storage in the process of encoding the pulse input signal or decoding the pulse output signal is realized by using the pulse buffer storage module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.
In an embodiment, as shown in fig. 1, the sparse pulse neural network accelerator based on the ping-pong architecture further includes a weight buffer module group 107, where the weight buffer module group 107 includes a first weight buffer module and a second weight buffer module; the weight buffer module group 107 is configured to control the read-write states of the first weight buffer module and the second weight buffer module in each buffer period in a ping-pong switching manner, so that one weight buffer module is in a read state and the other weight buffer module is in a write state in each buffer period.
Specifically, two weight buffer memories (RAMs) are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture and are used for decompressing compressed weight values in a CSR format received from the outside of the chip to obtain an effective weight matrix, and in the decompression process, the whole system can complete decompression of one RAM and update the other RAM.
According to the embodiment, the ping-pong buffer in the compression weight value decompression process of the CSR format is realized by using the weight buffer module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.
In one embodiment, the accelerator further includes a set of neuron parameter cache modules 108; the set of neuron parameter cache modules 108 includes a first neuron parameter cache module and a second neuron parameter cache module; the set of neuron parameter buffer modules 108 is configured to control the read-write states of the first neuron parameter buffer module and the second neuron parameter buffer module in each buffer cycle in a ping-pong switching manner, so that one of the neuron parameter buffer modules is in a read state and the other one of the neuron parameter buffer modules is in a write state in each buffer cycle.
Specifically, two pieces of neuron parameter buffer module RAM are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture, and are used for decoding the neuron parameters received from the outside of the chip to obtain the neuron parameters suitable for calculation, and in the decoding process, the whole system can update one piece of RAM while the other piece of RAM is decompressed.
Further, the ping-pong operation algorithm used in the present application comprises a three-dimensional control, i.e. how many time slots each time step of the input pulse comprises, how many time steps each group of neurons needs to calculate and how many groups of groups each layer of network needs to be divided into.
As shown in fig. 4, taking a two-layer 1024-512-256 fully-connected pulse neural network as an example, the accelerator will receive data of pulse ram#0, weight ram#0 and neuron parameters ram#0 from outside after power-up. When the first global synchronization command "sync_all" is received, the weight ram#1 and the neuron parameter ram#1 of the accelerator receive data from the outside, the data of the pulse ram#0, the weight ram#0, and the neuron parameter ram#0 are sent to the core calculation unit, and the calculation result is sent to the pulse ram#1, at which time the accelerator calculates the first 256 neurons of the first layer. When the second global synchronization command "sync_all" is received, the weight ram#0 and the neuron parameter ram#0 of the accelerator receive data from the outside, the data of the pulse ram#0, the weight ram#1, and the neuron parameter ram#1 are sent to the core calculation unit, and the calculation result is sent to the pulse ram#1, at which time the accelerator calculates the last 256 neurons of the first layer. When the third global synchronization command "sync_all" is received, the weight ram#1 and the neuron parameter ram#1 of the accelerator receive data from the outside, the data of the pulse ram#1, the weight ram#0, and the neuron parameter ram#0 are sent to the core calculation unit, the calculation result is sent to the pulse ram#0 and the outside of the accelerator, and at this time, the accelerator calculates 256 neurons of the second layer, thereby completing the whole calculation process.
According to the embodiment, the ping-pong buffer of the neuron parameters in the decoding process is realized by using the neuron parameter buffer module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.
In an embodiment, the sparse pulse detection module 105 is further configured to divide a pulse input sequence corresponding to the pulse input signal into a plurality of groups of subsequences;
each group of subsequences and the subsequences are subjected to bit pressing or operation in sequence, so that a bit pressing or operation result is obtained; if the bit pressing or operation result is all 0, ending the operation of the current group of subsequences;
if the bit pressing or operation result is not all 0, taking the bit pressing or operation result as a current sequence to be detected, and carrying out multi-round detection on the current sequence to be detected, wherein in each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference value; performing bit-wise operation on the difference value and the current sequence to be detected to obtain a bit-wise operation result; performing bit exclusive OR operation on the bit-wise operation result and the current sequence to be detected to obtain an effective pulse single-heat code; performing binary conversion on the effective pulse single-hot code to obtain an effective pulse index; judging whether the bit pressing and operation result is all 0, if so, ending the detection of the current sequence to be detected, and returning to the step of sequentially performing bit pressing or operation on each group of subsequences and the subsequences to obtain a bit pressing or operation result; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current detection sequence to obtain a difference value.
The pulse input sequence is a sequence obtained by encoding a pulse input signal. The valid pulse index refers to the position in the pulse input sequence where the non-zero value is located.
Specifically, as shown in fig. 5, fig. 5 shows an internal circuit configuration diagram of the sparse pulse detection module 105, including logic gates such as or gate, multiplexer, D flip-flop, adder, and gate, exclusive or gate, and the like. The sparse pulse detection module 105 is mainly responsible for extracting the effective pulse index of the input pulse train. In order to reduce the critical path delay, the application divides the input pulse sequence into 16 groups of 64bit subsequences, for example 1024bit input pulse sequence, each group of 64bit subsequences firstly performs bit pressing or operation with itself to obtain a bit pressing or operation result, if the bit pressing or operation result is all 0, it is indicated that all 64 bits in the group of subsequences are 0, the operation of the group of subsequences is finished, namely the calculation of the 64bit subsequences is skipped.
If the bit or operation result is not all 0, the position of the non-zero value in the current group of subsequences needs to be further detected, and the detection method specifically comprises the following steps: taking the bit pressing or operation result as a current sequence to be detected, carrying out multi-round detection on the current sequence to be detected, subtracting 1 from the current sequence to be detected (for example, the 64bit subsequence) to obtain a difference value in each round of detection, and then carrying out bit pressing and operation on the difference value and the current sequence to be detected to obtain a bit pressing and operation result; performing bit exclusive OR operation on the bit and operation result and the data (namely the current sequence to be detected) before subtracting 1 to obtain an effective pulse single-heat code; meanwhile, judging whether the bit pressing and operation result is all 0, if so, ending the detection of the group of subsequences, and entering the detection of the next group of subsequences, wherein in the example of the embodiment, the detection of the next group of subsequences with 64 bits is carried out; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current sequence to be detected to obtain a difference value. And the method is circulated until each group of subsequences are detected, a plurality of effective pulse single-heat codes can be obtained, and an effective pulse index in a binary code form corresponding to a pulse input signal can be obtained through a decoding unit for converting the single-heat codes into binary codes.
In the above embodiment, the sparse pulse detection module 105 is obtained by combining a plurality of logic gates, so that the effective index detection of sparse pulses can be realized.
In one embodiment, as shown in fig. 6, fig. 6 shows a schematic structural diagram of the compression weight calculation module 104, where the compression weight calculation module 104 includes a row offset module (Row Offset Module, ROM), a column index Delta coding module (Column Delta Module, CDM), a non-zero weight module (Nonzero Weight Value Module, NWVM), and an array of PEs (Processing Element, processing units), and each PE unit is used to perform calculation of a film potential increment, i.e., a dot product of a pulse input signal and a synaptic weight, i.e., x in formula (1) i w ij . In fig. 6, a 1024×256 synaptic crossover array is illustrated, where 1024 represents 1024 fan-ins (i.e., 1024 axons), and 256 represents 256 hardware neurons, i.e., 1024 fan-ins per hardware neuron, but not every fan-in is a valid pulse signal. The effective pulse index transmitted from the sparse pulse detection module activates the effective pulse indexThe row, then the Row Offset Module (ROM) calculates the corresponding row offset, the column index Delta encoding module (CDM) reads the column index of the non-zero value of the row according to the row offset, and the non-zero weight module (NWVM) adds the corresponding non-zero value to the membrane potential according to the column index. When all the effective pulses in the input pulse sequence are processed, all the accumulated membrane potential values are sent to a leakage integration distribution neuron dynamic behavior module for LIF operation.
Since the effective weight matrix is represented by three parameters, namely Row Offset (Row Offset), column index (columns indexes), and non-zero Value (Value), as shown in fig. 2. Accordingly, in order to restore the effective weight matrix according to the above three parameters, as shown in fig. 7, the storage of the above three parameters in the accelerator is respectively set: a row offset memory module (Row Offset Module, ROM, which is subdivided into Even ROM and Odd ROM), a column index Delta encoded memory module (Column Delta Module, CDM) and a non-zero value memory module (Nonzero Weight Value Module, NWVM). For pipeline progress, the line offset storage modules are further divided into an Even line offset storage module (Even ROM) and an Odd line offset storage module (Odd ROM). As shown in FIG. 7, for the input pulse pointing to the ith row, the CSR decoder reads the row offset of the ith row and the row offset of the (i+1) th row from Even ROM (Even row offset module) and Odd ROM (Odd row offset module) respectively, and subtracts the two to obtain the number of non-zero values corresponding to the ith row pulse, thereby obtaining the memory addresses and the memory number of the column index Delta code memory module (CDM) and the non-zero value memory module (NWVM). Since only one row offset and operation is required for the pulse of the ith row, but more than one column index and weight and operation may be required for the ith row, which may bring about throughput mismatch of the two types of data structure processing pipelines, in order to avoid inefficient uneven pipeline stalls, the circuit design decouples the CSR decoder into a row offset operation unit and a column index weight operation unit, which are synchronized using FIFO (First Input First Output, first-in first-out) memories, as shown in fig. 7.
Column index weightThe circuit configuration of the value operation unit is as shown in FIG. 8, and the row offset RO for the i-th row is obtained i And row offset RO for row i+1 i+1 After analysis, access start addresses Addr of a column index Delta coding storage module (CDM) and a non-zero value storage module (NWVM) are respectively obtained i And end address Addr i+1 When MAE (Masked Autoencoder, mask self-encoder) output signal is active, each clock cycle is according to Addr i Column index ΔCI in Delta encoding format is read from column index Delta encoding storage Module (CDM) and non-zero value storage Module (NWVM), respectively j And non-zero Weights (WV) to subsequent circuit units (i.e., corresponding PE units). When Addr i Exceeding Addr i+1 When the column index weight calculation unit finishes the processing of the ith row, the row offset of the new row is obtained from the row offset storage module.
As shown in fig. 9, the Delta-encoded column index (i.e., Δci) read from the column index Delta-encoded memory module (CDM) every clock cycle j ) Decoding into column indices (i.e. CI by adder chain j ). Since it is possible to read out Δcis not belonging to the current row (i.e., column index Delta codes) from the column index Delta coding module (CDM) every clock cycle, filtering these Δcis not belonging to the current row by MUX (multiplexer) results in filtered Delta codes Δci'. If the current clock cycle is the 1 st cycle of the row operation, ΔCI' 0 CI to be the current period 0 The method comprises the steps of carrying out a first treatment on the surface of the If the current clock cycle is not the 1 st cycle of the row operation, CI will be in the last cycle 31 Adding DeltaCI 'on the basis of' 0 . While others are CI j =CI j-1 +ΔCI′ j
The structure of the PE array is shown in FIG. 10, and FIG. 10 shows a PE array comprising 32 PE units; each PE unit contains 1 16-bit adder and 2 muxes (multiplexers), and the cumulative value of the membrane potential of all dendrites is input to MUX1 (multiplexer of multiple 1) of each PE. After obtaining 32 pairs of CIs j After (column index) and WV (non-zero weights), the weight allocator will add each pair of CIs j (jth column index) and WV j (jth non-zero weight) as the corresponding PE j Is input into: wherein CI is j Will be the multiplexing selection signal of MUX1, will correspond to(the accumulated value of the membrane potential of the neuron corresponding to the jth column index) as one operand of the adder; WV (WV) j Will be input to one of the 2 inputs of MUX2, and after filtering the weight mask, the effective WV j Would act as another operand to the adder. The weight distributor updates +.>Will be updatedAs an accumulated value for the next cycle.
The function of the weight mask is to filter out invalid non-zero weights WV, avoiding the generation of operation errors. The reason for the existence of the invalid non-zero weight WV in the calculation is the following two points: (1) When the NWVM is accessed, the weight which is at the same memory address of the NWVM but does not belong to the row is read out as a non-zero weight WV; (2) In order to realize the fan-in extension function, 128 columns of the weight matrix are divided into a plurality of shared fan-in dendritic clusters, and when the NWVM is accessed, weights of other dendrites which do not belong to the shared fan-in dendritic cluster receiving the current input pulse vector are read out together as non-zero weights WV.
In the embodiment, the odd line offset module and the even line offset module are arranged, so that the processing level of the pipeline is further improved.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (10)

1.一种基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,包括脉冲输入接口、权值和神经元参数输入接口、稀疏脉冲检测模块、压缩权重计算模块、泄漏积分发放模块;其中,1. A sparse spiking neural network accelerator based on a ping-pong architecture, characterized in that it includes a pulse input interface, a weight and a neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module, and a leakage integral distribution module; wherein, 所述脉冲输入接口,用于接收脉冲输入信号,并将所述脉冲输入信号输入至稀疏脉冲检测模块;The pulse input interface is used to receive a pulse input signal, and input the pulse input signal to a sparse pulse detection module; 所述权值和神经元参数输入接口,用于接收压缩权重值,并将所述压缩权重值输入至所述压缩权重计算模块;The weight and neuron parameter input interface is used to receive the compressed weight value, and input the compressed weight value to the compressed weight calculation module; 所述稀疏脉冲检测模块,用于从所述脉冲输入信号中提取有效脉冲索引;所述有效脉冲索引用于表征所述脉冲输入信号中非零值的位置;The sparse pulse detection module is configured to extract an effective pulse index from the pulse input signal; the effective pulse index is used to characterize the position of the non-zero value in the pulse input signal; 所述压缩权重计算模块,用于根据所述有效脉冲索引,对所述压缩权重值进行解压得到有效权值矩阵;计算所述有效权值矩阵与所述脉冲输入信号的加权和,得到每一神经元上的膜电位增量;利用所述每一神经元上的膜电位增量更新与所述每一神经元对应的膜电位累积量;The compression weight calculation module is used to decompress the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculate the weighted sum of the effective weight matrix and the pulse input signal to obtain each The membrane potential increment on the neuron; using the membrane potential increment on each neuron to update the membrane potential accumulation corresponding to each neuron; 所述泄漏积分发放模块,用于判断更新后的膜电位累积量与预设阈值的大小关系,根据所述大小关系确定与所述每一神经元对应的输出脉冲结果。The leakage integral distribution module is used to judge the magnitude relationship between the updated membrane potential accumulation and the preset threshold, and determine the output pulse result corresponding to each neuron according to the magnitude relationship. 2.根据权利要求1所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,还包括脉冲缓存模块组;所述脉冲缓存模块组包括第一脉冲缓存模块和第二脉冲缓存模块;2. the sparse pulse neural network accelerator based on ping-pong architecture according to claim 1, is characterized in that, also comprises pulse buffer module group; Described pulse buffer module group comprises the first pulse buffer module and the second pulse buffer module; 所述脉冲缓存模块组,用于以乒乓切换的方式控制每一缓存周期内第一脉冲缓存模块和第二脉冲缓存模块的读写状态,以使每一缓存周期内其中一个脉冲缓存模块处于读状态,另一个脉冲缓存模块处于写状态。The pulse buffer module group is used to control the read and write states of the first pulse buffer module and the second pulse buffer module in each buffer cycle in a ping-pong switching manner, so that one of the pulse buffer modules is in the read and write state in each buffer cycle. state, the other pulse buffer module is in the write state. 3.根据权利要求2所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,还包括权重缓存模块组;所述权重缓存模块组包括第一权重缓存模块和第二权重缓存模块;3. the sparse pulse neural network accelerator based on ping-pong architecture according to claim 2, is characterized in that, also comprises weight cache module group; Said weight cache module group comprises a first weight cache module and a second weight cache module; 所述权重缓存模块组,用于以乒乓切换的方式控制每一缓存周期内第一权重缓存模块和第二权重缓存模块的读写状态,以使每一缓存周期内其中一个权重缓存模块处于读状态,另一个权重缓存模块处于写状态。The weight cache module group is used to control the read and write states of the first weight cache module and the second weight cache module in each cache cycle in a ping-pong manner, so that one of the weight cache modules is in the read and write state in each cache cycle. state, another weight cache module is in write state. 4.根据权利要求3所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,还包括神经元参数缓存模块组;所述神经元参数缓存模块组包括第一神经元参数缓存模块和第二神经元参数缓存模块;4. the sparse spiking neural network accelerator based on ping-pong architecture according to claim 3, is characterized in that, also comprises neuron parameter cache module set; Described neuron parameter cache module set comprises the first neuron parameter cache module and the second neuron parameter cache module Two neuron parameter cache modules; 所述神经元参数缓存模块组,用于以乒乓切换的方式控制每一缓存周期内第一神经元参数缓存模块和第二神经元参数缓存模块的读写状态,以使每一缓存周期内其中一个神经元参数缓存模块处于读状态,另一个神经元参数缓存模块处于写状态。The neuron parameter cache module group is used to control the read and write states of the first neuron parameter cache module and the second neuron parameter cache module in each cache cycle in a ping-pong manner, so that in each cache cycle, the One neuron parameter cache module is in the read state, and the other neuron parameter cache module is in the write state. 5.根据权利要求1所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,5. the sparse spiking neural network accelerator based on ping-pong architecture according to claim 1, characterized in that, 所述稀疏脉冲检测模块,进一步用于将所述脉冲输入信号对应的脉冲输入序列分为多组子序列;The sparse pulse detection module is further configured to divide the pulse input sequence corresponding to the pulse input signal into multiple groups of subsequences; 依次将每一组子序列与其自身进行按位或操作,得到按位或操作结果;若所述按位或操作结果为全0,则结束当前组子序列的运算;performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result; if the bitwise OR operation result is all 0, the operation of the current group of subsequences is ended; 若所述按位或操作结果不为全0,则将所述按位或操作结果作为当前待检测序列,对所述当前待检测序列进行多轮检测,每一轮检测中,将当前待检测序列减1后得到差值;将所述差值与当前待检测序列进行按位与操作后得到按位与操作结果;将所述按位与操作结果与当前待检测序列进行按位异或操作后得到有效脉冲独热码;将所述有效脉冲独热码进行二进制转换后得到有效脉冲索引;判断所述按位与操作结果是否为全0,若是,则结束当前待检测序列的检测,并返回所述依次将每一组子序列与其自身进行按位或操作,得到按位或操作结果的步骤;若否,则将所述按位与操作结果作为当前待检测序列,返回所述将当前检测序列减1后得到差值的步骤。If the bitwise OR operation result is not all 0, the bitwise OR operation result is used as the current sequence to be detected, and multiple rounds of detection are performed on the current sequence to be detected. In each round of detection, the current sequence to be detected is The difference is obtained after subtracting 1 from the sequence; performing a bitwise AND operation on the difference with the current sequence to be detected to obtain a bitwise AND operation result; performing a bitwise XOR operation on the bitwise AND operation result with the current sequence to be detected Obtain effective pulse one-hot code afterwards; Obtain effective pulse index after binary conversion of described effective pulse one-hot code; Judge whether described bitwise AND operation result is all 0, if so, then end the detection of current sequence to be detected, and Return to the step of performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result; if not, use the bitwise AND operation result as the current sequence to be detected, and return the current The step of obtaining the difference after subtracting 1 from the detection sequence. 6.根据权利要求1所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,所述压缩权重计算模块包括行偏移运算单元、列索引权值运算单元、行偏移存储模块、列索引Delta编码存储模块、非零权值存储模块、权值分配器和处理单元阵列;其中,6. The sparse pulse neural network accelerator based on ping-pong architecture according to claim 1, wherein the compression weight calculation module comprises a row offset operation unit, a column index weight operation unit, a row offset storage module, a column Index Delta encoding storage module, non-zero weight storage module, weight allocator and processing unit array; wherein, 所述行偏移运算单元,用于从所述行偏移存储模块读取当前行的行偏移和与当前行相邻的下一行的行偏移;The row offset operation unit is used to read the row offset of the current row and the row offset of the next row adjacent to the current row from the row offset storage module; 所述列索引权值运算单元,用于解析所述当前行的行偏移和所述与当前行相邻的下一行的行偏移得到所述列索引编码模块和所述非零权值模块的起始地址和结束地址;根据所述起始地址和所述结束地址,分别从所述列索引编码模块和所述非零权值模块中获取列索引Delta编码和非零权值;The column index weight calculation unit is configured to analyze the row offset of the current row and the row offset of the next row adjacent to the current row to obtain the column index encoding module and the non-zero weight module The start address and the end address; according to the start address and the end address, obtain the column index Delta code and the non-zero weight from the column index encoding module and the non-zero weight module respectively; 所述权值分配器,包括预设数量的加法器构成的加法器链,所述加法器链用于将所述非零权值和所述Delta编码作为处理单元阵列的输入;The weight allocator includes an adder chain composed of a preset number of adders, and the adder chain is used to use the non-zero weight and the Delta code as the input of the processing unit array; 所述处理单元阵列,包括预设数量的处理单元,每个处理单元利用加法器进行运算,根据所述加法器的运算结果更新所述膜电位累积量。The processing unit array includes a preset number of processing units, and each processing unit uses an adder to perform calculations, and updates the accumulated membrane potential according to the calculation results of the adders. 7.根据权利要求6所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,所述列索引权值运算单元,还用于在所述起始地址超过所述结束地址的情况下,结束当前行的处理。7. The sparse spiking neural network accelerator based on ping-pong architecture according to claim 6, wherein the column index weight calculation unit is also used for when the start address exceeds the end address, End processing of the current row. 8.根据权利要求6所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,每个处理单元中还包括权值掩码生成模块、多路选择器和加法器;8. the sparse spiking neural network accelerator based on ping-pong architecture according to claim 6, is characterized in that, also comprises weight mask generation module, demultiplexer and adder in each processing unit; 所述权值掩码生成模块,用于根据权值分布状态生成权值掩码;The weight mask generation module is used to generate a weight mask according to the weight distribution state; 所述处理单元,还用于将所述权值掩码、所述非零权值作为多路选择器的输入,使得所述多路选择器根据所述权值掩码过滤后得到有效非零值,并将所述有效非零值和所述膜电位累积量作为加法器的输入,得到所述加法器输出的所述更新后的膜电位累积量。The processing unit is further configured to use the weight mask and the non-zero weight as inputs to a multiplexer, so that the multiplexer obtains effective non-zero weights after filtering according to the weight mask value, and using the effective non-zero value and the accumulated membrane potential as the input of an adder to obtain the updated accumulated membrane potential output by the adder. 9.根据权利要求1所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,所述泄漏积分发放模块,进一步用于将所述更新后的膜电位累积量与预设泄漏值相加后得到泄漏积分值;若所述泄漏积分值大于所述预设阈值,则确定所述输出脉冲结果为发放脉冲信号;若所述泄漏积分值小于或等于预设阈值,则确定所述脉冲结果为不发放脉冲。9. The sparse spiking neural network accelerator based on the ping-pong architecture according to claim 1, wherein the leakage integral distribution module is further used to add the updated membrane potential accumulation to a preset leakage value Finally, the leakage integral value is obtained; if the leakage integral value is greater than the preset threshold value, it is determined that the output pulse result is a pulse signal; if the leakage integral value is less than or equal to the preset threshold value, then the pulse result is determined to not emit pulses. 10.根据权利要求1所述的基于乒乓架构的稀疏脉冲神经网络加速器,其特征在于,还包括脉冲输出接口;10. The sparse spiking neural network accelerator based on ping-pong architecture according to claim 1, further comprising a spiking output interface; 所述泄漏积分发放模块,进一步用于在判断所述更新后的膜电位累积量大于所述预设阈值的情况下,向所述脉冲输出接口发送所述更新后的膜电位累积量,并将所述膜电位累积量复位。The leakage integral issuance module is further configured to send the updated cumulative membrane potential to the pulse output interface when it is judged that the updated cumulative membrane potential is greater than the preset threshold, and send The membrane potential accumulation is reset.
CN202310410779.3A 2023-04-17 2023-04-17 Sparse pulse neural network accelerator based on ping-pong architecture Active CN116663626B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202310410779.3A CN116663626B (en) 2023-04-17 2023-04-17 Sparse pulse neural network accelerator based on ping-pong architecture
PCT/CN2023/121949 WO2024216857A1 (en) 2023-04-17 2023-09-27 Sparse spiking neural network accelerator based on ping-pong architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310410779.3A CN116663626B (en) 2023-04-17 2023-04-17 Sparse pulse neural network accelerator based on ping-pong architecture

Publications (2)

Publication Number Publication Date
CN116663626A true CN116663626A (en) 2023-08-29
CN116663626B CN116663626B (en) 2025-11-07

Family

ID=87721335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310410779.3A Active CN116663626B (en) 2023-04-17 2023-04-17 Sparse pulse neural network accelerator based on ping-pong architecture

Country Status (2)

Country Link
CN (1) CN116663626B (en)
WO (1) WO2024216857A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118798276A (en) * 2024-09-11 2024-10-18 电子科技大学 A block-by-block vector-zero-value sparsity-aware convolutional neural network accelerator
WO2024216857A1 (en) * 2023-04-17 2024-10-24 北京大学 Sparse spiking neural network accelerator based on ping-pong architecture
CN119358606A (en) * 2024-09-25 2025-01-24 鹏城实验室 Pulse neural network weight gradient calculation method and related equipment
CN120874919A (en) * 2025-09-28 2025-10-31 山东云海国创云计算装备产业创新中心有限公司 Integrated deposit and calculation device, deposit and calculation method, processing device, tile module and accelerator

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119474626B (en) * 2024-11-05 2025-09-26 南京大学 Multifunctional linear convolution accelerator
CN119719720B (en) * 2024-12-24 2025-10-03 广东工业大学 A computing system for spiking recurrent neural networks
CN119721153B (en) * 2025-03-04 2025-05-23 浪潮电子信息产业股份有限公司 A reinforcement learning accelerator, acceleration method and electronic device
CN120234515B (en) * 2025-05-29 2025-08-15 兰州大学 Event-driven CSR parser supporting multiple computing modes
CN120255642B (en) * 2025-06-06 2025-09-19 南京大学 Sensing and computing integrated system and method based on SPAD and pulse neural network
CN120562489B (en) * 2025-07-31 2025-09-19 苏州元脑智能科技有限公司 Convolution acceleration method, device, equipment and medium for pulse neural network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049056A (en) * 2015-08-07 2015-11-11 杭州国芯科技股份有限公司 One-hot code detection circuit
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN111445013A (en) * 2020-04-28 2020-07-24 南京大学 Non-zero detector for convolutional neural network and method thereof
CN112732222A (en) * 2021-01-08 2021-04-30 苏州浪潮智能科技有限公司 Sparse matrix accelerated calculation method, device, equipment and medium
CN113537488A (en) * 2021-06-29 2021-10-22 杭州电子科技大学 Neural network accelerator based on sparse vector matrix calculation and acceleration method
CN114860192A (en) * 2022-06-16 2022-08-05 中山大学 FPGA-based sparse dense matrix multiplication array with high multiplier utilization rate for graph neural network
CN115440226A (en) * 2022-09-20 2022-12-06 南京大学 A low-power system for speech keyword recognition based on spiking neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169523A (en) * 2021-04-02 2022-10-11 华为技术有限公司 Impulse neural network circuit and computing method based on impulse neural network
CN116663626B (en) * 2023-04-17 2025-11-07 北京大学 Sparse pulse neural network accelerator based on ping-pong architecture

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105049056A (en) * 2015-08-07 2015-11-11 杭州国芯科技股份有限公司 One-hot code detection circuit
CN107239823A (en) * 2016-08-12 2017-10-10 北京深鉴科技有限公司 A kind of apparatus and method for realizing sparse neural network
US20180046895A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Device and method for implementing a sparse neural network
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN111445013A (en) * 2020-04-28 2020-07-24 南京大学 Non-zero detector for convolutional neural network and method thereof
CN112732222A (en) * 2021-01-08 2021-04-30 苏州浪潮智能科技有限公司 Sparse matrix accelerated calculation method, device, equipment and medium
CN113537488A (en) * 2021-06-29 2021-10-22 杭州电子科技大学 Neural network accelerator based on sparse vector matrix calculation and acceleration method
CN114860192A (en) * 2022-06-16 2022-08-05 中山大学 FPGA-based sparse dense matrix multiplication array with high multiplier utilization rate for graph neural network
CN115440226A (en) * 2022-09-20 2022-12-06 南京大学 A low-power system for speech keyword recognition based on spiking neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YISONG KUANG等: "An Event-driven Spiking Neural Network Accelerator with On-chip Sparse Weight", 《 2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS)》, 11 November 2022 (2022-11-11), pages 3468 - 3472 *
YISONG KUANG等: "ESSA: Design of a Programmable Efficient Sparse Spiking Neural Network Accelerator", 《 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS》, vol. 30, no. 11, 11 November 2022 (2022-11-11), pages 1631 - 1641, XP011924800, DOI: 10.1109/TVLSI.2022.3183126 *
余成宇等: "一种高效的稀疏卷积神经网络加速器的设计与实现", 《智能系统学报》, vol. 15, no. 2, 31 March 2020 (2020-03-31), pages 323 - 333 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024216857A1 (en) * 2023-04-17 2024-10-24 北京大学 Sparse spiking neural network accelerator based on ping-pong architecture
CN118798276A (en) * 2024-09-11 2024-10-18 电子科技大学 A block-by-block vector-zero-value sparsity-aware convolutional neural network accelerator
CN119358606A (en) * 2024-09-25 2025-01-24 鹏城实验室 Pulse neural network weight gradient calculation method and related equipment
CN119358606B (en) * 2024-09-25 2025-11-04 鹏城实验室 Methods and related equipment for calculating weight gradients in spiking neural networks
CN120874919A (en) * 2025-09-28 2025-10-31 山东云海国创云计算装备产业创新中心有限公司 Integrated deposit and calculation device, deposit and calculation method, processing device, tile module and accelerator

Also Published As

Publication number Publication date
CN116663626B (en) 2025-11-07
WO2024216857A1 (en) 2024-10-24

Similar Documents

Publication Publication Date Title
CN116663626B (en) Sparse pulse neural network accelerator based on ping-pong architecture
Kalchbrenner et al. Efficient neural audio synthesis
JP7366484B2 (en) Quantum error correction decoding system, method, fault tolerant quantum error correction system and chip
CN111783973B (en) Nerve morphology processor and equipment for liquid state machine calculation
Evans et al. JPEG-ACT: Accelerating deep learning via transform-based lossy compression
Wang et al. Learning efficient binarized object detectors with information compression
Kadetotad et al. Efficient memory compression in deep neural networks using coarse-grain sparsification for speech applications
CN112884149B (en) Random sensitivity ST-SM-based deep neural network pruning method and system
Wang et al. LSMCore: a 69k-synapse/mm 2 single-core digital neuromorphic processor for liquid state machine
Qin et al. Diagonalwise refactorization: An efficient training method for depthwise convolutions
CN115022637B (en) Image encoding method, image decompression method and device
CN113962371B (en) An image recognition method and system based on a brain-like computing platform
CN116663627A (en) Digital neuromorphic computing processor and computing method
Liu et al. Spiking-diffusion: Vector quantized discrete diffusion model with spiking neural networks
CN112598119B (en) An On-Chip Memory Compression Method for Neuromorphic Processors for Liquid State Machines
Lee et al. TT-SNN: tensor train decomposition for efficient spiking neural network training
Wu et al. A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation
CN109495113A (en) A kind of compression method and device of EEG signals
Dang et al. An efficient software-hardware design framework for spiking neural network systems
Guo et al. A high-efficiency FPGA-based accelerator for binarized neural network
Watkins et al. Image data compression and noisy channel error correction using deep neural network
CN113222160A (en) Quantum state conversion method and device
CN115170916B (en) Image reconstruction method and system based on multi-scale feature fusion
Li et al. An efficient sparse lstm accelerator on embedded fpgas with bandwidth-oriented pruning
Chen et al. An Edge Neuromorphic Processor With High-Accuracy On-Chip Aggregate-Label Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant