CN116663626A

CN116663626A - Sparse Spiking Neural Network Accelerator Based on Ping-Pong Architecture

Info

Publication number: CN116663626A
Application number: CN202310410779.3A
Authority: CN
Inventors: 王源; 王梓霖; 钟毅; 崔小欣
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-08-29
Anticipated expiration: 2043-04-17
Also published as: CN116663626B; WO2024216857A1

Abstract

The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which is characterized in that a compression weight value is transmitted to a compression weight calculation module, an effective pulse index is extracted from pulse input signals by using a sparse pulse detection module, each subsequent pulse signal is prevented from participating in operation, the calculated amount is reduced, and the compression weight calculation module accumulates non-zero values in the compression weight value to the membrane potential of neurons according to the effective pulse index, so that whether to issue pulses or not is finally determined. Compared with the technical scheme that all synapses in the traditional synapse crossing array are activated and participate in operation, only the synapse weight corresponding to the effective pulse index is activated, and other synapses do not participate in operation, so that the calculated amount is reduced, the operation power consumption of the whole chip is reduced, and the operation speed, the energy efficiency and the area efficiency of the pulse neural network are improved.

Description

Sparse pulse neural network accelerator based on ping-pong architecture

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a sparse pulse neural network accelerator based on a ping-pong architecture.

Background

The impulse neural network (Spiking Nueral Network, SNN) can bring about improvement of machine calculation force due to the low power consumption and high concurrency of the impulse neural network, is a calculation mode with great potential, and is considered to be the future of artificial intelligence research.

Because the signaling mechanisms of neurons in a impulse neural network do not conform to conventional von neumann computer architectures, it is highly desirable to design appropriate hardware accelerators for impulse neural networks to run impulse neural networks. The current neuromorphic accelerator usually adopts a synaptic crossover array with regular structure and fixed size to directly store a synaptic connection matrix of an SNN model, and all synapses can participate in operation no matter how the synapses are sparse, so that the calculated amount is increased, and the spatial sparsity of the SNN model cannot be reflected on the neuromorphic accelerator. Another accelerator design scheme is to buffer input pulses in a bitmap form, which causes hardware to need to determine the validity of each bit in the input pulse vector, increasing the computation time, and as a result, the time sparsity of the pulse signals cannot be used to increase the operation speed of the hardware.

Therefore, the current neural network accelerator cannot fully exert the potential performance advantages of the SNN model, so that the power consumption is not low enough, the operation speed is not fast enough, and the operation energy efficiency of the SNN model is not high enough.

Disclosure of Invention

The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which is used for solving the defects that in the prior art, the pulse neural network accelerator has large chip power consumption and large calculation amount because all synapses are involved in calculation or each symbol in an input pulse signal is involved in calculation, and realizing low-power consumption and low-delay operation of the pulse neural network.

The application provides a sparse pulse neural network accelerator based on a ping-pong architecture, which comprises a pulse input interface, a weight and neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module and a leakage integral issuing module; wherein,,

the pulse input interface is used for receiving a pulse input signal and inputting the pulse input signal to the sparse pulse detection module;

the weight and neuron parameter input interface is used for receiving a compression weight value and inputting the compression weight value to the compression weight calculation module;

the sparse pulse detection module is used for extracting an effective pulse index from the pulse input signal; the valid pulse index is used for representing the position of a non-zero value in the pulse input signal;

the compression weight calculation module is used for decompressing the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculating the weighted sum of the effective weight matrix and the pulse input signals to obtain the membrane potential increment on each neuron; updating the membrane potential accumulation amount corresponding to each neuron by using the membrane potential increment on each neuron;

the leakage integral issuing module is used for judging the magnitude relation between the updated membrane potential accumulation amount and a preset threshold value, and determining an output pulse result corresponding to each neuron according to the magnitude relation.

The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a pulse cache module group; the pulse buffer module group comprises a first pulse buffer module and a second pulse buffer module;

the pulse buffer module group is used for controlling the read-write states of the first pulse buffer module and the second pulse buffer module in each buffer period in a ping-pong switching mode so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer period.

The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a weight cache module group; the weight buffer module group comprises a first weight buffer module and a second weight buffer module;

the weight buffer module group is used for controlling the read-write states of the first weight buffer module and the second weight buffer module in each buffer period in a ping-pong switching mode so that one weight buffer module is in a read state and the other weight buffer module is in a write state in each buffer period.

The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a neuron parameter cache module group; the neuron parameter cache module group comprises a first neuron parameter cache module and a second neuron parameter cache module;

the neuron parameter buffer module group is used for controlling the read-write states of the first neuron parameter buffer module and the second neuron parameter buffer module in each buffer period in a ping-pong switching mode so that one neuron parameter buffer module is in a read state and the other neuron parameter buffer module is in a write state in each buffer period.

According to the sparse pulse neural network accelerator based on the ping-pong architecture, the sparse pulse detection module is further used for dividing a pulse input sequence corresponding to the pulse input signal into a plurality of groups of subsequences;

each group of subsequences and the subsequences are subjected to bit pressing or operation in sequence, so that a bit pressing or operation result is obtained; if the bit pressing or operation result is all 0, ending the operation of the current group of subsequences;

if the bit pressing or operation result is not all 0, taking the bit pressing or operation result as a current sequence to be detected, and carrying out multi-round detection on the current sequence to be detected, wherein in each round of detection, the current sequence to be detected is subtracted by 1 to obtain a difference value; performing bit-wise operation on the difference value and the current sequence to be detected to obtain a bit-wise operation result; performing bit exclusive OR operation on the bit-wise operation result and the current sequence to be detected to obtain an effective pulse single-heat code; performing binary conversion on the effective pulse single-hot code to obtain an effective pulse index; judging whether the bit pressing and operation result is all 0, if so, ending the detection of the current sequence to be detected, and returning to the step of sequentially performing bit pressing or operation on each group of subsequences and the subsequences to obtain a bit pressing or operation result; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current detection sequence to obtain a difference value.

According to the sparse pulse neural network accelerator based on the ping-pong architecture, the compression weight calculation module comprises a row offset calculation module, a column index weight calculation unit, a column index coding module, a non-zero weight module, a weight distributor and a processing unit array; wherein,,

the row offset operation unit is used for acquiring the row offset of the current row and the row offset of the next row adjacent to the current row;

the column index weight calculation unit is used for analyzing the row offset of the current row and the row offset of the next row adjacent to the current row to obtain the starting address and the ending address of the column index coding module and the non-zero weight module; acquiring a non-zero weight and a column index Delta code from the column index coding module and the non-zero weight module according to the starting address and the ending address;

the weight distributor comprises an adder chain formed by a preset number of adders, wherein the adder chain is used for taking the non-zero weight and the Delta code as inputs of a processing unit array;

the processing unit array comprises a preset number of processing units, each processing unit performs operation by using an adder, and the film potential accumulation amount is updated according to the operation result of the adder.

According to the sparse pulse neural network accelerator based on the ping-pong architecture provided by the application, the column index weight calculation unit is further used for ending the processing of the current row under the condition that the starting address exceeds the ending address.

According to the sparse pulse neural network accelerator based on the ping-pong architecture, each processing unit further comprises a weight mask generating module, a multiplexer and an adder;

the weight mask generation module is used for generating a weight mask according to the weight distribution state;

the processing unit is further configured to use the weight mask and the non-zero weight as inputs of a multiplexer, so that the multiplexer obtains an effective non-zero value after filtering according to the weight mask, and uses the effective non-zero value and the accumulated amount of the membrane potential as inputs of an adder to obtain the updated accumulated amount of the membrane potential output by the adder.

According to the sparse pulse neural network accelerator based on the ping-pong architecture, the leakage integration issuing module is further used for adding the updated membrane potential accumulation amount with a preset leakage value to obtain a leakage integration value; if the leakage integral value is larger than the preset threshold value, determining that the output pulse result is a release pulse signal; and if the leakage integral value is smaller than or equal to a preset threshold value, determining that the pulse result is that no pulse is issued.

The sparse pulse neural network accelerator based on the ping-pong architecture provided by the application further comprises a pulse output interface;

the leakage integration issuing module is further configured to send the updated accumulated amount of membrane potential to the pulse output interface and reset the accumulated amount of membrane potential if the updated accumulated amount of membrane potential is determined to be greater than the preset threshold.

According to the sparse pulse neural network accelerator based on the ping-pong architecture, the compression weight value is transmitted to the compression weight calculation module, the effective pulse index is extracted from the pulse input signal by using the sparse pulse detection module, each subsequent pulse signal is prevented from participating in operation, the calculated amount is reduced, the compression weight calculation module accumulates the non-zero value in the compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally whether to issue the pulse or not is determined. Compared with the traditional synapse cross array in which all synapses are activated and participate in operation, the application activates only the synapse weight corresponding to the effective pulse index, and other synapses do not participate in operation, thereby reducing the calculated amount, reducing the operation power consumption of the whole chip and improving the operation speed, energy efficiency and area efficiency of the pulse neural network.

Drawings

In order to more clearly illustrate the application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a functional module of a sparse pulse neural network accelerator based on a ping-pong architecture provided by the application;

FIG. 2 is a schematic diagram of matrix elements of a row compression matrix provided by the present application;

FIG. 3 is a schematic diagram of LIF neurons provided by the present application;

FIG. 4 is a schematic flow chart of a ping-pong operation method provided by the application;

FIG. 5 is a schematic diagram of a hardware circuit of the sparse pulse detection module provided by the present application;

FIG. 6 is a schematic diagram of the compression weight calculation module according to the present application;

FIG. 7 is a schematic diagram of a functional block of the compression weight calculation module according to the present application;

FIG. 8 is a schematic diagram of a hardware circuit of a column index weight calculation unit according to the present application;

FIG. 9 is a schematic diagram of adder chain decoding provided by the present application;

fig. 10 is a schematic structural diagram of a PE array according to the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Specific embodiments of the present application are described below in conjunction with fig. 1-10:

the functional modules of the sparse pulse neural network accelerator based on the ping-pong architecture provided by the embodiment of the application are shown in figure 1, and the sparse pulse neural network accelerator comprises a pulse input interface 101, a weight and neuron parameter input interface 102, a sparse pulse detection module 103, a compression weight calculation module 104 and a leakage integral issuing module 105; wherein,,

a pulse input interface 101 for receiving a pulse input signal and inputting the pulse input signal to a sparse pulse detection module 103;

specifically, the pulse input signal is transmitted through the pulse input interface 101 into the sparse pulse detection module 103.

A weight and neuron parameter input interface 102 for receiving a compression weight value and inputting the compression weight value to the compression weight calculation module 104;

the compression weight value refers to a weight value in a sparse matrix storage format. Because the weight parameters in the impulse neural network model are very many, the sparse data storage format can save a large amount of storage space and accelerate the calculation speed. The sparse matrix representation format used in the present application is CSR (Compressed Sparse Row line compression) format.

Specifically, the compression weight values stored in CSR format are input to the compression weight calculation module 104 through the weight and neuron parameter input interface 102.

A sparse pulse detection module 103, configured to extract a valid pulse index from the pulse input signal; the valid pulse index is used to characterize the location of non-zero values in the pulse input signal.

Since the pulse input signal is a sparse vector/tensor (sparse means having a large number of zeros), the sparse vector/tensor needs to be multiplied by the synaptic weight in the calculation process, which means that most of the synaptic weights are multiplied by "0", and the result is still 0. In order to reduce the overall power consumption of the chip, the application activates only the neurons needing to be activated in the synaptic crossover matrix, namely activates the corresponding neurons in the synaptic crossover array according to the non-zero value position in the pulse input signal. To determine which neurons need to be activated, the present application uses the sparse pulse detection module 103 to extract a valid pulse index from the pulse input signal, where the valid pulse index is used to represent the location of non-zero values in the pulse input signal.

The compression weight calculation module 104 is configured to decompress the compression weight value received from the outside of the chip according to the valid pulse index to obtain an valid weight matrix; calculating the weighted sum of the effective weight matrix and the pulse input signals to obtain the membrane potential increment on each neuron; the membrane potential increment on each neuron is used to update the membrane potential accumulation amount corresponding to each neuron.

The effective weight matrix is shown as the left matrix in fig. 2, and is a matrix obtained by restoring the compression weight according to the effective pulse index. Compression weight values as shown by the "non-zero values" in fig. 2, each element in the compression weight values is a non-zero value in the effective weight matrix. Because the on-chip storage space is limited, and the neural network often has huge numbers of parameters and weights, in order to save the storage space, the application uses a CSR (Compressed Row Storage, compressed row matrix) format to store the weights, the CSR storage format uses three one-dimensional arrays (including row offset, column index and non-zero value) to represent a sparse matrix (namely an effective weight matrix), wherein the row index array is used for storing the accumulated count of the non-zero values of the current row and the previous row, for example, the mth element in the row offset array represents the number of all the non-zero values above the mth row in the effective weight matrix, for example, the 2 nd element '4' in the row offset array in fig. 2 represents the total number of the non-zero values above the 2 nd row (the row sequence number is coded from 0) in the effective weight matrix to be 4 (1, 7, 2 and 8 respectively), and so on, only adjacent two elements in the row offset array are subtracted to know the non-zero value contained in each row in the effective weight matrix, for example, for the i element in the effective weight matrix, the i element in the row offset array can be subtracted from the non-zero value in the 1 th element only by reading the element in the row offset array; the column index array is used to store a column index for each element in the non-zero value array, e.g., a column index corresponding to "7" in the non-zero value array is 1, indicating that 7 is located in column 1 (starting with 0 to encode the index) in the valid weight matrix.

Specifically, the compression weight calculation module 104 decompresses the compression weight values stored in the CSR format to obtain an effective weight matrix; calculating a weighted sum of the effective weight matrix and the pulse input signals to obtain a membrane potential increment on each neuron; and adding the membrane potential increment on each neuron to the corresponding membrane potential accumulation amount of each neuron to obtain updated membrane potential accumulation amount.

The leakage integration issuing module 105 is configured to determine a magnitude relation between the updated membrane potential accumulation amount and a preset threshold, and determine an output pulse result corresponding to each neuron according to the magnitude relation.

Specifically, as shown in fig. 3, fig. 3 is a diagram of a structure of a pulse neural network algorithm used in the present application. The neurons of the impulse neural network supported by the application are linear LIF (leakage integration distribution) models, after the accumulated impulse time sequence is input, linear leakage operation, threshold comparison and impulse emission are carried out, and the membrane potential kinetic equation of the LIF neuron models is as follows:

V _j (t+1)＝V _j (t)+∑x _i w _ij -λ _j (1)

wherein V is _j (t) is the membrane potential accumulated value of the neuron j at the time t, w _ij The ith synaptic weight, x, corresponding to neuron j _i Input signal value, lambda, for the pulse of the ith synapse _j Is the linear leakage of neuron j.

The leak integration distribution module 105 determines the magnitude relation between the updated accumulated amount Vi (t+1) of the membrane potential and a preset threshold value, if the magnitude relation is greater than the preset threshold value, a nerve pulse signal is emitted, and the accumulated amount of the membrane potential on the neuron is reset.

According to the embodiment, the compression weight value is transmitted to the compression weight calculation module, the sparse pulse detection module is used for extracting the effective pulse index from the pulse input signal, so that each subsequent bit of pulse signal is prevented from participating in operation, the calculated amount is reduced, and the compression weight calculation module accumulates the non-zero value in the compression weight value to the membrane potential of the neuron according to the effective pulse index, and finally decides whether to issue the pulse or not. Compared with the traditional synapse cross array in which all synapses are activated and participate in operation, the application activates only the synapse weights corresponding to the effective pulse indexes, thereby reducing the calculated amount, reducing the operation power consumption of the whole chip and improving the operation speed, energy efficiency and area efficiency of the pulse neural network.

In an embodiment, as shown in fig. 1, the sparse pulse neural network accelerator based on the ping-pong architecture further includes a pulse buffer module group 106, where the pulse buffer module group 106 includes a first pulse buffer module and a second pulse buffer module; the pulse buffer module group 106 is configured to control the read-write states of the first pulse buffer module and the second pulse buffer module in each buffer period in a ping-pong switching manner, so that one pulse buffer module is in a read state and the other pulse buffer module is in a write state in each buffer period.

Specifically, two pulse buffer modules RAM (Random Access Memory ) are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture, and are used for encoding pulse input signals received from outside the chip to obtain pulse input codes suitable for calculation, and in the encoding process, the whole system can update one RAM while the other RAM completes calculation. Correspondingly, the output pulse code also needs to be decoded into an output pulse signal, and in the decoding process, a ping-pong operation method is also adopted, and one RAM is updated while the other RAM is calculated.

According to the embodiment, ping-pong buffer storage in the process of encoding the pulse input signal or decoding the pulse output signal is realized by using the pulse buffer storage module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.

In an embodiment, as shown in fig. 1, the sparse pulse neural network accelerator based on the ping-pong architecture further includes a weight buffer module group 107, where the weight buffer module group 107 includes a first weight buffer module and a second weight buffer module; the weight buffer module group 107 is configured to control the read-write states of the first weight buffer module and the second weight buffer module in each buffer period in a ping-pong switching manner, so that one weight buffer module is in a read state and the other weight buffer module is in a write state in each buffer period.

Specifically, two weight buffer memories (RAMs) are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture and are used for decompressing compressed weight values in a CSR format received from the outside of the chip to obtain an effective weight matrix, and in the decompression process, the whole system can complete decompression of one RAM and update the other RAM.

According to the embodiment, the ping-pong buffer in the compression weight value decompression process of the CSR format is realized by using the weight buffer module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.

In one embodiment, the accelerator further includes a set of neuron parameter cache modules 108; the set of neuron parameter cache modules 108 includes a first neuron parameter cache module and a second neuron parameter cache module; the set of neuron parameter buffer modules 108 is configured to control the read-write states of the first neuron parameter buffer module and the second neuron parameter buffer module in each buffer cycle in a ping-pong switching manner, so that one of the neuron parameter buffer modules is in a read state and the other one of the neuron parameter buffer modules is in a write state in each buffer cycle.

Specifically, two pieces of neuron parameter buffer module RAM are arranged in the sparse pulse neural network accelerator based on the ping-pong architecture, and are used for decoding the neuron parameters received from the outside of the chip to obtain the neuron parameters suitable for calculation, and in the decoding process, the whole system can update one piece of RAM while the other piece of RAM is decompressed.

Further, the ping-pong operation algorithm used in the present application comprises a three-dimensional control, i.e. how many time slots each time step of the input pulse comprises, how many time steps each group of neurons needs to calculate and how many groups of groups each layer of network needs to be divided into.

As shown in fig. 4, taking a two-layer 1024-512-256 fully-connected pulse neural network as an example, the accelerator will receive data of pulse ram#0, weight ram#0 and neuron parameters ram#0 from outside after power-up. When the first global synchronization command "sync_all" is received, the weight ram#1 and the neuron parameter ram#1 of the accelerator receive data from the outside, the data of the pulse ram#0, the weight ram#0, and the neuron parameter ram#0 are sent to the core calculation unit, and the calculation result is sent to the pulse ram#1, at which time the accelerator calculates the first 256 neurons of the first layer. When the second global synchronization command "sync_all" is received, the weight ram#0 and the neuron parameter ram#0 of the accelerator receive data from the outside, the data of the pulse ram#0, the weight ram#1, and the neuron parameter ram#1 are sent to the core calculation unit, and the calculation result is sent to the pulse ram#1, at which time the accelerator calculates the last 256 neurons of the first layer. When the third global synchronization command "sync_all" is received, the weight ram#1 and the neuron parameter ram#1 of the accelerator receive data from the outside, the data of the pulse ram#1, the weight ram#0, and the neuron parameter ram#0 are sent to the core calculation unit, the calculation result is sent to the pulse ram#0 and the outside of the accelerator, and at this time, the accelerator calculates 256 neurons of the second layer, thereby completing the whole calculation process.

According to the embodiment, the ping-pong buffer of the neuron parameters in the decoding process is realized by using the neuron parameter buffer module group, the simultaneous I/O operation and the data processing operation are accelerated, and the throughput of the accelerator is improved.

In an embodiment, the sparse pulse detection module 105 is further configured to divide a pulse input sequence corresponding to the pulse input signal into a plurality of groups of subsequences;

The pulse input sequence is a sequence obtained by encoding a pulse input signal. The valid pulse index refers to the position in the pulse input sequence where the non-zero value is located.

Specifically, as shown in fig. 5, fig. 5 shows an internal circuit configuration diagram of the sparse pulse detection module 105, including logic gates such as or gate, multiplexer, D flip-flop, adder, and gate, exclusive or gate, and the like. The sparse pulse detection module 105 is mainly responsible for extracting the effective pulse index of the input pulse train. In order to reduce the critical path delay, the application divides the input pulse sequence into 16 groups of 64bit subsequences, for example 1024bit input pulse sequence, each group of 64bit subsequences firstly performs bit pressing or operation with itself to obtain a bit pressing or operation result, if the bit pressing or operation result is all 0, it is indicated that all 64 bits in the group of subsequences are 0, the operation of the group of subsequences is finished, namely the calculation of the 64bit subsequences is skipped.

If the bit or operation result is not all 0, the position of the non-zero value in the current group of subsequences needs to be further detected, and the detection method specifically comprises the following steps: taking the bit pressing or operation result as a current sequence to be detected, carrying out multi-round detection on the current sequence to be detected, subtracting 1 from the current sequence to be detected (for example, the 64bit subsequence) to obtain a difference value in each round of detection, and then carrying out bit pressing and operation on the difference value and the current sequence to be detected to obtain a bit pressing and operation result; performing bit exclusive OR operation on the bit and operation result and the data (namely the current sequence to be detected) before subtracting 1 to obtain an effective pulse single-heat code; meanwhile, judging whether the bit pressing and operation result is all 0, if so, ending the detection of the group of subsequences, and entering the detection of the next group of subsequences, wherein in the example of the embodiment, the detection of the next group of subsequences with 64 bits is carried out; if not, taking the bitwise and operation result as the current sequence to be detected, and returning to the step of subtracting 1 from the current sequence to be detected to obtain a difference value. And the method is circulated until each group of subsequences are detected, a plurality of effective pulse single-heat codes can be obtained, and an effective pulse index in a binary code form corresponding to a pulse input signal can be obtained through a decoding unit for converting the single-heat codes into binary codes.

In the above embodiment, the sparse pulse detection module 105 is obtained by combining a plurality of logic gates, so that the effective index detection of sparse pulses can be realized.

In one embodiment, as shown in fig. 6, fig. 6 shows a schematic structural diagram of the compression weight calculation module 104, where the compression weight calculation module 104 includes a row offset module (Row Offset Module, ROM), a column index Delta coding module (Column Delta Module, CDM), a non-zero weight module (Nonzero Weight Value Module, NWVM), and an array of PEs (Processing Element, processing units), and each PE unit is used to perform calculation of a film potential increment, i.e., a dot product of a pulse input signal and a synaptic weight, i.e., x in formula (1) _i w _ij . In fig. 6, a 1024×256 synaptic crossover array is illustrated, where 1024 represents 1024 fan-ins (i.e., 1024 axons), and 256 represents 256 hardware neurons, i.e., 1024 fan-ins per hardware neuron, but not every fan-in is a valid pulse signal. The effective pulse index transmitted from the sparse pulse detection module activates the effective pulse indexThe row, then the Row Offset Module (ROM) calculates the corresponding row offset, the column index Delta encoding module (CDM) reads the column index of the non-zero value of the row according to the row offset, and the non-zero weight module (NWVM) adds the corresponding non-zero value to the membrane potential according to the column index. When all the effective pulses in the input pulse sequence are processed, all the accumulated membrane potential values are sent to a leakage integration distribution neuron dynamic behavior module for LIF operation.

Since the effective weight matrix is represented by three parameters, namely Row Offset (Row Offset), column index (columns indexes), and non-zero Value (Value), as shown in fig. 2. Accordingly, in order to restore the effective weight matrix according to the above three parameters, as shown in fig. 7, the storage of the above three parameters in the accelerator is respectively set: a row offset memory module (Row Offset Module, ROM, which is subdivided into Even ROM and Odd ROM), a column index Delta encoded memory module (Column Delta Module, CDM) and a non-zero value memory module (Nonzero Weight Value Module, NWVM). For pipeline progress, the line offset storage modules are further divided into an Even line offset storage module (Even ROM) and an Odd line offset storage module (Odd ROM). As shown in FIG. 7, for the input pulse pointing to the ith row, the CSR decoder reads the row offset of the ith row and the row offset of the (i+1) th row from Even ROM (Even row offset module) and Odd ROM (Odd row offset module) respectively, and subtracts the two to obtain the number of non-zero values corresponding to the ith row pulse, thereby obtaining the memory addresses and the memory number of the column index Delta code memory module (CDM) and the non-zero value memory module (NWVM). Since only one row offset and operation is required for the pulse of the ith row, but more than one column index and weight and operation may be required for the ith row, which may bring about throughput mismatch of the two types of data structure processing pipelines, in order to avoid inefficient uneven pipeline stalls, the circuit design decouples the CSR decoder into a row offset operation unit and a column index weight operation unit, which are synchronized using FIFO (First Input First Output, first-in first-out) memories, as shown in fig. 7.

Column index weightThe circuit configuration of the value operation unit is as shown in FIG. 8, and the row offset RO for the i-th row is obtained _i And row offset RO for row i+1 _i+1 After analysis, access start addresses Addr of a column index Delta coding storage module (CDM) and a non-zero value storage module (NWVM) are respectively obtained _i And end address Addr _i+1 When MAE (Masked Autoencoder, mask self-encoder) output signal is active, each clock cycle is according to Addr _i Column index ΔCI in Delta encoding format is read from column index Delta encoding storage Module (CDM) and non-zero value storage Module (NWVM), respectively _j And non-zero Weights (WV) to subsequent circuit units (i.e., corresponding PE units). When Addr _i Exceeding Addr _i+1 When the column index weight calculation unit finishes the processing of the ith row, the row offset of the new row is obtained from the row offset storage module.

As shown in fig. 9, the Delta-encoded column index (i.e., Δci) read from the column index Delta-encoded memory module (CDM) every clock cycle _j ) Decoding into column indices (i.e. CI by adder chain _j ). Since it is possible to read out Δcis not belonging to the current row (i.e., column index Delta codes) from the column index Delta coding module (CDM) every clock cycle, filtering these Δcis not belonging to the current row by MUX (multiplexer) results in filtered Delta codes Δci'. If the current clock cycle is the 1 st cycle of the row operation, ΔCI' ₀ CI to be the current period ₀ The method comprises the steps of carrying out a first treatment on the surface of the If the current clock cycle is not the 1 st cycle of the row operation, CI will be in the last cycle ₃₁ Adding DeltaCI 'on the basis of' ₀ . While others are CI _j ＝CI _j-1 +ΔCI′ _j 。

The structure of the PE array is shown in FIG. 10, and FIG. 10 shows a PE array comprising 32 PE units; each PE unit contains 1 16-bit adder and 2 muxes (multiplexers), and the cumulative value of the membrane potential of all dendrites is input to MUX1 (multiplexer of multiple 1) of each PE. After obtaining 32 pairs of CIs _j After (column index) and WV (non-zero weights), the weight allocator will add each pair of CIs _j (jth column index) and WV _j (jth non-zero weight) as the corresponding PE _j Is input into: wherein CI is _j Will be the multiplexing selection signal of MUX1, will correspond to(the accumulated value of the membrane potential of the neuron corresponding to the jth column index) as one operand of the adder; WV (WV) _j Will be input to one of the 2 inputs of MUX2, and after filtering the weight mask, the effective WV _j Would act as another operand to the adder. The weight distributor updates +.>Will be updatedAs an accumulated value for the next cycle.

The function of the weight mask is to filter out invalid non-zero weights WV, avoiding the generation of operation errors. The reason for the existence of the invalid non-zero weight WV in the calculation is the following two points: (1) When the NWVM is accessed, the weight which is at the same memory address of the NWVM but does not belong to the row is read out as a non-zero weight WV; (2) In order to realize the fan-in extension function, 128 columns of the weight matrix are divided into a plurality of shared fan-in dendritic clusters, and when the NWVM is accessed, weights of other dendrites which do not belong to the shared fan-in dendritic cluster receiving the current input pulse vector are read out together as non-zero weights WV.

In the embodiment, the odd line offset module and the even line offset module are arranged, so that the processing level of the pipeline is further improved.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A sparse spiking neural network accelerator based on a ping-pong architecture, characterized in that it includes a pulse input interface, a weight and a neuron parameter input interface, a sparse pulse detection module, a compression weight calculation module, and a leakage integral distribution module; wherein,

The pulse input interface is used to receive a pulse input signal, and input the pulse input signal to a sparse pulse detection module;

The weight and neuron parameter input interface is used to receive the compressed weight value, and input the compressed weight value to the compressed weight calculation module;

The sparse pulse detection module is configured to extract an effective pulse index from the pulse input signal; the effective pulse index is used to characterize the position of the non-zero value in the pulse input signal;

The compression weight calculation module is used to decompress the compression weight value according to the effective pulse index to obtain an effective weight matrix; calculate the weighted sum of the effective weight matrix and the pulse input signal to obtain each The membrane potential increment on the neuron; using the membrane potential increment on each neuron to update the membrane potential accumulation corresponding to each neuron;

The leakage integral distribution module is used to judge the magnitude relationship between the updated membrane potential accumulation and the preset threshold, and determine the output pulse result corresponding to each neuron according to the magnitude relationship.

2. the sparse pulse neural network accelerator based on ping-pong architecture according to claim 1, is characterized in that, also comprises pulse buffer module group; Described pulse buffer module group comprises the first pulse buffer module and the second pulse buffer module;

The pulse buffer module group is used to control the read and write states of the first pulse buffer module and the second pulse buffer module in each buffer cycle in a ping-pong switching manner, so that one of the pulse buffer modules is in the read and write state in each buffer cycle. state, the other pulse buffer module is in the write state.

3. the sparse pulse neural network accelerator based on ping-pong architecture according to claim 2, is characterized in that, also comprises weight cache module group; Said weight cache module group comprises a first weight cache module and a second weight cache module;

The weight cache module group is used to control the read and write states of the first weight cache module and the second weight cache module in each cache cycle in a ping-pong manner, so that one of the weight cache modules is in the read and write state in each cache cycle. state, another weight cache module is in write state.

4. the sparse spiking neural network accelerator based on ping-pong architecture according to claim 3, is characterized in that, also comprises neuron parameter cache module set; Described neuron parameter cache module set comprises the first neuron parameter cache module and the second neuron parameter cache module Two neuron parameter cache modules;

The neuron parameter cache module group is used to control the read and write states of the first neuron parameter cache module and the second neuron parameter cache module in each cache cycle in a ping-pong manner, so that in each cache cycle, the One neuron parameter cache module is in the read state, and the other neuron parameter cache module is in the write state.

5. the sparse spiking neural network accelerator based on ping-pong architecture according to claim 1, characterized in that,

The sparse pulse detection module is further configured to divide the pulse input sequence corresponding to the pulse input signal into multiple groups of subsequences;

performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result; if the bitwise OR operation result is all 0, the operation of the current group of subsequences is ended;

If the bitwise OR operation result is not all 0, the bitwise OR operation result is used as the current sequence to be detected, and multiple rounds of detection are performed on the current sequence to be detected. In each round of detection, the current sequence to be detected is The difference is obtained after subtracting 1 from the sequence; performing a bitwise AND operation on the difference with the current sequence to be detected to obtain a bitwise AND operation result; performing a bitwise XOR operation on the bitwise AND operation result with the current sequence to be detected Obtain effective pulse one-hot code afterwards; Obtain effective pulse index after binary conversion of described effective pulse one-hot code; Judge whether described bitwise AND operation result is all 0, if so, then end the detection of current sequence to be detected, and Return to the step of performing a bitwise OR operation on each group of subsequences with itself in turn to obtain a bitwise OR operation result; if not, use the bitwise AND operation result as the current sequence to be detected, and return the current The step of obtaining the difference after subtracting 1 from the detection sequence.

6. The sparse pulse neural network accelerator based on ping-pong architecture according to claim 1, wherein the compression weight calculation module comprises a row offset operation unit, a column index weight operation unit, a row offset storage module, a column Index Delta encoding storage module, non-zero weight storage module, weight allocator and processing unit array; wherein,

The row offset operation unit is used to read the row offset of the current row and the row offset of the next row adjacent to the current row from the row offset storage module;

The column index weight calculation unit is configured to analyze the row offset of the current row and the row offset of the next row adjacent to the current row to obtain the column index encoding module and the non-zero weight module The start address and the end address; according to the start address and the end address, obtain the column index Delta code and the non-zero weight from the column index encoding module and the non-zero weight module respectively;

The weight allocator includes an adder chain composed of a preset number of adders, and the adder chain is used to use the non-zero weight and the Delta code as the input of the processing unit array;

The processing unit array includes a preset number of processing units, and each processing unit uses an adder to perform calculations, and updates the accumulated membrane potential according to the calculation results of the adders.

7. The sparse spiking neural network accelerator based on ping-pong architecture according to claim 6, wherein the column index weight calculation unit is also used for when the start address exceeds the end address, End processing of the current row.

8. the sparse spiking neural network accelerator based on ping-pong architecture according to claim 6, is characterized in that, also comprises weight mask generation module, demultiplexer and adder in each processing unit;

The weight mask generation module is used to generate a weight mask according to the weight distribution state;

The processing unit is further configured to use the weight mask and the non-zero weight as inputs to a multiplexer, so that the multiplexer obtains effective non-zero weights after filtering according to the weight mask value, and using the effective non-zero value and the accumulated membrane potential as the input of an adder to obtain the updated accumulated membrane potential output by the adder.

9. The sparse spiking neural network accelerator based on the ping-pong architecture according to claim 1, wherein the leakage integral distribution module is further used to add the updated membrane potential accumulation to a preset leakage value Finally, the leakage integral value is obtained; if the leakage integral value is greater than the preset threshold value, it is determined that the output pulse result is a pulse signal; if the leakage integral value is less than or equal to the preset threshold value, then the pulse result is determined to not emit pulses.

10. The sparse spiking neural network accelerator based on ping-pong architecture according to claim 1, further comprising a spiking output interface;

The leakage integral issuance module is further configured to send the updated cumulative membrane potential to the pulse output interface when it is judged that the updated cumulative membrane potential is greater than the preset threshold, and send The membrane potential accumulation is reset.