Disclosure of Invention
The present invention provides a storage and computation integrated architecture systolic array design method suitable for a multi-precision neural network, aiming at solving the problem that the storage and computation integrated architecture in the prior art adopts a high-bit AD/DA module, which results in large power consumption and area overhead of the storage and computation integrated architecture.
The technical scheme adopted by the invention for solving the problems is as follows:
in a first aspect, an embodiment of the present invention provides a storage-integration architecture systolic array design method suitable for a multi-precision neural network, where the method includes:
acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal;
generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals;
respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals;
and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals.
In one embodiment, the performing analog-to-digital conversion on a plurality of input analog signals to obtain a plurality of target digital signals respectively includes:
and respectively inputting the input analog signals into a plurality of memory operation units to obtain a plurality of target digital signals.
In one embodiment, each of the memory operation units includes an array of memory operation devices arranged in a systolic array, and the inputting a plurality of the input analog signals into a plurality of the memory operation units respectively to obtain a plurality of the target digital signals includes:
determining a plurality of activated rows in a target memory operation device array corresponding to a plurality of input analog signals for each of the plurality of input analog signals;
inputting the input analog signal into a plurality of first memory operation devices which are in one-to-one correspondence with the plurality of activated rows, wherein the plurality of first memory operation devices are respectively positioned at the initial bits of the plurality of activated rows;
and determining a target digital signal corresponding to the input analog signal based on single-column multiply-add operation results output by a plurality of second memory operation devices respectively, wherein the plurality of second memory operation devices are respectively positioned at the stop bit of each column in the target memory operation device array.
In one embodiment, the determining a target digital signal corresponding to the input analog signal based on the result of the multiply-add operation for a single column respectively output by the second memory operation devices includes:
and respectively inputting the single-column multiplication and addition operation results output by the second memory operation devices into a shared comparator to obtain the target digital signal.
In one embodiment, the generating a convolution calculation corresponding to the original digital signal according to a number of the target digital signals includes:
carrying out shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data;
and generating a convolution calculation result corresponding to the original digital signal according to the target shift accumulated data.
In one embodiment, the performing a shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data includes:
shifting the weight bits of the target digital signals respectively to obtain a plurality of first shift data;
accumulating the first shift data to obtain first shift accumulated data;
shifting the input precision bits of the first shift accumulated data to obtain second shift data;
and accumulating the second shift data to obtain the target shift accumulated data.
In one embodiment, the generating convolution calculation results corresponding to a number of the original digital signals according to the target shift accumulation data includes:
inputting the target shift accumulated data into an activation layer, inputting the output data of the activation layer into a computer interface, and inputting the output data of the computer interface into a convolutional layer;
and acquiring output data of the convolutional layer to obtain the convolution calculation result.
In one embodiment, a number of the memory arithmetic units are arranged in a systolic array to form a processing unit.
In one embodiment, a number of the processing units are arranged in a pipeline configuration to form an array of processing units.
In a second aspect, the present invention further provides a computer-readable storage medium, on which a plurality of instructions are stored, wherein the instructions are adapted to be loaded and executed by a processor to implement the steps of the storage-integrated architecture systolic array design method for a multi-precision neural network described above.
The invention has the beneficial effects that: according to the embodiment of the invention, an original digital signal is obtained, and the original digital signal is split into a plurality of 1-bit digital signals with the number equal to the number of bits of the original digital signal; generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals; respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals; and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals. According to the invention, the original digital signal is split into a plurality of 1-bit digital signals, and a storage and calculation integrated framework can be constructed by adopting a low-bit AD/DA module, so that the problems of high power consumption and large area overhead of the storage and calculation integrated framework caused by the adoption of a high-bit AD/DA module in the storage and calculation integrated framework in the prior art are effectively solved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
At present, the field of artificial intelligence is rapidly developed, the application of artificial intelligence is rapidly increased, and the requirements on complex network operation and data processing speed are more and more strict. Under the current computing framework, how to efficiently utilize the deep learning neural network to process data, and developing a new generation of high energy efficiency neural network accelerator is one of the core problems of research and development in the academic world and the industrial world.
The traditional von Neumann structure separates data processing and storage, and data exchange with a memory is required frequently in the deep learning operation process, so that a large amount of energy is consumed. According to research, the energy consumption of data handling is 4 to 1000 times that of floating point calculation. As semiconductor processes advance, the power consumption of data handling becomes larger, although the overall power consumption decreases.
The storage and computation integrated architecture is a key technology for breaking the limitation of a storage wall and breaking through the bottleneck of AI computing energy efficiency. The core idea of the storage and computation integrated architecture is to transfer part or all of computation to a memory module, i.e. a computation unit and a memory unit are integrated on the same chip. However, most of the existing chips based on the storage and computation integrated architecture have the following problems:
to process digital information, a computationally integrated architecture using analog operations inevitably requires corresponding digital-to-analog/analog conversion of input and output data. The bottleneck of the existing storage and calculation integrated design is that the area and energy consumption of the whole system occupied by the AD/DA module are too large, and generally reach about 70% -90%. Because most of the existing analog storage and calculation schemes adopt single precision, in order to meet different application requirements, a high-precision AD/DA module is often adopted, so that the energy efficiency of the whole system is limited, part of schemes are limited by architecture design, and the scheme of sharing the AD/DA module is not adopted, so that the whole performance is further influenced.
In short, the storage and computation integrated architecture in the prior art adopts a high-bit AD/DA module, which results in large power consumption and area overhead of the storage and computation integrated architecture.
In view of the foregoing drawbacks of the prior art, the present invention provides a storage-integration architecture systolic array design method suitable for a multi-precision neural network, including: acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal; generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals; respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals; and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals. According to the invention, the original digital signal is split into a plurality of 1-bit digital signals, and a storage and calculation integrated framework can be constructed by adopting a low-bit AD/DA module, so that the problems of high power consumption and large area overhead of the storage and calculation integrated framework caused by the adoption of a high-bit AD/DA module in the storage and calculation integrated framework in the prior art are effectively solved.
As shown in fig. 1, the method comprises the steps of:
step S100, obtaining an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal.
In the conventional memory computing array, 8-bit ADC is adopted at most for digital-to-analog conversion of output signals, and high-bit DAC is adopted for analog-to-digital conversion of input signals. The use of the high-bit DAC not only increases the power consumption and the area of the DAC, but also increases the pressure of data input buffer memory, so that the energy consumption is exponentially increased along with the increase of the precision of the DAC, therefore, in the invention, the high-bit original digital signal is split into a plurality of 1-bit digital signals, and then, the low-bit ADC can be used, so that the problems of the increase of the energy consumption and the area overhead caused by the high-bit DAC in the conventional memory computing array are solved. Compared with a DAC module, the influence of the expansion of the precision of the ADC module on the whole performance is not as obvious as that of the DAC, and research and practice show that the energy consumption and area overhead caused by the shift accumulation is smaller compared with that of the ADC. In short, since the embodiment only needs to store and calculate the digital signal with 1 bit, it is not necessary to use a high-bit AD/DA module, and thus, the power consumption and area overhead of the architecture can be effectively reduced.
For example, when the original digital signal is a 4-bit digital signal, the original digital signal can be split into 4 1-bit digital signals.
As shown in fig. 1, the method further comprises the steps of:
step S200, generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals.
Specifically, since the 1-bit digital signal has only high and low states, the corresponding input analog signal can be directly generated without a step of digital-to-analog conversion. In addition, a DAC may also be used to convert time-sequential numbers into a continuous analog signal.
As shown in fig. 1, the method further comprises the steps of:
and step S300, respectively carrying out analog-to-digital conversion on the plurality of input analog signals to obtain a plurality of target digital signals.
Specifically, in order to implement the storage function and the calculation function of the computationally integrated architecture, the present embodiment needs to perform analog-to-digital conversion on the acquired multiple input analog signals respectively to obtain a target digital signal corresponding to each input analog signal. It can be understood that, since all the input analog signals in this embodiment are generated based on 1-bit digital signals, only a low-bit ADC is required to be used when performing analog-to-digital conversion, and power consumption and area overhead of a storage-and-computation-integrated architecture can be greatly reduced.
In one implementation, the step S300 specifically includes the following steps:
step S301, inputting the input analog signals into a plurality of memory operation units, respectively, to obtain a plurality of target digital signals.
Specifically, in order to perform analog-to-digital conversion on each input analog signal, a plurality of memory operation units are provided in advance in the present embodiment, and target digital signals corresponding to each input analog signal can be obtained by inputting each input analog signal into one memory operation unit.
For example, as shown in fig. 6. In a 4-bit precision rebinned design, each four columns are one basic compute unit, and 64 basic compute units are included in a 256 × 256 array. Finally, each array calculation is equivalent to a multiply-add operation of 64 sets of 1-bit inputs by 4-bit weights.
In one implementation, each of the memory operation units includes a memory operation device array arranged as a systolic array, and the step S301 specifically includes the following steps:
step S3011, for each input analog signal in the plurality of input analog signals, determining a plurality of activated rows in a target memory operation device array corresponding to the input analog signal;
step S3012, inputting the input analog signal into a plurality of first memory operation devices corresponding to the plurality of activated rows one to one, where the plurality of first memory operation devices are located at start bits of the plurality of activated rows, respectively;
step S3013, determining the target digital signal corresponding to the input analog signal based on the single-row multiply-add operation results output by the second memory operation devices, where the second memory operation devices are located at the stop bit of each row in the target memory operation device array.
Specifically, the memory operation unit in this embodiment is actually a memory operation device array composed of a plurality of memory operation devices, and the array structure is a systolic array. The systolic array refers to a pipelined high-speed computing structure with high throughput and arranged according to a certain interconnection rule, and data synchronously advance along respective directions among all devices of the array structure during operation. The present embodiment takes an input analog signal as an example: the input analog signal is input into a memory operation unit, the memory operation unit receives a selection signal output by the distributor at the same time, and the selection signal can reflect the precision of the ADC, so that which word lines in the memory operation device array need to be activated can be determined according to the selection signal, and the activated word lines are activated rows. The input analog signal is input to the memory operation device for activating the column start bit, and data flow is started. Because each memory operation device is preset with a corresponding resistance value which is determined (written in a binary mode) based on a preset bit weight value of each memory operation device, each memory operation device can complete multiplication operation of input data based on ampere's law. And the operation results in the same column are accumulated based on kirchhoff's law. Therefore, the single-column multiply-add operation result corresponding to each column can be obtained according to the second memory operation device on the stop bit of each column, and the target digital signal corresponding to the input analog signal can be determined according to the single-column multiply-add operation results corresponding to all columns. This embodiment can support the calculation of the mixed precision of the integer type 2 bits, the integer type 4 bits, and the integer type 8 bits.
For example, as shown in FIG. 4, in a single calculation of a memory computing device array, the select signal activates n rows of the memory computing device array, and the output of each column of the memory computing device array has at most n +1 different states, which is equivalent to log2n bits of data.
In one implementation, the memory arithmetic device may employ a novel non-volatile device memristor.
In one implementation, the determining the target digital signal corresponding to the input analog signal based on the result of the multiply-add operation for a single column respectively output by the second memory operation devices includes: and respectively inputting the single-column multiplication and addition operation results output by the second memory operation devices into a shared comparator to obtain the target digital signal.
Specifically, in order to implement analog-to-digital conversion, the present embodiment sequentially inputs the result of the single-column multiply-add operation output by each second memory operation device into a shared comparator (as shown in fig. 3), which is a comparator shared in the current memory calculation unit. It should be noted that, in addition to the result of the single-column multiply-add operation, the input terminal of the shared comparator needs to input a multi-stage reference voltage and an asynchronous clock signal. The multi-level reference voltage is preset and used as reference data to be compared with the input single-column multiply-add operation result (as shown in fig. 5), so as to output a binary multi-bit digital signal, i.e. a target digital signal. The asynchronous clock signal is determined based on the size of the memory arithmetic unit and the system clock, and is used for reducing the system delay.
As shown in fig. 1, the method further comprises the steps of:
and step S400, generating a convolution calculation result corresponding to the original digital signal according to a plurality of target digital signals.
Specifically, when the integrated storage and computation architecture in this embodiment is applied to a multi-precision neural network, in this embodiment, after obtaining a plurality of target digital signals, convolution operation needs to be performed according to the target digital signals to obtain a convolution computation result corresponding to an original digital signal.
In one implementation, the step S400 specifically includes the following steps:
step S401, carrying out shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data;
and S402, generating a convolution calculation result corresponding to the original digital signal according to the target shift accumulated data.
Due to the quantization error problem of the low-bit memory operation device and the low-bit AD/DA module, it is necessary to improve the accuracy and reduce the error in the calculation. Specifically, in the present embodiment, the shift accumulation operation is performed on a plurality of target digital signals to realize the precision reconstruction, and the convolution calculation is performed based on the target shift accumulation data obtained after the shift accumulation operation to obtain the final convolution calculation result.
In an implementation manner, the step S401 specifically includes the following steps:
step S4011, shifting the weight bits of the plurality of target digital signals respectively to obtain a plurality of first shift bits;
step S4012, accumulating the plurality of first shift data to obtain first shift accumulated data;
step S4013, shifting the input precision bits of the first shift accumulation data to obtain second shift data;
and S4014, accumulating the second shift data to obtain the target shift accumulated data.
Briefly, as shown in fig. 6, the present embodiment needs to perform two shift accumulation operations on the target digital signal to complete the precision reconstruction. Firstly, shifting each target digital signal for the first time, wherein the first shifting operation mainly shifts the weight bit of each target digital signal, and then accumulating the first shifting accumulated data through a first accumulator to obtain first shifting accumulated data. And then shifting the input precision bits of the first shift accumulated data to obtain second shift data, and accumulating the second shift data through a second accumulator to obtain final target shift accumulated data. Considering that precision shuffling is a weighted accumulation process, each asynchronous comparator clock has only one multi-bit number, i.e. only one data is subjected to shift accumulation operation at a particular time. The scheme utilizes a pulsating data flow mode, reduces the number of the shift accumulators, and only uses 2 shift accumulators to respectively complete the precision recombination of weight precision and input precision.
In an implementation manner, the step S4011 specifically includes: and aiming at each target digital signal, inputting the target digital signal into a first-stage shifter, and taking output data of the first-stage shifter as first shift data corresponding to the target digital signal.
In an implementation manner, the step S4013 specifically is: and inputting the output data of the first-stage shifter into a second-stage shifter, and taking the output data of the second-stage shifter as second shift data.
In an implementation manner, the step S402 specifically includes the following steps:
step S4021, accumulating a plurality of target shift accumulated data to obtain a target multiplication result;
step S4022, inputting the target multiplication result into an active layer, inputting the output data of the active layer into a computer interface, and inputting the output data of the computer interface into a convolution layer;
step S4023, obtaining the output data of the convolution layer to obtain the convolution calculation result.
Specifically, in order to implement integration of storage and calculation in the multi-precision neural network, that is, to cover input, output, and calculation processes in the multi-precision neural network, after the target shift accumulated data is obtained, the target shift accumulated data needs to be input into an activation layer, a computer interface (I/O), and a convolution layer, which are connected in sequence, and a result output by the convolution layer is a convolution calculation result corresponding to the original digital signal.
In one implementation, a number of the memory arithmetic units are arranged in a systolic array to form a processing unit.
In summary, the present embodiment also provides a processing unit, which includes a plurality of memory operation units arranged in a systolic array. Specifically, as shown in fig. 7, in the present invention, a ripple design is adopted between the memory operation units, that is, data is calculated in a ripple manner between the memory operation units, the weight data is pre-stored in the memory operation units and is fixed, and part of the sum data (i.e., the calculation result after the memory operation units are shifted and accumulated) is transmitted between the memory operation units.
In one implementation, in order to reduce the number of shifters and accumulators and optimize layout, the whole processing unit is divided into a plurality of blocks, and each block is composed of memory operation unit groups representing the same weight bits, so that the accuracy of the ADC can be properly improved, and the overall computation throughput and performance can be improved.
In one implementation, a number of the processing units are arranged in a pipeline structure to form an array of processing units.
Specifically, as shown in fig. 8, the present embodiment adopts a pipeline design for the processing unit array, that is, after the data is output from the previous processing unit, the data directly enters another processing unit for calculation, and does not need to be written back to the global memory first and then read from the memory. The method can save dozens of times of power consumption, effectively relieve the problem of data blockage and improve the operation speed.
By way of example, FIG. 2 includes an array of processing units in a pipelined design, a systolic design of the processing units and memory compute units, an array of memory compute devices in the memory compute units, and single device compute and control logic. An in-memory operation device array uses a low-precision 1-bit ADC through sharing, and multi-cycle operation output is completed.
Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 9. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a storage-and-computation integrated architecture systolic array design method suitable for multi-precision neural networks. The display screen of the terminal can be a liquid crystal display screen or an electronic ink display screen.
It will be understood by those skilled in the art that the block diagram of fig. 9 is a block diagram of only a portion of the structure associated with the inventive arrangements and is not intended to limit the terminals to which the inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may have some components combined, or may have a different arrangement of components.
In one implementation, one or more programs are stored in a memory of the terminal and configured to be executed by one or more processors include instructions for performing a computational integrated architecture systolic array design method applicable to a multi-precision neural network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
In summary, the present invention discloses a storage and computation integrated architecture systolic array design method suitable for a multi-precision neural network, the method includes: acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal; generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals; respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals; and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals. According to the invention, the original digital signal is split into a plurality of 1-bit digital signals, and a storage and calculation integrated framework can be constructed by adopting a low-bit AD/DA module, so that the problems of high power consumption and large area overhead of the storage and calculation integrated framework caused by the adoption of a high-bit AD/DA module in the storage and calculation integrated framework in the prior art are effectively solved.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.