[go: up one dir, main page]

CN113743600A - Design method of systolic array with integrated storage and computing architecture suitable for multi-precision neural network - Google Patents

Design method of systolic array with integrated storage and computing architecture suitable for multi-precision neural network Download PDF

Info

Publication number
CN113743600A
CN113743600A CN202110988635.7A CN202110988635A CN113743600A CN 113743600 A CN113743600 A CN 113743600A CN 202110988635 A CN202110988635 A CN 202110988635A CN 113743600 A CN113743600 A CN 113743600A
Authority
CN
China
Prior art keywords
target
digital signals
data
input analog
signals
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110988635.7A
Other languages
Chinese (zh)
Other versions
CN113743600B (en
Inventor
刘定邦
周浩翔
韩宇亮
周俊卓
黄耿斌
满昌海
申奥
毛伟
余浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Maitexin Technology Co ltd
Original Assignee
Southern University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southern University of Science and Technology filed Critical Southern University of Science and Technology
Priority to CN202110988635.7A priority Critical patent/CN113743600B/en
Publication of CN113743600A publication Critical patent/CN113743600A/en
Application granted granted Critical
Publication of CN113743600B publication Critical patent/CN113743600B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Analogue/Digital Conversion (AREA)
  • Radar Systems Or Details Thereof (AREA)

Abstract

本发明公开了一种适用于多精度神经网络的存算一体架构脉动阵列设计方法,包括:获取原始数字信号,将所述原始数字信号拆分成与所述原始数字信号的比特数数量相等的若干1比特数字信号;根据若干所述1比特数字信号生成若干输入模拟信号,其中,若干所述输入模拟信号分别对应不同的1比特数字信号;分别对若干所述输入模拟信号进行模数转换,得到若干目标数字信号;根据若干所述目标数字信号,生成与所述原始数字信号对应的卷积计算结果。本发明将原始数字信号拆分为多个1比特数字信号,可实现采用低比特的AD/DA模块构建存算一体架构,从而有效地解决了现有技术中存算一体架采用的是高比特AD/DA模块,导致存算一体架构功耗和面积开销较大的问题。

Figure 202110988635

The invention discloses a method for designing a systolic array with an integrated storage-computation architecture suitable for a multi-precision neural network. several 1-bit digital signals; several input analog signals are generated according to several of the 1-bit digital signals, wherein several of the input analog signals correspond to different 1-bit digital signals respectively; analog-to-digital conversion is performed on the several input analog signals respectively, Several target digital signals are obtained; according to the several target digital signals, a convolution calculation result corresponding to the original digital signal is generated. The invention splits the original digital signal into a plurality of 1-bit digital signals, can realize the use of low-bit AD/DA modules to construct an integrated storage-calculation architecture, thereby effectively solving the problem that the storage-calculation integrated frame in the prior art adopts a high-bit digital signal. AD/DA module, which leads to the problem of large power consumption and area overhead of the integrated storage and computing architecture.

Figure 202110988635

Description

Storage and computation integrated architecture pulse array design method suitable for multi-precision neural network
Technical Field
The invention relates to the field of mixed signal circuits, in particular to a storage and computation integrated architecture pulse array design method suitable for a multi-precision neural network.
Background
The existing storage and computation integrated architecture uses analog operations for processing digital information, inevitably requiring corresponding digital-to-analog/analog-to-digital conversion of input and output data. However, most of the existing analog storage and computation schemes adopt single precision, and in order to meet different application requirements, a high-precision AD/DA module is often required, so that the power consumption and area overhead of a storage and computation integrated framework are large.
Thus, there is still a need for improvement and development of the prior art.
Disclosure of Invention
The present invention provides a storage and computation integrated architecture systolic array design method suitable for a multi-precision neural network, aiming at solving the problem that the storage and computation integrated architecture in the prior art adopts a high-bit AD/DA module, which results in large power consumption and area overhead of the storage and computation integrated architecture.
The technical scheme adopted by the invention for solving the problems is as follows:
in a first aspect, an embodiment of the present invention provides a storage-integration architecture systolic array design method suitable for a multi-precision neural network, where the method includes:
acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal;
generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals;
respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals;
and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals.
In one embodiment, the performing analog-to-digital conversion on a plurality of input analog signals to obtain a plurality of target digital signals respectively includes:
and respectively inputting the input analog signals into a plurality of memory operation units to obtain a plurality of target digital signals.
In one embodiment, each of the memory operation units includes an array of memory operation devices arranged in a systolic array, and the inputting a plurality of the input analog signals into a plurality of the memory operation units respectively to obtain a plurality of the target digital signals includes:
determining a plurality of activated rows in a target memory operation device array corresponding to a plurality of input analog signals for each of the plurality of input analog signals;
inputting the input analog signal into a plurality of first memory operation devices which are in one-to-one correspondence with the plurality of activated rows, wherein the plurality of first memory operation devices are respectively positioned at the initial bits of the plurality of activated rows;
and determining a target digital signal corresponding to the input analog signal based on single-column multiply-add operation results output by a plurality of second memory operation devices respectively, wherein the plurality of second memory operation devices are respectively positioned at the stop bit of each column in the target memory operation device array.
In one embodiment, the determining a target digital signal corresponding to the input analog signal based on the result of the multiply-add operation for a single column respectively output by the second memory operation devices includes:
and respectively inputting the single-column multiplication and addition operation results output by the second memory operation devices into a shared comparator to obtain the target digital signal.
In one embodiment, the generating a convolution calculation corresponding to the original digital signal according to a number of the target digital signals includes:
carrying out shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data;
and generating a convolution calculation result corresponding to the original digital signal according to the target shift accumulated data.
In one embodiment, the performing a shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data includes:
shifting the weight bits of the target digital signals respectively to obtain a plurality of first shift data;
accumulating the first shift data to obtain first shift accumulated data;
shifting the input precision bits of the first shift accumulated data to obtain second shift data;
and accumulating the second shift data to obtain the target shift accumulated data.
In one embodiment, the generating convolution calculation results corresponding to a number of the original digital signals according to the target shift accumulation data includes:
inputting the target shift accumulated data into an activation layer, inputting the output data of the activation layer into a computer interface, and inputting the output data of the computer interface into a convolutional layer;
and acquiring output data of the convolutional layer to obtain the convolution calculation result.
In one embodiment, a number of the memory arithmetic units are arranged in a systolic array to form a processing unit.
In one embodiment, a number of the processing units are arranged in a pipeline configuration to form an array of processing units.
In a second aspect, the present invention further provides a computer-readable storage medium, on which a plurality of instructions are stored, wherein the instructions are adapted to be loaded and executed by a processor to implement the steps of the storage-integrated architecture systolic array design method for a multi-precision neural network described above.
The invention has the beneficial effects that: according to the embodiment of the invention, an original digital signal is obtained, and the original digital signal is split into a plurality of 1-bit digital signals with the number equal to the number of bits of the original digital signal; generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals; respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals; and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals. According to the invention, the original digital signal is split into a plurality of 1-bit digital signals, and a storage and calculation integrated framework can be constructed by adopting a low-bit AD/DA module, so that the problems of high power consumption and large area overhead of the storage and calculation integrated framework caused by the adoption of a high-bit AD/DA module in the storage and calculation integrated framework in the prior art are effectively solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a storage-integration architecture systolic array design method suitable for a multi-precision neural network according to an embodiment of the present invention.
Fig. 2 is an overall architecture diagram provided by an embodiment of the present invention.
Fig. 3 is a design diagram of an array of memory computing devices according to an embodiment of the present invention.
Fig. 4 is a reference diagram of precision reconstruction provided by an embodiment of the present invention.
Fig. 5 is a design diagram of a comparator according to an embodiment of the present invention.
Fig. 6 is a design diagram of a 1-bit-based 4-bit precision reconstruction PE-Slice according to an embodiment of the present invention.
FIG. 7 is a diagram of a systolic array design provided by an embodiment of the present invention.
Fig. 8 is a pipeline layout diagram provided by an embodiment of the present invention.
Fig. 9 is a functional block diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
At present, the field of artificial intelligence is rapidly developed, the application of artificial intelligence is rapidly increased, and the requirements on complex network operation and data processing speed are more and more strict. Under the current computing framework, how to efficiently utilize the deep learning neural network to process data, and developing a new generation of high energy efficiency neural network accelerator is one of the core problems of research and development in the academic world and the industrial world.
The traditional von Neumann structure separates data processing and storage, and data exchange with a memory is required frequently in the deep learning operation process, so that a large amount of energy is consumed. According to research, the energy consumption of data handling is 4 to 1000 times that of floating point calculation. As semiconductor processes advance, the power consumption of data handling becomes larger, although the overall power consumption decreases.
The storage and computation integrated architecture is a key technology for breaking the limitation of a storage wall and breaking through the bottleneck of AI computing energy efficiency. The core idea of the storage and computation integrated architecture is to transfer part or all of computation to a memory module, i.e. a computation unit and a memory unit are integrated on the same chip. However, most of the existing chips based on the storage and computation integrated architecture have the following problems:
to process digital information, a computationally integrated architecture using analog operations inevitably requires corresponding digital-to-analog/analog conversion of input and output data. The bottleneck of the existing storage and calculation integrated design is that the area and energy consumption of the whole system occupied by the AD/DA module are too large, and generally reach about 70% -90%. Because most of the existing analog storage and calculation schemes adopt single precision, in order to meet different application requirements, a high-precision AD/DA module is often adopted, so that the energy efficiency of the whole system is limited, part of schemes are limited by architecture design, and the scheme of sharing the AD/DA module is not adopted, so that the whole performance is further influenced.
In short, the storage and computation integrated architecture in the prior art adopts a high-bit AD/DA module, which results in large power consumption and area overhead of the storage and computation integrated architecture.
In view of the foregoing drawbacks of the prior art, the present invention provides a storage-integration architecture systolic array design method suitable for a multi-precision neural network, including: acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal; generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals; respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals; and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals. According to the invention, the original digital signal is split into a plurality of 1-bit digital signals, and a storage and calculation integrated framework can be constructed by adopting a low-bit AD/DA module, so that the problems of high power consumption and large area overhead of the storage and calculation integrated framework caused by the adoption of a high-bit AD/DA module in the storage and calculation integrated framework in the prior art are effectively solved.
As shown in fig. 1, the method comprises the steps of:
step S100, obtaining an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal.
In the conventional memory computing array, 8-bit ADC is adopted at most for digital-to-analog conversion of output signals, and high-bit DAC is adopted for analog-to-digital conversion of input signals. The use of the high-bit DAC not only increases the power consumption and the area of the DAC, but also increases the pressure of data input buffer memory, so that the energy consumption is exponentially increased along with the increase of the precision of the DAC, therefore, in the invention, the high-bit original digital signal is split into a plurality of 1-bit digital signals, and then, the low-bit ADC can be used, so that the problems of the increase of the energy consumption and the area overhead caused by the high-bit DAC in the conventional memory computing array are solved. Compared with a DAC module, the influence of the expansion of the precision of the ADC module on the whole performance is not as obvious as that of the DAC, and research and practice show that the energy consumption and area overhead caused by the shift accumulation is smaller compared with that of the ADC. In short, since the embodiment only needs to store and calculate the digital signal with 1 bit, it is not necessary to use a high-bit AD/DA module, and thus, the power consumption and area overhead of the architecture can be effectively reduced.
For example, when the original digital signal is a 4-bit digital signal, the original digital signal can be split into 4 1-bit digital signals.
As shown in fig. 1, the method further comprises the steps of:
step S200, generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals.
Specifically, since the 1-bit digital signal has only high and low states, the corresponding input analog signal can be directly generated without a step of digital-to-analog conversion. In addition, a DAC may also be used to convert time-sequential numbers into a continuous analog signal.
As shown in fig. 1, the method further comprises the steps of:
and step S300, respectively carrying out analog-to-digital conversion on the plurality of input analog signals to obtain a plurality of target digital signals.
Specifically, in order to implement the storage function and the calculation function of the computationally integrated architecture, the present embodiment needs to perform analog-to-digital conversion on the acquired multiple input analog signals respectively to obtain a target digital signal corresponding to each input analog signal. It can be understood that, since all the input analog signals in this embodiment are generated based on 1-bit digital signals, only a low-bit ADC is required to be used when performing analog-to-digital conversion, and power consumption and area overhead of a storage-and-computation-integrated architecture can be greatly reduced.
In one implementation, the step S300 specifically includes the following steps:
step S301, inputting the input analog signals into a plurality of memory operation units, respectively, to obtain a plurality of target digital signals.
Specifically, in order to perform analog-to-digital conversion on each input analog signal, a plurality of memory operation units are provided in advance in the present embodiment, and target digital signals corresponding to each input analog signal can be obtained by inputting each input analog signal into one memory operation unit.
For example, as shown in fig. 6. In a 4-bit precision rebinned design, each four columns are one basic compute unit, and 64 basic compute units are included in a 256 × 256 array. Finally, each array calculation is equivalent to a multiply-add operation of 64 sets of 1-bit inputs by 4-bit weights.
In one implementation, each of the memory operation units includes a memory operation device array arranged as a systolic array, and the step S301 specifically includes the following steps:
step S3011, for each input analog signal in the plurality of input analog signals, determining a plurality of activated rows in a target memory operation device array corresponding to the input analog signal;
step S3012, inputting the input analog signal into a plurality of first memory operation devices corresponding to the plurality of activated rows one to one, where the plurality of first memory operation devices are located at start bits of the plurality of activated rows, respectively;
step S3013, determining the target digital signal corresponding to the input analog signal based on the single-row multiply-add operation results output by the second memory operation devices, where the second memory operation devices are located at the stop bit of each row in the target memory operation device array.
Specifically, the memory operation unit in this embodiment is actually a memory operation device array composed of a plurality of memory operation devices, and the array structure is a systolic array. The systolic array refers to a pipelined high-speed computing structure with high throughput and arranged according to a certain interconnection rule, and data synchronously advance along respective directions among all devices of the array structure during operation. The present embodiment takes an input analog signal as an example: the input analog signal is input into a memory operation unit, the memory operation unit receives a selection signal output by the distributor at the same time, and the selection signal can reflect the precision of the ADC, so that which word lines in the memory operation device array need to be activated can be determined according to the selection signal, and the activated word lines are activated rows. The input analog signal is input to the memory operation device for activating the column start bit, and data flow is started. Because each memory operation device is preset with a corresponding resistance value which is determined (written in a binary mode) based on a preset bit weight value of each memory operation device, each memory operation device can complete multiplication operation of input data based on ampere's law. And the operation results in the same column are accumulated based on kirchhoff's law. Therefore, the single-column multiply-add operation result corresponding to each column can be obtained according to the second memory operation device on the stop bit of each column, and the target digital signal corresponding to the input analog signal can be determined according to the single-column multiply-add operation results corresponding to all columns. This embodiment can support the calculation of the mixed precision of the integer type 2 bits, the integer type 4 bits, and the integer type 8 bits.
For example, as shown in FIG. 4, in a single calculation of a memory computing device array, the select signal activates n rows of the memory computing device array, and the output of each column of the memory computing device array has at most n +1 different states, which is equivalent to log2n bits of data.
In one implementation, the memory arithmetic device may employ a novel non-volatile device memristor.
In one implementation, the determining the target digital signal corresponding to the input analog signal based on the result of the multiply-add operation for a single column respectively output by the second memory operation devices includes: and respectively inputting the single-column multiplication and addition operation results output by the second memory operation devices into a shared comparator to obtain the target digital signal.
Specifically, in order to implement analog-to-digital conversion, the present embodiment sequentially inputs the result of the single-column multiply-add operation output by each second memory operation device into a shared comparator (as shown in fig. 3), which is a comparator shared in the current memory calculation unit. It should be noted that, in addition to the result of the single-column multiply-add operation, the input terminal of the shared comparator needs to input a multi-stage reference voltage and an asynchronous clock signal. The multi-level reference voltage is preset and used as reference data to be compared with the input single-column multiply-add operation result (as shown in fig. 5), so as to output a binary multi-bit digital signal, i.e. a target digital signal. The asynchronous clock signal is determined based on the size of the memory arithmetic unit and the system clock, and is used for reducing the system delay.
As shown in fig. 1, the method further comprises the steps of:
and step S400, generating a convolution calculation result corresponding to the original digital signal according to a plurality of target digital signals.
Specifically, when the integrated storage and computation architecture in this embodiment is applied to a multi-precision neural network, in this embodiment, after obtaining a plurality of target digital signals, convolution operation needs to be performed according to the target digital signals to obtain a convolution computation result corresponding to an original digital signal.
In one implementation, the step S400 specifically includes the following steps:
step S401, carrying out shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data;
and S402, generating a convolution calculation result corresponding to the original digital signal according to the target shift accumulated data.
Due to the quantization error problem of the low-bit memory operation device and the low-bit AD/DA module, it is necessary to improve the accuracy and reduce the error in the calculation. Specifically, in the present embodiment, the shift accumulation operation is performed on a plurality of target digital signals to realize the precision reconstruction, and the convolution calculation is performed based on the target shift accumulation data obtained after the shift accumulation operation to obtain the final convolution calculation result.
In an implementation manner, the step S401 specifically includes the following steps:
step S4011, shifting the weight bits of the plurality of target digital signals respectively to obtain a plurality of first shift bits;
step S4012, accumulating the plurality of first shift data to obtain first shift accumulated data;
step S4013, shifting the input precision bits of the first shift accumulation data to obtain second shift data;
and S4014, accumulating the second shift data to obtain the target shift accumulated data.
Briefly, as shown in fig. 6, the present embodiment needs to perform two shift accumulation operations on the target digital signal to complete the precision reconstruction. Firstly, shifting each target digital signal for the first time, wherein the first shifting operation mainly shifts the weight bit of each target digital signal, and then accumulating the first shifting accumulated data through a first accumulator to obtain first shifting accumulated data. And then shifting the input precision bits of the first shift accumulated data to obtain second shift data, and accumulating the second shift data through a second accumulator to obtain final target shift accumulated data. Considering that precision shuffling is a weighted accumulation process, each asynchronous comparator clock has only one multi-bit number, i.e. only one data is subjected to shift accumulation operation at a particular time. The scheme utilizes a pulsating data flow mode, reduces the number of the shift accumulators, and only uses 2 shift accumulators to respectively complete the precision recombination of weight precision and input precision.
In an implementation manner, the step S4011 specifically includes: and aiming at each target digital signal, inputting the target digital signal into a first-stage shifter, and taking output data of the first-stage shifter as first shift data corresponding to the target digital signal.
In an implementation manner, the step S4013 specifically is: and inputting the output data of the first-stage shifter into a second-stage shifter, and taking the output data of the second-stage shifter as second shift data.
In an implementation manner, the step S402 specifically includes the following steps:
step S4021, accumulating a plurality of target shift accumulated data to obtain a target multiplication result;
step S4022, inputting the target multiplication result into an active layer, inputting the output data of the active layer into a computer interface, and inputting the output data of the computer interface into a convolution layer;
step S4023, obtaining the output data of the convolution layer to obtain the convolution calculation result.
Specifically, in order to implement integration of storage and calculation in the multi-precision neural network, that is, to cover input, output, and calculation processes in the multi-precision neural network, after the target shift accumulated data is obtained, the target shift accumulated data needs to be input into an activation layer, a computer interface (I/O), and a convolution layer, which are connected in sequence, and a result output by the convolution layer is a convolution calculation result corresponding to the original digital signal.
In one implementation, a number of the memory arithmetic units are arranged in a systolic array to form a processing unit.
In summary, the present embodiment also provides a processing unit, which includes a plurality of memory operation units arranged in a systolic array. Specifically, as shown in fig. 7, in the present invention, a ripple design is adopted between the memory operation units, that is, data is calculated in a ripple manner between the memory operation units, the weight data is pre-stored in the memory operation units and is fixed, and part of the sum data (i.e., the calculation result after the memory operation units are shifted and accumulated) is transmitted between the memory operation units.
In one implementation, in order to reduce the number of shifters and accumulators and optimize layout, the whole processing unit is divided into a plurality of blocks, and each block is composed of memory operation unit groups representing the same weight bits, so that the accuracy of the ADC can be properly improved, and the overall computation throughput and performance can be improved.
In one implementation, a number of the processing units are arranged in a pipeline structure to form an array of processing units.
Specifically, as shown in fig. 8, the present embodiment adopts a pipeline design for the processing unit array, that is, after the data is output from the previous processing unit, the data directly enters another processing unit for calculation, and does not need to be written back to the global memory first and then read from the memory. The method can save dozens of times of power consumption, effectively relieve the problem of data blockage and improve the operation speed.
By way of example, FIG. 2 includes an array of processing units in a pipelined design, a systolic design of the processing units and memory compute units, an array of memory compute devices in the memory compute units, and single device compute and control logic. An in-memory operation device array uses a low-precision 1-bit ADC through sharing, and multi-cycle operation output is completed.
Based on the above embodiments, the present invention further provides a terminal, and a schematic block diagram thereof may be as shown in fig. 9. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is configured to provide computing and control capabilities. The memory of the terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a storage-and-computation integrated architecture systolic array design method suitable for multi-precision neural networks. The display screen of the terminal can be a liquid crystal display screen or an electronic ink display screen.
It will be understood by those skilled in the art that the block diagram of fig. 9 is a block diagram of only a portion of the structure associated with the inventive arrangements and is not intended to limit the terminals to which the inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may have some components combined, or may have a different arrangement of components.
In one implementation, one or more programs are stored in a memory of the terminal and configured to be executed by one or more processors include instructions for performing a computational integrated architecture systolic array design method applicable to a multi-precision neural network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
In summary, the present invention discloses a storage and computation integrated architecture systolic array design method suitable for a multi-precision neural network, the method includes: acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal; generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals; respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals; and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals. According to the invention, the original digital signal is split into a plurality of 1-bit digital signals, and a storage and calculation integrated framework can be constructed by adopting a low-bit AD/DA module, so that the problems of high power consumption and large area overhead of the storage and calculation integrated framework caused by the adoption of a high-bit AD/DA module in the storage and calculation integrated framework in the prior art are effectively solved.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (10)

1. A storage and computation integrated architecture systolic array design method suitable for a multi-precision neural network is characterized by comprising the following steps:
acquiring an original digital signal, and splitting the original digital signal into a plurality of 1-bit digital signals with the number of bits equal to that of the original digital signal;
generating a plurality of input analog signals according to the plurality of 1-bit digital signals, wherein the plurality of input analog signals respectively correspond to different 1-bit digital signals;
respectively carrying out analog-to-digital conversion on the input analog signals to obtain a plurality of target digital signals;
and generating a convolution calculation result corresponding to the original digital signal according to the plurality of target digital signals.
2. The method of claim 1, wherein the performing analog-to-digital conversion on the input analog signals to obtain target digital signals comprises:
and respectively inputting the input analog signals into a plurality of memory operation units to obtain a plurality of target digital signals.
3. The method of claim 2, wherein each of the memory computing units comprises a memory computing device array arranged as a systolic array, and the inputting of the input analog signals into the memory computing units respectively to obtain the target digital signals comprises:
determining a plurality of activated rows in a target memory operation device array corresponding to a plurality of input analog signals for each of the plurality of input analog signals;
inputting the input analog signal into a plurality of first memory operation devices which are in one-to-one correspondence with the plurality of activated rows, wherein the plurality of first memory operation devices are respectively positioned at the initial bits of the plurality of activated rows;
and determining a target digital signal corresponding to the input analog signal based on single-column multiply-add operation results output by a plurality of second memory operation devices respectively, wherein the plurality of second memory operation devices are respectively positioned at the stop bit of each column in the target memory operation device array.
4. The method according to claim 3, wherein the determining the target digital signal corresponding to the input analog signal based on the result of the single-column multiply-add operation output by each of the second memory operation devices comprises:
and respectively inputting the single-column multiplication and addition operation results output by the second memory operation devices into a shared comparator to obtain the target digital signal.
5. The method of claim 1, wherein the generating a convolution calculation result corresponding to the original digital signal according to a plurality of target digital signals comprises:
carrying out shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data;
and generating a convolution calculation result corresponding to the original digital signal according to the target shift accumulated data.
6. The method of claim 5, wherein the performing shift accumulation operation on a plurality of target digital signals to obtain target shift accumulation data comprises:
shifting the weight bits of the target digital signals respectively to obtain a plurality of first shift data;
accumulating the first shift data to obtain first shift accumulated data;
shifting the input precision bits of the first shift accumulated data to obtain second shift data;
and accumulating the second shift data to obtain the target shift accumulated data.
7. The method of claim 5, wherein the generating convolution calculation results corresponding to a plurality of original digital signals according to the target shift accumulation data comprises:
inputting the target shift accumulated data into an activation layer, inputting the output data of the activation layer into a computer interface, and inputting the output data of the computer interface into a convolutional layer;
and acquiring output data of the convolutional layer to obtain the convolution calculation result.
8. The method of claim 3, wherein a plurality of the memory operation units are arranged in a systolic array to form a processing unit.
9. The method of claim 8, wherein a plurality of the processing units are arranged in a pipeline structure to form a processing unit array.
10. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to perform the steps of the method of any of claims 1-9 for a computational integrated architecture systolic array design for a multi-precision neural network.
CN202110988635.7A 2021-08-26 2021-08-26 Storage and calculation integrated architecture pulse array design method suitable for multi-precision neural network Active CN113743600B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110988635.7A CN113743600B (en) 2021-08-26 2021-08-26 Storage and calculation integrated architecture pulse array design method suitable for multi-precision neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110988635.7A CN113743600B (en) 2021-08-26 2021-08-26 Storage and calculation integrated architecture pulse array design method suitable for multi-precision neural network

Publications (2)

Publication Number Publication Date
CN113743600A true CN113743600A (en) 2021-12-03
CN113743600B CN113743600B (en) 2022-11-11

Family

ID=78733124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110988635.7A Active CN113743600B (en) 2021-08-26 2021-08-26 Storage and calculation integrated architecture pulse array design method suitable for multi-precision neural network

Country Status (1)

Country Link
CN (1) CN113743600B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118387A (en) * 2022-01-25 2022-03-01 深圳鲲云信息科技有限公司 Data processing method, data processing apparatus, and computer-readable storage medium
CN114741021A (en) * 2022-04-18 2022-07-12 北京知存科技有限公司 Storage and computing integrated chip
CN115906735A (en) * 2023-01-06 2023-04-04 上海后摩智能科技有限公司 Multi-bit-number storage and calculation integrated circuit based on analog signals, chip and calculation device
WO2023116314A1 (en) * 2021-12-23 2023-06-29 哲库科技(上海)有限公司 Neural network acceleration apparatus and method, and device and computer storage medium
CN117289896A (en) * 2023-11-20 2023-12-26 之江实验室 A basic computing device integrating storage and calculation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427171A (en) * 2019-08-09 2019-11-08 复旦大学 Expansible fixed-point number matrix multiply-add operation deposits interior calculating structures and methods
CN112836813A (en) * 2021-02-09 2021-05-25 南方科技大学 A Reconfigurable Systolic Array System for Mixed-Precision Neural Network Computing
US20210175893A1 (en) * 2019-12-09 2021-06-10 Technion Research & Development Foundation Limited Analog-to-digital converter using a pipelined memristive neural network
CN113298245A (en) * 2021-06-07 2021-08-24 中国科学院计算技术研究所 Multi-precision neural network computing device and method based on data flow architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427171A (en) * 2019-08-09 2019-11-08 复旦大学 Expansible fixed-point number matrix multiply-add operation deposits interior calculating structures and methods
US20210175893A1 (en) * 2019-12-09 2021-06-10 Technion Research & Development Foundation Limited Analog-to-digital converter using a pipelined memristive neural network
CN112836813A (en) * 2021-02-09 2021-05-25 南方科技大学 A Reconfigurable Systolic Array System for Mixed-Precision Neural Network Computing
CN113298245A (en) * 2021-06-07 2021-08-24 中国科学院计算技术研究所 Multi-precision neural network computing device and method based on data flow architecture

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023116314A1 (en) * 2021-12-23 2023-06-29 哲库科技(上海)有限公司 Neural network acceleration apparatus and method, and device and computer storage medium
CN114118387A (en) * 2022-01-25 2022-03-01 深圳鲲云信息科技有限公司 Data processing method, data processing apparatus, and computer-readable storage medium
CN114741021A (en) * 2022-04-18 2022-07-12 北京知存科技有限公司 Storage and computing integrated chip
CN115906735A (en) * 2023-01-06 2023-04-04 上海后摩智能科技有限公司 Multi-bit-number storage and calculation integrated circuit based on analog signals, chip and calculation device
CN115906735B (en) * 2023-01-06 2023-05-05 上海后摩智能科技有限公司 Multi-bit number storage and calculation integrated circuit, chip and calculation device based on analog signals
CN117289896A (en) * 2023-11-20 2023-12-26 之江实验室 A basic computing device integrating storage and calculation
CN117289896B (en) * 2023-11-20 2024-02-20 之江实验室 A basic computing device integrating storage and calculation

Also Published As

Publication number Publication date
CN113743600B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN113743600A (en) Design method of systolic array with integrated storage and computing architecture suitable for multi-precision neural network
Sridharan et al. X-former: In-memory acceleration of transformers
KR102780371B1 (en) Method for performing PIM (PROCESSING-IN-MEMORY) operations on serially allocated data, and related memory devices and systems
US12056599B2 (en) Methods of performing processing-in-memory operations, and related devices and systems
KR102780370B1 (en) Method for performing PIM (PROCESSING-IN-MEMORY) operations, and related memory devices and systems
Kang et al. S-FLASH: A NAND flash-based deep neural network accelerator exploiting bit-level sparsity
US20240143541A1 (en) Compute in-memory architecture for continuous on-chip learning
CN116011362A (en) A Matrix Multiply-Add Calculation Acceleration System with Optimized Bandwidth and Reduced Shared Cache Overhead
US20240028869A1 (en) Reconfigurable processing elements for artificial intelligence accelerators and methods for operating the same
CN117435159A (en) Method for compensating calculation result offset in integrated storage and calculation device and integrated storage and calculation device
Dhingra et al. Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures
CN113313251A (en) Deep separable convolution fusion method and system based on data stream architecture
US20250362875A1 (en) Compute-in-memory devices and methods of operating the same
Yang et al. ISARA: An Island-Style Systolic Array Reconfigurable Accelerator Based on Memristors for Deep Neural Networks
CN118469795A (en) Address data writing method and device, electronic equipment and storage medium
WO2024032220A1 (en) In-memory computing circuit-based neural network compensation method, apparatus and circuit
Shivanandamurthy et al. ODIN: A bit-parallel stochastic arithmetic based accelerator for in-situ neural network processing in phase change RAM
CN119357123B (en) Memory architecture of basic linear algebra subroutine and implementation method thereof
US20250285664A1 (en) Integrated in-memory compute configured for efficient data input and reshaping
CN119678144A (en) Compact Computer-in-Memory Architecture
Fang et al. Low-Power and Area-Efficient CIM: An SRAM-based fully-digital computing-in-memory hardware acceleration processor with approximate adder tree for multi-precision sparse neural networks
US20250028946A1 (en) Parallelizing techniques for in-memory compute architecture
US20250321684A1 (en) Time multiplexing and weight duplication in efficient in-memory computing
US20250028674A1 (en) Instruction set architecture for in-memory computing
CN119816822A (en) Compact Computer-in-Memory Architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240117

Address after: 518000, Building 307, Building 2, Nanshan Zhiyuan Chongwen Park, No. 3370 Liuxian Avenue, Fuguang Community, Taoyuan Street, Nanshan District, Shenzhen, Guangdong Province

Patentee after: Shenzhen Maitexin Technology Co.,Ltd.

Address before: No.1088 Xueyuan Avenue, Taoyuan Street, Nanshan District, Shenzhen, Guangdong 518055

Patentee before: Southern University of Science and Technology

PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A design method for a systolic array architecture in an in-memory computing system applicable to multi-precision neural networks

Granted publication date: 20221111

Pledgee: Shenzhen Branch of China Merchants Bank Co.,Ltd.

Pledgor: Shenzhen Maitexin Technology Co.,Ltd.

Registration number: Y2025980030087