Disclosure of Invention
In view of the foregoing, to at least partially solve one of the above technical problems, embodiments of the present invention provide a high-performance and low-power data processing method, and a system, an apparatus, and a medium capable of implementing the method.
In one aspect, a technical solution of the present application provides a data processing method, including the following steps:
acquiring a sensor signal, and sending the sensor signal to a read bit line of an SRAM array; the sensor signals are pixel points of an input characteristic diagram, and voltage signals obtained by conversion through binarization processing are obtained;
determining that the voltage value of the read bit line is stable, and starting a read word line of the SRAM array;
acquiring a weight value of a sensor signal in the read word line, acquiring a sensor signal in the read bit line, and performing multiplication operation on the sensor signal according to the weight value to obtain a first output voltage and outputting the first output voltage to a shared bit line; the first output voltage is a voltage signal of a pixel point in the output characteristic diagram;
and converting the first output voltage to obtain second output data, and obtaining a convolution operation result from the second output data.
In a practical embodiment of the present disclosure, the reading bit line includes a first reading bit line and a second reading bit line, the obtaining a weight value of a sensor signal in the reading bit line obtains the sensor signal in the reading bit line, and performing a multiplication operation with the sensor signal according to the weight value to obtain a first output voltage and output the first output voltage to the shared bit line includes:
Determining that the weighted value is a first numerical value, the first transistor is turned off, the second transistor is turned on, and the sensor signal in the first read bit line is sent to the shared bit line;
determining the weight value to be a second value, enabling the first transistor to be turned on, enabling the second transistor to be turned off, and sending the sensor signal in the second read bit line to the shared bit line;
and determining that the weight value is a third numerical value, turning off the first transistor, turning off the second transistor, and sending the sensor signals in the first reading bit line and the second reading bit line to the shared bit line.
In a possible embodiment of the present disclosure, the step of converting the first output voltage to obtain second output data, and obtaining a convolution operation result from the second output data includes:
and subtracting the second output data in the first shared bit line from the second output data in the second shared bit line to obtain the convolution operation result.
In a possible embodiment of the present disclosure, the data processing method further includes the following steps:
storing a first output voltage which is obtained by the multiplication operation and has a positive value to a first capacitor;
And storing the first output voltage which is obtained by the multiplication operation and has a negative value to the second capacitor.
In a possible embodiment of the present disclosure, the convolution operation result satisfies the following formula:
wherein OUT is the result of convolution operation, WiIs a weight value, ViFor the voltage value of the sensor signal, N is the number of SRAM arrays, i is 1,2,3, …, and N is a positive integer.
On the other hand, the technical solution of the present application further provides a data processing system, which includes:
the sensor storage array is used for acquiring a sensor signal and sending the sensor signal to a read bit line of the SRAM array; the sensor signal is a voltage signal obtained by converting a pixel point of an input characteristic diagram through binarization processing;
the reading calculation circuit is used for acquiring a weight value of a sensor signal in the reading word line, acquiring a sensor signal in the reading bit line, and performing multiplication operation on the sensor signal according to the weight value to obtain a first output voltage and outputting the first output voltage to the shared bit line; the first output voltage is a voltage signal of a pixel point in the output characteristic diagram;
and the analog-to-digital conversion circuit is used for converting the first output voltage to obtain second output data and obtaining a convolution operation result from the second output data.
In a possible embodiment of the solution of the present application, said sensor memory array is a ternary sram; the ternary static random access memory comprises a first ternary inverter and a second ternary inverter, wherein the output end of the first ternary inverter is connected to the input end of the second ternary inverter; the output end of the second ternary inverter is connected to the input end of the first ternary inverter.
In a possible embodiment of the present disclosure, the first ternary inverter includes a thin gate NMOS transistor, a thick gate NMOS transistor, a thin gate PMOS transistor, and a thick gate PMOS transistor;
when the input end of the first ternary phase inverter is a high level signal, the thick gate NMOS tube and the thin gate NMOS tube are conducted, and the output end of the first ternary phase inverter is a low level signal;
when the input end of the first ternary phase inverter is a low level signal, the thick gate PMOS tube and the thin gate PMOS tube are conducted, and the output end of the first ternary phase inverter is a high level signal;
when the signal amplitude of the input end of the first ternary phase inverter is half of the high level signal amplitude, the thin gate NMOS tube and the thin gate PMOS tube are conducted, and the signal amplitude of the output end of the first ternary phase inverter is half of the high level signal amplitude;
The second ternary inverter has the same structure as the first ternary inverter.
On the other hand, the technical solution of the present application further provides a data processing apparatus, which includes:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, the at least one program causes the at least one processor to carry out a data processing method according to any one of the first aspect.
On the other hand, the present technical solution also provides a storage medium, in which a program executable by a processor is stored, and when the program executable by the processor is executed by the processor, the program is used to execute a data processing method according to any one of the first aspect.
Advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
the technical scheme of the application is based on a storage-computation integrated convolution computation acceleration framework for the sensor, the scheme can be directly input from the sensor without a series of processing steps of analog-to-digital conversion and digital-to-analog conversion, and the hardware overhead, power consumption and delay of data transmission are greatly reduced; the SRAM capable of storing a plurality of weights is adopted, so that the accuracy of a neural network algorithm can be effectively improved, and the delay and power consumption caused by data movement can be reduced; the scheme meets the requirements of the neural network on reducing the cost including power consumption and hardware overhead for a hardware implementation architecture and simultaneously provides high-performance requirements such as low delay, high bandwidth and the like.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
In the intelligent era, the progress of human society is greatly promoted by the scientific technology represented by Artificial Intelligence (AI). In recent years, the artificial intelligence technology has been developed rapidly, and has been widely applied in the consumer field and the industrial production field. For example: image recognition, industrial robots, autopilot, meta universe, medical image analysis, and the like. Meanwhile, as the preface screen is opened in the era of intelligent internet of things, more and more data start to flow in clouds, sides and ends. Exponentially increasing amounts of data place higher demands on the computing power and power consumption of existing computing architectures. Due to the existence of Memory Wall (Memory Wall) and Power Wall (Power Wall), the limitation of von neumann computing architecture is increasingly prominent. Therefore, a new computing architecture is needed to address the challenges of future application scenarios. In such a context, the holistic concept reenters the academic and industrial fields of view.
The success of artificial neural network algorithms and the breakthrough of powerful hardware bases have together promoted the rapid development of the artificial intelligence revolution. In recent years, artificial neural networks have shown great advantages in many application scenarios such as target detection, wearable devices, natural language processing, and the like. In the software level, the artificial neural network algorithm has achieved great success, in order to support the effective implementation of the artificial neural network model from the cloud end to the edge device. Researchers in academia and industry have begun to design hardware accelerators dedicated to artificial neural networks. Currently, the mainstream platform for accelerating the artificial neural network algorithm is a Graphics Processing Unit (GPU), which has the advantages of high computational accuracy and very flexible programming. Training of artificial neural network algorithms is typically done in GPU clusters, which are very energy consuming, consuming up to hundreds to thousands of watts of power. To improve energy efficiency and make it suitable for large data center environments, researchers have begun customizing Application Specific Integrated Circuit (ASIC) architecture solutions, such as the Tensor Processing Unit (TPU) of Google, for cloud and edge devices. However, the real problem of the artificial neural network algorithm accelerator is the frequent data movement between the computing unit and the memory unit, which is the memory wall problem in the traditional von Neumann architecture. Most of the operations of artificial neural network processing are Vector Matrix Multiplication (VMM) between input vectors and weight matrices, which essentially performs multiply-accumulate (MAC) operations. Memory Computing (CIM) is therefore considered to be the most efficient solution with the potential to break the von neumann architecture bottleneck.
Furthermore, sensor systems are an important component of artificial intelligence devices. Traditional sensor systems have become increasingly unsuitable for smart devices, whose energy consumption cannot support long-term continuous data acquisition tasks. In conventional solutions, the sensor system is physically separated from the computing unit, since its functional requirements and manufacturing technology differ. The sensors work primarily in the noisy analog domain, while the computing units are typically implemented digitally based on a traditional von neumann computing architecture. The sensor terminal collects a large amount of raw data locally and then transmits the data to a computing unit of a local system. As shown in fig. 1, in a conventional intelligent system, analog data collected by a sensor is first converted into a digital signal by an analog-to-digital converter (ADC), and then temporarily stored in a memory, and then a processing unit extracts the data from the memory for processing. In the whole, data are collected from the sensor and processed by the computing unit, and a series of processes such as data conversion, data transmission and the like exist in the middle. And the published data shows that ADC and data storage dominate the power consumption of the overall system. Therefore, the system architecture inevitably causes significant problems of energy consumption, processing speed, communication bandwidth and the like.
Based on the foregoing theoretical basis, as shown in fig. 2, the technical solution of the present application provides a storage and computation integrated convolution computation acceleration architecture design for a proximity sensor. In a first aspect, an embodiment of the present application provides a data processing system based on this computation-integrated convolution computation acceleration architecture, where the system mainly includes a sensor storage array, a readout computation circuit, and an analog-to-digital conversion circuit.
The overall architecture of the system is shown in fig. 3, wherein the storage array is composed of an SRAM, and the storage array acquires a sensor signal through a connected sensor and sends the sensor signal to a read bit line of the SRAM array; and also for storing weights of the neural network model. A reading calculation circuit in the system is used for determining that the voltage value of the reading bit line is stable and starting a reading word line of the SRAM array; and obtaining a weight value of the sensor signal through the read word line, and performing multiplication operation according to the weight value to obtain a first output voltage and outputting the first output voltage to the shared bit line. And the analog-to-digital conversion circuit in the system is used for converting the first output voltage to obtain second output data, and subtracting the second output data to obtain a convolution operation result.
In some alternative embodiments, the sensor memory array in the system is a ternary SRAM; the ternary static random access memory comprises a first ternary inverter and a second ternary inverter, wherein the output end of the first ternary inverter is connected to the input end of the second ternary inverter; the output end of the second ternary inverter is connected to the input end of the first ternary inverter.
In the related art, the SRAM used in the mainstream CIM (Computer Integrated manufacturing) architecture can only store 2 weights Wi(1, -1) in the CIM architecture, high level VddRepresents the weight value WiIs 1, low level VGNDWeight value W of the representativeiIs-1. Specifically, in the technical solution of the present application, as shown in fig. 3, the embodiment system adopts a three-valued SRAM, and can store 3 weight values Wi. In FIG. 3, the Q point is at a high level VddRepresents a weight value WiIs-1, the Q point is a low level VGNDWeight value W of the representativeiIs 1, Q is 1/2VddWeight value W of the representativeiIs 0. Three-value SRAM actually consumes more power than ordinary six-tube SRAMSlightly higher, but a weight value is added, so that the recognition accuracy of the neural network model is increased greatly.
In some alternative embodiments, as shown in fig. 4, the configuration of the ternary inverter (including the first ternary inverter and the second ternary inverter) in the embodiment system mainly includes a thin gate NMOS transistor, a thick gate NMOS transistor, a thin gate PMOS transistor, and a thick gate PMOS transistor.
When the input end of the ternary inverter is a high level signal, the thick gate NMOS tube and the thin gate NMOS tube are conducted, and the output end of the ternary inverter is a low level signal; when the input end of the ternary inverter is a low level signal, the thick gate PMOS tube and the thin gate PMOS tube are conducted, and the output end of the ternary inverter is a high level signal; when the signal amplitude of the input end of the ternary phase inverter is half of the high-level signal amplitude, the thin-gate NMOS tube and the thin-gate PMOS tube are conducted, and the signal amplitude of the output end of the ternary phase inverter is half of the high-level signal amplitude.
As shown in fig. 5 in particular, the SRAM in the embodiment can store three weight keys, namely the design of the ternary inverter (STI); in an embodiment system, multi-threshold CMOS technology is used in STI to achieve ternary switching operation. The thick gate NMOS tube is only at the gate voltage of V ddIs conducted, the thick gate PMOS tube is only conducted when the grid voltage is VGNDIs conducted, and the NMOS tube with a thinner grid electrode has a grid electrode voltage of VddOr 1/2VddThe PMOS tube with a thinner grid electrode is conducted, and the grid electrode voltage is VGNDOr 1/2VddAnd conducting. When the input voltage In is VddWhen the output voltage Out is VGND(ii) a When the input voltage In is VGNDWhen the output voltage Out is VGND(ii) a When the input voltage In is 1/2VddWhen the output voltage Out is 1/2Vdd。
As shown in fig. 6, more specifically, when the system provided in the present application performs convolution calculation, the voltage value V may be directly input from the sensoriThe data collected by the sensor does not need an intermediate series of analog-to-digital converter (ADC) and digital-to-analog converter (DAC) and is storedTherefore, most energy consumption can be saved, the delay of intermediate data can be greatly reduced, and the processing speed of the whole system is improved. The convolution calculation (multiply accumulate MAC) is performed next in the CIM architecture.
Furthermore, based on the data processing system based on the storage and computation integrated architecture of the proximity sensor proposed in the first aspect, the present application further provides a data processing method, and the method mainly completes the multiply-accumulate computation of the convolution layer and the full connection layer of the binarization convolutional neural network CNN. First, the artificial neural network model on which the embodiment is based is the LeNet-5 neural network model. And carrying out binarization operation based on a LeNet-5 neural network model to construct a binarization neural network model. Although the full-precision floating-point type CNN can provide high recognition accuracy, the cost behind the high recognition rate is huge data computation amount, high power consumption and high hardware cost. This is burdensome for low power, hardware resource constrained embedded edge devices. From the viewpoint of hardware friendliness, the method for performing binarization operation on the CNN is as follows: and determining a binarization result according to the positive and negative of the floating point number by using a sign function. In short, the result of binarization for positive numbers is 1, and the result of binarization for negative numbers is-1; the specific binarization formula is as follows:
The network structure of LeNet-5 neural network model, LeNet-5 neural network model is a CNN network model used for handwriting recognition. The LeNet-5 neural network model has a total of 6 layers excluding the input layer and the output layer: c1 and C3 are convolutional layers, S2 and S4 are pooling layers, and F5 and F6 are fully-connected layers. By carrying out binarization on the LeNet-5 neural network model, the calculation amount and hardware overhead are greatly reduced. In the whole LeNet-5 neural network model, most of the calculation amount occurs in a convolution layer and a full connection layer, and the convolution operation of the two layers is essentially a multiply-accumulate (MAC) operation. Therefore, the embodiment method, which completes the two layers of computation in the computation-integrated computing architecture (CIM) proposed by us, will effectively reduce the power consumption of the whole system and improve and accelerate the whole neural network model. Based on the foregoing theoretical basis, as shown in fig. 7, the embodiment method includes steps S100-S400:
s100, obtaining a sensor signal, and sending the sensor signal to a read bit line of the SRAM array, wherein the sensor signal is a voltage signal obtained by converting a pixel point of an input characteristic diagram through binarization processing.
Illustratively, in an embodiment, a memory bank architecture of the proximity sensor includes 64 SRAM cells in a row, meaning that 64 multiplications can be done in parallel and the results accumulated. First, the embodiment sends the voltage value Vi output by the sensor to the read bit line of the SRAM array.
S200, determining that the voltage value of the read bit line is stable, and starting a read word line of the SRAM array.
In an embodiment, after the voltage applied to the read bit line is stabilized, the Read Word Line (RWL) of the SRAM is turned on to start to read the weight value W pre-stored in the SRAMi。
S300, acquiring a weight value of a sensor signal in the read word line, acquiring the sensor signal in the read bit line, and multiplying the sensor signal by the weight value to obtain a first output voltage and outputting the first output voltage to a shared bit line; the first output voltage is a voltage signal of a pixel point in the output characteristic diagram.
In an embodiment, the read bit lines include a first read bit line and a second read bit line (RBL and RBLB), and the step of obtaining the sensor signal weight value through the read word line and performing a multiplication operation according to the weight value to obtain a first output voltage output to the shared bit line in the method may include steps S310 to S330:
s310, determining that the weight value is a first numerical value, the first transistor is turned off, the second transistor is turned on, and the sensor signal in the first read bit line is sent to the shared bit line.
As shown in fig. 6, when the weight value WiAt 1, i.e. Q has a voltage value of V GNDThen QB has a voltage value of VddAt this timeThe transistor N4 (i.e., the second transistor) is turned on, the voltage on RBLB will be drained to ground, and the voltage on RBL remains unchanged when the transistor N3 (i.e., the first transistor) is turned off, which puts EN onPThe switch is closed and the voltage on RBL is transferred to the V _ p shared bit line. Completing multiplication operation 1ViAnd obtaining a second numerical value.
And S320, determining that the weighted value is a second numerical value, turning on the first transistor, turning off the second transistor, and sending the sensor signal in the second read bit line to the shared bit line.
As shown in fig. 6, when the weight value WiWhen the voltage is-1, that is, the voltage value of Q is VddThen QB has a voltage value of VGNDWhen the transistor N3 is turned on, the voltage on RBL will be completely discharged to ground, and when the transistor N4 is turned off, the voltage on RBLB will remain unchanged, and EN will be turned onnThe switch is closed and the voltage on RBLB is transferred to the V _ n shared bit line. Completing multiplication operation-1ViAnd obtaining a second numerical value.
S330, determining that the weight value is a third numerical value, turning off the first transistor, turning off the second transistor, and sending the sensor signals in the first read bit line and the second read bit line to the shared bit line.
As shown in fig. 6, when the weight value WiWhen the voltage value is 0, i.e., the voltage value of Q is 1/2Vdd, the voltage value of QB is 1/2Vdd, the voltages at N3 and N4 transistors are both turned off, RBL and RBLB are kept constant, and EN is turned onP、ENnThe switch is closed and the voltage on RBL is transferred to the V _ p shared bit line and the voltage on RBLB is transferred to the V _ n shared bit line. Complete multiplication operation 0ViAnd obtaining a second numerical value.
In an embodiment, the step of performing binarization processing on the first output voltage to obtain second output data and obtaining a convolution operation result from the second output data includes a step of subtracting the second output data in the first shared bit line and the second output data in the second shared bit line to obtain the convolution operation result.
In the embodiment, although the voltages on the two bitlines are respectively applied to the shared bitlines V _ p and V _ n, the difference between the voltages is 0, and therefore 0V is achievediThe effect of the multiplication.
In some possible embodiments, after the calculation results of the multiplication operations are obtained, the voltage on the read bit line (RBL, RBLB) is transferred to the shared bit lines V _ p and V _ n for each multiplication operation based on capacitive coupling and charge sharing schemes, and the voltage accumulation is stored in the Cp and Cn capacitors. C for the positive voltage on the shared bit line V _ p pCapacitance, C on shared bit line V _ n for negative voltage of multiplication resultnAnd (4) a capacitor.
S400, converting the first output voltage to obtain second output data, and obtaining a convolution operation result from the second output data.
Specifically, in the embodiment, the voltage value on each shared bit line can be converted into a binary digital result by the ADC converter, and then subtracted to obtain the final result after convolution operation:
OUT is the result of the convolution operation, WiIs a weight value, ViFor the voltage value of the sensor signal, N is the number of SRAM arrays, i is 1,2,3, …, and N is a positive integer.
On the other hand, the technical scheme of the application also provides a data processing device; it includes:
at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to execute a data processing method as in the first aspect.
An embodiment of the present invention further provides a storage medium, where a corresponding execution program is stored, and the program is executed by a processor, so as to implement the data processing method in the first aspect.
From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:
The technical scheme of the application provides a storage and computation integrated convolution calculation acceleration framework facing a sensor. Firstly, in an input layer, data processed by a CIM framework is directly input from a sensor without a series of processing steps of analog-to-digital conversion and digital-to-analog conversion, so that the hardware overhead, power consumption and delay of data transmission are greatly reduced. Then, the technical scheme of the application adopts the SRAM capable of storing the three weights, so that the accuracy of the neural network algorithm can be effectively improved. In addition, the CIM structure breaks through the bottleneck of the traditional von Neumann structure, and the multiply-accumulate (MAC) operation is directly completed in the SRAM, so that the delay and the power consumption generated by data transfer can be reduced. The storage and calculation integrated convolution calculation acceleration architecture for the sensor is suitable for being used as a hardware implementation architecture for convolution operation of a neural network algorithm, meets the requirements that the cost of the neural network for the hardware implementation architecture is reduced, including power consumption and hardware overhead, and provides high-performance requirements such as low delay and high bandwidth.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise specified to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be understood that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those of ordinary skill in the art will be able to practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is to be determined from the appended claims along with their full scope of equivalents.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.