WO2021223547A1 - 子单元、mac阵列、位宽可重构的模数混合存内计算模组 - Google Patents
子单元、mac阵列、位宽可重构的模数混合存内计算模组 Download PDFInfo
- Publication number
- WO2021223547A1 WO2021223547A1 PCT/CN2021/084032 CN2021084032W WO2021223547A1 WO 2021223547 A1 WO2021223547 A1 WO 2021223547A1 CN 2021084032 W CN2021084032 W CN 2021084032W WO 2021223547 A1 WO2021223547 A1 WO 2021223547A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- calculation
- type mos
- capacitor
- mac
- mos transistor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/544—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
- G06F7/5443—Sum of products
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
- G06N3/065—Analogue means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/41—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
- G11C11/412—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger using field-effect transistors only
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/21—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
- G11C11/34—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
- G11C11/40—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
- G11C11/41—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
- G11C11/413—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction
- G11C11/417—Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction for memory cells of the field-effect type
- G11C11/419—Read-write [R-W] circuits
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C11/00—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
- G11C11/54—Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using elements simulating biological cells, e.g. neuron
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11C—STATIC STORES
- G11C7/00—Arrangements for writing information into, or reading information out from, a digital store
- G11C7/10—Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
- G11C7/1006—Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present invention relates to the field of analog-digital mixed memory in-memory calculation, and more specifically, to a sub-unit, MAC array, and a reconfigurable analog-digital mixed memory in-memory calculation module with a bit width.
- the digital circuit occupies a large chip area and consumes a lot of power, making it difficult to realize a large-scale neural network with high energy efficiency.
- the data exchange bottleneck between the memory and the central processing unit caused by the von Neumann structure used in traditional digital circuits will severely limit the computing energy efficiency and computing speed under the large-scale data handling in DNN applications.
- the analog circuit implementation of MAC has the advantages of simple structure and low power consumption, so analog and analog-digital mixed-signal calculations have the potential to achieve high energy efficiency.
- in-memory computing which has become a research hotspot in recent years, cannot essentially be realized in the form of pure digital circuits, and requires the assistance of analog circuits.
- DNN application-specific integrated circuits ASICs
- the addition stage uses charge sharing.
- Each 1-bit calculation unit of the above 1-bit MAC calculation has 10 transistors.
- the problems in the prior art of Papers 1 and 2 are: (1) For each addition operation, the transmission gate in each computing unit is unconditionally driven, and the sparsity of input data cannot be used to save energy; (2) Each arithmetic unit that performs 1-bit multiplication is equipped with an independent capacitor, and the metal oxide metal (MOM) of the successive approximation (Successive Approximation, SAR) analog-to-digital converter (Analog to Digital Converter, ADC) The capacitor is located outside the Static Random Access Memory (SRAM) calculation array, because there is no space inside the array, which reduces the area efficiency; (3) The addition stage using charge sharing needs to be connected to the top plate of the capacitor that stores the XNOR operation result .
- MOM metal oxide metal
- This circuit topology makes the addition susceptible to non-ideal effects such as charge injection, clock feedthrough, non-linear parasitic capacitance at the drain or source of the transmission gate transistor, and leakage of the transistor connected to the top plate of the capacitor. Causes calculation errors.
- the mismatch between the arithmetic capacitor and the capacitor in the digital-to-analog converter in the ADC due to the mismatch of the physical layout will also cause calculation errors.
- Paper 3 proposes one that only supports Binary neural network (BNN) operation module with binarized weights and activation values.
- BNN Binary neural network
- the disadvantages of the computing module in Paper 3 are: (1) The architecture only supports BNN and cannot be used for large-scale DNN models for vision applications, such as object detection. The scope of application is small; (2) The multiplication stage of 1-bit MAC calculation is at least Need one OR (OR) gate, two XNOR gates, two exclusive OR (NOR) gates and a latch, the number of transistors used is large, and the area is large.
- Paper 4 proposes an embedded convolution Energy-saving SRAM for computing functions.
- the shortcomings of SRAM in Paper 4 are: (1) Each 1-bit computing SRAM cell has 10 transistors. The higher the number of transistors in each cell, the lower the storage density; (2) Use the parasitic capacitance on the bit line to store charge for subsequent averaging operations.
- the article’s solution uses a ramp-based ADC that takes up to 2 N -1 (N is the ADC resolution) steps to converge, which reduces the speed of analog-to-digital conversion and leads to lower computational throughput; (5) the array’s
- the input uses an additional DAC circuit to convert the input data X in (usually a characteristic map) from a digital representation to an analog representation.
- the non-ideal characteristics of the DAC circuit will lead to more loss of accuracy and cost of area and energy.
- the calculation unit for 1-bit multiplication in the prior art MAC array uses many transistors; the capacitors used to store the multiplication results for accumulation correspond to the storage cells one-to-one, that is, the number of storage cells and the number of capacitors The same, and the capacitance is generally much larger than the SRAM cell, especially under the advanced technology process, which will cause the MAC array to occupy a large area; at the same time, there is unconditional driving of the transistor in the multiplication and addition operation, resulting in low energy efficiency of the operation; in addition, calculation errors The high rate leads to limited application scenarios.
- the present invention provides a sub-unit, MAC array, and a reconfigurable analog-digital mixed memory computing module.
- the realization of the MAC array of the differential system is also provided.
- the present invention adopts the following technical solutions:
- an in-memory analog-digital hybrid calculation subunit including: a storage module, a calculation capacitor, and a control module;
- the storage module includes two cross-coupled CMOS inverters and a complementary transmission gate, the two cross-coupled CMOS inverters store 1-bit filter parameters, and the gates of the N-type MOS transistors of the complementary transmission gates are connected For input signals, the gate of the P-type MOS transistor of the complementary transmission gate is connected to the complementary input signal, the output terminal of one of the CMOS inverters is connected to the input terminal of the complementary transmission gate, and the output terminal of the complementary transmission gate is connected to the bottom plate of the calculation capacitor and the control module;
- a plurality of sub-units are used to form a calculation unit, and each sub-unit in the same calculation unit shares the same control module and a calculation capacitor.
- a 1-bit filter parameter or weight w is written and stored in two cross-coupled CMOS inverters.
- the input signal A is connected to the gate of the N-type MOS transistor of the complementary transmission gate, and the P of the complementary transmission gate is The gate of the type MOS transistor is connected to the complementary input signal nA, and the multiplication result of the input signal A and the weight w is stored as the voltage of the capacitor bottom plate.
- the multiple subunits form a calculation unit, and each subunit in the same calculation unit shares the same
- the control module and the calculation capacitor, the sub-units are arranged in a feasible way such as 2 ⁇ 2, 4 ⁇ 2, etc. Intuitively, this solution reduces the number of control modules composed of MOS tubes. Taking the 2 ⁇ 2 sub-unit as an example, 3 control modules and 3 calculation capacitors are reduced.
- the control module includes a first N-type MOS transistor, a second N-type MOS transistor, and a P-type MOS transistor.
- the gate of the first N-type MOS transistor is connected to the signal B.
- the level of the complementary input signal nA is the same as the signal B during calculation.
- the output terminal of one of the two cross-coupled CMOS inverters is connected to the input terminal of the complementary transmission gate.
- the source of the second N-type MOS transistor is grounded to Gnd, the gate is connected to a bit line, the source of the P-type MOS transistor is connected to Vdd, and the gate is connected to another complementary bit line.
- Such a topological structure can avoid unconditional driving of complementary transmission gates and improve energy efficiency.
- the complementary transmission gate avoids being connected to the top plate of the calculation capacitor for charge accumulation, which can minimize the calculation error, especially because the MOS tube is used as the clock feedthrough when switching, the charge injection when switching from on to off, and the complementary transmission gate
- the connection mode between the MOS transistors in the control module is changed, and the connection to the computing capacitor bottom plate is changed.
- the second N-type MOS tube and the P-type MOS tube are connected in series to form a first CMOS inverter.
- the source of the P-type MOS transistor of the first CMOS inverter is connected to Vdd, and the source of the second N-type MOS transistor of the first CMOS inverter is connected to the drain of the first N-type MOS transistor;
- the source of the N-type MOS transistor is grounded Gnd, the input signal of the gate and the signal connected to the gate of the complementary transmission gate P-type MOS transistor are at the same level during operation;
- the input of the first CMOS inverter is connected to a bit line ,
- the output is connected to the bottom plate of the calculation capacitor.
- the first N-type MOS transistor and the P-type MOS transistor are retained in the control unit.
- Remove the second N-type MOS tube The drain of the first N-type MOS transistor is connected to the drain of the P-type MOS transistor and connected to the bottom plate of the calculation capacitor, and the source of the first N-type MOS transistor is connected to the source of the P-type MOS transistor and connected to a bit line .
- the gate of the first N-type MOS tube is connected to a control word line, and its level is the same as the gate level of the P-type MOS tube in the complementary transmission gate during calculation.
- the gate of the P-type MOS tube is connected to another control word line.
- the more sub-units share the MOS transistors and calculation capacitors in the control unit the number and area of the devices required at this time are equally distributed to each sub-unit, and each sub-unit needs The closer the number of transistors is to six.
- the subunits in the computing unit are activated in a time-division multiplexed manner, that is, when one subunit is activated, other subunits in the same computing unit are deactivated, and each computing unit is
- the signal input by the gate of the first N-type MOS transistor in the cell is the same as the gate level of the P-type MOS transistor of the complementary transmission gate in the sub-unit that is in the working state at a certain time.
- the filter parameters stored in other sub-units included in the same calculation unit can be used for in-memory calculations immediately. There is no need to move data from the outside and store it in the sub-unit before performing calculations, which improves Calculate speed and data throughput and reduce energy consumption and area consumption.
- a MAC array including the first aspect and possible implementations of the first aspect, which performs multiplication and addition operations, including: multiple calculation units, and the output of the complementary transmission gates of all sub-units in each calculation unit
- the terminals are connected to the same calculation capacitor and the same bottom plate, and the top plates of the calculation capacitors in all calculation units in the same column are connected to the same accumulation bus, and the voltage of each accumulation bus corresponds to the accumulation sum of the multiplication calculation of each column.
- the unit area MAC array includes more cross-coupled CMOS inverters, which can store more neural network filter parameters at one time to reduce data movement.
- the MAC array further includes a second CMOS inverter and a differential calculation capacitor.
- the output terminals of the complementary transmission gates of all sub-units are connected to the same first
- the input ends of the two CMOS inverters, and the output end of the second CMOS inverter is connected to the bottom plate of the differential calculation capacitor; the top plates of all the difference calculation capacitors in the same column are connected to the same differential accumulation bus.
- a bit-width reconfigurable analog-digital hybrid MAC calculator including: the MAC array in the second aspect or any possible implementation of the second aspect, and the column-wise accumulation result after calculation is expressed as Analog voltage; filter/ifmap module, which provides the filter parameters written and stored in the MAC array or the activation value calculated by the upper layer of the neural network; ifmap/filter module, provides the input of the MAC array, and the filter Multiply and add the activation value calculated by the upper layer of the neural network or the parameters of the neural network; the analog-to-digital conversion module converts the analog voltage obtained after the MAC into a digital representation; the digital processing module multiplies the digital representation output by the analog-to-digital conversion module Bit fusion, offset, scaling, or non-linear operation, the output result is a partial sum or an activation value that can be used for the input of the next layer of the network.
- the filter parameters or activation values calculated in the upper layer of the neural network are written and stored in the MAC array through the filter/ifmap module, so that the two cross-coupled CMOS inverters in the subunit store logic 1 or 0, and multiply and add with the input provided by the ifmap/filter module.
- the multiplication between the stored value in each subunit and the input is a digital operation, which is equivalent to an AND operation.
- the result of the multiplication is stored in the calculation capacitor.
- the addition stage because the top plates of all calculation capacitors in the same column pass the same accumulation The buses are connected together, the charges stored in different calculation capacitors are shared through the accumulation bus, and the multiplication result accumulated in the column direction is stored as an analog voltage.
- the analog result is converted into a digital representation by an analog-to-digital conversion module, and finally the digital representation is processed, and the output result is a partial sum or an activation value that can be used for the input of the next layer of the network.
- MAC consumes a lot of energy.
- MAC adopts analog-digital mixed operation, which can greatly reduce energy consumption, and the realization of low area of MAC array can improve energy efficiency and calculation speed.
- the combination of different calculation methods is adopted for different stages of the entire neural network calculation, which makes great use of the different advantages of analog and digital calculations, and ensures the realization of low power consumption, high energy efficiency, high speed and high precision in the calculation process.
- the analog-to-digital conversion module adopts a SAR ADC, which is specifically a SAR ADC with a binary weighted capacitor structure.
- the sparseness of the input value and the stored value of the MAC array can prevent the switching sequence of some capacitors in the SAR DAC from switching, thereby obtaining more High energy efficiency and ADC conversion speed.
- the bit width of each column of SAR ADC in the MAC array can be determined in real time by the input value and the sparsity of the stored value.
- the MAC DAC and the SAR DAC can be connected together.
- the MAC DAC refers to a column of calculation capacitors in the MAC array Array, that is, a column of capacitors in the MAC array is connected in parallel with the capacitors in the SAR DAC.
- MAC DAC is allowed to be multiplexed into SAR DAC through backplane sampling, so that the same capacitor array is used to implement MAC operation and analog-to-digital conversion, avoiding the MAC operation link
- it allows the realization of a fully differential SARADC to better solve the common-mode-related comparator input offset voltage deviation. The problem of moving.
- Figure 1a is a schematic diagram of a subunit in an embodiment of the present invention.
- Figure 1b is a schematic diagram of a 6T structure in a subunit in an embodiment of the present invention.
- Figure 2a is a schematic diagram of a subunit structure in an embodiment of the present invention.
- Figure 2b is a schematic diagram of a subunit structure in another embodiment of the present invention.
- Figure 2c is a schematic diagram of a subunit structure in another embodiment of the present invention.
- 2d is a schematic diagram of a truth table of a 1-bit multiplier unit in an embodiment of the present invention.
- 3a is a schematic diagram of the arrangement of sub-units in a computing unit in an embodiment of the present invention.
- Figure 3b is a schematic diagram of a computing unit composed of multiple subunits in an embodiment of the present invention.
- Figure 3c is a schematic diagram of a computing unit composed of multiple subunits in another embodiment of the present invention.
- 3d is a schematic diagram of a computing unit composed of multiple subunits in another embodiment of the present invention.
- Fig. 3e is a truth table of subunits in a calculation unit in an embodiment of the present invention.
- FIG. 4 is a schematic diagram of a MAC array including a calculation unit in an embodiment of the present invention.
- Figure 5 is a schematic diagram of calculating the bottom and top plate voltages of a capacitor in an embodiment of the present invention.
- 6a is a schematic diagram of a calculation unit connected to a second CMOS inverter and a differential calculation capacitor in an embodiment of the present invention
- 6b is a schematic diagram of a calculation unit connected to a second CMOS inverter and a differential calculation capacitor in another embodiment of the present invention
- 6c is a schematic diagram of a calculation unit connected to a second CMOS inverter and a differential calculation capacitor in another embodiment of the present invention
- FIG. 7 is a schematic diagram of forming a MAC array under a differential system in an embodiment of the present invention.
- FIG. 8 is a schematic diagram of an in-memory calculation module in an embodiment of the present invention.
- FIG. 9 is a schematic diagram of an analog-to-digital conversion module in an embodiment of the present invention.
- FIG. 10 is a schematic diagram of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 11 is a schematic diagram of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 12 is a schematic diagram of an analog-to-digital conversion module in another embodiment of the present invention.
- 13 is a schematic diagram of the differential structure of an analog-to-digital conversion module in another embodiment of the present invention.
- FIG. 14 is a schematic diagram of an architecture for reducing the energy consumption of analog-to-digital conversion in an embodiment of the present invention.
- bit-width reconfigurable analog-digital hybrid computing module provided by the embodiment of the present invention can be applied in the visual and acoustic DNN architecture, more specifically, to achieve object detection and low power consumption. Acoustic feature extraction, etc.
- the data to be processed is convolved with a filter composed of weights in the feature extractor, and then the corresponding feature map/activation value is output.
- Different filter selections will result in different extracted features.
- the convolution operation of the data to be processed and the filter requires the highest energy consumption, and it is necessary to avoid the energy consumption caused by unconditional driving of the circuit, especially when the data to be processed is a sparse matrix.
- This application proposes a sub-unit for in-memory calculation, as shown in Figure 1a, including: a storage module, a calculation capacitor, and a control module;
- the storage module includes two cross-coupled CMOS inverters and a complementary transmission gate, Two cross-coupled CMOS inverters store 1-bit filter parameters, the gate of the N-type MOS transistor of the complementary transmission gate is connected to the input signal A, and the gate of the P-type MOS transistor of the complementary transmission gate is connected to the complementary input signal nA,
- the output terminal of one of the CMOS inverters is connected to the input terminal of the complementary transmission gate, and the output terminal of the complementary transmission gate is connected to the backplane of the calculation capacitor and the control module.
- the complementary transmission gate is a bidirectional device, and the input terminal of the complementary transmission gate in the present invention refers to the end connected to the output terminal of one of the CMOS inverters.
- the gates of the N and P terminals of the complementary transmission gate can be respectively connected to the word line signals WL and nWL to control the writing or reading weight w of the sub-units, and can also be connected to the input signal A and the complementary input signal nA respectively , Used to participate in 1-bit multiplication calculations.
- w in Fig. 1a is the abbreviation of Weight
- nW is the abbreviation of negative weight, respectively representing weighted and negative weighted sites.
- the multiplication result of the input signal A and the weight w is stored as the voltage V btm of the capacitor bottom plate.
- a memory module composed of two cross-coupled CMOS inverters connected to a complementary transmission gate is called a 6T structure (6T sub-cell, including 6 transistors), see Figure 1b.
- each subunit in the same calculation unit shares the same control module and a calculation capacitor.
- Figure 2a is a schematic diagram of an embodiment of a sub-unit structure for 1-bit multiplication, in which the control module includes a first N-type MOS tube, a second N-type MOS tube, and a P-type MOS tube.
- the source of the MOS transistor is grounded Gnd, the drain is connected to the drain of the second N-type MOS transistor, the drain of the P-type MOS transistor, and the output end of the complementary transmission gate is connected to the same base plate of the computing capacitor.
- the complementary transmission gate N-type MOS The gate of the tube is connected to the input signal A, the signal B input from the gate of the first N-type MOS tube is at the same level as the complementary input signal nA connected to the gate of the P-type MOS tube of the complementary transmission gate; the second N-type MOS tube
- the source of the P-type MOS transistor is grounded to Gnd, and the gate is connected to a bit line BL k ; the source of the P-type MOS transistor is connected to Vdd, and the gate is connected to another bit line nBL k .
- the complementary input signal nA and signal B share a node to provide the same level.
- the signal B input from the gate of the first N-type MOS transistor is not only used to reset the bottom plate voltage of the calculation capacitor, so that the subunit is used for the next calculation, but also participates in a 1-bit multiplication operation, as shown in Figure 2.
- the truth table is shown.
- the written filter parameter or weight w stored in the two cross-coupled CMOS inverters is subjected to a one-bit multiplication operation with the input signal A at the N terminal of the complementary transmission gate, and the result of the multiplication operation is stored as a calculation capacitor The voltage of the backplane.
- the complementary transmission gate is used to control write w, and the word line signals received by the N-terminal and P-terminal gates of the complementary transmission gate in the subunit need to ensure that the complementary transmission gate is turned on.
- the word line WL connected to the N terminal gate of the complementary transmission gate is set to Vdd
- the other word line nWL connected to the P terminal gate of the complementary transmission gate is set to 0. If you want to write "0", the bit lines BL k and nBL k are both set to high level.
- the second N-type MOS tube is turned on, and the P-type MOS tube is not turned on; if you want to write "1",
- the bit lines BL k and nBL k are both set to low level.
- the second N-type MOS transistor is not turned on, and the P-type MOS transistor is turned on.
- BL k is set to low level and nBL k is set to high level to ensure that the multiplication result of the memory calculation in the complementary transmission gate is stored in the calculation capacitor.
- bit line design based on the control module in the sub-unit is Optimized, see Figure 2b, which is a schematic diagram of a sub-unit after optimization.
- a control module that includes a first N-type MOS tube, a second N-type MOS tube, and a P-type MOS tube
- the second N-type MOS tube and the P-type MOS tube are connected in series to form a first CMOS inverter
- the source of the P-type MOS tube of the first CMOS inverter is connected to Vdd
- the first CMOS inverter is The source of the second N-type MOS transistor of the device is connected to the drain of the first N-type MOS transistor; the source of the first N-type MOS transistor is grounded Gnd, and the input signal of the gate is connected to the complementary transmission gate P-type MOS transistor.
- the signal connected to the gate has the same level during operation; the input of the first CMOS inverter is connected to a bit line, and the output is connected to the bottom plate of the calculation capacitor.
- the bit lines BL k and B are both set to high level; if 1 is to be written, the bit line BL k is set low.
- BL k is set to a high level, and the signal B is at the same level as the complementary input signal nA.
- the second N-type MOS transistors in the two embodiments of FIG. 2a and FIG. 2b are removed. See FIG. 2c, for the control unit only has two Schematic diagram of the sub-unit of the transistor.
- the drain of the first N-type MOS transistor is connected to the drain of the P-type MOS transistor, and is connected to the bottom plate of the computing capacitor.
- the source of the first N-type MOS transistor and the source of the P-type MOS transistor are connected to a bit line.
- the gate of the first N-type MOS transistor is connected to the word line signal B.
- the level is the same as that of the P-type MOS transistor of the complementary transmission gate in the 6T structure during calculation.
- the gate of the P-type MOS tube is connected to another word line signal nB.
- the bit line BL k is set to low level, and B is set to high level; if you want to write 1.
- the bit line BL k is set to a high level, and nB is set to a low level.
- BL k is set to low level
- B is the same level as the complementary input signal nA
- nB is set to high level. It can be understood that the sub-multiplication of 1 bit in this embodiment is The unit only needs 8 transistors.
- the process for the subunit to perform one-bit multiplication calculation is as follows:
- the bottom plate voltage V btm of the calculated capacitor is either kept at 0 or enters Vdd.
- the output result of the multiplication operation is the bottom plate voltage V btm of the calculated capacitor, expressed as Vdd ⁇ w ⁇ A.
- the transmission gate connected to the bottom plate of the calculation capacitor can minimize the calculation error compared to the prior art solution of connecting to the top plate of the calculation capacitor, especially because the MOS tube is used as a switch when the clock feed-through is switched from on to off. The error caused by the charge injection at the time, the non-linear parasitic capacitance at the drain/source of the complementary pass gate transistor, and the leakage of the transistor itself.
- the multiple subunits in the foregoing embodiment are used to form a computing unit, and the subunits are arranged in a feasible manner such as 2 ⁇ 2, 4 ⁇ 2, and the like.
- this solution reduces the number of calculation capacitors and MOS tubes that make up the control module. Taking the 2 ⁇ 2 sub-unit as an example, 3 control modules and 3 calculation capacitors are reduced. As shown in Fig. 3b, Fig. 3c, and Fig.
- each sub-unit in the same calculation unit retains its own 6T structure, and multiple calculation sub-units are used to form One calculation unit, and each sub-unit in the same calculation unit shares the same control module and one calculation capacitor. It can be understood that there is only one control module and one calculation capacitor in one calculation unit. Intuitively, this solution reduces the number of control modules and calculation capacitors required to implement the same number of independent sub-units by sharing. Taking the calculation unit in Figure 3b as an example, multiple sub-units share the same calculation capacitor, one A first N-type MOS tube, a second N-type MOS tube, and a P-type MOS tube.
- the output terminal of the complementary transmission gate of each subunit is connected to the drain of the same first N-type MOS tube, the bottom plate of the same calculation capacitor, The drain of the same P-type MOS transistor and the drain of the same second N-type MOS transistor.
- the control module is generally composed of transistors, so the more sub-units share the transistors of the control module, the number of transistors required for each sub-unit will be shared closer to the number of transistors required by the memory module, that is, the number of transistors in the 6T structure. .
- the sub-units share devices that is, multiple sub-units used for 1-bit multiplication share a capacitor to store the calculation results.
- each 1-bit multiplication subunit connected to a capacitor to store the calculation result it can greatly increase the storage capacity in the specified area, that is, more filter parameters can be stored in the same area at one time than in the prior art. Or weight.
- the subunits in the same computing unit are activated in a time-division multiplexing manner, that is, when one subunit is activated, other subunits in the same computing unit are deactivated. After the subunit is activated, perform the one-bit multiplication operation as described above.
- W 0a in the figure represents the value stored in the weighted position of the subunit a in the 0th calculation unit in a column.
- V btm0 represents the backplane voltage of the 0th computing unit in a column.
- the first N-MOS transistor gate and the input signal B i P end complementary transmission gates gate of each subunit
- the input signal nA ij is controlled separately. Under time division multiplexing, although the levels of nA ij and B i in the subunits that are in the working state at a certain moment are the same, the situation that the two share nodes is no longer applicable.
- the number of calculation capacitors and control modules required in the calculation unit combined by the sub-units is reduced by n-1 respectively, that is, the calculation capacitor, the first N-type MOS tube, and the second N
- the number of control modules of the type MOS tube and the P type MOS tube is reduced by n-1 respectively, and the sub-unit structure that completes a 1-bit multiplication in the calculation unit is close to 6 transistors.
- the area occupied by the calculation capacitor is several times that of the storage structure of the subunit 6T.
- the shared method reduces the number of capacitors per unit area and can increase the storage capacity of the module composed of the calculation unit.
- the filter parameters stored in other subunits included in the same calculation unit can be used for in-memory calculations immediately, without the need to move data from the outside and store it in the subunit before performing calculations. Improved calculation speed.
- a MAC array is obtained by combining the sub-units and calculation units of the first aspect.
- the MAC array includes multiple calculation units, and all calculation capacitors in the same column are connected to the same top plate. Cumulative bus.
- each calculation unit includes at least one of the sub-units, the output terminal of each complementary transmission gate in the calculation unit is connected to the same base plate of the same calculation capacitor, and the voltage of each accumulation bus corresponds to each column calculation The cumulative sum.
- the MAC array can store more neural network parameters or values calculated by the upper layer network in the calculation unit that adopts the mode of sharing capacitors and transistors. Specifically, the calculation unit completes the calculation of 1-bit multiplication and stores the calculation result in the calculation capacitor, and the calculation units in the same column in the MAC array accumulate their respective 1-bit multiplication results through the same accumulation bus connected to the top plate of the calculation capacitor.
- the top plates of all calculation capacitors in the same column are connected together by a cumulative bus, and the cumulative bus voltage is V top .
- V top the cumulative bus voltage
- the MAC array performs multiplication and addition operations in the following "mode one":
- the filter parameters (or the activation value calculated by the upper layer of the network) are first written to each sub-unit according to the writing process, and are stored in the cross-coupled 2 CMOS inverters of the sub-unit;
- V top of the capacitor calculates the top plate voltage V top of the capacitor and reset it to V rst through the reset switch S rst on the accumulation bus, and V rst can be 0;
- the bottom plate voltage V btmi of the calculation capacitor is either kept at 0 or enters Vdd.
- the charge is redistributed in a column of calculation capacitors, similar to the charge redistribution in the capacitors of the SAR DAC. If the parasitic capacitance and other non-idealities are not considered, the analog output voltage V top of a column of calculation capacitors represents the cumulative result of the following formula, as shown in Figure 5.
- the MAC array can be operated according to the following "Method 2":
- the signals A ij and nA ij are activated in a time-division multiplexed manner.
- V btmi of each calculation capacitor will either remain at 0 or enter Vdd. Then disconnect S rst , set the backplane voltage V btmi to 0 or Vdd, and the MOS switch in the control module of each computing unit runs a successive approximation algorithm for analog-to-digital conversion. Taking V btmi all set to 0 as an example, the voltage V top can be expressed as:
- the MAC array described in the second aspect or the second aspect can be used for the calculation of multi-bit weights.
- the calculation unit of each column performs a bit-by-bit MAC operation by shifting and adding the digital representation after analog-to-digital conversion.
- the operation obtains the output result of multiple weights.
- each column performs bit-wise MAC, which can be the lowest bit in the first column, that is, the value of the 0th bit and the MAC of the input signal, and the k-th column performs the highest bit. , That is, the value of the k-1 bit and the MAC of the input signal.
- each column is equivalent to MAC for one bit of a multi-bit binary weight.
- the MAC result obtained by all columns participating in the calculation contains k elements. Finally, the k elements after the analog-to-digital conversion are performed in the digital domain. The shifts are added.
- the MAC array further includes a second CMOS inverter and a differential calculation capacitor. See Figures 6a, 6b, and 6c, each of the MAC arrays
- the calculation unit connects a second CMOS inverter and a differential calculation capacitor to obtain the differential architecture of the MAC array.
- the complementary transmission gate output terminals of all sub-units in each calculation unit are connected to the input terminal of the same second CMOS inverter, and the output terminal of the second CMOS inverter is connected to the bottom plate of the differential calculation capacitor;
- the top plates of all differential calculation capacitors are connected to the same differential accumulation bus.
- the calculation unit composed of the sub-units in the above embodiments is connected with a second CMOS inverter and a differential calculation capacitor as a differential unit.
- the sub-units in the same differential unit share the same first N-type MOS tube, a second N-type MOS tube (FIG. 6a, FIG. 6b), a P-type MOS tube, a differential calculation capacitor, and a second N-type MOS tube.
- Two CMOS inverters, the sub-units in the differential unit are also activated in the time-division multiplexing manner.
- Figure 7 shows the MAC array composed of the aforementioned differential unit.
- the output terminal of each complementary transmission gate in the differential unit is connected to the bottom plate of the same calculation capacitor, all the top plates of the calculation capacitor in the same column are connected to the same accumulation bus, and all the top plates of the differential calculation capacitor are connected The same differential accumulation bus.
- a modular hybrid computing module with reconfigurable bit width is provided. See FIG. 8, including: the MAC array in the second aspect or any possible implementation of the second aspect.
- the MAC array The result of column-wise accumulation is expressed as an analog voltage, that is, V top of the capacitor top plate in the above-mentioned embodiment; the filter/ifmap module provides filter parameters that are written and stored in the MAC array.
- the value written and stored in the MAC array can also be the value output after the calculation of the upper layer of the network is completed;
- the ifmap/filter module provides the input of the MAC array, specifically, the input of the complementary transmission gate in the calculation unit , Multiply and add with the filter parameters or the activation value of the upper layer of the network;
- the analog-to-digital conversion module which converts the analog voltage obtained by the MAC operation into a digital representation;
- the digital processing module to the digital output from the analog-to-digital conversion module It means that operations such as multi-bit fusion, offset, scaling, or non-linearity are performed at least, and the output result is a partial sum or an activation value that can be directly used for the input of the next layer of the network.
- the module of the present application when used in the MAC calculation of a neural network, in general, due to the same area, the module includes more memory cells, that is, two CMOS inverters that are cross-coupled can be used at one time.
- the properties are used to load filter parameters (weights) in advance.
- the output part and or the activation value (feature map) that is finally used for the calculation of the next layer of the network can be MAC immediately with the filter parameters (weights) pre-loaded and stored in the module Calculation reduces the waiting time and power consumption of off-chip data transfer.
- the large throughput of the module can improve the on-chip storage capacity.
- the storage unit also stores the activation value (characteristic map) output by the network in the MAC array.
- the calculation unit also Share some transistors and other devices involved in analog-to-digital conversion and digital processing.
- the analog-to-digital conversion module can be a SAR ADC with a parallel capacitor structure, which converts the top plate voltage V top output by the column-direction calculation unit into a digital representation, including MAC DAC, SAR DAC, comparator, switch sequence and SAR logic .
- SAR ADCs that use a parallel capacitor structure can make full use of the existing structure of the present invention, and achieve the effects of saving devices and reducing area.
- the MAC DAC is composed of capacitors of a column of calculation units in the aforementioned MAC array in parallel. It should be understood that the output voltage of the MAC DAC is V top .
- the capacitance of 0 is C/4
- the reference voltage ratios that can be distributed from MSB to LSB are: 1/2, 1/4, 1/8
- the capacitance of the redundant capacitor C U is C/4
- the B One end of the two capacitors and the redundant capacitor are connected in parallel
- the other end of the B capacitor is connected to the switching sequence
- the other end of the redundant capacitor is always grounded Gnd.
- the free end of the switch sequence includes a VDD end and a grounded Gnd end, and the SAR logic controls the switch sequence.
- the output voltage V top of the MAC DAC is used as the positive input V + of the comparator; the output V SAR of the SAR DAC is used as the negative input V -of the comparator, and the SAR logic controls the switching sequence to be negative.
- the input V - is approximately equal to the positive input V + , and the final SAR logic output is the digital representation of V +.
- the activation sparsity of the MAC array can prevent some capacitors in the SAR DAC from switching, thereby achieving higher energy efficiency and ADC conversion speed.
- the number of MAC capacitors whose backplane voltage V btmi is VDD is less than 25%, that is, in the MAC array, a column of calculation units performs a 1-bit multiplication of 1 ⁇ 0, 0 ⁇ 0, and 0 ⁇ 1. If the 1 ⁇ 1 case is less than 1/4 of the number of calculation units in the column, the first two capacitors of the SAR DAC, namely the S B-1 of the switch sequence corresponding to C B-1 and C B-2, can be used. And S B-2 is dialed to the ground Gnd end, not unconditionally activate all the capacitors in the SAR DAC to perform digital-to-analog conversion, saving energy.
- the connection modes of the V + side and the V - side of the comparator shown in the drawings of the present invention are only for convenience of description, in fact, the connection of the V + side and the V - side can be interchanged.
- V ref 0
- the MAC DAC on the positive input V + side and the SAR DAC on the negative input V- side of the comparator are both added with one and the other
- the capacitor is a half LSB capacitor connected in parallel; the other end of the half LSB capacitor on the positive input V + side is always grounded Gnd, and the other end of the half LSB capacitor on the negative input V - side can be connected to the switching sequence. This will produce a half-LSB voltage difference between the discrete analog levels between the MAC DAC and SAR DAC, providing additional error tolerance.
- the above half LSB capacitor may be two lowest LSB capacitors connected in series to achieve good matching.
- the MAC DAC is allowed to be multiplexed into a SAR DAC through the backplane sampling.
- the positive input V + side of the comparator is connected to the MAC DAC and a half-LSB capacitor.
- the capacitors and half-LSB capacitors of the first to N-1th units of the MAC DAC can be connected to the VDD terminal of the switching sequence or Grounding the Gnd terminal, the capacitor of the Nth unit can optionally be connected to the ground Gnd terminal; the negative input V - side of the comparator is not connected to the capacitor but the voltage Vref .
- the MAC DAC in this embodiment is also a SAR DAC.
- SAR comparator positive input voltage V + V rst returned, corresponding to the "two way" of the top plate voltage V top by the capacitor reset switch S rst reset the desired V rst step V rst.
- the same capacitor array is used to implement MAC operation and analog-to-digital conversion, avoiding mismatch and accuracy loss caused by the difference of the capacitor array in the MAC DAC and the analog-to-digital conversion stage of the SAR DAC in the MAC operation link, and allows the realization of fully differential SARADC .
- the transistors required to implement the switching sequence in this embodiment are already included in the control module in the aforementioned calculation unit, and no additional transistors need to be added.
- FIG. 13 shows a differential MAC architecture, which solves the problem of input offset voltage offset of the common mode related comparator.
- the positive input V + side of the comparator is connected to the MAC DAC and an additional LSB capacitor.
- the capacitors from the 1st to the N-1th unit of the MAC DAC and the additional LSB capacitors can be connected to the connection of the switching sequence.
- VDD terminal or grounded Gnd terminal the capacitor of the Nth unit can be connected to the grounded Gnd terminal; the negative input V - side of the comparator is connected to the differential MAC DAC and an additional LSB capacitor.
- the capacitors from 1 to the N-1th unit and the additional LSB capacitors can all be connected to the switching sequence, and the capacitors of the Nth unit can optionally be connected to the ground Gnd terminal.
- the differential MAC DAC is composed of a column of differential calculation capacitors in the MAC array. It should be noted that the differential MAC architecture needs to be combined with the aforementioned differential structure modules to be implemented. It should be particularly pointed out that the transistors required to implement the switching sequence in this embodiment are already included in the control module in the aforementioned differential calculation unit, and no additional transistors need to be added.
- the bit width of a column of SARADC can be determined in real time by the sparseness of the input data and the values stored in the column, so that the average capacitance in the binary weighted capacitor array that needs to be charged and discharged during the analog-to-digital conversion process The number may be greatly reduced, so as to achieve the effect of greatly saving the energy consumption of analog-to-digital conversion.
- the real-time bit width of the SAR ADC can be calculated as ceil(log 2 (min(X,W)+1)).
- ceil is the round-up function
- min is the minimum value function
- X is the number of 1 in the 1-bit input vector
- X 1 -X m are the X 1 to X m 1-bit input vectors, which can be passed through the adder tree Calculated
- W is the number of 1s stored in a column of the calculation array
- W 1 -W m is the number of 1s stored in each subunit in a column of the calculation array, which can be calculated off-chip
- the data When stored in the calculation array, it has been stored in the SAR logic.
- the min, log 2 and ceil functions in the formula for calculating the bit width can be replaced by simple digital combination logic to get the same calculation result.
- the modules included are only divided according to the functional logic, but not limited to the above-mentioned division, as long as the corresponding function can be realized; in addition, the specific name of each functional unit is also It is just for the convenience of distinguishing each other, and is not used to limit the scope of protection of the present invention.
- the "first N-type MOS transistor" and the “second N-type MOS transistor” in the embodiment only distinguish devices at different connection positions, and it is not understandable. For a specific device.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Computer Hardware Design (AREA)
- Neurology (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Analogue/Digital Conversion (AREA)
- Static Random-Access Memory (AREA)
- Dram (AREA)
Abstract
一种模数混合存内计算的子单元,用于1位乘法计算,仅需要9个晶体管,在此基础上,提出多个子单元共用计算电容器、晶体管以组成1个计算单元,使得平均下来子单元的晶体管数量逼近6个,进而提出一种MAC阵列,用于乘加计算,包含多个计算单元,每个单元内的子单元以时分复用的方式被激活。进一步地,提出MAC阵列的差分体系,提高计算的容错能力。进一步地,提出一种用于内存内模数混合运算模组,对MAC阵列的并行模拟输出数字化并进行其它数字域的运算。所述运算模组中的模数转换模块充分利用MAC阵列的电容器,既能减少运算模组的面积,又能降低运算误差。进一步地,提出一种充分利用数据稀疏性来节省模数转换模块能耗的方法。
Description
本申请要求于2020年05月08日提交中国专利局、申请号为202010382467.2、发明名称为“子单元、MAC阵列、位宽可重构的模数混合存内计算模组”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明涉及一种模数混合存内计算领域,并且更具体地,涉及一种子单元、MAC阵列、位宽可重构的模数混合存内计算模组。
目前,现有移动和物联网之类的新兴边缘应用要求高能效和高单位面积的运算速率。高能效意味着更长的电池寿命,而高单位面积的运算速率意味着在指定的运算速率下减小面积,进而降低成本。如今,深度神经网络(Deep NeuralNetwork,DNN)中的前馈推理计算以乘法累加(Multiply-And-Accumulate,MAC)计算为主导,需要MAC计算的高能效和低面积的实现,同时减少待处理数据的搬运量。传统数字集成电路实现MAC有抗噪声能力强、精度高、扩展性好、设计方法成熟等优点,但是数字电路占用的芯片面积大,功耗大,难以实现高能效的大规模神经网络。并且传统数字电路采用的冯诺依曼结构带来的存储器和中央运算单元之间的数据交换瓶颈在DNN应用中的大规模数据搬运下会严重限制运算能效和运算速度。模拟电路实现MAC具有结构简单、功耗较低的优点,所以模拟和模数混合信号计算具有实现高能效的潜力。而为了打破冯诺依曼架构的瓶颈,近年来成为研究热点的存内计算从本质上无法以纯数字电路的形式实现,需要模拟电路的辅助。同时由于DNN对包括电路噪声造成的计算错误的承受能力较高,DNN专用集成电路(ASIC)正重新引起关注。
论文“A mixed-signal binarized convolutional-neural-network accelerator integrating dense weight storage and multiplication for reduced data movement”,DOI:10.1109/VLSIC.2018.8502421(以下称“论文1”)和论文“A Microprocessor implemented in 65nm CMOS with configurable and bit-scalable accelerator for programmable in-memory computing”,arXiv:1811.04047(以下称“论文2”),阐述1位MAC计算的乘法阶段是等效于1位权重和1位输入进行同或(XNOR)运算,把XNOR运算结果以电压的形式存储到电容器,加法阶段是利用电荷共享,每个电容器的电荷相同但所有电容器的总电荷不变,得出1位MAC计算结果。上述1位MAC计算的每个1位计算单元都有10个晶体管。论文1和论文2的现有技术存在的问题为:(1)对于每个加法操作,将无条件驱动每个计算单元中的传输门,而无法利用输入数据的稀疏性达到节省能耗的目的;(2)每一个进行1位乘法的运算单元配置一个独立电容器,逐次逼近型(Successive Approximation,SAR)模拟数字转换器(Analog to Digital Converter,ADC)的金属氧化物金属(Metal Oxide Metal,MOM)电容器位于静态随机存储器(Static Random Access Memory,SRAM)计算阵列之外,因为该阵列内部没有空间,从而降低了面积效率;(3)利用电荷共享的加法阶段需要连接存储XNOR运算结果的电容器的顶板。这种电路拓扑使加法容易受到非理想效应的影响,例如电荷注入,时钟馈通,传输门晶体管的漏极或源极处的非线性寄生电容,以及连接到电容器顶板的晶体管的漏电等,从而导致计算错误。此外,因为物理版图的不匹配而带来的运算电容器与ADC中的数模转换器里的电容器之间的不匹配也会导致计算错误。
论文“An always-on 3.8μJ/86%CIFAR-10 mixed-signal binary CNN processor with all memory on chip in 28nm CMOS”,DOI:10.1109/ISSCC.2018.8310264(以下称“论文3”)提出一种仅支持二进制化的权重和激活值的二值神经网络(BNN)的运算模组。论文3中的运算模组的不足为:(1)该架构只支持BNN,无法用于视觉应用的大型DNN模型、例如对象检测等,适用范围小;(2)1位MAC计算的乘法阶段至少需要一个或(OR)门,两个同或(XNOR)门,两个异或(NOR)门和一个锁存器,使用的晶体管数量多,面积占用大。
论文“Conv-RAM:an energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications”,DOI:10.1109/ISSCC.2018.8310397(以下称“论文4”)提出一种具有嵌入式卷积计算功能的节能SRAM。论文4中的SRAM的不足有:(1)每个1位计算SRAM单元具有10个晶体管。每个单元中的晶体管数越高,存储密度越小;(2)利用位线上的寄生电容存储电荷,以用于随后的平均操作。与如MOM电容器之类的显式电容器相比,位线寄生电容的建模不充分,并且可能遭受更大的失配,导致较低的计算精度;(3)论文内所使用的水平电荷平均方法需要6个额外的晶体管,这些晶体管在几行单元之间共享,限制了吞吐量,因为并非所有行都可以同时执行计算;(4)差分电荷平均线Vp
AVG和Vn
AVG上的共模电压取决于输入数据X
in的大小,在通过局部MAV电路评估平均值后,此共模电压是不恒定的。因此差分结构的高效率高速ADC,例如SARADC并不适用。文章的方案采用了最大占用2
N-1(N是ADC分辨率)次步骤进行收敛的基于斜坡的ADC,降低了模数转换的速度,导致了较低的计算吞吐量;(5)阵列的输入使用额外的DAC电路将输入数据X
in(通常是特征图)从数字表示转换为模拟表示,DAC电路的非理想特性会导致更多的精度损失以及面积和能量的开销。
综上所述,现有技术中的MAC阵列中进行1位乘法的计算单元使用的晶体管多;存储乘法结果用于累加的电容器与存储单元一一对应,即存储单元的个数与电容器的数量相同,而电容一般会比SRAM单元大很多,特别是在先进工艺制程下,会导致MAC阵列占用面积大;同时存在乘加运算中晶体管的无条件的驱动,导致运算能效不高;另外,计算错误率高导致适用场景有限等。
因此,在模数混合存内计算领域,亟需一种面积小、能效高、容错能力好的位宽可重构的模数混合存内计算的运算模组。
发明内容
有鉴于此,本发明提供一种子单元、MAC阵列、位宽可重构的模数混合存内计算模组。为减小计算误差,还提供了差分体系的MAC阵列的 实现。为达到上述目的,本发明采用如下技术方案:
第一方面,提供了一种内存内模数混合计算子单元,包括:存储模块、计算电容器、以及控制模块;
所述存储模块包括交叉耦合的两个CMOS反相器和一个互补传输门,交叉耦合的两个CMOS反相器存储1位的过滤器参数,所述互补传输门的N型MOS管栅极连接输入信号,互补传输门的P型MOS管栅极连接互补输入信号,其中一个CMOS反相器的输出端连接互补传输门的输入端,互补传输门的输出端连接计算电容器的底板以及控制模块;
所述输入信号与所述过滤器参数的乘法结果存储为计算电容器底板的电压;
多个子单元用于组成一个计算单元,所述同一计算单元内的每一个子单元共用同一所述控制模块和一个计算电容器。
在该方案中,1位的过滤器参数或权重w写入并存储在交叉耦合的两个CMOS反相器中,输入信号A连接互补传输门的N型MOS管栅极,互补传输门的P型MOS管栅极连接互补输入信号nA,输入信号A与权重w的乘法结果存储为计算电容器底板的电压,所述多个子单元组成一个计算单元,同一计算单元内的每一个子单元共用同一所述控制模块和计算电容器,子单元以2×2、4×2等可行的方式排列。直观地,该方案减少了由MOS管组成的控制模块的数量。以2×2的子单元为例,减少了3个控制模块以及3个计算电容器。
在一些实施方式中,控制模块包括一个第一N型MOS管、一个第二N型MOS管、一个P型MOS管,第一N型MOS管栅极连接信号B。特别地,对于一个计算子单元,互补输入信号nA的电平在计算时与信号B相同。交叉耦合的两个CMOS反相器中的一个CMOS反相器的输出端连接互补传输门的输入端。所述第二N型MOS管的源极接地Gnd,栅极接一条位线,所述P型MOS管的源极接Vdd,栅极接另一条互补位线。这样的拓扑结构可以避免互补传输门的无条件驱动,提高能效。例如,当信号B=0,互补输入信号nA=0,输入信号A=1,过滤器参数w=1,计算电容器与N型MOS管连接的支路不通,互补传输门与计算电容器连接的 支路导通,过滤器参数w与输入信号A的乘法结果存储为计算电容器底板电压V
btm。这样,完成一位乘法(过滤器参数w与输入信号A)的子单元只需要9个晶体管,减小了完成1位乘法的子单元的面积。互补传输门避免连接到进行电荷累加的计算电容器的顶板,这样可以最小化计算误差,特别是由于MOS管用作开关时的时钟馈通、由导通转向关断时的电荷注入、在互补传输门晶体管的漏/源处的非线性寄生电容、以及晶体管的漏电引起的误差。
结合第一方面及其可能的实施方式,在一些实施方式中,为了减少位线的数量方便物理版图中的走线,改变了控制模块中MOS管之间的连接方式,与计算电容器底板连接的第二N型MOS管和P型MOS管串联形成一个第一CMOS反相器。所述第一CMOS反相器的P型MOS管的源极接Vdd,第一CMOS反相器的第二N型MOS管的源极连接第一N型MOS管的漏极;所述第一N型MOS管源极接地Gnd,栅极的输入信号与所述互补传输门P型MOS管栅极连接的信号在运算时电平相同;所述第一CMOS反相器的输入连接一条位线,输出连接计算电容器底板。
结合第一方面及其可能的实施方式,在一些实施方式中,为了减少计算单元中晶体管的数量,并且方便存储单元内容的读出,控制单元中保留第一N型MOS管和P型MOS管,去掉第二N型MOS管。第一N型MOS管的漏极与P型MOS管的漏极相连并连接到计算电容器的底板,第一N型MOS管的源极与P型MOS管的源极相连并连接到一条位线。第一N型MOS管的栅极连接一条控制字线,其电平在计算时与互补传输门中P型MOS管的栅极电平相同。P型MOS管的栅极连接另一条控制字线。
结合第一方面及其可能的实施方式,越多的子单元共用所述控制单元中的MOS管与计算电容器,此时需要的器件数量及面积均摊到每一个子单元上,每个子单元需要的晶体管数量越接近于6个。
结合第一方面,在一些实施方式中,计算单元内的子单元以时分复用的方式被激活,即一个子单元被激活时,同一个计算单元内的其他子单元被停用,每个计算单元内的第一N型MOS管栅极输入的信号与某时刻处 于工作状态的子单元内互补传输门的P型MOS管的栅极电平相同。在一个子单元参与完成计算后,同一计算单元包含的其他子单元内存储的过滤器参数可以立即用于内存内运算,不需要再从外部移动数据储存到子单元内后再进行计算,提高了计算速度以及数据吞吐量并且减少能量损耗以及面积消耗。
第二方面,提供了包含第一方面以及第一方面可能实施方式的一种MAC阵列,进行乘加运算,包括:多个计算单元,每个计算单元内的所有子单元的互补传输门的输出端连接同一个计算电容器同一底板,同一列的所有计算单元内的计算电容器顶板连接同一累加总线,每一累加总线的电压对应每一列乘法计算的累加和。
在该方案中,由于一个电容占用的面积一般是一个SRAM单元占用面积的数倍,在所述子单元采用共用由晶体管组成的控制模块以及计算电容器的方式下,即多个用于1位乘法的子单元共用一个计算电容器存储计算结果,相对其他的1个子单元用于1位乘法需要连接一个计算电容器存储计算结果的设计,可以极大地提高单位面积的存储容量。对于内存内计算,减少片内外部数据的移动是减少能量消耗的最主要方式之一。方案中,单位面积MAC阵列包括更多的交叉耦合的CMOS反相器,可以一次性存储更多的神经网络过滤器参数从而减少数据移动。
结合第二方面,在一些实施方式中,MAC阵列还包括第二CMOS反相器和差分计算电容器,对应组成MAC阵列的每一计算单元中,所有子单元的互补传输门的输出端连接同一第二CMOS反相器的输入端,第二CMOS反相器的输出端连接差分计算电容器的底板;同一列的所有差分计算电容器顶板连接同一差分累加总线。
第三方面,提供了一种位宽可重构的模数混合MAC计算器,包括:第二方面或第二方面的任意可能的实现方式中的MAC阵列,计算后列向累积的结果表示为模拟电压;filter/ifmap模块,提供被写入并存储在MAC阵列中的过滤器参数或神经网络上一层计算完的激活值;ifmap/filter模块,提供MAC阵列的输入,与所述的过滤器参数或神经网络上一层计算完的激活值进行乘加运算;模数转换模块,将MAC后得到的模拟电压转 换为数字表示;数字处理模块,对模数转换模块输出的数字表示进行多位融合、偏置、缩放或非线性操作,输出结果为部分和或者为能用于下一层网络输入的激活值。
该方案中,所述过滤器参数或者神经网络上一层计算完的激活值通过filter/ifmap模块写入并存储在MAC阵列中,使子单元中的交叉耦合的两个CMOS反相器存储逻辑1或0,并与ifmap/filter模块提供的输入进行乘加运算。此过程,每个子单元内的存储值与输入的乘法运算属于数字运算,等效于AND运算,乘法运算的结果存储在计算电容器中,加法阶段,由于同一列的所有计算电容器的顶板通过同一累加总线连接在一起,不同计算电容器中存储的电荷通过该累加总线进行共享,列向累积的乘法结果存储为模拟电压。随后,模拟结果通过模数转换模块转换为数字表示,最后对该数字表示进行处理,输出结果为部分和或者为能用于下一层网络输入的激活值。在传统数字实现的神经网络计算过程中,MAC耗费大量能耗。该方案中,MAC采用模数混合运算,可以极大降低能耗,同时MAC阵列的低面积实现可以提高能效以及计算速度。针对整个神经网络计算的不同阶段采用不同的运算方式的结合,极大地利用了模拟和数字运算的不同优点,保证了计算过程的低功耗、高能效、高速度、高精度的实现。
结合第三方面,在一种可能的实施方式中,模数转换模块采用SAR ADC,具体为二进制加权电容结构的SAR ADC。
结合第三方面和第一种可能的实施方式,在第二种实施方式中,MAC阵列的输入数值以及存储数值的稀疏性可以使SAR DAC中的部分电容器的开关序列免于切换,从而获得更高的能效和ADC转换速度。换一种方式说,MAC阵列中每一列SAR ADC的位宽可以实时地由输入数值以及存储数值的稀疏性来决定。
结合第三方面或第三方面可能的实施方式,在第三种可能的实施方式中,MAC DAC和SAR DAC可以连接在一起,应当理解,所述MAC DAC指的是MAC阵列中的一列计算电容阵列,即一列MAC阵列中的电容器与SAR DAC中的电容器并联。
结合第三方面或第三方面可能的实施方式,在其他实施方式中,允许 MAC DAC通过底板采样复用为SAR DAC,从而使用相同的电容阵列实现MAC操作以及模数转换,避免在MAC操作环节的MAC DAC和模数转换阶段的SAR DAC中使用不同电容阵列导致的失配以及精度损失,进一步地,并且允许全差分SARADC的实现,更好地解决共模相关的比较器的输入失调电压偏移的问题。
说明书附图
图1a为本发明一实施例中的子单元的示意图;
图1b为本发明一实施例中的子单元中6T结构的示意图;
图2a为本发明一实施例中的子单元结构的示意图;
图2b为本发明另一实施例中的子单元结构的示意图;
图2c为本发明另一实施例中的子单元结构的示意图;
图2d为本发明一实施例中1位乘子单元真值表的示意图;
图3a为本发明一实施例中计算单元中子单元排列示意图;
图3b为本发明一实施例中多个子单元组成的计算单元的示意图;
图3c为本发明另一实施例中多个子单元组成的计算单元的示意图;
图3d为本发明另一实施例中多个子单元组成的计算单元的示意图;
图3e为本发明一实施例中计算单元内子单元真值表;
图4为本发明一实施例中包含计算单元MAC阵列示意图;
图5为本发明一实施例中计算电容器底、顶板电压示意图;
图6a为本发明一实施例中计算单元连接第二CMOS反相器和差分计算电容器示意图;
图6b为本发明另一实施例中计算单元连接第二CMOS反相器和差分计算电容器示意图;
图6c为本发明另一实施例中计算单元连接第二CMOS反相器和差分计算电容器示意图;
图7为本发明一实施例中差分体系下组成MAC阵列示意图;
图8为本发明一实施例中存内计算模组示意图;
图9为本发明一实施例中模数转换模块示意图;
图10为本发明另一实施例中模数转换模块示意图;
图11为本发明另一实施例中模数转换模块示意图;
图12为本发明另一实施例中模数转换模块示意图;
图13为本发明另一实施例中模数转换模块差分结构示意图;
图14为本发明一实施例中减少模数转换的能量消耗的架构示意图。
为了使发明的目的、原理、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,正如本发明内容部分所述,此处所描述的具体实施例用以解释本发明,并不用于限定本发明。
需要特别说明的是,根据说明书的文字或者技术内容可以确定的连接或位置关系,为了图画的简洁进行了部分的省略或者没有画出全部的位置变化图,本说明书未明确说明省略的或者没有画出的位置变化图,不能认为没有说明,为了阐述的简洁,在具体阐述时不再一一进行说明,在此统一说明。
此外,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。
作为一种常见的应用场景,本发明实施例所提供的的位宽可重构的模数混合计算模组可以应用在视觉、声学DNN架构中,更具体地,实现对象检测、低功耗的声学特征提取等。
以特征提取为例,将待处理数据与特征提取器中的由权重构成的过滤器进行卷积运算后,输出相应的特征图/激活值。过滤器选取不同,提取的特征也会不同。此过程中,待处理数据与过滤器的卷积运算需要的能耗最高,需要避免电路无条件驱动等情形造成的能耗,特别是待处理数据为稀疏矩阵时。
本申请提出一种用于内存内计算的子单元,如图1a,包括:存储模块、计算电容器、以及控制模块;所述存储模块包括交叉耦合的两个CMOS反相器和一个互补传输门,交叉耦合的两个CMOS反相器存储1位的过 滤器参数,所述互补传输门的N型MOS管栅极连接输入信号A,互补传输门的P型MOS管栅极连接互补输入信号nA,其中一个CMOS反相器的输出端连接互补传输门的输入端,互补传输门的输出端连接计算电容器的底板和控制模块。应当注意,互补传输门是一个双向器件,本发明中所说的互补传输门的输入端指的是与其中一个CMOS反相器的输出端连接的一端。互补传输门的N端和P端栅极既可以分别连接到字线信号WL、nWL,用于控制子单元的写入或者读出权重w,又可以分别连接到输入信号A和互补输入信号nA,用于参与1位乘法计算。另,图1a中的w是Weight的缩写,nW是negative weight的缩写,分别表示加权和负加权位点。
所述输入信号A与权重w的乘法结果存储为计算电容器底板的电压V
btm。
为描述方便,所述互补传输门与计算电容器连接的是计算电容器的底板,计算电容器与累加总线连接的是顶板。由交叉耦合的两个CMOS反相器连接一个互补传输门构成的存储模块称为6T结构(6T sub-cell,包含6个晶体管),参见图1b。
多个子单元用于组成一个计算单元,同一所述计算单元内的每一个子单元共用同一所述控制模块和一个计算电容器。
图2a为用于1位乘运算的子单元结构一实施例示意图,其中控制模块包括一个第一N型MOS管、一个第二N型MOS管、一个P型MOS管,所述第一N型MOS管源极接地Gnd,漏极与第二N型MOS管漏极、P型MOS管的漏极、所述互补传输门的输出端连接计算电容器的同一底板,所述互补传输门N型MOS管栅极连接输入信号A,第一N型MOS管栅极输入的信号B与互补传输门的P型MOS管栅极连接的互补输入信号nA的电平相同;所述第二N型MOS管的源极接地Gnd,栅极接一条位线BL
k;所述P型MOS管的源极接Vdd,栅极接另一条位线nBL
k。
在一些可能的实施方式中,可以采用互补输入信号nA与信号B共用节点从而提供相同的电平。应当理解,所述第一N型MOS管栅极输入的信号B不仅用于将计算电容器的底板电压重置,使子单元用于下一次的 计算,还参与了1位乘法运算,如图2真值表所示。交叉耦合的两个CMOS反相器存储的被写入的过滤器参数或权重w与所述互补传输门的N端的输入信号A进行一位乘法的运算,所述乘法运算的结果存储为计算电容器底板的电压。举例说明,所述互补传输门用于控制写入w,子单元内的互补传输门的N端和P端栅极接收的字线信号需要确保互补传输门导通。具体地,互补传输门的N端栅极连接的字线WL置Vdd,互补传输门的P端栅极连接的另一字线nWL置0。如果要写入“0”,位线BL
k和nBL
k都被置为高电平,此时第二N型MOS管导通,P型MOS管不导通;如果要写入“1”,位线BL
k和nBL
k都被置为低电平,此时第二N型MOS管不导通,P型MOS管导通。而在进行存内计算的时候,BL
k被置为低电平,nBL
k被置为高电平,保证互补传输门中的存内计算的乘法结果存储在计算电容器中。
进一步地,位线过多对于走线是很大的挑战,为了减少位线的数量,方便物理版图中的走线,在另外的实施例中,对基于上述子单元中控制模块的位线设计进行了优化,参见图2b,为优化后一个子单元示意图,具体地,在包含一个第一N型MOS管、一个第二N型MOS管和一个P型MOS管的控制模块中,相对于图2a的实施例,第二N型MOS管和P型MOS管串联组成一个第一CMOS反相器,所述第一CMOS反相器的P型MOS管的源极接Vdd,第一CMOS反相器的第二N型MOS管的源极连接第一N型MOS管的漏极;所述第一N型MOS管源极接地Gnd,栅极的输入信号与所述互补传输门P型MOS管栅极连接的信号在运算时电平相同;所述第一CMOS反相器的输入连接一条位线,输出连接计算电容器底板。在本实施例中,当所述6T结构写入过滤器参数或权重的时候,如果要写入0,位线BL
k和B都被置为高电平;如果要写入1,位线BL
k被置为低电平。在进行存内计算的时候,BL
k被置为高电平,信号B与互补输入信号nA的电平相同。
进一步地,为了继续减少子单元中晶体管的数量,在另外的实施例中,图2a和图2b两个实施例中的第二N型MOS管被去掉,参见图2c,为控制单元只有两个晶体管的子单元示意图。第一N型MOS管的漏极连接 P型MOS管的漏极,并与计算电容器的底板相连。第一N型MOS管的源极和P型MOS管的源极与一条位线相连。第一N型MOS管的栅极连接字线信号B,特别地,其电平在计算时与6T结构中互补传输门的P型MOS管的栅极电平相同。P型MOS管的栅极连接另一条字线信号nB。在本实施例中,当所述6T结构写入过滤器参数或权重的时候,如果要写入0,位线BL
k被置为低电平,B被置为高电平;如果要写入1,位线BL
k被置为高电平,nB被置为低电平。在进行存内计算的时候,BL
k被置为低电平,B与互补输入信号nA的电平相同,nB被置为高电平,可以理解,本实施例中完成1位乘法运算的子单元只需要8个晶体管。
可选的,子单元进行一位乘法计算的过程如下:
1.计算电容器的顶板电压V
top通过累加总线上的复位开关S
rst复位到V
rst。
2.将子单元中第一N型MOS管的栅极信号B提升到Vdd,导通第一N型MOS管,将电容器的底板电压V
btm重置为0,将子单元中的互补传输门的输入信号A和nA分别保持在0和Vdd。V
btm重置为0后,S
rst断开连接。
3.计算过程中,激活信号A和nA,激活子单元时1位乘法运算的真值表如图2所示。
4.子单元乘法运算完成后,计算电容器的底板电压V
btm要么保持在0,要么进入Vdd,乘法运算的输出结果为计算电容器的底板电压V
btm,表示为Vdd×w×A。
可以理解,所述完成一位乘法(过滤器参数w与输入信号A)的子单元只需要9个或者只需8个晶体管,相对现有技术,减小了子单元的面积,提高能效。另外,传输门连接在计算电容器的底板相对于现有技术中的连接在计算电容器的顶板的方案能够最小化计算误差,特别是由于MOS管用作开关时的时钟馈通,由导通转向关断时的电荷注入,在互补传输门晶体管的漏/源处的非线性寄生电容,以及晶体管本身的漏电等等引起的误差。
上述实施例中的多个子单元用于组成一个计算单元,子单元以2×2、 4×2等可行的方式排列,排列方式参考图3a。直观地,该方案减少了计算电容器以及组成控制模块的MOS管的数量。以2×2的子单元为例,减少了3个控制模块和3个计算电容器。如图3b、图3c、图3d所示的分别对应由上述子单元组成计算单元的实施例,同一个计算单元内的每个子单元保留各自所述的6T结构,多个计算子单元用于组成一个计算单元,所述同一计算单元内的每一个子单元共用同一所述控制模块和一个计算电容器,可以理解,一个计算单元内只有一个控制模块、一个计算电容器。直观地,该方案通过共用的方式减少了对应实现相同数量独立的子单元所需要的控制模块以及计算电容器的数量,以图3b的计算单元为例,多个子单元共用同一所述计算电容器、一个第一N型MOS管、一个第二N型MOS管和一个P型MOS管,每个子单元的互补传输门的输出端连接同一个第一N型MOS管的漏极、同一计算电容器的底板、同一P型MOS管的漏极、同一第二N型MOS管的漏极。应当注意,控制模块一般由晶体管组成,那么越多的子单元共用控制模块的晶体管,分摊下来每个子单元需要的晶体管数量就会越接近存储模块所需要的晶体管数量,也即6T结构的晶体管数量。
另外,由于单个电容占用的面积一般是所述6T结构占用面积的数倍,差距悬殊,采用所述子单元共用器件的方式,即多个用于1位乘法的子单元共用一个电容器存储计算结果,相较于每个1位乘法的子单元单独连接一个电容器存储计算结果,可以极大的提高指定面积内的存储容量,即相同面积内可以一次性存储较现有技术更多的过滤器参数或权重。
进一步地,同一计算单元内的子单元以时分复用的方式被激活,即一个子单元被激活时,同一个计算单元内的其他子单元被停用。子单元被激活后按照上述执行一位乘法的运算,单元内计算的真值表参考图3e,图中的W
0a表示一列中第0个计算单元中子单元a的加权位点存储的数值,V
btm0表示一列中第0个计算单元的底板电压。具体地,在一些实施例中,子单元互补传输门的N端栅极和P端栅极的信号分别为A
ij和nA
ij,其中i为单元列的索引,且为0~(n-1)的非负整数,j为单元内子单元的索引,在2×2的单元内,j=a,b,c,d。可以理解,所述子单元共用晶体管以及电容 器的方式,指的是一个计算单元内包含了多个能用于乘法运算的子单元。应当注意,不同于单个独立的子单元,当多个子单元组成一个计算单元时,所述第一N型MOS管栅极的输入信号B
i与每个子单元的互补传输门的P端栅极的输入信号nA
ij是分别控制的,在时分复用下,虽然某时刻处于工作状态的子单元内的nA
ij与B
i电平相同,但是不再适用二者共用节点的情形。相对于相同数量且独立的子单元来说,子单元组合成的计算单元内需要的计算电容器、控制模块的数量分别减少n-1个,即计算电容器、第一N型MOS管、第二N型MOS管、P型MOS管的控制模块的数量分别减少n-1个,计算单元内完成1位乘法的子单元结构逼近6个晶体管。一般地,由于制作工艺的区别,计算电容器占用的面积是子单元6T存储结构的数倍,所述共用的方式减少单位面积中电容的数量,可以提高计算单元组成的模组的存储容量。并且,在一个子单元参与完成计算后,同一计算单元包含的其他子单元内存储的过滤器参数可以立即用于内存内运算,不需要再从外部移动数据储存到子单元内后再进行计算,提高了计算速度。
第二方面,结合第一方面的子单元、计算单元,得到一种MAC阵列,参见图4,进行乘加运算,所述MAC阵列包括:多个计算单元,同一列的所有计算电容器顶板连接同一累加总线。并且,如前述可以理解,每个计算单元包括至少一个所述子单元,计算单元内的每一个互补传输门的输出端连接同一个计算电容器的同一底板,每一累加总线的电压对应每一列计算的累加和。
该方案中,相对于独立的子单元组成的MAC阵列,所述采用共用电容器及晶体管的模式的计算单元,MAC阵列可以存储更多的神经网络参数或者上一层网络计算完成的值。具体地,计算单元内完成1位乘法的计算并将计算结果存储在计算电容器中,处于MAC阵列中的同一列计算单元通过计算电容器顶板连接的同一条累加总线将各自的1位乘法结果累加。
另外,对于内存内计算,减少芯片内外部数据的移动是减少能量消耗的直接方式。可以理解,因为单个计算电容器占用的面积是单个存储单元 的数倍,方案中,共用计算电容器的方式使得单位面积MAC阵列容纳了更多的存储单元,可以一次性存储较现有技术更多的过滤器参数。在一个子单元计算完成后,同一单元的其他子单元内交叉耦合的两个CMOS反相器存储的过滤器参数可以立即用于内存内运算,不需要再从外部移动数据储存到存储单元内后再进行计算,这极大地提高了计算速度,提高吞吐量并且减少能量损耗以及面积消耗。
参见图5所示,特别地,同一列的所有计算电容器的顶板通过累加总线连接在一起,累加总线电压为V
top,需要明确的是,多个计算单元按列分布,1个计算单元对应1个计算电容器,1个计算单元中包含多个第一方面或者第一方面实施例所述的子单元。
在一些实施例中,MAC阵列以下列“方式一”执行乘加运算:
1.过滤器参数(或上一层网络计算完成的激活值)首先按照写入过程写入各个子单元,并被存储在子单元的交叉耦合的2个CMOS反相器中;
2.计算电容器的顶板电压V
top通过累加总线上的复位开关S
rst复位到V
rst,V
rst可以为0;
3.将每个计算单元中的信号B
i提升到Vdd,计算电容器的底板电压V
btmi重置为0,每个计算单元中的信号A
ij和nA
ij分别保持在0和Vdd。S
rst断开连接;
4.在计算过程中,信号A
ij和nA
ij以时分复用的方式被激活,例如,当A
0a和nA
0a被激活时,A
0j和nA
0j(j=b,c,d)被停用,即分别保持在0和Vdd。值得注意,在计算过程中,一个计算单元的B
0与该时刻被激活的子单元内的nA
0j是一样的电平。
5.在一列计算单元的相乘完成后,计算电容器的底板电压V
btmi要么保持在0,要么进入Vdd。电荷在一列计算电容器中重新分布,类似于SAR DAC的电容器中的电荷重新分布。如果不考虑寄生电容等非理想性,则一列计算电容器的模拟输出电压V
top表示如下式的累加结果,如图5。
在其他实施例中,MAC阵列可以按照下面的“方式二”进行运算:
1.过滤器参数(或上一层网络计算完成的激活值)写入各个子单元;
2.计算电容器的顶板电压V
top通过累加总线上的复位开关S
rst复位到V
rst。S
rst保持V
top和V
rst之间的连接。
3.将每个单元中的信号B
i提升到Vdd,将计算电容器的底板电压V
btmi重置为0,将每个单元中的信号A
ij和nA
ij分别保持在0和Vdd;
4.在计算过程中,同理,信号A
ij和nA
ij以时分复用的方式被激活。
5.在一列计算单元的相乘完成后,每个计算电容器的底板电压V
btmi要么保持在0,要么进入Vdd。然后断开S
rst,将底板电压V
btmi设置为0或Vdd,每个计算单元的控制模块里的MOS开关运行逐次逼近算法进行模数转换。以V
btmi均被置为0为例,电压V
top可表示为:
特别地,第二方面或者第二方面所述的MAC阵列可用于多位权重的计算,每一列的计算单元执行逐位的MAC操作,通过把模数转换后的数字表示进行移位相加的操作得到多位权重的输出结果。举例而言,对于一个k位的权重/过滤器参数,每一列执行逐位的MAC,可以是第1列执行最低位,即第0位的值与输入信号的MAC,第k列执行最高位,即第k-1位的值与输入信号的MAC。可以理解,相当于每一列单独对一个多位的二进制权重的一位进行MAC,所有参与计算的列得到的MAC结果包含了k个元素,最后对进行模数转换后的k个元素进行数字域的移位相加。
为了减少计算误差,可使用差分体系的MAC阵列架构,在一些实施方式中,MAC阵列还包括第二CMOS反相器和差分计算电容器,参见图6a、图6b、图6c,MAC阵列中每个计算单元连接一个第二CMOS反相器和一个差分计算电容器得到MAC阵列的差分架构。具体地,每一计算单元中的所有子单元的互补传输门输出端连接同一个第二CMOS反相器的输入端,第二CMOS反相器的输出端连接差分计算电容器的底板;同一列的所有差分计算电容器顶板连接同一差分累加总线。为描述方便,由上述实施例中的子单元所组成的计算单元连接一个第二CMOS反相器和一个差分计算电容器的结构为差分单元。那么可以理解,所述同一差分单 元内的子单元共用同一所述第一N型MOS管、一个第二N型MOS管(图6a、图6b)、一个P型MOS管、差分计算电容器和第二CMOS反相器,所述差分单元内的子单元同样以所述时分复用的方式被激活。
图7为由前述差分单元构成的MAC阵列,差分单元内的每一个互补传输门的输出端连接同一个计算电容器的底板,同一列的所有计算电容器顶板连接同一累加总线,所有差分计算电容器顶板连接同一差分累加总线。
第三方面,提供了一种位宽可重构的模数混合计算模组,参见图8,包括:第二方面或第二方面的任意可能的实现方式中的MAC阵列,计算完成后MAC阵列中列向累积的结果表示为模拟电压,即上述实施例中电容器顶板的V
top;filter/ifmap模块,提供被写入并存储在MAC阵列中的过滤器参数,应当理解,对于神经网络,所述被写入并存储在MAC阵列中的还可以是上一层网络计算完成后所输出的值;ifmap/filter模块,提供MAC阵列的输入,具体地,提供计算单元内互补传输门两端的输入,与所述的过滤器参数或上一层网络的激活值进行乘加运算;模数转换模块,将MAC操作得到的模拟电压转换为数字表示;数字处理模块,对模数转换模块输出的数字表示至少进行多位融合、偏置、缩放或非线性等操作,输出结果为部分和或者为能直接用于下一层网络输入的激活值。
可以理解,将本申请的模组用于神经网络的MAC计算时,一般情况下,由于相同的面积上,模组包括更多的存储单元,即交叉耦合的两个CMOS反相器,可以一次性预先用于加载过滤器参数(权重)。在完成一层网络的计算后,输出部分和或者是最终用于下一层网络计算的激活值(特征图),可以立即与预先加载并存储在模组中的过滤器参数(权重)进行MAC计算,减少了片外的数据搬运的等待时间以及功耗。另外,模组的大吞吐量可以提高片上的存储能力,例如,存储单元除了存储过滤器参数外,本层网络输出的激活值(特征图)也可以存储在MAC阵列中。
应当理解,除了在第一方面、第二方面所述的计算单元、MAC阵列内采用的共用晶体管和计算电容器的方式,实际上,在所述模组的非MAC阵列区域,所述计算单元还共用一些参与模数转换和数字处理的晶体管等 器件。
本发明中,所述模数转换模块可为并行电容结构的SAR ADC,将列向计算单元输出的顶板电压V
top转换为数字表示,包括MAC DAC,SAR DAC,比较器,开关序列和SAR逻辑。相对于采用其他类型如电阻、混合阻容结构等的SARADC,采用并行电容结构的SAR ADC更能充分利用本发明已有的结构,达到节省器件,减小面积的效果。MAC DAC由前述MAC阵列中一列计算单元的电容器并联组成,应当理解,所述MAC DAC的输出电压为V
top。SAR DAC包括(B+1)个并联电容器,B=log
2N,N是MAC DAC中电容器的数量;所述电容器包括从最高位(Most SignificantBit,MSB)到最低位(Least Significant Bit,LSB)的电容呈2倍递减的B个电容器,还包括一个1个与最低位LSB电容等值的电容器,作为冗余电容器。举例说明,MAC DAC中电容器的数量N=8,则B=3,最高位MSB电容器C
B-1的电容为C,次高位电容器C
B-2的电容为C/2,最低位LSB电容器C
0的电容为C/4,从MSB到LSB能够分配SAR DAC的基准电压比例分别为:1/2、1/4、1/8,冗余电容器C
U的电容为C/4,所述B个电容器和冗余电容器的一端并联在一起,B个电容器的另一端连接开关序列,冗余电容器的另一端始终接地Gnd。所述开关序列的自由端包括VDD端和接地Gnd端,SAR逻辑控制所述开关序列。
在一实施例中,如图9,MAC DAC的输出电压V
top作为比较器的正输入V
+;SAR DAC的输出V
SAR作为比较器的负输入V
-,SAR逻辑控制所述开关序列使负输入V
-近似等于正输入V
+,最终的SAR逻辑输出是V
+的数字表示。特别地,MAC阵列的激活稀疏性可以使SAR DAC中的某些电容器免于切换,从而获得更高的能效和ADC转换速度。例如,如果已知在MAC操作之后,底板电压V
btmi为VDD的MAC电容器数量小于25%,即MAC阵列中,一列计算单元进行1位乘法中1×0、0×0、0×1的情形较多,而1×1的情形小于该列计算单元数量的1/4,则可以将SAR DAC的前两位电容器,即C
B-1和C
B-2对应的开关序列的S
B-1和S
B-2拨向接地Gnd端,并不是无条件激活SAR DAC中的所有电容进行数模转换,节省能耗。应当注意,本发明附图所示的比较器V
+侧和V
-侧的连接方式 只是为了方便说明,实际上V
+侧和V
-侧的连接可以互换。
在另一实施例中,参见图10,MAC DAC和SAR DAC可以连接在一起,即使所有电容器并联,产生的总电压为比较器的正输入V
+;比较器的负输入V
-为V
ref;SAR逻辑控制开关序列使正输入V
+逼近V
ref。应当注意,本实施例应在MAC操作遵循前述“方式一”的情况下。如果V
rst=0且未考虑电路非理想情况,连接至比较器负输入V
-侧的V
ref可以为0或VDD/2。例如,如果V
ref=0,SAR DAC中的电容器最初是通过从S
0到S
B-1的开关连接到VDD的,则SAR操作可以在给出数字表示的同时使V
+返回0,对应了“方式一”中电容器的顶板电压V
top通过重置开关S
rst重置为0这一步骤所需的V
rst=0。
图9和图10所示的两个实施例中,当比较器的正输入V
+和负输入V
-无限地彼此接近时,比较器很容易在模数转换过程中遭受亚稳性问题,即在短暂的时间内无法判断比较器的正输入V
+和负输入V
-的差异。这是因为要量化的模拟MAC结果的幅度不是连续的而是离散的,并且离散的幅度级别与SAR DAC对齐。为了减轻比较器的亚稳性,如图11,在另一实施例中,相对于图9,比较器的正输入V
+侧的MAC DAC和负输入V
-侧的SAR DAC均添加一个与其他电容器并联的半LSB电容器;正输入V
+侧的半LSB电容器另一端始终接地Gnd,负输入V
-侧的半LSB电容器另一端可连接开关序列。这将在MAC DAC和SAR DAC之间的离散模拟电平之间产生半个LSB电压的差异,提供额外的误差容限。上述半LSB电容器可以是两个串联的最低位LSB电容器,以实现良好的匹配。
在另一实施例,允许MAC DAC通过底板采样复用为SAR DAC。如图12,比较器的正输入V
+侧连接MAC DAC和一个半LSB电容器,MAC DAC的第1个至第N-1个单元的电容器和半LSB电容器均可连接开关序列的接VDD端或者接地Gnd端,第N个单元的电容器可选择连接地Gnd端;比较器的负输入V
-侧不连接电容器而是电压V
ref。实际上,本实施例中的MAC DAC也是SAR DAC。应当注意,此实施例应当在MAC计算遵循“方式二”的操作,且通常V
ref=V
rst。SAR转换完成后,比较器的正输入电压V
+返回V
rst,对应了“方式二”中电容器的顶板电压V
top通过重置开 关S
rst重置为V
rst这一步骤所需的V
rst。这样使用相同的电容阵列实现MAC操作以及模数转换,避免在MAC操作环节的MAC DAC和模数转换阶段SAR DAC中的电容阵列不同而导致的失配以及精度损失,并且允许全差分SARADC的实现。应当特别指出的是,本实施例中的实现开关序列所需要的晶体管已经包括在前述计算单元里的控制模块中而不需要加入额外的晶体管。
结合图12的实施例,在另一实施例中,图13显示差分MAC体系结构,解决了共模相关比较器输入失调电压偏移的问题。比较器的正输入V
+侧连接MAC DAC和一个额外的LSB电容器,模数转换过程中,MAC DAC的第1个至第N-1个单元的电容器和额外LSB电容器均可连接开关序列的接VDD端或者接地Gnd端,第N个单元的电容器可选择连接接地Gnd端;比较器的负输入V
-侧连接差分MAC DAC和一个额外的LSB电容器,模数转换过程中,差分MAC DAC的第1个至第N-1个单元的电容器和额外LSB电容器均可连接开关序列,第N个单元的电容器可选择连接接地Gnd端。所述差分MAC DAC由MAC阵列中的一列差分计算电容器组成。应当注意,所述差分MAC体系结构需与前述差分结构的模组结合才可实现。应当特别指出的是,本实施例中实现开关序列所需要的晶体管已经包括在前述差分计算单元里的控制模块中而不需要加入额外的晶体管。
在一实施例中,一列SARADC的位宽可以实时地由输入数据以及存储在该列的数值的稀疏性来决定,这样平均下来在模数转换过程中需要充放电的二进制加权电容器阵列里的电容的个数有可能大量减少,从而达到大幅节省模数转换能耗的效果。特别地,如图14所示,SAR ADC的实时位宽可以计算为ceil(log
2(min(X,W)+1))。其中ceil为上取整函数,min为最小值函数,X为1比特输入向量中1的个数,X
1-X
m为第X
1至第X
m个1比特输入向量,可以通过加法器树计算得到,W为计算阵列的一列里存储的1的个数,W
1-W
m为计算阵列的一列里各个子单元中存储的1个个数,可以在片下计算得到,并且在将数据存储在计算阵列里的时候已经存放在SAR逻辑里。计算位宽的式子里的min,log
2,ceil函数可以被 简单的数字组合逻辑替代而得到同样的计算结果。
值得注意的是,上述实施例中,所包括的各个模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围,如实施例中的“第一N型MOS管”,“第二N型MOS管”仅为区分处于不同连接位置的器件,不能理解为特定的器件。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。
Claims (16)
- 一种内存内模数混合计算子单元,其特征在于,进行1位乘法计算,包括:存储模块、计算电容器以及控制模块;所述存储模块包括交叉耦合的两个CMOS反相器和一个互补传输门,所述交叉耦合的两个CMOS反相器存储1位的过滤器参数,所述互补传输门的N型MOS管栅极连接输入信号,所述互补传输门的P型MOS管栅极连接互补输入信号,其中一个CMOS反相器的输出端连接所述互补传输门的输入端,所述互补传输门的输出端连接所述计算电容器的底板以及所述控制模块;所述输入信号与所述过滤器参数的乘法运算结果存储为所述计算电容器的底板的电压;多个所述计算子单元用于组成一个计算单元,同一所述计算单元内的每一个计算子单元共用同一所述控制模块和一个所述计算电容器。
- 如权利要求1所述的内存内模数混合计算子单元,其特征在于,所述控制模块包括一个第一N型MOS管、一个第二N型MOS管、一个P型MOS管,所述第一N型MOS管的源极接地Gnd,漏极与所述第二N型MOS管的漏极、所述P型MOS管的漏极连接至所述计算电容器的同一底板;所述第一N型MOS管的栅极的输入信号与所述互补传输门的所述P型MOS管的栅极连接的信号在运算时电平相同;所述第二N型MOS管的源极接地Gnd,栅极接一条位线,所述P型MOS管的源极接Vdd,栅极接另一条互补位线。
- 如权利要求1所述的内存内模数混合计算子单元,其特征在于,所述控制模块包括一个第一N型MOS管、一个第二N型MOS管、一个P型MOS管,所述第二N型MOS管和所述P型MOS管串联组成一个第一CMOS反相器;所述第一CMOS反相器的所述P型MOS管的源极接Vdd,所述第一CMOS反相器的所述第二N型MOS管的源极连接所述第一N型MOS管的漏极;所述第一N型MOS管的源极接地Gnd,栅极的输入信号与所述互补传输门的P型MOS管的栅极连接的信号在运算时电 平相同;所述第一CMOS反相器的输入连接一条位线,输出连接所述计算电容器的底板。
- 如权利要求1所述的内存内模数混合计算子单元,其特征在于,所述控制模块包括一个第一N型MOS管、一个P型MOS管;所述第一N型MOS管的漏极连接所述P型MOS管的漏极,并连接到所述计算电容器的底板;所述第一N型MOS管的源极连接所述P型MOS管的源极,并连接到一条位线;所述第一N型MOS管的栅极连接一条控制字线,其电平与所述互补传输门的P型MOS管的栅极连接的信号在运算时的电平相同;所述P型MOS管的栅极连接另一条控制字线。
- 如权利要求2、3或4任一项所述的内存内模数混合计算子单元,其特征在于,所述计算单元内的子单元以时分复用的方式被激活,同一所述计算单元内的所述第一N型MOS管的栅极连接的信号与某时刻处于工作状态的子单元内所述互补传输门的P型MOS管的栅极连接的互补输入信号电平相同。
- 一种包含权利要求5所述的内存内模数混合计算子单元的MAC阵列,进行乘加运算,其特征在于,包含多个计算单元,每个计算单元内的所有子单元的互补传输门的输出端连接同一个计算电容器的底板,同一列的所有所述计算单元内的所述计算电容器的顶板连接同一累加总线,每一累加总线的电压对应MAC阵列中每一列乘法计算的累加和。
- 如权利要求6所述的MAC阵列,其特征在于,所述MAC阵列中还包括第二CMOS反相器和差分计算电容器,组成所述MAC阵列的每一所述计算单元中的所有子单元的所述互补传输门的输出端连接同一所述第二CMOS反相器的输入端,所述第二CMOS反相器的输出端连接所述差分计算电容器的底板;同一列的所有所述差分计算电容器顶板连接同一差分累加总线。
- 一种位宽可重构的模数混合内存内计算的运算模组,其特征在于,包括:如权利要求6或权利要求7其一所述的MAC阵列,所述MAC阵列中列向累积的乘法结果表示为模拟电压;filter/ifmap模块,提供被写入并存储在所述MAC阵列中的过滤器参数或上一层计算完成的激活值; ifmap/filter模块,提供所述MAC阵列的输入,与神经网络的所述过滤器参数或者上一层计算完成的激活值进行乘法计算;模数转换模块,将所述MAC阵列后得到的模拟电压转换为数字表示;数字处理模块,对所述模数转换模块的输出至少进行多位融合、偏置、缩放或非线性操作,输出结果为部分和或者为能作为下一层网络输入的激活值。
- 如权利要求8所述的运算模组,其特征在于,所述模数转换模块为二进制加权电容阵列的SAR ADC,所述SAR ADC包括:MAC DAC,由所述MAC阵列中一列的计算电容器组成;SAR DAC,由多个二进制加权电容器和1个与LSB电容等值的冗余电容器组成的阵列;比较器;开关序列以及SAR逻辑,所述SAR逻辑控制所述开关序列。
- 如权利要求9所述的运算模组,其特征在于,所述MAC DAC的输出电压作为所述比较器一端的输入;所述SAR DAC的输出电压作为所述比较器另一端的输入。
- 如权利要求9所述的运算模组,其特征在于,所述MAC DAC和所述SAR DAC中的电容器并联产生的输出电压作为所述比较器一端的输入;比较电压V ref作为所述比较器另一端的输入。
- 如权利要求10所述的运算模组,其特征在于,所述比较器的两端分别添加一个半LSB电容器;所述MAC DAC和一个半LSB电容器并联产生的输出电压作为所述比较器一端的输入,所述SAR DAC和另一个半LSB电容器并联产生的输出电压作为所述比较器另一端的输入。
- 如权利要求12所述的运算模组,其特征在于,所述MAC DAC和所述半LSB电容器均连接开关序列复用为所述SAR DAC,此双用途DAC的输出电压作为所述比较器一端的输入;比较电压V ref作为所述比较器另一端的输入。
- 如权利要求9所述的运算模组,其特征在于,所述SAR ADC还包括差分MAC DAC,所述差分MAC DAC由所述MAC阵列中一列的差分计算电容器组成。
- 如权利要求14所述的运算模组,其特征在于,所述MAC DAC和一个额外并联的LSB电容器均连接开关序列复用为SAR DAC,此双用途 DAC的输出电压作为所述比较器一端的输入;所述差分MAC DAC和一个额外并联的差分LSB电容器均连接开关序列复用为差分SAR DAC,此双用途差分DAC的输出电压作为所述比较器另一端的输入。
- 如权利要求9所述的运算模组,其特征在于,所述SAR ADC的位宽实时地根据输入数据和存储在MAC阵列中的数据的稀疏性来决定,此实时位宽可以计算为ceil(log 2(min(X,W)+1));其中ceil为上取整函数,min为最小值函数,X为1比特输入向量中1的个数,W为计算阵列的一列里存储的1的个数,实时位宽计算公式在电路中等效地由数字组合逻辑实现。
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP21800372.1A EP3985670B1 (en) | 2020-05-08 | 2021-03-30 | Subunit, mac array, and analog and digital combined in-memory computing module having reconstructable bit width |
| US17/631,723 US12487795B2 (en) | 2020-05-08 | 2021-03-30 | Sub-cell, MAC array and bit-width reconfigurable mixed-signal in-memory computing module |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010382467.2 | 2020-05-08 | ||
| CN202010382467.2A CN113627601B (zh) | 2020-05-08 | 2020-05-08 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021223547A1 true WO2021223547A1 (zh) | 2021-11-11 |
Family
ID=78377232
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/084032 Ceased WO2021223547A1 (zh) | 2020-05-08 | 2021-03-30 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
Country Status (4)
| Country | Link |
|---|---|
| US (1) | US12487795B2 (zh) |
| EP (1) | EP3985670B1 (zh) |
| CN (1) | CN113627601B (zh) |
| WO (1) | WO2021223547A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113949385A (zh) * | 2021-12-21 | 2022-01-18 | 之江实验室 | 一种用于rram存算一体芯片补码量化的模数转换电路 |
| TWI898752B (zh) * | 2024-06-26 | 2025-09-21 | 旺宏電子股份有限公司 | 記憶體內運算裝置以及使用其執行操作的方法 |
Families Citing this family (13)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11823035B2 (en) * | 2020-07-07 | 2023-11-21 | Qualcomm Incorporated | Power-efficient compute-in-memory pooling |
| JP2022049312A (ja) * | 2020-09-16 | 2022-03-29 | キオクシア株式会社 | 演算システム |
| US12488228B2 (en) * | 2021-04-02 | 2025-12-02 | Arizona Board Of Regents On Behalf Of Arizona State University | Programmable in-memory computing accelerator for low-precision deep neural network inference |
| US11705171B2 (en) * | 2021-09-23 | 2023-07-18 | Intel Corporation | Switched capacitor multiplier for compute in-memory applications |
| US11990178B2 (en) * | 2021-12-13 | 2024-05-21 | Ncku Research And Development Foundation | Recognition system and SRAM cell thereof |
| US11811416B2 (en) * | 2021-12-14 | 2023-11-07 | International Business Machines Corporation | Energy-efficient analog-to-digital conversion in mixed signal circuitry |
| CN114300012B (zh) * | 2022-03-10 | 2022-09-16 | 中科南京智能技术研究院 | 一种解耦合sram存内计算装置 |
| US20230386565A1 (en) * | 2022-05-25 | 2023-11-30 | Stmicroelectronics International N.V. | In-memory computation circuit using static random access memory (sram) array segmentation and local compute tile read based on weighted current |
| KR102800148B1 (ko) * | 2022-09-20 | 2025-04-23 | 연세대학교 산학협력단 | eDRAM 기반 메모리 셀 및 이를 포함하는 CIM |
| TWI849566B (zh) * | 2022-11-07 | 2024-07-21 | 國立陽明交通大學 | 用於記憶體內運算(cim)的記憶體陣列及其操作方法 |
| CN118689447A (zh) * | 2023-03-21 | 2024-09-24 | 华为技术有限公司 | 一种存内计算器件及存内计算方法 |
| CN117033302A (zh) * | 2023-08-23 | 2023-11-10 | 杨闵昊 | 一种存储计算单元、阵列、宏模块及上层宏模块 |
| CN119152906B (zh) * | 2024-11-05 | 2025-03-25 | 杭州万高科技股份有限公司 | 一种基于eDRAM的高密度近存计算与存内计算混合架构及计算方法 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102394102A (zh) * | 2011-11-30 | 2012-03-28 | 无锡芯响电子科技有限公司 | 一种采用虚拟地结构实现的近阈值电源电压sram单元 |
| US20120119808A1 (en) * | 2010-11-15 | 2012-05-17 | Renesas Electronics Corporation | Semiconductor integrated circuit and operating method therof |
| CN103165177A (zh) * | 2011-12-16 | 2013-06-19 | 台湾积体电路制造股份有限公司 | 存储单元 |
| CN110414677A (zh) * | 2019-07-11 | 2019-11-05 | 东南大学 | 一种适用于全连接二值化神经网络的存内计算电路 |
| CN110598858A (zh) * | 2019-08-02 | 2019-12-20 | 北京航空航天大学 | 基于非易失性存内计算实现二值神经网络的芯片和方法 |
| CN110941185A (zh) * | 2019-12-20 | 2020-03-31 | 安徽大学 | 一种用于二值神经网络的双字线6tsram单元电路 |
Family Cites Families (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7812757B1 (en) | 2009-06-12 | 2010-10-12 | Hong Kong Applied Science And Technology Research Institute Co., Ltd. | Hybrid analog-to-digital converter (ADC) with binary-weighted-capacitor sampling array and a sub-sampling charge-redistributing array for sub-voltage generation |
| US8547269B2 (en) | 2012-01-30 | 2013-10-01 | Texas Instruments Incorporated | Robust encoder for folding analog to digital converter |
| US11263522B2 (en) * | 2017-09-08 | 2022-03-01 | Analog Devices, Inc. | Analog switched-capacitor neural network |
| CN112567350B (zh) * | 2018-06-18 | 2025-01-17 | 普林斯顿大学 | 可配置的存储器内计算引擎、平台、位单元及其布局 |
| US20200105337A1 (en) * | 2018-09-28 | 2020-04-02 | Gregory Chen | Memory cells and arrays for compute in memory computations |
| US10964356B2 (en) * | 2019-07-03 | 2021-03-30 | Qualcomm Incorporated | Compute-in-memory bit cell |
| US11372622B2 (en) * | 2020-03-06 | 2022-06-28 | Qualcomm Incorporated | Time-shared compute-in-memory bitcell |
| CN111144558B (zh) | 2020-04-03 | 2020-08-18 | 深圳市九天睿芯科技有限公司 | 基于时间可变的电流积分和电荷共享的多位卷积运算模组 |
| US11487507B2 (en) * | 2020-05-06 | 2022-11-01 | Qualcomm Incorporated | Multi-bit compute-in-memory (CIM) arrays employing bit cell circuits optimized for accuracy and power efficiency |
| CN111431536B (zh) * | 2020-05-18 | 2023-05-02 | 深圳市九天睿芯科技有限公司 | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 |
-
2020
- 2020-05-08 CN CN202010382467.2A patent/CN113627601B/zh active Active
-
2021
- 2021-03-30 US US17/631,723 patent/US12487795B2/en active Active
- 2021-03-30 EP EP21800372.1A patent/EP3985670B1/en active Active
- 2021-03-30 WO PCT/CN2021/084032 patent/WO2021223547A1/zh not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20120119808A1 (en) * | 2010-11-15 | 2012-05-17 | Renesas Electronics Corporation | Semiconductor integrated circuit and operating method therof |
| CN102394102A (zh) * | 2011-11-30 | 2012-03-28 | 无锡芯响电子科技有限公司 | 一种采用虚拟地结构实现的近阈值电源电压sram单元 |
| CN103165177A (zh) * | 2011-12-16 | 2013-06-19 | 台湾积体电路制造股份有限公司 | 存储单元 |
| CN110414677A (zh) * | 2019-07-11 | 2019-11-05 | 东南大学 | 一种适用于全连接二值化神经网络的存内计算电路 |
| CN110598858A (zh) * | 2019-08-02 | 2019-12-20 | 北京航空航天大学 | 基于非易失性存内计算实现二值神经网络的芯片和方法 |
| CN110941185A (zh) * | 2019-12-20 | 2020-03-31 | 安徽大学 | 一种用于二值神经网络的双字线6tsram单元电路 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113949385A (zh) * | 2021-12-21 | 2022-01-18 | 之江实验室 | 一种用于rram存算一体芯片补码量化的模数转换电路 |
| CN113949385B (zh) * | 2021-12-21 | 2022-05-10 | 之江实验室 | 一种用于rram存算一体芯片补码量化的模数转换电路 |
| TWI898752B (zh) * | 2024-06-26 | 2025-09-21 | 旺宏電子股份有限公司 | 記憶體內運算裝置以及使用其執行操作的方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113627601B (zh) | 2023-12-12 |
| CN113627601A (zh) | 2021-11-09 |
| EP3985670A1 (en) | 2022-04-20 |
| EP3985670A4 (en) | 2022-08-17 |
| US12487795B2 (en) | 2025-12-02 |
| EP3985670B1 (en) | 2025-05-07 |
| US20220276835A1 (en) | 2022-09-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111431536B (zh) | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 | |
| CN113627601B (zh) | 子单元、mac阵列、位宽可重构的模数混合存内计算模组 | |
| Biswas et al. | Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications | |
| CN113946310B (zh) | 一种用于卷积神经网络的内存计算eDRAM加速器 | |
| TWI750038B (zh) | 記憶體裝置、計算裝置及計算方法 | |
| CN111816234B (zh) | 一种基于sram位线同或的电压累加存内计算电路 | |
| Ha et al. | A 36.2 dB high SNR and PVT/leakage-robust eDRAM computing-in-memory macro with segmented BL and reference cell array | |
| US20210240441A1 (en) | Computation in-memory using 6-transistor bit cells | |
| US11762700B2 (en) | High-energy-efficiency binary neural network accelerator applicable to artificial intelligence internet of things | |
| CN115910152B (zh) | 电荷域存内计算电路以及具有正负数运算功能的存算电路 | |
| CN117033302A (zh) | 一种存储计算单元、阵列、宏模块及上层宏模块 | |
| CN115906976A (zh) | 一种全模拟向量矩阵乘法存内计算电路及其应用 | |
| CN118034644B (zh) | 一种基于eDRAM的高密度高可靠性存内计算电路 | |
| Zang et al. | 282-to-607 TOPS/W, 7T-SRAM based CiM with reconfigurable column SAR ADC for neural network processing | |
| Nasrin et al. | Memory-immersed collaborative digitization for area-efficient compute-in-memory deep learning | |
| CN115664422B (zh) | 一种分布式逐次逼近型模数转换器及其运算方法 | |
| Kim et al. | A charge-domain 10T SRAM based in-memory-computing macro for low energy and highly accurate DNN inference | |
| Xiao et al. | A 128 Kb DAC-less 6T SRAM computing-in-memory macro with prioritized subranging ADC for AI edge applications | |
| TWI788964B (zh) | 子單元、mac陣列、位寬可重構的模數混合存內計算模組 | |
| CN116244252A (zh) | 可实现多模式乘累加计算的存算一体芯片 | |
| Lin et al. | An 11T1C Bit-Level-Sparsity-Aware Computing-in-Memory Macro With Adaptive Conversion Time and Computation Voltage | |
| Chen et al. | A charge-digital hybrid compute-in-memory macro with full precision 8-bit multiply-accumulation for edge computing devices | |
| KR102878901B1 (ko) | 삼진 뉴럴 네트워크 연산 가속을 위한 컴퓨팅 인 메모리 장치 | |
| Jeong et al. | HYTEC: Compact and Energy-Efficient Analog-Digital Hybrid CIM With Transpose Ternary eDRAM | |
| Li et al. | Ghost-CIM: All-analog domain computing architecture with programmable Vref ReLU-ADC Co-design for smart sensing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21800372 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2021800372 Country of ref document: EP Effective date: 20220113 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWG | Wipo information: grant in national office |
Ref document number: 2021800372 Country of ref document: EP |