[go: up one dir, main page]

WO2025045366A1 - Dispositif à semi-conducteur crossbar de traitement - Google Patents

Dispositif à semi-conducteur crossbar de traitement Download PDF

Info

Publication number
WO2025045366A1
WO2025045366A1 PCT/EP2023/073854 EP2023073854W WO2025045366A1 WO 2025045366 A1 WO2025045366 A1 WO 2025045366A1 EP 2023073854 W EP2023073854 W EP 2023073854W WO 2025045366 A1 WO2025045366 A1 WO 2025045366A1
Authority
WO
WIPO (PCT)
Prior art keywords
electric
blocks
values
super
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2023/073854
Other languages
English (en)
Inventor
Bijoy Kundu
Roland Müller
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Original Assignee
Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV filed Critical Fraunhofer Gesellschaft zur Foerderung der Angewandten Forschung eV
Priority to PCT/EP2023/073854 priority Critical patent/WO2025045366A1/fr
Publication of WO2025045366A1 publication Critical patent/WO2025045366A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06GANALOGUE COMPUTERS
    • G06G7/00Devices in which the computing operation is performed by varying electric or magnetic quantities
    • G06G7/12Arrangements for performing computing operations, e.g. operational amplifiers
    • G06G7/16Arrangements for performing computing operations, e.g. operational amplifiers for multiplication or division
    • G06G7/163Arrangements for performing computing operations, e.g. operational amplifiers for multiplication or division using a variable impedance controlled by one of the input signals, variable amplification or transfer function
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1006Data managing, e.g. manipulating data before writing or reading out, data bus switches or control circuits therefor

Definitions

  • the present examples relate to a processing crossbar semiconductor device, e.g. for vector matrix multiplication (VMM) using analog in-memory computing.
  • VMM vector matrix multiplication
  • the ex- amples further relate to perform deep neural network processing.
  • VMMs Analog in-memory computing using crossbar architectures is known to compute VMMs.
  • the multiplication part of these VMM products may be computed by apply- ing an input activation voltage from an input matrix - either in analog using digital to analog converters, DACs, or in digital, e.g. selection bits, to a synapse circuit that results a proportional current or a charge, wherein the proportionality factor comes from the weight matrix.
  • the accumulation part of the VMM computation happens inherently by the charge or current integration to a common node, namely a column in a crossbar.
  • the integrated charge or current from a column is con- verted further to digital domain by employing analog to digital converters, ADCs.
  • the present examples offer solutions to these problems e.g. by employing a con- figurable and scalable crossbar architecture, and other associated building block circuits such as a digitally configurable current source-based synapse cell, analog to digital converters, ADCs, Shift-Add circuit with built in offset and gain support, etc.
  • a con- figurable and scalable crossbar architecture and other associated building block circuits such as a digitally configurable current source-based synapse cell, analog to digital converters, ADCs, Shift-Add circuit with built in offset and gain support, etc.
  • VMMs Vector Matrix Multiplications
  • Fig. 1 shows a fully connected layer of a neural network having inputs N of dimen- sion n and output neurons M of dimension m. This translates to VMMs given by the following matrix formula:
  • Fig. 1 can be accompanied by a non-linear activation function, which is applied to the dot product results computed by the above equa- tion.
  • Fig. 2 depicts a typical crossbar array according to the prior art for in-memory computing, showing R number of rows and C number of columns, where each el- ement represents a synapse weight W (the subscripts not being shown in Fig. 2, therefore W is used instead of W ⁇ 1 ... W ln , etc.).
  • the input activation lines, and the accumulation lines typically connect complete rows and columns of the crossbar respectively.
  • the crossbar can map the fully connected layer in Fig. 1 partially or completely. Re- gardless of the size, the crossbar would have unused weights in the rows and col- umns, because the input lines and the accumulation lines physically connect the entire row or column.
  • Fig. 1 shows an example of fully connected layer which can be addressed to by the present examples.
  • Fig. 2 shows an example of a crossbar according to the prior art.
  • Fig. 3 shows an example according to the present disclosure.
  • Fig. 4 shows the operation of the example of Fig. 2.
  • Fig. 5, 6, 7, and 8 show optional features of possible components of the example of Fig. 3.
  • a processing crossbar semiconductor device for processing at least one input vector by at least one weight tensor, to derive at least one output vector as processed version of the at least one weight tensor, the at least one weight tensor having a plurality of weights
  • the crossbar processing device a multiple input line including a plurality of single input lines, each single input line being configured to process each input electric value of an array of input electric values representing the at least one input vector
  • the pro- cessing crossbar semiconductor device comprising: a set of weighting elements arranged according to element columns and el- ements rows, each weighting element corresponding to a weight of the at least one weight tensor, wherein the set of weighting elements is partitioned among a plurality of blocks, the plurality of blocks being arranged according to super col- umns and super rows in such a way that each super row includes a plurality of immediately subsequent element rows and each super column includes a plural- ity of immediately subsequent element columns; a plurality of block
  • the processing crossbar semiconductor device may be so that each block of a super row is configured to activate a plurality of weighting elements simultane- ously, to thereby provide, to the respective block output lines and through the weighting elements, the electric weighted values simultaneously, so that each an- alog accumulation element that receives the electric weighted values from the same block output line has the accumulation value simultaneously.
  • the processing crossbar semiconductor device may be so that each weighting el- ement is configured to provide a current analogically obtained by weighting the in- put electric values, so that the respective accumulation element provides at least one of an accumulated weighted current as the electric accumulated weighted value.
  • the processing crossbar semiconductor device may be so that each weighting el- ement is configured to provide a charge and/or a voltage analogically obtained by weighting the input electric values, so that the respective accumulation element provides an accumulated weighted charge and/or an accumulated weighted volt- age as the electric accumulated weighted value.
  • the processing crossbar semiconductor device may be so that at least one elec- tric input value encodes a binary value, so that a first electric level of the at least one electric input value corresponds to a first logical level of the binary value and a second electric level of the at least one electric input value corresponds to a second logical level of the binary value, each of the relating weighting elements being configured to process the electric input value according to a weight which is selected between more than two weight values.
  • the processing crossbar semiconductor device may be so that at least one of the weighting elements is configured to select between a first current generator providing a first current and a second current generator providing a second cur- rent, wherein the weighting element is configured to selectabiy route each of the first current and the second current, independently of each other, to a respective conductor chosen between a first conductor of the respective block output line and a second conductor of the of the respective block output line, so that the re- spective block output line carries one selected of multiple selectable levels of the weighted output value.
  • the processing crossbar semiconductor device may be so that at least one of the relating weighting elements is configured to select between at least one positive polarity, thereby generating a positive weight, and one negative polarity, thereby generating a negative weight, by awarding a larger electric level to a positive con- ductor or electrode of the analog accumulation element than to a negative con- ductor or electrode of the analog accumulation element in case of positive electric weighted value, and by awarding a larger electric level to the negative conductor or electrode than to the positive conductor or electrode in case of positive electric weighted value.
  • the processing crossbar semiconductor device may be so that the plurality of an- alog accumulation elements are gathered in analog accumulation element buses, each of the analog accumulation element buses being connected with, and asso- ciated to, at least one block output bus, each analog accumulation element of each analog accumulation element bus being connected to at least one block out- put line of the associated at least one block output line, to accumulate weighted electric values from the same element columns of the blocks of at least one super row associated with the associated at least one block output bus.
  • the processing crossbar semiconductor device may be so that at least one ana- log accumulation element bus is associated with, and connected to, at least a first block output bus associated with, and connected to, a first super row and a sec- ond block output bus associated with, and connected to, a second super row, so that each analog accumulation element of the analog accumulation element bus provides an electric accumulated weighted value accumulated from the electric weight values from both first blocks of the first super row and second blocks of the second super row.
  • the processing crossbar semiconductor device further comprises at least one analog to digital, ADC, converter to convert at least two electric accumulated weighted values from at least two analog accumulation ele- ments, respectively.
  • the processing crossbar semiconductor device may convert at least one electric accumulated weighted value onto one single bit in accordance to the electric level of the electric accumulated weighted value.
  • the processing crossbar semiconductor device may be configured to convert at least one electric accumulated weighted value onto a plurality of bits in accord- ance to the electric level of the electric accumulated weighted value, the weight level .
  • the processing crossbar semiconductor device may further comprise at least one digital accumulation element to accumulate different accumulated weighted val- ues, once converted in digital, from different analog accumulation elements or from different analog accumulation element buses.
  • the processing crossbar semiconductor device may be configured so that a first block a first super row, which is associated with a first block output bus, contains a selectable multiple connection with a second block of a second super row, the second super row being associated with, and connected to, a second block out- put bus but being not associated with, and not connected to, the first block output bus, wherein the first block output bus is connected to, and associated with, a first analog accumulation element bus including a plurality of first analog accumulation elements and the second block output bus is connected to a second analog accu- mulation element bus including a plurality of second analog accumulation ele- ments, none the first analog accumulation elements being connected to any of the second analog accumulation elements, the selectable multiple connection electrical ly connecting, when selected, the first block with the block output bus, in such a way that the weighted electric values from the first block are provided to the second block output bus, and to the second analog accumulation element bus.
  • the processing crossbar semiconductor device may be so that a plurality of blocks of at least one super column are electrically connected two by two through a plurality of selectable multiple connections, in such a way to selectably provide weighted values obtained at a first block of a first super row to a second block output bus which is not electrically connected to a first block output bus to which the first super row is connected.
  • the processing crossbar semiconductor device may be configured to evaluate whether the input vector has more elements than the element rows of first blocks of a first super row, so that, in case of the input vector having more elements than the number of element rows in each first block of the first super row, to distribute electric input values between the element rows of the first blocks of the first super row and element rows of second blocks of at least one second super row, and to selectably provide electric weighted values from each of the first blocks of the first super column to at least one of the further blocks of the second super row, the second blocks thereby providing to the same block output bus, with which they are associated, both the weighted values from the first blocks and the weighted values from the further blocks.
  • the processing crossbar semiconductor device may be so that the plurality of blocks includes a first subplurality of blocks and a second subplurality of blocks disjoint from the first subplurality of blocks, the plurality of block output buses in- cluding a first subplurality of block output buses uniquely connected to blocks of the first subplurality of blocks and a second subplurality of block output buses uniquely connected to blocks of the second subplurality of blocks, wherein the second subplurality of blocks is selectably activatable and deactivatable.
  • the processing crossbar semiconductor device may be so that the plurality of an- alog accumulation elements includes at least a first subplurality of analog accu- mulation elements to accumulate electric weighted values from the first subplural- ity of blocks and a second subplurality of analog accumulation elements to accu- mulate electric weighted values from the second subplurality of blocks, the pro- cessing crossbar semiconductor device further comprising a plurality of further accumulation elements, each further accumulation element being configured to accumulate a weighed electric value obtained by accumulating both a first weighted electric value from a first analog accumulation element of the first sub- plurality of blocks and a second weighted electric value obtain from a second an- alog accumulation element of the second subplurality of blocks.
  • the processing crossbar semiconductor device may be so that the further accu- mulation element is a digital accumulation elements.
  • the processing crossbar semiconductor device may be configured to evaluate whether the output vector to be obtained has more elements than the number of element columns of the first subplurality of blocks, so as to activate, in case the output vector has more elements than the number of element columns of the first subplurality, a number of super columns of blocks of the second subplurality of blocks, so that the number of columns of the activated super columns of blocks at least matches the number of elements of the output vector.
  • the processing crossbar semiconductor device may be so that the second sub- plurality of blocks are selectively activatable and deactivatable.
  • the processing crossbar semiconductor device may be so that each analog ac- cumulation element is a resistor which receives the weighted electric values from the weighting elements of the same element row in multiple blocks of at least one super row, so that the respective electric accumulated weighted value becomes a sum of the weighted electric values.
  • the processing crossbar semiconductor device may be so that the weighted elec- tric values are currents, and the electric accumulated weighted value is a current which is the sum of the currents from the block output lines.
  • the processing crossbar semiconductor device may be so that each analog ac- cumulation element is a capacitor which receives the weighted electric values from the weighting elements of the same element row in multiple blocks of at least one super row, so that the respective electric accumulated weighted value becomes a sum of the weighted electric values which are charges or voltages.
  • the processing crossbar semiconductor device may be configured to activate dif- ferent blocks independently of each other.
  • the processing crossbar semiconductor device may be configured to activate simultaneously multiple blocks of the same super row, in such a way that weighting elements connected to the same block output line provide electric weighted values to the same block output line to thereby provide the electric weighted values.
  • the processing crossbar semiconductor device may be configured to receive the array of electric values as at least a first input vector and a second input vector independent from the first input vector, the first input vector being inputted to a first super row of blocks and, simultaneously, the sec- ond input vector being provided to a second super row of blocks.
  • the processing crossbar semiconductor device may be so that the weighting ele- ments of each block are configured to provide, in parallel, respective electric weighted values to respective block output lines, thereby preaccumulating the electric weighted values in the same output line.
  • the crossbar processing device may be configured to implement a neural net- work according to a plurality of layers, including an input layer, an output layer, and optionally at least one hidden layer, wherein each transition from a layer to the immediately subsequent layer is performed by processing the analog input vector, wherein each weight of the at least one weight tensor represents a synap- sis, each analog input value is a neuron, and each output value is a neuron of the immediately subsequent layer.
  • the processing crossbar semiconductor device may be so that the analog input vector is a kernel, or a portion of kernel, to be convolutionally applied to the at least one weight tensor.
  • the processing crossbar semiconductor device may be configured to perform af- ter a first phase in which the weights of the at least one weight tensor are ob- tained by minimizing a cost function providing error metrics on a known dataset, an inference phase in which the weights of the weight tensor are established, and predictions are provided in response to input values.
  • a method for at least one input vector by at least one weight tensor, to de- rive at least one output vector as processed version of the at least one weight tensor, the at least one weight tensor having a plurality of weights, the method in- cluding processing each input electric value of an array of input electric values representing the at least one input vector, the method using weighting elements arranged according to element columns and elements rows, each weighting ele- ment corresponding to a weight of the at least one weight tensor, wherein the set of weighting elements is partitioned among a plurality of blocks, the plurality of blocks being arranged according to super columns and super rows in such a way that each super row includes a plurality of immediately subsequent element rows and each super column includes a plurality of immediately subsequent element columns, the method further using a plurality of block output, each block output bus being associated with, and connected to, a plurality of blocks in the re
  • the crossbar processing device further comprises a con- troller to activate at least one among the following: at least one block, at least one super row, at least one super column, at least one element row, at least one bypassing connection, at least one subplurality and/or to associate weights to re- spective weight elements, input electric values to element rows, and/or output electric values to element columns of at least one block.
  • a con- troller to activate at least one among the following: at least one block, at least one super row, at least one super column, at least one element row, at least one bypassing connection, at least one subplurality and/or to associate weights to re- spective weight elements, input electric values to element rows, and/or output electric values to element columns of at least one block.
  • Fig. 3 shows an example of a processing crossbar semiconductor device 100 ac- cording to one of the present examples.
  • the processing crossbar semiconductor device may perform a multiplication in analog domain between a matrix and a vec- tor.
  • the concept of matrix can be generalized to tensor, i.e., having a number of dimensions which is generic (e.g., three dimen- sions).
  • the tensor could be the matrix of formula (1 ), but there could be a plurality of channel representing another matrix in a third di- mension, e.g. a first channel could be a second channel could be , and a p-th channel could be .
  • the pro cessing crossbar semiconductor device may therefore process at least one input vector, which may be analog or digital, but in general terms may be in the form an array of electric values through the plurality of weights of a weight tensor, thereby generating an output vector (either in digital form or in analog form).
  • an input vector which may be analog or digital, but in general terms may be in the form an array of electric values through the plurality of weights of a weight tensor, thereby generating an output vector (either in digital form or in analog form).
  • the input vector may be analog or digital (in case of being digital, it may be con- verted onto an analog value encoding the digital value). Therefore, the electric val- ues which represent the input vector may be either digital values, each which in principle can only be encoded in two logical states (e.g. a first logical state repre- sented by a first electric level, and a second logical state represented by a second electric value different from the first electric value) or strings of logical states, or analog values (e.g. ideally defined in a continuous interval of analog values).
  • the input values may be provided in the form of voltages, charges, currents, and so on, either in digital form or in analog form.
  • the processed values e.g.,.
  • weighted val- ues, accumulated weighted values, etc. may be analog values (e.g. in the form of at least one of voltages, charges, currents, and so on), in case being converted form the digital domain.
  • the output elements e.g. elements of the output vector
  • weightings are here per- formed in analog domain, but accumulations are either performed in analog domain (e.g., by summation of several weighted electric values from multiple weightings in each accumulation element) or, in some examples, include a first, analog step (e.g., in which a summation of several weighted electric values from multiple weightings is performed at several, different analog accumulation elements), the first step being followed by a second step, in which, after a conversion of analog accumulated weighted values onto digital versions of the accumulated weighted values, digital versions of the accumulated weighted values provided by different analog accumulation elements are summed with each other in digital version. Therefore, while in some examples the input values and the output values are dig- ital, their processing in advantageously performed in analog domain, thereby sav- ing from power consumption and increasing processing rate.
  • the input vector is to be processed (e.g., multiplied, weighted, scaled) by a weight tensor, with weights defined in different dimensions (in the case of the ma- trix, the dimensions will be two, the weight tensor being a weight matrix, while in the case of three dimensional tensors, dimensions will be three and the weight tensor will be a three dimensional tensor).
  • each of the little rectangles 101 represents a weighting element, which corresponds to a weight to be applied to an input electric value, while the input vectors are or are part of kernels, such as the kernels K0_0 (301 ), K0_1 (302), K1_0 (303), K1_1 (304). Therefore, in the time instant depicted by Fig. 3, the input vectors are multiplied by the weights in the positions 301 , 302, 303, 304.
  • each element row is associated with (e.g. inputted by and/or processing) a particular input electric value (e.g., all the weighing elements in a specific element row are applied to the same electric input value) (e.g. the input may be a digital value, which may be converted onto an analog value e.g. by the weighting element 101 or upstream to the weighting element 101 ), while each element column (e.g.
  • FIG. 6 shows examples of weighting elements for weighting electric input vectors con- trolled by the digital inputs djn, djno, djni, djna (which in this case control the provision of the supply voltage Vdd) As shown in Fig.
  • each weighting element also called CS_AWE, see below
  • CS_AWE has a conventionally positive terminal voutjc and a conventionally negative terminal vout_n, which constitute a connection port with a relative block output line (OL-Ao, OL-Ai OL-A2, OL-A31, see below); in particular, vout_p may be connected to a conventionally positive terminal of the block output line, and vout_n may be connected to a conventionally negative ter- minal of the block output line).
  • deviations e.g. bifurcations
  • choice of one single generator e.g. current generator
  • choice of the polarity e.g.
  • vout_p and vout_n may permit to select one among multiple (in particular more than 2) weights of the weighting element, so as to advantageously process the weighting of the binary value according to a sub binary resolution. It is noted that it may be that all the weighting elements 101 of an element row of a block provide the electric weighted values to the same block output line simultaneously and in parallel, so that the block output line acts as a pre-accumulating output line which pre-accumulates multiple weighted values of the same element column of the same block.
  • Fig. 3 shows a plurality of block output buses, here indicated with OL-A0-3, OL-A4- 7, OL-As-n, OL-A28-31.
  • Each block output bus (which may be understood as a multiple output line, or a group of lines in parallel) may include a plurality of single output lines (e.g. in parallel connection with each other for each block output bus), not electrically connected with each other, and here called block output lines (each of the block output lines may have only one single voltage value at a given time instant, while the block output bus may represent an array of multiple voltage val- ues at the given time).
  • the block output bus OL-A0-3 is shown to include the four output lines OL-Ao, OL-A1 OL-A2, OL-A3 (different numbers are also possible).
  • Each of the block output lines OL-Ao, OL-A1 OL-A2, ... , OL-A31 may a dual channel (e.g.
  • each single output line could only have one single conductor, and a common analog ground could be used instead of the second conductor; however, it has been un- derstood that the use of two different conductors instead of a common analog ground is advantageous: primarily because the use to the common ground can lead to process dependent errors such as from PVT variations, and also can add asymmetric electric parasitic capacitances I resistances. Additionally, in the ab- sence of even a common analog ground, it is possible to define a polarity, providing a sign to the weight, thereby permitting to define negative weights).
  • each output bus is connected to, and associated with, one particular groups of weighting elements (or more in particular of blocks of weighting elements which are in the same super row, see below), but is not connected to, and not associated with, other groups of weighting elements (or more in particular of blocks of weighting elements which are in different super rows, see below).
  • each of the block output bus OL-Ao, OL-Ai OL-A2, OL-A31 is meant at receiving analog weighted values only from some portions of element columns (i.e., each block output line receiving weighted values from corresponding weighting elements displaced in element col- umns of blocks of the same super row), but not from the totality of the weighting elements of any global element column.
  • each of the block output lines OL-Ao, OL-A1 OL-A2, OL-A3 is electrically disconnected from any other of the block output lines OL-Ao, OL-A1 OL-A2, OL-A3 of the same block output bus OL-Ao-
  • Each block output bus OL-A0-3, OL-A4-7, OL-As-n, .... OL-A28-31 is in generally elec- trically disconnected from other block output buses, or at least from the some of them (e.g., the majority of them), or analogously each of the block output lines OL- Ao, OL-A1 OL-A2, OL-A3 of a generic block output bus OL-A0-3 is electrically discon- nected from block output lines buses of other block output buses, or at least from the some block output lines buses of other block output buses (e.g., the majority of the block output lines buses of other block output buses).
  • the block output bus OL-A0-3 is in general electrically separated from the OL-A4-7 and also from OL-As-n, OL-A12-15, OL-A20-23, OL-A24-27, OL-A28-31 (i.e. from the majority of other block output buses).
  • each block output bus OL-Ao, OL-A1 OL-A2, ... OL-A31 may be connected to one accumulation element bus ALo, AL1, AL2, ... AL31 (which is a single accumulation line). Multiple of these accumulation element buses ALo, AL1, AL2, ... AL31 (shown, for example, in Fig. 7) may be gathered together in accumu- lation element busses AL0-3, AL4-7, ALs-n, ... , AL28-31.
  • each accumulation bus can be connected with a plurality of block output buses.
  • the accumulation element ALo may be electrically connected to both the output line OL-Ao and the output line OL-A16. Since, as it may be mentioned, each output line is fed with analog electrical values coming from a group of weighting elements of the processing crossbar semiconductor device, the accumu- lation line (e.g., ALo) will be affected by an integral information of the weighted electric values (e.g. in accumulative form) obtained from those weighting elements. It may be understood that each analog accumulation element ALo, ALi, ...
  • AL31 may comprise, for example, a resistor through which a current flows which is the sum of the currents outputted from the weighting elements to which the analog accumulation element is electrically connected (e.g. the current of the analog ac- cumulation element ALo may be the sum of the currents received from the single output line OL-Ao and, in some examples, under a suitable selection also form the single output line OL-A16-19, which are in turn received from weighting elements of corresponding columns of the blocks associated with the single output lines OL-Ao and OL-A16-19).
  • the weighted electric values may be provided as voltages and/or charges generated by the different weighting elements.
  • Each analog accumulation element ALo... AL31 may be capac- itor, and its charge may be the sum of the effects of the output of the different weighting elements to which it is electrically connected (in this case, it may be pref- erable to avoid the current generators in Fig. 6, for example).
  • the weighting elements 101 may be gathered in the blocks also called case and indicated globally with 104 (inf Fig. 3 there are shown partic- ular blocks 104-00, 104-01 , 104-10, 10411 ).
  • each block (CAE) can be a matrix 16x4 (of course, the different numbers of columns can be chosen for different matrixes).
  • the blocks are also arranged according to super rows 106 (elongated along the element rows, horizontal in Fig. 3) and super columns 108 (elongated along the element columns, vertical in Fig. 3).
  • each super row 104 includes a plurality of immediately subsequent element rows (e.g.
  • the top super row 106-0 includes the top 16 elements rows of the device, and the second top super row 106-1 includes the 17 th to 32 nd element rows of the device) and each super column includes a plurality of immediately subsequent el- ement columns (e.g. the left super column 104-0 includes the most left 16 elements columns of the device, and the second lest super column 104-1 includes the 17 th to 32 nd element column of the device).
  • each element column is electrically connected to one block output line (e.g., the most left element column of each of the blocks 104-00, 104-01 , etc.
  • the con- nection of a given element column of a block with a respective block output line may imply that all the weighting elements of the given element column are con- nected to the respective block output line, thereby providing the weighted electric values to the respective block output line.
  • multiple corresponding ele- ment columns of different super columns but in the same super row are connected to the same block output line.
  • Corresponding element columns in different blocks of the same super row may be electrically connected to the same block output line, to thereby provide weighted outputs to be accumulated by the corresponding ac- cumulation element ALo. It is to be noted that therefore, for each super row all the corresponding columns of all the blocks are electrically connected to the same ac- cumulation element. It may be understood that the blocks 104 in different super rows 106 are connected to different block output buses (and therefore different block output lines), which are in general electrical independent from the other block output buses (or at least the majority thereof).
  • each super row 106 is associated with one single block output bus (e.g. the top super row 106-0 being associated with and connected to the first block output bus OL-Ao- 3, the second top super row 106-1 being associated with and connected to the block output bus OL-A4-7 and so on).
  • the weighed values provided by the blocks 104 of a super row are in principle independent from weighed values provided by another super row (e.g. 106-1 ): this is other, for example, from the prior art of Fig. 2, where all the weighting elements of a column shall provide an output to the same accu- mulation line.
  • different super rows e.g., 106-0 and 106-1
  • all the blocks 104 can be activated sim- ultaneously. More in general, multiple blocks 104 can be activated simultaneously. However, in some examples, even though different blocks 104 can be activated simultaneously, they can be activated in selectable independent way: it may be decided, for example, that some blocks 104 are not to be activated at all, thereby advantageously reducing the power consumption. Notwithstanding, when activated simultaneously, an increased speed is increased. In examples, the blocks 104 are activated independently of each other, thereby energizing only those block which are actually used to perform weightings, without energizing the non-activated block.
  • weighting elements it is possible to activate the weighting elements selectively, within each block 104, so as to reduce power consumption at the minimum. In other ex- amples, different weighting elements of the same block 104 are activated simulta- neously, even if they are not used for weighting.
  • each block has 16 rows and each kernel K_0_0, KJ)_1 , K_1_0, K_1 1 , has 16 electric input values).
  • a input vector has more than 16 electric input values (e.g., between 17 and 32)
  • the kernel will be partially pro- vided to the first super row (top super row in Fig. 3) and other remaining 16 electric input values will be provided to the super row connected to the output line OL-A16- 19 (and providing output weighted values to the same accumulation line AL0-3)
  • blocks of the same super column can, in some examples, exchange weighted values with each other, thereby bypassing some block output buses(e.g. block 104-00 of the left super column 108-0 can transmit weighted values to block 104-10 of the same left super column 108-0, while it may be that simultaneously block 104-01 of the second left column 108-1 transmits weighted values to block 104-11 of the same second left column 108-0).
  • block 104-00 of the left super column 108-0 can transmit weighted values to block 104-10 of the same left super column 108-0, while it may be that simultaneously block 104-01 of the second left column 108-1 transmits weighted values to block 104-11 of the same second left column 108-0.
  • a single kernel formed the by 301 and 303 in this particular case, 301 and 303 are not different kernels, but two portions of the same kernel), or more in general an input vector is longer (in number of rows) than the number of rows of each block 104. In this case, it could be preferable to provide all the electric weighted values to the same analog accumulation elements, and not to different single analog accumulators. However, in Fig.
  • the blocks (104-00, 104-01 ) of a first super row are connected to the first block output bus OL-A0-3 but not to the second block output bus OL-A4- 7, while the blocks (104-10, 104-11 ) of the second super row 106-1 are connected to the second block output bus OL-A4-7 but not to the first block output bus OL-Ao- 3.
  • OL-A0-3) to provide all the weighted values (coming from the blocks of both the first and second super col- umns) to one single analog output bus (e.g., OL-A4-7) to be therefore accumulated by one single analog accumulation bus (e.g. OL-A4-7).
  • first block output bus OL-A0-3 and the first analog accumulation bus OL-A0-3 for other processings (e.g. using some blocks of the same super col- umns which do not bypass the first block output bus OL-A0-3).
  • any of (e.g. all) the blocks of a same super column can be selectively connectable with each other (at least in the same subplurality, in the examples which have subpluralities A and B).
  • a first block 104-00 of a first super row 106-0 may provide weighted electric values to a first block (e.g. a correspond- ing block 104-10) of a second super row (106-1 ), which in some examples in turn can provide the same electric values (together its own electric weighted values) to the first block of a third super row 106-2, and so on, or route both the electric weighted values from the first block 104-00 and the electric weighted values pro- Switchd by itself to its associated output line OL-A4-7.
  • the first block 104-00 of the first super row 106-0 may, instead of providing the weighted values processed by itself to the first block output bus OL-A0-3 (which is associated with the first super row 106-0), based on a selection, provide the value to a correspond- ing first block 104-10 of the second super row 106-0, which may (based on a se- lection) pass both the weighted values generated by the first block 104-00 of the first super row 106-0, and also further the weighted values generated by the first block 104-00 of the first super row 106-0 of its own, to the second output line OL- A4-7 associated with the second super row 106-1 .
  • a pre-accumulated weighted value (taking into account the weighted values obtained from the first block of the first super row and the accumulated values obtained from the first block of the second super row) may be accumulated in the block output bus OL-A4-7.
  • different blocks e.g. other blocks of the first super row 106-0 which are not the blocks 104-00 and 104-01 ) process the electric input values to provide different electric weighted values to the first output line OL-A0-3, in case of such a selection. Accordingly, it is possible to achieve a better configu- rability and to configure the different values differently. Therefore:
  • the input vector has a number of elements which is larger than the number of element rows of the blocks (or equivalently the array of input electric val- ues has more input electric values than the number of element rows that each block of a super row has), it is possible to distribute the input electric values to more than one super rows, and meanwhile 2) Directly mutually connecting blocks (in the same super column) from those super rows, thereby bypassing at least one block output bus, to thereby pro- vide all the weighted elements to the block output bus of one super row, and
  • FIG. 5 shows a generic first block (here indicated with 104- 00, but which could be any other of Fig. 3) of a generic first super row (here indi- cated with 106-0) which is connected to a first block output bus (“Analog Bus”, OL- A0-3) associated with the first super row 106-0.
  • the block (CAE) 104-00 is shown as being a 16x4 block.
  • each element column and the respective block out- put line is subjected to two switches (e.g. SwOa, SwOb, collectively indicated with 518), each of which may be selectably activated (closed) or deactivated (opened)
  • Each second switch (e.g. SwOa) may be controlled independently from the other second switches (e.g. Sw1 a, Sw2a, etc.) of the same block (e.g. by a binary command written in a cell, like the SRMA cells), but in some examples all the second switches of one single block are controlled by one single com- mand
  • a multiplexer may be used (e.g. controlled by the AND gates) to selectively choose between a. providing the electric weighted values to the block output bus OL- AO-3 (indicated in Fig. 3 with 86) b. bypassing it (indicated in Fig. 3 with 85), thereby providing the electric weighted values to the adjacent block 104-10); or c. both providing the electric weighted values to the block output bus OL-Ao-s and providing the electric weighted values to the adjacent block 104-10)
  • Section B can be identical to Section A, or may at least have features which are the same or similar (maybe not in combination) to those de- scribed for Section A.
  • the second subplurality may be deactivated (e.g. by simultaneously deactivating all the blocks of the second subplurality), so as to avoid power consumption in case it is not needed.
  • An example of using the second subplurality is in the case in which the output value shall have a dimen- sion which has more elements than the element columns of the first subplurality. In the example of Fig. 3, if the output shall have more than 16 elements, the second subplurality shall be activated.
  • the reconfigurable pro- cessing device may therefore be possible to have the reconfigurable pro- cessing device to permit at least of one (e.g. a combination) of the following selec- tions:
  • mag- nitude of the weight according to a digital address (e.g., 3 bits), e.g. to select among a plurality of different levels of weight (e.g., seven levels, addressed by 3 bits).
  • each element column is not bounded to one single output value (link in the prior art of Figs. 2 and 4): each different output values arrive from corresponding element columns of blocks in the same super row (or from corre- sponding element columns of blocks from different super tows, after bypass). Therefore, in principle one single super column is not reserved to one single output value.
  • Fig. 6 shows an example 501 of how each weighting element 101 (AWE) may be carried out (different implementations are possible).
  • Each weighting element may provide at least one of a current, a charge or a voltage obtained in analog domain by weighting an input electric value (which may encode a digital value), so as to provide a respective block output line OL-Ao with at least one weighted value, which is to be accumulated by the respective accumulation element ALo.
  • the element 101 may provide at least one of an weighted current, a weighted charge.
  • the weighting element 101 may be in parallel to all the weighting elements of the same element column of the same block, so as to provide in parallel and simultaneously the electric weighted values to the same respective block output line OL-Ao, to be pre-accumulated together with other electric weighted values from corresponding element columns of other blocks of the same super row.
  • At least one electric input value may encode a binary value (e.g. indicated with djn in 501 and 502, or djno, djni, d_in 2 in 503 and 504), so that a first electric level of the at least one electric input value corresponds to a first logical level (e.g.
  • a second electric level of the at least one electric input value corresponds to a second logical level (e.g. 0) of the binary value.
  • This may be obtained, for example, by activating vs deactivating a switch controlled by the binary value (e.g. djn or djno, djni, d_in2) thereby imposing vs releasing at least one electric value (e.g. impos- ing Vdd, which may be a supply voltage, and/or by imposing a particular current or charge, vs releasing the electric voltage, e.g. by letting the electric value to float, or by imposing a different voltage or a different electric value).
  • the weighting element 101 may selectably route each of the first current 11 and the second current I2, independently of each other, to a respective terminal (first, conventionally positive terminal vout_p vs second, conventionally negative terminal vout_n) connected to a respective con- ductor (first, conventionally positive conductor AL-P ⁇ 0> of the block output line vs second, conventionally negative conductor AL-N ⁇ 0> of the block output line, see Fig. 7).
  • This effect may be achieved, for example, by coordinately activate vs de- activate, e.g. independently:
  • a table providing the various outputs is represented in a subsequent pars of the present description. It is notwithstanding clear that it possible, for example, to change the sign of the weight by changing the polarity. For example: if the current is caused to flow mainly in the first terminal vout_p (and in the positive conductor of the block output line) rather than in the second terminal voutjn (and in the negative conductor of the block output line), then the polarity will be positive, thereby being indicative of a positive weight; and if the current is caused to flow mainly in the second terminal vout_n (and in the negative conductor of the block output line) rather than in the first terminal vout_p (and in the positive conductor of the block output line), then the polarity will be negative, thereby being indicative of a negative weight.
  • each digital output value from each processed input value may have, advantageously, a resolution of more than 1 bit (i.e. each digital output value may have a resolution of more than 1 bit, for example).
  • each an- alog accumulation element ALo, ... , AL31 may have one accumulated weighted value which is one of a set of potential accumulated weighted values which has a cardinality of twice the number of input elements in the input vector multiplied by the number of weight levels that can be provided to each electric input value mul- tiplied by the number of weighting elements.
  • the output of the ADC of Fig. 7 can be in nine bits for each analog accumulation element but (e.g. AL0-3). Shift registers e.g.
  • elements 702 may be used to gen- erate a super string of bits, in the case that different analog accumulation element carry different weighted accumulated values of different bits of input encoded val- ues.
  • Each transition from a layer (N1 , N2, N3, Nn) to the immediately subsequent layer (M1 , M2, M3, ... , Mm) may be performed with the techniques discussed in the present document.
  • the input vector is N1 , N2, N3, Nn (it could be represented by an array of digital values, e.g. forming an input digital string)
  • the input vector may be converted into an array of input electric values which represent the input vector (each bit of the input digital string may be the binary value djn of Fig. 6, which generates an input electric value which may be, for example, Vdd)
  • each electric value may be processed by weighting, e.g. by applying the current generator(s) 1101 and 110I2 and/or by opportunely activating vs deactivating the switches d_wp ⁇ 1 >, d__wp ⁇ 0>, d_wn ⁇ 1 >, d_wn ⁇ 0>, thereby providing a current value at the terminals vout_p and vout_n
  • the digital activated output may be the output vector M1 ... Mm which con- stitutes the subsequent layer.
  • Each weight of the at least one weight tensor represents a synapsis.
  • Each analog input value is a neuron.
  • Each output value is a neuron of the immediately subse- quent layer.
  • the input vector may be, for example, a kernel. It is possible to perform after a first phase in which the weights of the at least one weight tensor are defined by minimizing a cost function providing error metrics on a known dataset (and evaluating the error metrics with respect to know values of the dataset). Subsequently, an inference phase may be performed in which the weights are established, and predictions are provided in response to input values. This may be performed for each layer.
  • a controller (not shows) may be used to perform the processing and/or the selec- tions.
  • the controller may control the operations of the weighting elements, activating vs deactivating the weighting elements and/or the blocks.
  • the controller may be, for example, the same which defines the neural network, or may be a slave controller, which receives, forma neural network controller, a request for defining the neural structure of the neural network (e.g., how many layers, how many neurons for layers, which synapses, which weights, etc.).
  • a request for defining the neural structure of the neural network e.g., how many layers, how many neurons for layers, which synapses, which weights, etc.
  • the controller performs the op- erations of adapting the structure of the processing crossbar semiconductor de- vice 100 to the neural network to be used, imagining that the structure of the neu- ral network is either requested by an external entity or is pre-defined.
  • the controller may receive information on the number of elements of the input vector, the number of elements of the output vector, and the values of the weights to be applied.
  • the controller may evaluate at least one of:
  • the controller assigns each input value to a respective row of the block b. if the number of elements of the input vector is greater than the num- ber of element rows in the block, then the input values may be dis- tributed among: i. different super rows (and weighted values in different super rows may be bypassed to the same block output line, to reach the same analog accumulation element) and/or ii. different super rows which are connected to the same analog accumulation element bus (e.g.
  • only one single subplurality of blocks may be activated) whether the number of elements of the output vector is greater than or equal to or lower than the total number of element columns in one block of one super column: a. if it is ascertained that the number of elements of the output vector is lower than or equal to the number of element columns in one block, then each output value is assigned to one single block or one single super column b.
  • each input value is assigned to multiple blocks of a same super row (and, in case, weighted values are bypassed to one single block output line, so that they are routed to the same analog accumulator element), or to blocks of different subpluralities in the but having the same element rows (i.e. having the same electric in- put values) 4)
  • the controller may evaluate whether there is the possibility of insert a further in- put vector. This is evaluated by: a.
  • the assignments of the values to the blocks and element rows and element columns and super columns and super rows are performed as for the first input vector, but dis- carding the blocks and element rows and element columns and super columns and super rows are performed as for the first input vector which are already assigned to the first input vector.
  • controller may associate the weights of the tensor weight to the weighting elements 101 , e.g. in positions complying with the activated weighting elements and weighting blocks.
  • the layout of the device 100 may substantially follow Fig. 3.
  • the weighting elements 101 may be placed according to that bidimensional array, e.g. one adjacent to the other one.
  • Each block onto which the bidimensional array is partitioned may comprise a reduced bidimensional array, e.g. comprising only those weighting elements which are spatially enclosed in a closed boundary, for example.
  • all the blocks of at least one super row and/or at least one super column have the same number of element rows and element columns.
  • the bidimensional array may be obtained, for example, by opportunely doping a silicon substrate, thereby making the weighting elements and the connection.
  • the bidi- mensional array may extend mostly planarly.
  • the super rows may be geometrical parallel to each other and the super columns may be geometrical parallel to each other.
  • the super rows may be geometrical perpendicular to the super columns.
  • the block element output buses may be geometrical parallel to the super rows.
  • the block element accumulation busses may be geometrical parallel to the super columns.
  • the block element output buses may be geometrical perpendicular to the super columns.
  • the blocks of the first sub plurality may be all within one single closed boundary excluding all the blocks of the second sub plurality, and/or the blocks of the second sub plurality may be all within another single closed boundary excluding all the blocks of the first sub plurality.
  • the blocks and the weighting elements may be grouped in a position which is spaced from the ADCs 700. It has been understood that these solutions permit to reduce parasitic capac- itances.
  • the complete crossbar array (CA) of Fig. 3 may be divided into two sections (sub- pluralities), Section A (first subplurality) and Section B (second subplurality), and each section is further divided into blocks (which smaller crossbar array elements or case).
  • Each block (CAE) may include 4 columns and 16 rows (other numbers are possible) of e.g. analog synapses, which are termed as weighting elements or analog weight emulators (AWEs).
  • Columns and rows of blocks (CAEs) are termed as Super-Columns and Super-Rows respectively. All blocks (CAEs) in a Super- Row of a section share a dual channel analog bus of e.g. width 8 (i.e.
  • FIG. 3a Accumulation Line (accumulation element bus), and referred to as AL-A and AL-B for Section A (first subplurality) and Section B (second subplurality), re- spectively.
  • AL-A and AL-B for Section A (first subplurality) and Section B (second subplurality), re- spectively.
  • FIG. 3 there are 32 accumulation elements - 16 per section and e.g. each of width 2 (i.e. two conductors).
  • AL-A and AL-B per section as there are ADCs, and they span over all the Super-Rows in each section.
  • the CA in Fig. 3 may comprise e.g. 32 shift and add digital blocks, as many as there are ADCs.
  • Each blocks (CAE) may be equipped with special multiplexers that can be config- ured statically - e.g. prior to running inference, and/or dynamically - while running inference. These multiplexers enable the CAE (blocks) to forward their outputs to the next block (CAE) in the same Super-Column and/or to their Output Lines (e.g. AL-A0-3) as shown by the long and short arrows in Fig. 3.
  • the outputs from the analog block output buses e.g., OL-AO-3, 0L-A4-?,...
  • can then be further forwarded to different analog Accumulation lines e.g., AL-0-3, AL-AA-Z,...
  • multiple CAEs can be configured to support convolution and fully con- nected kernels (or more in general input vectors) of varying dimensions (and also to achieve output vectors of multiple, variable dimensions), without sacrificing the utilization of the analog crossbar.
  • Kernel mapping on crossbar an exercise on configurability and utilization
  • K0_0, KO 1 , K1_0, and K1__1 each of length 16.
  • K0_0, and K0_1 form a pair which gets the same inputs
  • K1_0, and K1 1 form an- other pair, which also gets same inputs but different from the former pair.
  • Such an assumption is a valid model configuration, for instance a CNN layer with two filters would get the same set of inputs, but the inputs would obviously be different for two different layers.
  • Reference numerals 301 to 304 in Fig. 3and 401 to 404 in Fig. 4 shows the mapping of the above 4 K*__* (where is a generic notation meaning “any number”) kernels on the configurable crossbar presented in these examples and the traditional in-memory computing crossbar of Fig. 2 respectively. If in Fig. 4 (prior art) reference numerals 403 and 404 are mapped on the same element column as that 401 and 402, their accumulation results cannot be computed simultaneously because they connect to the same accumulation lines.
  • 303 and 304 while mapped below 301 and 302 in the same columns, can be computed simulta- neously because the accumulation outputs from 301 and 302 or 303 and 304 can be forwarded to different accumulation lines through the output lines.
  • accumulation outputs from 301 and 302 can be forwarded to accumulation lines A- 3 through output lines OL-Ao-3, while that from 303 and 304 can be forwarded to accumulation lines AL4-7 through OL-A4-7, simultaneously.
  • the routing configuration of OLs and ALs are controlled by CAE multiplexers, configurations of which are set during compile time and runtime.
  • Fig. 5 shows the multiplexer circuit employed by each CAE (block), where a com- bination of the S-Col Select, the S-Row Select, and the signals from the SRAM cells may control the switch array.
  • the accumulated output from each column of the CAE (block) is forwarded to the corresponding column of the next CAE (block) when first row of the SRAM cells contains the bit “1”, while the outputs are for- warded to the Output Lines when the S-Col Select, the S-Row Select are both high, and the second row of the SRAM cells contain the bit “1”.
  • the switch array in Fig. 5 comprises 8 analog switches.
  • the On-Off resistances and the par- asitic capacitances of these switches are the critical design specifications which must be met.
  • the accumulated outputs from the CAE (block) are forwarded to next CAE (next block) when the kernel length is greater than the height (number of rows) of the CAE (block), which in Fig. 5 is 16.
  • This being a neural network dependent param- eter is calculated at compile time and therefore is treated as a static signal.
  • the S-Row and S-Col Select signals which control the Output Line ac- cess, may be generated during runtime by means of instructions and therefore are configured dynamically.
  • Driving the compile time signals by SRAM cells also sim- plifies the interface complexity of the CA (device) as they can be configured using an existing memory interface and reduces the energy consumption as they are located locally in the CAE (block).
  • Multiply and accumulate or MAC is a basic operation all vector matrix multiplica- tions or VMMs that involves a multiply operation on two operands followed by an accumulation operation that adds the outputs of two multiply operations.
  • IMC in-memory com- puting
  • SRAM-based IMCs that typically ac- cumulate charges
  • eNVM-based IMCs that accumulate currents [1], [3], [4], [5], [6], [7], [8]
  • the present examples may employ current mode accumulation using current sources and 6-T CMOS SRAM cells.
  • Fig. 6 shows the current source based analog weight emulator (CS-AWE, weighting element) implementation employed in the present examples for MAC computation, which multiplies a 1 -bit input and a signed weight with 3-bit or 7 levels. This will be referred to as a 1 bx3b MAC operation in this description.
  • CS-AWE current source based analog weight emulator
  • CS-AWE weighting element
  • the magnitude part of a synaptic weight is encoded into the two binary weighted current sources in the examples 501 and 502, while the sign part of the weight is effectively computed by subtracting vout__n from vout_p (thereby inverting the polarity).
  • the switches controlled by d_wn ⁇ 1:0> and d_wp ⁇ 1:0> control the flow of current to either vout_n or vout_p depending on the sign of the weight.
  • Table 1 shows the mapping between the 7 possible weight levels or values and the switch control signals d_wn ⁇ 1:0> and d_wp ⁇ 1:0>.
  • the input bit, djn is essentially used as a selection bit, which controls the switch that enables the CS-AWE (weighting ele- ment).
  • Example 503 shows a circuit example of a neuron connected to three inputs, djno- 2, by synaptic weights Wo, Wi, W2.
  • Example 504 shows an equivalent circuit of the neuron implementation of 503, along with the electrical parameters that equate the output voltage to the inputs and weights.
  • all the three inputs of 503 are high, which enables all the weights or the CS-AWEs (weighting elements).
  • the weighted currents for the positive weights, Wo and W2 are directed to the vout_p terminal, while the weighted current for the negative weight W1 is directed to the voutji terminal, controlled by d_wn ⁇ 1:0> and d__wp ⁇ 1:0> according to Table 1 .
  • the present examples may include circuits and apparatus that can be employed to configure the bit precession of neuron activations and weights, see Fig. 7.
  • the neuron activations can be configured from unsigned 1 bit to 8 bits in steps of 1 bit, while the weights can be configured from signed 3 to 9 bits in steps of 2 bits.
  • the general idea is to compute accumulations for 1 bit input and 3-bit signed weights separately by employing the unit MAC of 1 bit times 3 bit, described above, and subsequently combine all such accumulations for multiple input bits ranging from 1 bit to 8 bit and multiple weight bits ranging from 3 bit to 9 bit, either sequentially in multiple cycles or parallelly in a single cycle. All the accumulations corresponding to the same set of input and weights of higher precision than the unit MAC precision is performed by shifting and adding the digital accumulation result from each ADC.
  • the 32 pair of AL-P (positive conductors) and AL-N (negative conductors) signals shown by 601 in Fig. 7 are the vertically traversing accumulation lines that carry the 1 bx3b MAC accumulation outputs of VMMs, which are digitized by the 32 ADCs and fed to the 32 shift-add blocks. For each bit of a higher precision input of a VMM, the digitized accumulation result from an ADC is left shifted by one and added to the previous output in a sequential manner, from most significant bit (MSB) to the least significant bit (LSB) of the input.
  • MSB most significant bit
  • LSB least significant bit
  • 8-bit VMM inputs take 8 accu- mulation cycles to compute the combined 8bx3b MAC accumulations, where the digitized accumulation results corresponding to MSB of the input encounter 7 left shifts.
  • the left shifts (resulting in multiplications by 2) are shown by 603 in Fig. 7.
  • Each ADC and shift-add block or S&A comprise a shift-add column shown by 602, which can facilitate the MAC accumulations of corresponding 3-bit weights.
  • multiple Shift-Add Column outputs shown by 604, are left shifted by 2, shown by 605, and added to Shift-Add Column to its left.
  • 5-bit weight would require 2 Shift-Add Columns
  • 9-bit weight would require 4 Shift-Add Columns.
  • the final output is available at the left most column of the combination of columns.
  • Fig. 8 shows detailed implementation of a shift-add circuit.
  • the shift-add circuit supports sequential addition of digitized accumulation outputs corresponding to dif- ferent bits of a multi-bit input from the same AL and that corresponding to different bits of a multi-bit weight from different Als. It also supports multiplication of the accumulated result with configurable 8 bit gain 801 and addition of configurable off- set 802, a combination of which can be used to support Batch Normalization fea- ture used in deep neural networks.
  • the present examples may include a field configurable analog crossbar architec- ture for in-memory computing, described above, which provides functionality to map and compute VMM kernels of varying dimensions without sacrificing resource utilization and performance.
  • the present examples are not limited to the specific parameter values, e.g. dimen- sions of crossbar, CAE (block), Super-Row, Super-Column, ADC, Shift-ADD etc., used to describe the solution above, but extends beyond to other feasible dimen- sions.
  • Some of the present examples may include a multiplexer circuit implementation which can be configured statically, at compile time, and dynamically, during runtime, by means of memory and switches.
  • the present examples are not limited to SRAM (Static Random Access Memory) memory and CMOS (Complementary Metal Oxide Semiconductor) switches, but extends to non-volatile memories like RRAM (Resistive Random Access Memory), FeFET (Ferro-electric Field Effect transistor), MRAM (Magnetoresistive Random Access Memory), etc.
  • RRAM Resistive Random Access Memory
  • FeFET Fero-electric Field Effect transistor
  • MRAM Magneticoresistive Random Access Memory
  • Some of the present examples provide an Analog Weight Emulator or AWE circuit which facilitates multi-bit signed Multiply and Accumulate or MAC computation.
  • the present examples are not limited to SRAM memory and CMOS switches, but extends to non-volatile memories like RRAM, FeFET, MRAM, etc.
  • the present examples are not limited to 3-bit signed weight and 1 bit input, but extends to other bit precisions.
  • Some of the present examples provide a shift and add circuit that supports built in batch normalization functionality.
  • FIG. 4 Traditional In-memory Crossbar Array architecture showing kernel mapping Fig. 5 CAE (block) multiplexer

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Power Engineering (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Logic Circuits (AREA)

Abstract

L'invention concerne un dispositif à semi-conducteur crossbar de traitement destiné à traiter un vecteur d'entrée d'un tenseur de poids, pour dériver un vecteur de sortie en tant que version traitée du tenseur de poids, le tenseur de poids présentant une pluralité de poids, le dispositif de traitement crossbar étant configuré pour traiter chaque valeur électrique d'entrée d'un réseau de valeurs électriques d'entrée représentant le vecteur d'entrée, le dispositif à semi-conducteur crossbar de traitement comprenant : un ensemble d'éléments de pondération (101) agencés selon des colonnes d'éléments et des rangées d'éléments, chaque élément de pondération correspondant à un poids du tenseur de poids, l'ensemble d'éléments de pondération étant partitionné parmi une pluralité de blocs (102), la pluralité de blocs (102) étant agencés selon des super-colonnes (108) et des super-rangées (106) de telle sorte que chaque super-rangée comprend une pluralité de rangées d'éléments immédiatement subséquentes et chaque super-colonne comprend une pluralité de colonnes d'éléments immédiatement subséquentes ; une pluralité de bus de sortie de bloc, chacun étant associé à, et connecté à, une pluralité de blocs dans la super-rangée respective sans être connectés à des blocs associés à, et connectés à, une autre super-rangée, chaque bus de sortie de bloc comprenant une pluralité de lignes de sortie de bloc, chaque bloc de la pluralité de blocs étant configuré, lorsqu'il est activé, pour pondérer des valeurs électriques d'entrée du réseau de valeurs électriques d'entrée par des poids correspondants du tenseur de poids, pour fournir des valeurs pondérées électriques au bus de sortie de bloc associé à la super-rangée dont le bloc est une partie ; et une pluralité d'éléments d'accumulation analogiques, chacun étant électriquement connecté à une ligne de sortie de bloc, pour ainsi fournir une valeur pondérée accumulée électrique respective à partir des valeurs pondérées électriques obtenues à partir des colonnes d'éléments correspondantes d'une pluralité de blocs activés dans la super-rangée associée à la ligne de sortie de bloc, pour dériver un réseau de valeurs de sortie pondérées accumulées qui forment le vecteur de sortie.
PCT/EP2023/073854 2023-08-30 2023-08-30 Dispositif à semi-conducteur crossbar de traitement Pending WO2025045366A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/073854 WO2025045366A1 (fr) 2023-08-30 2023-08-30 Dispositif à semi-conducteur crossbar de traitement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2023/073854 WO2025045366A1 (fr) 2023-08-30 2023-08-30 Dispositif à semi-conducteur crossbar de traitement

Publications (1)

Publication Number Publication Date
WO2025045366A1 true WO2025045366A1 (fr) 2025-03-06

Family

ID=87886734

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/073854 Pending WO2025045366A1 (fr) 2023-08-30 2023-08-30 Dispositif à semi-conducteur crossbar de traitement

Country Status (1)

Country Link
WO (1) WO2025045366A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10346347B2 (en) 2016-10-03 2019-07-09 The Regents Of The University Of Michigan Field-programmable crossbar array for reconfigurable computing
US20210271597A1 (en) 2018-06-18 2021-09-02 The Trustees Of Princeton University Configurable in memory computing engine, platform, bit cells and layouts therefore
US11269973B2 (en) * 2020-04-28 2022-03-08 Hewlett Packard Enterprise Development Lp Crossbar allocation for matrix-vector multiplications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10346347B2 (en) 2016-10-03 2019-07-09 The Regents Of The University Of Michigan Field-programmable crossbar array for reconfigurable computing
US20210271597A1 (en) 2018-06-18 2021-09-02 The Trustees Of Princeton University Configurable in memory computing engine, platform, bit cells and layouts therefore
US20210295905A1 (en) * 2018-06-18 2021-09-23 The Trustees Of Princeton University Efficient reset and evaluation operation of multiplying bit-cells for in-memory computing
US11269973B2 (en) * 2020-04-28 2022-03-08 Hewlett Packard Enterprise Development Lp Crossbar allocation for matrix-vector multiplications

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
A. REUTHERP. MICHALEASM. JONESV. GADEPALLYS. SAMSIJ. KEPNER: "AI and ML Accelerator Survey and Trends", 2022 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), WALTHAM, MA, USA, 2022
A. SHAFIEEA. NAGN. MURALIMANOHAR ET AL.: "ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars", PROC. 43RD INT. SYMP. COMPUT. ARCHIT., 2016, pages 14 - 26, XP032950645, DOI: 10.1109/ISCA.2016.12
G. W. BURR ET AL.: "Experimental demonstration and tolerancing of a large-scale neural network (165000 synapses) using phase-change memory as the synaptic weight element", PROC. INT. ELECTRON DEVICES MEETING, 2014, pages 1 - 4
H. JIA ET AL.: "15.1 A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing", 2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), SAN FRANCISCO, CA, 2021
H. TSAI ET AL.: "Recent progress in analog memory-based accelerators for deep learning", J. PHYS. D: APPL. PHYS, 2018
J. -W. SU ET AL.: "16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for Al Edge Chips", 2021 IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC), SAN FRANCISCO, CA, USA, 2021
M. -S. K. N. R. S. S. E. A. K. C. M. KANG: "An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM", 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), FLORENCE, ITALY, 2014
N. R. SHANBHAGS. K. ROY: "Comprehending In-memory Computing Trends via Proper Benchmarking", 2022 IEEE CUSTOM INTEGRATED CIRCUITS CONFERENCE (CICC), 2022
SHAFIEE ALI ET AL: "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars", 2013 21ST INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC); [INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE.(ISCA)], IEEE, US, 18 June 2016 (2016-06-18), pages 14 - 26, XP032950645, ISSN: 1063-6897, ISBN: 978-0-7695-3174-8, [retrieved on 20160824], DOI: 10.1109/ISCA.2016.12 *

Similar Documents

Publication Publication Date Title
US11934480B2 (en) NAND block architecture for in-memory multiply-and-accumulate operations
Peng et al. DNN+ NeuroSim V2. 0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training
Jia et al. 15.1 a programmable neural-network inference accelerator based on scalable in-memory computing
Peng et al. Optimizing weight mapping and data flow for convolutional neural networks on RRAM based processing-in-memory architecture
CN111052153B (zh) 使用半导体存储元件的神经网络运算电路及动作方法
TW202526619A (zh) 用於記憶體內運算的可擴展的陣列架構
CN114945916B (zh) 使用存内处理进行矩阵乘法的装置和方法
Krishnan et al. Interconnect-aware area and energy optimization for in-memory acceleration of DNNs
Mikhaylov et al. Neuromorphic computing based on CMOS-integrated memristive arrays: current state and perspectives
CN107368889B (zh) 基于阻变存储器三维交叉阵列的卷积、池化和激活电路
Wang et al. Neuromorphic processors with memristive synapses: Synaptic interface and architectural exploration
WO1991018348A1 (fr) Processeur de reseau neuronal evolutif triangulaire
Fernando et al. 3D memristor crossbar architecture for a multicore neuromorphic system
Wang et al. Digital-assisted analog in-memory computing with RRAM devices
Kosta et al. HyperX: A hybrid RRAM-SRAM partitioned system for error recovery in memristive Xbars
WO2025045366A1 (fr) Dispositif à semi-conducteur crossbar de traitement
Bazzi et al. Reconfigurable precision sram-based analog in-memory-compute macro design
Yang et al. ISARA: An Island-Style Systolic Array Reconfigurable Accelerator Based on Memristors for Deep Neural Networks
Zhao et al. Re2PIM: A reconfigurable ReRAM-based PIM design for variable-sized vector-matrix multiplication
CN111988031B (zh) 一种忆阻存内矢量矩阵运算器及运算方法
Xuan et al. HPSW-CIM: A novel ReRAM-based computing-in-memory architecture with constant-term circuit for full parallel hybrid-precision-signed-weight MAC operation
CN114004344A (zh) 神经网络电路
Dorzhigulov et al. Hybrid CMOS-RRAM spiking CNNs with time-domain max-pooling and integrator re-use
US20240339138A1 (en) Compute-in-memory circuit and control method thereof
US20250123802A1 (en) Large Parameter Set Computation Accelerator Using Configurable Connectivity Mesh

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23762514

Country of ref document: EP

Kind code of ref document: A1