Attorney Docket No.: L1021421590WO LOW-LATENCY RECONFIGURABLE FIELD PROGRAMMABLE CROSSBAR ARRAY ARCHITECTURE CROSS REFERENCE TO RALTED APPLICATION(S) [0001] This application claims priority to and the benefit of co-pending US Provisional Patent Application No.63/574,686, filed on April 04, 2024, the contents of which is incorporated herein by reference in its entirety. STATEMENT OF GOVERNMENT SUPPORT [0002] This invention was made with government support under Contract No. DE-AC02- 05CH11231 awarded by the U.S. Department of Energy. The government has certain rights in this invention. FIELD [0003] The present invention relates to resistive random-access memory (RRAM) design and more particularly, relates to a low-latency reconfigurable RRAM based field programmable crossbar array (FPCA) architecture. BACKGROUND [0004] RRAM is a form of non-volatile random-access computer memory which operates by changing the resistance across a dielectric solid-state material. RRAM can be scaled down to very small sizes allowing them to be incorporated into massively parallel in-memory computations. A field programmable crossbar array is a three-dimensional structure of memory devices (e.g., arranged in columns and rows). BRIEF DESCRIPTION OF THE DRAWINGS [0005] Figure 1 illustrates an example RRAM crossbar array, according to some embodiments.
Attorney Docket No.: L1021421590WO [0006] Figure 2A illustrates an example of a field programmable gate array (FPGA), according to some embodiments. [0007] Figure 2B illustrates inter-tile connections between logic block pins and routing tracks of an FPGA, according to some embodiments. [0008] Figure 2C illustrates components of the FPGA, according to some embodiments. [0009] Figure 3A illustrates a planar RRAM array (Manhattan array) for an in-memory adder, according to some embodiments. [0010] Figure 3B illustrates various logic operations operation of the planar Manhattan array, according to some embodiments. [0011] Figure 3C illustrates an in-memory 1-bit full adder, according to some embodiments. [0012] Figure 3D illustrates a multiplexor for selecting outputs of an in-memory adder, according to some embodiments. [0013] Figure 3E illustrates a detailed description of the hardware components used in the in-memory adder execution and the data flow of the in-memory adder, according to some embodiments. [0014] Figure 4A illustrates a programming scheme of an in-memory 4-bit adder, according to some embodiments. [0015] Figure 4B illustrates Boolean logic of a programming scheme of an in-memory 4- bit adder, according to some embodiments. [0016] Figure 5A illustrates an example of planar staircase array of RRAM for performing a multiplication operation, according to some embodiments. [0017] Figure 5B illustrates another example of a planar staircase array with shared inputs for performing a multiplication operation, according to some embodiments.
Attorney Docket No.: L1021421590WO [0018] Figure 6A illustrates the evaluation and combination of partial products in a floating-point multiplication performed in parallel by a planar staircase array, according to some embodiments. [0019] Figure 6B illustrates the storage of operands in a planar staircase array as device conductance and supplying the other operand as input pulses at the word line for multiplication, according to some embodiments. [0020] Figure 6C illustrates a flow chart depicting an example method of performing an in-memory multiplication operation, according to some embodiments. [0021] Figure 6D illustrates example hardware components of an FPGA for performing an in-memory multiplication operation, according to some embodiments. [0022] Figure 7A illustrates an array structure including two planar staircase arrays for two operands of a multiplication operation, according to some embodiments [0023] Figure 7B illustrates a 16-bit multiplication operation using two planar staircase arrays for parallel partial products, according to some embodiments. [0024] Figure 8 illustrates a programming scheme for the planar staircase array with m outputs per array structure, according to some embodiments. [0025] Figure 9A illustrates a row-by-row readout scheme while grounding/floating all unused inputs, according to some embodiments. [0026] Figure 9B illustrates a device-by-device readout scheme while grounding/floating all the unused inputs, according to some embodiments. [0027] Figure 10 illustrates a 16-bit multiplication using two array structures of a planar staircase array, according to some embodiments. [0028] Figure 11A illustrates a structure of an FPCA including multiple independent blocks, a crossbar layer, and a CMOS layer, according to some embodiments.
Attorney Docket No.: L1021421590WO [0029] Figure 11B illustrates routing multiplexers implemented with RRAMs, according to some embodiments. [0030] Figure 11C illustrates a pipeline for performing various operations in the crossbar arrays, according to some embodiments. [0031] Figure 12A illustrates an example method of implementing a neural network in an FPCA, according to some embodiments. [0032] Figure 12B illustrates an example data distribution for a implementing neural network in an FPCA, according to some embodiments. [0033] Figure 12C illustrates an example of implementing neural network in an FPCA, according to some embodiments. [0034] Figure 13 illustrates an example instruction set architecture for an FPCA, according to some embodiments. DETAILED DESCRIPTION [0035] Resistive random-access memory (RRAM) devices have the ability to perform massively parallel in-memory computations. Additionally, their low area, non-volatility, and re-programmability make RRAM devices ideal candidates for field-programmable crossbar array (FPCA) architectures. An RRAM-based FPCA architecture provides superior programmability and blurs the boundary between computation and storage. However, long- distance routing, device irregularities, and the high-power analog-digital interface become performance bottlenecks. [0036] Advances in computing applications, and Artificial Intelligence drive the employment of edge computing for IoT applications which require low run-time energy consumption, minimal footprint, short computing latency, and cost-effective solutions. Field programmable Gate Arrays (FPGAs) offer high flexibility in hardware implementation while
Attorney Docket No.: L1021421590WO lowering the time-to-market and the system cost. However, FPGAs consume higher power and area for edge computing applications, which negates the gains offered by them. While the power overhead of an FPGA can be reduced by lowering the operating voltage in typical systems, the lower operating voltage may result in performance degradation. In addition, the decreasing feature size of MOSFETs combined with a lower operating voltage result in a high leakage current in the standby mode. Further, using volatile memory technology forces FPGAs to lose configurations when powered off and to be reconfigured at each power on. [0037] Current methodologies leverage the analog nature of the RRAMs to expedite neural network (NN) inference and boost performance. Additionally, current methods integrate the crossbars with other architectural enhancements such as inter-layer pipelines, 3D-array systems, parallel multi-image processing, data recycling to enhance performance parameters. However, these methods focus exclusively on optimizing the in-memory execution of macro-operations like convolutions and vector-matrix multiplications required by the NNs and overlook the in-memory execution of primitive/ micro digital logic computations. In addition, conventional RRAM crossbars require high programming cycles, which act as a bottleneck to improving system speed and latency. Additionally, such conventional methods are applied for inference which does not require the devices to be dynamically programmed, and therefore provide no solution to lower programming cycles. Thus, these works fail to fully leverage the RRAM devices to build real-time reprogrammable systems capable of dynamically implementing trainable Binary/Floating-point NNs, multi- state inference machines, and purely digital operations. [0038] Some conventional systems optimize the implementation of a full adder within an RRAM crossbar. However, the utilization of a ripple-carry adder architecture results in massive delays for higher bit-widths (e.g., 128 cycles for 128-bit addition). In addition, employing crossbars for various operations pose difficulties such as high programming time,
Attorney Docket No.: L1021421590WO parasitic crossbar resistances, sneak currents, and device irregularities. Other systems provide for execution of in-memory binary multiplication using RRAMs. However, these systems optimize multiplication execution to aid convolution/vector-matrix multiplications. As most NNs deal with 16-bit precision and massive data, using these methods to perform individual higher bit-width multiplications results in a steep increase in latency and power while reducing throughput and device utilization. Further, these systems neglect to address the ubiquitous device irregularities in RRAMs. [0039] To improve the performance of an RRAM-based FPGA, conventional systems focus on the programmable routing devices, configuration registers and replace the Static Random-Access memory (SRAM) circuits with RRAM counterparts. Others further improve performance by replacing SRAMs within Look-Up Tables (LUTs) with RRAMs. However, these approaches fail to leverage the massive parallelism of the crossbar array to build exceedingly efficient logic blocks. Though some systems leverage the crossbar’s parallelism to improve performance, they do not address the increased latency due to the high programming cycles of the RRAM. [0040] Embodiments of the present disclosure provide a field programmable crossbar array architecture to efficiently perform in-memory arithmetic operations. In particular, the FPCA described herein provides in-memory addition and in-memory multiplication architectures that operate within only one or a few clock cycles. While embodiments are described with respect to resistive random-access memory (RRAM) devices, embodiments can be used with any advanced memory capable of performing in-memory computations. [0041] In some embodiments, a field programmable computing architecture, includes a programmable crossbar array structure comprising a plurality of memory components and a control plane coupled to a plurality of inputs and a plurality of outputs of the programmable
Attorney Docket No.: L1021421590WO crossbar array, the control plane to configure the programmable crossbar array to perform one or more in-memory arithmetic operations. [0042] In some embodiments, the programmable crossbar array comprises a Manhattan array, and wherein the control plane configures the Manhattan array to perform an in-memory addition operation. In some embodiments, to perform the in-memory addition operation, the control plane configures each computational block of the programmable crossbar array to generate, for a single bit of two operands, a sum value and an output carry value for each possible input carry value, in parallel. In some embodiments, the field programmable computing architecture includes a plurality of multiplexors to select, for the output of each computational block, from the sum value and the output carry value based on a determined input carry value, wherein the selected values from each computational block are combined into an output value. In some embodiments, to configure the Manhattan array to perform the in-memory addition operation, the control plane is to erase each computational block of the crossbar array, program each computational block of the crossbar array to perform a computation, and erase redundant computational blocks of the crossbar array structure that have been programmed to provide a part of a single computational block in each column. For example, one computational block may include four columns of memory components. [0043] In some embodiments, the programmable crossbar array comprises a planar staircase array, and wherein the control plane configures the planar staircase array to perform an in-memory multiplication operation. In some embodiments, to perform the in-memory addition operation, the control plane configures the planar staircase array to perform each of a plurality of partial products of a multiplication in parallel. In some embodiments, the control plane further arranges an output of the plurality of partial products from the planar staircase array into a plurality of intermediate words. In some embodiments, the programmable crossbar array structure further comprises a Manhattan array, wherein the control plane
Attorney Docket No.: L1021421590WO configures the Manhattan array to perform an in-memory addition operation on the plurality of intermediate words to generate a final multiplication output. In some embodiments, to configure the planar staircase array, the control plane is to store a first set of operands of the multiplication in-memory of the planar staircase array, wherein the first set of operands are distributed to perform bit-wise multiplication operations to generate the plurality of partial products and input a second set of operands of the multiplication to corresponding word lines of the planar staircase array. [0044] In some embodiments, a method of operating a field programmable computing architecture includes providing a programmable crossbar array structure comprising a plurality of resistive random-access memory (RRAM) components and configuring the programmable crossbar array structure to perform one or more in-memory arithmetic operations. [0045] In some embodiments, the programmable crossbar array comprises a Manhattan array, and wherein configuring, and further includes configuring the Manhattan array to perform an in-memory addition operation. In some embodiments, configuring the Manhattan array to perform the in memory addition operation includes configuring each computational block of the programmable crossbar array to generate, for a single bit of two operands, a sum value and an output carry value for each possible input carry value, in parallel. In some embodiments, the method further includes selecting, for the output of each computational block, from the sum value and the output carry value based on a determined input carry value and combining the selected values from each computational block into an output value. In some embodiments, configuring the Manhattan array to perform the in-memory addition operation, includes erasing each computational block of the crossbar array, programming each computational block of the crossbar array to perform a computation, and erasing redundant computational blocks of the crossbar array structure that have been programmed to
Attorney Docket No.: L1021421590WO provide a part of a single computational block in each column. For example, one computational block may include four columns of memory components. [0046] In some embodiments, the programmable crossbar array comprises a planar staircase array, and the method further includes configuring the planar staircase array to perform an in-memory multiplication operation. In some embodiments, configuring the planar staircase array includes configuring the planar staircase array to perform each of a plurality of partial products of a multiplication in parallel. In some embodiments, the method further includes configuring the planar staircase array to arrange an output of the plurality of partial products into a plurality of intermediate words. In some embodiments, the programmable crossbar array structure further comprises a Manhattan array, and further includes configuring the Manhattan array to perform an in-memory addition operation on the plurality of intermediate words to generate a final multiplication output. In some embodiments, configuring the planar staircase array includes storing a first set of operands of the multiplication in-memory of the planar staircase array, wherein the first set of operands are distributed to perform bit-wise multiplication operations to generate the plurality of partial products and inputting a second set of operands of the multiplication to corresponding word lines of the planar staircase array. [0047] As mentioned above, the non-volatile nature, low power consumption (~0.1pJ/bit), and low footprint (4F2) of typical Resistive Random-Access Memories (RRAMs) make them ideal candidates for FPGAs. RRAMs are non-volatile memory devices with specially designed solid dielectric material positioned between two electrodes. A voltage is applied across an RRAM’s terminals to modulate the resistance of the dielectric material to store data. Embodiments leverage the RRAM's compute-in- memory capability and the crossbar arrays' parallelism to build highly efficient Logic Blocks within FPGAs. Such an architecture is notably lucrative owing to the analog
Attorney Docket No.: L1021421590WO nature of the RRAM and the massive data processing required by a typical neural network (NN). Additionally, by combining memory functionality and pass-gate logic in a unique device, RRAMs can significantly reduce the area and delay of routing elements. [0048] Embodiments described herein replace the typical LUTs within the logic blocks with RRAM crossbars while also replacing the SRAMs used in the reprogrammable routing. In addition, embodiments lower the programming cycles for the crossbar arrays while improving system performance. Embodiments provide a flexible FPCA architecture, referred to herein as a BMX-FPCA (Beyond Moore Flexible – Field Programmable Crossbar Array), to enable support for multi Beyond-Moore device-based 3D run-time reconfigurable architectures. Additionally, embodiments provide an H- multiplier that combines in-memory computing with novel partial product reduction techniques to reduce system latency and full adder access. Embodiments further include a hybrid crossbar system combining conventional array and “optimized planar staircase (PS) array” resulting in efficient mapping/execution of NNs and other in-memory analog, storage and digital operations. Embodiments include a MIAMI (Modular In-memory arithmetic programming) technique which drastically lowers the crossbar programming time to provide high-throughput and low-latency systems. Embodiments include a custom RRAM-CMOS interface level peripheral circuit design, referred to herein as DICP (Device Irregularity Combating Peripherals), enabling novel system-level strategies to combat device/crossbar irregularities (sneak currents, I-R drop, device-to- device variability, and finite High device Resistance state). Combining novel methodologies for arithmetic operations and programming of RRAM crossbar arrays with optimal layout and circuit design provides each of the advantages discussed above. Furthermore, to implement and manage the disclosed architecture, embodiments include
Attorney Docket No.: L1021421590WO a new Instruction Set Architecture to support the high programmability of our BMX- FPCA architecture. [0049] Figure 1 illustrates a crossbar array of RRAM devices for memory storage and in- memory computation, according to some embodiments. In-memory computing architectures utilize crossbar arrays, which are structures with the two electrodes of the device forming a 2D grid with devices located at the m×n intersections. The RRAM conductance holds one of the operands during computations while the other operand is applied as a voltage level to the word line. Ohm's and Kirchoff's laws govern the bit-line current. Integrators (SA) aggregate the bit-line currents as voltages and store the analog output. Using sample and hold circuits (S+H) saves processing overhead by allowing ADC sharing. The analog voltage outputs of the crossbar arrays are converted to digital signals by the ADC and then translated to equivalent floating-point elements using linear functions. The digital to analog converter (u), RRAM (v), and analog to digital converter (ADC) resolutions, as well as the number of contributing RRAMs (n) determine the output precision. The ideal ADC resolution is: ^^^^^^^^^^^^=^^+^^+^^^^^^2(^^), ^^,^^>1; =^^+^^−1+^^^^^^2(^^), ^^^^ℎ^^^^^^^^^^^^ (1) [0050] For NNs that can tolerate error, the overall ADC power can be reduced by using multi-state RRAMs and binning. However, high-accuracy multiplications cannot be performed similarly owing to device-to-device variability, parasitic wire resistance and finite high resistance. The wire parasitic results in I-R drop along the routing and sneak currents leading to an increased output error (OE). Embodiments described herein provide a general- purpose system that excels in both situations. [0051] Figure 2A illustrates a field programmable gate array, according to embodiments of the present disclosure. Field-programmable gate arrays (FPGA) consist of arrays of repeatable tiles with programmable connections to realize different designs. FPGAs consist of
Attorney Docket No.: L1021421590WO configurable Logic Blocks, Connection Blocks, and Switch Boxes, as depicted in Figure 2A. Connection Blocks connect routing tracks to the Logic Block pins, while Switch Boxes provide the inter-tile connection between the logic block pins and the routing tracks, as depicted in Figure 2B. Logic blocks consist of several Basic Logic Elements (BLEs) interconnected by a dense local routing architecture. Typical BLE contains a LUT, a Flip- Flop, and a 2:1 multiplexer, which selects either a combinational or a sequential output, as depicted in Figure 2C. Based on different application needs, commercial FPGAs may adopt fracturable LUTs, hard carry chains in the logic blocks or replace tile columns with heterogeneous blocks. Embodiments described herein use the architecture shown in Figure 2A while replacing the basic LUTs with RRAM-based processing blocks and the SRAM- based switches with RRAM-based switches. [0052] Figure 3A depicts an RRAM Manhattan crossbar array, according to embodiments of the present disclosure. Using the crossbar array depicted in Figure 3A, embodiments provide an in-memory variant of a full adder for high-accuracy computations. [0053] With two operands (a0, b0) and a input carry bit (c-1), the sum (S0) and carry-out (c0) bits of a 1-bit full addition are calculated as: ^^0 = ^^0 ⊕ ^^0 ⊕ ^^−1 =(((~a0).b0 || a0.(~b0)).(~c-1)) || (((~a0).(~b0) || a0.b0).c-1) (2) ^^0 = ^^0. ^^0 ||( (^^0||^^0). ^^−1) (3) [0054] Here, the dots represent the logical AND while the ‘||’ sign represents the logical OR operation. Also, the “~” sign represents a logical NOT (inversion) operation. [0055] Depending on c-1, we can rewrite (2) and (3) as: ^^0 =(~a0).b0 || a0.(~b0); c-1=0 =(~a0).(~b0) || a0.b0; c-1=1 (4) ^^0 = ^^0. ^^0, ^^−1 = 0
Attorney Docket No.: L1021421590WO = (^^0. ^^0) || (^^0 || ^^0), ^^−1 = 1 (5) [0056] In (5), the a0||b0 (a0 OR b0) term encompasses the a0.b0 (a0 AND b0) term when c-1 is 1. Hence, we can eliminate the a0.b0 term to rewrite (5) as: ^^0 = ^^0. ^^0, ^^−1 = 0 = ^^0 || ^^0, ^^−1 = 1 (6) [0057] From (4), we observe that the sum bit requires the execution of XOR and the XNOR operation depending on c-1. Similarly, c0 generation in (6) requires performing the input OR/AND depending on c-1. [0058] To execute logic “AND” within memory, we store one of the operands (b0) as the device conductance. The device is set to the low resistance state (LRS) if b0=1 and the high resistance state (HRS) if b0=0. Embodiments apply the other operand (a0) as an input pulse at the word line of the crossbar. Embodiments supply a 0.1V input pulse, if a0=1 and a 0V pulse, if a0=0. The output current varies as a function of the device resistance (b0) and the input voltage (a0) and is representative of a0.b0. Thus, embodiments accomplish the execution of in-memory “AND.” Likewise, embodiments can perform other logic functions within memory. [0059] Figure 3A illustrates the planar Manhattan (MH) array used to execute the in- memory adder computations. Figure 3B delineates the execution of AND, preprocessed OR, XOR and XNOR operations within memory. Using the logic circuits described above, embodiments provide an in-memory 1-bit full adder, using (4) and (6), as shown in Figure 3C. Embodiments store operand b0 as device resistance and supply a0 as input pulses to the crossbar. The high resistance state (HRS) of the RRAM represents a logic “0” in the equations, whereas the low-resistance state (LRS) represents a logic “1”. The part of the RRAM array with three inputs and four outputs (3×4), seen in Figure 3C, is referred to as a computational block (CB), and each planar MH array can process multiple such CBs in a
Attorney Docket No.: L1021421590WO cycle. As seen from Figure 3C, each CB generates the sum and carry bits for both values of the carry-in (c-1=0/1). The 1st column of the CB generates a0.b0, which is c0 when c-1=0 (6); the 2nd column generates a0||b0, which is c0 when c-1=1 (6). Similarly, the 3rd column generates S0 when c-1=0 (a0 XOR b0) (4), and the 4th column generates S0 when c-1=1 (a0 XNOR b0) (4). Because embodiments generate carry-out and sum for both values of carry-in, an n-bit addition can be divided into “n” 1-bit additions and process the addition of all n-bits parallelly using n CBs. Each CB generates the sum and carry-out for both values of carry-in (=0/1). [0060] At the end of the parallel add operations, embodiments select the sum and carry- out at each stage based on the carry-in of the corresponding stage. Accordingly, after the in- memory processing is complete and the crossbar current outputs are converted to digital floating-point equivalents using sense amplifier (SA) and analog to digital converter (ADC), embodiments determine the sum and carry-out by feeding the crossbar outputs into 2:1 multiplexer (MUX) with the previous stage's carry as the control signal, as depicted in Figure 3D. In some embodiments, a MUX chain is formed to determine the carry-out and sum of each stage. Since the crossbar is operated at a much lower clock frequency (10MHz) compared to the operating frequency of the CMOS counterparts (e.g., switching frequency of a 2:1 MUX is 19.8GHz), the entire MUX chain for most bit-widths (≤1024-bit) takes one clock cycle. For additions with operand bit-width >1024, embodiments conclude the in- memory processing in a single cycle but perform the output MUXing over multiple cycles. [0061] Though the proposed in-memory full adder enables parallel processing of all the bits of the inputs to reduce latency, the device/layout issues of crossbars increase OE. As seen in Figure 3C, 2 LRS devices contribute to the current flowing out of the second column (c0 when c-1=1). Thus, when c-1=1, the crossbar ADC output (ADCres=k-bit) could be any integer among 0, 1 and 2, depending on a0 and b0 (supplied as input pulses). Further, the
Attorney Docket No.: L1021421590WO other crossbar irregularities such as finite HRS and sneak currents, can result in high OE. Figure 3C shows that each column connects to multiple "unused" RRAMs programmed to be in HRS. The unused RRAMs with finite HRS still contribute to the bit current and results in erroneous output since an increase in output current increases the output voltage (SA output), thereby increasing the value of their digital equivalents (ADC outputs). [0062] Accordingly, some embodiments apply Device Irregularity Combating Peripherals (DICP) to eliminate the above issues by performing an OR operation of the k bits of each column's ADC output bits before MUXing, as shown in Figure 3D. DICP involves compressing the k-bit (At,BLi[k-1:0] in Figure 3D) to 1-bit by using OR gates. Even when the current contribution of devices grows to increase the ADC output value, the OR output remains 1. As a result, embodiments eliminate the impact of finite HRS and multiple LRS devices on the OE. Additionally, embodiments curtail the effect of device-to-device variability on the output. [0063] Figure 3E illustrates a detailed description of the hardware components used in the in-memory adder execution and the data flow of the in-memory adder, according to some embodiments. As depicted in Figure 3E, a DAC converts digital inputs to analog inputs for the crossbar array. The input signal may be programming signals or operator signals. Upon receiving the operator signals, the crossbar array generates outputs that are converted to digital signals (e.g., by an array of ADCs) which are stored in one or more registers. The outputs are then provided to OR gates for error handling. The digital outputs from the OR gates are then provided to a chain of multiplexors to select the sum bits and the carry bit which is used as the control signal in the next multiplexor in the chain. Thus, the initial add operation of the crossbar array may utilize one clock cycle and the multiplexing chain may utilize one clock cycle.
Attorney Docket No.: L1021421590WO [0064] Figure 4A illustrates an example programming scheme for a planar crossbar array, according to embodiments of the present disclosure. Customarily, an m×n crossbar is programmed column-by-column over n cycles (binary RRAM). So, although embodiments process the data within an m×n crossbar array in a single cycle, in-memory computing using conventional crossbar array programming would require n+1 cycles (n programming cycles and 1 processing cycle). Accordingly, embodiments described herein lower the programming cycles to 4+(2×log2(n/4)) using the following methodology. In a first cycle, embodiments clear the data within the crossbar by programming all the devices to HRS by supplying reset pulses at the word lines and grounding the bit lines. In the subsequent three cycles (e.g., cycles 2-4), embodiments write the operand data (b0-bn) into the multiple CBs of each crossbar. To begin, the 1st row of all the CBs are programmed by supplying set voltages to the word line and grounding the appropriate bit lines (based on b0-bn values). For example, if b0=1, we ground the 1st, 2nd, and 4th bit-lines (based on Figure 3C) of the CB processing a0, b0 while supplying a set pulse to the word-line. The other terminals are floated using switches. In the two subsequent cycles, embodiments program the next two rows of the CBs. These four programming cycles are independent of the bit-width/ crossbar size and are referred to as “Independent Write Cycles” in Figure 4A. The Boolean logic deployed to the crossbars is depicted in Figure 4B. For example, each of the computational blocks depicted in Figure 4B may perform at least an ADD operation of the two corresponding operands and then produce a sum and carry bit for each possible carry bit. [0065] The described programming scheme creates multiple copies of the CBs within the crossbar that must be reprogrammed to HRS to prevent errors, as seen in Cycle 4 of both Figures 4A and 4B. In some embodiments, the devices within the CB copies are reprogramed to HRS over the next 2×log2(n/4) cycles, as shown in Figure 4A, by supplying reset voltages to the appropriate word lines and grounding the bit lines. Embodiments begin by dividing the
Attorney Docket No.: L1021421590WO entire crossbar into four sections (shown in Cycle 5 of Figure 4A). In the 5th cycle, embodiments supply reset pulses to word-lines and ground the bit-lines of CBs in section 3 (Figure 4A). In the 6th cycle, embodiments apply reset pulses to word-lines and ground the bit-lines of CBs present in section 2. If the number of CBs in each section is >1, we divide each section into four sub-sections (shown in Cycle 7 of Figure 4A). We reset the CBs in sub- sections 2 and 3 of all sections using the described technique (Cycles 7 and 8 of Figure 4A). The division of each sub-section is repeated multiple times until each final section consists of only 1 CB each. At this point, reprogramming the CBs is concluded by using the method described above. Using this technique, embodiments can clear the unwanted copy-CBs within 2×log2(n/4) cycles (m×n sized crossbar). Since the required programming cycles depends on the value of “n”, these cycles are referred to herein as “Array-Dependent Write Cycles”. Thus, the described programming method lowers the crossbar programming cycles from n to 2×(2+log2(n/4)). In the process cycle, read pulses proportional to the input values are applied at the word lines and determine the output using bit line current. [0066] Figures 5A and 5B illustrate a planar staircase (PS) array, according to embodiments of the present disclosure. As can be seen in Figure 5A, each word line travels in steps up the bit lines, resulting in fully independent word lines in each array structure. As can be seen in Figure 5B, inputs may be shared between adjacent array structures in the FPCA. Additionally, the inputs and outputs correspond to vias from the CMOS layer of the FPGA (Figure 5B). [0067] Figure 6A and B depict an in-memory multiplication algorithm referred to herein as H-multiplier, that utilizes in-memory computation to determine partial products (PPs) of floating-point multiplications. As depicted in Figure 6A, embodiments bin/ reorder the PPs into words using CMOS control circuits to reduce additions. Lastly, embodiments use the in- memory adder described above with respect to Figures 3A-E to add the derived words and
Attorney Docket No.: L1021421590WO generate the output. As mentioned above, embodiments derive the PPs by performing multiply-accumulate computation (MAC) operations within memory. The planar staircase (PS) array reduces the crossbar programming time while shifting the input inherently. To make computations within each sub-array of the PS array independent and reduce the operations, embodiments employ the method illustrated in Figure 6A to perform multiplications. For any P-bit operand multiplication, embodiments divide the operation into two parts: the 1st evaluates the PPs (C0-C6 of Figure 6A) within memory, and the 2nd combines the PPs to generate the output. Embodiments may store one of the operands (b) as the device conductance and supply the other operand (a) as input pulses at the word-line, as depicted in Figure 6B. The current flowing out of the bit-lines represents the PPs. The final output (D in Figure 6A) generation requires shifting and adding these PPs, which typically requires 2P-1 sequential additions. [0068] Embodiments reduce the processing time and hardware utilization by decreasing the number of PP additions using the method detailed in Figure 6A. The maximum value for any PPs of two P-bit operand multiplication does not exceed P. Hence, the maximum number of bits utilized for representing any PP are 1+log2P. For example, the maximum value of any PPs (C0-C6) for the 4-bit multiplication shown in Figure 6A is 4 (C3), requiring 3 bits (C3[2:0]) for representation. Similarly, for an 8-bit multiplication, the maximum PP value is 8 and needs 4 bits for representation. Thus, we can conclude that any PP (Ci) can contribute to a maximum of 1+log2P output bits (Di to Di+log2P) during shift and add. [0069] Building on the above observation, embodiments reduce the post-in-memory processing additions by reordering/ binning the PPs into words. Embodiments then divide the complete set of PPs into multiple non-overlapping sub-divisions (SDs), each with 1+log2P elements. Embodiments partition the PPs into ceil(2P/(1+log2P)) non-overlapping SDs contributing to various words. For a 4-bit multiplication, such a partition results in 3 sub-
Attorney Docket No.: L1021421590WO divisions (SD0-SD2 in Figure 6B). Next, embodiments combine the non-overlapping PPs of each sub-division to form different words (e.g., Word1-Word3). We compose the 1st word (Word1) by combining the 0th PP of each sub-division, the second word (Word2) by combining the 1st PP, and so on. Using this binning process, we form WM+1 words (M=log2P) by combining different non-overlapping PPs of the SDs. Addition of these derived words results in the final product (D). [0070] Further, D[0] and D[1] receive contributions from C0 and C1[0] alone. Accordingly, embodiments may eliminate the first 2 bits (Wx[1:0], 0 < x ≤ 1+log2P) in the final addition of the words and connect them (C0, C1[0]) directly to the output bits (D[0], D[1]). Thus, using this method, embodiments reduce the number of PP additions from 2P-1 with 2P-bit operands to log2P with 2×(P-1) bit operands. In some embodiments, derived words are added over a maximum of 1+floor(log2(1+log2P)) cycles. [0071] Figure 6C depicts a flow diagram of an example method of performing a parallel multiplication operation using a planar staircase array, according to embodiments described herein. As can be seen in Figure 6C, two operands may be input to the multiplication operation. A first operand (e.g., operand A) is programmed into conductance/resistance of the crossbar array of RRAMs of an FPCA, as described herein. The second operand (operand b) may be provided as a set of pulses to the crossbar array. Upon receiving the set of pulses of the second operand, and output is generated that includes the partial products of the operands. The partial products are reordered and binned into a set of words. The words are then added together to generate a final output. For example, the words may be a set of bits (e.g., a word of 8 bits) that can be added together using the adder described above. [0072] Figure 6D depicts hardware components involved in a multiplication operation using a PS array, according to some embodiments. In some embodiments, a DAC may convert digital operands into analog signals, one operand being programmed in-memory and
Attorney Docket No.: L1021421590WO the other provided as a set of signal pulses. The output analog signals (e.g., partial products) from each bit line may be converted from analog to digital signals by an array of ADCs. Each of the partial products may be stored in a register and then reordered and binned to generate a set of words that include all of the partial products. The words may be stored in a word register and then added together via an adder, as described above. For example, the adder may be a crossbar array structure for in-memory addition operations. The sum of the words may be the output result (e.g., the product) of the initial operators. [0073] Figure 7A and B illustrates 16-bit multiplication using two different array structures, according to some embodiments. Embodiments may multiply two 8-bit operands in-memory using two sub-arrays (P1, P2) within two different array structures (AS), as seen in Figure 7A. In some embodiments the sub-arrays may share inputs while in other embodiments the sub-arrays do not share inputs. Due to the location of bits stored within the array, each sub-array is independent and receives no interference from other sub-array inputs. However, since the size of the crossbar arrays cannot keep increasing with an increase in operand bit-width, embodiments divide the execution of higher bit-width multiplication into smaller parts that the finite-sized crossbar arrays can handle. Accordingly, embodiments divide each P-bit (P>9) operand multiplication into [ceil(P/8)]28-bit multiplications. For example, as depicted in Figure 7B, embodiments divide each 16-bit number into two shifted 8-bit numbers, transforming each 16-bit multiplication into 4 (=(16/8)2) 8-bit multiplications. We execute the constituent 8-bit multiplication within 2×(2×ceil(P/8)-1) AS using the configuration shown in Figure 7A. The sub-parts of the different PPs from different AS (outputs from section A, B and C of Figure 7B) are combined in a single cycle because they are independent of each other. Embodiments obtain the output by binning the resultant PPs using the method described above with respect to Figures 6A and adding the derived words. Thus, embodiments perform in-memory multi-precision (≥9) floating-point number
Attorney Docket No.: L1021421590WO multiplications by partitioning it into 8-bit multiplications. The total additions (TNA) required to generate the output of a P-bit multiplication is: ^^^^^^ = ^^^^^^2^^; ^^ ≤ 8 =2 × (^^ − 8) + ^^^^^^2^^; 8 < ^^ ≤ 80 (10) [0074] Figure 8 illustrates a programming scheme for multiplication in an array structure, according to some embodiments. As mentioned above, though the in-memory processing can be performed in 1 cycle, programming an array requires multiple cycles. This tuning time adds to the system latency and reduces performance. Hence, embodiments of the present disclosure provide a method that lowers the programming cycles of the planar staircase (PS) array. [0075] The PS arrays depicted in Figure 5B, for example, may include 18 array structures (AS), each with 10 sub-arrays and 10 outputs. Each sub-array consists of 9 RRAMs contributing to the bit line, thus resulting in 18 inputs (=10+9-1; 10 outputs and 9 contributing kernel elements) per sub-array. Due to the staircase routing, different sub-arrays within the same AS of a PS array do not share inputs, as shown in Figure 5A. Even within each sub-array, a maximum of 9 columns of devices may share inputs, as shown in Figure 5B. Hence, embodiments may program all the devices within an AS in 9 cycles. Further, for the PS array with outputs per AS ≥ contributing devices per output+1 (as is the case here), only the adjacent AS share inputs with each other, as depicted in Figure 5A-B. Thus, after setting all the devices to HRS in a cycle, we can fully program the PS array in the next 18 cycles by tuning the “even” AS in 9 cycles and the “odd” AS in the subsequent 9 cycles. [0076] For smaller bit-widths (P≤5), each multiplication is contained within a single AS and requires 2P (=1+2P-1; 1 to set all devices to HRS, 2P-1 to program the devices connected to the 2P-1 outputs) cycles. However, for 5<P≤8, since the “odd” AS utilized for multiplication only has 2P-11 outputs generating meaningful PPs, embodiments program
Attorney Docket No.: L1021421590WO only those 2P-11 columns. Thus, for 5<P≤8, the multiplication is spread across 2 adjacent AS and requires 2P-1 (=1+9+2P-11) cycles. For P>8, since we divide the multiplication into multiple 8-bit multiplications, we require only 15 (=1+9+5) program cycles. Thus, as compared to multiplication using Manhattan array structure, embodiments using the planar staircase array lowers the programming cycles by 8.53× for 128-bit multiplication. Additionally, embodiments may lower the PS programming cycles to 10 by using alternate AS, though this leads to device under-utilization and power/ area wastage. [0077] Figure 9A illustrates a row-by-row readout scheme while floating all unused inputs and Figure 9B illustrates a device-by-device readout scheme while grounding all the unused inputs, according to some embodiments. The non-volatile nature and high retention time (>104s) of the RRAMs make them excellent candidates for storage. The crossbar arrays can either be programmed bit-by-bit or column-by-column based on system requirements. As detailed in Figures 9A-B, embodiments may read the values within the devices in diverse ways. [0078] Figure 10 illustrates storage of 16-bit numbers within a PS array, according to some embodiments. Such storage enables the simultaneous readout of multiple bytes stored within different AS of the same crossbar. Embodiments may use the readout scheme described in either of Figures 9A and 9B but it may be advantageous to use the readout scheme of Figure 9A to minimize leakage. [0079] Figure 11A illustrates a structure of an FCPA including multiple independent blocks, a crossbar layer, and a CMOS layer, according to embodiments described herein. The FPCA consists of multiple RRAM tiles that perform the computations and top-level digital circuits that control data flow and perform simple processing operations. Each RRAM tile consists of multiple independent blocks (IB), each with a layer of crossbar (XB) arrays stacked over a CMOS layer, as shown in Figure 11A. The tile top consists of combinational
Attorney Docket No.: L1021421590WO logic (CL) and state matrix unit (SMU) to enable the implementation of NNs. The CMOS layer of the independent blocks consists of different digital components required for the functioning of the tile, such as ADCs, DACs, MUXes, and CL. The top RRAM layer provides the system with computational, storage functions, and the CMOS layer acts as the control layer. Since RRAM fabrication is a low-temperature process and is BEOL- compatible, it does not cause any performance degradation of the transistors underneath it. The RRAM tiles contain either multiple MH arrays of 8×8, 128×128 size or multiple PS arrays of 90×180 (number of devices contributing to each output × number of array outputs) size. As discussed in previous sections, we determined the RRAM array sizes by the array I- R drop, programming cycles, power, and area. However, by lowering the RRAM device current, embodiments can reduce the parasitic I-R drop and enable the demonstration of larger arrays on smaller nodes. [0080] Embodiments facilitate a low-delay interaction between the RRAM arrays and CMOS circuits by connecting the two layers through high-density inter-layer vias. The DACs supply voltage pulses to the crossbars, while the ADCs convert the crossbar output voltages to their digital counterparts. The MUXs are essential for the random-access operation of the RRAM layer and ADC sharing. For post-in-memory processing, the adders require MUXs and OR gates. Also, the convolution operations require digital adders and multipliers. In addition to basic digital circuits in the tiles, embodiments include centralized control circuitry to facilitate the overall system operation. Hence, the CMOS layer hosts digital circuitry for control, core-to-core data communications and simple processing operations. Embodiments further use RRAMs to realize the routing multiplexers, depicted in Figure 11B, and fabricate them on the CMOS layer. [0081] In some examples, the FPCA functions at 10MHz frequency to account for the typical read time of an RRAM and to represent the NN implementation’s width-modulated
Attorney Docket No.: L1021421590WO pulse inputs with 16 states (4-bit input resolution). In some examples, the CMOS circuitry in the architecture may function at 1.2GHz. As a result, embodiments provide various CMOS circuitry to satisfy the clock frequency, crossbar size, and bit-width requirements. Such a design with the crossbar running at a lower frequency compared to the CMOS circuitry allows for ADC sharing among multiple crossbar outputs. This results in a decrease in the total ADCs, which reduces the system's power substantially. Though the crossbars run at 10MHz due to the high device read time and power constraints, embodiments increase throughput by leveraging their massive parallelism. [0082] To enable the implementation of NNs, Discrete Fourier Transform (DFT/IDFT) and multi-precision digital operations within the developed system, embodiments include 168 tiles in the FPCA system. Each tile top may include circuits to enable high throughput DFT and NN implementations. Since convolutions within NNs are the most computationally intensive tasks, embodiments include 84 PS tiles with 8 independent blocks (IBs) in each tile. Each of the IBs consists of 8 PS arrays. Embodiments include combinational logic and registers to handle the tile requirements. Embodiments may include 54128×128 MH array tiles in the system to enable a low-power implementation of the fully connected layers of different NNs. Each 128×128 MH tile consists of 12 IBs with 8 arrays each. The remaining 30 tiles consist of the 8×8 MH arrays. Each such tile consists of 12 IBs with 1288×8 MH crossbars each. [0083] Because the FPCA may perform multiple tasks, each with multiple combinations, embodiments include an Instruction Set Architecture (ISA). Figure 13 provides a detailed description of the different in-memory operations our architecture can perform, and the programming/processing cycles required by them. [0084] In Figure 13, P represents the bit-width, M×M represents the input matrix size supplied as voltage pulses, and K×K represents the weight/kernel size stored as RRAM
Attorney Docket No.: L1021421590WO conductance. The Al2O3 RRAM considered here requires 50 cycles (VReset=-0.87V) to program it to the different considered conductance levels for multi-level operations. Since each column requires 50 cycles, an M×M sized matrix (M<128) requires 1+50×M cycles (1 to set the entire array to LRS) for programming the MH array. Embodiments may lower the tuning cycles by increasing the reset voltage. [0085] Table I depicted in Figure 13 further shows that the described architecture provides multiple combinations with different programming, processing cycles, and arrays required for the same operation. Larger matrices are partitioned into smaller portions that can be handled within the crossbars and are processed over multiple pipelined cycles. Embodiments may reduce the processing cycles for binary operations by increasing the number of arrays used to execute the operation. However, such increase in arrays increases power consumption. Thus, the FPCA chooses the combination by considering all the variables involved based on the application requirements. The system decides on the crossbar type, size, and number based on the available hardware and user inputs on the output accuracy. [0086] Figure 11C illustrates a pipeline for performing various operations within the crossbar arrays, according to embodiments of the present disclosure. To prevent resource under-utilization, embodiments leverage the advantages of pipelining. In the 1st cycle of the pipeline, the architecture fetches the inputs and sends them to the crossbar tile's input register (IR) either directly for binary operations or after processing in the SMU to determine the state matrices for NN computations involving multi-bit storage within RRAMs. An SMU determines the state matrices and consists of adders, multipliers, and units to determine the maximum/minimum of matrices, comparators, and other CL. The state matrix elements are sent over the shared bus to the current RRAM tile and stored in the IR. The number of
Attorney Docket No.: L1021421590WO crossbars within a tile and the unique inputs to these arrays determines the IR width. We complete the data transmission to IR in a 100ns stage. [0087] Next, the arrays are programmed based on the inputs over the subsequent O programming cycles, based on the crossbar employed and the application. Upon programming, the IR sends the other operand to the appropriate array and supplies it as voltage pulses using DACs. Over the following L processing cycles, the in-memory computation is performed. The ADCs transform the SA outputs to their digital counterparts in the following 100ns clock cycle and store them within the IB’s output register (OR). The unprocessed ADC outputs are retrieved from the OR and processed using the required CL at the IB top (adders, multipliers, MUXes) in the following 100ns cycle. The processed results from the CL are written back into the OR of the IB by the end of the cycle. The IB's OR may accommodate the binary and the multi-bit storage scenarios. For the multi-bit storage case, the OR may accommodate a 16-bit output bit-width. [0088] In the next cycle, the output stored in the IB OR is sent to the tile top. As in the case of NNs that require multiple IBs/ tiles to work in tandem, some computations may be spread across numerous IBs. These values may undergo further additions before merging with the results in the tile OR during a 100ns-cycle. In the subsequent cycle, the outputs generated in one layer are transmitted into the IR of the tile processing the next layer for NNs. Hence, for a NN layer during inference, one complete pipeline requires 6 clock cycles. For NN training, embodiments may require additional programming cycles for RRAM conductance tuning with any weight changes. However, for standalone instructions, embodiments write back the results stored within the top OR in the final cycle. Thus, embodiments execute a standalone instruction in L+O+510MHz cycles, where L and O are operation dependent.
Attorney Docket No.: L1021421590WO [0089] Figure 12A-C illustrate an implementation of the multipliers discussed above to perform matrix multiplication, according to embodiments described herein. While NNs mandate low power and high accuracy, the RRAM states (minimum of 6-bit) required to achieve a high accuracy are difficult to demonstrate on a single device. RRAM device variability further exacerbates the issue. Hence, embodiments described herein utilize a Matrix-Matrix multiplication method to achieve high output accuracy with low input resolution while combating device issues and improving throughput. Embodiments perform operations on positive and negative numbers within a single array by splitting the matrices using the method described in Figures 12A-C. Kernel elements are binned and mapped onto the device conductance states, as depicted in Figure 12B, while input voltage pulses with pulse-width based on the input feature map are applied to the word-lines as depicted in Figure 12A. Current flowing through each bit line over processing time is integrated and converted to digital signals. A linear transformation of these digital signals generates the floating-point output matrix elements. Employing the above M2M method to simulate a 4-layer CNN designed for the MNIST database and implemented using PS arrays achieved 99.27% classification accuracy. [0090] For in-memory DNN implementation, we use the methodology described in Figures 12A-C. As depicted, embodiments split each kernel matrix once and employ 3-bit device/ 4-bit pulse resolutions for in-memory computations. Embodiments represent all the negative numbers of all the kernel sets (irrespective of the kernel values) using the max and least device pulse states (0, 2dev_res-1). Hence, all the negative numbers of all the kernels within different kernel sets operating on the same image can be processed using two crossbar arrays. Thus, embodiments may approximate that processing each kernel requires four crossbars (2 for the split, with each split matrix represented using two matrices). Also, since the inputs to any layer of the DNN consist of elements ≥0 due to the activation function,
Attorney Docket No.: L1021421590WO embodiments can process the IFM within one cycle. Using the above observations, embodiments utilize four devices in 4 different crossbars for processing a floating-point operation within one 10MHz clock cycle. [0091] For performing an inference, embodiments provide a Manhattan tile with N1 IBs, each with M1 arrays and P1 outputs/I1 inputs per array. Owing to the pipeline described in fig 11C, the floating-point multiply-accumulate throughput of the tile is given by: ^^^^_^^^^^^^^=(2×(^^1×^^1×^^1×^^1)×107)/ 4 ^^^^^^^^^^ (11) [0092] Using a PS array, which is a hierarchical structure composed of the basic unit called subarray, multiple subarrays contributing to the same outputs provide an array- structure, and numerous such array-structures may share inputs from a PS array. For a PS array, owing to the staircase routing, embodiments over-estimate the number of devices within each array when multiplying the number of inputs and the number of outputs. However, embodiments may determine the devices performing computations within each array by multiplying the number of the outputs with the number of devices contributing to each output. Hence, for a PS array, the throughput formula changes. For a PS array tile with N2 IBs, each with M2 arrays, each PS array consists of A array-structures, each array- structure with P2 outputs and X2 sub-arrays. Each of the outputs is connected to U2 RRAMs per sub-array. Hence, the total number of RRAMs contributing to each output of an array structure is: ^^^^^^^^^^=(^^2×^^2) (12) [0093] Using (12), the number of devices performing computations within each array- structure is determined as: ^^=^^2×(^^2×^^2) (13) [0094] The total number of devices within each PS array is obtained by multiplying D in equation (13) with A (the number of array-structures within a PS array). However,
Attorney Docket No.: L1021421590WO embodiments use 4 RRAMs to perform two floating-point operations – multiplication and addition. Thus, the DNN throughput per PS tile can be derived as: ^^^^^^_^^^^^^^^=(2×(^^2×^^2×^^×^^2×^^2×^^2)×107) /4 ^^^^^^^^^^ (14) [0095] Although, as described herein, each PS array has devices of 10 sub-arrays contributing to the output, embodiments consider a maximum kernel size of 9×9 to calculate the throughput. Hence, X2=9. [0096] The overall throughput of the system is given by: ^^^^^^^^^^^^^^^^=Σ^^^^×^^^^_^^^^^^^^^^ ^^^^^^^^^^ (15) [0097] In the above equation, j represents the different types of arrays utilized in the system and Lj is the number of tiles of crossbar type j. [0098] Additionally, embodiments include training of the NN. While calculating the throughput during training with an input image batch size of k, embodiments must consider the programming cycles as well. Though the PS arrays can be programmed within 18 cycles, the Manhattan arrays have to be programmed as well. For the different crossbars considered in this work, the worst-case programming cycles (prog_c) is given by: ^^^^^^^^_^^=max (^^^^^^×50,^^^^HU×50) ^^^^^^^^^^^^ (16) [0099] In (6), OMHU represents the cycles required for programming MH array type “u” (u: 128×128/ 8×8). Thus, we get the throughput per Manhattan tiles to be: ^^^^_^^^^^^^^=(2×(^^1×^^1×^^1×^^1)×^^×107)/ (4×(^^^^^^^^_^^+^^)) ^^^^^^^^^^ (17) [00100] The throughput using the PS tiles is altered to: ^^^^^^_^^^^^^^^=(2×(^^2×^^2×^^×^^2×^^2×^^2)×^^×107) / (4×(^^^^^^^^_^^+^^)) ^^^^^^^^^^ (18) [00101] The final system throughput during training can be calculated using equation (15).