WO2025122584A1

WO2025122584A1 - Systems and methods for high-throughput data operations in in-memory computing arrays

Info

Publication number: WO2025122584A1
Application number: PCT/US2024/058415
Authority: WO
Inventors: Echere Iroaga; Naveen Verma
Original assignee: Encharge Ai Inc
Current assignee: Encharge Ai Inc
Priority date: 2023-12-04
Filing date: 2024-12-04
Publication date: 2025-06-12
Anticipated expiration: 2026-06-04
Also published as: TW202531232A

Abstract

A device aspects include a compute-in-memory (CIM) array of computing cells. The CIM array includes a plurality of rows and a plurality of columns. Each computing cell includes a memory cell and an output port for providing a computation result. A device aspects include a first bank of the CIM array including rows of computing cells and a first row decoder coupled with the first bank. A device aspects include a second bank of CIM array including a second set of rows of computing cells and a second row decoder coupled with the second bank. A controller can control the first and second row decoders to write data into any of the first rows of computing cells simultaneously with the second rows of computing cells.

Description

SYSTEMS AND METHODS FOR HIGH-THROUGHPUT DATA OPERATIONS IN IN-MEMORY COMPUTING ARRAYS

CROSS REFERENCE TO RELATED APPLICATIONS

This Application claims priority to and the benefit of United States Provisional Application Number 63/606,020, filed December 4, 2023.

TECHNICAL FIELD

[0001] This disclosure relates to in-memory computing arrays, and in particular to data operations in in-memory computing arrays.

DESCRIPTION OF THE RELATED TECHNOLOGY

[0002] Using in-memory computing for neural network acceleration is an emerging and innovative approach that leverages the unique properties of memory devices to enhance the speed and efficiency of neural network computations. Traditional neural network training and inference processes involve moving data back and forth between memory (RAM) and processing units (CPUs or GPUs), which can be a significant bottleneck in terms of speed and energy consumption. In-memory computing seeks to overcome these limitations by processing data directly within the memory itself.

SUMMARY

[0003] In some aspects, the techniques described herein relate to an in-memory computing architecture, including: a compute-in-memory (CIM) array of computing cells, the CIM array including a plurality of rows and a plurality of columns, each computing cell including a memory cell and an output port for providing a result of computation, a first bank of CIM array, from a plurality of banks of the CIM array, including a first set of rows of computing cells; a first row decoder coupled with the first bank of CIM array, the first row decoder configured to, one at a time, enable any of the first set of rows of computing cells to write data in respective memory cells; a second bank of CIM array, from the plurality of banks of the CIM array, including a second set of rows of computing cells; a second row decoder coupled with the second bank of CIM array; the second row decoder configured to enable, one at a time, any of the second set of rows of computing cells to write data in respective memory cells; a controller configured to: control the first row decoder and the second row decoder to write data into any of the first set of rows of computing cells simultaneously with writing data into any of the second set of rows of computing cells; and a plurality of column interconnects, each column interconnect of the plurality of column interconnects coupled with output ports of computing cells in the respective column of the CIM array.

[0004] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a first plurality of bit-line pairs, each bit-line pair including a first read interconnect and a first write interconnect, wherein each bit- line pair of the first plurality of bit-line pairs is coupled with computing cells in respective columns of the first bank of CIM array; and a second plurality of bit-line pairs, each bit- line pair including a second read interconnect and a write interconnect, wherein each bit- line pair of the second plurality of bit-line pairs is coupled with computing cells in respective columns of the second bank of CIM array.

[0005] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including a third bank of CIM array, the second bank of CIM array positioned between the first bank of CIM array and the third bank of CIM array, wherein the first plurality of bit-line pairs traverse across the second bank of CIM array and the third bank of CIM array and are coupled with the computing cells in the respective columns of the first bank of CIM array.

[0006] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a plurality of repeater circuits positioned on the first plurality of bit-line pairs between the second bank of CIM array and the third bank of CIM array, wherein each plurality of repeater circuits includes a unidirectional read repeater and a unidirectional write repeater coupled with a first read interconnect and a first write interconnect, respectively, of one of the first plurality of bit-line pairs.

[0007] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a plurality of first bank differential drivers coupled with the first plurality of bit-lines, each first bank differential driver of the plurality of first bank differential drivers including: a sense amplifier that receives differential signals from a respective computing cell and outputs a non-differential signal to the respective first read interconnect, and a differential driver that receives a nondifferential signal from the respective first write interconnect and outputs a differential signal to the respective computing cell.

[0008] In some aspects, the techniques described herein relate to an in-memory computing architecture, wherein the differential driver includes an inverter for inverting the non-differential signal received from the respective first wire interconnect, wherein the differential signal includes an output of the inverter.

[0009] In some aspects, the techniques described herein relate to an in-memory computing architecture, wherein the differential driver includes at least one buffer in path of the differential signal.

[0010] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a first set of word-line interconnects corresponding to the first set of rows of computing cells of the first bank of CIM array, wherein each word-line interconnect of the first set of word-line interconnects is coupled with computing cells in a respective row of computing cells, a plurality of word-line interconnect repeaters positioned on the first set of word-line interconnects.

[0011] In some aspects, the techniques described herein relate to an in-memory computing architecture, wherein the first bank of CIM arrays includes at least a first subbank and a second sub-bank where the first sub-bank includes a first set of columns of computing cells and the second sub-bank includes a second set of columns of computing cells, wherein the plurality of word-line interconnect repeaters are positioned between the first sub-bank and the second sub-bank.

[0012] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including a plurality of analog to digital converters (ADCs) coupled with one or more of the plurality of column interconnects, the ADCs configured to convert analog voltages on the one or more of the plurality of column interconnects into a digital value.

[0013] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a plurality of compute in-memory units, each compute in-memory unit including the CIM array, and an on-chip network providing communication between the plurality of compute in-memory units.

[0014] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a plurality of repeater circuits positioned on the first plurality of bit-line pairs between the second bank of CIM array and the third bank of CIM array, wherein each plurality of repeater circuits includes a unidirectional read repeater and a unidirectional write repeater coupled with a first read interconnect and a first write interconnect, respectively, of one of the first plurality of bit-line pairs.

[0015] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a plurality of first bank differential drivers coupled with the first plurality of bit-lines, each first bank differential driver of the plurality of first bank differential drivers including: a sense amplifier that receives differential signals from a respective computing cell and outputs a non-differential signal to the respective first read interconnect, and a differential driver that receives a nondifferential signal from the respective first write interconnect and outputs a differential signal to the respective computing cell.

[0016] In some aspects, the techniques described herein relate to an in-memory computing architecture, wherein the differential driver includes an inverter for inverting the non-differential signal received from the respective first wire interconnect, wherein the differential signal includes an output of the inverter.

[0017] In some aspects, the techniques described herein relate to an in-memory computing architecture, wherein the differential driver includes at least one buffer in path of the differential signal.

[0018] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including: a first set of word-line interconnects corresponding to the first set of rows of computing cells of the first bank of CIM array, wherein each word-line interconnect of the first set of word-line interconnects is coupled with computing cells in a respective row of computing cells, and a plurality of word -line interconnect repeaters positioned on the first set of word-line interconnects.

[0019] In some aspects, the techniques described herein relate to an in-memory computing architecture, wherein the first bank of CIM arrays includes at least a first subbank and a second sub-bank where the first sub-bank includes a first set of columns of computing cells and the second sub-bank includes a second set of columns of computing cells, wherein the plurality of word-line interconnect repeaters are positioned between the first sub-bank and the second sub-bank.

[0020] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including a plurality of analog to digital converters (ADCs) coupled with one or more of the plurality of column interconnects, the ADCs configured to convert analog voltages on the one or more of the plurality of column interconnects into a digital value.

[0021] In some aspects, the techniques described herein relate to an in-memory computing architecture, further including a plurality of compute in-memory units, each compute in-memory unit including the CIM array, and an on-chip network providing communication between the plurality of compute in-memory units.

[0022] In some aspects, the techniques described herein relate to a method, wherein the method uses a compute-in-memory (CIM) architecture, the CIM architecture including: a CIM array of computing cells including a plurality of rows and a plurality of columns, each computing cell including a memory cell; first and second banks of the CIM array, from a plurality of banks of the CIM array, respectively including first and second sets of rows of computing cells; first and second row decoders respectively coupled with the first and second banks of the CIM array, and a plurality of column interconnects, each column interconnect of the plurality of column interconnects coupled with output ports of computing cells in the respective column of the CIM array, the method including: enabling, by the first row decoder, any of the first set of rows of computing cells to write, one at a time, data in respective memory cells; enabling, by the second row decoder, any of the second set of rows of computing cells to write, one at a time, data in respective memory cells; controlling, by a controller, the first row decoder and the second row decoder to write data into any of the first set of rows of computing cells simultaneously with writing data into any of the second set of rows of computing cells; and providing a result of computation at an output port of one or more of the computing cells from the CIM array of computing cells.

[0023] In some aspects, the techniques described herein relate to a method, wherein the CIM architecture further include: a first plurality of bit-line pairs, each bit-line pair including a first read interconnect and a first write interconnect, wherein each bit-line pair of the first plurality of bit-line pairs is coupled with computing cells in respective columns of the first bank of CIM array; and a second plurality of bit-line pairs, each bit-line pair including a second read interconnect and a write interconnect, wherein each bit-line pair of the second plurality of bit-line pairs is coupled with computing cells in respective columns of the second bank of CIM array.

[0024] In some aspects, the techniques described herein relate to a method, wherein the CIM architecture further includes: a third bank of CIM array, the second bank of CIM array positioned between the first bank of CIM array and the third bank of CIM array, wherein the first plurality of bit-line pairs traverse across the second bank of CIM array and the third bank of CIM array and are coupled with the computing cells in the respective columns of the first bank of CIM array.

BRIEF DESCRIPTION OF THE DRAWINGS

[0025] FIG. 1 depicts a block diagram of an example in-memory computing architecture. [0026] FIG. 2 shows a block diagram of a compute in-memory array. [0027] FIG. 3 shows an example circuit diagram of the computing cells discussed above in relation to FIG. 2.

[0028] FIG. 4 shows an example compute in-memory array.

[0029] FIG. 5 shows an example timing diagram depicting the word line operation in various banks of the compute in-memory array shown in FIG. 4.

[0030] FIG. 6 shows an example compute in-memory array including repeaters for bit- line pairs.

[0031] FIG. 7 shows a portion of the compute in-memory array including word line repeaters.

[0032] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

[0033] The various concepts introduced above and discussed in greater detail below aspects be implemented in any of numerous ways, as the described concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes.

[0034] As will be apparent to those of skill in the art upon reading this disclosure, each of the individual aspects described and illustrated herein has discrete components and features which aspects be readily separated from or combined with the features of any of the other several aspects without departing from the scope or spirit of the present disclosure.

[0035] Any recited method can be carried out in the order of events recited or in any other order that is logically possible. That is, unless otherwise expressly stated, it is in no way intended that any method or aspect set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not specifically state in the claims or descriptions that the steps are to be limited to a specific order, it is no way intended that an order be inferred, in any respect. This holds for any possible nonexpress basis for interpretation, including matters of logic with respect to arrangement of steps or operational flow, plain meaning derived from grammatical organization or punctuation, or the number or type of aspects described in the specification.

[0036] All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. All such publications and patents are herein incorporated by references as if each individual publication or patent were specifically and individually indicated to be incorporated by reference. Such incorporation by reference is expressly limited to the methods and/or materials described in the cited publications and patents and does not extend to any lexicographical definitions from the cited publications and patents. Any lexicographical definition in the publications and patents cited that is not also expressly repeated in the instant specification should not be treated as such and should not be read as defining any terms appearing in the accompanying claims. The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided herein can be different from the actual publication dates, which can require independent confirmation.

[0037] While aspects of the present disclosure can be described and claimed in a particular statutory class, such as the system statutory class, this is for convenience only and one of skill in the art will understand that each aspect of the present disclosure can be described and claimed in any statutory class.

[0038] It is also to be understood that the terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the disclosed compositions and methods belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the specification and relevant art and should not be interpreted in an idealized or overly formal sense unless expressly defined herein.

[0039] It should be noted that ratios, concentrations, amounts, and other numerical data can be expressed herein in a range format. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. For example, if the value “10” is disclosed, then “about 10” is also disclosed. Ranges can be expressed herein as from “about” one particular value, and/or to “about” another particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms a further aspect. For example, if the value “about 10” is disclosed, then “10” is also disclosed.

[0040] When a range is expressed, a further aspect includes from the one particular value and/or to the other particular value. For example, where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure, e.g. the phrase “x to y” includes the range from ‘x’ to ‘y’ as well as the range greater than ‘x’ and less than ‘y’. The range can also be expressed as an upper limit, e.g. ‘about x, y, z, or less’ and should be interpreted to include the specific ranges of ‘about x’, ‘about y’, and ‘about z’ as well as the ranges of Tess than x’, less than y’, and Tess than z’. Likewise, the phrase ‘about x, y, z, or greater’ should be interpreted to include the specific ranges of ‘about x’, ‘about y’, and ‘about z’ as well as the ranges of ‘greater than x’, greater than y’, and ‘greater than z’. In addition, the phrase “about ‘x’ to ‘y’”, where ‘x’ and ‘y’ are numerical values, includes “about ‘x’ to about ‘y’”.

[0041] It is to be understood that such a range format is used for convenience and brevity, and thus, should be interpreted in a flexible manner to include not only the numerical values explicitly recited as the limits of the range, but also to include all the individual numerical values or sub-ranges encompassed within that range as if each numerical value and sub-range is explicitly recited. To illustrate, a numerical range of “about 0.1% to 5%” should be interpreted to include not only the explicitly recited values of about 0.1% to about 5%, but also include individual values (e.g., about 1%, about 2%, about 3%, and about 4%) and the sub-ranges (e.g., about 0.5% to about 1.1%; about 5% to about 2.4%; about 0.5% to about 3.2%, and about 0.5% to about 4.4%, and other possible sub-ranges) within the indicated range.

[0042] As used herein, the terms “about,” “approximate,” “at or about,” and “substantially” mean that the amount or value in question can be the exact value or a value that provides equivalent results or effects as recited in the claims or taught herein. That is, it is understood that amounts, sizes, formulations, parameters, and other quantities and characteristics are not and need not be exact, but can be approximate and/or larger or smaller, as desired, reflecting tolerances, conversion factors, rounding off, measurement error and the like, and other factors known to those of skill in the art such that equivalent results or effects are obtained. In some circumstances, the value that provides equivalent results or effects cannot be reasonably determined. In such cases, it is generally understood, as used herein, that “about” and “at or about” mean the nominal value indicated ±10% variation unless otherwise indicated or inferred. In general, an amount, size, formulation, parameter or other quantity or characteristic is “about,” “approximate,” or “at or about” whether or not expressly stated to be such. It is understood that where “about,” “approximate,” or “at or about” is used before a quantitative value, the parameter also includes the specific quantitative value itself, unless specifically stated otherwise.

[0043] Prior to describing the various aspects of the present disclosure, the following definitions are provided and should be used unless otherwise indicated. Additional terms can be defined elsewhere in the present disclosure.

[0044] As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list.

[0045] As used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a proton beam degrader,” “a degrader foil,” or “a conduit,” includes, but is not limited to, two or more such proton beam degraders, degrader foils, or conduits, and the like.

[0046] The terms “configured for” or “configured to,” as used herein with respect to a specified operation or function, refer to a device, component, circuit, structure, machine, signal, etc. that is physically constructed, programmed, formatted and/or arranged to perform the specified operation or function.

[0047] The various concepts introduced above and discussed in greater detail below may be implemented in any of numerous ways, as the described concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes.

[0048] As used herein, the terms “optional” or “optionally” means that the subsequently described event or circumstance can or cannot occur, and that the description includes instances where said event or circumstance occurs and instances where it does not.

[0049] Unless otherwise specified, temperatures referred to herein are based on atmospheric pressure (i.e. one atmosphere).

[0050] In-memory Computing Architecture

[0051] FIG. 1 depicts a block diagram of an example in-memory computing architecture 100. The in-memory computing architecture 100 can be adapted, for example, to a scalable neural network accelerator architecture based on in-memory computing (IMC). However, the in-memory computing architecture 100 is not limited to neural network applications, and can be employed in numerous applications where high data throughput with low power consumption is desired. The in-memory computing architecture 100 includes a plurality of Compute In-Memory unit (CIMU) tiles 102. The plurality of CIMU tiles 102 are arranged in an array within the architecture. The plurality of CIMU tiles 102 can be individually enabled/disabled based on the computations to be carried out by the in-memory computing architecture 100. In examples where the in-memory computing architecture 100 can be used to implement neural networks, the neural networks can be mapped to one or more CIMU tiles of the plurality of CIMU tiles 102. The remainder of the CIMU tiles of the plurality of CIMU tiles 102 can be disabled to reduce power consumption.

[0052] The in-memory computing architecture 100 can include, in part, activation buffers 104, segmented weight buffers 106, and one or more phase-locked loops (PLLs) 108. The activation buffers 104 can provide signals representative of activations from previous stages of computation, for instance previous layers in a neural network. The segmented weight buffers 106 can provide data required for computation together with the activations/data from previous stages, for instance these weight buffers could store the weights of neural network layers. The one or more PLLs 108 can provide reference clock signals to various portions of the in-memory computing architecture 100. The in-memory computing architecture 100 can also include off-chip control elements 110 or interfaces for communication with off-chip processors or software to send and receive control or data signals. The off-chip interface 110 can, by itself or in concert with other elements, provide circuits and protocols for high-speed interfaces for wired or wireless connections involving data, control signals, or both, to other processors or other arrays of CIMU tiles, for example, enabling the in-memory computing architecture 100 to scale upward as desired.

[0053] Each of the plurality of CIMU tiles 102 can include a plurality of CIMUs 112, an on-chip network 114, and a weight network 116. While FIG. 1 shows each of the plurality of CIMU tiles 102 including four CIMUs 112, this is only an example, and the CIMU tiles 102 can include fewer or more CIMUs 112. One or more of the CIMUs 112 can include a compute in-memory (CIM) array 118, compute dataflow buffers 120, programmable digital single instruction multiple data (SIMD) module 122, and a programming and control module 124. The CIM array 118 can be an array of computing cells, discussed further below. The CIM array 118 can carry out computations based on data stored in the computing cells and data provided by the activation buffers 104. The computing cells can be used to perform computational operations between inputs and data stored in a memory cell within the computing cells. The operations can include logical operations (AND, NOR, etc.) or multiplication operations carried out between inputs. The CIM array 118 can carry out matrix operations between multi -bit operands, which is particularly useful in neural network computations where activations are multiplied with weights. In some such applications, the weights can be stored in the memory cells of the CIM array 118 and activations can be provided as input vectors. Each computing cell in the CIM array 118 can perform the multiplication operation between a 1 -bit weight and a portion of the input activation, which can be represented in digital or analog signal form. Some example computing cells can generate a result that is in the form of an electrical signal. For example, the computing cell can output an analog voltage that is representative of the computation result. In some other examples, the computing cell can output an electrical current that is representative of the computation result. The electrical signals of various computing cells can be accumulated and processed to generate the overall matrix multiplication result. For example, electrical signals representative of computation from all computing cells in a single column of the CIM array 118 can be accumulated to represent a portion of the computation. Accumulated electrical signals from multiple columns of computing cells of the CIM array 118 can be combined and processed to generate an overall matrix multiplication result. For instances where the electrical signal generated by the computing cells is an electrical current, the currents from various computing cells within a column can be summed to generate a representative accumulated electrical current. In instances where the electrical signal generated by the computing cell is an analog voltage, the analog voltage generated by each computing cell can be stored in capacitors within the computing cell and then accumulated as a voltage that is representative of a portion of the overall matrix multiplication result. The accumulated result, whether an electrical current or an analog voltage, can be converted into digital form using analog to digital converters (ADCs) and further processed, stored, or passed on to other CIM arrays 118 for further computations.

[0054] The programmable digital SIMD 122 can have an instruction set for flexible element-wise operation and the compute dataflow buffers 120 can support wide range of neural network dataflows. Each CIMUs 112 can provide a high-level of configurability and can be abstracted into a software library of instructions for interfacing with a compiler (for allocating/mapping an application, neural network and the like to the architecture), and where instructions can thus also be added prospectively. That is, the library can include single/fused instructions such as element mult/add, h(») activation, (N-step convolutional stride + matrix-vector-multiplication (MVM) + batch norm. + h(») activation + max. pool), (dense + MVM) and the like. In various nonlimiting examples, h(») can indicate an activation function, including without limitation the rectified linear unit ReLU(x) function, the sigmoid function (o(x)), and other such functions. Max pooling, a downsampling technique for reducing spatial dimensions to maintain computational efficiency while retaining other important features of the CIMU array or the network, can also be a subject of the computation. The N-step convolutional stride can refer to the number of pixels or other information bits that a kernel or convolutional filter moves or glides across the input image during convolution to effect operations like feature detection, pattern recognition, blurring, image sharpening, image recognition, and the like.

[0055] The on-chip network 114 (OCN) can include routing channels within Network In/Out Blocks, and a Switch Block, which provides flexibility via a disjoint architecture as shown, for example, by the disjoint buffer switch 133 in the enlarged view of OCN 114. This flexibility, among other benefits, enables modules that are independent of one another to work in parallel. The OCN 114 works with configurable CIMU input/output ports to optimize data structuring to/from an in-memory computing engine, to maximize data locality across MVM dimensionalities and tensor depth/pixel indices. The OCN 114 routing channels can include bidirectional wire pairs as shown by the exemplary duodirectional pipelined routing structure 131 in the expanded view if the OCN 114, so as to ease repeater/pipeline-FF insertion, while providing sufficient density.

[0056] The in-memory computing architecture 100 can be used to implement a neural network (NN) accelerator, wherein a plurality of compute in memory units (CIMUs 112) are arrayed and interconnected using a very flexible on-chip network (OCN 114) wherein the outputs of one CIMU can be connected to or flow to the inputs of another CIMU or to multiple other CIMUs, the outputs of many CIMUs can be connected to the inputs of one CIMU, the outputs of one CIMU can be connected to the inputs of another CIMU and so on. The OCN 114 can be implemented as a single on-chip network, as a plurality of on-chip network portions, or as a combination of on-chip and off-chip network portions.

[0057] The CIMUs 112 can be surrounded by an on-chip network for moving activations between CIMUs 112 (activation network) as well as moving weights from embedded L2 memory to CIMUs 112 (weight-loading interface). This has similarities with architectures used for coarse-grained reconfigurable arrays (CGRAs), but with cores providing high- efficiency MVM and element-wise computations targeted for neural network acceleration. Various options exist for implementing the on-chip network. The approach in FIG. 1 enables routing segments along a CIMU 112 to take outputs from that CIMU 112 and/or to provide inputs to that CIMU 112. In this manner data originating from any CIMU 112 can be routed to any CIMU 112, and any number of CIMUs 112.

[0058] Each CIMU 112 is associated with an input buffer (not shown) for receiving computational data from the on-chip network and composing the received computational data into an input vector for matrix vector multiplication (MVM) processing by the CIMU to generate thereby computed data including an output vector.

[0059] Each CIMU 112 is associated with a shortcut buffer (not shown), for receiving computational data from the on-chip network 114, imparting a temporal delay to the received computational data, and forwarding delayed computation data toward a next CIMU 112 or an output in accordance with a dataflow map such that dataflow alignment across multiple CIMUs 112 is maintained. At least some of the input buffers can be configured to impart a temporal delay to computational data received from the on-chip network 114 or from a shortcut buffer. The dataflow map can support pixel-level pipelining to provide pipeline latency matching.

[0060] The temporal delay imparted by a shortcut or input buffers includes at least one of an absolute temporal delay, a predetermined temporal delay, a temporal delay determined with respect to a size of input computational data, a temporal delay determined with respect to an expected computational time of the CIMU 112, a control signal received from a dataflow controller, a control signal received from another CIMU 112, and a control signal generated by the CIMU 112 in response to the occurrence of an event within the CIMU. In some aspects, at least one of the input buffer and shortcut buffers of each of the plurality of CIMUs 112 in the array of CIMUs 112 can be configured in accordance with a dataflow map supporting pixel-level pipelining to provide pipeline latency matching. The array of CIMUs 112 can also include parallelized computation hardware configured for processing input data received from at least one of respective input and shortcut buffers.

[0061] A least a subset of the CIMUs 112 can be associated with on-chip network 114 portions including operand loading network portions configured in accordance with a dataflow of an application mapped onto the IMC. The application mapped onto the IMC includes a neural network (NN) mapped onto the IMC such that parallel output computed data of configured CIMUs executing at a given layer are provided to configured CIMUs 112 executing at a next layer, said parallel output computed data forming respective NN feature-map pixels.

[0062] The input buffer can be configured for transferring input NN feature-map data to parallelized computation hardware within the CIMU in accordance with a selected stride step, such as discussed above. The NN can include a convolution neural network (CNN), and the input buffer can be used to buffer a number of rows of an input feature map corresponding to a size or height of the CNN kernel.

[0063] The CIM array 118 in each CIMU 112 can perform matrix vector multiplication (MVM) in accordance with a bit-parallel, bit-serial (BPB S) computing process in which single bit computations are performed using an iterative barrel shifting with column weighting process, followed by a results accumulation process.

[0064] FIG. 2 shows additional details of a portion of the in-memory computing architecture 100 shown in FIG. 1, and in particular, details of an example compute inmemory (CIM) array 200 and associated components. The CIM array 200 can be used, for example, to implement, in part, the CIM array 118 discussed above in relation to the in-memory computing architecture 100 shown in FIG. 1. In one example implementation, the CIM array 200 can include a fully row/column-parallel (1152 row X 256 column) array of computing cells 202 of an in-memory-computing (IMC) macro enabling N-bit (5-bit) input processing. The number of rows (1152), the number or columns (256), and the number of bits (5-bit) of input shown in FIG. 2 are only examples, and any or all of these quantities or circuit configurations can be varied based on desired implementations. The computing cells can be used to perform computational operations between inputs and data stored in a memory cell within the computing cells. The operations can include logical operations (AND, NOR, etc.) or multiplication operations carried out between inputs. In some examples, the operands of the computation can be 1 -bit each. In some other examples, one of the operands can be an analog signal (voltage or current) while the other operand can be a 1 -bit operand stored in the memory cell.

[0065] The in-memory computing architecture 100, in some examples, can be utilized for matrix vector multiplication (MVM) operations, which dominate compute-intensive and data-intensive Al workloads, in a manner that reduces compute energy and data movement by orders of magnitude. This is achieved through efficient analog compute in the computing cells 202, and by thus accessing a compute result (e.g., inner product), rather than individual bits, from memory. But, doing so fundamentally instates an energy/throughput-vs.-SNR tradeoff, where going to analog introduces compute noise and accessing a compute result increases dynamic range (i.e., reducing SNR for given readout architecture). The computing cells 202, which store computational results in the form of a voltage in capacitors within the computing cells 202, can employ metal-fringing capacitors, which can achieve very low noise from analog nonidealities, and thus have the potential for extremely high dynamic range.

[0066] FIG. 2 shows a block diagram of the CIM array 200 including a 1152 (row) X 256 (col.) array of 10T (“ten transistor”) SRAM computing cells 202, which in this example are multiplying bit-cells (M-BCs) (such as, for example, a 10T M-BC 202, although the number of transistors of SRAM interface 204 and M-BCs 202 is implementationdependent and the circuit can use different numbers of transistors or other circuit elements without departing from the principles of the disclosure); peripheral circuits for standard writing/reading thereto (e.g., a bit line (BL) decoder such as SRAM interface 204, and 256 BL drivers 206-1 through 206-256 (collectively referred to as BL drivers 206)), a word line (WL) or address decoder 208 and 1152 WL drivers 210-1 through 210-1152 (collectively referred to as WL drivers 210), and control block 212 for controlling the BL decoder 204 and the WL decoder 208); peripheral circuitry for providing 5-bit inputvector elements thereto (e.g., 1152 Dynamic-Range Doubling (DRD) DACs 214-1 through 214-1152 (collectively referred to as DRD DACs 214), and a corresponding inmemory computing (IMC) controller (“IMC control block”) 216); peripheral circuitry for digitizing the compute result from each column (e.g., 256 8-bit successive approximation register (“SAR”) ADCs 218-1 through 218-256 (collectively referred to as SAR ADCs 218), and column reset mechanisms 220-1 through 220-256 (collectively referred to as column reset mechanisms 220) (e.g., CMOS switches configured to pull the output voltage levels of column compute lines CLs to a reset voltage VRST during a reset phase of operation, and allow the voltage levels of column compute lines CLs to reflect their respective compute results during an evaluation phase of operation). For example, the RST switches corresponding to CL1-CL256, or a subset thereof, can close to produce the desired reset voltage VRST during the reset phase. The RST switches can then open during an ensuing evaluation phase, thereby enabling the voltage values at the CLs to reflect the computed product.

[0067] In addition, the lower right portion of FIG. 2 depicts an example enlarged view of a representative one of the 256 8-bit ADCs, which includes various switch mechanisms ADCRST (Analog-to-Digital Converter Reset), ADCSMP (Analog-to-Digital Converter Sample) and voltage designations VADCRST and VCMPR, the latter voltage designation connected in this example to a positive terminal of the comparator CMPR and the former voltage designation selectively applied via the ADCRST and ADCSMP switch to reset the comparator. The negative terminal of comparator CMPR receives a CL value when the circuit is activated. An output of comparator CMPR is coupled to SAR logic for outputting an 8-bit digital result. It will be appreciated, however, that the implementation details of the above circuits are representative in nature and that variations to the circuits are possible without departing from the scope or spirit of the present disclosure.

[0068] While writing/reading is typically performed row-by-row, MVM operations are typically performed by applying input-vector elements corresponding to neural -network input activations to all rows at once. That is, each DRD DAC 214j, in response to a respective 5-bit input-vector element Xj [4:0], generates a respective differential output signal (lAj/IAbj) which is subjected to a 1-bit multiplication with the stored weights (Aij/Abij) at each computing cell 202j in the corresponding row of computing cells 202, and accumulation through charge-redistribution across computing cells 202 capacitors CM-BC on the compute line (CL) to yield an inner product in each column, which is then digitized via the respective SAR ADCs 218 of each column as noted above.

[0069] FIG. 3 shows an example circuit diagram of the computing cells 202 discussed above in relation to FIG. 2. The computing cells 202 can include a highly dense structure for achieving weight storage and multiplication, thereby minimizing data-broad-cast distance and control signals within the context of i-row, j -column arrays implemented using such computing cells, such as the 1152 (row) X 256 (col.) CIM array 200 of 10T SRAM multiplying bit cells (M-BCs).

[0070] The exemplary computing cells 202 includes a six-transistor bit cell portion 222 (here, NMOS transistors 226a, 226b, 226e, 226f and PMOS transistors 226c and 226d), a first switch SW1, a second switch SW2, a capacitor C, a word line (WL) 224, a first bit line (BLj) 227, a second bit line (BLbj) 228, and a compute line (CL) 230.

[0071] The six-transistor bit cell portion 222 is depicted as being located in a middle portion of the computing cells 202, and includes six transistors 226a-226f. The 6- transistor bit cell portion 222 can be used for storage, and to read and write data. In one example, the 6-transistor bit cell portion 222 stores the filter weight. In some examples, data is written to the computing cells 202 through the word line (WL) 224, the first bit line (BL) 227, and the second bit line (BLb) 228.

[0072] The computing cells 202 can include a first CMOS switch SW1 and a second CMOS switch SW2. The first switch SW 1 is depicted as being controlled by a first stored signal Aij such that, when closed, the first switch SW1 couples one of the received differential output signals provided by the DRD DACs 214, illustratively IA, to a first terminal of the capacitor C. The second switch SW2 is depicted as being controlled by a second stored signal Abij such that, when closed, the second switch SW2 couples the other one of the received differential output signals (lA/Ab) of the corresponding DRD DACs 214, illustratively lAb, to the first terminal of the capacitor C. The second terminal of the capacitor C is connected to a compute line (CL) 230 via an output port 232 that provides a result of the computation of the computation cell 202. It is noted that in various other examples, the input signals provided to the first and second switches SW 1 and SW2 can include a fixed voltage (e.g., Vaa), ground, or some other voltage level.

[0073] The computing cells 202, including the first SW1 and second SW2 switches, can implement computation on the data stored in the six-transistor bit cell portion 222. The result of a computation is sampled as charge on the capacitor C. According to various implementations, the capacitor C can be is positioned above the computing cell 202 and utilize no additional area on the circuit. In some implementations, a logic value of either Vdd or ground is stored on the capacitor C. In other implementations, the voltage stored on the capacitor C can include a positive or negative voltage in accordance with the operation of the first and the second switches SW 1 and SW2, and the output voltage level generated by the corresponding DRD DACs 214 as shown in FIG. 2.

[0074] Thus, with continued reference to FIG. 3, the value that is stored on the capacitor C is highly stable, since the capacitor C value is either driven up to a fixed analog voltage or down to ground. In some examples, the capacitor C is a metal-oxide-metal (MOM) finger capacitor, and in some examples, the capacitor C can be about 0.1 femto-Farhads (fF) to about 10 fF or can be about 1.2 fF. MOM capacitors have very good matching temperature and process characteristics, and thus have highly linear and stable compute operations. Note that other types of logic functions can be implemented using the computing cells 202 by changing the way the transistors 226a-226f and/or the first and the second switches SW 1 and SW2 are connected and/or operated during the reset and evaluation phases of operation. The six-transistor bit cell portion 222 can be implemented using different numbers of transistors and can have different architectures. In some examples, the six-transistor bit cell portion 222 can be a SRAM, DRAM, MRAM, or an RRAM.

[0075] High-Throughput Compute In-Memory Array [0076] FIG. 4 shows an example compute in-memory (CIM) array 300. The CIM array 300 can be used, for example, to implement the CIM array 118 or the CIM array 200 discussed above in relation to FIG. 1 and FIG. 2. The CIM array 300 is an array of computing cells 302 that are arranged in a plurality of rows and a plurality of columns. In the example shown in FIG. 4, the CIM array 300 includes 576 rows of computing cells 302 and 128 columns of computing cells 302. The number of rows and the number of columns shown in FIG. 4 are only examples and can vary based on the implementation. The CIM array 300 can be partitioned into a plurality of banks. For example, the CIM array 300 can be partitioned into a first bank of CIM array 304, a second bank of CIM array 306, a third bank of CIM array 308, and a fourth bank of CIM array 310. While the CIM array 300 shown in FIG. 4 is partitioned into four banks, it should be understood that the CIM array 300 can be partitioned into fewer or greater number of banks.

[0077] Each bank of CIM array can include a number of rows of the CIM array 300. For example, the first bank of CIM array 304 includes a first set of rows of computing cells 302: row-0 to row- 143, the second bank of CIM array 306 includes a second set of rows of computing cells 302: row-144 to row-287, the third bank of CIM array 308 includes a third set of rows of computing cells 302: row-288 to row-431, and the fourth bank of CIM array 310 includes a fourth set of rows of computing cells 302: row-432 to row-575. In the example shown in FIG. 4, the CIM array 300 is partitioned into four equal banks. That is, each of the four banks include the same number of rows of computing cells 302. In some other examples, one or more banks can include a number of rows of computing cells 302 that is different from the number of rows of computing cells 302 in another bank.

[0078] Each bank of the CIM array can have an associated row decoder. For example, a first row decoder can be coupled with the first bank of CIM array 304, a second row decoder can be coupled with the second bank of CIM array 306, a third row decoder can be associated with the third bank of CIM array 308, and a fourth row decoder can be coupled with the fourth bank of CIM array 310. Each row decoder can receive a row address as an input and output an enable signal on an interconnect that corresponds to the row address. In the example shown in FIG. 4, the word lines WL can carry signals between the bank and the corresponding row decoder. For example, the word lines WL0[0] - WL0[143] can extend between the rows of computing cells 302 in the first bank of CIM array 304 and the first row decoder, word lines WLl[0] - WL1[143] can extend between the rows of computing cells 302 in the second bank of CIM array 306 and the second row decoder, word lines WL2[0] - WL2[143] can extend between the rows of computing cells 302 in the third bank of CIM array 308, and word lines WL3[0] - WL3[143] can extend between the rows of computing cells 302 in the fourth bank of CIM array 310 and the fourth row decoder. While not explicitly shown in FIG. 4, each word line associated with a row of computing cells 302 is coupled with each computing cell 302 within that row. Referring to the computing cell 202 shown in FIG. 3 as an example, the word line WLi can be common to all computing cells 202 within the row. Enabling the word line WLi can allow read/write access to the six-transistor bit cell portion 222, but the bit cell could also have other structures and be based on other memory technologies besides SRAM. In this manner, when the row decoder associated with a row of computing cells 302 enables the corresponding word line WL, all the computing cells in that row can be written into or read from using the first bit line 228 and the second bit line 230. The first row decoder, for example, can receive a row address that corresponds to row-0 of the first bank of the CIM array 304. The first row encoder can be configured to enable the word line WL0[0], which enables all the computing cells 302 in the row-0 to be written into or read from using their respective bit lines. The row encoders associated with the various banks can be similar to the decoder 208 discussed above in relation to FIG. 2. However, unlike the decoder 208, which can enable any row in the entire CIM array 200, each row decoder associated with a bank can enable a row only in the associated bank. That is, the first row encoder may not enable a row in the second bank of CIM array 306.

[0079] Each row of computing cells 302 also receives a pair of differential output signals from the DRD DACs 214: a first differential signal IA and a second differential signal I Ab. As discussed in relation to FIG. 2 and FIG. 3, the computing cells 202 can receive the pair of differential signals from the DRD DACs 214 which are selectively applied to one plate of the capacitor C during reset and evaluation phases based on the states of the first switch SW 1 and the second switch SW2. The pair of differential signals lA/IAb are provided to each computing cell 302 in a row. For example, the pair of differential signals IA/IAb[0] are provided to each computing cell 302 in row-0, the pair of differential signals IA/IAb[l] are provided to each computing cells 302 in row-1, and so on. It should be noted that a pair of differential signals lA/IAb are, for simplicity, represented as a single interconnect in FIG. 4. It should also be noted that the differential signals lA/IAb are only examples, and are specific to the computing cell 202 shown in FIG. 3. In other computing cell designs, only one signal aspects be needed instead of two. In some other implementations of computing cells, where current instead of voltage is used to carry out computation, the CIM array 300 aspects receive one or more current signals that correspond to the input data.

[0080] The in-memory computing architecture 100 also includes a plurality of bit-line pairs BL/BLb for each column of each bank. The bit-line pairs can provide data to be written into the computing cell 302 but can also be used to read data from the computing cell 302. As an example, the bit-line pair BL0/BLb0[0] is coupled with computing cells 302 in column-0 of the first bank of the CIM array 304, the bit-line pair BLl/BLbl[0] is coupled with computing cells 302 in column-0 of the second bank of CIM array 306, the bit-line pair BL2/BLb2[0] is coupled with computing cells 302 in column-0 of the third bank of the CIM array 308, and the bit-line pair BL3/BLb3[0] is coupled with computing cells 302 in column-0 of the fourth bank of the CIM array 310. For simplicity, a limited number of bit-line pairs have been labeled in FIG. 4. As mentioned above, the bit-line pairs are coupled with all computing cells 302 within a column of a bank. Specifically, with reference to the computing cell 202 shown in FIG. 3, the bit-line pairs provide data that is to be stored in the six-transistor bit cell portion 222. For example, in neural network operations, the bit-line pairs can provide voltages that are representative of weights. If the weight is ‘ 1 ’ for example, one of the bit-line pair can provide a high voltage (e.g., Vaa) and the other of the bit-line pair can provide a low voltage (e.g., GND). For each bank, the bit-line pairs provide the appropriate voltages to be stored in the computing cell that is in a currently enabled row. Thus, for example, for the first bank of CIM array 304, when the word line WL0[0] is enabled, all bit-line pairs: BL0/BLb0[0] to BL0/BLb0[127] provide data for the computing cells in row-0 of the first bank of the CIM array 304. In a subsequent cycle, when the word line WL0[143] is enabled, all bit-line pairs: BL0/BLb0[0] to BL0/BLb0[127] provide data for the computing cells in row-143 of the first bank of the CIM array 304. In some examples, the bit-line pairs can be utilized for both read and write. In some such examples, one of the bit-line pairs can be a read interconnect and the other of the bit-line pair can be a write interconnect.

[0081] The bit-line pairs coupled with the first bank of CIM array 304 traverse across the second bank of CIM array 306, the third bank of CIM array 308, and the fourth bank of CIM array 310. For example, BL0/BLb0[0], which is coupled with the computing cells 302 in column-0 of the first bank of CIM array 304 traverses across the computing cells 302 in the same column (column-0) of the second bank of CIM array 306, the third bank of CIM array 308, and the fourth bank of CIM array 310. Similarly, bit-line pairs coupled with the second bank of CIM array 306 traverse across the third bank of CIM array 308 and the fourth bank of CIM array 310, and bit-line pairs coupled with the third bank of CIM array 308 traverse across the fourth bank of CIM array 310.

[0082] The CIM array 300 can include a plurality of column interconnects. For example, the CIM array 300 can include one column interconnect for each column of the CIM array 300. Referring to FIG. 4, the CIM array 300 includes 128 column interconnects (CL[0] to CL[127]) respective to the 128 columns of computing cells 302. Each column interconnect can be coupled with output ports of computing cells in the respective column of the CIM array 300. For example, the column interconnect CL[0] is coupled with all the computing cells 302 in the column-0 of each of the banks of the CIM array 300. The column interconnects can carry a signal that is representative of the computational carried out by all the computing cells 302 in the respective column. For example, referring to the computing cell 202 shown in FIG. 3, the computation result is stored in the capacitor C. During an evaluate phase of the operation of the CIM array 300, the charge on the column interconnect can be a function of the charges of all the capacitors C of all the computing cells 202 coupled with the column interconnect.

[0083] The in-memory computing architecture 100 can include a controller that can control the operation of the CIM array 300. As an example, the SRAM control block 212 in relation to FIG. 2 can be coupled with the row decoders to control read/write operation of the banks. Such read/write operations can take higher-level control from the programming and control module 124, discussed above in relation to FIG. 1. In some instances, the in-memory computing architecture 100 can include a single programming and control module 124 that controls all the row decoders 208 through control blocks 212 in each CIMU 112. That is, the SRAM control block 212 can be coupled with and control the first row decoder, the second row decoder, the third row decoder and the fourth row decoder associated with the four banks of CIM array 300. The SRAM control block 212 can be implemented using one or more microcontrollers, microprocessors, state machines, application specific integrated circuits, etc.

[0084] In one aspect, the partitioning of the CIM array 300 into a plurality of banks can increase of the throughput of write operations to the CIM array 300. This is because the write operation to a row of one bank is independent of the write operation to a row in another bank. In contrast, the CIM array 200 shown in FIG. 2, where no partitioning of the CIM array 200 is implemented and only a single decoder 208 is utilized for writing into the entire CIM array 200, at any given time, only one row of the CIM array 200 can be enabled such that the data to be written into the computing cells in that row can be provided on the BL/BLb bit-line pairs. The CIM array 300 shown in FIG. 4, on the other hand, provides separate bit-line pairs for each bank. Therefore, rows of computing cells in multiple banks can be simultaneously enabled for writing.

[0085] FIG. 5 shows an example timing diagram 500 depicting the word line operation in various banks of the CIM array 300 shown in FIG. 4. In particular, FIG. 5 shows that word lines of various banks of the CIM array 300 can be enabled simultaneously. For example, at time tl, the word lines of row-0 of each of the first bank of CIM array 304, the second bank of CIM array 306, the third bank of CIM array 308, and the fourth bank of CIM array 310 can be enabled simultaneously or in parallel. That is, at time tl, the controller can control the first row decoder, the second row decoder, the third row decoder, and the fourth row decoder to write data into any row of computing cells 302 in each of the banks simultaneously or in parallel. As each bank has separate bit-line pairs, the computing cells 302 in row-0 of the first bank of CIM array 304 can be written into using the first plurality of bit-line pairs BL0/BLb0[0] to BL0/BLb0[127], while in parallel writing data into the computing cells of row-0 of the second bank of CIM array 306 using the second plurality of bit-line pairs BLl/BLbl[0] to BLl/BLbl[127], It should be noted that while FIG. 5 shows that the same row (row-0) in each of the banks is written into at the same time, this is only an example. The controller and the row decoders associated with each bank can write into a row of computing cells 302 within a bank in parallel with writing into any row of computing cells 302 in another bank. That is, the controller and the row decoders can write into, for example, row-1 in the fourth bank while simultaneously writing into row-56 in the second bank. Further, the sequence with which the controller and the row decoders enable the rows of computing cells 302 (row-0, row- 1, . . . , row-143) shown in FIG. 5 is only an example. It is not necessary that the rows of computing cells be enabled within increasing index number. Any row in a bank aspects follow any other row in the bank in the sequence of enabling the rows.

[0086] The simultaneous or parallel writing of data into the banks can allow the loading of data into the computing cells 302 of the CIM array 300 substantially faster. For example, if the CIM array 300 shown in FIG. 4 were not partitioned, it would need 576 cycles (one cycle for each of the 576 rows of computing cells 302) to write data into all the computing cells 302 of the CIM array 300. With the partitioning into four banks, it would need only 144 cycles (a one-fourth reduction) to write data into all the computing cells 302 of the CIM array 300. [0087] The partitioning of the CIM array 300 can also help in increasing the write frequency of the CIM array 300. Write frequency can refer to the frequency at which the row decoder can enable one row to the next. Referring to FIG. 5, the write period can refer to the duration between tl and t2, and the write frequency can be the reciprocal of the write period. The write period or write duration is a function, in part, of the time it takes for voltages on the bit-line pairs to reach the desired values. Referring, for example, to FIG. 2, each bit-line pair BL/BLb is driven by a BL driver 206. The BL driver 206 can drive the bit-line pairs BL/BLb to a high (e.g., Vaa) or a low (e.g., GND) voltage. The time it takes for the BL driver 206 to drive the bit-line pairs to the desired voltage can be a function, in part, of the electric load (e.g., capacitive, resistive, and inductive load) on the bit-lines. Higher the loading, greater the time needed to drive the bit-line to its desired voltage. The length of the bit-line is one contributor to the loading of the bit-line. Another contributor is the number of devices coupled with the bit-line. As shown in FIG. 3, each bit-line is coupled with at least one transistor 226a of the six-transistor bit cell portion 222. The terminal capacitance (e.g., the source terminal capacitance) of the transistor 226a can contribute to the loading of the bit-line. Each bit-line in the example shown in FIG. 2 can be coupled with the transistor 226a of each row (e.g., 1152 rows) of the CIM array 200. Therefore, the terminal capacitance of each of the transistor 226a can further contribute to the loading of each bit-line.

[0088] Partitioning of the CIM array 300 into banks can help reduce the loading of the bit-line and therefore reduce the time needed to drive the bit-line to the desired voltage. Partitioning the CIM array 300 into banks limits the number of devices that are coupled with each bit-line, as each bit-line in the CIM array 300 is coupled with only the computing cells within that bank. For example, each bit-line in the bit-line pair BL0/BLb0[0] is coupled with the computing cells 302 in column-0 of the first bank of CIM array 304 only, which computing cells 302 are one-fourth of the total number of computing cells 302 in the column-0 of the entire CIM array 300. This reduction in the number of computing cells coupled with the bit-lines reduces the loading on the bit-line, and therefore reduces the amount of time needed by the BL driver to charge/discharge the bit-line to the desired voltage level. The reduced loading also reduces energy consumption by reducing switching losses.

[0089] In some examples, repeaters or buffers can be positioned in the paths of one or more bit-line pairs. These repeaters or buffers can help alleviate the resistor-capacitor (RC) delay caused by the long length of some of the bit-lines. For example, the bit-line pairs for the first bank of CIM array 304 traverse the second, third and the fourth bank of CIM array 300, the bit-line pairs for the second bank of CIM array 306 traverse the third and the fourth banks of CIM array 300, and the bit-line pairs for the third bank of CIM array 308 traverse the fourth bank of CIM array 310. The long length of the bit-line pairs can introduce undesirable delay in the signal propagation along the bit-line pairs, which delay can be reduced by introducing repeaters or buffers along the path of the bit-lines. However, the repeaters and the buffers are unidirectional, whereas the bit-line pairs are bidirectional in the sense that the bit-line pairs can be used to read and write into the computing cells 302. For example, referring to FIG. 3, the bit-line pairs BL and BLb can be used to write data into the six-transistor bit cell portion 222 or read data from the six- transistor bit cell portion 222 of the computing cell 202. The bit-line pairs BL and BLb carry differential or complementary signals. If the bit-line pairs Bl and BLb were employed only to read or only to write, then unidirectional repeaters or buffers could be placed on each bit-line. However, the unidirectional repeaters or buffers would not allow bidirectional read and write functionality of the bit-lines. One approach to addressing the bidirectionality of the bit-line pairs while at the same time accommodating unidirectional repeaters or buffers, the bit-line pairs can be viewed as having a read interconnect and a write interconnect, where the read interconnect carries only read signals from the computing cells and the write interconnect carries only data to be written into the computing cells. As the read interconnect and the write interconnect are inherently unidirectional, repeaters or buffers can be positioned on these interconnects in the opposite directions.

[0090] FIG. 6 shows an example CIM array 600 including repeaters for bit-line pairs. The CIM array 600 is similar to the CIM array 300 discussed above in relation to FIG. 4, in that the CIM array 600 also includes a plurality of banks of rows of computing cells. However, the CIM array 600 includes bit-line pairs, each of which include a read interconnect and a write interconnect, which enables the positioning of unidirectional repeaters or buffers on the bit-lines. FIG. 6 shows a portion of the CIM array 600 that includes computing cells 302 in column-0. The CIM array 600 includes a first bit-line pair (BL0/BLb0[0]) coupled with the computing cells 302 in the first bank of CIM array 304. It should be noted that while FIG. 6 shows only a portion of the CIM array 600, the CIM array 600 can include a first plurality of bit-line pairs corresponding to other columns of the CIM array 600 where the remainder of the first plurality of bit-line pairs are coupled with computing cells 302 in respective columns of the first bank of CIM array 304. Each bit-line pair of the first plurality of bit-line pairs, and in particular the first bit-line pair (BL0/BLb0[0]) includes a first read interconnect 602 and a first write interconnect 604. The first read interconnect 602 and the first write interconnect 604 are coupled with the computing cells 302 in the first bank of CIM array 304, albeit indirectly via a series of repeaters, differential circuit, and a sense amplifier. It should be noted that the number of repeater circuits positioned on the first read interconnect 602 or first write interconnect 604 can be different from that shown in FIG. 6. That is, while FIG. 6 shows three repeater circuits on each of the first read interconnect 602 and the first write interconnect 604, this number is only an example, and that additional or fewer repeater circuits can be employed. [0091] The first plurality of bit-line pairs, including the first bit-line pair (BL0/BLb0[0]), traverse across the fourth bank of CIM array 310, the third bank of CIM array 308, and the second bank of CIM array 306. In particular, the first plurality of bit-line pairs are not coupled with the computing cells 302 in the traversed banks. For example, the first read interconnect 602 and the first write interconnect 604 shown in FIG. 6 traverse the fourth, third and the second banks of the CIMA 600 and are not coupled with any computing cells 302 within these banks.

[0092] A plurality of repeater circuits can be positioned on the first plurality of bit-line pairs. For example, a first read repeater circuit 606 is positioned on the first read interconnect 602 between the second bank of CIM array 306 and the third bank of CIM array 308, a second read repeater circuit 610 is positioned on the first read interconnect 602 between the third bank of CIM array 308 and the fourth bank of CIM array 310, and a third read repeater circuit 614 is positioned after the fourth bank of CIM array 310 (the read repeater circuits can also be referred to as unidirectional read repeaters). Similarly, a first write repeater circuit 616 is positioned on the first write interconnect 604 on one side of the fourth bank of CIM array 310, a second write repeater circuit 612 is positioned on the first write interconnect 604 between the fourth bank of CIM array 310 and the third bank of CIM array 308, and a third write repeater circuit 608 is positioned on the first write interconnect 604 between the third bank of CIM array 308 and the second bank of CIM array 306 (the write repeater circuits can also be referred to as unidirectional write repeaters).

[0093] A first bank differential driver 660 is coupled with the first bit-line pair (the first read interconnect 602 and the first write interconnect 604) and can be used to convert the signals between the first bit-line pair and the computing cells 302. The first bank differential driver 660 is positioned between the first bank of CIM array 304 and the second bank of CIM array 306. In the first bank of CIM array 304, the computing cells 302 are coupled with BL and BLb differential or complementary bit-lines. A sense amplifier 618 receives the BL and BLb bit-lines as input and generates a read signal that is provided to the first read interconnect 602. The sense amplifier 618 can be used to detect very small difference in voltage at the bit-lines BL and BLb and amplify the difference to its full voltage swing (e.g., Vdd/GND) to identify the data value stored in the six-transistor bit cell portion 222 of the computing cells 302. In some examples, the sense amplifier 618 can be similar to sense amplifiers used in static random-access memories (SRAMs). These sense amplifiers can include, for example, voltage-mode sense amplifiers or current-mode sense amplifiers.

[0094] The first write interconnect 604 is coupled with a differential driver that converts the non-differential signal on the first write interconnect 604 into differential or complimentary voltages that are then provided to the BL and BLb differential bit-lines. The differential driver at the least can include an inverter 620 coupled with the first write interconnect 604. The output of the inverter 620 is fed to one of the pair (BL/BLb) of differential bit-lines. The other of the pair of differential bit-lines receives the signal on the first write interconnect 604. In some examples, such as that shown in FIG. 6, additional buffers 622 can be included to strengthen the signal provided to the BL and BLb differential bit-lines and to equalize timing delays. Thus, a write signal on the first write interconnect 604 is fed to the BLb bit-line, while an inverted signal from the inverter 620 is fed to the BL bit-line, thus providing a differential signal to the bit-lines BL and BLb. In some examples, the BL and BLb buffers 622 can be tristate drivers. That is, the outputs of the buffers 622 can be pulled to a high impedance state, which can be useful for instance during read phases, where data is instead driven on the bit lines by the computing cells 302. During write phases, on the other hand, the buffers 622 can be operated to output high or low voltage signals based on the data to be written. The buffers 622 can include control inputs that can receive signals which can enable/disable the high- impedance state at the outputs of the buffers 622. The control signals can be received, for example, from the controller that can control the operation of the CIM array. In some examples, the controller can be the SRAM control block discussed above in relation the CIM array 200 shown in FIG. 2.

[0095] The CIM array 600 includes a second bit-line pair (BLl/BLbl[0]) coupled with the computing cells 302 in the second bank of CIM array 306. Similar to the first bit-line pair (BL0/BLb0[0]), the second bit-line pair (BLl/BLbl[0]) includes a second read interconnect 624 and a second write interconnect 626. Repeater circuits can be positioned on the second read interconnect 624 and the second write interconnect 626. For example, on the second read interconnect 624, a first read repeater circuit 628 is positioned between the third bank of CIM array 308 and the fourth bank of CIM array 310, and a second read repeater circuit 630 is positioned on the other side of the fourth bank of CIM array 310. One the second write interconnect 626, a first write repeater circuit 632 is positioned on one side of the fourth bank of CIM array 310 and a second write repeater circuit 634 is positioned between the third bank of CIM array 308 and the fourth bank of CIM array 310. It should be noted that the number of and the positions of the repeater circuits shown in FIG. 6 are only examples, and that fewer or additional repeater circuits could be used and can be positioned differently than that shown in FIG. 6.

[0096] A second bank differential driver 662 is coupled with the second bit-line pair (including the second read interconnect 624 and the second write interconnect 626) and positioned between the second bank of CIM array 306 and the third bank of CIM array 308. The second bank differential driver 662 includes a sense amplifier 636 that converts the differential signal on the BL and BLb bit-lines coupled with the computing cells 302 into a non-differential signal and provides the non-differential signal on the second read interconnect 624. The second bank differential driver 662 also includes at least an inverter 638 that inverts the non-differential signal on the second write interconnect 626 and provides the inverted signal to one of the bit-lines BL and BLb. The other of the bitlines BL and BLb receive the non-inverted signal on the second write interconnect 626. One or more buffers 622 can be included to buffer the inverted signal and the non-inverted signal to strengthen the signals and to correct for any timing mismatch, and such buffers can be tristate drivers to enable bidirectional data to/from the compute cells on the bit lines.

[0097] The CIM array 600 includes a third bit-line pair (BL2/BLb2[0]) coupled with the computing cells 302 in the third bank of CIM array 308. The third bit-line pair (BL2/BLb2[0]) can include a third read interconnect 640 and a third write interconnect 642. A read repeater circuit 644 can be positioned on the third read interconnect 640 one side of the fourth bank of CIM array 310 and a write repeater circuit 646 can be positioned on the third write interconnect 642 on the same side of the fourth bank of CIM array 310. In some examples, the third bit-line pair (BL2/BLb2[0]) aspects include fewer or additional repeater circuits that that shown in FIG. 6. A third bank differential driver 664 interfaces between the third read interconnect 640/third write interconnect 642 and the differential bit-line pairs BL/BLb of the computing cells 302 of the third bank of CIM array 308. The third bank differential driver 664 can include at least a sense amplifier 648 and an inverter 650. The sense amplifier 648 converts the differential signal on the bit-lines BL and BLb into a non-differential signal that is provided to the third read interconnect 640. The inverter 650 can invert the non-differential signal on the third write interconnect 642 and provide the inverted signal to one of the bit-line pair BL/BLb and the non-differential signal on the third write interconnect 642 is provided to the other of the bit-line pair BL/BLb. The third bank differential driver 664 can also include additional buffers 622 to buffer the inverted signal and the non-inverted signal provided to the bit-lines BL/BLb to strengthen the signals and to correct for any timing mismatch. [0098] The CIM array 600 includes a fourth bit-line pair BL3/BLb3[0]) coupled with the computing cells 302 in the fourth bank of CIM array 310. The fourth bit-line pair BL3/BLb3[0]) can include a fourth read interconnect 652 and a fourth write interconnect 654. As the fourth read interconnect 652 and the fourth write interconnect 654 do not traverse any other bank, in some examples, no repeater circuits may be needed prior to interfacing the interconnects with a fourth bank differential driver 666. The fourth bank differential driver 666 can include at least an inverter 656 that can provide an inverted signal of the non-differential signal on the fourth write interconnect 654 and can include a sense amplifier 658, which can convert the differential signal on the bit-line pair BL and BLb into a non-differential signal to be provided to the fourth read interconnect 652. The unidirectional repeater circuits discussed above can include buffer circuits such as, for example, an even number of inverters coupled in series.

[0099] FIG. 7 shows a portion 700 of the CIM array 300 including word line repeaters. In particular, FIG. 7 shows only the first bank of CIM array 304 for simplicity. Aspects discussed in relation to the first bank of CIM array 304 can be applied equally to other banks of the CIM array 300. A plurality of word-line interconnect repeaters 706 can be positioned between a first sub-bank 702 and a second sub-bank 704 of the first bank of CIM array 304. While not shown in FIG. 7, the first bank of CIM array 304 can include additional sub-banks, and word-line interconnect repeaters, similar to the plurality of word-line interconnect repeaters 706 can be positioned between those sub-banks. The first sub-bank 702 can include a first plurality of columns of computing cells 302 within the first bank of CIM array 304, and the second sub-bank 704 can include a second plurality of columns of computing cells 302 within the first bank of CIM array 304. The plurality of word line interconnects WL0[0] to WL0[143] generally run across each row of the computing cells 302 from a first column to the last column of the first bank of CIM array 304. Depending upon the number of columns, the length of each of the word line interconnects can be long enough to undesirably load the word-line drivers (e.g., the decoder 208 shown in FIG. 2). This loading can increase the delay of the word line signal carried by the word line interconnects. One approach to mitigate this delay is to insert repeaters along the word line interconnects. For example, the plurality of word -line interconnect repeaters 706 are positioned on the word line interconnects between the first sub-bank 702 and the second sub-bank 704. The plurality of word-line interconnect repeaters 706 can improve the strength of the signal on the word line interconnects in the second sub-bank 704 and beyond. In some instances, the number of columns in each subbank can be equal. In some other examples, the number of columns in each sub-bank can be unequal. In some examples, the word-line interconnect repeaters can include buffers including, for example, an even number of inverters. In some examples, multiple wordline interconnect repeaters can be positioned on a word line interconnect. In some instances, each of the word line interconnect repeaters on a single word line can be of the same size. Here, size can refer to sizes of the transistors utilized for implementing the repeaters. Larger sized transistors can provide faster repeaters. In some other instances, at least one word line interconnect repeater on a word line can have a size that is different from another word line interconnect repeater on the same word line.

ASPECTS OF THE DISCLOSURE

[0100] The present disclosure will be better understood upon reading the following numbered aspects, which should not be confused with the claims. Each of the numbered aspects described below can, in some instances, be combined with aspects described elsewhere in the disclosure. The following listing of example aspects is supported by the disclosure provided herein.

[0101] Aspect 1. An in-memory computing architecture, including: a compute-inmemory (CIM) array of computing cells, the CIM array including a plurality of rows and a plurality of columns, each computing cell including a memory cell and an output port for providing a result of computation, a first bank of CIM array, from a plurality of banks of the CIM array, including a first set of rows of computing cells; a first row decoder coupled with the first bank of CIM array, the first row decoder configured to, one at a time, enable any of the first set of rows of computing cells to write data in respective memory cells; a second bank of CIM array, from the plurality of banks of the CIM array, including a second set of rows of computing cells; a second row decoder coupled with the second bank of CIM array; the second row decoder configured to enable, one at a time, any of the second set of rows of computing cells to write data in respective memory cells; a controller configured to: control the first row decoder and the second row decoder to write data into any of the first set of rows of computing cells simultaneously with writing data into any of the second set of rows of computing cells; and a plurality of column interconnects, each column interconnect of the plurality of column interconnects coupled with output ports of computing cells in a respective column of the CIM array.

[0102] Aspect 2. The in-memory computing architecture of any one of Aspects 1- 19, further including: a first plurality of bit-line pairs, each bit-line pair including a first read interconnect and a first write interconnect, wherein each bit-line pair of the first plurality of bit-line pairs is coupled with computing cells in respective columns of the first bank of CIM array; and a second plurality of bit-line pairs, each bit-line pair including a second read interconnect and a write interconnect, wherein each bit-line pair of the second plurality of bit-line pairs is coupled with computing cells in respective columns of the second bank of CIM array.

[0103] Aspect s. The in-memory computing architecture of any one of Aspects 1- 19, further including a third bank of CIM array, the second bank of CIM array positioned between the first bank of CIM array and the third bank of CIM array, wherein the first plurality of bit-line pairs traverse across the second bank of CIM array and the third bank of CIM array and are coupled with the computing cells in the respective columns of the first bank of CIM array.

[0104] Aspect 4. The in-memory computing architecture of any one of Aspects 1- 19, further including: a plurality of repeater circuits positioned on the first plurality of bit- line pairs between the second bank of CIM array and the third bank of CIM array, wherein each plurality of repeater circuits includes a unidirectional read repeater and a unidirectional write repeater coupled with a first read interconnect and a first write interconnect, respectively, of one of the first plurality of bit-line pairs.

[0105] Aspect 5. The in-memory computing architecture of any one of Aspects 1- 19, further including: a plurality of first bank differential drivers coupled with the first plurality of bit-line pairs, each first bank differential driver of the plurality of first bank differential drivers including: a sense amplifier that receives differential signals from a respective computing cell and outputs a non-differential signal to the respective first read interconnect, and a differential driver that receives a non-differential signal from the respective first write interconnect and outputs a differential signal to the respective computing cell.

[0106] Aspect 6. The in-memory computing architecture of any one of Aspects 1- 19, wherein the differential driver includes an inverter for inverting the non-differential signal received from the respective first write interconnect, wherein the differential signal includes an output of the inverter.

[0107] Aspect ?. The in-memory computing architecture of any one of Aspects 1- 19, wherein the differential driver includes at least one buffer in path of the differential signal.

[0108] Aspect 8. The in-memory computing architecture of any one of Aspects 1- 19, further including: a first set of word-line interconnects corresponding to the first set of rows of computing cells of the first bank of CIM array, wherein each word-line interconnect of the first set of word-line interconnects is coupled with computing cells in a respective row of computing cells, and a plurality of word-line interconnect repeaters positioned on the first set of word-line interconnects.

[0109] Aspect 9. The in-memory computing architecture of any one of Aspects 1- 19, wherein the first bank of CIM arrays includes at least a first sub-bank and a second sub-bank where the first sub-bank includes a first set of columns of computing cells and the second sub-bank includes a second set of columns of computing cells, wherein the plurality of word-line interconnect repeaters are positioned between the first sub-bank and the second sub-bank.

[0110] Aspect 10. The in-memory computing architecture of any one of Aspects 1- 19, further including a plurality of analog to digital converters (ADCs) coupled with one or more of the plurality of column interconnects, the ADCs configured to convert analog voltages on the one or more of the plurality of column interconnects into a digital value.

[0111] Aspect 11. The in-memory computing architecture of any one of Aspects 1- 19, further including: a plurality of compute in-memory units, each compute in-memory unit including the CIM array, and an on-chip network providing communication between the plurality of compute in-memory units.

[0112] Aspect 12. The in-memory computing architecture of any one of Aspects 1- 19, further including: a plurality of repeater circuits positioned on the first plurality of bit- line pairs between the second bank of CIM array and the third bank of CIM array, wherein each plurality of repeater circuits includes a unidirectional read repeater and a unidirectional write repeater coupled with a first read interconnect and a first write interconnect, respectively, of one of the first plurality of bit-line pairs. [0113] Aspect 13. The in-memory computing architecture of any one of Aspects 1- 19, further including: a plurality of first bank differential drivers coupled with the first plurality of bit-lines, each first bank differential driver of the plurality of first bank differential drivers including: a sense amplifier that receives differential signals from a respective computing cell and outputs a non-differential signal to the respective first read interconnect, and a differential driver that receives a non-differential signal from the respective first write interconnect and outputs a differential signal to the respective computing cell.

[0114] Aspect 14. The in-memory computing architecture of any one of Aspects 1- 19, wherein the differential driver includes an inverter for inverting the non-differential signal received from the respective first write interconnect, wherein the differential signal includes an output of the inverter.

[0115] Aspect 15. The in-memory computing architecture of any one of Aspects 1- 19, wherein the differential driver includes at least one buffer in path of the differential signal.

[0116] Aspect 16. The in-memory computing architecture of any one of Aspects 1- 19, further including: a first set of word-line interconnects corresponding to the first set of rows of computing cells of the first bank of CIM array, wherein each word-line interconnect of the first set of word-line interconnects is coupled with computing cells in a respective row of computing cells, and a plurality of word-line interconnect repeaters positioned on the first set of word-line interconnects.

[0117] Aspect 17. The in-memory computing architecture of any one of Aspects 1- 19, wherein the first bank of CIM arrays includes at least a first sub-bank and a second sub-bank where the first sub-bank includes a first set of columns of computing cells and the second sub-bank includes a second set of columns of computing cells, wherein the plurality of word-line interconnect repeaters are positioned between the first sub-bank and the second sub-bank.

[0118] Aspect 18. The in-memory computing architecture of any one of Aspects 1- 19, further including a plurality of analog to digital converters (ADCs) coupled with one or more of the plurality of column interconnects, the ADCs configured to convert analog voltages on the one or more of the plurality of column interconnects into a digital value. [0119] Aspect 19. The in-memory computing architecture of any one of Aspects 1- 18, further including: a plurality of compute in-memory units, each compute in-memory unit including the CIM array, and an on-chip network providing communication between the plurality of compute in-memory units.

[0120] Aspect 20. A method for in-memory computing using a compute-in-memory (CIM) architecture, the CIM architecture including: a CIM array of computing cells including a plurality of rows and a plurality of columns, each computing cell including a memory cell; first and second banks of the CIM array, from a plurality of banks of the CIM array, respectively including first and second sets of rows of computing cells; first and second row decoders respectively coupled with the first and second banks of the CIM array, and a plurality of column interconnects, each column interconnect of the plurality of column interconnects coupled with output ports of computing cells in a respective column of the CIM array, the method including: enabling, by the first row decoder, any of the first set of rows of computing cells to write, one at a time, data in respective memory cells; enabling, by the second row decoder, any of the second set of rows of computing cells to write, one at a time, data in respective memory cells; controlling, by a controller, the first row decoder and the second row decoder to write data into any of the first set of rows of computing cells simultaneously with writing data into any of the second set of rows of computing cells; and providing a result of computation at an output port of one or more of the computing cells from the CIM array of computing cells.

[0121] Aspect 21. The method of any one of Aspects 20-22, wherein the CIM architecture further includes: a first plurality of bit-line pairs, each bit-line pair including a first read interconnect and a first write interconnect, wherein each bit-line pair of the first plurality of bit-line pairs is coupled with computing cells in respective columns of the first bank of CIM array; and a second plurality of bit-line pairs, each bit-line pair including a second read interconnect and a write interconnect, wherein each bit-line pair of the second plurality of bit-line pairs is coupled with computing cells in respective columns of the second bank of CIM array.

[0122] Aspect 22. The method of any one of Aspects 20-21, wherein the CIM architecture further includes: a third bank of CIM array, the second bank of CIM array positioned between the first bank of CIM array and the third bank of CIM array, wherein the first plurality of bit-line pairs traverse across the second bank of CIM array and the third bank of CIM array and are coupled with the computing cells in the respective columns of the first bank of CIM array.

[0123] The examples disclosed herein are illustrative and not limiting in nature. Details disclosed with respect to the methods described herein included in one example or embodiment aspects be applied to other examples and aspects. Any aspect of the present disclosure that has been described herein aspects be disclaimed, i.e., exclude from the claimed subject matter whether by proviso or otherwise.

[0124] Various modifications to the implementations described in this disclosure may be readily apparent to those skilled in the art, and the generic principles defined herein aspects be applied to other implementations without departing from the spirit or scope of this disclosure. Thus, the claims are not intended to be limited to the implementations shown herein, but are to be accorded the widest scope consistent with this disclosure, the principles and the novel features disclosed herein.

Claims

CLAIMS What is claimed is:

1. An in-memory computing architecture, comprising: a compute-in-memory (CIM) array of computing cells, the CIM array comprising a plurality of rows and a plurality of columns, each computing cell including a memory cell and an output port for providing a result of computation, a first bank of CIM array, from a plurality of banks of the CIM array, including a first set of rows of computing cells; a first row decoder coupled with the first bank of CIM array, the first row decoder configured to, one at a time, enable any of the first set of rows of computing cells to write data in respective memory cells; a second bank of CIM array, from the plurality of banks of the CIM array, including a second set of rows of computing cells; a second row decoder coupled with the second bank of CIM array; the second row decoder configured to enable, one at a time, any of the second set of rows of computing cells to write data in respective memory cells; a controller configured to: control the first row decoder and the second row decoder to write data into any of the first set of rows of computing cells simultaneously with writing data into any of the second set of rows of computing cells; and a plurality of column interconnects, each column interconnect of the plurality of column interconnects coupled with output ports of computing cells in a respective column of the CIM array.

2. The in-memory computing architecture of claim 1, further comprising: a first plurality of bit-line pairs, each bit-line pair including a first read interconnect and a first write interconnect, wherein each bit-line pair of the first plurality of bit-line pairs is coupled with computing cells in respective columns of the first bank of CIM array; and a second plurality of bit-line pairs, each bit-line pair including a second read interconnect and a write interconnect, wherein each bit-line pair of the second plurality of bit-line pairs is coupled with computing cells in respective columns of the second bank of CIM array.

3. The in-memory computing architecture of claim 2, further comprising a third bank of CIM array, the second bank of CIM array positioned between the first bank of CIM array and the third bank of CIM array, wherein the first plurality of bit-line pairs traverse across the second bank of CIM array and the third bank of CIM array and are coupled with the computing cells in the respective columns of the first bank of CIM array.

4. The in-memory computing architecture of claim 3, further comprising: a plurality of repeater circuits positioned on the first plurality of bit-line pairs between the second bank of CIM array and the third bank of CIM array, wherein each plurality of repeater circuits includes a unidirectional read repeater and a unidirectional write repeater coupled with a first read interconnect and a first write interconnect, respectively, of one of the first plurality of bit-line pairs.

5. The in-memory computing architecture of claim 3, further comprising: a plurality of first bank differential drivers coupled with the first plurality of bit- line pairs, each first bank differential driver of the plurality of first bank differential drivers including: a sense amplifier that receives differential signals from a respective computing cell and outputs a non-differential signal to the respective first read interconnect, and a differential driver that receives a non-differential signal from the respective first write interconnect and outputs a differential signal to the respective computing cell.

6. The in-memory computing architecture of claim 5, wherein the differential driver includes an inverter for inverting the non-differential signal received from the respective first write interconnect, wherein the differential signal includes an output of the inverter.

7. The in-memory computing architecture of claim 5, wherein the differential driver includes at least one buffer in path of the differential signal.

8. The in-memory computing architecture of claim 1, further comprising: a first set of word-line interconnects corresponding to the first set of rows of computing cells of the first bank of CIM array, wherein each word-line interconnect of the first set of word-line interconnects is coupled with computing cells in a respective row of computing cells, and a plurality of word-line interconnect repeaters positioned on the first set of wordline interconnects.

9. The in-memory computing architecture of claim 8, wherein the first bank of CIM arrays includes at least a first sub-bank and a second sub-bank where the first sub-bank includes a first set of columns of computing cells and the second sub-bank includes a second set of columns of computing cells, wherein the plurality of word-line interconnect repeaters are positioned between the first sub-bank and the second sub-bank.

10. The in-memory computing architecture of claim 1, further comprising a plurality of analog to digital converters (ADCs) coupled with one or more of the plurality of column interconnects, the ADCs configured to convert analog voltages on the one or more of the plurality of column interconnects into a digital value.

11. The in-memory computing architecture of claim 1, further comprising: a plurality of compute in-memory units, each compute in-memory unit including the CIM array, and an on-chip network providing communication between the plurality of compute in-memory units.

12. The in-memory computing architecture of claim 3, further comprising: a plurality of repeater circuits positioned on the first plurality of bit-line pairs between the second bank of CIM array and the third bank of CIM array, wherein each plurality of repeater circuits includes a unidirectional read repeater and a unidirectional write repeater coupled with a first read interconnect and a first write interconnect, respectively, of one of the first plurality of bit-line pairs.

13. The in-memory computing architecture of claim 3, further comprising: a plurality of first bank differential drivers coupled with the first plurality of bitlines, each first bank differential driver of the plurality of first bank differential drivers including: a sense amplifier that receives differential signals from a respective computing cell and outputs a non-differential signal to the respective first read interconnect, and a differential driver that receives a non-differential signal from the respective first write interconnect and outputs a differential signal to the respective computing cell.

14. The in-memory computing architecture of claim 5, wherein the differential driver includes an inverter for inverting the non-differential signal received from the respective first write interconnect, wherein the differential signal includes an output of the inverter.

15. The in-memory computing architecture of claim 5, wherein the differential driver includes at least one buffer in path of the differential signal.

16. The in-memory computing architecture of claim 1, further comprising: a first set of word-line interconnects corresponding to the first set of rows of computing cells of the first bank of CIM array, wherein each word-line interconnect of the first set of word-line interconnects is coupled with computing cells in a respective row of computing cells, and a plurality of word-line interconnect repeaters positioned on the first set of wordline interconnects.

17. The in-memory computing architecture of claim 8, wherein the first bank of CIM arrays includes at least a first sub-bank and a second sub-bank where the first sub-bank includes a first set of columns of computing cells and the second sub-bank includes a second set of columns of computing cells, wherein the plurality of word-line interconnect repeaters are positioned between the first sub-bank and the second sub-bank.

18. The in-memory computing architecture of claim 1, further comprising a plurality of analog to digital converters (ADCs) coupled with one or more of the plurality of column interconnects, the ADCs configured to convert analog voltages on the one or more of the plurality of column interconnects into a digital value.

19. The in-memory computing architecture of claim 1, further comprising: a plurality of compute in-memory units, each compute in-memory unit including the CIM array, and an on-chip network providing communication between the plurality of compute in-memory units.

20. A method for in-memory computing using a compute-in-memory (CIM) architecture, the CIM architecture comprising: a CIM array of computing cells including a plurality of rows and a plurality of columns, each computing cell including a memory cell; first and second banks of the CIM array, from a plurality of banks of the CIM array, respectively including first and second sets of rows of computing cells; first and second row decoders respectively coupled with the first and second banks of the CIM array, and a plurality of column interconnects, each column interconnect of the plurality of column interconnects coupled with output ports of computing cells in a respective column of the CIM array, the method comprising: enabling, by the first row decoder, any of the first set of rows of computing cells to write, one at a time, data in respective memory cells; enabling, by the second row decoder, any of the second set of rows of computing cells to write, one at a time, data in respective memory cells; controlling, by a controller, the first row decoder and the second row decoder to write data into any of the first set of rows of computing cells simultaneously with writing data into any of the second set of rows of computing cells; and providing a result of computation at an output port of one or more of the computing cells from the CIM array of computing cells.

21. The method of claim 20, wherein the CIM architecture further comprises: a first plurality of bit-line pairs, each bit-line pair including a first read interconnect and a first write interconnect, wherein each bit-line pair of the first plurality of bit-line pairs is coupled with computing cells in respective columns of the first bank of CIM array; and a second plurality of bit-line pairs, each bit-line pair including a second read interconnect and a write interconnect, wherein each bit-line pair of the second plurality of bit-line pairs is coupled with computing cells in respective columns of the second bank of CIM array.

22. The method of claim 21, wherein the CIM architecture further comprises: a third bank of CIM array, the second bank of CIM array positioned between the first bank of CIM array and the third bank of CIM array, wherein the first plurality of bit- line pairs traverse across the second bank of CIM array and the third bank of CIM array and are coupled with the computing cells in the respective columns of the first bank of CIM array.