[go: up one dir, main page]

WO2025064495A1 - Unité de traversée de tenseur flexible - Google Patents

Unité de traversée de tenseur flexible Download PDF

Info

Publication number
WO2025064495A1
WO2025064495A1 PCT/US2024/047195 US2024047195W WO2025064495A1 WO 2025064495 A1 WO2025064495 A1 WO 2025064495A1 US 2024047195 W US2024047195 W US 2024047195W WO 2025064495 A1 WO2025064495 A1 WO 2025064495A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
memory
value
elements
address offset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/047195
Other languages
English (en)
Inventor
Hongil Yoon
Xiaoxiao LONG
Dinesh CHANDRASEKARAN
Jason Jong Kyu Park
Wenzhi Cui
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Publication of WO2025064495A1 publication Critical patent/WO2025064495A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/355Indexed addressing

Definitions

  • TTU tensor traversal units
  • a tensor traversal unit is a specialized hardware component used to perform tensor operations in machine learning, e.g., neural network, computations.
  • a tensor includes a single array or multiple arrays of numbers, and tensor operations are mathematical operations that manipulate tensors.
  • Tensor traversal units are typically integrated into hardware accelerators, such as graphics processing units (GPUs), tensor processing units (TPUs), and field- programmable gate arrays (FPGAs).
  • GPUs graphics processing units
  • TPUs tensor processing units
  • FPGAs field- programmable gate arrays
  • Neural networks are machine learning models that employ one or more layers of models to generate an output for a received input.
  • Some neural networks include one or more hidden layers in addition to an outer layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer of the network.
  • Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
  • This specification describes techniques relating to using one or more hardware counters and one or more hardware adders to determine memory addresses for tensor elements of an N-dimensional tensor.
  • one innovative aspect of the subject matter described in this specification can be embodied in a computer-implemented method comprising: obtaining one or more instructions to distribute tensor elements of an N-dimensional tensor among a destination memory unit, wherein the N-dimensional tensor has multiple tensor elements arranged across each of the N dimensions, and wherein N is an integer that is greater than or equal to one; determining an address offset value of a corresponding memory bank in the destination memory unit for each tensor element in the N-dimensional tensor by using one or more counters and one or more adders, wherein the address offset value for each tensor element is determined as a linear combination of partial address offset values counted by the one or more counters; and writing, in accordance with the determined address offset values, the tensor elements of the N-dimensional tensor to the corresponding memory banks of the destination memory unit.
  • the destination memory unit includes one or more cell groups that each include one or more memory cells that each include one or more memory banks, and wherein the instructions include: a first value that defines a number of contiguous tensor elements to be written to each memon cell, a second value that defines a step value to take between the memory cells when writing tensor elements to different memory cells, and a third value that defines a total number of the cell groups in the destination memory unit for storing tensor elements.
  • the determining the address offset value for each tensor element includes: determining, by dividing a total number of the tensor elements by the third value, a fourth value that defines a total number of tensor elements to be written to each cell group.
  • the partial address offset values include three values counted respectively by three counters arranged in senes.
  • the three counters include a least significant digit counter that increments by one at each tensor element of the N-dimensional tensor, up to the first value.
  • the three counters include a second least significant digit counter that increments by one at each increment signal from the least significant bit counter, up to the third value.
  • the three counters include a third least significant digit counter that increments by one at each increment signal from the second least significant bit counter, up to the fourth value divided by the first value.
  • determining the address offset value for each tensor element in the N-dimensional tensor includes: computing, by using the one or more adders, a sum of (i) a partial address offset value counted by the least significant digit counter, (ii) a partial address offset value counted by the second least significant digit counter multiplied by the second value multiplied by a constant that corresponds to a number of memory banks included in a cell, and (iii) a partial address offset value counted by the third least significant digit counter multiplied by the first value.
  • obtaining the instructions to distribute tensor elements of the N-dimensional tensor among the memory unit includes: reading source tensor elements of a source tensor from a source memon' unit; performing data type conversion on the source tensor elements; and storing the converted source tensor elements in a shifting buffer.
  • obtaining the instructions to distribute tensor elements of the N-dimensional tensor among the memory unit includes: obtaining a base memory' address value for accessing the N-dimensional tensor; and storing the base memory address value in an address queue.
  • writing the tensor elements of the N- dimensional tensor to the corresponding memory banks of the memory unit includes, for each tensor element: using a set of multiplexers to write the tensor element to the corresponding memory 7 bank that is identified by the determined address offset value for the tensor element.
  • the method further includes: obtaining one or more other instructions to distribute tensor elements of another N-dimensional tensor among the destination memory unit; and writing the tensor elements of the other N-dimensional tensor to memory banks of the destination memory unit while bypassing determining address offset values for the tensor elements of the other N-dimensional tensor using the one or more counters and the one or more adders.
  • the method further includes: outputting data indicating the determined address offset value for a particular tensor element of the N-dimensional tensor.
  • Other implementations of this and other aspects include corresponding systems and computer programs configured to perform the actions of the methods encoded on computer storage devices.
  • a system of one or more computers can be so configured by virtue of software, firmware, hardware, or a combination of them installed on the system that in operation causes the system to perform the actions.
  • One or more computer programs can be so configured by virtue of having instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions.
  • a special purpose computational unit can quickly determine memory address values for accessing a storage medium.
  • the memory address values need not be contiguous in a destination storage medium. Nor need they be ordered in the destination storage medium in the same way as in a source storage medium from which the data was read.
  • the special purpose computational unit thus facilitates easier and faster distribution of data across a storage medium in accordance with flexible data distribution patterns that fit the needs of particular computation tasks. Determining memory' address values using counters and adders allows the number of computational cycles at a processor, the number of instructions that the processor is required to execute, or both that are otherwise required to write the data to be reduced, and thus increases processor bandwidth for other computation tasks.
  • FIG. 1 is a block diagram of an example computing system.
  • FIG. 2 is an example diagram of moving a tensor from a first memory unit to a second memoiy unit.
  • FIG. 3 is an example diagram of distributing tensor elements of a tensor across a memory unit.
  • FIG. 4 is a flow diagram that illustrates an example process for distributing tensor elements of an N-dimensional tensor among a destination memory unit.
  • FIG. 5 is an example illustration of distributing tensor elements of an N- dimensional tensor among a destination memory unit.
  • FIG. 1 shows a block diagram of an example computing system 100.
  • the computing system 100 processes an input 101 to generate an output 116.
  • the computing system 100 may be configured to perform linear algebra computations.
  • the input 101 may include any suitable data, instructions, or both that can be processed by the computing system 100.
  • the computing system 100 includes a processing unit 102. a storage medium 104. and a tensor traversal unit 106.
  • the computing system 100 can use the tensor traversal unit 106 for traversing a tensor during the processing of the input 101 to generate the output 116.
  • An N-dimensional tensor may be a vector, a matrix, or a multi-dimensional matrix.
  • a 1-dimensional tensor is a vector
  • a 2-dimensional tensor is a matrix
  • a 3-dimensional tensor is a three-dimensional matrix made up of multiple two- dimensional matrices.
  • Each dimension of the N-dimensional tensor may include one or more tensor elements, where each tensor element may store a respective data value.
  • the processing unit 102 is configured to process instructions for execution within the computing system 100, including instructions 112 stored in the storage medium 104 or other instructions stored in another storage device and provided to the computing system 100 via a communication bus.
  • the processing unit 102 may include one or more processors.
  • the storage medium 104 stores information within the computing system 100.
  • the storage medium 104 is a volatile memory’ unit or units.
  • the storage medium 104 is a non-volatile memory unit or units.
  • the storage medium 104 may also be another form of computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
  • the instructions when executed by the processing unit 102, cause the processing unit 102 to perform one or more computation tasks.
  • the tensor traversal unit 106 may be implemented as an application-specific integrated circuit.
  • the processing unit 102 executes an instruction for storing a particular tensor element of a tensor
  • the tensor traversal unit 106 determines, based on a tensor index of the particular tensor element of in the tensor, an address offset value of the particular tensor element of the tensor, such that the processing unit 102 may use the address offset value to write data 114 representing the value of the particular tensor element to the storage medium 104.
  • the data 114 can include any intermediate data generated during the processing of the input 101 by the computing system 100 to generate the output 116, or can alternatively include the output 1 16.
  • the tensor traversal unit 106 can translate a set of N-dimensional tensor indices into a sequence of address offset values in a one-dimensional address space.
  • the tensor traversal unit can perform such translations by counting one or more partial address offset values based on the tensor element’s dimension indices, and then making a tensor element’s address offset value a linear combination of the counted partial address offset values.
  • the tensor traversal unit 106 can efficiently and programmatically generate a sequence of addresses which reference a sequence of tensor elements.
  • the address sequence corresponds to the sequence of tensor elements that w ould be accessed by one or more memory access instructions that specify, for example, a loop nest in a software traversal routine.
  • the sequence of tensor elements may or may not be stored in physically contiguous memory banks once they are written to memory.
  • FIG. 5 and described below' provide an example of how a sequence of tensor elements can be written to memory banks in a destination memory unit that are not physically contiguous in the destination memory unit.
  • the tensor traversal unit 106 includes a hardware counter unit 122 and a hardware adder unit 124.
  • the hardware counter unit 122 can include one or more counters. Each counter may include digital circuitry that is configured to perform counting operations. Each counter can count a partial address offset value, which is stored as a count value in a count register of the counter that is incremented each time an increment signal is received at the input up to a maximum value that is specified in a max-count register of the counter.
  • the output signal of a first counter which may be referred to as the least significant digit counter
  • a second counter which may be referred to as the second least significant digit counter
  • the output signal of the second counter is provided to the input of a third counter, which may be referred to as the third least significant digit counter, if one is present, and so on until the final counter in the chain, which may be referred to as the Nth least significant digit counter.
  • each hardware counter includes a count register (e.g., implemented as a group of flip-flops or latches) which is configured to store a value and increase the value stored therein each time an input signal is received.
  • each hardware counter also includes a max-count register (e.g.. implemented as a group of flip-flops or latches) which is configured to receive and store a maximum value. An output signal of a hardware counter will be provided when the value stored in the count register reaches the maximum value, i.e. when the value stored in the count register would exceed the maximum value.
  • the count register can be configured to reset the value stored in the count register to a start value, such as zero.
  • a value stored in the count register of the least significant digit counter may increase by a predetermined amount, such as incremented by one, each time an input signal is received by the hardware counter unit 122.
  • a value stored in the count register of the second least significant digit counter will be incremented each time the least significant digit counter provides an output signal, that is, each time the value stored in the count register reaches the maximum value stored in the max-count register of the least significant digit counter, and correspondingly resets the value stored in the count register to a start value, such as zero.
  • the second least significant digit counter will then provide an output signal when the value in its count register reaches the maximum value stored in the max-count register of the second least significant digit counter.
  • a value stored in the count register of the third least significant digit counter will be incremented and so on until the Nth least significant digit counter, i.e.. the final counter in the chain.
  • the hardware adder unit 124 can include one or more hardware adders.
  • Each hardware adder may include digital circuitry that is configured to perform additions. For example, as described below, the one or more adders may multiply (e.g., by shift-and-add) partial address offset values with given coefficients to determine weighted partial address offset values, and then add the weighted partial address offset values to determine a total address offset value for a tensor element of a tensor.
  • some implementations of the tensor traversal unit 106 can include an arithmetic logic unit (ALU), a hardware multiplier unit, or another hardware unit having suitable digital circuitry that can be configured to perform addition and multiplication operations.
  • ALU arithmetic logic unit
  • the ALU and/or multiplier unit can be used to compute the total address offset values for tensor elements of a tensor from the partial address offset values counted by the hardw are counter unit 122.
  • the computing system 100 described in this specification can perform the computations across one or more layers of a multi-layer neural network.
  • a computation process performed within a neural network layer may include a multiplication of an input activation tensor including input activations with a weight tensor including weights of the neural network layer.
  • the computation can include multiplying an input activation with a weight to generate a product on one or more cycles and performing an accumulation of the products over many cycles.
  • the computation for a given neural network layer requires a number of instructions 112.
  • Some instructions 112 when obtained and executed by the processing unit 102, can cause the processing unit 102 to perform computation tasks associated with traversing a tensor. For example, such computation tasks include moving data between two distinct memory address locations in the same or different memory units, storing activations in memory address locations of a first memory unit, storing weights in memory address locations of a second memory unit, and so on.
  • Some instructions 112 can cause the processing unit 102 to access the stored data from the first memory unit and the second memory unit.
  • Some instructions 112 can cause the processing unit 102 to move data as required to accomplish performance of a particular tensor operation by system 100, including moving data between two memory resources of unequal widths, for example, from the first memory unit to the second memory 7 unit, or vice versa.
  • the first memory unit may be a narrow memory unit
  • the second memory unit may be a wide memory unit, where either or both of the first and second memory units are included in the storage medium 104.
  • the data read from the narrow memory unit can be broadcasted to processing units associated with the wide memory' unit, and the data can be read from the wide memory 7 unit in parallel to support the parallel computation.
  • Wide and narrow designations generally refer to an approximate size in width (bits/bytes) of one or more memory units.
  • “narrow” may refer to one or more memory 7 units each having a size or width of less than 16-bytes (or 16-bits) and “wide” may refer to one or more memory units each having a size or width of greater than 16-bytes (or 16-bits) but, in some implementations, less than 128-bytes or 256-bytes (or 128-bits or 128- bits).
  • FIG. 2 is an example diagram of moving a tensor from a first memory 7 unit 202 to a second memory unit 218.
  • the computing system 100 can perform the operations depicted in FIG. 2 as a result of executing the instruction(s) to move data by processing unit 1 2, i.e., to read data from the first memory unit 202 (also referred to as the source memory unit) and then write the data to the second memory' unit 218 (also referred to as the destination memory 7 unit).
  • the first memory unit 202 is a narrow memory 7 unit and the second memory unit 218 is a wide memory unit.
  • Each memory unit has a number of memory' banks.
  • Each memory 7 bank can store a tensor element of a tensor.
  • the first memory unit can store the input activations of an input activation tensor in the memory banks at memory address locations that correspond to respective tensor elements of the input activation tensor.
  • the first memory unit can also store the weights of the weight tensor, or output activations of the output activation tensor, in the memory banks of the first memory 7 unit.
  • the operations to move data occurs when the computing system executes a memory access instruction to read data from the first memory unit 202 (step 204).
  • the computing system makes a request (step 206) to read tensor elements of a tensor (e.g., input activations of an input activation tensor) stored in memory banks of the first memory unit 202.
  • the computing system obtains the tensor elements (step 208).
  • the tensor elements read from the first memory unit should also have the data type corresponding to the second memory unit.
  • the computing system optionally performs data type conversion (step 210), for example, to convert an integer to a float.
  • the computing system buffers the tensor elements at one or more shifting buffers (step 212). For example, the buffering of the tensor elements can be done in a first-in-first-out (FIFO) order. As another example, the buffering of the tensor elements can be done according to an order defined in an instruction received by the computing system.
  • the computing system executes a memory' access instruction to write data to the second memory unit 218 (step 214). Because the tensor will be stored to the second memory unit 218, a set of memory addresses in the second memory unit 218 will need to be generated for the tensor based on the tensor indices of the tensor elements of the tensor. As described further below, the computing system uses the tensor traversal unit to provide the address of each tensor element of the tensor based on the index of the dimension associated with the tensor element (step 216).
  • FIG. 3 is an example diagram of distributing tensor elements of one or more tensors across a memory unit 318.
  • the memory unit 318 may be a destination memory 7 unit, e.g., the second memory' unit 218 of FIG. 2, and the tensor may be the tensor read from a source memory unit, e.g., the first memory unit 202 of FIG. 2.
  • the computing system receives a memory access instruction 314 to write a tensor to the memory' unit 318.
  • an N-dimensional tensor may be a vector, a matrix, or a multi-dimensional matrix.
  • An algorithm, including nested loops, may be executed by the computing system to perform tensor operations by iterating one or more nested loops to traverse an N-dimensional tensor.
  • each loop of the loop nest may be responsible for traversing a particular dimension of the N- dimensional tensor.
  • the memory' access instruction 314 includes a set of values that define a data distribution pattern for distributing the tensor elements of the tensor across the memory’ unit 318.
  • the memory' access instruction 314 may also include a base memory address value 302 and one or more loop index variables 304.
  • the memory’ access instruction 314 may be provided to the computing system via a communication bus which, once arrived, causes the memory access operations to occur at the computing system.
  • the base memory’ address value 302 can be used by the computing system to locate a memory address in the memory unit 318 that should be accessed.
  • the memory' access instruction 314 may include a memory' address of a particular tensor element, e.g., the first tensor element, of the tensor.
  • the received base memory’ address value may be stored to an address queue 306 of the computing system.
  • the address queue 306 may include a number of address registers for storing the base memory' address value(s) to be received with one or more memory’ access instruction(s).
  • the one or more loop index variables 304 can be used by the computing system to traverse the tensor elements of the tensor in order to write the tensor elements of the tensor to a destination memory unit, e.g., the second memory unit 218 of FIG. 2.
  • a for- loop is an example of a nested loop, where three loops tracked by three loop index variables (e.g., i, j, and k) can be nested to traverse through a three-dimensional tensor.
  • an outer for-loop may be used to traverse the loop tracked by variable i
  • a middle for-loop loop may be used to traverse the loop tracked by variable j
  • an inner for-loop may be used to traverse the loop tracked by variable k.
  • Other examples of the nested loop may include more or fewer loop index variables.
  • the computing system uses a hardware counter unit 322 and a hardware adder unit 324 to compute an address offset value for each tensor element of the tensor, and then writes the tensor element to the memory unit 318 in accordance with a memory address value that is determined from the address offset values and the base memory address value.
  • the computing system includes a multiplexer device 308 that includes a set of multiplexers and controls which memory banks of the memory unit 318 store which tensor elements using the multiplexer device 308.
  • a multiplexer device 308 that includes a set of multiplexers and controls which memory banks of the memory unit 318 store which tensor elements using the multiplexer device 308.
  • each memory bank of the memory' unit can be connected to a multiplexer and the multiplexers can be used to select the appropriate memory bank to store a tensor element based on the memory address value of the tensor element.
  • the computing system controls the multiplexer device 308 to store each tensor element to the appropriate memory bank that corresponds to the memory' address value of the tensor element (which has been computed by using the hardware counter unit 322 and the hardware adder unit 324). In this way, the computing system distributes the tensor elements of the tensor across the memory unit 318 in accordance with the data distribution pattern specified in the memory access instruction 314.
  • the computing system includes a bypass path that bypasses the usage of both the hardware counter unit 322 and the hardware adder unit 324 in computing the memory address values.
  • the tensor elements received from the shifting buffer 312 are directly written to the memory unit 318, i.e.. without redistribution.
  • the memory access instruction 314 may include an indicator or field that instructs the computing system to either enable or disable the hardware counter unit 322 and the hardware adder unit 324.
  • the memory' access instruction 314 may enable the hardware counter unit 322 and the hardware adder unit 324 and thereby causes the computing system to use these units to compute the address offset values.
  • the memory access instruction 314 may disable the hardware counter unit 322 and the hardware adder unit 324 and thereby cause the computing system to enter into a bypass mode where the computing system does not use these units to compute the address offset values.
  • the computing system can write the tensor elements of the tensor into the memory unit 318 in accordance with a fixed order, e.g., the order in which the tensor elements of the tensor were popped from the shifting buffer 312, or the order in which the tensor elements of the tensor were read from the first memory unit 202 of FIG. 2.
  • a fixed order e.g., the order in which the tensor elements of the tensor were popped from the shifting buffer 312, or the order in which the tensor elements of the tensor were read from the first memory unit 202 of FIG. 2.
  • bypassing the use of the hardware counter unit 322 and the hardware adder unit 324 in computing the address offset values can further reduce the time and/or quantity' of computations that are needed for traversing the tensor.
  • FIG. 4 is a flow diagram that illustrates an example process 400 for distributing tensor elements of an N-dimensional tensor among a destination memory’ unit.
  • the process 400 may be performed by a system of one or more computers, e.g., the computing system 100 of FIG. 1.
  • the system includes a tensor traversal unit.
  • the tensor traversal unit includes a hardware counter unit having one or more counters.
  • the tensor traversal unit also includes a hardware adder unit having one or more adders.
  • the system obtains one or more instructions to distribute tensor elements of an N-dimensional tensor among a destination memory unit (step 402).
  • the N-dimensional tensor can include one or more elements arranged across each of the N dimensions, where N is an integer that is greater than or equal to one.
  • the N-dimensional tensor can also be part of a larger tensor having the same or different dimensions, and thus tensor elements to be distributed can be a subset of a larger group of tensor elements.
  • the system can include a processing unit, e.g., processing unit 102 of FIG. 1, that executes an instruction for writing tensor elements of a tensor to a destination memory unit, e.g.. storage medium 104 of FIG. 1.
  • the one or more instructions may represent instructions for processing a nested loop that includes one or more loops, where each loop may be iterated using a corresponding loop index variable.
  • the program may specify respective values of the loop index variables.
  • the one or more instructions may include a base memory address for accessing the destination memory unit.
  • an instruction may include a base memory address that corresponds to an address in the destination memory unit at which a particular tensor element, e.g., the first tensor element, of the tensor should be written to.
  • the one or more instructions include a set of values that define or otherwise specify a data distribution pattern according to which these tensor elements of the N-dimensional tensor should be distributed among the destination memory unit.
  • the destination memory unit may include multiple memory banks. Each memory bank has a different memory’ address. One or more memory’ banks may be grouped into a memory cell (thus each memory' cell includes a number of memory' banks). One or more memory cells may be grouped into a cell group (thus each cell group includes a number of memory’ cells).
  • the one or more instructions can include three values that collectively’ define the data distribution pattern: a first value (a “unit bytes” value) that defines a number of contiguous tensor elements to be written to each memory cell, a second value (a “cell stride” value) that defines a step value (or a length of a spacing) to take between the memory cells when writing tensor elements to different memory cells, and a third value (a “group count” value) that defines a total number of the cell groups in the destination memory unit for storing tensor elements.
  • a first value a “unit bytes” value
  • a second value a “cell stride” value
  • a third value a “group count” value
  • the one or more instruction may also include a fourth value (the “access bytes” value) that defines a total number of tensor elements to be written to each cell group.
  • the system can compute the fourth value by computing a division of a total number of the tensor elements in the N-dimensional tensor by the third value (the “group count” value).
  • FIG. 5 is an example illustration of distributing tensor elements of an N- dimensional tensor among a destination memory' unit 500.
  • the N-dimensional tensor includes a total of 18 tensor elements
  • the destination memory unit 500 includes 3 cell groups, where each cell group includes 2 memory’ cells (for example a first cell group that includes Cell 0 and Cell 1), where each memory' cell includes 4 memory banks (for example a first memory cell includes four memory banks that can each store a tensor element of a tensor).
  • the system obtains one or more instructions that include the first value (the “unit bytes” value) of 2, the second value (the “cell stride” value) of 2, the third value (the “group count” value) of 3, and also include or specify the fourth value (the “access bytes” value) of 6.
  • the system determines an address offset value of a corresponding memory bank in the destination memory unit for each tensor element in the N-dimensional tensor byusing one or more counters and one or more adders (step 404). Generally, the system determines the address offset value for each tensor element based on using the one or more adders to compute a linear combination of the partial address offset values counted respectively by the one or more counters.
  • the system can include three counters, including a least significant digit counter, a second least significant digit counter, and a third least significant digit counter, that are arranged in series in a chain.
  • the system can determine a maximum value from the set of values included in (or specified by) the one or more instructions obtained by the system at step 402.
  • the system can configure the counters such that each counter can count up to a respective maximum value, for example, by way of storing the maximum values in the max-count registers of the counters.
  • the least significant digit counter is configured to increment by one each time an input signal is received by the least significant digit counter.
  • the least significant digit counter may receive such an input signal during each memory access cycle during which a tensor element of the N-dimensional tensor is read.
  • the maximum value for the least significant digit counter can be equal to the first value (the “unit bytes” value).
  • the least significant digit counter can thus count starting from zero (inclusive), up to the first value (exclusive), and then wrap around, i.e., reset to zero.
  • the second least significant digit counter is configured to increment by one at each increment signal from the least significant bit counter.
  • the least significant digit counter can provide such an increment signal each time the value counted by the least significant digit counter reaches the maximum value for the least significant digit counter.
  • the maximum value for the second least significant digit counter can be equal to the third value (the “group count” value).
  • the second least significant digit counter can thus count starting from zero (inclusive), up to the third value (exclusive), and then wrap around, i.e., reset to zero.
  • the third least significant digit counter is configured to increment by one at each increment signal from the second least significant bit counter.
  • the second least significant digit counter can provide such an increment signal each time the value counted by the second least significant digit counter reaches the maximum value for the second least significant digit counter.
  • the maximum value for the third least significant digit counter can be equal to the result of dividing the fourth value (the “access bytes” value) by the first value (the “unit bytes” value).
  • the third least significant digit counter can thus count starting from zero (inclusive), up to the division result (exclusive), and then wrap around, i.e., reset to zero.
  • the three counters count three partial address offset values of (0, 0, 0) when the first tensor element (DO) is read by the system.
  • the least significant digit counter increments by one when the second tensor element (DI) is read by the system, that is, increments by one for each tensor index of the tensor.
  • the three counters now count three partial address offset values of (0, 0, 1).
  • the second least significant digit counter increments by one in response to an increment signal provided by the least significant digit counter when the third tensor element (D2) is read by the system, and correspondingly the three counts now count three partial address offset values of (0. 1, 0).
  • the three counters continue to count in this way until the last tensor element (DI 7) is read by the system, at which time the three partial address offset values are (2, 2, 1).
  • the system can determine the address offset value for each tensor element in the N-dimensional tensor to compute a sum of (i) a partial address offset value counted by the least significant digit counter, (ii) a partial address offset value counted by the second least significant digit counter multiplied by the second value multiplied by a constant that corresponds to the number of memory banks included in a cell (which is equal to four, in the example of FIG. 5) , and (iii) a partial address offset value counted by the third least significant digit counter multiplied by the first value, as shown in the equation below:
  • OffsetO Offset 1 * second value * 4 + Offset2 * first value
  • OffsetO represents the partial address offset value counted by the least significant digit counter
  • Offset 1 represents the partial address offset value counted by second least significant digit counter
  • Offset2 represents partial address offset value counted by the third least significant digit counter
  • first value represents the unit bytes value
  • second value represents the cell stride value
  • the system can compute this sum by using the one or more adders. For example, to compute (iii), the system can provide as inputs to an adder the partial address offset value counted by the third least significant digit counter and the first value, and use the adder to perform a shift-and-add multiplication of the inputs.
  • the system writes, in accordance with total memory addresses determined from the address offset values, the tensor elements of the N-dimensional tensor to the corresponding memory banks of the destination memory unit (step 406).
  • a total memoi ’ address for a particular tensor element is determined as a sum of a base memory address included in the instruction and the address offset value for the particular tensor element.
  • the total memory address for the fourth tensor element (D3) in the example of FIG. 5 is equal to the sum of the base memory’ address and the address offset value of 9.
  • the fourth tensor element can then be written to a corresponding memory’ bank in the destination memory unit that has this total memory address.
  • the total memory’ address can be similarly computed by using one or more adders.
  • the inputs to an adder for a particular tensor element are the base memory’ address and the address offset value.
  • the output is the total memory address for a memory bank in the destination memory unit for storing the particular tensor element.
  • the system distributes the tensor elements of the N-dimensional tensor in the data distribution pattern defined by the first, second, and third values.
  • the four memory banks in the first cell (Cell 0) store the first, second, seventh, and eighth tensor elements (DO, DI, D6, and D7) of the N- dimensional tensor
  • two memory’ banks in the second cell (Cell 1) store the thirteenth and fourteenth tensor elements (D12 and D13) of the N-dimensional tensor, while the two remaining memory banks in the second cell are empty.
  • the system optionally outputs, e.g., to another computing system, data indicating the address offset value, the total memory address, or both for a particular tensor element of the N-dimensional tensor, so that the other computing system may use the address offset value, the total memory address, or both to access a particular element of the N- dimensional tensor in the destination memory unit.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus.
  • the computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • data processing apparatus refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
  • the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations.
  • the index database can include multiple collections of data, each of which may be organized and accessed differently.
  • engine is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions.
  • an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.
  • Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
  • the central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
  • embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser.
  • a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
  • Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.
  • Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a J AX framework.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data, e.g., an HTML page, to a user device, e.g.. for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.
  • Data generated at the user device, e g , a result of the user interaction can be received at the server from the device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

L'invention concerne des procédés, des systèmes et un appareil, y compris des programmes informatiques codés sur des supports de stockage informatiques, pour la détermination d'adresses de mémoire pour des éléments de tenseur d'un tenseur à N dimensions. L'un des procédés comprend: l'obtention d'une ou de plusieurs instruction(s) pour distribuer des éléments de tenseur d'un tenseur à N dimensions parmi une unité de mémoire de destination; la détermination d'une valeur de décalage d'adresse d'une banque de mémoire correspondante dans l'unité de mémoire de destination pour chaque élément de tenseur dans le tenseur à N dimensions en utilisant un ou plusieurs compteur(s) et un ou plusieurs additionneur(s), la valeur de décalage d'adresse pour chaque élément de tenseur étant déterminée en tant que combinaison linéaire de valeurs de décalage d'adresse partielle comptées par le ou les compteur(s); et l'écriture, conformément aux valeurs de décalage d'adresse déterminées, des éléments de tenseur du tenseur à N dimensions aux banques de mémoire correspondantes de l'unité de mémoire de destination.
PCT/US2024/047195 2023-09-18 2024-09-18 Unité de traversée de tenseur flexible Pending WO2025064495A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363539040P 2023-09-18 2023-09-18
US63/539,040 2023-09-18

Publications (1)

Publication Number Publication Date
WO2025064495A1 true WO2025064495A1 (fr) 2025-03-27

Family

ID=92933001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/047195 Pending WO2025064495A1 (fr) 2023-09-18 2024-09-18 Unité de traversée de tenseur flexible

Country Status (1)

Country Link
WO (1) WO2025064495A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130036268A1 (en) * 2005-10-21 2013-02-07 Roger Espasa Implementing Vector Memory Operations
US20180285254A1 (en) * 2017-04-04 2018-10-04 Hailo Technologies Ltd. System And Method Of Memory Access Of Multi-Dimensional Data
US20200234099A1 (en) * 2018-06-22 2020-07-23 Samsung Electronics Co., Ltd. Neural processor
CN115495400A (zh) * 2022-11-07 2022-12-20 上海亿铸智能科技有限公司 一种针对张量数据搬运的多通道dma方法
US20230229407A1 (en) * 2022-01-20 2023-07-20 SambaNova Systems, Inc. Compiler for a Fracturable Data Path in a Reconfigurable Data Processor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130036268A1 (en) * 2005-10-21 2013-02-07 Roger Espasa Implementing Vector Memory Operations
US20180285254A1 (en) * 2017-04-04 2018-10-04 Hailo Technologies Ltd. System And Method Of Memory Access Of Multi-Dimensional Data
US20200234099A1 (en) * 2018-06-22 2020-07-23 Samsung Electronics Co., Ltd. Neural processor
US20230229407A1 (en) * 2022-01-20 2023-07-20 SambaNova Systems, Inc. Compiler for a Fracturable Data Path in a Reconfigurable Data Processor
CN115495400A (zh) * 2022-11-07 2022-12-20 上海亿铸智能科技有限公司 一种针对张量数据搬运的多通道dma方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIE XINFENG ET AL: "SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator", 2021 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA), IEEE, 27 February 2021 (2021-02-27), pages 570 - 583, XP033905537, DOI: 10.1109/HPCA51647.2021.00055 *

Similar Documents

Publication Publication Date Title
US11227216B2 (en) Batch processing in a neural network processor
JP7561925B2 (ja) 専用ニューラルネットワークトレーニングチップ
KR102788305B1 (ko) 다차원 텐서들에서의 데이터 액세스
US11379707B2 (en) Neural network instruction set architecture
KR101959376B1 (ko) 멀티 코어 최적화된 순환 신경망을 위한 시스템 및 방법
JP2022095817A (ja) ベクトル縮小プロセッサ
US11275561B2 (en) Mixed precision floating-point multiply-add operation
US9946539B1 (en) Accessing data in multi-dimensional tensors using adders
CN111506520B (zh) 一种地址生成的方法、相关装置以及存储介质
EP3827376B1 (fr) Tailles de mini-lots dynamiques
Valdez et al. Gpu simulations of spiking neural p systems on modern web browsers
WO2025064495A1 (fr) Unité de traversée de tenseur flexible
US12141513B2 (en) Method to map convolutional layers of deep neural network on a plurality of processing elements with SIMD execution units, private memories, and connected as a 2D systolic processor array
JP2025530341A (ja) ハードウェアアクセラレータにおけるメモリバンク競合の低減
US20250124347A1 (en) Training model allocation method, apparatus, computer device, and storage medium
Seveso et al. Element Scheduling for GPU-Accelerated Finite-Volumes Computations
HK40027365B (en) Method for generating address, related device and storage medium
HK40027365A (en) Method for generating address, related device and storage medium
CN119670622A (zh) 可压缩流动的模拟方法、装置、计算机设备和存储介质
CN117461028A (zh) 灵活的可扩缩的图形处理加速器
HK40014403A (en) Accessing data in multi-dimensional tensors
HK40014403B (en) Accessing data in multi-dimensional tensors
HK1242809A1 (en) Accessing data in multi-dimensional tensors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24783113

Country of ref document: EP

Kind code of ref document: A1