WO2023287757A1

WO2023287757A1 - Asymmetric data path operations

Info

Publication number: WO2023287757A1
Application number: PCT/US2022/036778
Authority: WO
Inventors: Nicholas KNIGHT
Original assignee: SiFive Inc
Current assignee: SiFive Inc
Priority date: 2021-07-13
Filing date: 2022-07-12
Publication date: 2023-01-19
Anticipated expiration: 2024-01-13
Also published as: TW202311985A

Abstract

Executing an instruction may include loading first, second, and third sets of data elements, stored as one-dimensional arrays, via first, second, and third data paths, where the width of the third data path is greater than the width of the first data path and the width of the second data path, and where data elements of the first, second, and third sets of data elements map to entries of first, second, and third matrices; and matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the matrix multiplication result to the third matrix to produce a fourth matrix, where entries of the fourth matrix map to data elements of a fourth set of data elements.

Description

ASYMMETRIC DATA PATH OPERATIONS

CROSS-REFERENCE TO RELATED APPLICATION(S)

[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application Serial No. 63/221,264, filed July 13, 2021, the entire disclosure of which is hereby incorporated by reference.

TECHNICAL FIELD

[0002] This disclosure relates generally to optimization of computer instructions and, more specifically, to asymmetric data path operations.

BACKGROUND

[0003] A matrix multiplication operation may involve multiplying two matrices to produce or accumulate into a third matrix. In computing systems, implementing an artificial neural network may involve processing multiple matrix multiplication operations.

BRIEF DESCRIPTION OF THE DRAWINGS [0004] The disclosure is best understood from the following detailed description when read in conjunction with the accompanying drawings. It is emphasized that, according to common practice, the various features of the drawings are not to-scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity.

[0005] FIG. 1 is a block diagram of an example of an integrated circuit supporting vector operations.

[0006] FIG. 2 is a block diagram of an example of an integrated circuit supporting vector operations.

[0007] FIG. 3 is an example of a memory map of examples of vector memory instructions.

[0008] FIG. 4 is a flow chart of an example of a process for an asymmetric data path operation.

[0009] FIG. 5 is a block diagram of a system supporting asymmetric data path operations.

[0010] FIG. 6 is an example of a matrix multiply-add operation targeting asymmetric data paths. [0011] FIG. 7 is an example of a geometric visualization of a matrix multiply-add operation.

[0012] FIG. 8 is a flow chart of an example of a process for executing an instmction using asymmetric data paths.

[0013] FIG. 9 is an example of a geometric visualization of a sequence of matrix multiply-add operations that may be performed with a single instruction.

[0014] FIG. 10 is a flow chart of an example of a process for executing a sequence of instructions, each itself a sequence of matrix multiply-add operations using asymmetric data paths, to perform a larger matrix multiply- accumulate operation.

DETAILED DESCRIPTION

[0015] A processor may implement vector instructions that operate on multiple data elements at the same time. For example, a vector instmction may operate on multiple data elements, arranged in a one-dimensional data array, which may be stored in a vector register file or memory. Implementations of some applications may involve processing data elements arranged in multiple dimensions. For example, neural network applications may involve processing data elements arranged in multi-dimensional arrays (e.g., matrices). A need therefore exists to improve the processing of instructions for data elements arranged in multiple dimensions.

[0016] Implementations of this disclosure are designed to improve the processing of instructions for data elements arranged in multiple dimensions by executing a vector instmction (e.g., a vector instmction configured to read and write data elements arranged in one-dimensional data arrays), using asymmetric data paths, to multiply and accumulate data elements in matrices. Executing the instmction may include loading first, second, and third sets of data elements from a storage location, such as a vector register file or memory. The first, second, and third sets of data elements may be stored as one-dimensional data arrays in the storage location. The first, second, and third sets of data elements may be loaded via first, second, and third data paths, respectively (e.g., read ports). For example, the first, second, and third sets of data elements may be loaded by a vector arithmetic logic unit (ALU) connected to the storage location. The widths of the data paths may be asymmetric. In some implementations, the width of the third data path may be greater than the width of the first data path and the width of the second data path. In some cases, the width of the third data path may be at least twice the width of the first data path and at least twice the width of the second data path. For example, the width of the third data path may be 512 bits, while the width of the first data path may be 256 bits, and the width of the second data path may also be 256 bits. Data elements of the first, second, and third sets of data elements may map to entries of first, second, and third matrices. For example, the vector ALU may implement a transpose buffer to arrange the data elements into the multiple dimensions of the matrices. Executing the instruction may also include multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the third matrix to the matrix multiplication result to produce a fourth matrix. For example, the vector ALU may implement a matrix multiply-add operation that uses the first, second, and third matrices to generate the fourth matrix. The fourth matrix may map to data elements of a fourth set of data elements. For example, the vector ALU may use the transpose buffer to arrange the entries of the fourth matrix into a one-dimensional array of data elements of the fourth set of data elements. Executing the instruction may also include storing the fourth set of data elements, as a one dimensional array, via a fourth data path (e.g., a write port). For example, the fourth set of data elements may be stored by the vector ALU connected to the storage location. Further, the width of the fourth data path may be greater than the width of the first data path and the width of the second data path. In some cases, the width of the fourth data path may be at least twice the width of the first data path and at least twice the width of the second data path. For example, the width of the fourth data path may be 512 bits, while the width of the first data path may be 256 bits, and the width of the second data path also 256 bits.

[0017] The size of data elements of the third and fourth sets of data elements may be greater than the size of data elements of the first and second sets of data elements. In some cases, the size of data elements of the third and fourth sets of data elements may be at least four times the size of data elements of the first and second sets of data elements. For example, data elements of the third and fourth sets of data elements may be 32 bit integers, while data elements of the first and second sets of data elements may be 8 bit integers. In some cases, this may represent a quadruple-widening of the arithmetic values (e.g., the data elements of the third and fourth sets have quadruple the size of those of the first and second sets), but only a double-widening of the data path widths (e.g., the third and fourth data paths have double the width of the first and second data paths). This may be due to the destination matrix (e.g., the fourth matrix) being smaller (e.g., having fewer entries, arranged with 4 rows by 4 columns) than the input matrices (e.g., having more entries, such as the first matrix arranged in 4 rows by 8 columns, and the second matrix arranged with 8 rows by 4 columns), resulting from the matrix multiplication.

[0018] Also described herein are systems and methods for matrix multiplication using asymmetric data paths. The techniques for matrix multiplication and/or vector operations may be used to realize one or more advantages over conventional processors. For example, the structures and techniques described herein may enable efficient and accurate matrix multiplication. Existing instruction set architectures may use the same size for input and output. The matrix multiplication and/or vector operations can use asymmetric data paths. [0019] As used herein, the term “circuit” refers to an arrangement of electronic components (e.g., transistors, resistors, capacitors, and/or inductors) that is structured to implement one or more functions. For example, a circuit may include one or more transistors interconnected to form logic gates that collectively implement a logical function.

[0020] FIG. 1 is a block diagram of an example of an integrated circuit 110 supporting vector operations, including executing instructions for matrix multiplication using quad- widening matrix multiplication instructions. The integrated circuit 110 includes a processor core 120. The processor core 120 is configured to fetch instructions from and access data stored in a memory 140 external to the integrated circuit 110 and/or a memory 142 internal to the integrated circuit 110. The integrated circuit 110 may provide advantages over conventional processor architectures, such as, for example, enabling efficient and accurate matrix multiplication and/or vector operations. For example, the integrated circuit 110 may implement the process 400 of FIG. 4.

[0021] The integrated circuit 110 includes a processor core 120, which may include a pipeline configured to execute instructions, including unit-stride and constant-stride vector memory instructions and quad-widening matrix multiplication instructions. The pipeline stages can include for example, fetch, decode, rename, dispatch, issue, execute, memory access, and write-back stages. For example, the processor core 120 may be configured to execute instructions of a RISC-V instruction set which includes a RISC-V vector extension instruction set.

[0022] The processor core 120 may be configured to fetch instructions from a memory 140 external to the integrated circuit 110 that stores instructions and/or data. The processor core 120 may be configured to access data in the memory 140 in response to instructions, including vector instructions (e.g., the vector load instruction 310 or the vector store instruction 330). For example, the processor core 120 may access data in the memory directly or via one or more caches. The processor core 120 may also be configured to fetch instructions from a memory 142 internal to the integrated circuit 110 that stores instructions and/or data. The processor core 120 may be configured to access data in the memory 142 in response to instructions, including vector instructions including quad-widening matrix multiplication instructions. Although not shown in FIG. 1, the integrated circuit 110 may include multiple processor cores in some implementations.

[0023] FIG. 2 is a block diagram of an example of an integrated circuit 210 supporting vector operations, including executing instructions for matrix multiplication using quad- widening matrix multiplication instructions and vector operations. The integrated circuit 210 includes a processor core 220. The processor core 220 includes one or more register files 240, which may include vector registers. The processor core 220 includes an LI instruction cache 250 and an LI data cache 252. The integrated circuit 210 includes an outer memory system 260, which may include memory storing instructions and data and/or provide access to a memory 262 external to the integrated circuit 210 that stores instructions and/or data. The integrated circuit 210 may provide advantages over conventional processor architectures, such as, for example, enabling efficient and accurate matrix multiplication. For example, the integrated circuit 210 may implement the process 400 of FIG. 4.

[0024] The integrated circuit 210 includes a processor core 220 including a pipeline 230 configured to execute instructions, including unit-stride and constant-stride vector memory instructions and quad-widening matrix multiplication instructions. The pipeline 230 includes one or more fetch stages that are configured to retrieve instructions from a memory system of the integrated circuit 210. For example, the pipeline 230 may fetch instructions via the LI instruction cache 250. The pipeline 230 may include additional stages, such as decode, rename, dispatch, issue, execute, memory access, and write-back stages. For example, the processor core 220 may include a pipeline 230 configured to execute instructions of a RISC- V instruction set which includes a RISC-V vector extension instruction set.

[0025] The integrated circuit 210 includes one or more register files 240 for the processor core 220. The one or more register files 240 may store part or all or an architectural state of the processor core 220. For example, the one or more register files 240 may include a set of vector registers. For example, the one or more register files 240 may include a set of control and status registers (CSRs). For example, the one or more register files 240 may include a set of scalar registers.

[0026] The integrated circuit 210 includes an LI instruction cache 250 for the processor core 220. The LI instruction cache 250 may be a set-associative cache for instruction memory. To avoid the long latency of reading a tag array and a data array in series, and the high power of reading the arrays in parallel, a way predictor may be used. The way predictor may be accessed in an early fetch stage and the hit way may be encoded into the read index of the data array. The tag array may be accessed in a later fetch stage and may be used for verifying the way predictor.

[0027] The integrated circuit 210 includes an LI data cache 252 for the processor core 220. For example, the LI data cache 252 may be a set- associative VIPT cache, meaning that it is indexed purely with virtual address bits and tagged fully with all translated physical address bits. For low power consumption, the tag and data arrays may be looked up in serial so that at most a single data SRAM way is accessed. For example, the line size of the LI data cache 252 may be 64 Bytes, and the beat size may be 26 Bytes.

[0028] The integrated circuit 210 includes an outer memory system 260, which may include memory storing instructions and data and/or provide access to a memory 262 external to the integrated circuit 210 that stores instructions and/or data. For example, the outer memory system 260 may include an L2 cache, which may be configured to implement a cache coherency protocol/policy to maintain cache coherency across multiple LI caches. Although not shown in FIG. 2, the integrated circuit 210 may include multiple processor cores in some implementations. For example, the outer memory system 260 may include multiple layers.

[0029] FIG. 3 is a memory map of examples of vector memory instructions 300 that include a vector load instruction 310 and a vector store instruction 330. The vector load instruction 310 includes an opcode 312, a destination register field 314 that identifies an architectural register to be used to store a result of the vector load instruction 310, a width field 316 that specifies the size of memory elements of a vector being loaded from memory, a base register field 318 that identifies an architectural register that stores a base address for the vector in memory, a stride register field 320 that identifies an architectural register that stores a stride (e.g., one for a unit-stride vector load or another constant stride) for the vector in memory, and a mode field 322 that specifies additional or optional parameters (e.g., including a memory addressing mode and/or a number of fields in each segment) for the vector load instruction 310. The vector store instruction 330 includes an opcode 332, a source register field 334 that identifies an architectural register holding vector data for storage, a width field 336 that specifies the size of memory elements of a vector being stored in memory, a base register field 338 that identifies an architectural register that stores a base address for the vector in memory, a stride register field 340 that identifies an architectural register that stores a stride for the vector in memory, and a mode field 342 that specifies additional or optional parameters (e.g., including a memory addressing mode and/or a number of fields in each segment) for the vector store instruction 330. For example, in a RISC-V processor core, the vector load instruction 310 may use the RISC-V LOAD-FP major opcode with a vector encoding extension and the vector store instruction 330 may use the RISC-V STORE-FP major opcode with a vector encoding extension.

[0030] In neural network computations, quantization may be applied to reduce the costs of computation and data movement. For the purposes of this disclosure, we can view a general neural network layer mathematically as f(A*B), where A and B are matrices and f is a function. Our concern here may be the matrix product C = A*B; the details of f, and the origins of A and B, may be out of scope. To reduce the costs of computation and data movement, entries of A and B may use narrower data types than the entries of C. For example, in a quantized neural network, the entries of A and B may be represented as 8-bit integers and the entries of C may be represented as 32-bit integers. In this context, the data widening may avoid or reduce loss of precision during the matrix multiplication. The entries of C may be subsequently narrowed; this is a detail of the function f.

[0031] For example, we can express a matrix multiply, like one appearing in a quantized neural network computation, in pseudocode as three nested loops: for i = 1:I for j = 1:J for k = 1:K

C(i,j) += A (i,k) * B(k,j)

Here we may assume A is an I-by-K matrix, B is K-by-J matrix, and C is I-by-J matrix, zero- initialized. Applying iteration space tiling, with a Ti-by-Tj-by-Tk tile, may transform this loop nest as follows: for i = 1:Ti:I, let TI = i:i+Ti-l for j = l:Tj:J, let TJ = j:j+Tj-l for k = 1:Tk:K, let TK = k:k+Tk-l C(TI,TJ) += A (TI,TK) * B(TK,TJ)

For simplicity, we may assume Ti divides I, Tj divides J, and Tk divides K. Note that the inner loop has transformed from a scalar multiply-add to a matrix multiply-add. To leverage conventional (one-dimensional) vector instructions, like those in RISC-V vector instruction (RVV) implementations, a system may pick one-dimensional tiles. For example, j-loop vectorization amounts to picking Ti = Tk = 1 and Tj = VF, the vector length. The instructions in this disclosure may correspond to more general choices of three-dimensional tiles: that is, a general tile (Ti, Tj, Tk) may imply an instruction that multiplies a Ti-by-Tk submatrix of A by a Tk-by-Tj submatrix of B and accumulates it into a Ti-by-Tj submatrix of C. For example, FIG. 6 and FIG. 7 depict this scenario for the case (Ti, Tj, Tk) = (4, 4, 8).

[0032] Matrix instructions may be designed by optimizing the tile parameters for the machine's data path widths and the matrix data sizes. For example, suppose a vector register file has three read ports of widths at least d, d, and 2*d, respectively, and a write port of width at least 2*d, where d is a power of two. Further, suppose the matrices A and B have b-bit elements, and the matrix C has 4*b-bit elements, where b is a power of two. In such a scenario, there could be matrix instructions corresponding to the tilings (Ti, Tj, Tk) = (t, t,

2*t) where t is a non-negative power of two. For example, if d = 256 and b = 8, a system may pick t = 4, yielding the (4, 4, 8) instructions depicted in FIG. 6 and FIG. 7. While we may consider other choices of t for d = 256 and b = 8, the choice t = 4 may maximize the amount of computation performed subject to the constraints of the data paths. In detail, the 4-by-8 submatrix of A (8-bit elements) is 256 bits, so can use the first read port. And the 8-by-4 submatrix of B (also 8-bit elements) is also 256 bits, so can use the second read port. And the 4-by-4 submatrix of C (32-bit elements) is 512 bits, so can use the third read port on input, and the write port on output. An instruction designed in this manner could perform 4*4*8 = 128 quad-widening multiply- adds, which may be optimal under the constraints of these data paths.

[0033] As another example, consider the same asymmetric data paths but now suppose C has 2*b-bit (vs. 4*b-bit) elements. In this scenario, a system could select (Ti, Tj, Tk) = (t, t, t) instead; for example, if d = 256 and b = 16, the system may pick t = 4, yielding (4, 4, 4) instructions. While we may consider other choices of t for d = 256 and b = 16, the choice t =

4 may maximize the amount of computation performed subject to the constraints of the data paths. In detail, the 4-by-4 submatrix of A (16-bit elements) is 256 bits, so can use the first read port. And the 4-by-4 submatrix of B (also 16-bit elements) is also 256 bits, so can use the second read port. And the 4-by-4 submatrix of C (32-bit elements) is 512 bits, so can use the third read port on input, and the write port on output. An instruction designed in this manner could perform 4*4*4 = 64 double-widening multiply-adds, which may be optimal under the constraints of these data paths.

[0034] In some implementations, this approach may be applied to symmetric data paths. For example, continuing to suppose A and B have b-bit elements while C has 4*b-bit elements, now suppose the three read ports and the write port all have a width of at least d. In such a scenario, there could be matrix instructions corresponding to the tilings (Ti, Tj , Tk) = (t, t, 4*t). For example, if d = 128 and b = 8, we pick t = 2, yielding (2, 2, 8) instructions. Again, while other choices of t for d = 256 and b = 8 may be used, the choice t = 2 may maximize the amount of computation performed subject to the constraints of the data paths. For example, the 2-by-8 submatrix of A (8-bit elements) is 128 bits, which can use the first read port. And the 8-by-2 submatrix of B (also 8-bit elements) is also 128 bits, which can use the second read port. And the 2-by-2 submatrix of C (32-bit elements) is 128 bits, which can use the third read port on input, and the write port on output. An instruction designed in this manner may perform 2*2*8 = 32 double-widening multiply- adds, which may be optimal under the constraints of these data paths.

[0035] As another example in the symmetric case, in the case where C has 2*b-bit (vs. 4*b-bit) elements, a system could select (Ti, Tj, Tk) = (t, t, 2*t). For example, if d = 128 and b = 16, a system may pick t = 2, yielding (2, 2, 4) instructions. Again, while we may consider other choices of t for d = 256 and b = 16, the choice t = 2 maximizes the amount of computation performed subject to the constraints of the data paths. For example, the 2-by-4 submatrix of A (16-bit elements) is 128 bits, which can use the first read port. And the 4-by-2 submatrix of B (also 8-bit elements) is also 128 bits, which can use the second read port. And the 2-by-2 submatrix of C (32-bit elements) is 128 bits, which can use the third read port on input, and the write port on output. An instruction designed in this manner may perform 2*2*4 = 16 double-widening multiply- adds, which may be optimal under the constraints of these data paths.

[0036] Thus, a system may pick (Ti, Tj, Tk) as the solution to an optimization problem, defined by the machine data path widths and the matrix data element sizes. In particular, one objective may be to maximize the amount of computation, Ti*Tj*Tk, subject to the constraints that the matrix operands can use the associated read and write ports. For example, if a matrix has b-bit elements and wants to use a d-bit port, the system may constrain the number of elements to be at most d / b. To reiterate, the A, B, and C matrices have Ti*Tk, Tk*Tk, and Ti*Tj elements, respectively. Optimization problems in this family can be solved by various techniques.

[0037] When implementing these matrix instructions on a temporal (vs. spatial) vector machine, we may fuse multiple of these matrix operations into a longer instruction. This can be modeled as a multi-level iteration space tiling: for non- negative integers Ni, Nj, Nk, a system can implement a (Ni*Ti, Nj*Tj, Nk*Tk) tile as a single instruction performing a sequence of Ni*Nj*Nk matrix operations, each corresponding to the subtile (Ti, Tj, Tk). For example, continuing the (4, 4, 8) case from the previous paragraph, a system might fuse N of these operations in the j -dimension, which can be viewed as a particular implementation of a (4, N*4, 8) tile. Such an instruction may execute over N cycles: for example, in each cycle n = 0:N-1, a system might read the (same) submatrix of A, the n-th submatrix of B, and the n-th submatrix of C, perform the n-th multiply-add operation, and write the (updated) n-th submatrix of C. The degree of fusion the system may perform may be limited by other aspects of the vector machine's design. For example, when adding the (4, N*4, 8) instruction to a RISC-V vector machine with 32 architectural vector registers, each of length 512 bits, a system might choose to restrict N to 8, to match an existing constraint that at most eight architectural vector registers can be logically grouped into a source or destination register for a single vector instruction.

[0038] When implementing these instructions, the in-register- file layout of the matrix operands need not match their in-memory layout. For example, an implementation may store the sub-matrices of A, B, and C in vector registers in row-major order or in column-major order. And in the case of the fused instructions in the preceding paragraph, multiple of these row- or column-major sub-matrices may be stored contiguously over a logical group of architectural vector registers. For example, in the (4, N*4, 8) example of the preceding paragraph, the 4-by-8 submatrix of A (8-bit elements) may reside in row-major layout in bytes 0:31 of the first vector register group operand, the n-th 8-by-4 submatrix of B may reside in row-major layout in bytes n*32:(n+l)*32-l of the second vector register group operand, and the n-th 4-by-4 submatrix of C (32-bit elements) may reside in row-major layout in bytes n*64:(n+l)*64-l of the third vector register group operand.

[0039] Thus, this disclosure provides details about the widths of arithmetic types used in these instructions Once these details are determined, the instruction implementation may address the arithmetic types used in these instructions. In applications, these arithmetic types may be integer, fixed-point, or floating-point. For example, in quantized neural network applications, the A and B matrices may be int8 or uint8, and the C matrix may be int32. In the (4,4,8) example above, the functional unit may have 128 multipliers feeding into 16 summation trees. In some implementations, an optimized design may leverage techniques of Booth, Wallace, and/or Dadda, perhaps fusing the carry-save adds with those in the summation trees. For example, the products may be representable in at most 16b and the dot products in at most 19b; and final accumulations may widen to 32 bits.

[0040] FIG. 4 is a flow chart of an example of a process 400 for an asymmetric data path operation. The process 400 includes reading 410 a first matrix from a first x-bit read port; reading 420 a second matrix from a second x-bit read port; reading 430 a third matrix from a 2x-bit read port; executing 440 a matrix multiply-add operation using the first matrix, the second matrix, and the third matrix to generate a fourth matrix; and writing 450 the fourth matrix to a 2x-bit write port.

[0041] Some implementations may include a method comprising reading a first matrix from a first x-bit read port; reading a second matrix from a second x-bit read port; reading a third matrix from a 2x-bit read port; executing a matrix multiply-add operation using the first matrix, the second matrix, and the third matrix to generate a fourth matrix; and writing the fourth matrix to a 2x-bit write port. Some implementations may include a computer- implemented method for matrix multiplication, the method comprising reading a first matrix from a first x-bit read port; reading a second matrix from a second x-bit read port; reading a third matrix from a 2x-bit read port; executing a matrix multiply-add operation using the first matrix, the second matrix, and the third matrix to generate a fourth matrix; and writing the fourth matrix to a 2x-bit write port. Some implementations may include a computer readable media storing data and instructions, said data and instructions, when executed, adapting a computer system to perform matrix multiplication using quad- widening matrix multiplication instructions, said computer system adapted to: read a first matrix from a first x-bit read port; read a second matrix from a second x-bit read port; read a third matrix from a 2x-bit read port; execute a matrix multiply-add operation using the first matrix, the second matrix, and the third matrix to generate a fourth matrix; and write the fourth matrix to a 2x-bit write port. [0042] FIG. 5 is a block diagram of a system 500 supporting asymmetric data path operations. The instruction may be a vector instruction (e.g., a vector instruction configured to read and write data elements in one-dimensional data arrays). For example, the instruction may be a matrix multiply-add instruction that is executed to multiply and accumulate data elements mapped to entries in matrices. The system 500 may include a vector ALU 502 and a storage location 504, such as a vector register file or memory. Executing the instruction may include the vector ALU 502 loading first, second, and third sets of data elements from the storage location 504 via read ports. The first, second, and third sets of data elements may be stored as one-dimensional data arrays in the storage location 504. The first, second, and third sets of data elements may be loaded via first, second, and third data paths 506, 508, and 510 connected to the storage location, respectively (e.g., the read ports). The widths of the data paths may be asymmetric. In some implementations, the width of the third data path 510 may be greater than the width of the first data path 506 and the width of the second data path 508. In some cases, the width of the third data path 510 may be at least twice the width of the first data path 506 and at least twice the width of the second data path 508. For example, the width of the third data path 510 may be 512 bits, while the width of the first data path 506 may be 256 bits, and the width of the second data path 508 may be 256 bits.

[0043] The vector ALU 502 may map data elements of the first, second, and third sets of data elements to entries of first, second, and third matrices. For example, the vector ALU 502 may implement a transpose buffer to arrange the data elements into the multiple dimensions of the matrices. Executing the instruction may further include the vector ALU 502 matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the third matrix to the matrix multiplication result (e.g., accumulating) to produce a fourth matrix. For example, the vector ALU 502 may implement a matrix multiply-add operation that uses the first, second, and third matrices to generate the fourth matrix. The vector ALU 502 may map the fourth matrix to data elements of a fourth set of data elements. For example, the vector ALU 502 may use the transpose buffer to arrange the entries of the fourth matrix into a one-dimensional array of data elements of the fourth set of data elements. [0044] Executing the instruction may also include storing the fourth set of data elements, as a one-dimensional array, via a fourth data path 512 (e.g., a write port). For example, the fourth set of data elements may be stored by the vector ALU 502, connected to the storage location 504, via the fourth data path 512. Further, the width of the fourth data path 512 may be greater than the width of the first data path 506 and the width of the second data path 508. In some cases, the width of the fourth data path 512 may be at least twice the width of the first data path 506 and at least twice the width of the second data path 508. For example, the width of the fourth data path 512 may be 512 bits (e.g., a double- widening), while the width of the first data path 506 may be 256 bits, and the width of the second data path 508 may be 256 bits. In some cases, the width of the fourth data path 512 may be equal to the width of the third data path 510. For example, the width of the fourth data path 512 may be 512 bits, while the width of the third data path 510 may also be 512 bits.

[0045] FIG. 6 is an example of a matrix multiply-add operation targeting asymmetric data paths. For example, the matrices may be loaded and/or stored when executing an instruction. For example, the matrices may be loaded and/or stored when executing the instruction in the system 500.

[0046] Data elements of first, second, and third sets of data elements 602A through 602C may be stored as one-dimensional arrays in a storage location (e.g., the storage location 504). For example, the first set of data elements 602A may be stored in the storage location as a first one-dimensional array; the second set of data elements 602B may be stored in the storage location as a second one- dimensional array; and the third set of data elements 602C may be stored in the storage location as a third one-dimensional array. For example, the data elements of the first, second, and third sets of data elements 602A through 602C may be stored in the storage location row-major order or column-major order. A vector ALU (e.g., the vector ALU 502) may load first, second, and third sets of data elements 602A through 602C via first, second, and third data paths (e.g., the first, second, and third data paths 506, 508, and 510), respectively. The vector ALU may map the data elements of the first, second, and third sets of data elements 602 A through 602C (e.g., the one-dimensional arrays) to corresponding entries of first, second, and third matrices 604A through 604C, respectively. For example, the vector ALU may implement a transpose buffer to arrange the data elements into the multiple dimensions of the matrices, based on the row-major order or the column- major order. The vector ALU may map data elements of the first set of data elements 602 A to corresponding entries of the first matrix 604A (“A” matrix); data elements of the second set of data elements 602B to corresponding entries of the second matrix 604B (“B” matrix); and data elements of the third set of data elements 602C to corresponding entries of the third matrix 604C (previous “C” matrix).

[0047] The instruction may execute to multiply the first matrix 604A by the second matrix 604B to produce a matrix multiplication result, and to add the matrix multiplication result to the third matrix 604C to produce a fourth matrix 604D (next “C” matrix). For example, the instruction could implement a matrix multiplication and accumulate, such as C = A*B + C. The first matrix 604A could have 4 rows and 8 columns (4x8) and the second matrix 604B could have 8 rows and 4 columns (8x4). The data elements of the first and second sets of data elements 602 A and 602B, and thus the entries of the first and second sets of matrices 604A and 604B, could be 8-bit integers. Executing the instruction may include matrix multiplying the first matrix 604A by the second matrix 604B to produce the matrix multiplication result. The matrix multiplication result could be a matrix having 4 rows and 4 columns (4x4). For example, the vector ALU may implement the matrix multiplication operation. Executing the instruction may also include adding the matrix multiplication result to the third matrix 604C (e.g., accumulating) to produce a fourth matrix 604D. For example, the vector ALU may implement the accumulate operation. The third matrix 604C may have 4 rows and 4 columns (4x4), and the fourth matrix 604D may have 4 rows and 4 columns (4x4). The entries of the third and fourth matrices 604C and 604D, and thus the data elements of the third and fourth sets of data elements 602C and 602D, could be 32-bit integers. While indicated as integer numbers by way of example, in some cases, the data elements may be floating point numbers.

[0048] The vector ALU may map the entries of the fourth matrix 604D to corresponding data elements of the fourth set of data elements 602D (e.g., a one-dimensional array). For example, the fourth set of data elements 602D may be stored in the storage location as a fourth one-dimensional array. For example, the vector ALU may use the transpose buffer to arrange the entries of the fourth matrix 604D into a one-dimensional array of data elements of the fourth set of data elements 602D.

[0049] With additional reference to FIG. 7, in a geometric visualization of a matrix multiply-add operation, a three dimensional array 610, or rectangular prism, may be used to represent the matrices that are multiplied and accumulated by the instruction. For example, in the three dimensional array 610, the first matrix 604A, shown on a first side of the three dimensional array 610, may be multiplied by the second matrix 604B, shown on a top of the three dimensional array 610, to produce a matrix multiplication result. The third matrix 604C matrix, accumulating from a left side of the three dimensional array 610, may be added to the matrix multiplication result to produce the fourth matrix 604D, flowing from a right side of the three dimensional array 610. The three dimensional array 610 may be executed in one clock cycle. In some implementations, the instruction may be executed multiple times, e.g., resulting in multiple three dimensional arrays like the three dimensional array 610. In such cases, the result of one three dimensional array may feed forward for an accumulate operation for a next three dimensional array (e.g., the fourth matrix 604D of one three dimensional array may feed forward to become the third matrix 604C matrix for another three dimensional array).

[0050] Thus, the size of data elements of the fourth set of data elements 602D may be greater than the size of data elements of the first and second sets of data elements 602A and 602B. In some cases, the size of data elements of the fourth set of data elements 602D may be at least four times the size of data elements of the first and second sets of data elements 602A and 602B. For example, data elements of the fourth set of data elements 602D may be 32-bit integers, while data elements of the first and second sets of data elements 602A and 602B may be 8-bit integers. In some cases, this may represent a quad-widening of the arithmetic values (e.g., data elements of the fourth set of data elements 602D), but only a double- widening of data width (e.g., only double-widening of the fourth data path, such as the fourth data path 512, over the first data path, such as the first data path 506, or the second data path, such as the second data path 508). This may be due to the destination matrix (e.g., the fourth matrix 604D) being smaller (e.g., 4 rows by 4 columns) than the input matrices (e.g., the first matrix 604 A being 4 rows by 8 columns, and the second matrix 604B being 8 rows by 4 columns), based on the matrix multiplication.

[0051] In some implementations, storing the fourth set of data elements 602D may include overwriting the third set of data elements 602C with the fourth set of data elements 602D. For example, the instruction could be a destructive multiply and accumulate instruction that may have a single argument identifying both the third set of data elements 602C and the fourth set of data elements 602D that will replace the third set of data elements 602C. For example, the location of the third set of data elements 602C may be the same as the location of the fourth set of data elements 602D. In some implementations, storing the fourth set of data elements 602D may include storing the fourth set of data elements 602D separately from the third set of data elements (e.g., the third set of data elements 602C and the fourth set of data elements 602D may be stored in different locations). For example, the instruction could be a non-destructive multiply and accumulate instruction that may have separate arguments respectively identifying the third set of data elements and the fourth set of data elements 602D.

[0052] FIG. 8 is a flow chart of an example of a process 800 for executing an instmction using asymmetric data paths. The instruction may be executed to multiply and accumulate matrices. For example, the process 800 may be implemented using the integrated circuit 110 of FIG. 1 or the integrated circuit 210 of FIG. 2. For example, the process 800 may be implemented by the system 500 of FIG. 5.

[0053] The process 800 may include executing an instruction to load 810 first, second, and third sets of data elements (e.g., the first, second, and third sets of data elements 602A through 602C). For example, the instmction may be a vector instmction (e.g., a vector instmction configured to read and write data elements in one-dimensional data arrays) that multiplies and accumulates data elements mapped to entries in matrices. The first, second, and third sets of data elements may be stored as one-dimensional arrays in a storage location (e.g., the storage location 504). Executing the instmction may include a vector ALU (e.g., the vector ALU 502) loading the first, second, and third sets of data elements from the storage location. The first, second, and third sets of data elements may be loaded via asymmetric data paths, such as via first, second, and third data paths (e.g., the first, second, and third data paths 506, 508, and 510) connected to the storage location, respectively (e.g., read ports). The widths of the data paths may be asymmetric. In some implementations, the width of the third data path may be greater than the width of the first data path and the width of the second data path. In some cases, the width of the third data path may be at least twice (e.g., 2x) the width of the first data path and at least twice the width of the second data path. For example, the width of the third data path may be 512 bits, while the width of the first data path may be 256 bits, and the width of the second data path may be 256 bits.

[0054] The data elements of the first, second, and third sets of data elements may map to entries of first, second, and third matrices (e.g., the first, second, and third matrices 604A through 604C). For example, the vector ALU may map data elements of the first, second, and third sets of data elements to entries of first, second, and third matrices. For example, the vector ALU 502 may implement a transpose buffer to arrange the data elements into the multiple dimensions of the matrices.

[0055] The process 800 may also include executing the instruction to matrix multiply 820 the first matrix by the second matrix to produce a matrix multiplication result. The matrix multiplication result may be added to the third matrix to produce a fourth matrix. The entries of the fourth matrix may map to data elements of a fourth set of data elements. For example, executing the instruction may further include the vector ALU matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the third matrix to the matrix multiplication result to produce a fourth matrix. For example, the vector ALU may implement a matrix multiply-add operation that uses the first, second, and third matrices to generate the fourth matrix. The vector ALU may map the fourth matrix to data elements of a fourth set of data elements. For example, the vector ALU may use the transpose buffer to arrange the entries of the fourth matrix into a one-dimensional array of data elements of the fourth set of data elements

[0056] The process 800 may also include executing the instruction to store 830 the fourth set of data elements (e.g., the fourth set of data elements 602D). In some implementations, the fourth set of data elements may overwrite the third set of data elements (e.g., the third set of data elements 602C). For example, the instruction may be a destructive multiply - accumulate operation. In some implementations, storing the fourth set of data elements may include overwriting the third set of data elements (e.g., the third set of data elements 602C) with the fourth set of data elements. For example, the instruction could be a destructive multiply-accumulate instruction that may have a single argument identifying both the third set of data elements and the fourth set of data elements that will replace the third set of data elements. In some implementations, storing the fourth set of data elements may include storing the fourth set of data elements separately from the third set of data elements (e.g., the third set of data elements and the fourth set of data elements may be stored in different locations). For example, the instruction could be a non-destructive multiply and accumulate instruction that may have separate arguments respectively identifying the third set of data elements and the fourth set of data elements.

[0057] The fourth set of data elements may be stored as a one-dimensional array in the storage location. The fourth set of data elements may be stored via a data path of the asymmetric data paths, such as a fourth data path (e.g., the fourth data path 512) connected to the storage location. The width of the fourth data path may be greater than the width of the first data path and the width of the second data path. For example, executing the instruction may include storing the fourth set of data elements, as a one-dimensional array, via the fourth data path (e.g., a write port). For example, the fourth set of data elements may be stored by the vector ALU, connected to the storage location, via the fourth data path. Further, the width of the fourth data path may be greater than the width of the first data path and the width of the second data path. In some cases, the width of the fourth data path may be at least twice the width of the first data path and at least twice the width of the second data path. For example, the width of the fourth data path may be 512 bits, while the width of the first data path may be 256 bits, and the width of the second data path may be 256 bits. In some cases, the width of the fourth data path may be equal to the width of the third data path. For example, the width of the fourth data path may be 512 bits, while the width of the third data path may also be 512 bits.

[0058] The size of data elements of the fourth set of data elements may be greater than the size of data elements of the first and second sets of data elements. In some cases, the size of data elements of the fourth set of data elements may be at least four times the size of data elements of the first and second sets of data elements. For example, data elements of the fourth set of data elements may be 32-bit integers, while data elements of the first and second sets of data elements may be 8 bit integers. In some cases, this may represent a quad- widening of the arithmetic values (e.g., data elements of the fourth set of data elements), but only a double-widening of data width (e.g., only double-widening of the fourth data path over the first data path or the second data path). This may be due to the fourth matrix being smaller (e.g., 4 rows by 4 columns) than the input matrices (e.g., the first matrix being 4 rows by 8 columns, and the second matrix being 8 rows by 4 columns).

[0059] FIG. 9 is an example of a geometric visualization of a sequence of matrix multiply-add operations that may be performed with a single instruction. In the geometric visualization, a block 900 of matrices may be multiplied and accumulated when executing multiple iterations of an instruction. For example, the block 900 of matrices may be multiplied and accumulated when executing multiple iterations of the instruction in the system 500. The block 900 of matrices may include multiple three dimensional arrays, such as three dimensional arrays 910A through 910H, where a single three dimensional array may be like the three dimensional array 610 in FIG. 6. A three dimensional array may represent one iteration of execution of the instruction, and a three dimensional array may be executed in one clock cycle. Thus, the block 900 of matrices may represent execution of the instruction multiple times, or through multiple iterations. For example, the block 900 of matrices may represent executing eight iterations of the instruction, for a block of eight three dimensional arrays 910A through 910H, in eight clock cycles. The size of the block may be configurable, such as 1, 2, 4, or 8 three dimensional arrays.

[0060] In the example, a group 912 of B matrices may be accessed. For example, the group 912 of B matrices may include eight sets of B matrices (e.g., B0 through B7 matrices, where the B0 matrix is like the B0 matrix shown in FIG. 6). In addition, a single A matrix 914 (e.g., matrix A) may be loaded. For example, the A matrix 914 may be like the A matrix shown in FIG. 6. The A matrix 914 may be part of a larger array of A matrices (e.g., an A0 matrix in a larger array of A0 through A7 matrices). In some implementations, the group 912 of B matrices and the A matrix 914 may be associated with parameters for a neural network. [0061] A first iteration of the instruction (e.g., the three dimensional array 910A) may be executed to matrix multiply the A matrix 914 by the B0 matrix to produce a matrix multiplication result, and to add the matrix multiplication result to a previous result (e.g., a CO-1 matrix) to produce a next result (e.g., a CO matrix). This may involve loading A, loading B0, and loading CO-1, via the first, second, and third data paths (e.g., the first, second, and third data paths 506, 508, and 510), matrix multiplying and adding to produce CO, and storing CO via the fourth data path (e.g., the fourth data path 512). A second iteration of the instruction (e.g., the three dimensional array 910B) may be executed to matrix multiply the A matrix 914 (e.g., the A matrix) by the B1 matrix to produce a matrix multiplication result, and to add the matrix multiplication result to the previous result (e.g., the CO matrix) to produce a next result (e.g., a Cl matrix). This may involve re-loading A, loading Bl, and loading CO, via the first, second, and third data paths, matrix multiplying and adding to produce Cl, and storing Cl via the fourth data path. A third iteration of the instruction (e.g., the three dimensional array 910C) may be executed to matrix multiply the A matrix 914 (e.g., the A matrix) by the B2 matrix to produce a matrix multiplication result, and to add the matrix multiplication result to the previous result (e.g., the Cl matrix) to produce a next result (e.g., a C2 matrix). This may involve re-loading A, loading B2, and loading Cl, via the first, second, and third data paths, matrix multiplying and adding to produce C2, and storing C2 via the fourth data path. This process may repeat for subsequent three dimensional arrays in the block 900 of matrices (e.g., for three dimensional arrays 910D through 910H). Thus, the instruction may be executed multiple times to perform multiple iterations of the loading and the matrix multiplying for the group 912 of B matrices. The first set of data elements (e.g., the A matrix 914) may be the same in the multiple iterations, and the second set of data elements (e.g., a B matrix of the group 912 of B matrices) and third set of data elements (e.g., the C matrices) may change in the multiple iterations.

[0062] In this way, the block 900 of matrices may be efficiently processed (e.g., multiplied and accumulated) via multiple iterations of the instruction. For example, the group of eight three dimensional arrays 910A through 91 OH may be processed by executing eight iterations of the instruction in eight clock cycles. In some implementations, the multiple iterations may be associated with calculating parameters for a neural network. When processing of the block 900 of matrices is complete, the block of B matrices may be accessed again for a next A matrix (e.g., a matrix Al).

[0063] FIG. 10 is a flow chart of an example of a process 1000 for executing a sequence of instructions, each itself a sequence of matrix multiply-add operations using asymmetric data paths, to perform a larger matrix multiply- accumulate operation. The multiple iterations of the instruction may be executed to multiply and accumulate a block of matrices. For example, the process 1000 may be implemented using the integrated circuit 110 of FIG. 1 or the integrated circuit 210 of FIG. 2. For example, the process 1000 may be implemented by the system 500 of FIG. 5. The process 1000 may include loading 1010 a group of B matrices for a given A matrix. For example, the group of B matrices and the A matrix may be loaded like the group of B matrices 912 and the A matrix 914 and of FIG. 9. In some implementations, the group of B matrices and the A matrix may be associated with parameters for a neural network.

[0064] The process 1000 may also include executing 1020 an iteration of an instruction to produce a result. The iteration may load first, second, and third sets of data elements (e.g., loading A, loading B0, and loading CO-1). The first, second, and third sets of data elements may be stored as one-dimensional arrays. The first, second, and third sets of data elements may be loaded via first, second, and third data paths (e.g., the first, second, and third data paths 506, 508, and 510). The width of the third data path may be greater than the width of the first data path and the width of the second data path (e.g., asymmetric data paths). The data elements of the first, second, and third sets of data elements may map to entries of the first, second, and third matrices (e.g., A, B0, and CO-1). The iteration may matrix multiply the first matrix by the second matrix to produce a matrix multiplication result. The matrix multiplication result may be added to the third matrix to produce a fourth matrix (e.g., CO). The entries of the fourth matrix may map to data elements of a fourth set of data elements. The fourth set of data elements may be the result. For example, the iteration may correspond to a three dimensional array like the three dimensional array 910A of FIG. 9.

[0065] The process 1000 may also include executing 1030 a next iteration of the instruction, using a previous result (e.g., CO), to produce a next result (e.g., Cl). The next iteration may load first, second, and third sets of data elements (e.g., re-loading A, loading Bl, and loading CO). The third set of data elements may be the previous result. The first, second, and third sets of data elements may be stored as one-dimensional arrays. The first, second, and third sets of data elements may be loaded via first, second, and third data paths. The width of the third data path may be greater than the width of the first data path and the width of the second data path. The data elements of the first, second, and third sets of data elements may map to entries of first, second, and third matrices. The next iteration may matrix multiply the first matrix by the second matrix to produce a matrix multiplication result. The matrix multiplication result may be added to the third matrix to produce a fourth matrix (e.g., Cl). The entries of the fourth matrix may map to data elements of a fourth set of data elements. The fourth set of data elements may be the next result. For example, the iteration may correspond to a three dimensional array like the three dimensional array 910B (and subsequent three dimensional arrays) of FIG. 9.

[0066] The process 1000 may also include determining 1040 whether a matrix is a last matrix in the block of matrices. If the matrix is not a last matrix in the block of matrices (e.g., “No”), the process 1000 may return to executing 1030 a next iteration of the instruction, using a previous result, to produce a next result (e.g., a next three dimensional array). If the matrix is a last matrix in the block of matrices (e.g., “Yes”), the process 1000 may continue by changing 1050 the A matrix for the block of B matrices (e.g., changing from matrix A0 to matrix Al). Using the next A matrix, the process 1000 may return to executing 1020 an iteration of the instruction to produce a result. This may continue until a group of blocks, associated with a larger multi -dimensional matrix, have been processed.

[0067] Some implementations may include a method that includes executing an instruction that includes: loading first, second, and third sets of data elements, stored as one dimensional arrays, via first, second, and third data paths, wherein the width of the third data path is greater than the width of the first data path and the width of the second data path, and wherein data elements of the first, second, and third sets of data elements map to entries of first, second, and third matrices; and matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the matrix multiplication result to the third matrix to produce a fourth matrix, wherein entries of the fourth matrix map to data elements of a fourth set of data elements. In some implementations, the method may include storing the fourth set of data elements, as a one-dimensional array, via a fourth data path, wherein the width of the fourth data path is greater than the width of the first data path and the width of the second data path. In some implementations, the size of data elements of the fourth set of data elements is at least four times the size of data elements of the first and second sets of data elements. In some implementations, data elements of the fourth set of data elements are 32-bit integers, and data elements of the first and second sets of data elements are 8-bit integers. In some implementations, the width of the third data path is at least twice the width of the first data path and at least twice the width of the second data path. In some implementations, entries of the first, second, third, and fourth matrices are in rows and columns with the first matrix having 4 rows and 8 columns, the second matrix having 8 rows and 4 columns, the third matrix having 4 rows and 4 columns, and the fourth matrix having 4 rows and 4 columns. In some implementations, the method may include storing the fourth set of data elements for a first calculation while loading first, second, and third sets of data elements for a second calculation. In some implementations, the method may include executing the instruction via a vector ALU, wherein the first, second, and third sets of data elements are stored as one-dimensional arrays in a vector register file. In some implementations, the method may include executing the instruction multiple times to perform multiple iterations of the loading and the matrix multiplying, wherein the first set of data elements is the same in the multiple iterations, and wherein the second and third sets of data elements change in the multiple iterations. In some implementations, the method may include executing the instruction to calculate parameters for a neural network.

[0068] Some implementations may include an apparatus that includes a processor core configured to execute an instruction that includes: loading first, second, and third sets of data elements, stored as one-dimensional arrays, via first, second, and third read ports, wherein the width of the third read port is greater than a width of the first read port and the width of the second read port, and wherein data elements of the first, second, and third sets of data elements map to entries of first, second, and third matrices; and matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the matrix multiplication result to the third matrix to produce a fourth matrix, wherein entries of the fourth matrix map to data elements of a fourth set of data elements. In some implementations, the instruction further includes storing the fourth set of data elements, as a one-dimensional array, via a write port, wherein the width of the write port is greater than the width of the first read port and the width of the second read port. In some implementations, the size of data elements of the fourth set of data elements is at least four times the size of data elements of the first and second sets of data elements. In some implementations, the width of the third read port is at least twice the width of the first read port and at least twice the width of the second read port. In some implementations, the processor core includes: a vector ALU configured to execute the instruction; and a vector register file configured to store the first, second, and third sets of data elements as one-dimensional arrays.

[0069] Some implementations may include a method that includes: executing multiple iterations of an instruction, wherein the instruction includes: loading first, second, and third sets of data elements, stored as one-dimensional arrays, via first, second, and third data paths, wherein the width of the third data path is greater than the width of the first data path and the width of the second data path, wherein data elements of the first set of data elements map to entries of an A matrix, wherein data elements of the second set of data elements map to entries of a B matrix of a group of B matrices, and wherein data elements of the third set of data elements map to entries of a C matrix; and matrix multiplying the A matrix by the B matrix of the group of B matrices to produce a matrix multiplication result, and adding the matrix multiplication result to the C matrix, wherein entries of the C matrix map to data elements of a fourth set of data elements, and wherein the second and third sets of data elements change in the multiple iterations. In some implementations, the instruction further includes storing the fourth set of data elements, as a one-dimensional array, via a fourth data path, wherein the width of the fourth data path is greater than the width of the first data path and the width of the second data path. In some implementations, the size of data elements of the fourth set of data elements is at least four times the size of data elements of the first and second sets of data elements. In some implementations, the width of the third data path is at least twice the width of the first data path and at least twice the width of the second data path. In some implementations, the instruction further includes storing the fourth set of data elements for a first iteration of the instruction while loading first, second, and third sets of data elements for a second iteration of the instruction.

[0070] While the disclosure has been described in connection with certain embodiments, it is to be understood that the disclosure is not to be limited to the disclosed embodiments but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures.

Claims

What is claimed is:

1. A method comprising : executing an instruction that includes: loading first, second, and third sets of data elements, stored as one-dimensional arrays, via first, second, and third data paths, wherein the width of the third data path is greater than the width of the first data path and the width of the second data path, and wherein data elements of the first, second, and third sets of data elements map to entries of first, second, and third matrices; and matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the matrix multiplication result to the third matrix to produce a fourth matrix, wherein entries of the fourth matrix map to data elements of a fourth set of data elements.

2. The method of claim 1, further comprising: storing the fourth set of data elements, as a one-dimensional array, via a fourth data path, wherein the width of the fourth data path is greater than the width of the first data path and the width of the second data path.

3. The method of claim 1, wherein the size of data elements of the fourth set of data elements is at least four times the size of data elements of the first and second sets of data elements.

4. The method of claim 1, wherein data elements of the fourth set of data elements are 32-bit integers, and wherein data elements of the first and second sets of data elements are 8 -bit integers.

5. The method of claim 1, wherein the width of the third data path is at least twice the width of the first data path and at least twice the width of the second data path.

6. The method of claim 1, wherein entries of the first, second, third, and fourth matrices are in rows and columns with the first matrix having 4 rows and 8 columns, the second matrix having 8 rows and 4 columns, the third matrix having 4 rows and 4 columns, and the fourth matrix having 4 rows and 4 columns.

7. The method of claim 1, further comprising: storing the fourth set of data elements for a first calculation while loading first, second, and third sets of data elements for a second calculation.

8. The method of claim 1, further comprising: executing the instruction via a vector arithmetic logic unit (ALU), wherein the first, second, and third sets of data elements are stored as one-dimensional arrays in a vector register file.

9. The method of claim 1, further comprising: executing the instruction multiple times to perform multiple iterations of the loading and the matrix multiplying, wherein the first set of data elements is the same in the multiple iterations, and wherein the second and third sets of data elements change in the multiple iterations.

10. The method of claim 1, further comprising: executing the instruction to calculate parameters for a neural network.

11. An apparatus comprising: a processor core configured to execute an instruction that includes: loading first, second, and third sets of data elements, stored as one-dimensional arrays, via first, second, and third read ports, wherein the width of the third read port is greater than a width of the first read port and the width of the second read port, and wherein data elements of the first, second, and third sets of data elements map to entries of first, second, and third matrices; and matrix multiplying the first matrix by the second matrix to produce a matrix multiplication result, and adding the matrix multiplication result to the third matrix to produce a fourth matrix, wherein entries of the fourth matrix map to data elements of a fourth set of data elements.

12. The apparatus of claim 11, wherein the instruction further includes: storing the fourth set of data elements, as a one-dimensional array, via a write port, wherein the width of the write port is greater than the width of the first read port and the width of the second read port.

13. The apparatus of claim 11, wherein the size of data elements of the fourth set of data elements is at least four times the size of data elements of the first and second sets of data elements.

14. The apparatus of claim 11, wherein the width of the third read port is at least twice the width of the first read port and at least twice the width of the second read port.

15. The apparatus of claim 11, wherein the processor core includes: a vector ALU configured to execute the instruction; and a vector register file configured to store the first, second, and third sets of data elements as one-dimensional arrays.

16. A method comprising : executing multiple iterations of an instruction, wherein the instruction includes: loading first, second, and third sets of data elements, stored as one-dimensional arrays, via first, second, and third data paths, wherein the width of the third data path is greater than the width of the first data path and the width of the second data path, wherein data elements of the first set of data elements map to entries of an A matrix, wherein data elements of the second set of data elements map to entries of a B matrix of a group of B matrices, and wherein data elements of the third set of data elements map to entries of a C matrix; and matrix multiplying the A matrix by the B matrix of the group of B matrices to produce a matrix multiplication result, and adding the matrix multiplication result to the C matrix, wherein entries of the C matrix map to data elements of a fourth set of data elements, and wherein the second and third sets of data elements change in the multiple iterations.

17. The method of claim 16, wherein the instruction further includes: storing the fourth set of data elements, as a one-dimensional array, via a fourth data path, wherein the width of the fourth data path is greater than the width of the first data path and the width of the second data path.

18. The method of claim 16, wherein the size of data elements of the fourth set of data elements is at least four times the size of data elements of the first and second sets of data elements.

19. The method of claim 16, wherein the width of the third data path is at least twice the width of the first data path and at least twice the width of the second data path.

20. The method of claim 16, wherein the instruction further includes: storing the fourth set of data elements for a first iteration of the instruction while loading first, second, and third sets of data elements for a second iteration of the instruction.