[go: up one dir, main page]

WO2021057111A1 - Dispositif et procédé informatique, puce, dispositif électronique, support d'informations et programme - Google Patents

Dispositif et procédé informatique, puce, dispositif électronique, support d'informations et programme Download PDF

Info

Publication number
WO2021057111A1
WO2021057111A1 PCT/CN2020/096384 CN2020096384W WO2021057111A1 WO 2021057111 A1 WO2021057111 A1 WO 2021057111A1 CN 2020096384 W CN2020096384 W CN 2020096384W WO 2021057111 A1 WO2021057111 A1 WO 2021057111A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory
instruction
multiply
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2020/096384
Other languages
English (en)
Chinese (zh)
Inventor
王维伟
罗飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Stream Computing Inc
Original Assignee
Stream Computing Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stream Computing Inc filed Critical Stream Computing Inc
Publication of WO2021057111A1 publication Critical patent/WO2021057111A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode

Definitions

  • the present disclosure relates to the technical field of data operations, and in particular to a computing device, a computing method, a chip, an electronic device, a computer-readable storage medium, and a computer program.
  • the chip is the cornerstone of data processing, and it fundamentally determines the ability of people to process data. From the perspective of application fields, there are two main routes for chips: one is a general-purpose chip route, such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even unable to handle.
  • a general-purpose chip route such as a central processing unit (CPU), etc. They can provide great flexibility, but they are effective in processing algorithms in specific fields. The power is relatively low; the other is a dedicated chip route, such as Tensor Processing Unit (TPU), etc. They can exert higher effective computing power in some specific fields, but they are more versatile in the face of flexible and changeable In the field, their processing power is relatively poor or even
  • a single-core CPU implements matrix operations, it will disassemble the matrix into scalars for operations, and implement matrix and matrix multiplication and accumulation operations by combining scalar instructions; if a multi-core CPU implements matrix operations, it may use multiple cores. Parallel execution of their respective scalar instructions, combined to achieve the entire matrix and matrix multiplication and accumulation operations.
  • the underlying program is complex, and generally requires multiple layers of loops to implement matrix and matrix multiplication and accumulation operations;
  • the CPU's cache is limited. Realizing relatively large matrices and matrix multiplication and accumulation operations requires multiple transfers from outside the chip, which affects efficiency;
  • the CPU needs to access data multiple times, which will increase the calculation time and calculation power consumption for matrix and matrix multiplication and accumulation;
  • the GPU disassembles the matrix and matrix multiply-accumulate operations into multiple instruction operations. These instructions are mainly vector instructions.
  • the matrix and matrix multiply-accumulate operations are implemented by combining and executing vector instructions.
  • the underlying program is complex, and generally requires multiple layers of loops to implement matrix and matrix multiplication and accumulation operations;
  • the GPU needs to access the data multiple times, which will increase the calculation time and calculation power consumption for matrix and matrix multiplication and accumulation;
  • the GPU cache is limited, and the implementation of relatively large matrices and matrix multiplication and accumulation operations requires multiple transfers from outside the chip, which affects efficiency.
  • the present disclosure aims to solve at least one of the technical problems existing in the prior art, and provides a computing device, a computing method, a chip, an electronic device, a computer-readable storage medium, and a computer program.
  • a computing device including:
  • An instruction fetch unit for fetching a multiply-accumulate instruction from the memory includes an instruction name, a destination address register, a first source address register, and a second source address register, and the multiply-accumulate instruction is a single instruction;
  • the execution unit is configured to execute the decoded multiply-accumulate instruction to read the first data from the memory according to the instruction of the first source address register, and read from the memory according to the instruction of the second source address register For the second data, perform a multiply-accumulate operation on the first data and the second data according to the multiply-accumulate instruction, and save the result of the multiply-accumulate operation in the memory indicated by the destination address register, wherein At least one of the first data and the second data is a matrix.
  • the instruction fetching unit only fetches the multiply-accumulate instruction from the memory.
  • the multiply-accumulate instruction is a single instruction, and the execution unit can complete the multiply-accumulate operation of the matrix and the matrix (or vector, or scalar) according to the instruction.
  • the underlying program very simple.
  • the whole hardware circuit can be designed to realize the multiplication and accumulation operation of a complete matrix and matrix (or vector or scalar), which can greatly improve the operation efficiency and calculation speed.
  • all data will only be read from the memory once, and the intermediate data will not be stored in the memory, which can greatly save calculation time and reduce power consumption.
  • the instruction format adopts the RISC-V instruction format, the versatility of the instruction can be improved, and the size of the input matrix can be flexibly configured.
  • the execution unit includes a control unit and an arithmetic unit array, and each arithmetic unit in the arithmetic unit array includes an output register, a first input register, and a second input register;
  • the performing a multiply-accumulate operation on the first data and the second data according to the multiply-accumulate instruction includes:
  • the control unit is configured to sequentially distribute the read first data to the first input registers of the arithmetic units in the arithmetic unit array in the arithmetic unit array in a preset distribution manner;
  • the control unit is further configured to sequentially distribute the read second data to the second input registers of the operation units participating in the operation in the operation unit array in a preset distribution manner;
  • Each of the arithmetic units participating in the operation multiplies the first data stored in the first input register and the second data stored in the second input register to obtain a first result, and the participating The arithmetic unit of operation executes the accumulation operation in a preset accumulation mode to obtain the final accumulation result, and the arithmetic unit that obtains the final accumulation result transmits the final multiplication and accumulation result to the memory indicated by the destination address register.
  • the preset allocation method includes:
  • first data is a matrix and the second data is a matrix
  • first matrix data and the second matrix data located in the same row and column positions in the first data and the second data are respectively allocated to the first data of the same operation unit
  • the vector in the second data is copied in the row or column direction into the same shape as the first data, and the first data and the The copied second data are respectively allocated to the first input register and the second input register of the same arithmetic unit; or,
  • the scalar in the second data is copied to the same shape as the first data, and the first data and the copied second data are respectively The first input register and the second input register assigned to the same arithmetic unit.
  • the arithmetic unit participating in the operation performs an accumulation operation in a preset accumulation manner to obtain a final accumulation result, including:
  • the operation unit participating in the operation sequentially sends the first result of its own calculation to the next operation unit participating in the operation in the row direction to perform the accumulation calculation to obtain the final accumulation result;
  • the operation unit participating in the operation sequentially sends the first result of its own calculation to the next operation unit participating in the operation in the column direction to perform the accumulation calculation to obtain the final accumulation result;
  • the operation unit participating in the operation sequentially accumulates all the first results to obtain the final accumulation result.
  • the indication of the first source address register includes: the first address of the first data in the memory;
  • the indication of the second source address register includes: the first address of the second data in the memory;
  • the indication of the destination address register includes: the first address of the output result in the memory.
  • the execution unit reading the first data from the memory according to the instruction of the first source address register includes:
  • the execution unit is further configured to read the first data from the memory according to the first address of the first data in the memory and the attribute of the first data; and,
  • the execution unit reading the second data from the memory according to the instruction of the second source address register includes:
  • the execution unit is further configured to read the second data from the memory according to the first address of the second data in the memory and the attribute of the second data;
  • the saving the output result to the memory indicated by the destination address register includes:
  • the execution unit is further configured to save the output result in the memory according to the first address of the output result in the memory and the attribute of the output result.
  • it further includes a custom register, the custom register being used to store the attributes of the first data, the attributes of the second data, and the attributes of the output result;
  • the data attributes include data shape and data row-column direction interval
  • the output result attributes include output length
  • each arithmetic unit includes a molecular cutting unit and a judgment subunit;
  • the judging subunit is used to judge whether the shape of the first data or the second data exceeds the shape of the arithmetic unit array
  • the slicing unit divides the shape of the first data or the second data.
  • Another aspect of the present disclosure provides a calculation method, including:
  • the multiply-accumulate instruction including an instruction name, a destination address register, a first source address register, and a second source address register, and the multiply-accumulate instruction is a single instruction;
  • the decoded multiply-accumulate instruction is executed to read the first data from the memory according to the instruction of the first source address register, and the second data is read from the memory according to the instruction of the second source address register.
  • the multiply-accumulate instruction performs a multiply-accumulate operation on the first data and the second data, and saves the result of the multiply-accumulate operation in the memory indicated by the destination address register, wherein the first data At least one of the first data and the second data is a matrix.
  • the performing a multiply-accumulate operation on the first data and the second data according to the multiply-accumulate instruction includes:
  • Each of the arithmetic units participating in the operation multiplies the first data stored in the first input register and the second data stored in the second input register to obtain a first result, and the participating The arithmetic unit of operation executes the accumulation operation in a preset accumulation mode to obtain the final accumulation result, and the arithmetic unit that obtains the final accumulation result transmits the final multiplication and accumulation result to the memory indicated by the destination address register.
  • the preset allocation method includes:
  • first data is a matrix and the second data is a matrix
  • first matrix data and the second matrix data located in the same row and column positions in the first data and the second data are respectively allocated to the first data of the same operation unit
  • the vector in the second data is copied in the row or column direction into the same shape as the first data, and the first data and the The copied second data are respectively allocated to the first input register and the second input register of the same arithmetic unit; or,
  • the scalar in the second data is copied to the same shape as the first data, and the first data and the copied second data are respectively The first input register and the second input register assigned to the same arithmetic unit.
  • the arithmetic unit participating in the operation performs an accumulation operation in a preset accumulation manner to obtain a final accumulation result, including:
  • the operation unit participating in the operation sequentially sends the first result of its own calculation to the next operation unit participating in the operation in the row direction to perform the accumulation calculation to obtain the final accumulation result;
  • the operation unit participating in the operation sequentially sends the first result of its own calculation to the next operation unit participating in the operation in the column direction to perform the accumulation calculation to obtain the final accumulation result;
  • the operation unit participating in the operation sequentially accumulates all the first results to obtain the final accumulation result.
  • the indication of the first source address register includes: the first address of the first data in the memory;
  • the indication of the second source address register includes: the first address of the second data in the memory;
  • the indication of the destination address register includes: the first address of the output result in the memory.
  • the reading the first data from the memory according to the instruction of the first source address register includes:
  • the reading the second data from the memory according to the instruction of the second source address register includes:
  • the saving the output result to the memory indicated by the destination address register includes:
  • the output result is stored in the memory according to the first address of the output result in the memory and the output result attribute.
  • the data attribute includes a data shape and an interval in a data row and column direction
  • the output result attribute includes an output length
  • the method further includes:
  • the shape of the first data or the second data is segmented.
  • Another aspect of the present disclosure provides a chip including the aforementioned computing device.
  • an electronic device including:
  • One or more processors are One or more processors;
  • the storage unit is used to store one or more programs.
  • the one or more processors can realize the Calculation method.
  • Another aspect of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the calculation method according to the foregoing description can be realized.
  • Another aspect of the present disclosure provides a computer program that, when executed by a processor, can implement the calculation method described above.
  • the instruction fetching unit only fetches the multiply-accumulate instruction from the memory.
  • the multiply-accumulate instruction is a single instruction, and the execution unit can complete the matrix and the matrix (or Vector, or scalar) multiplication and accumulation operation, the underlying program is very simple.
  • the whole hardware circuit can be designed to realize the multiplication and accumulation operation of a complete matrix and matrix (or vector or scalar), which can greatly improve the operation efficiency and calculation speed.
  • all data will only be read from the memory once, and the intermediate data will not be stored in the memory, which can greatly save calculation time and reduce power consumption.
  • the instruction format adopts the RISC-V instruction format, the versatility of the instruction can be improved, and the size of the input matrix can be flexibly configured.
  • FIG. 1 is a schematic diagram of the structure of a computing device in the first embodiment of the disclosure
  • FIG. 2 is a schematic diagram of a multiply-accumulate operation in the second embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of the structure of the execution unit in the third embodiment of the disclosure.
  • FIG. 4 is a schematic diagram of the structure of the arithmetic unit array in the fourth embodiment of the disclosure.
  • FIG. 5 is a schematic diagram of the structure of the arithmetic unit in the fifth embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of the function of multiplying and accumulating in the sixth embodiment of the present disclosure.
  • FIG. 7 is a schematic diagram of data storage in the memory in the seventh embodiment of the present disclosure.
  • FIG. 8 is a flowchart of the calculation method in the eighth embodiment of the disclosure.
  • one aspect of the present disclosure relates to a computing device, including an instruction fetching unit 110, a decoding unit 120, and an execution unit 130.
  • a computing device including an instruction fetching unit 110, a decoding unit 120, and an execution unit 130.
  • the decoding unit 120 and the execution unit 130 may be combined to form a single unit.
  • the instruction fetch unit 110 is used to fetch a multiply-accumulate instruction from the memory 200.
  • the multiply-accumulate instruction includes an instruction name, a destination address register, a first source address register, and a second source address register.
  • the multiply-accumulate instruction is a single instruction.
  • the instruction fetch unit 110 can fetch a multiply-accumulate instruction from the memory 200 according to a program (in program sequence) as instructed by a program counter (PC), and the multiply-accumulate instruction can complete subsequent matrix operations.
  • the instruction fetch unit 110 can fetch the multiply and accumulate instructions from the memory 200 in one processing cycle, and there can be one processing cycle in one clock cycle, or multiple processing cycles in one clock cycle, or multiple clock cycles Form a processing cycle and so on.
  • the instruction fetch unit 110 fetches the multiply-accumulate instruction from the memory 200, it provides the instruction to the decoding unit 120, and the decoding unit 120 decodes the received instruction so that the execution unit 130 can recognize and execute the multiply-accumulate instruction .
  • the format of the multiply-accumulate instruction can choose to follow the RISC-V instruction format, as shown in Table 1 below:
  • .mm indicates the suffix of the instruction.
  • the suffix can be used to distinguish the type of operand of the multiply-accumulate instruction. It can be defined according to actual needs.
  • the destination address register is the destination vector address register.
  • the destination The address register can also be a destination scalar address register.
  • the first data and the second data are both matrices, and the output result is the output vector. That is, the first data is the data of the first input matrix, and the second data is the second data.
  • Input matrix data correspondingly, the first source address register corresponding to the first data should be the first matrix address register, the second source address register corresponding to the second data should be the second matrix address register, and the destination address register is Target vector address register.
  • the execution unit 130 can execute the instruction after receiving the decoded multiply and accumulate instruction, that is, the execution unit 130 reads the first input matrix from the memory 200 according to the instruction of the first matrix address register. Data, that is, input matrix one.
  • the execution unit 130 reads the data of the second input matrix from the memory 200 according to the instruction of the second matrix address register, that is, the input matrix two.
  • the execution unit 130 performs multiplication and accumulation operations on the data of the first input matrix and the data of the second input matrix according to the instruction to obtain an output vector, and saves the output vector in the memory 200 indicated by the destination vector address register.
  • the execution unit 130 may include one or more circuits in various unit circuits, for example, an arithmetic operation unit, a logic operation unit, a floating-point operation unit, a data access unit, and a floating-point operation unit.
  • One or more of the circuits are used to complete matrix operations.
  • the unit circuits used in different instructions are not necessarily the same, and may also be a combination of multiple unit circuits.
  • the matrix and matrix multiply-accumulate instructions only need to use the arithmetic unit circuit. It is not difficult to understand that, in addition to some unit circuits as shown in FIG. 3, the execution unit 130 can also add or delete some unit circuits according to actual needs.
  • the specific structure of the memory 200 is not limited.
  • the memory 200 may be a random access memory (RAM), a read-only memory (ROM), a flash memory (Flash Memory), or a first-in first-out memory (FIFO). ) And first-in-last-out memory (FILO) and so on.
  • the multiply-accumulate instruction and data can share one memory or be stored in different memories, which can be determined according to actual needs.
  • the computing device may include some other functional modules in addition to the aforementioned structures.
  • the computing device may also include a control unit 140, which is connected to the instruction fetching unit 110 and the decoding unit respectively.
  • the unit 120 and the execution unit 130 are connected, and the control unit 140 can control the operating states of the instruction fetching unit 110, the decoding unit 120, and the execution unit 130 according to a clock cycle, a clock signal, or a control signal.
  • the instruction fetch unit only fetches the multiply-accumulate instruction from the memory.
  • the multiply-accumulate instruction is a single instruction, and the execution unit can complete the multiply-accumulate operation of the matrix and the matrix (or vector, or scalar) according to the instruction.
  • the underlying program is very simple.
  • the whole hardware circuit can be designed to realize the multiplication and accumulation operation of a complete matrix and matrix (or vector or scalar), which can greatly improve the operation efficiency and calculation speed.
  • all data will only be read from the memory once, and the intermediate data will not be stored in the memory, which can greatly save calculation time and reduce power consumption.
  • the instruction format adopts the RISC-V instruction format, the versatility of the instruction can be improved, and the size of the input matrix can be flexibly configured.
  • the execution unit 130 includes a control unit and an arithmetic unit array PU 1,1 , PU 1,2 ...PU 1,N , PU 2,1 , PU 2,2 ...PU 2,N ...PU M,1 , PU M,2 ...PU M,N , each arithmetic unit of the arithmetic unit array includes an output register Rout, a first input register Rin 1 and a second input register Rin 2 .
  • control unit is configured to sequentially distribute the read first input matrix data to the first input register Rin 1 of the arithmetic units participating in the operation in the arithmetic unit array in a preset distribution manner.
  • the control unit is also used to sequentially distribute the read second input matrix data to the second input register Rin 2 of the arithmetic units participating in the operation in the arithmetic unit array in a preset distribution manner.
  • Each of the arithmetic units participating in the operation multiplies the first data stored in its first input register Rin 1 and the second data stored in the second input register Rin 2 to obtain the first result.
  • the arithmetic unit executes the accumulation operation in the preset accumulation mode to obtain the final accumulation result, and the arithmetic unit that obtains the final accumulation result transmits the final multiplication and accumulation result to the memory 200 indicated by the destination vector address register. Because the shape of the arithmetic unit array may be different from the shape of the input matrix, because the shape of the input matrix is different, the arithmetic units involved in the operation in the arithmetic unit array will be different.
  • the data of the first input matrix M1 are respectively a 11 , a 12 ...a 1N , a 21 , a 22 ...a 2N ...a M1 , a M2 ...a MN .
  • the data of the second input matrix M2 are b 11 , b 12 ... b 1N , b 21 , b 22 ... b 2N ... b M1 , b M2 ... b MN .
  • the shape of the arithmetic unit array is greater than or equal to the input matrix array, that is, the arithmetic unit array is X rows and Y columns, X is greater than or equal to M, Y is greater than or equal to N, and X, Y, M, and N are all positive integers. .
  • the control unit assigns the first input matrix M1 data a 11 , a 12 ...a 1N ...a M1 , a M2 ...a MN to the arithmetic units PU 1,1 , PU 1,2 ...PU 1, N ...PU M one by one in sequence ,1 , PU M,2 ...PU M,N , that is, data a 11 is allocated to the arithmetic unit PU 1,1 and stored in the first input register Rin 1 of PU 1,1 , and data a 12 is allocated to the arithmetic unit PU 1,2 and stored in the first input register Rin 1 of PU 1,2 ,...Data a MN is allocated to the arithmetic unit PU M,N and stored in the first input register Rin 1 of PU M,N.
  • the control unit distributes the data b 11 , b 12 ...b 1N , b 21 , b 22 ...b 2N ...b M1 , b M2 ...b MN of the second input matrix M2 to the arithmetic units PU 1,1 , PU 1 one by one in sequence ,2 ...PU 1,N ...PU M,1 , PU M,2 ...PU M,N , that is, data b 11 is allocated to the arithmetic unit PU 1,1 and stored in the second input register Rin 2 of PU 1,1 , B 12 is allocated to the arithmetic unit PU 1,2 and stored in the second input register Rin 2 of PU 1,2 ,...data b MN is allocated to the arithmetic unit PU M,N and stored in the second input of PU M,N In the register Rin 2 .
  • Each arithmetic unit PU 1,1 , PU 1,2 ,...PU MN respectively stores the data a 11 , a 12 ...a MN of the first matrix stored in its first input register Rin 1 and the second input register Rin 2
  • the data b 11 , b 21 ...b MN of the second matrix are multiplied to obtain intermediate data c 11 , c 12 ...c MN
  • the intermediate data are respectively stored in the operation units PU 1,1 , PU 1,2 ,... In the output register of PU MN.
  • PU 1,1 , PU 1,2 ,...PU 1N sends the stored c 11 , c 12 ...c 1N to the processing units PU 2,1 , PU 2,2 , ...PU 2N in the next row, respectively, by PU 2,1 , PU 2,2 ,...PU 2N executes the sum of c 11 , c 12 ...c 1N and the stored c 21 , c 22 ...c 2N , and then PU 2,1 calculates the result ⁇ c 11+ c 21 , send to the processing unit PU 3,1 of the next row, PU 2,2 sends the calculation result ⁇ c 12+ c 22 to the processing unit PU 3,2 ,...
  • PU M,1 , PU M,2 ,...PU MN receives the data of the processing unit of the previous row and performs the summation calculation between its own data and the data of the processing unit of the previous row, respectively.
  • PU M,1 , PU M,2 ,...PU MN will Each data is combined to form the final output vector V, that is:
  • PU 1,1 For the arithmetic unit PU 1,1 , the first The data in the first column of the input matrix and the data in the first column of the second input matrix are sent to PU 1,1 in sequence, and a PU 1,1 completes the multiplication and accumulation of the first column data of the first input matrix and the second input matrix , And so on, PU 1,N completes the multiplication and accumulation of the data in the Nth column of the first input matrix and the second input matrix.
  • the accumulation method can also be accumulation in the column direction to form a column vector, or adding all the data of the intermediate matrix, etc., which will not be described here.
  • the indication of the first matrix address register includes: the first address of the first input matrix in the memory.
  • the indication of the second matrix address register includes: the first address of the second input matrix in the memory.
  • the indication of the destination address register includes: the first address of the output vector in the memory.
  • the computing device further includes a custom register 150, and the custom register 150 is used to store the attributes of the first input matrix, the attributes of the second input matrix, and the attributes of the output vector.
  • the attribute of the input matrix may include the shape of the input matrix (for example, the shape of the input matrix is M ⁇ N, that is, M rows and N columns), the row and column direction interval of the input matrix, and the attributes of the output vector may include the length of the output vector.
  • the computing device may also include a vector register 160 and a general-purpose register 170.
  • the vector register 160 may be used for certain vector operations, etc., the destination vector address register, Both the first matrix address register and the second matrix address register may be general registers 170.
  • the execution unit 130 may read the first input matrix from the memory 200 according to the first address of the first input matrix in the memory 200, the shape of the first input matrix and the row-column spacing.
  • the execution unit 130 reads the data of the second input matrix from the memory 200 according to the first address of the second input matrix in the memory 200, the shape of the second input matrix, and the row-column direction interval.
  • the execution unit 130 also saves the output vector in the memory 200 according to the first address of the output vector in the memory 200 and the length of the output vector.
  • the execution unit 130 can continuously read the first input matrix data and the second input matrix data stored in the memory 200, and continuously store the output vector.
  • the execution unit 130 may also discontinuously read the first input matrix data and the second input matrix data stored in the memory 200, and may discontinuously store the output output vector.
  • the continuous access or storage it depends on the defined row-column direction interval.
  • each arithmetic unit further includes a molecular cutting unit and a judgment subunit.
  • the judging subunit is used to judge whether the shape of the input matrix exceeds the shape of the arithmetic unit array.
  • the slicing unit divides the shape of the input matrix.
  • the shape of the arithmetic unit array is fixed.
  • the shape of the arithmetic unit array is smaller than the input matrix array, that is, the arithmetic unit array has X rows and Y columns, X is less than M, Y is less than N, and X, Y, M, and N are all positive integers.
  • the input matrix The shape of is divided into sub-matrices with shapes less than or equal to the shape of the arithmetic unit array.
  • the arithmetic unit array can perform operations on the sub-matrix, and the number of operations is equal to the number of divisions.
  • the calculation method of the sub-matrix after the division is the same as the calculation method of the aforementioned complete matrix, but it should be noted that if it is to calculate row accumulation, Then the matrix is divided into columns, and if column accumulation is calculated, row division is carried out to facilitate the execution of accumulation.
  • the output vector is not described in detail here.
  • the foregoing only specifically describes one case of the first source address register, the second source address register, and the destination address register.
  • the first source address register, the second source address register, and the destination address register are other
  • the calculation process can also refer to the related methods recorded in the previous section, which will not be repeated here.
  • the second aspect of the present disclosure provides a calculation method S100.
  • the calculation method S100 may use the calculation device described in the foregoing.
  • the calculation method S100 includes:
  • the multiply-accumulate instruction includes an instruction name, a destination address register, a first source address register, and a second source address register.
  • the multiply-accumulate instruction is a single instruction.
  • the calculation method of this embodiment only fetches the multiply-accumulate instruction from the memory.
  • the multiply-accumulate instruction is a single instruction.
  • the multiply-accumulate operation of the matrix and the matrix (or vector, or scalar, etc.) is completed according to the instruction, and the underlying program is very simple.
  • the whole hardware circuit can be designed to realize the multiplication and accumulation operation of complete matrix and matrix (or vector, or scalar, etc.), which can greatly improve the operation efficiency and calculation speed.
  • all data will only be read from the memory once, and the intermediate data will not be stored in the memory, which can greatly save calculation time and reduce power consumption.
  • the instruction format adopts the RISC-V instruction format
  • the versatility of the instruction can be improved, and the size of the input matrix can be flexibly configured.
  • performing a multiply-accumulate operation on the first data and the second data according to the multiply-accumulate instruction includes:
  • the arithmetic unit that participates in the operation multiplies the allocated first data and the second data to obtain the first result, executes the accumulation operation in a preset accumulation mode to obtain the final accumulation result, and the final accumulation result is multiplied by the arithmetic unit that obtains the final accumulation result
  • the accumulation result is transferred to the memory indicated by the destination address register.
  • registers in order to store data, different registers can be set in the arithmetic units involved in the operation. For example, three general-purpose registers can be set in each arithmetic unit involved in the operation, which are the first input registers. , The second input register and output register.
  • three general-purpose registers can be set in each arithmetic unit involved in the operation, which are the first input registers. , The second input register and output register.
  • other storage devices can also be provided, which are not specifically limited here.
  • the preset allocation method includes:
  • first data is a matrix and the second data is a matrix
  • first matrix data and the second matrix data located in the same row and column positions in the first data and the second data are respectively allocated to the same operation unit participating in the operation
  • the vector in the second data is copied in the row or column direction into the same shape as the first data, and the first data and the The copied second data are respectively allocated to the first input register and the second input register of the same arithmetic unit.
  • the scalar in the second data is copied to the same shape as the first data, and the first data and the copied second data are respectively The first input register and the second input register assigned to the same arithmetic unit.
  • the different arithmetic units involved in the operation perform the accumulation operation in a preset accumulation manner to obtain the final accumulation result, including:
  • the different arithmetic units involved in the operation sequentially send the first results of their own calculations in the row direction to the next arithmetic unit involved in the operation to perform accumulation calculations to obtain the final accumulation result; or, the calculation unit participating in the operation
  • the first result is sequentially sent in the column direction to the next calculation unit participating in the operation to perform the accumulation calculation, and the final accumulation result is obtained.
  • the operation unit participating in the operation sequentially accumulates all the first results to obtain the final accumulation result.
  • the indication of the first source address register includes: the first address of the first data in the memory.
  • the indication of the second source address register includes: the first address of the second data in the memory.
  • the indication of the destination address register includes: the first address of the output result in the memory.
  • reading the first data from the memory according to the instruction of the first source address register includes:
  • Reading the second data from the memory according to the instruction of the second source address register includes:
  • the data attributes include data shape and data row-column direction interval
  • the output result attributes include output length
  • the calculation method further includes:
  • the shape of the first data or the second data is segmented.
  • a third aspect of the present disclosure provides a chip including the computing device described in the foregoing.
  • the computing device described in the foregoing.
  • the chip of this embodiment has the calculation device described above, and only fetches the multiply-accumulate instruction from the memory.
  • the instruction is a single instruction.
  • the matrix and matrix multiply-accumulate operations are completed according to the instruction, and the underlying program is very simple.
  • the whole hardware circuit can be designed to realize complete matrix and matrix multiplication and accumulation operations, which can greatly improve the calculation efficiency and calculation speed.
  • all data will only be read from the memory once, and the intermediate data will not be stored in the memory, which can greatly save calculation time and reduce power consumption.
  • the instruction format adopts the RISC-V instruction format, the versatility of the instruction can be improved, and the size of the input matrix can be flexibly configured.
  • an electronic device including:
  • One or more processors are One or more processors;
  • the storage unit is used to store one or more programs.
  • the one or more processors can realize the calculation method according to the foregoing.
  • a fifth aspect of the present disclosure provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the calculation method according to the foregoing description can be realized.
  • the computer-readable medium may be included in the device, equipment, or system of the present disclosure, or may exist alone.
  • the computer-readable storage medium can be any tangible medium that contains or stores a program, which can be an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device. More specific examples include, but are not limited to: having one or Electrical connection of multiple wires, portable computer disk, hard disk, optical fiber, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory ( CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.
  • a program can be an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device. More specific examples include, but are not limited to: having one or Electrical connection of multiple wires, portable computer disk, hard disk, optical fiber, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory ( CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof.
  • the computer-readable storage medium may also include a data signal propagated in baseband or as part of a carrier wave, which carries computer-readable program code. Specific examples thereof include, but are not limited to, electromagnetic signals, optical signals, or any of them as appropriate. The combination.
  • Another aspect of the present disclosure provides a computer program that, when executed by a processor, can implement the calculation method described above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • CD-ROM Read only memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

La présente invention concerne un dispositif et un procédé informatique, une puce, un dispositif électronique et un support d'informations. Le dispositif comprend une unité d'extraction d'instructions (110) configurée pour extraire une instruction de multiplication-accumulation à partir d'une mémoire (200), l'instruction de multiplication-accumulation comprenant un nom d'instruction, un registre d'adresse de destination, un premier registre d'adresse source et un second registre d'adresse source ; une unité de décodage (120) configurée pour décoder l'instruction de multiplication-accumulation ; une unité d'exécution (130) configurée pour exécuter l'instruction de multiplication-accumulation décodée pour lire des premières données à partir de la mémoire (200) en fonction de l'indication du premier registre d'adresse source, pour lire des secondes données à partir de la mémoire (200) en fonction de l'indication du second registre d'adresse source, pour effectuer une opération de multiplication-accumulation sur les premières données et sur les secondes données selon l'instruction de multiplication-accumulation, et pour sauvegarder le résultat de l'opération de multiplication-accumulation dans la mémoire (200) indiquée par le registre d'adresses de destination, les premières données et/ou les secondes données étant une matrice. L'opération de multiplication-accumulation d'une matrice et d'une matrice (ou vecteur ou scalaire) selon une seule instruction de multiplication-accumulation et le programme sous-jacent sont simples, et l'efficacité de l'opération et la vitesse de calcul sont améliorées.
PCT/CN2020/096384 2019-09-29 2020-06-16 Dispositif et procédé informatique, puce, dispositif électronique, support d'informations et programme Ceased WO2021057111A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910934627.7A CN112579042B (zh) 2019-09-29 2019-09-29 计算装置及方法、芯片、电子设备及计算机可读存储介质
CN201910934627.7 2019-09-29

Publications (1)

Publication Number Publication Date
WO2021057111A1 true WO2021057111A1 (fr) 2021-04-01

Family

ID=75111174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096384 Ceased WO2021057111A1 (fr) 2019-09-29 2020-06-16 Dispositif et procédé informatique, puce, dispositif électronique, support d'informations et programme

Country Status (2)

Country Link
CN (1) CN112579042B (fr)
WO (1) WO2021057111A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867794A (zh) * 2021-09-08 2021-12-31 深圳市兆驰数码科技股份有限公司 Rru装置时延与功率优化方法、系统及存储介质
CN114265561A (zh) * 2021-12-24 2022-04-01 上海集成电路装备材料产业创新中心有限公司 一种数据读取控制方法、芯片和介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579188B (zh) * 2022-03-17 2025-02-07 成都启英泰伦科技有限公司 一种risc-v向量访存处理系统及处理方法
CN118094074B (zh) * 2024-04-28 2024-07-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) 矩阵乘计算结果累加方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739235A (zh) * 2008-11-26 2010-06-16 中国科学院微电子研究所 将32位dsp与通用risc cpu无缝混链的处理器装置
CN101986264A (zh) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 用于simd向量微处理器的多功能浮点乘加运算装置
WO2018154268A1 (fr) * 2017-02-23 2018-08-30 Arm Limited Multiplication-accumulation dans un appareil de traitement de données
CN108701015A (zh) * 2017-11-30 2018-10-23 深圳市大疆创新科技有限公司 用于神经网络的运算装置、芯片、设备及相关方法
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7395298B2 (en) * 1995-08-31 2008-07-01 Intel Corporation Method and apparatus for performing multiply-add operations on packed data
US7430578B2 (en) * 2001-10-29 2008-09-30 Intel Corporation Method and apparatus for performing multiply-add operations on packed byte data
US8478969B2 (en) * 2010-09-24 2013-07-02 Intel Corporation Performing a multiply-multiply-accumulate instruction
CN106325812B (zh) * 2015-06-15 2019-03-08 华为技术有限公司 一种针对乘累加运算的处理方法及装置
CN111090467B (zh) * 2016-04-26 2025-05-27 中科寒武纪科技股份有限公司 一种用于执行矩阵乘运算的装置和方法
CN109117947A (zh) * 2017-10-30 2019-01-01 上海寒武纪信息科技有限公司 轮廓检测方法及相关产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739235A (zh) * 2008-11-26 2010-06-16 中国科学院微电子研究所 将32位dsp与通用risc cpu无缝混链的处理器装置
CN101986264A (zh) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 用于simd向量微处理器的多功能浮点乘加运算装置
WO2018154268A1 (fr) * 2017-02-23 2018-08-30 Arm Limited Multiplication-accumulation dans un appareil de traitement de données
CN108701015A (zh) * 2017-11-30 2018-10-23 深圳市大疆创新科技有限公司 用于神经网络的运算装置、芯片、设备及相关方法
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113867794A (zh) * 2021-09-08 2021-12-31 深圳市兆驰数码科技股份有限公司 Rru装置时延与功率优化方法、系统及存储介质
CN114265561A (zh) * 2021-12-24 2022-04-01 上海集成电路装备材料产业创新中心有限公司 一种数据读取控制方法、芯片和介质

Also Published As

Publication number Publication date
CN112579042B (zh) 2024-04-19
CN112579042A (zh) 2021-03-30

Similar Documents

Publication Publication Date Title
WO2021057111A1 (fr) Dispositif et procédé informatique, puce, dispositif électronique, support d'informations et programme
KR102869307B1 (ko) 혼합-정밀도 앤피유 타일
CN112612521B (zh) 一种用于执行矩阵乘运算的装置和方法
CN102012893B (zh) 一种可扩展向量运算装置
CN109992743B (zh) 矩阵乘法器
CN112784973B (zh) 卷积运算电路、装置以及方法
US8412917B2 (en) Data exchange and communication between execution units in a parallel processor
US10713059B2 (en) Heterogeneous graphics processing unit for scheduling thread groups for execution on variable width SIMD units
KR101703797B1 (ko) 벡터 소팅 알고리즘 및 다른 알고리즘들을 지원하기 위한 트리 구조를 갖춘 기능 유닛
CN104317768B (zh) 面向cpu+dsp异构系统的矩阵乘加速方法
CN106970896A (zh) 面向向量处理器的二维矩阵卷积的向量化实现方法
CN105960630A (zh) 用于执行分段操作的数据处理设备和方法
EP3842954A1 (fr) Système et procédé pour réseau systolique configurable avec lecture/écriture partielle
CN114503126B (zh) 矩阵运算电路、装置以及方法
CN104346318B (zh) 面向通用多核dsp的矩阵乘加速方法
US10979337B2 (en) I/O routing in a multidimensional torus network
CN104615584B (zh) 面向gpdsp的大规模三角线性方程组求解向量化计算的方法
CN104615516A (zh) 面向GPDSP的大规模高性能Linpack测试基准实现的方法
CN112446007A (zh) 一种矩阵运算方法、运算装置以及处理器
CN112395548A (zh) 通过指令用于动态编程的处理器及配置该处理器的方法
CN102012802B (zh) 面向向量处理器数据交换的方法及装置
US9565094B2 (en) I/O routing in a multidimensional torus network
Shinde et al. Architectures of Flynn’s taxonomy-A Comparison of Methods
CN102004672B (zh) 一种可配置归约目标自增间隔的归约装置
WO2024250758A1 (fr) Procédé de traitement de données pour données complexes et dispositif associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20867104

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20867104

Country of ref document: EP

Kind code of ref document: A1