WO2025086399A1 - Processing method for matrix multiplication in parallel computing hardware, and related device - Google Patents
Processing method for matrix multiplication in parallel computing hardware, and related device Download PDFInfo
- Publication number
- WO2025086399A1 WO2025086399A1 PCT/CN2023/135478 CN2023135478W WO2025086399A1 WO 2025086399 A1 WO2025086399 A1 WO 2025086399A1 CN 2023135478 W CN2023135478 W CN 2023135478W WO 2025086399 A1 WO2025086399 A1 WO 2025086399A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- matrix
- precision
- difference
- initial
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present application relates to the field of artificial intelligence technology, and in particular to a processing method and related equipment for matrix multiplication operations in parallel computing hardware.
- a computing chip that supports single-precision calculation can be used for calculation, but the cost of the computing chip that supports single-precision calculation is relatively high.
- the single-precision matrix to be multiplied is usually converted into a half-precision matrix, and then the half-precision is multiplied, but the accuracy of the result obtained by this solution is low.
- the embodiments of the present application provide a method and related equipment for processing matrix multiplication operations in parallel computing hardware, which can improve the accuracy of single-precision matrix multiplication operations when using a computing chip that supports half-precision calculations.
- a first aspect of an embodiment of the present application proposes a method for processing matrix multiplication operations in parallel computing hardware, comprising:
- first initial matrix and a second initial matrix wherein the first initial matrix and the second initial matrix are both single-precision matrices;
- obtaining a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix includes:
- the accumulating the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix includes:
- the product of the first half-precision matrix and the second half-precision matrix and the intermediate reduction matrix are accumulated to obtain the first single-precision target matrix.
- obtaining the first initial matrix and the second initial matrix includes:
- the first target matrix is divided to obtain at least one first initial matrix
- the second target matrix is divided to obtain at least one second initial matrix
- the method when there are multiple first initial matrix and second initial matrix, the method further includes:
- each of the matrix sequences includes a plurality of the matrix groups, and each of the matrix groups includes the first initial matrix and the second initial matrix;
- a second single-precision target matrix is obtained according to the first accumulation matrix and the second accumulation matrix, and the second single-precision target matrix is used as a result of matrix multiplication operation of the first target matrix and the second target matrix.
- the method further comprises:
- the test result is compared with a preset test threshold, and the second single-precision target matrix is output based on the comparison result.
- a matrix multiplication operation device comprising:
- An acquisition module used to acquire a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices;
- FIG. 12 is a schematic diagram of another improved flow chart of the matrix block multiplication operation provided in an embodiment of the present application.
- FIG. 16 is a simulation diagram comparing the accuracy of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
- FIG. 17 is a simulation diagram comparing the working efficiency of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
- FIG. 18 is a simulation data table diagram of the working efficiency of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
- FIG. 19 is another accuracy comparison simulation diagram of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
- FIG. 20 is a schematic diagram of the structure of a matrix multiplication operation device provided in one embodiment of the present application.
- FIG. 21 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
- Single-precision matrix multiplication refers to the operation of multiplying two matrices of single-precision floating-point type.
- Single-precision floating-point type usually uses 32-bit binary numbers to represent a floating-point number, of which 1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa. Assuming that the size of matrix A is m ⁇ n and the size of matrix B is n ⁇ p, then the size of their product C is m ⁇ p, where the value of each element C[i][j] of matrix C is the sum of the products of the elements in the i-th row of matrix A and the elements in the j-th column of matrix B.
- half-precision matrix multiplication refers to the multiplication of two matrices of half-precision floating-point type.
- Half-precision floating-point type usually uses 16 bits of binary to represent a floating-point number, of which 1 bit is used for the sign bit, 5 bits for the exponent, and 10 bits for the mantissa. Since the precision of half-precision floating-point type is lower, in practical applications, half-precision matrix multiplication is usually used in scenarios that require less precision, such as calculations in neural networks.
- IEEE-754 is a floating-point representation of binary numbers. It is an international standard developed by the Institute of Electrical and Electronics Engineers (IEEE). This standard specifies the binary representation of floating-point numbers in computers, including the sign bit, exponent bit, and mantissa bit.
- Parallel computing hardware refers to a hardware device that can perform multiple computing tasks simultaneously. These hardware devices usually have a high degree of parallel computing capabilities and efficient computing resource management capabilities, which can achieve efficient computing and data processing.
- a computing chip that supports single-precision calculation can be used for calculation, but the cost of the computing chip that supports single-precision calculation is relatively high.
- the single-precision matrix to be multiplied is usually converted into a half-precision matrix, and then the half-precision is multiplied, but the accuracy of the result obtained by this solution is low.
- the processing method for matrix multiplication operations in parallel computing hardware mainly obtains a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices; then half-precision processing is performed based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix; then a first difference matrix is obtained based on the difference between the first initial matrix and the first half-precision matrix, and a second difference matrix is obtained based on the difference between the second initial matrix and the second half-precision matrix; wherein the first difference matrix and the second difference matrix are both half-precision matrices; finally, the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-
- the embodiment of the present application is directed to performing a single-precision matrix multiplication operation on a device supporting a half-precision matrix multiplication operation, using a first difference matrix to save a first initial matrix converted to The error after the first half-precision matrix is obtained, and the error after the second initial matrix is converted to the second half-precision matrix is saved by using the second difference matrix, so as to add an error compensation term in the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform the corresponding half-precision multiplication operation, and then obtain a single-precision multiplication operation result with higher accuracy on a hardware device that only supports half-precision multiplication operation.
- the embodiments of the present application provide a processing method and related equipment for matrix multiplication operations in parallel computing hardware, which are specifically illustrated by the following embodiments. First, the processing method for matrix multiplication operations in parallel computing hardware in the embodiments of the present application is described.
- the embodiments of the present application can acquire and process relevant data based on artificial intelligence technology.
- artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that the machines have the functions of perception, reasoning and decision-making.
- the processing method of matrix multiplication operation in parallel computing hardware provided by the embodiment of the present application relates to the field of artificial intelligence technology, and in particular to the field of data calculation and processing.
- the processing method of matrix multiplication operation in parallel computing hardware provided by the embodiment of the present application can be applied to a terminal, can also be applied to a server side, and can also be a computer program running in a terminal or a server side.
- a computer program can be a native program or a software module in an operating system; it can be a local application (Application, APP), that is, a program that needs to be installed in an operating system to run, or it can be a small program, that is, a program that can be run only by downloading it to a browser environment; it can also be a small program that can be embedded in any APP.
- the above-mentioned computer program can be an application, module or plug-in in any form.
- the terminal communicates with the server via a network.
- the processing method of matrix multiplication operation in the parallel computing hardware can be executed by a terminal or a server, or by a terminal and a server in collaboration.
- the terminal may be a smart phone, a tablet computer, a laptop computer, a desktop computer, or a smart watch, etc.
- the terminal may also be an intelligent vehicle-mounted device.
- the intelligent vehicle-mounted device applies the processing method of matrix multiplication operation in the parallel computing hardware of this embodiment to provide related services and improve the driving experience.
- the server may be an independent server, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and big data and artificial intelligence platforms; or a service node in a blockchain system, where each service node in the blockchain system forms a peer-to-peer (Peer To Peer, P2P) network, and the P2P protocol is an application layer protocol running on the Transmission Control Protocol (Transmission Control Protocol, TCP) protocol.
- P2P peer To Peer
- TCP Transmission Control Protocol
- the server can be installed with a server end of a text translation system, through which the server end can interact with the terminal, for example, the corresponding software is installed on the server end, and the software may be an application that implements the processing method of matrix multiplication operation in parallel computing hardware, etc., but is not limited to the above forms.
- the terminal and the server can be connected via Bluetooth, Universal Serial Bus (USB) or network and other communication connection methods, which is not limited in this embodiment.
- the present application can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network personal computers (Personal Computer, PC), minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc.
- the present application can be described in the general context of computer-executable instructions executed by a computer, such as program modules.
- program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
- the present application can also be practiced in distributed computing environments, in which tasks are performed by remote processing devices connected through a communication network.
- program modules can be located in local and remote computer storage media including storage devices.
- the processing method of matrix multiplication operation in parallel computing hardware provided in the present application is mainly aimed at the matrix multiplication operation of two single-precision matrices in parallel computing hardware that only supports half-precision matrix multiplication operation.
- the parallel computing hardware that only supports half-precision matrix multiplication operation can be Ascend AI chip, etc.
- the matrix multiplication operation device after the matrix multiplication operation device responds to the single-precision matrix multiplication operation, it first obtains the first initial matrix and the second initial matrix, both of which are single-precision matrices.
- the first initial matrix and the second initial matrix refer to matrix data to be processed by the matrix multiplication operation.
- there is no restriction on the acquisition source of the first initial matrix and the second initial matrix that is, they can be input manually, or generated based on the calculation of the machine learning model itself, or extracted from a text database by a computer device, or crawled from the network by a computer device, etc.
- the first initial matrix is derived from the first target matrix
- the second initial matrix is derived from the second target matrix
- the first target matrix and the second target matrix are two matrices that need to be matrix multiplied.
- the data scale of the first target matrix and the second target matrix is large, such as the input matrix scale of matrix multiplication operations in some applications of network models in deep learning is generally large. Therefore, in order to improve the operational efficiency of the matrix multiplication operation device, and also limited by the running memory size of the parallel computing hardware, it is necessary to perform block processing on the first target matrix and the second target matrix. The following describes the process of performing block processing on the first target matrix and the second target matrix in an embodiment of the present application.
- obtaining the first initial matrix and the second initial matrix includes steps S201 to S203 .
- Step S202 Obtain the maximum matrix multiplication operation order of the parallel computing hardware.
- Step S203 Based on the maximum matrix multiplication operation order, the first target matrix is divided to obtain at least one first initial matrix, and the second target matrix is divided to obtain at least one second initial matrix.
- the matrix multiplication operation device after responding to the single-precision matrix multiplication operation, the matrix multiplication operation device first obtains a first target matrix and a second target matrix, both of which are single-precision matrices, and then obtains the maximum matrix multiplication operation order in the parallel computing hardware. It can be understood that the maximum matrix multiplication operation order can be obtained through the running memory size of the parallel computing hardware.
- a judgment is made based on the data size of the first target matrix and the second target matrix and the maximum matrix multiplication operation order in the parallel computing hardware. If the data size of the first target matrix and/or the second target matrix is greater than the maximum matrix multiplication operation order in the parallel computing hardware, the first target matrix and the second target matrix are processed in blocks as follows.
- the first target matrix can be The first initial matrix A single is divided into 2 ⁇ 3 with 16 rows and 16 columns; similarly, the second target matrix The second initial matrix B single is divided into 3 ⁇ 2 matrices with 16 rows and 16 columns.
- Step S102 Perform half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix.
- the matrix multiplication operation device due to the limitation of some parallel computing hardware, the matrix multiplication operation device only supports half-precision matrix multiplication operations. Therefore, it is necessary to perform half-precision processing on the first initial matrix A single and the second initial matrix B single based on the single-precision data type of the first initial matrix A single and the second initial matrix B single to obtain a first half-precision matrix A half of the first initial matrix A single and a second half-precision matrix B half of the second initial matrix B single , so that the subsequent matrix multiplication operation device can perform matrix multiplication operations B half according to the first half-precision matrix A half and the second half-precision matrix B half .
- the first initial matrix and the second initial matrix are converted into a first half-precision matrix and a second half-precision matrix.
- the half-precision floating-point data cannot fully represent the single-precision floating-point data, that is, when the single-precision data (FP32) input is converted into half-precision data (FP16), the mantissa part cannot be fully represented, which will introduce data truncation errors. Therefore, at this time, directly multiplying the first half-precision matrix and the second half-precision matrix as the result of the first initial matrix and the second initial matrix will result in lower accuracy.
- the method provided by the present application is further introduced with reference to Figure 4.
- the first initial matrix A single is a single precision matrix, so for each element data therein, there are 32 bits of data, of which 1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa;
- the first half-precision matrix A half is a half-precision matrix, so for each element data therein, there are 16 bits of data, of which 1 bit is used for the sign bit, 5 bits are used for the exponent, and 10 bits are used for the mantissa.
- the first half-precision matrix A half cannot completely store all the mantissa data of the first initial matrix A single , that is, the first half-precision matrix A half can only store the first mantissa part l 1 of the first initial matrix A single , which inevitably leads to data truncation errors. Therefore, in the embodiment of the present application, the first difference matrix RA -half will be introduced, the purpose of which is to store the remaining second mantissa part l 2 of the first initial matrix A single . It can be understood that since the first difference matrix is also a half-precision matrix, its mantissa part can also store 10 bits of data.
- the first difference matrix actually stores the last ten mantissas of the mantissa part that is not 0 in the second mantissa part l 2. Therefore, theoretically, for the worst case, that is, the first data in the second mantissa part l 2 is not 0, the mantissa data that can be stored in the first half-precision matrix and the first difference matrix reaches 21 bits, which is very close to the 23-bit mantissa data part of the first initial matrix. Based on this, the data truncation error that exists after the first initial matrix and the second initial matrix are converted into the first half-precision matrix and the second half-precision matrix can be effectively solved.
- the first difference matrix is used to save the error after the first initial matrix is converted to the first half-precision matrix
- the second difference matrix is used to save the error after the second initial matrix is converted to the second half-precision matrix, so that an error compensation term is added to the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform corresponding half-precision multiplication operations, thereby obtaining a single-precision multiplication result with higher accuracy on a hardware device that only supports half-precision multiplication operations.
- Step S103 obtaining a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and obtaining a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix.
- the matrix multiplication operation device in order to solve the data truncation error after the first initial matrix and the second initial matrix are converted into the first half-precision matrix and the second half-precision matrix, obtains a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and obtains a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix. It can be understood that, due to the limitation of some parallel computing hardware, the matrix multiplication operation device only supports multiplication operations of half-precision matrices, so the first difference matrix and the second difference matrix are both half-precision matrices.
- the process of obtaining a first difference matrix based on the difference between a first initial matrix and a first half-precision matrix includes steps S501 and S502 .
- Step S501 Perform single-precision processing on the first half-precision matrix to obtain a first intermediate matrix.
- Step S502 performing half-precision processing on the difference between the first initial matrix and the first intermediate matrix to obtain a first difference matrix.
- Step S104 Accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and use the first single-precision target matrix as the result of matrix multiplication of the first initial matrix and the second initial matrix.
- the multiplication operation of the first initial matrix and the second initial matrix can be further converted into:
- formula (3) includes the product A half B half of the first half-precision matrix and the second half-precision matrix, and also includes the compensation term A half RB -half +B half RA -half + RA-half RB -half .
- the value of the first difference matrix RA-half will be relatively small compared with the first half-precision matrix A half ; similarly, considering that the second difference matrix RB -half is used to store the difference between the second initial matrix B single and the second half-precision matrix B half , the value of the second difference matrix RB - half will be relatively small compared with the second half-precision matrix B half. Based on this, for formula (3), the product RA -half RB -half of the first difference matrix and the second difference matrix is very small compared with other terms.
- the product RA -half RB -half of the first difference matrix and the second difference matrix in formula (3) can be ignored as a redundant term.
- the first single-precision target matrix which is the result of the matrix multiplication operation of the first initial matrix and the second initial matrix, can be obtained by accumulating the product of the first half-precision matrix A half and the second half-precision matrix B half , the product of the first half-precision matrix A half and the second difference matrix RB-half , and the product of the second half-precision matrix A half and the first difference matrix RA -half :
- the product RA -half and RB -half i.e., redundant terms of the first difference matrix and the second difference matrix in formula (3) still need to be considered.
- the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix are accumulated to obtain a first single-precision target matrix, including steps S601 to S603 .
- Step S601 Accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first addition term.
- Step S602 Obtain a second addition term according to the product of the first difference matrix and the second difference matrix.
- Step S603 Accumulate the first addition term and the second addition term to obtain a first single-precision target matrix.
- the product A half B half of the first half-precision matrix and the second half-precision matrix, the product A half R B-half of the first half-precision matrix and the second difference matrix, and the product B half R A-half of the second half-precision matrix and the first difference matrix are accumulated to obtain a first addition term represented as follows:
- C single_1 A half B half +A half R B-half +B half R A-half (5)
- the second addition term is expressed as follows:
- C single_2 R A-half R B-half (6)
- the smaller floating-point number may be ignored, which is a phenomenon of "floating-point underflow".
- F32 with a mantissa of 23 bits as an example, 8388608, the length is 7, which means that FP32 can represent up to 7 significant digits (the number of digits after the decimal point), the 7th bit may not represent all, but the 6th bit must be valid.
- the first difference matrix RA -half will be very small, which will indirectly cause B half RA -half in formula (7) to be very small; similarly, if the second initial matrix B single and the second half-precision matrix B half are very close, the second difference matrix RB -half will be very small, which will indirectly cause A half RB -half in formula (7) to be very small; on the other hand, the amount of data of the remaining second mantissa part l2 of the first initial matrix A single recorded using the first difference matrix RA -half is more than 10 orders of magnitude smaller than the amount of data of the second half-precision matrix B half , which will further cause the floating point underflow problem to occur when performing the B half RA -half operation; similarly, the amount of data of the remaining second mantissa part l B-2 of the second initial matrix B single recorded using the second difference matrix RB -
- the first difference matrix RA -half and the second difference matrix RB -half need to be further processed accordingly.
- Step S701 Obtain a preset multiplication value determined according to the number of digits in the mantissa.
- the data values of the first magnified matrix 2 11 RA -half and the data values of the second half-precision matrix B half are of similar order of magnitude, which can effectively prevent the floating point underflow problem of B half (2 11 RA -half ).
- the data values of the second magnified matrix 2 11 R B-half and the data values of the first half-precision matrix A half are of similar order of magnitude, which can effectively prevent the floating point underflow problem of A half (2 11 R B-half ).
- the product of the first half-precision matrix and the second magnified matrix, and the product of the second half-precision matrix and the first magnified matrix are further accumulated to obtain the intermediate magnified matrix:
- the intermediate magnification matrix is magnified by a preset multiplication value of 2 11 at the same time. Therefore, after solving the floating point underflow problem, the intermediate magnification matrix needs to be divided by the preset multiplication value 2 11 to obtain the intermediate reduction matrix: (A half (2 11 R B-half )+B half (2 11 R A-half ))/2 11 (11) At this time, based on formula (4), the product of the first half-precision matrix and the second half-precision matrix and the intermediate reduction matrix are accumulated.
- the first single-precision target matrix is obtained as: C single ⁇ A half B half +(A half (2 11 R B-half )+B half (2 11 R A-half ))/2 11 (12)
- the first target matrix needs to be divided into multiple first initial matrices
- the second target matrix needs to be divided into multiple second initial matrices.
- the matrix multiplication process of the first target matrix and the second target matrix is shown in FIG3.
- the second initial matrix is selected in the second target matrix according to the block position of the first initial matrix in the first target matrix, and multiple matrix groups are generated as follows:
- k represents the number of the first initial matrix and the second initial matrix.
- FIG9 is another schematic diagram of a process of a matrix block multiplication operation provided by an embodiment of the present application, wherein, compared with FIG8, it is necessary to perform a plurality of redundant items RA -half RB -half multiplication operations.
- the data scale of the first target matrix and the second target matrix is small, that is, when the maximum matrix multiplication operation order in the parallel computing hardware is larger than the data scale of the first target matrix and the second target matrix
- the multiplication operation of the first target matrix and the second target matrix can be processed immediately in the parallel computing hardware.
- the first target matrix is directly used as the first initial matrix
- the second target matrix is also used as the second initial matrix.
- the first single-precision target matrix C single is equal to the second single-precision target matrix C single .
- the processing method of matrix multiplication operations in parallel computing hardware provided in the embodiment of the present application also includes the following steps S1001 to S1003.
- Step S1001 selecting a second initial matrix in a second target matrix according to the block positions of the first initial matrix in the first target matrix, and generating a plurality of matrix sequences.
- Step S1002 Accumulate the product of the first half-precision matrix and the second half-precision matrix in each matrix group to obtain a first cumulative matrix of the matrix sequence, and accumulate the sum of the product of the first half-precision matrix and the second difference matrix and the product of the second half-precision matrix and the second difference matrix in each matrix group to obtain a second cumulative matrix of the matrix sequence.
- Step S1003 Obtain a second single-precision target matrix according to the first accumulation matrix and the second accumulation matrix, and use the second single-precision target matrix as the result of matrix multiplication of the first target matrix and the second target matrix.
- the second initial matrix is selected in the second target matrix according to the block position of the first initial matrix in the first target matrix, and a plurality of matrix sequences are generated, wherein each matrix sequence includes a plurality of matrix groups, each matrix group includes a first initial matrix and a second initial matrix corresponding to the first initial matrix, wherein the plurality of matrix groups are as shown in formula (14).
- FIG11 there is a schematic diagram of an improved process flow of a matrix block multiplication operation provided by an embodiment of the present application.
- the product of the first half-precision matrix and the second half-precision matrix in each matrix group is accumulated to obtain the first cumulative matrix of the matrix sequence, as shown below:
- the parallel computing hardware only needs to divide the storage cache space into two different sizes when performing the matrix multiplication operation of the first target matrix and the second target matrix.
- the parallel computing capability of the parallel computing hardware can be better utilized, that is, multiple items can be calculated simultaneously. Operations and multiple operations, thereby effectively improving the work efficiency of single-precision matrix multiplication operations.
- FIG12 is a schematic diagram of another improved process of a matrix block multiplication operation provided by an embodiment of the present application, wherein, compared with FIG11, it is necessary to perform a plurality of redundant items RA -half RB -half multiplication operations.
- the processing method of matrix multiplication operation in parallel computing hardware also includes the following steps S1301 to S1304.
- Step S1301 Perform double-precision processing based on single-precision data to obtain a first double-precision matrix of the first target matrix, a second double-precision matrix of the second target matrix, and a check matrix of the second single-precision target matrix.
- Step S1302 performing a multiplication operation on the first double precision matrix and the second double precision matrix to obtain an evaluation matrix.
- Step S1304 Compare the inspection result with a preset inspection threshold, and output a second single-precision target matrix based on the comparison result.
- the matrix multiplication operation device performs double-precision processing on the second single-precision target matrix C single based on the single-precision data structure of the second single-precision target matrix C single , and obtains a test matrix to_double(C single ) of the second single-precision target matrix C single ; at the same time; and based on the single-precision data structure of the first target matrix A single , the first target matrix A single is double-precision processed to obtain a first double-precision matrix to_double(A single ) of the first target matrix A single , and based on the single-precision data structure of the second target matrix B single , the second target matrix
- V is used to store the relative residual between the test matrix and the evaluation matrix, and is used to characterize the actual error between the processing method of the matrix multiplication operation in the parallel computing hardware provided by this embodiment and the direct matrix multiplication operation of the first target matrix and the second target matrix;
- F is the Euclidean norm of the matrix, which can be understood as the Euclidean norm of the matrix
- the number refers to the largest singular value of a matrix, also known as the 2-norm of the matrix, which is used to evaluate the condition number of the matrix, that is, the stability of the matrix and the reliability of the numerical solution.
- the processing method for matrix multiplication operation in the parallel computing hardware provided in the embodiment of the present application is effective, so the second single-precision target matrix can be used as the result of the matrix multiplication operation of the first target matrix and the second target matrix and output.
- no constraints are imposed on the setting of the preset test threshold, that is, it can be artificially preset, or it can be obtained by the matrix multiplication operation device according to the historical operation law.
- the verification steps shown in steps S1301 to S1304 are for verifying the reliability of the processing method for matrix multiplication operations in the parallel computing hardware provided in the embodiment of the present application, and the processing method for matrix multiplication operations in the parallel computing hardware provided in the embodiment of the present application is mainly used for matrix multiplication operations of two single-precision matrices (i.e., the first target matrix and the second target matrix) in parallel computing hardware that only supports half-precision matrix multiplication operations. Therefore, in this case, the verification steps shown in steps S1301 to S1304 will be placed in other parallel computing hardware that can support double-precision matrix multiplication operations for execution, and will have no effect on the processing method for matrix multiplication operations in the parallel computing hardware provided in the embodiment of the present application.
- the processing method of matrix multiplication operation in parallel computing hardware can be applied to the Ascend AI processor, which is a chip adapted to a specific field, and the core of Ascend AI processing is an artificial intelligence chip.
- the Ascend AI processor provides three basic computing units: matrix computing unit (CUBE), vector computing unit (Vector) and scalar computing unit (Scalar).
- the three computing units will form three independent pipelines for the corresponding calculations to complete the corresponding calculations.
- the L1 buffer is used to store the first target matrix and the second target matrix, and then perform a matrix conversion operation and save it in the cache conversion unit, wherein the conversion operation includes dividing the first target matrix into a plurality of first initial matrices, and converting the first initial matrix into a first half-precision matrix, and generating a first difference matrix and a first magnification matrix.
- the second target matrix is divided into a plurality of second initial matrices, and the second initial matrix is converted into a second half-precision matrix, and a second difference matrix and a second magnification matrix are generated.
- Buffer L0A and buffer L0B are used to store matrices that are about to perform matrix multiplication operations.
- the matrix calculation unit is used to receive the matrices in the buffer L0A and the buffer L0B to perform a matrix multiplication operation.
- the matrix multiplication operation includes: as well as The accumulator is used to accumulate the data obtained by the matrix calculation unit, that is, the addition operation shown in Figure 12.
- the buffer L0C is used to store the data obtained by the accumulator calculation, and finally obtain the second single-precision target matrix C single required by this solution.
- the unified buffer is an important component inside it, which is used to store data shared between different computing units.
- the system control module is a module in the Ascend chip, which is responsible for controlling the overall operation and scheduling of the chip. It includes the coordination and management between various functional modules, as well as tasks such as processing external input and output.
- the bus interface module is an interface module used to connect the Ascend chip with other external devices. It provides an interface for data transmission and communication with the host system or other external devices, and realizes interaction with external systems.
- the instruction cache is a high-speed cache used to store instructions in the Ascend chip. It is used to increase the reading speed of instructions, reduce instruction access latency, and improve the execution efficiency of instructions.
- the scalar instruction processing queue is a module in the Ascend chip, which is used to process scalar instructions.
- a scalar instruction is an instruction that operates on a single data, such as addition, multiplication, etc.
- the scalar instruction processing queue is responsible for receiving and decoding scalar instructions and distributing them to the corresponding functional units for execution.
- the instruction distribution module is a module in the Ascend chip, which is used to distribute the decoded instructions to the corresponding functional units for execution. It is responsible for distributing instructions to appropriate functional units according to their type and operands, realizing parallel execution of instructions and efficient use of computing resources.
- the unified buffer in the Ascend chip is used to store data shared between different computing units, the system control module is responsible for overall scheduling and management, the bus interface module realizes communication with external devices, the instruction cache improves the instruction reading speed, the scalar instruction processing queue processes scalar instructions, and the instruction distribution module distributes the decoded instructions to the corresponding functional units for execution.
- These modules together constitute part of the functions in the Ascend chip.
- the CUBE queue is a hardware module in the Ascend chip, which is used to manage and schedule the execution of tasks. The CUBE queue can execute multiple tasks in parallel, and switch and schedule tasks according to certain scheduling strategies to improve computing efficiency.
- the Vector queue is another hardware module in the Ascend chip, which is used to support vector computing.
- the column can efficiently perform vector operations and improve the efficiency and performance of vector computing by processing multiple vector data in parallel.
- the storage conversion queue is a hardware module in the Ascend chip, which is used to handle data storage and conversion.
- the storage conversion queue is responsible for managing data transmission and conversion between different storage media, such as data reading and writing between memory and external storage devices.
- the time synchronization module is a hardware module in the Ascend chip, which is used to ensure time synchronization between multiple Ascend chips.
- the time synchronization module uses a precise clock synchronization mechanism to ensure that multiple Ascend chips have a consistent time base when performing tasks to support distributed computing and collaborative work.
- the vector computing unit is a hardware module in the Ascend chip, which is used to perform vector computing.
- the vector computing unit can efficiently perform operations on large-scale vector data, provide parallel computing capabilities, and is used to accelerate vector computing-intensive tasks.
- the scalar computing unit is a hardware module in the Ascend chip, which is used to perform scalar computing.
- the scalar computing unit is responsible for processing the computing tasks of a single data element, including addition, subtraction, multiplication, division, logical operations, etc., to support scalar computing-intensive tasks.
- the configuration port is an interface in the Ascend chip, which is used to configure and manage various parameters and settings of the chip.
- the configuration port provides a communication channel with the internal control logic of the chip, which is used to read and write the configuration registers of the chip to achieve flexible configuration of the chip functions and performance.
- the bus is a communication channel in the Ascend chip, which is used to connect various modules inside the chip and external devices.
- the bus is responsible for transmitting data and control signals to achieve communication and collaboration between various modules inside the chip.
- the L2 cache area is a high-speed cache storage area in the Ascend chip, located between the chip core and the memory. The L2 cache area is used to store frequently accessed data and instructions, providing fast data reading and writing and access speed to speed up computing and data processing.
- DDR Double Data Rate
- DDR Double Data Rate
- DDR is a type of memory in the Ascend chip and one of the common memory types in computer systems. DDR memory uses double data transfer rate technology, which can transmit twice the amount of data in one clock cycle, providing higher memory bandwidth and faster data access speed, and is used to store and read and write large-scale data.
- the processing method of matrix multiplication in parallel computing hardware provided by the present application can be applied to ordinary parallel computing hardware for block execution processing.
- Parallel computing hardware usually includes multiple artificial intelligence chips, so each block multiplication operation can be processed in parallel using one artificial intelligence chip.
- the Ascend AI processor Take the Ascend AI processor as an example.
- the Ascend AI processor supports 32 artificial intelligence chips, and theoretically can process 32 block matrix multiplication operations at the same time.
- the time spent on calculation exceeds the time spent on data transfer.
- Blocking can only solve the size limit of UB, but cannot solve the time-consuming problem.
- the multi-core processing capabilities of Ascend AI must be fully utilized. For example: Assume that for the test matrix (M, N, K), the calculation of K is performed on multiple cores for parallel calculation. This involves accumulating the calculation results of all cores for the same block after the multi-core calculation is completed. Therefore, the intermediate results of the two calculations must be stored separately in advance.
- the Ascend AI processor achieves acceleration through large-scale parallel computing capabilities. It uses the Bisheng C++ language tool to express the mapping to the device's computing core through the concept of a parallel computing workgroup. Each parallel computing workgroup is mapped to a specific core. Each parallel computing workgroup has the same instruction code but a different identification ID, which is similar to a SPMD (Single Program Multiple Data) technology.
- SPMD Single Program Multiple Data
- SPMD is a parallel computing model, which means that multiple processors or computing units execute the same program at the same time, but corresponding to different data.
- different processors or computing units have their own data sets and independently execute the same instruction sequence to process these data.
- the SPMD model is a model based on task parallelism, which is suitable for many parallel computing applications, such as scientific computing, image processing, and data analysis.
- programmers need to divide the computing problem into multiple independent tasks and assign different data sets to each task. Then, each processor or computing unit will execute the same program in parallel, but corresponding to different data sets.
- the present embodiment compares the NPU verification result (this patent solution), that is, the matrix multiplication operation processing solution provided by the present application with the other two solutions, namely: 1.
- NPU verification result that is, as in the related art, directly divide the first target matrix and the second target matrix into multiple first initial matrices and multiple second initial matrices, and then convert the divided multiple first initial matrices and multiple second initial matrices into corresponding multiple half-precision matrices and then perform half-precision matrix multiplication operation to obtain the result;
- Sgemm verification result solution of OpenBLAS library that is, as in the related art, directly use the sgemm of OpenBLAS library to perform single-precision matrix multiplication operation on the first target matrix and the second target matrix.
- the OpenBLAS library is an open source basic linear algebra subroutine (BLAS) library for high-performance numerical calculations.
- BLAS is a set of standard interfaces and functions for performing common linear algebra operations such as matrix multiplication, vector addition, etc.
- sgemm is a function for performing matrix multiplication.
- the horizontal axis in Figure 16 refers to the size of the input matrix k (i.e., the number of columns K of the first target matrix and the number of rows K of the second target matrix), and the vertical axis refers to the accuracy test difference.
- FIGs 17 and 18 there are a simulation diagram of the work efficiency comparison of the processing method for matrix multiplication operations in parallel computing hardware provided in the embodiment of the present application and a simulation data table diagram of the work efficiency of the processing method for matrix multiplication operations in parallel computing hardware.
- this embodiment simulates and compares the running time of the optimized calculation execution process (i.e., the execution process shown in Figure 11, i.e., the NPU verification result (this patent solution) in Figure 17) used when executing this solution with the running time of the sgemm verification result solution of the OpenBLAS library.
- the optimized calculation execution process i.e., the execution process shown in Figure 11, i.e., the NPU verification result (this patent solution) in Figure 17
- the horizontal axis in Figure 17 refers to the size of the input matrix k (i.e., the number of columns K of the first target matrix and the number of rows K of the second target matrix), and the vertical axis refers to the calculation delay. It can be clearly seen that with the increase of the number of columns K of the first target matrix and the number of rows K of the second target matrix, the running time of the sgemm verification result solution of the OpenBLAS library will increase significantly, while the running time of the NPU verification result (patent solution) (i.e., the execution flow shown in FIG. 11 ) increases, but its The increase is very small.
- the acceleration ratio data of Figure 18 is the ratio of the running time of the sgemm verification result solution of the OpenBLAS library to the running time of the NPU verification result (this patent solution).
- FIG19 is another precision comparison simulation diagram of the processing method for matrix multiplication operation in parallel computing hardware provided by an embodiment of the present application.
- This embodiment compares the two schemes shown in FIG11 and FIG12, namely, the redundant items Schemes and non-redundant items The schemes are compared with each other in terms of the accuracy (i.e., relative residuals).
- the horizontal axis in Figure 19 is the size of the input matrix k (i.e., the number of columns K of the first target matrix and the number of rows K of the second target matrix), and the vertical axis is the relative residual.
- An embodiment of the present application provides a method for processing matrix multiplication operations in parallel computing hardware, which can improve the accuracy of single-precision matrix multiplication operations when using a computing chip that supports half-precision calculations.
- a first target matrix and a second target matrix both of which are single-precision matrices, are divided into at least one first initial matrix and at least one second initial matrix, and a plurality of corresponding matrix groups are obtained; then half-precision processing is performed based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix in each matrix group; then a first difference matrix is obtained based on the difference between the first initial matrix and the first half-precision matrix in each matrix group, and a second difference matrix is obtained based on the difference between the second initial matrix and the second half-precision matrix; then the first difference matrix and the second difference matrix are multiplied by a preset multiplication value to obtain a first magnification matrix and a second magnification matrix;
- the embodiment of the present application is directed to the process of performing single-precision matrix multiplication using a computing chip that supports half-precision calculation.
- the first target matrix and the second target matrix are divided to obtain multiple first initial matrices and second initial matrices, and the first initial matrix and the second initial matrix are processed with half precision to obtain the first half-precision matrix and the second half-precision matrix, so as to facilitate the subsequent execution calculation of the parallel computing hardware.
- the error after the first initial matrix is converted to the first half-precision matrix is saved by using the first difference matrix
- the error after the second initial matrix is converted to the second half-precision matrix is saved by using the second difference matrix, so as to add an error compensation item in the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform the corresponding half-precision multiplication operation, thereby effectively improving the accuracy of the half-precision multiplication operation of the single-precision matrix.
- the corresponding operation is performed after multiplying the first difference matrix and the second difference matrix by the preset multiplication value, so as to solve the floating-point underflow problem and improve the accuracy of the matrix multiplication operation.
- the embodiment of the present application further provides a matrix multiplication operation device, which can implement the processing method of the matrix multiplication operation in the above parallel computing hardware.
- the device 2000 includes:
- An acquisition module 210 is used to acquire a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices;
- a half-precision conversion module 2020 is used to perform half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix;
- the difference processing module 2030 is used to obtain a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and to obtain a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix; wherein the first difference matrix and the second difference matrix are both half-precision matrices;
- the calculation module 2040 is used to accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and use the first single-precision target matrix as the result of matrix multiplication operation between the first initial matrix and the second initial matrix.
- the specific implementation of the matrix multiplication operation device of this embodiment is basically consistent with the specific implementation of the processing method of the matrix multiplication operation in the above-mentioned parallel computing hardware, and will not be repeated here.
- the present application also provides an electronic device, including:
- the program is stored in the memory, and the processor executes the at least one program to implement the processing method of matrix multiplication operation in the parallel computing hardware implemented in the present application.
- the electronic device can be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (PDA), a car computer, etc.
- the processor 2101 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of the present application;
- a general-purpose CPU Central Processing Unit
- ASIC Application Specific Integrated Circuit
- the memory 2102 can be implemented in the form of ROM (Read Only Memory), static storage device, dynamic storage device or RAM (Random Access Memory).
- the memory 2102 can store operating systems and other application programs.
- the relevant program codes are stored in the memory 2102, and the processor 2101 calls and executes the processing method of matrix multiplication operation in the parallel computing hardware of the embodiment of this application;
- Communication interface 2104 used to realize communication interaction between the device and other devices, which can be realized through wired mode (such as USB, network cable, etc.) or wireless mode (such as mobile network, WIFI, Bluetooth, etc.);
- wired mode such as USB, network cable, etc.
- wireless mode such as mobile network, WIFI, Bluetooth, etc.
- a bus 2105 that transmits information between the various components of the device (e.g., the processor 2101, the memory 2102, the input/output interface 2103, and the communication interface 2104);
- the processor 2101 , the memory 2102 , the input/output interface 2103 and the communication interface 2104 are connected to each other in communication within the device via the bus 2105 .
- An embodiment of the present application also provides a storage medium, which is a computer-readable storage medium and stores a computer program.
- the computer program When executed by a processor, it implements a processing method for matrix multiplication operations in the above-mentioned parallel computing hardware.
- the memory as a non-transient computer-readable storage medium, can be used to store non-transient software programs and non-transient computer executable programs.
- the memory may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device.
- the memory may optionally include a memory remotely disposed relative to the processor, and these remote memories may be connected to the processor via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
- the device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place or distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- At least one (item) means one or more, and “plurality” means two or more.
- “And/or” is used to describe the association relationship of associated objects, indicating that three relationships may exist.
- a and/or B can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural.
- the character “/” generally indicates that the objects associated before and after are in an “or” relationship.
- At least one of the following” or similar expressions refers to any combination of these items, including any combination of single or plural items.
- At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c", where a, b, c can be single or multiple.
- the disclosed devices and methods can be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the above units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
- Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
- the units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the computer software product is stored in a storage medium, including multiple instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store programs.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
Abstract
Description
本申请涉及人工智能技术领域,尤其涉及并行计算硬件中矩阵乘法运算的处理方法及相关设备。The present application relates to the field of artificial intelligence technology, and in particular to a processing method and related equipment for matrix multiplication operations in parallel computing hardware.
近些年,随着机器学习、人工智能等领域的蓬勃发展,极大增加了矩阵乘法运算的数量的同时,也极大的加速了专门用于并行计算的专用硬件的发展。随之而来的是,提高矩阵乘法运算的精准度也成为了最需要注意的地方。In recent years, with the vigorous development of machine learning, artificial intelligence and other fields, the number of matrix multiplication operations has greatly increased, and the development of dedicated hardware for parallel computing has also greatly accelerated. As a result, improving the accuracy of matrix multiplication operations has become the most important thing to pay attention to.
在相关技术中,针对于单精度矩阵的乘法运算中,可以使用支持单精度计算的计算芯片来计算,但支持单精度计算的计算芯片其成本较高。除此之外通常在使用仅支持半精度计算的计算芯片中,将待乘的单精度矩阵转换为半精度矩阵,然后再将半精度进行乘法运算,但是通过这种方案得到的结果的精准度较低。In the related art, for the multiplication operation of single-precision matrices, a computing chip that supports single-precision calculation can be used for calculation, but the cost of the computing chip that supports single-precision calculation is relatively high. In addition, when using a computing chip that only supports half-precision calculation, the single-precision matrix to be multiplied is usually converted into a half-precision matrix, and then the half-precision is multiplied, but the accuracy of the result obtained by this solution is low.
发明内容Summary of the invention
本申请实施例的提供了一种并行计算硬件中矩阵乘法运算的处理方法及相关设备,能够在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度。The embodiments of the present application provide a method and related equipment for processing matrix multiplication operations in parallel computing hardware, which can improve the accuracy of single-precision matrix multiplication operations when using a computing chip that supports half-precision calculations.
为实现上述目的,本申请实施例的第一方面提出了一种并行计算硬件中矩阵乘法运算的处理方法,包括:To achieve the above-mentioned purpose, a first aspect of an embodiment of the present application proposes a method for processing matrix multiplication operations in parallel computing hardware, comprising:
获取第一初始矩阵和第二初始矩阵;其中,所述第一初始矩阵和所述第二初始矩阵均为单精度矩阵;Obtain a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices;
基于单精度数据类型进行半精度处理,得到所述第一初始矩阵的第一半精度矩阵,以及所述第二初始矩阵的第二半精度矩阵;Perform half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix;
基于所述第一初始矩阵和所述第一半精度矩阵的差值得到第一差值矩阵,以及基于所述第二初始矩阵和所述第二半精度矩阵的差值得到第二差值矩阵;其中,所述第一差值矩阵和所述第二差值矩阵均为半精度矩阵;Obtaining a first difference matrix based on a difference between the first initial matrix and the first half-precision matrix, and obtaining a second difference matrix based on a difference between the second initial matrix and the second half-precision matrix; wherein both the first difference matrix and the second difference matrix are half-precision matrices;
累加所述第一半精度矩阵和所述第二半精度矩阵的乘积、所述第一半精度矩阵和所述第二差值矩阵的乘积以及所述第二半精度矩阵和所述第一差值矩阵的乘积,得到第一单精度目标矩阵,并将所述第一单精度目标矩阵作为所述第一初始矩阵和所述第二初始矩阵进行矩阵乘法运算的结果。Accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and use the first single-precision target matrix as a result of matrix multiplication of the first initial matrix and the second initial matrix.
在一些实施例,所述基于所述第一初始矩阵和所述第一半精度矩阵的差值得到第一差值矩阵,包括:In some embodiments, obtaining a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix includes:
将所述第一半精度矩阵进行单精度处理,得到第一中间矩阵;Performing single-precision processing on the first half-precision matrix to obtain a first intermediate matrix;
对所述第一初始矩阵与所述第一中间矩阵的差值进行半精度处理,得到所述第一差值矩阵。Perform half-precision processing on the difference between the first initial matrix and the first intermediate matrix to obtain the first difference matrix.
在一些实施例,所述累加所述第一半精度矩阵和所述第二半精度矩阵的乘积、所述第一半精度矩阵和所述第二差值矩阵的乘积以及所述第二半精度矩阵和所述第一差值矩阵的乘积,得到第一单精度目标矩阵,包括:In some embodiments, the accumulating the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix includes:
累加所述第一半精度矩阵和所述第二半精度矩阵的乘积、所述第一半精度矩阵和所述第二差值矩阵的乘积以及所述第二半精度矩阵和所述第一差值矩阵的乘积,得到第一加成项;Accumulating the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first addition term;
根据所述第一差值矩阵和所述第二差值矩阵的乘积,得到第二加成项;Obtaining a second addition term according to the product of the first difference matrix and the second difference matrix;
累加所述第一加成项和所述第二加成项,得到所述第一单精度目标矩阵。 The first addition item and the second addition item are accumulated to obtain the first single-precision target matrix.
在一些实施例,所述第一半精度矩阵中元素包括尾数;所述累加所述第一半精度矩阵和所述第二半精度矩阵的乘积、所述第一半精度矩阵和所述第二差值矩阵的乘积以及所述第二半精度矩阵和所述第一差值矩阵的乘积,得到第一单精度目标矩阵,还包括:In some embodiments, the elements in the first half-precision matrix include mantissas; the accumulating the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix further includes:
获取根据所述尾数的位数确定的预设乘值;Obtaining a preset multiplication value determined according to the number of digits of the mantissa;
利用预设乘值乘以所述第一差值矩阵,得到第一放大矩阵,并利用所述预设乘值乘以所述第二差值矩阵,得到第二放大矩阵;Multiplying the first difference matrix by a preset multiplication value to obtain a first magnification matrix, and multiplying the second difference matrix by the preset multiplication value to obtain a second magnification matrix;
累加所述第一半精度矩阵和所述第二放大矩阵的乘积、所述第二半精度矩阵和所述第一放大矩阵的乘积,得到中间放大矩阵;Accumulate the product of the first half-precision matrix and the second magnification matrix, and the product of the second half-precision matrix and the first magnification matrix to obtain an intermediate magnification matrix;
将所述中间放大矩阵除以所述预设乘值,得到中间缩小矩阵;Dividing the intermediate magnification matrix by the preset multiplication value to obtain an intermediate reduction matrix;
累加所述第一半精度矩阵和所述第二半精度矩阵的乘积以及所述中间缩小矩阵,得到所述第一单精度目标矩阵。The product of the first half-precision matrix and the second half-precision matrix and the intermediate reduction matrix are accumulated to obtain the first single-precision target matrix.
在一些实施例,所述获取第一初始矩阵和第二初始矩阵,包括:In some embodiments, obtaining the first initial matrix and the second initial matrix includes:
获取第一目标矩阵和第二目标矩阵;Get a first target matrix and a second target matrix;
获取所述并行计算硬件的最大矩阵乘法运算阶数;Obtaining a maximum matrix multiplication operation order of the parallel computing hardware;
基于所述最大矩阵乘法运算阶数,将所述第一目标矩阵进行划分得到至少一个第一初始矩阵,并将所述第二目标矩阵进行划分得到至少一个第二初始矩阵。Based on the maximum matrix multiplication operation order, the first target matrix is divided to obtain at least one first initial matrix, and the second target matrix is divided to obtain at least one second initial matrix.
在一些实施例,当所述第一初始矩阵和所述第二初始矩阵为多个时,所述方法还包括:In some embodiments, when there are multiple first initial matrix and second initial matrix, the method further includes:
根据所述第一初始矩阵在所述第一目标矩阵中的分块位置在所述第二目标矩阵中选取所述第二初始矩阵,并生成多个矩阵序列;每个所述矩阵序列中包括多个所述矩阵组,每个所述矩阵组包括所述第一初始矩阵和所述第二初始矩阵;Selecting the second initial matrix in the second target matrix according to the block position of the first initial matrix in the first target matrix, and generating a plurality of matrix sequences; each of the matrix sequences includes a plurality of the matrix groups, and each of the matrix groups includes the first initial matrix and the second initial matrix;
将每个所述矩阵组中所述第一半精度矩阵和所述第二半精度矩阵的乘积进行累加,得到所述矩阵序列的第一累积矩阵,将每个所述矩阵组中所述第一半精度矩阵和所述第二差值矩阵的乘积和所述第二半精度矩阵和所述第二差值矩阵的乘积的和进行累加,得到所述矩阵序列的第二累积矩阵;Accumulating the product of the first half-precision matrix and the second half-precision matrix in each of the matrix groups to obtain a first accumulated matrix of the matrix sequence, and accumulating the sum of the product of the first half-precision matrix and the second difference matrix and the product of the second half-precision matrix and the second difference matrix in each of the matrix groups to obtain a second accumulated matrix of the matrix sequence;
根据所述第一累积矩阵和所述第二累积矩阵,得第二单精度目标矩阵,并将所述第二单精度目标矩阵作为所述第一目标矩阵和所述第二目标矩阵进行矩阵乘法运算的结果。A second single-precision target matrix is obtained according to the first accumulation matrix and the second accumulation matrix, and the second single-precision target matrix is used as a result of matrix multiplication operation of the first target matrix and the second target matrix.
在一些实施例,所述方法还包括:In some embodiments, the method further comprises:
基于单精度数据类型进行双精度处理,得到所述第一目标矩阵的第一双精度矩阵、所述第二目标矩阵的第二双精度矩阵以及所述第二单精度目标矩阵的检验矩阵;Perform double-precision processing based on the single-precision data type to obtain a first double-precision matrix of the first target matrix, a second double-precision matrix of the second target matrix, and a check matrix of the second single-precision target matrix;
将所述第一双精度矩阵和所述第二双精度矩阵进行乘法运算得到评估矩阵;Performing a multiplication operation on the first double precision matrix and the second double precision matrix to obtain an evaluation matrix;
基于所述检验矩阵和所述评估矩阵得到检验结果;Obtaining a test result based on the test matrix and the evaluation matrix;
将所述检验结果与预设检验阈值进行对比,并基于对比结果输出所述第二单精度目标矩阵。The test result is compared with a preset test threshold, and the second single-precision target matrix is output based on the comparison result.
为实现上述目的,本申请实施例的第二方面提出了一种矩阵乘法运算装置,包括:To achieve the above-mentioned purpose, a second aspect of an embodiment of the present application proposes a matrix multiplication operation device, comprising:
获取模块,用于获取第一初始矩阵和第二初始矩阵;其中,所述第一初始矩阵和所述第二初始矩阵均为单精度矩阵;An acquisition module, used to acquire a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices;
半精度转换模块,用于基于单精度数据类型进行半精度处理,得到所述第一初始矩阵的第一半精度矩阵,以及所述第二初始矩阵的第二半精度矩阵;A half-precision conversion module, used for performing half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix;
差值处理模块,用于基于所述第一初始矩阵和所述第一半精度矩阵的差值得到第一差值矩阵,以及基于所述第二初始矩阵和所述第二半精度矩阵的差值得到第二差值矩阵;其中,所述第一差值矩阵和所述第二差值矩阵均为半精度矩阵;A difference processing module, configured to obtain a first difference matrix based on a difference between the first initial matrix and the first half-precision matrix, and to obtain a second difference matrix based on a difference between the second initial matrix and the second half-precision matrix; wherein both the first difference matrix and the second difference matrix are half-precision matrices;
计算模块,用于累加所述第一半精度矩阵和所述第二半精度矩阵的乘积、所述第一半精度矩阵和所述第二差值矩阵的乘积以及所述第二半精度矩阵和所述第一差值矩阵的乘积,得到第一单精度目标矩阵,并将所述第一单精度目标矩阵作为所述第一初始矩阵和所述第二初始矩阵进行矩阵乘法运算的结果。A calculation module is used to accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and use the first single-precision target matrix as a result of matrix multiplication of the first initial matrix and the second initial matrix.
为实现上述目的,本申请实施例的第三方面提出了一种电子设备,所述电子设备包括存 储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如第一方面所述的并行计算硬件中矩阵乘法运算的处理方法。To achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, wherein the electronic device includes a storage A memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the processing method for matrix multiplication operation in parallel computing hardware as described in the first aspect is implemented.
为实现上述目的,本申请实施例的第四方面提出了一种存储介质,所述存储介质为计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面所述的并行计算硬件中矩阵乘法运算的处理方法。To achieve the above-mentioned purpose, the fourth aspect of an embodiment of the present application proposes a storage medium, which is a computer-readable storage medium, and the storage medium stores a computer program. When the computer program is executed by a processor, it implements the processing method of matrix multiplication operations in the parallel computing hardware described in the first aspect above.
本申请实施例提出的并行计算硬件中矩阵乘法运算的处理方法及相关设备,通过获取第一初始矩阵和第二初始矩阵;其中,第一初始矩阵和第二初始矩阵均为单精度矩阵;然后基于单精度数据类型进行半精度处理,得到第一初始矩阵的第一半精度矩阵,以及第二初始矩阵的第二半精度矩阵;再基于第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,以及基于第二初始矩阵和第二半精度矩阵的差值得到第二差值矩阵;其中,第一差值矩阵和第二差值矩阵均为半精度矩阵;最后,累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一单精度目标矩阵,并将第一单精度目标矩阵作为第一初始矩阵和第二初始矩阵进行矩阵乘法运算的结果。本申请实施例针对与在支持半精度矩阵乘法运算的设备上进行单精度矩阵乘法运算中,利用第一差值矩阵保存第一初始矩阵转化到第一半精度矩阵后的误差,以及利用第二差值矩阵保存第二初始矩阵转化到第二半精度矩阵后的误差,从而在第一半精度矩阵和第二半精度矩阵的乘法运算中增加误差补偿项以进行相应的半精度乘法运算,进而在仅支持半精度乘法运算的硬件设备上得到一个精准度较高的单精度乘法运算结果。The processing method and related equipment of matrix multiplication operation in parallel computing hardware proposed in the embodiment of the present application are as follows: obtaining a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices; then performing half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix; then obtaining a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and obtaining a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix; wherein the first difference matrix and the second difference matrix are both half-precision matrices; finally, accumulating the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and using the first single-precision target matrix as the result of matrix multiplication operation of the first initial matrix and the second initial matrix. The embodiments of the present application are directed to performing single-precision matrix multiplication operations on a device that supports half-precision matrix multiplication operations, using a first difference matrix to save the error after a first initial matrix is converted to a first half-precision matrix, and using a second difference matrix to save the error after a second initial matrix is converted to a second half-precision matrix, thereby adding an error compensation term in the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform corresponding half-precision multiplication operations, and thereby obtaining a single-precision multiplication result with higher accuracy on a hardware device that only supports half-precision multiplication operations.
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be described in the following description, and partly become apparent from the description, or understood by practicing the present application. The purpose and other advantages of the present application can be realized and obtained by the structures specifically pointed out in the description, claims and drawings.
图1是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的流程图。FIG1 is a flowchart of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图2是图1中的步骤S101的流程图。FIG. 2 is a flowchart of step S101 in FIG. 1 .
图3是本申请一实施例提供的矩阵分块的结构示意图。FIG3 is a schematic diagram of the structure of matrix blocks provided in an embodiment of the present application.
图4是本申请一实施例提供的浮点数数据结构示意图。FIG. 4 is a schematic diagram of a floating-point data structure provided in an embodiment of the present application.
图5是图1中的步骤S103的流程图。FIG. 5 is a flowchart of step S103 in FIG. 1 .
图6是图1中的步骤S104的流程图。FIG. 6 is a flowchart of step S104 in FIG. 1 .
图7是图1中的步骤S104的又一流程图。FIG. 7 is another flow chart of step S104 in FIG. 1 .
图8是本申请一实施例提供的一种矩阵分块乘法运算的流程示意图。FIG8 is a flowchart of a matrix block multiplication operation provided in an embodiment of the present application.
图9是本申请一实施例提供的一种矩阵分块乘法运算的又一流程示意图。FIG. 9 is another flowchart of a matrix block multiplication operation provided by an embodiment of the present application.
图10是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的又一流程图。FIG. 10 is another flow chart of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图11是本申请一实施例提供的矩阵分块乘法运算的改进流程示意图。FIG. 11 is a schematic diagram of an improved flow chart of a matrix block multiplication operation provided in an embodiment of the present application.
图12是本申请一实施例提供的矩阵分块乘法运算的又一改进流程示意图。FIG. 12 is a schematic diagram of another improved flow chart of the matrix block multiplication operation provided in an embodiment of the present application.
图13是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的又一流程图。FIG. 13 is another flowchart of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图14是本申请一实施例提供的昇腾AI处理器的芯片结构示意图。FIG14 is a schematic diagram of the chip structure of the Ascend AI processor provided in one embodiment of the present application.
图15是本申请一实施例提供的分块并行运算的执行示意图。FIG. 15 is a schematic diagram of the execution of block parallel computing provided in an embodiment of the present application.
图16是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的精准度对比仿真图。FIG. 16 is a simulation diagram comparing the accuracy of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图17是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的工作效率对比仿真图。FIG. 17 is a simulation diagram comparing the working efficiency of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图18是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的工作效率的仿真数据表图。FIG. 18 is a simulation data table diagram of the working efficiency of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图19是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的又一精准度对比仿真图。 FIG. 19 is another accuracy comparison simulation diagram of a method for processing matrix multiplication operations in parallel computing hardware provided in an embodiment of the present application.
图20是本申请一实施例提供的矩阵乘法运算装置的结构示意图。FIG. 20 is a schematic diagram of the structure of a matrix multiplication operation device provided in one embodiment of the present application.
图21是本申请一实施例提供的电子设备的硬件结构示意图。FIG. 21 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.
需要说明的是,虽然在装置示意图中进行了功能模块划分,在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于装置中的模块划分,或流程图中的顺序执行所示出或描述的步骤。It should be noted that although the functional modules are divided in the device schematic and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as those commonly understood by those skilled in the art to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of this application and are not intended to limit this application.
首先,对本申请中涉及的若干名词进行解析:First, some nouns involved in this application are analyzed:
单精度矩阵乘法运算是指两个单精度浮点数类型的矩阵相乘的运算。单精度浮点数类型通常用32位二进制数来表示一个浮点数,其中1位用于符号位,8位用于指数,23位用于尾数。假设矩阵A的大小为m×n,矩阵B的大小为n×p,那么它们的乘积C的大小为m×p,其中矩阵C的每个元素C[i][j]的值为矩阵A第i行的元素与矩阵B第j列的元素的乘积之和。Single-precision matrix multiplication refers to the operation of multiplying two matrices of single-precision floating-point type. Single-precision floating-point type usually uses 32-bit binary numbers to represent a floating-point number, of which 1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa. Assuming that the size of matrix A is m×n and the size of matrix B is n×p, then the size of their product C is m×p, where the value of each element C[i][j] of matrix C is the sum of the products of the elements in the i-th row of matrix A and the elements in the j-th column of matrix B.
与单精度矩阵乘法运算类似的,半精度矩阵乘法运算是指两个半精度浮点数类型的矩阵相乘的运算。半精度浮点数类型通常用16位二进制数来表示一个浮点数,其中1位用于符号位,5位用于指数,10位用于尾数。由于半精度浮点数类型的精度较低,因此在实际应用中,半精度矩阵乘法通常用于需要较少精度的场景,例如神经网络中的计算。Similar to single-precision matrix multiplication, half-precision matrix multiplication refers to the multiplication of two matrices of half-precision floating-point type. Half-precision floating-point type usually uses 16 bits of binary to represent a floating-point number, of which 1 bit is used for the sign bit, 5 bits for the exponent, and 10 bits for the mantissa. Since the precision of half-precision floating-point type is lower, in practical applications, half-precision matrix multiplication is usually used in scenarios that require less precision, such as calculations in neural networks.
IEEE-754是一种二进制数的浮点表示法,它是由电气和电子工程师学会(Institute of Electrical and Electronics Engineers,IEEE)制定的一种国际标准。该标准规定了浮点数在计算机中的二进制表示方法,包括表示符号位、指数位和尾数位等。IEEE-754 is a floating-point representation of binary numbers. It is an international standard developed by the Institute of Electrical and Electronics Engineers (IEEE). This standard specifies the binary representation of floating-point numbers in computers, including the sign bit, exponent bit, and mantissa bit.
并行计算硬件指的是一种可以同时执行多个计算任务的硬件设备。这些硬件设备通常具有高度的并行计算能力和高效的计算资源管理能力,可以实现高效的计算和数据处理。Parallel computing hardware refers to a hardware device that can perform multiple computing tasks simultaneously. These hardware devices usually have a high degree of parallel computing capabilities and efficient computing resource management capabilities, which can achieve efficient computing and data processing.
近些年,随着机器学习、人工智能等领域的蓬勃发展,极大增加了矩阵乘法运算的数量的同时,也极大的加速了专门用于并行计算的专用硬件的发展。随之而来的是,提高矩阵乘法运算的精准度也成为了最需要注意的地方。In recent years, with the vigorous development of machine learning, artificial intelligence and other fields, the number of matrix multiplication operations has greatly increased, and the development of dedicated hardware for parallel computing has also greatly accelerated. As a result, improving the accuracy of matrix multiplication operations has become the most important thing to pay attention to.
在相关技术中,针对于单精度矩阵的乘法运算中,可以使用支持单精度计算的计算芯片来计算,但支持单精度计算的计算芯片其成本较高。除此之外通常在使用仅支持半精度计算的计算芯片中,将待乘的单精度矩阵转换为半精度矩阵,然后再将半精度进行乘法运算,但是通过这种方案得到的结果的精准度较低。In the related art, for the multiplication operation of single-precision matrices, a computing chip that supports single-precision calculation can be used for calculation, but the cost of the computing chip that supports single-precision calculation is relatively high. In addition, when using a computing chip that only supports half-precision calculation, the single-precision matrix to be multiplied is usually converted into a half-precision matrix, and then the half-precision is multiplied, but the accuracy of the result obtained by this solution is low.
基于此,本申请实施例提供了一种并行计算硬件中矩阵乘法运算的处理方法及相关设备,能够在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度。并行计算硬件中矩阵乘法运算的处理方法主要通过获取第一初始矩阵和第二初始矩阵;其中,第一初始矩阵和第二初始矩阵均为单精度矩阵;然后基于单精度数据类型进行半精度处理,得到第一初始矩阵的第一半精度矩阵,以及第二初始矩阵的第二半精度矩阵;再基于第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,以及基于第二初始矩阵和第二半精度矩阵的差值得到第二差值矩阵;其中,第一差值矩阵和第二差值矩阵均为半精度矩阵;最后,累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一单精度目标矩阵,并将第一单精度目标矩阵作为第一初始矩阵和第二初始矩阵进行矩阵乘法运算的结果。本申请实施例针对与在支持半精度矩阵乘法运算的设备上进行单精度矩阵乘法运算中,利用第一差值矩阵保存第一初始矩阵转化到 第一半精度矩阵后的误差,以及利用第二差值矩阵保存第二初始矩阵转化到第二半精度矩阵后的误差,从而在第一半精度矩阵和第二半精度矩阵的乘法运算中增加误差补偿项以进行相应的半精度乘法运算,进而在仅支持半精度乘法运算的硬件设备上得到一个精准度较高的单精度乘法运算结果。Based on this, the embodiment of the present application provides a processing method and related equipment for matrix multiplication operations in parallel computing hardware, which can improve the accuracy of single-precision matrix multiplication operations in computing chips that support half-precision calculations. The processing method for matrix multiplication operations in parallel computing hardware mainly obtains a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices; then half-precision processing is performed based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix; then a first difference matrix is obtained based on the difference between the first initial matrix and the first half-precision matrix, and a second difference matrix is obtained based on the difference between the second initial matrix and the second half-precision matrix; wherein the first difference matrix and the second difference matrix are both half-precision matrices; finally, the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix are accumulated to obtain a first single-precision target matrix, and the first single-precision target matrix is used as the result of matrix multiplication operations of the first initial matrix and the second initial matrix. The embodiment of the present application is directed to performing a single-precision matrix multiplication operation on a device supporting a half-precision matrix multiplication operation, using a first difference matrix to save a first initial matrix converted to The error after the first half-precision matrix is obtained, and the error after the second initial matrix is converted to the second half-precision matrix is saved by using the second difference matrix, so as to add an error compensation term in the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform the corresponding half-precision multiplication operation, and then obtain a single-precision multiplication operation result with higher accuracy on a hardware device that only supports half-precision multiplication operation.
本申请实施例提供并行计算硬件中矩阵乘法运算的处理方法及相关设备,具体通过如下实施例进行说明,首先描述本申请实施例中的并行计算硬件中矩阵乘法运算的处理方法。The embodiments of the present application provide a processing method and related equipment for matrix multiplication operations in parallel computing hardware, which are specifically illustrated by the following embodiments. First, the processing method for matrix multiplication operations in parallel computing hardware in the embodiments of the present application is described.
本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。The embodiments of the present application can acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that the machines have the functions of perception, reasoning and decision-making.
本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法,涉及人工智能技术领域,尤其涉及数据计算处理领域。本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法可应用于终端中,也可应用于服务器端中,还可以是运行于终端或服务器端中的计算机程序。举例来说,计算机程序可以是操作系统中的原生程序或软件模块;可以是本地应用程序(Application,APP),即需要在操作系统中安装才能运行的程序,也可以是小程序,即只需要下载到浏览器环境中就可以运行的程序;还可以是能够嵌入至任意APP中的小程序。总而言之,上述计算机程序可以是任意形式的应用程序、模块或插件。其中,终端通过网络与服务器进行通信。该并行计算硬件中矩阵乘法运算的处理方法可以由终端或服务器执行,或由终端和服务器协同执行。The processing method of matrix multiplication operation in parallel computing hardware provided by the embodiment of the present application relates to the field of artificial intelligence technology, and in particular to the field of data calculation and processing. The processing method of matrix multiplication operation in parallel computing hardware provided by the embodiment of the present application can be applied to a terminal, can also be applied to a server side, and can also be a computer program running in a terminal or a server side. For example, a computer program can be a native program or a software module in an operating system; it can be a local application (Application, APP), that is, a program that needs to be installed in an operating system to run, or it can be a small program, that is, a program that can be run only by downloading it to a browser environment; it can also be a small program that can be embedded in any APP. In short, the above-mentioned computer program can be an application, module or plug-in in any form. Among them, the terminal communicates with the server via a network. The processing method of matrix multiplication operation in the parallel computing hardware can be executed by a terminal or a server, or by a terminal and a server in collaboration.
在一些实施例中,终端可以是智能手机、平板电脑、笔记本电脑、台式计算机或者智能手表等。此外,终端还可以是智能车载设备。该智能车载设备应用本实施例的并行计算硬件中矩阵乘法运算的处理方法提供相关的服务,提升驾驶体验。服务器可以是独立的服务器,也可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器;也可以是区块链系统中的服务节点,该区块链系统中的各服务节点之间形成组成点对点(Peer To Peer,P2P)网络,P2P协议是一个运行在传输控制协议(Transmission Control Protocol,TCP)协议之上的应用层协议。服务器上可以安装文本翻译系统的服务端,通过该服务端可以与终端进行交互,例如服务端上安装对应的软件,软件可以是实现并行计算硬件中矩阵乘法运算的处理方法的应用等,但并不局限于以上形式。终端与服务器之间可以通过蓝牙、通用串行总线(Universal Serial Bus,USB)或者网络等通讯连接方式进行连接,本实施例在此不做限制。In some embodiments, the terminal may be a smart phone, a tablet computer, a laptop computer, a desktop computer, or a smart watch, etc. In addition, the terminal may also be an intelligent vehicle-mounted device. The intelligent vehicle-mounted device applies the processing method of matrix multiplication operation in the parallel computing hardware of this embodiment to provide related services and improve the driving experience. The server may be an independent server, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and big data and artificial intelligence platforms; or a service node in a blockchain system, where each service node in the blockchain system forms a peer-to-peer (Peer To Peer, P2P) network, and the P2P protocol is an application layer protocol running on the Transmission Control Protocol (Transmission Control Protocol, TCP) protocol. The server can be installed with a server end of a text translation system, through which the server end can interact with the terminal, for example, the corresponding software is installed on the server end, and the software may be an application that implements the processing method of matrix multiplication operation in parallel computing hardware, etc., but is not limited to the above forms. The terminal and the server can be connected via Bluetooth, Universal Serial Bus (USB) or network and other communication connection methods, which is not limited in this embodiment.
本申请可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络个人计算机(Personal Computer,PC)、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present application can be used in many general or special computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network personal computers (Personal Computer, PC), minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, etc. The present application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present application can also be practiced in distributed computing environments, in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.
在一些实施例中,本申请所提供的并行计算硬件中矩阵乘法运算的处理方法主要目标是针对于在仅支持半精度矩阵乘法运算的并行计算硬件中,对两个单精度矩阵的矩阵乘法运算。其中仅支持半精度矩阵乘法运算的并行计算硬件可以是昇腾AI芯片等等。In some embodiments, the processing method of matrix multiplication operation in parallel computing hardware provided in the present application is mainly aimed at the matrix multiplication operation of two single-precision matrices in parallel computing hardware that only supports half-precision matrix multiplication operation. The parallel computing hardware that only supports half-precision matrix multiplication operation can be Ascend AI chip, etc.
首先描述本申请实施例中的并行计算硬件中矩阵乘法运算的处理方法。在本实施例中, 并行计算硬件中矩阵乘法运算的处理方法可应用于矩阵乘法运算装置。参照图1,为本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法的一个可选的流程图,图1中的方法可以包括但不限于包括步骤S101至步骤S104。同时可以理解的是,本实施例对图1中步骤S101至步骤S104的顺序不做具体限定,可以根据实际需求调整步骤顺序或者减少、增加某些步骤。First, a method for processing matrix multiplication operations in parallel computing hardware in an embodiment of the present application is described. In this embodiment, The processing method of matrix multiplication operation in parallel computing hardware can be applied to a matrix multiplication operation device. Referring to FIG1 , an optional flowchart of the processing method of matrix multiplication operation in parallel computing hardware provided in an embodiment of the present application is provided. The method in FIG1 may include but is not limited to steps S101 to S104. At the same time, it can be understood that this embodiment does not specifically limit the order of steps S101 to S104 in FIG1 , and the order of steps can be adjusted or certain steps can be reduced or increased according to actual needs.
步骤S101:获取第一初始矩阵和第二初始矩阵。Step S101: Obtain a first initial matrix and a second initial matrix.
在一些实施例中,矩阵乘法运算装置响应于单精度矩阵乘法运算后,首先获取均为单精度矩阵的第一初始矩阵和第二初始矩阵。其中第一初始矩阵和第二初始矩阵是指待进行矩阵乘法运算处理的矩阵数据。在本实施例中,不对第一初始矩阵和第二初始矩阵的获取来源进行限制,即可以是由人为输入的,也可以是一些基于机器学习模型自身计算所生成的,也可以是通过计算机设备从文本数据库中提取得到或通过计算机设备从网络上爬取得到等。In some embodiments, after the matrix multiplication operation device responds to the single-precision matrix multiplication operation, it first obtains the first initial matrix and the second initial matrix, both of which are single-precision matrices. The first initial matrix and the second initial matrix refer to matrix data to be processed by the matrix multiplication operation. In this embodiment, there is no restriction on the acquisition source of the first initial matrix and the second initial matrix, that is, they can be input manually, or generated based on the calculation of the machine learning model itself, or extracted from a text database by a computer device, or crawled from the network by a computer device, etc.
在一些实施例中,第一初始矩阵来源于第一目标矩阵,第二初始矩阵来源于第二目标矩阵,第一目标矩阵和第二目标矩阵是需要进行矩阵乘法的两个矩阵。在一些情况下,第一目标矩阵和第二目标矩阵的数据规模较大,如深度学习中网络模型中一些应用中的矩阵乘法运算的输入矩阵规模一般较大。因此,为了提高矩阵乘法运算装置的运算效率,同时也受限于并行计算硬件的运行内存大小,需要对第一目标矩阵和第二目标矩阵进行分块处理。下面描述本申请实施例对第一目标矩阵和第二目标矩阵进行分块处理的过程。In some embodiments, the first initial matrix is derived from the first target matrix, the second initial matrix is derived from the second target matrix, and the first target matrix and the second target matrix are two matrices that need to be matrix multiplied. In some cases, the data scale of the first target matrix and the second target matrix is large, such as the input matrix scale of matrix multiplication operations in some applications of network models in deep learning is generally large. Therefore, in order to improve the operational efficiency of the matrix multiplication operation device, and also limited by the running memory size of the parallel computing hardware, it is necessary to perform block processing on the first target matrix and the second target matrix. The following describes the process of performing block processing on the first target matrix and the second target matrix in an embodiment of the present application.
因此,参照图2,获取第一初始矩阵和第二初始矩阵,包括步骤S201至步骤S203。Therefore, referring to FIG. 2 , obtaining the first initial matrix and the second initial matrix includes steps S201 to S203 .
步骤S201:获取第一目标矩阵和第二目标矩阵。Step S201: Obtain a first target matrix and a second target matrix.
步骤S202:获取并行计算硬件的最大矩阵乘法运算阶数。Step S202: Obtain the maximum matrix multiplication operation order of the parallel computing hardware.
步骤S203:基于最大矩阵乘法运算阶数,将第一目标矩阵进行划分得到至少一个第一初始矩阵,并将第二目标矩阵进行划分得到至少一个第二初始矩阵。Step S203: Based on the maximum matrix multiplication operation order, the first target matrix is divided to obtain at least one first initial matrix, and the second target matrix is divided to obtain at least one second initial matrix.
在一些实施例中,在一些实施例中,矩阵乘法运算装置响应于单精度矩阵乘法运算后,首先获取均为单精度矩阵的第一目标矩阵和第二目标矩阵。然后获取并行计算硬件中的最大矩阵乘法运算阶数,可以理解的是最大矩阵乘法运算阶数可以通过并行计算硬件的运行内存大小得到。In some embodiments, in some embodiments, after responding to the single-precision matrix multiplication operation, the matrix multiplication operation device first obtains a first target matrix and a second target matrix, both of which are single-precision matrices, and then obtains the maximum matrix multiplication operation order in the parallel computing hardware. It can be understood that the maximum matrix multiplication operation order can be obtained through the running memory size of the parallel computing hardware.
在一些实施例中,根据第一目标矩阵和第二目标矩阵的数据规模与并行计算硬件中的最大矩阵乘法运算阶数进行判断,若第一目标矩阵和/或第二目标矩阵的数据规模大于并行计算硬件中的最大矩阵乘法运算阶数,则将第一目标矩阵和第二目标矩阵进行分块处理如下。In some embodiments, a judgment is made based on the data size of the first target matrix and the second target matrix and the maximum matrix multiplication operation order in the parallel computing hardware. If the data size of the first target matrix and/or the second target matrix is greater than the maximum matrix multiplication operation order in the parallel computing hardware, the first target matrix and the second target matrix are processed in blocks as follows.
参照图3,是本申请一实施例所提供的一种矩阵分块的结构示意图,其中矩阵为行数为M=32、列数为J=48的第一目标矩阵,为行数为J=48、列数为N=32的第二目标矩阵。当前并行运算硬件中最大矩阵乘法运算阶数为16×16阶时,可以如图3所示的将第一目标矩阵划分为2×3个行数为16、列数为16的第一初始矩阵Asingle;类似的,可以将第二目标矩阵划分为3×2个行数为16、列数为16的第二初始矩阵Bsingle。3 is a schematic diagram of a matrix block structure provided by an embodiment of the present application, wherein the matrix is the first target matrix with M=32 rows and J=48 columns, is a second target matrix with J=48 rows and N=32 columns. When the maximum matrix multiplication operation order in the current parallel computing hardware is 16×16, the first target matrix can be The first initial matrix A single is divided into 2×3 with 16 rows and 16 columns; similarly, the second target matrix The second initial matrix B single is divided into 3×2 matrices with 16 rows and 16 columns.
步骤S102:基于单精度数据类型进行半精度处理,得到第一初始矩阵的第一半精度矩阵,以及第二初始矩阵的第二半精度矩阵。Step S102: Perform half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix.
在一些实施例中,存在由于受到部分并行计算硬件的限制,矩阵乘法运算装置仅支持半精度矩阵的乘法运算,因此需要基于第一初始矩阵Asingle和第二初始矩阵Bsingle的单精度数据类型,对第一初始矩阵Asingle和第二初始矩阵Bsingle进行半精度处理,得到第一初始矩阵Asingle的第一半精度矩阵Ahalf,以及第二初始矩阵Bsingle的第二半精度矩阵Bhalf,以便于后续矩阵乘法运算装置根据第一半精度矩阵Ahalf和第二半精度矩阵进行矩阵乘法运算Bhalf。In some embodiments, due to the limitation of some parallel computing hardware, the matrix multiplication operation device only supports half-precision matrix multiplication operations. Therefore, it is necessary to perform half-precision processing on the first initial matrix A single and the second initial matrix B single based on the single-precision data type of the first initial matrix A single and the second initial matrix B single to obtain a first half-precision matrix A half of the first initial matrix A single and a second half-precision matrix B half of the second initial matrix B single , so that the subsequent matrix multiplication operation device can perform matrix multiplication operations B half according to the first half-precision matrix A half and the second half-precision matrix B half .
在一些实施例中,将第一初始矩阵和第二初始矩阵转化为第一半精度矩阵和第二半精度 矩阵之后,由于半精度浮点数数据无法完全表示单精度浮点数数据,即单精度数据(FP32)输入转换成半精度数据(FP16)时,因为尾数部分不能完全表示,从而会引入数据截断误差。因此,此时直接将第一半精度矩阵和第二半精度矩阵进行乘法运算后的结构作为第一初始矩阵和第二初始矩阵的结果会导致精准度较低。为了解释上述的数据截断误差并同时解释本申请,下面参照图4对本申请所提供的方法进一步介绍。In some embodiments, the first initial matrix and the second initial matrix are converted into a first half-precision matrix and a second half-precision matrix. After the matrix, since the half-precision floating-point data cannot fully represent the single-precision floating-point data, that is, when the single-precision data (FP32) input is converted into half-precision data (FP16), the mantissa part cannot be fully represented, which will introduce data truncation errors. Therefore, at this time, directly multiplying the first half-precision matrix and the second half-precision matrix as the result of the first initial matrix and the second initial matrix will result in lower accuracy. In order to explain the above-mentioned data truncation error and explain the present application at the same time, the method provided by the present application is further introduced with reference to Figure 4.
参照图4,是本申请实施例所提供的一种浮点数数据结构示意图。其中,第一初始矩阵Asingle是单精度矩阵,因此对于其中每个元素数据而言,都存在32位数据,其中1位用于符号位,8位用于指数,23位用于尾数;此外,第一半精度矩阵Ahalf是半精度矩阵,因此对于其中每个元素数据而言,都存在16位数据,其中1位用于符号位,5位用于指数,10位用于尾数。因此第一半精度矩阵Ahalf无法完全存放第一初始矩阵Asingle的所有尾数数据,即第一半精度矩阵Ahalf仅能存放第一初始矩阵Asingle的第一尾数部分l1,从而不可避免的导致数据截断误差。因此,在本申请实施例中,将引入第一差值矩阵RA-half,目的是用于存放第一初始矩阵Asingle的剩余的第二尾数部分l2。可以理解的,由于第一差值矩阵也是半精度矩阵,因此,其尾数部分也能存放10位数据。此外,再考虑到由于IEEE-754浮点数规则中的隐藏尾数位,因此第一差值矩阵实际上保存的是第二尾数部分l2中不为0的尾数部分的后十个尾数。因此,从理论上看,针对于最差的情况,即第二尾数部分l2中第一个数据不为0,此时第一半精度矩阵和第一差值矩阵可存放的尾数部分数据达到21位,已经很接近第一初始矩阵的23位尾数数据部分。基于此,可以有效的解决第一初始矩阵和第二初始矩阵转化为第一半精度矩阵和第二半精度矩阵之后,存在的数据截断误差情况。Referring to Fig. 4, it is a schematic diagram of a floating point data structure provided by an embodiment of the present application. Wherein, the first initial matrix A single is a single precision matrix, so for each element data therein, there are 32 bits of data, of which 1 bit is used for the sign bit, 8 bits are used for the exponent, and 23 bits are used for the mantissa; in addition, the first half-precision matrix A half is a half-precision matrix, so for each element data therein, there are 16 bits of data, of which 1 bit is used for the sign bit, 5 bits are used for the exponent, and 10 bits are used for the mantissa. Therefore, the first half-precision matrix A half cannot completely store all the mantissa data of the first initial matrix A single , that is, the first half-precision matrix A half can only store the first mantissa part l 1 of the first initial matrix A single , which inevitably leads to data truncation errors. Therefore, in the embodiment of the present application, the first difference matrix RA -half will be introduced, the purpose of which is to store the remaining second mantissa part l 2 of the first initial matrix A single . It can be understood that since the first difference matrix is also a half-precision matrix, its mantissa part can also store 10 bits of data. In addition, considering the hidden mantissa bits in the IEEE-754 floating point number rules, the first difference matrix actually stores the last ten mantissas of the mantissa part that is not 0 in the second mantissa part l 2. Therefore, theoretically, for the worst case, that is, the first data in the second mantissa part l 2 is not 0, the mantissa data that can be stored in the first half-precision matrix and the first difference matrix reaches 21 bits, which is very close to the 23-bit mantissa data part of the first initial matrix. Based on this, the data truncation error that exists after the first initial matrix and the second initial matrix are converted into the first half-precision matrix and the second half-precision matrix can be effectively solved.
因此,利用第一差值矩阵保存第一初始矩阵转化到第一半精度矩阵后的误差,以及利用第二差值矩阵保存第二初始矩阵转化到第二半精度矩阵后的误差,从而在第一半精度矩阵和第二半精度矩阵的乘法运算中增加误差补偿项以进行相应的半精度乘法运算,进而在仅支持半精度乘法运算的硬件设备上得到一个精准度较高的单精度乘法运算结果。Therefore, the first difference matrix is used to save the error after the first initial matrix is converted to the first half-precision matrix, and the second difference matrix is used to save the error after the second initial matrix is converted to the second half-precision matrix, so that an error compensation term is added to the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform corresponding half-precision multiplication operations, thereby obtaining a single-precision multiplication result with higher accuracy on a hardware device that only supports half-precision multiplication operations.
步骤S103:基于第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,以及基于第二初始矩阵和第二半精度矩阵的差值得到第二差值矩阵。Step S103: obtaining a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and obtaining a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix.
在一些实施例中,为了解决第一初始矩阵和第二初始矩阵转化为第一半精度矩阵和第二半精度矩阵之后,存在的数据截断误差情况,矩阵乘法运算装置将基于第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,以及基于第二初始矩阵和第二半精度矩阵的差值得到第二差值矩阵。可以理解的是,依旧由于受到部分并行计算硬件的限制,矩阵乘法运算装置仅支持半精度矩阵的乘法运算,因此,第一差值矩阵和第二差值矩阵均为半精度矩阵。In some embodiments, in order to solve the data truncation error after the first initial matrix and the second initial matrix are converted into the first half-precision matrix and the second half-precision matrix, the matrix multiplication operation device obtains a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and obtains a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix. It can be understood that, due to the limitation of some parallel computing hardware, the matrix multiplication operation device only supports multiplication operations of half-precision matrices, so the first difference matrix and the second difference matrix are both half-precision matrices.
在一些实施例中,为了更加有效地在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度,需要更加精准地生成第一差值矩阵和第二差值矩阵,下面描述本申请实施例提供的第一差值矩阵和第二差值矩阵的过程。In some embodiments, in order to more effectively improve the accuracy of single-precision matrix multiplication operations when using a computing chip that supports half-precision calculations, it is necessary to generate the first difference matrix and the second difference matrix more accurately. The process of generating the first difference matrix and the second difference matrix provided in an embodiment of the present application is described below.
参照图5,基于第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,包括步骤S501至步骤S502。5 , the process of obtaining a first difference matrix based on the difference between a first initial matrix and a first half-precision matrix includes steps S501 and S502 .
步骤S501:将第一半精度矩阵进行单精度处理,得到第一中间矩阵。Step S501: Perform single-precision processing on the first half-precision matrix to obtain a first intermediate matrix.
步骤S502:对第一初始矩阵与第一中间矩阵的差值进行半精度处理,得到第一差值矩阵。Step S502: performing half-precision processing on the difference between the first initial matrix and the first intermediate matrix to obtain a first difference matrix.
在一些实施例中,为了更加精准地生成第一差值矩阵RA-half,需要精准地确定第一半精度矩阵Ahalf和第一初始矩阵Asingle之间的差值,因此需要将第一半精度矩阵Ahalf进行单精度处理,以得到第一中间矩阵to_single(Ahalf)。然后根据第一初始矩阵Ahalf与第一中间矩阵
to_single(Ahalf)的差值Ahalf-to_single(Ahalf)进行半精度处理,以得到精准度较高的第一差值矩阵为:
RA-half=to_half(Ahalf-to_single(Ahalf)) (1)In some embodiments, in order to more accurately generate the first difference matrix R A-half , it is necessary to accurately determine the difference between the first half-precision matrix A half and the first initial matrix A single , so it is necessary to perform single-precision processing on the first half-precision matrix A half to obtain the first intermediate matrix to_single(A half ). Then, according to the first initial matrix A half and the first intermediate matrix The difference A half -to_single (A half ) is processed with half precision to obtain a first difference matrix with higher precision:
R A-half =to_half(A half -to_single(A half )) (1)
在一些实施例中,与第一差值矩阵类似的,为了更加精准地生成第二差值矩阵RB-half,需要精准地确定第二半精度矩阵Bhalf和第二初始矩阵Bsingle之间的差值,因此需要将第二半精度矩阵Bhalf进行单精度处理,以得到第二中间矩阵to_single(Bhalf)。然后根据第二初始矩阵Bhalf与第二中间矩阵to_single(Bhalf)的差值Bhalf-to_single(Bhalf)进行半精度处理,以得到精准度较高的第二差值矩阵为:
RB-half=to_half(Bhalf-to_single(Bhalf)) (2)In some embodiments, similar to the first difference matrix, in order to more accurately generate the second difference matrix RB -half , it is necessary to accurately determine the difference between the second half-precision matrix Bhalf and the second initial matrix Bsingle , so the second half-precision matrix Bhalf needs to be processed with single precision to obtain the second intermediate matrix to_single( Bhalf ). Then, half-precision processing is performed based on the difference Bhalf -to_single( Bhalf ) between the second initial matrix Bhalf and the second intermediate matrix to_single( Bhalf ) to obtain a second difference matrix with higher precision:
R B-half =to_half(B half -to_single(B half )) (2)
步骤S104:累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一单精度目标矩阵,并将第一单精度目标矩阵作为第一初始矩阵和第二初始矩阵进行矩阵乘法运算的结果。Step S104: Accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and use the first single-precision target matrix as the result of matrix multiplication of the first initial matrix and the second initial matrix.
在一些实施例中,在获取第一差值矩阵和第二差值矩阵之后,为了更加有效地在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度,第一初始矩阵和第二初始矩阵的乘法运算可以进一步地转化为:
In some embodiments, after obtaining the first difference matrix and the second difference matrix, in order to more effectively improve the accuracy of the single-precision matrix multiplication operation in a computing chip that supports half-precision calculation, the multiplication operation of the first initial matrix and the second initial matrix can be further converted into:
可以理解的是公式(3)中包括了第一半精度矩阵和第二半精度矩阵的乘积AhalfBhalf,也包括了补偿项AhalfRB-half+BhalfRA-half+RA-halfRB-half。It can be understood that formula (3) includes the product A half B half of the first half-precision matrix and the second half-precision matrix, and also includes the compensation term A half RB -half +B half RA -half + RA-half RB -half .
在一些实施例中,考虑到第一差值矩阵RA-half是用于存放第一初始矩阵Asingle第一半精度矩阵Ahalf之间的差值,因此与第一半精度矩阵Ahalf相比,第一差值矩阵RA-half的数值将比较小;类似的考虑到第二差值矩阵RB-half是用于存放第二初始矩阵Bsingle第二半精度矩阵Bhalf之间的差值,因此与第二半精度矩阵Bhalf相比,第二差值矩阵RB-half的数值将比较小。基于此,对于公式(3)而言,其中第一差值矩阵和第二差值矩阵的乘积RA-halfRB-half与其他项相比很小,因此,在本申请实施例中为了提高并行计算硬件中矩阵乘法运算的处理方法的处理效率,可以将公式(3)中的第一差值矩阵和第二差值矩阵的乘积RA-halfRB-half作为冗余项进行忽略。基于此,第一初始矩阵和第二初始矩阵进行矩阵乘法运算的结果第一单精度目标矩阵,可以通过累加第一半精度矩阵Ahalf和第二半精度矩阵Bhalf的乘积、第一半精度矩阵Ahalf和第二差值矩阵RB-half的乘积以及第二半精度矩阵Ahalf和第一差值矩阵RA-half的乘积所得到:
In some embodiments, considering that the first difference matrix RA -half is used to store the difference between the first initial matrix A single and the first half-precision matrix A half , the value of the first difference matrix RA- half will be relatively small compared with the first half-precision matrix A half ; similarly, considering that the second difference matrix RB -half is used to store the difference between the second initial matrix B single and the second half-precision matrix B half , the value of the second difference matrix RB - half will be relatively small compared with the second half-precision matrix B half. Based on this, for formula (3), the product RA -half RB -half of the first difference matrix and the second difference matrix is very small compared with other terms. Therefore, in order to improve the processing efficiency of the processing method of matrix multiplication operation in parallel computing hardware in the embodiment of the present application, the product RA -half RB -half of the first difference matrix and the second difference matrix in formula (3) can be ignored as a redundant term. Based on this, the first single-precision target matrix, which is the result of the matrix multiplication operation of the first initial matrix and the second initial matrix, can be obtained by accumulating the product of the first half-precision matrix A half and the second half-precision matrix B half , the product of the first half-precision matrix A half and the second difference matrix RB-half , and the product of the second half-precision matrix A half and the first difference matrix RA -half :
在一些实施例中,针对一些相比起处理效率,更加注重单精度矩阵乘法运算的应用场景而言,因此,为了更加有效地在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度,仍需考虑上公式(3)中的第一差值矩阵和第二差值矩阵的乘积RA-halfRB-half(即冗余项)。In some embodiments, for some application scenarios that place more emphasis on single-precision matrix multiplication operations than processing efficiency, in order to more effectively improve the accuracy of single-precision matrix multiplication operations in a computing chip that supports half-precision calculations, the product RA -half and RB -half (i.e., redundant terms) of the first difference matrix and the second difference matrix in formula (3) still need to be considered.
因此,参照图6,累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一单精度目标矩阵,包括步骤S601至步骤S603。Therefore, referring to FIG6 , the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix are accumulated to obtain a first single-precision target matrix, including steps S601 to S603 .
步骤S601:累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一加成项。 Step S601: Accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first addition term.
步骤S602:根据第一差值矩阵和第二差值矩阵的乘积,得到第二加成项。Step S602: Obtain a second addition term according to the product of the first difference matrix and the second difference matrix.
步骤S603:累加第一加成项和第二加成项,得到第一单精度目标矩阵。Step S603: Accumulate the first addition term and the second addition term to obtain a first single-precision target matrix.
在一些实施例中,为了更加有效地在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度,基于公式(3),累加第一半精度矩阵和第二半精度矩阵的乘积AhalfBhalf、第一半精度矩阵和第二差值矩阵的乘积AhalfRB-half以及第二半精度矩阵和第一差值矩阵的乘积BhalfRA-half,得到第一加成项如下表示:
Csingle_1=AhalfBhalf+AhalfRB-half+BhalfRA-half (5)
再根据第一差值矩阵和第二差值矩阵的乘积RA-halfRB-half,得到第二加成项如下表示:
Csingle_2=RA-halfRB-half (6)In some embodiments, in order to more effectively improve the accuracy of single-precision matrix multiplication operations in a computing chip that supports half-precision calculations, based on formula (3), the product A half B half of the first half-precision matrix and the second half-precision matrix, the product A half R B-half of the first half-precision matrix and the second difference matrix, and the product B half R A-half of the second half-precision matrix and the first difference matrix are accumulated to obtain a first addition term represented as follows:
C single_1 =A half B half +A half R B-half +B half R A-half (5)
According to the product of the first difference matrix and the second difference matrix RA -half R B-half , the second addition term is expressed as follows:
C single_2 =R A-half R B-half (6)
累加第一加成项Csingle_1和第二加成项Csingle_2,得到第一单精度目标矩阵如下:
The first addition term C single_1 and the second addition term C single_2 are accumulated to obtain the first single-precision target matrix as follows:
在一些实施例中,在IEEE-754标准中,如果两个浮点数数量级相差太大,当两个浮点数进行加法运算时,较小的浮点数可能被忽略掉,即为“浮点数下溢”现象。以尾数为23位的F32为例,=8388608,长度是7,意味着FP32最多能表示7位有效数字(小数点后的位数),第7位不一定能表示所有,但是第6位一定是有效的。假设a=112345.1,b=0.00001,数学上会有a+b=112345.10001,编程计算的结果却是a+b=112345.101562,出现了较小浮点数被部分忽略的情况;出现这种结果是因为112345.10001转换为IEEE-754中单精度浮点数要先转成1.1234510001,由于小数点后的有效位数最多是7位,因此1.1234510001从小数点第8就出现了不准确的情况。此外,如果两个数比较接近,当两个数进行相减时,会导致这两个数相减的结果很小,如果结果太小以至于无法表示,则将这个较小的结果舍入为零。In some embodiments, in the IEEE-754 standard, if the magnitude of two floating-point numbers differs too much, when the two floating-point numbers are added, the smaller floating-point number may be ignored, which is a phenomenon of "floating-point underflow". Taking F32 with a mantissa of 23 bits as an example, =8388608, the length is 7, which means that FP32 can represent up to 7 significant digits (the number of digits after the decimal point), the 7th bit may not represent all, but the 6th bit must be valid. Assume that a=112345.1, b=0.00001, in mathematics, there will be a+b=112345.10001, but the result of programming calculation is a+b=112345.101562, and the smaller floating point number is partially ignored; this result occurs because 112345.10001 must be converted into 1.1234510001 in the IEEE-754 single-precision floating point number. Since the maximum number of significant digits after the decimal point is 7, 1.1234510001 is inaccurate from the 8th decimal point. In addition, if two numbers are close, when the two numbers are subtracted, the result of the subtraction of the two numbers will be very small. If the result is too small to be represented, the smaller result will be rounded to zero.
针对于公式(7)进行分析,一方面,如果第一初始矩阵Asingle和第一半精度矩阵Ahalf很接近,会导致第一差值矩阵RA-half很小,间接会导致公式(7)中的BhalfRA-half很小;同理的是如果第二初始矩阵Bsingle和第二半精度矩阵Bhalf很接近,会导致第二差值矩阵RB-half很小,间接会导致公式(7)中的AhalfRB-half很小;另一方面,使用第一差值矩阵RA-half额外记录第一初始矩阵Asingle的剩余的第二尾数部分l2的数据量要比第二半精度矩阵Bhalf的数据量小10个数量级以上,这会进一步导致进行BhalfRA-half运算时,有可能出现浮点数下溢问题;同理的是使用第二差值矩阵RB-half额外记录第二初始矩阵Bsingle的剩余的第二尾数部分lB-2的数据量要比第一半精度矩阵Ahalf的数据量小10个数量级以上,这会进一步导致进行AhalfRB-half运算时,也有可能出现浮点数下溢问题。With respect to the analysis of formula (7), on the one hand, if the first initial matrix A single and the first half-precision matrix A half are very close, the first difference matrix RA -half will be very small, which will indirectly cause B half RA -half in formula (7) to be very small; similarly, if the second initial matrix B single and the second half-precision matrix B half are very close, the second difference matrix RB -half will be very small, which will indirectly cause A half RB -half in formula (7) to be very small; on the other hand, the amount of data of the remaining second mantissa part l2 of the first initial matrix A single recorded using the first difference matrix RA -half is more than 10 orders of magnitude smaller than the amount of data of the second half-precision matrix B half , which will further cause the floating point underflow problem to occur when performing the B half RA -half operation; similarly, the amount of data of the remaining second mantissa part l B-2 of the second initial matrix B single recorded using the second difference matrix RB - half is more than 10 orders of magnitude smaller than the amount of data of the first half-precision matrix A half , which will further cause the A half RB-2 operation to be performed. Floating point underflow may also occur during B-half operations.
因此,为了解决浮点数下溢问题,以使得更加有效地在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度,下面需要进一步对第一差值矩阵RA-half和第二差值矩阵RB-half进行相应的处理。Therefore, in order to solve the floating-point underflow problem and to more effectively improve the accuracy of single-precision matrix multiplication operations in a computing chip that supports half-precision calculations, the first difference matrix RA -half and the second difference matrix RB -half need to be further processed accordingly.
因此,参照图7,累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一单精度目标矩阵,包括步骤S701至步骤S705。Therefore, referring to Figure 7, the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix are accumulated to obtain the first single-precision target matrix, including steps S701 to S705.
步骤S701:获取根据尾数的位数确定的预设乘值。Step S701: Obtain a preset multiplication value determined according to the number of digits in the mantissa.
步骤S702:利用预设乘值乘以第一差值矩阵,得到第一放大矩阵,并利用预设乘值乘以第二差值矩阵,得到第二放大矩阵。Step S702: multiplying the first difference matrix by a preset multiplier to obtain a first magnified matrix, and multiplying the second difference matrix by a preset multiplier to obtain a second magnified matrix.
步骤S703:累加第一半精度矩阵和第二放大矩阵的乘积、第二半精度矩阵和第一放大矩阵的乘积,得到中间放大矩阵。Step S703: Accumulate the product of the first half-precision matrix and the second magnification matrix, and the product of the second half-precision matrix and the first magnification matrix to obtain an intermediate magnification matrix.
步骤S704:将中间放大矩阵除以预设乘值,得到中间缩小矩阵。 Step S704: Divide the intermediate magnification matrix by a preset multiplication value to obtain an intermediate reduction matrix.
步骤S705:累加第一半精度矩阵和第二半精度矩阵的乘积以及中间缩小矩阵,得到第一单精度目标矩阵。Step S705: Accumulate the product of the first half-precision matrix and the second half-precision matrix and the intermediate reduction matrix to obtain a first single-precision target matrix.
在一些实施例中,由于第一半精度矩阵Ahalf用于存放第一初始矩阵Asingle的第一尾数部分l1,第一差值矩阵RA-half用于存放第一初始矩阵Asingle的剩余的第二尾数部分l2,以及由于IEEE-754浮点数规则中的隐藏尾数位。因此,第一初始矩阵Asingle的数据值和第一差值矩阵RA-half的数据值实际至少相差11个数量级。综上理由所述,根据半精度数据结构的尾数位数确定预设乘值为211。然后利用预设乘值211乘以第一差值矩阵RA-half,得到第一放大矩阵为:
211RA-half=to_half(Ahalf-to_single(Ahalf)×211) (8)In some embodiments, since the first half-precision matrix A half is used to store the first mantissa part l 1 of the first initial matrix As single , and the first difference matrix RA -half is used to store the remaining second mantissa part l 2 of the first initial matrix A single , and due to the hidden mantissa bits in the IEEE-754 floating point number rule. Therefore, the data values of the first initial matrix A single and the data values of the first difference matrix RA -half actually differ by at least 11 orders of magnitude. For the above reasons, the preset multiplication value is determined to be 2 11 according to the number of mantissa bits of the half-precision data structure. Then, the preset multiplication value 2 11 is multiplied by the first difference matrix RA -half to obtain the first enlarged matrix:
2 11 R A-half =to_half(A half -to_single(A half )×2 11 ) (8)
此时,第一放大矩阵211RA-half的数据值和第二半精度矩阵Bhalf的数据值的数量级相似,可以有效的防止Bhalf(211RA-half)出现浮点数下溢问题。类似地,利用预设乘值乘以第二差值矩阵,得到第二放大矩阵为:
211RB-half=to_half(Bhalf-to_single(Bhalf)×211) (9)At this time, the data values of the first magnified matrix 2 11 RA -half and the data values of the second half-precision matrix B half are of similar order of magnitude, which can effectively prevent the floating point underflow problem of B half (2 11 RA -half ). Similarly, the second difference matrix is multiplied by the preset multiplication value to obtain the second magnified matrix:
2 11 R B-half =to_half(B half -to_single(B half )×2 11 ) (9)
此时,第二放大矩阵211RB-half的数据值和第一半精度矩阵Ahalf的数据值的数量级相似,可以有效的防止Ahalf(211RB-half)出现浮点数下溢问题。接下来基于公式(7)的中间项AhalfRB-half+BhalfRA-half,进一步累加第一半精度矩阵和第二放大矩阵的乘积、第二半精度矩阵和第一放大矩阵的乘积,得到中间放大矩阵为:
At this time, the data values of the second magnified matrix 2 11 R B-half and the data values of the first half-precision matrix A half are of similar order of magnitude, which can effectively prevent the floating point underflow problem of A half (2 11 R B-half ). Next, based on the intermediate term A half R B-half +B half R A-half of formula (7), the product of the first half-precision matrix and the second magnified matrix, and the product of the second half-precision matrix and the first magnified matrix are further accumulated to obtain the intermediate magnified matrix:
由于与公式(4)对比,中间放大矩阵同时被放大预设乘值211倍数,因此在解决完浮点数下溢问题后需要将中间放大矩阵除以预设乘值211,得到中间缩小矩阵为:
(Ahalf(211RB-half)+Bhalf(211RA-half))/211 (11)
此时,基于公式(4),累加第一半精度矩阵和第二半精度矩阵的乘积以及中间缩小矩阵,Compared with formula (4), the intermediate magnification matrix is magnified by a preset multiplication value of 2 11 at the same time. Therefore, after solving the floating point underflow problem, the intermediate magnification matrix needs to be divided by the preset multiplication value 2 11 to obtain the intermediate reduction matrix:
(A half (2 11 R B-half )+B half (2 11 R A-half ))/2 11 (11)
At this time, based on formula (4), the product of the first half-precision matrix and the second half-precision matrix and the intermediate reduction matrix are accumulated.
得到第一单精度目标矩阵为:
Csingle≈AhalfBhalf+(Ahalf(211RB-half)+Bhalf(211RA-half))/211 (12)The first single-precision target matrix is obtained as:
C single ≈A half B half +(A half (2 11 R B-half )+B half (2 11 R A-half ))/2 11 (12)
此时可以有效地避免由于第一差值矩阵RA-half和第二半精度矩阵Bhalf的数据值差距过大,使得AhalfRB-half运算出现浮点数下溢情况;同时也有效地避免了避免由于第二差值矩阵RB-half和第一半精度矩阵Ahalf的数据值差距过大,使得BhalfRA-half运算出现浮点数下溢情况,从而更加有效地在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度。At this time, it can effectively avoid the floating-point underflow in the A half RB -half operation due to the large gap between the data values of the first difference matrix RB -half and the second half-precision matrix B half ; at the same time, it can also effectively avoid the floating-point underflow in the B half RB-half operation due to the large gap between the data values of the second difference matrix RB-half and the first half-precision matrix A half , thereby more effectively improving the accuracy of single-precision matrix multiplication operations in computing chips that support half-precision calculations.
在一些实施例中,当第一目标矩阵和第二目标矩阵的数据规模较大时,需要将第一目标矩阵被划分为多个第一初始矩阵,以及将第二目标矩阵被划分为多个第二初始矩阵时,此时第一目标矩阵和第二目标矩阵的矩阵乘法运算过程参照图3所示。首先根据第一初始矩阵在第一目标矩阵中的分块位置在第二目标矩阵中选取第二初始矩阵并生成多个矩阵组为:
In some embodiments, when the data scale of the first target matrix and the second target matrix is large, the first target matrix needs to be divided into multiple first initial matrices, and the second target matrix needs to be divided into multiple second initial matrices. At this time, the matrix multiplication process of the first target matrix and the second target matrix is shown in FIG3. First, the second initial matrix is selected in the second target matrix according to the block position of the first initial matrix in the first target matrix, and multiple matrix groups are generated as follows:
其中,k表示第一初始矩阵和第二初始矩阵的数量。然后将第一初始矩阵和第二初始矩阵进行矩阵运算后可以得到多个第一单精度目标矩阵Csingle;最后累加所有第一单精度目标矩阵Csingle可以得到第一目标矩阵和第二目标矩阵进行乘法运算后的第二单精度目标矩阵此时,第二单精度目标矩阵可以由下式得到:
Where k represents the number of the first initial matrix and the second initial matrix. Then, after performing matrix operations on the first initial matrix and the second initial matrix, multiple first single-precision target matrices C single can be obtained; finally, all the first single-precision target matrices C single are accumulated to obtain a second single-precision target matrix after multiplying the first target matrix and the second target matrix. At this time, the second single-precision target matrix can be obtained as follows:
此时结合公式(12)和公式(14),在第一目标矩阵被划分为多个第一初始矩阵,以及第二目标矩阵被划分为多个第二初始矩阵时,参照图8所示,是本申请实施例提供的一种矩阵分块乘法运算的流程示意图,其中,对于每项矩阵序列中的第一初始矩阵和第二初始矩阵按照公式(12)依次进行运算得到第一单精度矩阵Csingle,然后累加第一单精度矩阵Csingle得到第二单精度矩阵Csingle。At this time, combining formula (12) and formula (14), when the first target matrix is divided into multiple first initial matrices, and the second target matrix is divided into multiple second initial matrices, as shown in Figure 8, it is a flow chart of a matrix block multiplication operation provided by an embodiment of the present application, wherein the first initial matrix and the second initial matrix in each matrix sequence are operated in sequence according to formula (12) to obtain a first single-precision matrix C single , and then the first single-precision matrix C single is accumulated to obtain the second single-precision matrix C single .
可以理解的是,若为了进一步提高矩阵乘法运算的精准度,图8中所示的流程应加上冗余项RA-halfRB-half,即如图9所示。图9是本申请一实施例所提供的一种矩阵分块乘法运算的又一流程示意图,其中相较于图8,还需执行多项冗余项RA-halfRB-half的乘法运算。It is understandable that in order to further improve the accuracy of the matrix multiplication operation, the process shown in FIG8 should be added with redundant items RA -half RB -half , as shown in FIG9. FIG9 is another schematic diagram of a process of a matrix block multiplication operation provided by an embodiment of the present application, wherein, compared with FIG8, it is necessary to perform a plurality of redundant items RA -half RB -half multiplication operations.
在一些实施例中,当第一目标矩阵和第二目标矩阵的数据规模较小,即出现并行运算硬件中最大矩阵乘法运算阶数比第一目标矩阵和第二目标矩阵的数据规模都更大时,进一步解释为并行运算硬件中可以立即处理第一目标矩阵和第二目标矩阵的乘法运算。此时直接将第一目标矩阵作为第一初始矩阵,也将第二目标矩阵作为第二初始矩阵,可以理解的是,此时第一单精度目标矩阵Csingle等于第二单精度目标矩阵Csingle。In some embodiments, when the data scale of the first target matrix and the second target matrix is small, that is, when the maximum matrix multiplication operation order in the parallel computing hardware is larger than the data scale of the first target matrix and the second target matrix, it is further explained that the multiplication operation of the first target matrix and the second target matrix can be processed immediately in the parallel computing hardware. At this time, the first target matrix is directly used as the first initial matrix, and the second target matrix is also used as the second initial matrix. It can be understood that at this time, the first single-precision target matrix C single is equal to the second single-precision target matrix C single .
在一些实施例中,当使用如图8所示的运算流程的时候,仍然存在由于不同第一单精度矩阵Csingle之间数量级差距较大导致的浮点数下溢情况。此外,考虑到由于公式(13)中AhalfBhalf的数据值与(Ahalf(211RB-half)+Bhalf(211RA-half))/211的数据值存在较大的数量级差距,因此在并行计算硬件中对于这两项需要划分不同大小的存储缓存空间,若依旧使用如图8所示的运算流程的话,在执行第一目标矩阵和第二目标矩阵的矩阵乘法运算时,并行计算硬件需要多次划分不同的存储缓存空间依次存放的数据与的数据,这将导致并行计算硬件中执行矩阵乘法运算的工作效率受到限制。In some embodiments, when the operation flow shown in FIG8 is used, there is still a floating point underflow caused by the large order of magnitude difference between different first single-precision matrices C single . In addition, considering that the data value of A half B half in formula (13) is significantly different from the data value of (A half (2 11 RB -half )+B half (2 11 RA -half ))/2 11 , different sizes of storage cache spaces need to be divided for these two items in the parallel computing hardware. If the operation flow shown in FIG8 is still used, when performing the matrix multiplication operation of the first target matrix and the second target matrix, the parallel computing hardware needs to divide different storage cache spaces multiple times to store them in sequence. Data and This will limit the efficiency of performing matrix multiplication operations in parallel computing hardware.
因此,为了提高单精度矩阵乘法运算的精准度,同时也为了提高并行计算硬件中执行矩阵乘法运算的工作效率,需要对如图8所示运算流程进一步优化如下。参照图10,本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法,还包括下述步骤S1001至步骤S1003。Therefore, in order to improve the accuracy of single-precision matrix multiplication operations, and also to improve the work efficiency of performing matrix multiplication operations in parallel computing hardware, it is necessary to further optimize the operation process shown in Figure 8 as follows. Referring to Figure 10, the processing method of matrix multiplication operations in parallel computing hardware provided in the embodiment of the present application also includes the following steps S1001 to S1003.
步骤S1001:根据第一初始矩阵在第一目标矩阵中的分块位置在第二目标矩阵中选取第二初始矩阵,并生成多个矩阵序列。Step S1001: selecting a second initial matrix in a second target matrix according to the block positions of the first initial matrix in the first target matrix, and generating a plurality of matrix sequences.
步骤S1002:将每个矩阵组中第一半精度矩阵和第二半精度矩阵的乘积进行累加,得到矩阵序列的第一累积矩阵,将每个矩阵组中第一半精度矩阵和第二差值矩阵的乘积和第二半精度矩阵和第二差值矩阵的乘积的和进行累加,得到矩阵序列的第二累积矩阵。Step S1002: Accumulate the product of the first half-precision matrix and the second half-precision matrix in each matrix group to obtain a first cumulative matrix of the matrix sequence, and accumulate the sum of the product of the first half-precision matrix and the second difference matrix and the product of the second half-precision matrix and the second difference matrix in each matrix group to obtain a second cumulative matrix of the matrix sequence.
步骤S1003:根据第一累积矩阵和第二累积矩阵,得第二单精度目标矩阵,并将第二单精度目标矩阵作为第一目标矩阵和第二目标矩阵进行矩阵乘法运算的结果。Step S1003: Obtain a second single-precision target matrix according to the first accumulation matrix and the second accumulation matrix, and use the second single-precision target matrix as the result of matrix multiplication of the first target matrix and the second target matrix.
在一些实施例中,如同上述的操作,当第一目标矩阵被划分为多个第一初始矩阵,以及第二目标矩阵被划分为多个第二初始矩阵时,此时参照公式(1)与图3所示,根据第一初始矩阵在第一目标矩阵中的分块位置在第二目标矩阵中选取第二初始矩阵,并生成多个矩阵序列,其中每个矩阵序列中包括多个矩阵组,每个矩阵组包括第一初始矩阵和该第一初始矩阵对应的第二初始矩阵,其中多个矩阵组如公式(14)所示。In some embodiments, as in the above operation, when the first target matrix is divided into a plurality of first initial matrices, and the second target matrix is divided into a plurality of second initial matrices, then referring to formula (1) and FIG. 3 , the second initial matrix is selected in the second target matrix according to the block position of the first initial matrix in the first target matrix, and a plurality of matrix sequences are generated, wherein each matrix sequence includes a plurality of matrix groups, each matrix group includes a first initial matrix and a second initial matrix corresponding to the first initial matrix, wherein the plurality of matrix groups are as shown in formula (14).
接下来,参照图11,是本申请一实施例提供的一种矩阵分块乘法运算的改进流程示意图。其中,将每个矩阵组中第一半精度矩阵和第二半精度矩阵的乘积进行累加,得到矩阵序列的第一累积矩阵,如下所示:
Next, referring to FIG11 , there is a schematic diagram of an improved process flow of a matrix block multiplication operation provided by an embodiment of the present application. In which, the product of the first half-precision matrix and the second half-precision matrix in each matrix group is accumulated to obtain the first cumulative matrix of the matrix sequence, as shown below:
接下来将每个矩阵组中第一半精度矩阵和第二差值矩阵的乘积和将每个矩阵组中第二半精度矩阵和第二差值矩阵的乘积的和进行累加,得到矩阵序列的第二累积矩阵如下所示:
Next, the product of the first half-precision matrix and the second difference matrix in each matrix group and the product of the second half-precision matrix and the second difference matrix in each matrix group are accumulated to obtain the second cumulative matrix of the matrix sequence as shown below:
然后再将第一累积矩阵和第二累积矩阵进行累加得到第二单精度目标矩阵Csingle。从而有效的避免由于不同第一单精度矩阵Csingle之间数量级差距较大导致的浮点数下溢情况,从而提高单精度矩阵乘法运算的精准度。同时并行计算硬件在执行第一目标矩阵和第二目标矩阵的矩阵乘法运算中仅需划分两次不同大小的存储缓存空间即可,此外,还可以更好地利用并行计算硬件的并行计算能力,即可以同时计算多项的运算和多项的运算,从而有效地提高单精度矩阵乘法运算的工作效率。Then, the first accumulation matrix and the second accumulation matrix are accumulated to obtain the second single-precision target matrix C single . This effectively avoids the floating-point underflow caused by the large order of magnitude difference between different first single-precision matrices C single , thereby improving the accuracy of single-precision matrix multiplication operations. At the same time, the parallel computing hardware only needs to divide the storage cache space into two different sizes when performing the matrix multiplication operation of the first target matrix and the second target matrix. In addition, the parallel computing capability of the parallel computing hardware can be better utilized, that is, multiple items can be calculated simultaneously. Operations and multiple operations, thereby effectively improving the work efficiency of single-precision matrix multiplication operations.
可以理解的是,若为了进一步提高矩阵乘法运算的精准度,图11中所示的流程应加上冗余项RA-halfRB-half,即如图12所示。图12是本申请一实施例所提供的一种矩阵分块乘法运算的又一改进流程示意图,其中相较于图11,还需执行多项冗余项RA-halfRB-half的乘法运算。It is understandable that in order to further improve the accuracy of the matrix multiplication operation, the process shown in FIG11 should be added with redundant items RA -half RB -half , as shown in FIG12. FIG12 is a schematic diagram of another improved process of a matrix block multiplication operation provided by an embodiment of the present application, wherein, compared with FIG11, it is necessary to perform a plurality of redundant items RA -half RB -half multiplication operations.
在一些实施例中,为了提高并行计算硬件中矩阵乘法运算的处理方法的可靠性,还需要增加检测对比环节。因此,参照图11,并行计算硬件中矩阵乘法运算的处理方法,还包括下述步骤S1301至步骤S1304。In some embodiments, in order to improve the reliability of the processing method of matrix multiplication operation in parallel computing hardware, it is also necessary to add a detection and comparison link. Therefore, referring to Figure 11, the processing method of matrix multiplication operation in parallel computing hardware also includes the following steps S1301 to S1304.
步骤S1301:基于单精度数据进行双精度处理,得到第一目标矩阵的第一双精度矩阵、第二目标矩阵的第二双精度矩阵以及第二单精度目标矩阵的检验矩阵。Step S1301: Perform double-precision processing based on single-precision data to obtain a first double-precision matrix of the first target matrix, a second double-precision matrix of the second target matrix, and a check matrix of the second single-precision target matrix.
步骤S1302:将第一双精度矩阵和第二双精度矩阵进行乘法运算得到评估矩阵。Step S1302: performing a multiplication operation on the first double precision matrix and the second double precision matrix to obtain an evaluation matrix.
步骤S1303:基于检验矩阵和评估矩阵得到检验结果。Step S1303: Obtaining a test result based on the test matrix and the evaluation matrix.
步骤S1304:将检验结果与预设检验阈值进行对比,并基于对比结果输出第二单精度目标矩阵。Step S1304: Compare the inspection result with a preset inspection threshold, and output a second single-precision target matrix based on the comparison result.
在一些实施例中,为了提高并行计算硬件中矩阵乘法运算的处理方法的可靠性,也为了验证通过使用本申请提供的并行计算硬件中矩阵乘法运算的处理方法所得到的第二单精度目标矩阵Csingle的精准度。利用双精度数据结构的高精度,矩阵乘法运算装置在得到第二单精度目标矩阵Csingle之后,将基于第二单精度目标矩阵Csingle的单精度数据结构,对第二单精度目标矩阵Csingle进行双精度处理,得到第二单精度目标矩阵Csingle的检验矩阵to_double(Csingle);同时;以及基于第一目标矩阵Asingle的单精度数据结构,对第一目标矩阵Asingle进行双精度处理,得到第一目标矩阵Asingle的第一双精度矩阵to_double(Asingle),并基于第二目标矩阵Bsingle的单精度数据结构,对第二目标矩阵Bsingle进行双精度处理,得到第二目标矩阵Bsingle的第二双精度矩阵to_double(Bsingle),然后将第一双精度矩阵和第二双精度矩阵进行双精度矩阵乘法运算,得到评估矩阵to_double(Asingle)·to_double(Bsingle)。接下来基于检验矩阵和评估矩阵进行检验如下:
In some embodiments, in order to improve the reliability of the processing method of matrix multiplication operations in parallel computing hardware, and also to verify the accuracy of the second single-precision target matrix C single obtained by using the processing method of matrix multiplication operations in parallel computing hardware provided by the present application. Utilizing the high precision of the double-precision data structure, after obtaining the second single-precision target matrix C single , the matrix multiplication operation device performs double-precision processing on the second single-precision target matrix C single based on the single-precision data structure of the second single-precision target matrix C single , and obtains a test matrix to_double(C single ) of the second single-precision target matrix C single ; at the same time; and based on the single-precision data structure of the first target matrix A single , the first target matrix A single is double-precision processed to obtain a first double-precision matrix to_double(A single ) of the first target matrix A single , and based on the single-precision data structure of the second target matrix B single , the second target matrix B single is double-precision processed to obtain a second double-precision matrix to_double(B single ) of the second target matrix B single, and then double-precision matrix multiplication is performed on the first double-precision matrix and the second double-precision matrix to obtain an evaluation matrix to_double(A single )·to_double(B single ). Next, a test is performed based on the test matrix and the evaluation matrix as follows:
其中,V是用于存储检验矩阵和评估矩阵之间的相对残差,用于表征通过使用本实施例提供的并行计算硬件中矩阵乘法运算的处理方法,与直接进行第一目标矩阵和第二目标矩阵的矩阵乘法运算之间的实际误差;||·||F是矩阵的欧几里德范数,可以理解的是矩阵的欧几里德范 数是指矩阵的最大奇异值,也称为矩阵的2-范数,其用于评估矩阵的条件数,即矩阵的稳定性和数值求解的可靠性。Wherein, V is used to store the relative residual between the test matrix and the evaluation matrix, and is used to characterize the actual error between the processing method of the matrix multiplication operation in the parallel computing hardware provided by this embodiment and the direct matrix multiplication operation of the first target matrix and the second target matrix; ||·|| F is the Euclidean norm of the matrix, which can be understood as the Euclidean norm of the matrix The number refers to the largest singular value of a matrix, also known as the 2-norm of the matrix, which is used to evaluate the condition number of the matrix, that is, the stability of the matrix and the reliability of the numerical solution.
当相对残差V小于预设检验阈值时,可以判定本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法是有效的,因此可以将第二单精度目标矩阵作为第一目标矩阵和第二目标矩阵的矩阵乘法运算的结果并输出。在本申请实施例中不对预设检验阈值的设置做任何的约束,即可以是人为预先设置的,也可以是矩阵乘法运算装置根据历史运算规律所得到的。When the relative residual V is less than the preset test threshold, it can be determined that the processing method for matrix multiplication operation in the parallel computing hardware provided in the embodiment of the present application is effective, so the second single-precision target matrix can be used as the result of the matrix multiplication operation of the first target matrix and the second target matrix and output. In the embodiment of the present application, no constraints are imposed on the setting of the preset test threshold, that is, it can be artificially preset, or it can be obtained by the matrix multiplication operation device according to the historical operation law.
可以理解的是,步骤S1301至步骤S1304所示的检验步骤是为了检验本申请实施例所提供的并行计算硬件中矩阵乘法运算的处理方法的可靠性,而本申请实施例所提供的并行计算硬件中矩阵乘法运算的处理方法是主要用于针对在仅支持半精度矩阵乘法运算的并行计算硬件中,对两个单精度矩阵(即第一目标矩阵和第二目标矩阵)的矩阵乘法运算,因此,在这种情况下,如步骤S1301至步骤S1304所示的检验步骤将放置于其他可支持双精度矩阵乘法运算的并行计算硬件中执行,并对本申请实施例所提供的并行计算硬件中矩阵乘法运算的处理方法无任何影响。It can be understood that the verification steps shown in steps S1301 to S1304 are for verifying the reliability of the processing method for matrix multiplication operations in the parallel computing hardware provided in the embodiment of the present application, and the processing method for matrix multiplication operations in the parallel computing hardware provided in the embodiment of the present application is mainly used for matrix multiplication operations of two single-precision matrices (i.e., the first target matrix and the second target matrix) in parallel computing hardware that only supports half-precision matrix multiplication operations. Therefore, in this case, the verification steps shown in steps S1301 to S1304 will be placed in other parallel computing hardware that can support double-precision matrix multiplication operations for execution, and will have no effect on the processing method for matrix multiplication operations in the parallel computing hardware provided in the embodiment of the present application.
在一些实例中,本申请所提供并行计算硬件中矩阵乘法运算的处理方法可以应用于昇腾AI处理器,昇腾AI处理器是一个适应特定领域的芯片,昇腾AI处理的核心是人工智能芯片。如图14所示是昇腾AI处理器的芯片结构示意图。昇腾AI处理器提供了3种基础计算单元:矩阵计算单元(CUBE)、向量计算单元(Vector)和标量计算单元(Scalar),3种计算单元会形成3条针对相应计算的独立流水线,完成相应的计算。其中,L1缓冲区用于存储第一目标矩阵和第二目标矩阵,然后在缓存转换单元中进行矩阵转换操作并保存,其中转换操作包括将第一目标矩阵划分为多个第一初始矩阵,并将第一初始矩阵转换为第一半精度矩阵,以及生成第一差值矩阵和第一放大矩阵,类似的,将第二目标矩阵划分为多个第二初始矩阵,并将第二初始矩阵转换为第二半精度矩阵,以及生成第二差值矩阵和第二放大矩阵。缓冲区L0A和缓冲区L0B用于存储即将进行矩阵相乘运算的矩阵。矩阵计算单元用于接收缓冲区L0A和缓冲区L0B的矩阵执行矩阵相乘运算操作,结合图12所示的执行流程图,其中矩阵相乘运算操作包括: 以及累加器用于将矩阵计算单元所得到的数据进行累加,即如图12所示的加法操作。缓冲区L0C用于存储经过累加器计算所得到的数据,最终得到本方案所需的第二单精度目标矩阵Csingle。In some instances, the processing method of matrix multiplication operation in parallel computing hardware provided in this application can be applied to the Ascend AI processor, which is a chip adapted to a specific field, and the core of Ascend AI processing is an artificial intelligence chip. As shown in Figure 14, it is a schematic diagram of the chip structure of the Ascend AI processor. The Ascend AI processor provides three basic computing units: matrix computing unit (CUBE), vector computing unit (Vector) and scalar computing unit (Scalar). The three computing units will form three independent pipelines for the corresponding calculations to complete the corresponding calculations. Among them, the L1 buffer is used to store the first target matrix and the second target matrix, and then perform a matrix conversion operation and save it in the cache conversion unit, wherein the conversion operation includes dividing the first target matrix into a plurality of first initial matrices, and converting the first initial matrix into a first half-precision matrix, and generating a first difference matrix and a first magnification matrix. Similarly, the second target matrix is divided into a plurality of second initial matrices, and the second initial matrix is converted into a second half-precision matrix, and a second difference matrix and a second magnification matrix are generated. Buffer L0A and buffer L0B are used to store matrices that are about to perform matrix multiplication operations. The matrix calculation unit is used to receive the matrices in the buffer L0A and the buffer L0B to perform a matrix multiplication operation. In conjunction with the execution flow chart shown in FIG12 , the matrix multiplication operation includes: as well as The accumulator is used to accumulate the data obtained by the matrix calculation unit, that is, the addition operation shown in Figure 12. The buffer L0C is used to store the data obtained by the accumulator calculation, and finally obtain the second single-precision target matrix C single required by this solution.
在昇腾AI处理器的人工智能芯片中中,统一缓冲区是其内部的一个重要组成部分,用于存储不同计算单元之间共享的数据。系统控制模块是昇腾芯片中的一个模块,负责控制芯片的整体运行和调度。它包含了各个功能模块之间的协调和管理,以及处理外部输入和输出等任务。总线接口模块是用于连接昇腾芯片与其他外部设备的接口模块。它提供了与主机系统或其他外部设备进行数据传输和通信的接口,实现了与外部系统的交互。指令缓存是昇腾芯片中用于存储指令的高速缓存。它用于提高指令的读取速度,减少指令访问延迟,提高指令的执行效率。标量指令处理队列是昇腾芯片中的一个模块,用于处理标量指令。标量指令是一种操作单个数据的指令,例如加法、乘法等。标量指令处理队列负责接收和解码标量指令,并将其分发给相应的功能单元进行执行。指令分发模块是昇腾芯片中的一个模块,用于将解码后的指令分发给相应的功能单元进行执行。它负责将指令根据其类型和操作数分发给适当的功能单元,实现指令的并行执行和高效利用计算资源。综上所述,昇腾芯片中的统一缓冲区用于存储不同计算单元之间共享的数据,系统控制模块负责整体调度和管理,总线接口模块实现与外部设备的通信,指令缓存提高指令读取速度,标量指令处理队列处理标量指令,指令分发模块将解码后的指令分发给相应的功能单元进行执行。这些模块共同组成了昇腾芯片中的一部分功能。CUBE队列是昇腾芯片中的一个硬件模块,用于管理和调度任务的执行。CUBE队列能够并行执行多个任务,并按照一定的调度策略进行任务切换和调度,以提高计算效率。Vector队列是昇腾芯片中的另一个硬件模块,用于支持向量计算。Vector队 列能够高效地执行向量运算,通过并行处理多个向量数据,提高向量计算的效率和性能。存储转换队列是昇腾芯片中的一个硬件模块,用于处理数据的存储和转换。存储转换队列负责管理不同存储介质之间的数据传输和转换,如内存和外部存储设备之间的数据读写。时间同步模块是昇腾芯片中的一个硬件模块,用于保证多个昇腾芯片之间的时间同步。时间同步模块通过精确的时钟同步机制,确保多个昇腾芯片在执行任务时具有一致的时间基准,以支持分布式计算和协同工作。向量计算单元是昇腾芯片中的一个硬件模块,用于执行向量计算。向量计算单元能够高效地执行大规模向量数据的运算,提供并行计算的能力,用于加速向量计算密集型的任务。标量计算单元是昇腾芯片中的一个硬件模块,用于执行标量计算。标量计算单元负责处理单个数据元素的计算任务,包括加减乘除、逻辑运算等,用于支持标量计算密集型的任务。配置端口是昇腾芯片中的一个接口,用于配置和管理芯片的各种参数和设置。配置端口提供了与芯片内部控制逻辑的通信通道,用于读取和写入芯片的配置寄存器,以实现对芯片功能和性能的灵活配置。总线是昇腾芯片中的一个通信通道,用于连接芯片内部的各个模块和外部设备。总线负责传输数据和控制信号,实现芯片内部各个模块之间的通信和协作。L2缓存区是昇腾芯片中的一个高速缓存存储区域,位于芯片核心和内存之间。L2缓存区用于存储频繁访问的数据和指令,提供快速的数据读写和访问速度,以加快计算和数据处理的速度。DDR(Double Data Rate)是昇腾芯片中的一种内存类型,也是计算机系统中常见的内存类型之一。DDR内存采用双倍数据传输速率技术,能够在一个时钟周期内传输两倍的数据量,提供更高的内存带宽和更快的数据访问速度,用于存储和读写大规模的数据。In the artificial intelligence chip of the Ascend AI processor, the unified buffer is an important component inside it, which is used to store data shared between different computing units. The system control module is a module in the Ascend chip, which is responsible for controlling the overall operation and scheduling of the chip. It includes the coordination and management between various functional modules, as well as tasks such as processing external input and output. The bus interface module is an interface module used to connect the Ascend chip with other external devices. It provides an interface for data transmission and communication with the host system or other external devices, and realizes interaction with external systems. The instruction cache is a high-speed cache used to store instructions in the Ascend chip. It is used to increase the reading speed of instructions, reduce instruction access latency, and improve the execution efficiency of instructions. The scalar instruction processing queue is a module in the Ascend chip, which is used to process scalar instructions. A scalar instruction is an instruction that operates on a single data, such as addition, multiplication, etc. The scalar instruction processing queue is responsible for receiving and decoding scalar instructions and distributing them to the corresponding functional units for execution. The instruction distribution module is a module in the Ascend chip, which is used to distribute the decoded instructions to the corresponding functional units for execution. It is responsible for distributing instructions to appropriate functional units according to their type and operands, realizing parallel execution of instructions and efficient use of computing resources. In summary, the unified buffer in the Ascend chip is used to store data shared between different computing units, the system control module is responsible for overall scheduling and management, the bus interface module realizes communication with external devices, the instruction cache improves the instruction reading speed, the scalar instruction processing queue processes scalar instructions, and the instruction distribution module distributes the decoded instructions to the corresponding functional units for execution. These modules together constitute part of the functions in the Ascend chip. The CUBE queue is a hardware module in the Ascend chip, which is used to manage and schedule the execution of tasks. The CUBE queue can execute multiple tasks in parallel, and switch and schedule tasks according to certain scheduling strategies to improve computing efficiency. The Vector queue is another hardware module in the Ascend chip, which is used to support vector computing. Vector queue The column can efficiently perform vector operations and improve the efficiency and performance of vector computing by processing multiple vector data in parallel. The storage conversion queue is a hardware module in the Ascend chip, which is used to handle data storage and conversion. The storage conversion queue is responsible for managing data transmission and conversion between different storage media, such as data reading and writing between memory and external storage devices. The time synchronization module is a hardware module in the Ascend chip, which is used to ensure time synchronization between multiple Ascend chips. The time synchronization module uses a precise clock synchronization mechanism to ensure that multiple Ascend chips have a consistent time base when performing tasks to support distributed computing and collaborative work. The vector computing unit is a hardware module in the Ascend chip, which is used to perform vector computing. The vector computing unit can efficiently perform operations on large-scale vector data, provide parallel computing capabilities, and is used to accelerate vector computing-intensive tasks. The scalar computing unit is a hardware module in the Ascend chip, which is used to perform scalar computing. The scalar computing unit is responsible for processing the computing tasks of a single data element, including addition, subtraction, multiplication, division, logical operations, etc., to support scalar computing-intensive tasks. The configuration port is an interface in the Ascend chip, which is used to configure and manage various parameters and settings of the chip. The configuration port provides a communication channel with the internal control logic of the chip, which is used to read and write the configuration registers of the chip to achieve flexible configuration of the chip functions and performance. The bus is a communication channel in the Ascend chip, which is used to connect various modules inside the chip and external devices. The bus is responsible for transmitting data and control signals to achieve communication and collaboration between various modules inside the chip. The L2 cache area is a high-speed cache storage area in the Ascend chip, located between the chip core and the memory. The L2 cache area is used to store frequently accessed data and instructions, providing fast data reading and writing and access speed to speed up computing and data processing. DDR (Double Data Rate) is a type of memory in the Ascend chip and one of the common memory types in computer systems. DDR memory uses double data transfer rate technology, which can transmit twice the amount of data in one clock cycle, providing higher memory bandwidth and faster data access speed, and is used to store and read and write large-scale data.
可以理解的是,本申请的主要改进点的执行步骤主要在于昇腾AI处理器的人工智能芯片中的L1缓冲区、缓存转换单元、缓冲区L0A、缓冲区L0B、缓冲区L0C、矩阵计算单元以及累加器中。It can be understood that the execution steps of the main improvements of the present application are mainly in the L1 buffer, cache conversion unit, buffer L0A, buffer L0B, buffer L0C, matrix calculation unit and accumulator in the artificial intelligence chip of the Ascend AI processor.
在一些实施例中,本申请所提供并行计算硬件中矩阵乘法运算的处理方法可以应用于普通的并行计算硬件中进行分块执行处理,并行计算硬件通常包括多个人工智能芯片,因此可以将每一个分块的相乘运算利用一个人工智能芯片进行并行处理。例如在图12所示的执行流程中,利用第一个人工智能芯片执行i=1时,分块矩阵的乘法运算,以此类推每一个块矩阵乘法均可以选用对应的人工智能芯片进行处理,即将 以及并行执行,然后最后再将得到的数据加起来,从而提高执行矩阵乘法运算的工作效率。下面以昇腾AI处理器为例,昇腾AI处理器支持32个人工智能芯片,理论上可以同时处理32个分块的矩阵乘法运算。下面在某一昇腾AI处理器中选取四个人工智能芯片进行示例说明,此时参照图15所示进一步进行相关描述。图15是本申请一实施例提供的分块并行运算的执行示意图。其中将第一目标矩阵和第二目标矩阵分成4块,计算好每个人工智能芯片计算的数据长度和分块的索引值,然后并行的由4个人工智能芯片进行计算,最后将每个人工智能芯片的计算结果进行汇总,通过多人工智能芯片并行计算达到加速矩阵计算时间的效果。相比起使用一个人工智能芯片的情况,由4个人工智能芯片进行并行计算,理论上可以提高四倍工作效率。In some embodiments, the processing method of matrix multiplication in parallel computing hardware provided by the present application can be applied to ordinary parallel computing hardware for block execution processing. Parallel computing hardware usually includes multiple artificial intelligence chips, so each block multiplication operation can be processed in parallel using one artificial intelligence chip. For example, in the execution process shown in FIG12, the first artificial intelligence chip is used to perform the block matrix multiplication when i=1, and each block matrix multiplication can be processed by the corresponding artificial intelligence chip. as well as Execute in parallel, and then add up the obtained data at the end, so as to improve the work efficiency of performing matrix multiplication operations. Take the Ascend AI processor as an example. The Ascend AI processor supports 32 artificial intelligence chips, and theoretically can process 32 block matrix multiplication operations at the same time. Four artificial intelligence chips are selected in a certain Ascend AI processor for example, and the relevant description is further made with reference to Figure 15. Figure 15 is a schematic diagram of the execution of block parallel operations provided by an embodiment of the present application. The first target matrix and the second target matrix are divided into 4 blocks, the data length calculated by each artificial intelligence chip and the index value of the block are calculated, and then the calculations are performed in parallel by 4 artificial intelligence chips, and finally the calculation results of each artificial intelligence chip are summarized, and the effect of accelerating the matrix calculation time is achieved through the parallel calculation of multiple artificial intelligence chips. Compared with the case of using one artificial intelligence chip, the parallel calculation by 4 artificial intelligence chips can theoretically increase the work efficiency by four times.
在一些实施例中,从图14中可以看出在昇腾AI处理器的人工智能芯片中核进行矩阵计算时,要求输入数据是要放在缓冲L0上的,输出最终要使用缓存单元UB的,同时主机侧申请的内存要通过Global Memory(GM)才能传递给昇CUBE核进行使用,因此要充分考虑数据搬运对计算性能的影响。In some embodiments, it can be seen from Figure 14 that when the core of the artificial intelligence chip of the Ascend AI processor performs matrix calculations, the input data is required to be placed in the buffer L0, and the output ultimately uses the cache unit UB. At the same time, the memory requested by the host side must be passed through the Global Memory (GM) to be passed to the Ascend CUBE core for use. Therefore, the impact of data transfer on computing performance must be fully considered.
当分块后的数据量较小时,数据搬运就是影响性能的主要因素,为了充分利用昇腾AI处理器上只有256KB大小的高速率的内部缓存UB,一个分块内的计算就要多次在UB上申请不同的内存大小,这会导致UB上放数据分块的大小被限制,导致要将UB上的数据交换到临时的GM中存储,这样执行效率会降低。为了解决这个问题,可以将相同的计算模块化, 即将α[k]和β[k]分开进行计算,这样就可以统一申请大小相同的UB,减少了UB的申请次数,同时也会减小GM和UB之间的数据搬运次数,提高了执行效率。When the amount of data after block division is small, data handling is the main factor affecting performance. In order to fully utilize the high-speed internal cache UB of only 256KB on the Ascend AI processor, the calculation within a block must apply for different memory sizes on the UB multiple times, which will limit the size of the data block on the UB and cause the data on the UB to be exchanged to the temporary GM for storage, which will reduce the execution efficiency. To solve this problem, the same calculation can be modularized. That is, α [k] and β [k] are calculated separately, so that UBs of the same size can be applied for uniformly, reducing the number of UB applications. At the same time, it will also reduce the number of data transfers between GM and UB, and improve execution efficiency.
在一些实施例中,当参与矩阵乘的输入数据量特别大时,此时计算花费的时间超过了数据搬运的耗时,分块只能解决UB的大小限制问题,是无法解决耗时问题的,此时就要充分利用的昇腾AI的多核处理能力。举例:假设将对于测试矩阵(M,N,K),将K的计算放在多核上进行并行计算,这就涉及到多核计算完后针对同一个分块,要将所有核的计算结果进行累加,因此要提前将这个2个计算的中间结果分离存储。In some embodiments, when the amount of input data involved in matrix multiplication is particularly large, the time spent on calculation exceeds the time spent on data transfer. Blocking can only solve the size limit of UB, but cannot solve the time-consuming problem. At this time, the multi-core processing capabilities of Ascend AI must be fully utilized. For example: Assume that for the test matrix (M, N, K), the calculation of K is performed on multiple cores for parallel calculation. This involves accumulating the calculation results of all cores for the same block after the multi-core calculation is completed. Therefore, the intermediate results of the two calculations must be stored separately in advance.
昇腾AI处理器通过大规模的并行计算能力实现加速,利用毕昇C++语言工具通过并行运算工作组的概念来表达向设备的计算核心的映射,每个并行运算工作组都会映射到一个确定的核心上,每个并行运算工作组有相同的指令代码,但是不同的标识id,类似于一种SPMD(Single Program Multiple Data)的技术。The Ascend AI processor achieves acceleration through large-scale parallel computing capabilities. It uses the Bisheng C++ language tool to express the mapping to the device's computing core through the concept of a parallel computing workgroup. Each parallel computing workgroup is mapped to a specific core. Each parallel computing workgroup has the same instruction code but a different identification ID, which is similar to a SPMD (Single Program Multiple Data) technology.
SPMD是一种并行计算模型,它是指多个处理器或计算单元同时执行相同的程序,但对应不同的数据。在SPMD模型中,不同的处理器或计算单元拥有各自的数据集,并且独立地执行相同的指令序列来处理这些数据。SPMD模型是一种基于任务并行的模型,它适用于许多并行计算应用,如科学计算、图像处理和数据分析等。在SPMD模型中,程序员需要将计算问题划分为多个独立的任务,并为每个任务分配不同的数据集。然后,每个处理器或计算单元将并行地执行相同的程序,但对应不同的数据集。SPMD is a parallel computing model, which means that multiple processors or computing units execute the same program at the same time, but corresponding to different data. In the SPMD model, different processors or computing units have their own data sets and independently execute the same instruction sequence to process these data. The SPMD model is a model based on task parallelism, which is suitable for many parallel computing applications, such as scientific computing, image processing, and data analysis. In the SPMD model, programmers need to divide the computing problem into multiple independent tasks and assign different data sets to each task. Then, each processor or computing unit will execute the same program in parallel, but corresponding to different data sets.
参考图16所示,是本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法的精准度对比仿真图。为了进一步验证通过采用本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法所得到的结果的可靠性,本实施例将NPU验证结果(本专利方案),即本申请提供的矩阵乘法运算处理方案与其他两个方案进行对比,分别为:1、NPU验证结果(没有优化方案):即如相关技术中,直接将第一目标矩阵和第二目标矩阵划分为多个第一初始矩阵和多个第二初始矩阵,然后将划分为多个第一初始矩阵和多个第二初始矩阵转化为对应的多个半精度矩阵然后进行半精度矩阵乘法运算所得到的结果的方案;2、OpenBLAS库的sgemm验证结果方案:即如相关技术中,直接将第一目标矩阵和第二目标矩阵利用OpenBLAS库的sgemm进行单精度矩阵乘法运算所得到的结果的方案。可以理解的是OpenBLAS库是一个开源的基础线性代数子程序(BLAS)库,用于高性能数值计算。BLAS是一套用于执行常见的线性代数运算(如矩阵乘法、向量加法等)的标准接口和函数集合。此外,在OpenBLAS库中,sgemm是一个用于执行矩阵乘法的函数。As shown in reference figure 16, it is a precision comparison simulation diagram of the processing method of matrix multiplication operation in parallel computing hardware provided by the embodiment of the present application. In order to further verify the reliability of the results obtained by adopting the processing method of matrix multiplication operation in parallel computing hardware provided by the embodiment of the present application, the present embodiment compares the NPU verification result (this patent solution), that is, the matrix multiplication operation processing solution provided by the present application with the other two solutions, namely: 1. NPU verification result (no optimization solution): that is, as in the related art, directly divide the first target matrix and the second target matrix into multiple first initial matrices and multiple second initial matrices, and then convert the divided multiple first initial matrices and multiple second initial matrices into corresponding multiple half-precision matrices and then perform half-precision matrix multiplication operation to obtain the result; 2. Sgemm verification result solution of OpenBLAS library: that is, as in the related art, directly use the sgemm of OpenBLAS library to perform single-precision matrix multiplication operation on the first target matrix and the second target matrix. It can be understood that the OpenBLAS library is an open source basic linear algebra subroutine (BLAS) library for high-performance numerical calculations. BLAS is a set of standard interfaces and functions for performing common linear algebra operations such as matrix multiplication, vector addition, etc. In addition, in the OpenBLAS library, sgemm is a function for performing matrix multiplication.
图16中的横坐标是指输入矩阵的k的大小(即第一目标矩阵的列数K和第二目标矩阵的行数K的数量大小),纵坐标是指精准度检验差值。可以明显的得到,与NPU验证结果(没有优化方案)相比,通过NPU验证结果(本专利方案)所得到的第二单精度目标矩阵Csingle的相对残差V很低,且通过采用NPU验证结果(本专利方案)所得到的第二单精度目标矩阵Csingle的精准度很接近OpenBLAS库的sgemm验证结果方案,从而验证通过采用本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法所得到的结果的可靠性。The horizontal axis in Figure 16 refers to the size of the input matrix k (i.e., the number of columns K of the first target matrix and the number of rows K of the second target matrix), and the vertical axis refers to the accuracy test difference. It can be clearly seen that compared with the NPU verification result (without optimization solution), the relative residual V of the second single-precision target matrix C single obtained by the NPU verification result (this patent solution) is very low, and the accuracy of the second single-precision target matrix C single obtained by adopting the NPU verification result (this patent solution) is very close to the sgemm verification result solution of the OpenBLAS library, thereby verifying the reliability of the results obtained by adopting the processing method of matrix multiplication operations in parallel computing hardware provided by the embodiment of the present application.
参考图17和图18所示,是本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法的工作效率对比仿真图和并行计算硬件中矩阵乘法运算的处理方法的工作效率的仿真数据表图。为了进一步验证通过采用本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法所提高的工作效率的可靠性,本实施例将执行本方案时所采用的优化计算执行流程(即如图11所示的执行流程,即图17中NPU验证结果(本专利方案))的运行时间,和OpenBLAS库的sgemm验证结果方案的运行时间进行了仿真对比。图17中的横坐标是指输入矩阵的k的大小(即第一目标矩阵的列数K和第二目标矩阵的行数K的数量大小),纵坐标是指计算时延。可以明显的看出,随着第一目标矩阵的列数K和第二目标矩阵的行数K的数量的提升,OpenBLAS库的sgemm验证结果方案的运行时间将有很大幅度的上升,而NPU验证结果(本专利方案)(即如图11所示的执行流程)的运行时间虽然有上升,但其 上升幅度很小。此外结合图18的加速比数据,可以明显得到当输入矩阵的k的大小(即第一目标矩阵的列数K和第二目标矩阵的行数K的数量大小)达到211以上时,NPU验证结果(本专利方案)的运行时间比OpenBLAS库的sgemm验证结果方案的运行时间少很多。可以理解的是,加速度数据为OpenBLAS库的sgemm验证结果方案的运行时间与NPU验证结果(本专利方案)的运行时间的比值。从而验证通过采用本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法中所采用的优化计算执行流程(即如图11所示的执行流程)所提高的工作效率的可靠性。Referring to Figures 17 and 18, there are a simulation diagram of the work efficiency comparison of the processing method for matrix multiplication operations in parallel computing hardware provided in the embodiment of the present application and a simulation data table diagram of the work efficiency of the processing method for matrix multiplication operations in parallel computing hardware. In order to further verify the reliability of the work efficiency improved by the processing method for matrix multiplication operations in parallel computing hardware provided by the embodiment of the present application, this embodiment simulates and compares the running time of the optimized calculation execution process (i.e., the execution process shown in Figure 11, i.e., the NPU verification result (this patent solution) in Figure 17) used when executing this solution with the running time of the sgemm verification result solution of the OpenBLAS library. The horizontal axis in Figure 17 refers to the size of the input matrix k (i.e., the number of columns K of the first target matrix and the number of rows K of the second target matrix), and the vertical axis refers to the calculation delay. It can be clearly seen that with the increase of the number of columns K of the first target matrix and the number of rows K of the second target matrix, the running time of the sgemm verification result solution of the OpenBLAS library will increase significantly, while the running time of the NPU verification result (patent solution) (i.e., the execution flow shown in FIG. 11 ) increases, but its The increase is very small. In addition, combined with the acceleration ratio data of Figure 18, it can be clearly obtained that when the size of the input matrix k (that is, the number of columns K of the first target matrix and the number of rows K of the second target matrix) reaches 2 11 or more, the running time of the NPU verification result (this patent solution) is much less than the running time of the sgemm verification result solution of the OpenBLAS library. It can be understood that the acceleration data is the ratio of the running time of the sgemm verification result solution of the OpenBLAS library to the running time of the NPU verification result (this patent solution). Thereby verifying the reliability of the work efficiency improved by the optimized calculation execution process (that is, the execution process shown in Figure 11) adopted in the processing method of matrix multiplication operation in the parallel computing hardware provided by the embodiment of the present application.
图19是本申请一实施例提供的并行计算硬件中矩阵乘法运算的处理方法的又一精准度对比仿真图。为了进一步验证通过采用本申请实施例提供的并行计算硬件中矩阵乘法运算的处理方法所提高的工作效率的可靠性。本实施例将如图11和图12所示的两种方案,即有冗余项方案和无冗余项方案,所得到的精准度(即相对残差)进行仿真对比。图19中的横坐标为输入矩阵的k的大小(即第一目标矩阵的列数K和第二目标矩阵的行数K的数量大小),纵坐标为相对残差。可以明显看到,当两种方案的相对残差的最大差值在10-8左右,即两种方案所得到的结果的精准度基本一致。因此当采用如图11所示的执行流程方案时,可以有效的降低四分之一的计算量,从而可以减少四分之一的缓存使用量,进一步提高矩阵乘法运算的工作效率。FIG19 is another precision comparison simulation diagram of the processing method for matrix multiplication operation in parallel computing hardware provided by an embodiment of the present application. In order to further verify the reliability of the work efficiency improved by adopting the processing method for matrix multiplication operation in parallel computing hardware provided by an embodiment of the present application. This embodiment compares the two schemes shown in FIG11 and FIG12, namely, the redundant items Schemes and non-redundant items The schemes are compared with each other in terms of the accuracy (i.e., relative residuals). The horizontal axis in Figure 19 is the size of the input matrix k (i.e., the number of columns K of the first target matrix and the number of rows K of the second target matrix), and the vertical axis is the relative residual. It can be clearly seen that when the maximum difference in the relative residuals of the two schemes is around 10 -8 , the accuracy of the results obtained by the two schemes is basically the same. Therefore, when the execution flow scheme shown in Figure 11 is adopted, the amount of calculation can be effectively reduced by one quarter, thereby reducing the cache usage by one quarter, further improving the work efficiency of matrix multiplication operations.
本申请实施例提供了一种并行计算硬件中矩阵乘法运算的处理方法,能够在使用支持半精度计算的计算芯片中提高单精度矩阵乘法运算的精准度。通过将均为单精度矩阵的第一目标矩阵和第二目标矩阵划分为至少一个第一初始矩阵和至少一个第二初始矩阵,并得到多个对应的矩阵组;然后基于单精度数据类型进行半精度处理,得到每个矩阵组中第一初始矩阵的第一半精度矩阵和第二初始矩阵的第二半精度矩阵;再基于每个矩阵组中第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,以及基于第二初始矩阵和第二半精度矩阵的差值得到第二差值矩阵;然后将第一差值矩阵和第二差值矩阵乘以预设乘值,从而得到第一放大矩阵和第二放大矩阵;接下来累加第一半精度矩阵和第二放大矩阵的乘积、第二半精度矩阵和第一放大矩阵的乘积,得到每个矩阵组的中间放大矩阵,并累加所有矩阵组的中间放大矩阵得到第一累积矩阵,同时累积所有矩阵组的第一半精度矩阵和第二半精度矩阵的乘积得到第二累积矩阵,并将第一累积矩阵和第二累积矩阵相加得到第二单精度目标矩阵,并将第二单精度目标矩阵作为第一目标矩阵和第二目标矩阵进行矩阵乘法运算的结果。An embodiment of the present application provides a method for processing matrix multiplication operations in parallel computing hardware, which can improve the accuracy of single-precision matrix multiplication operations when using a computing chip that supports half-precision calculations. A first target matrix and a second target matrix, both of which are single-precision matrices, are divided into at least one first initial matrix and at least one second initial matrix, and a plurality of corresponding matrix groups are obtained; then half-precision processing is performed based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix in each matrix group; then a first difference matrix is obtained based on the difference between the first initial matrix and the first half-precision matrix in each matrix group, and a second difference matrix is obtained based on the difference between the second initial matrix and the second half-precision matrix; then the first difference matrix and the second difference matrix are multiplied by a preset multiplication value to obtain a first magnification matrix and a second magnification matrix; then the product of the first half-precision matrix and the second magnification matrix and the product of the second half-precision matrix and the first magnification matrix are accumulated to obtain an intermediate magnification matrix of each matrix group, and the intermediate magnification matrices of all matrix groups are accumulated to obtain a first accumulation matrix, and the products of the first half-precision matrices and the second half-precision matrices of all matrix groups are accumulated to obtain a second accumulation matrix, and the first accumulation matrix and the second accumulation matrix are added to obtain a second single-precision target matrix, and the second single-precision target matrix is used as the result of matrix multiplication operation between the first target matrix and the second target matrix.
本申请实施例针对在使用支持半精度计算的计算芯片进行单精度矩阵乘法运算的过程,首先基于并行计算硬件的硬件参数,将第一目标矩阵和第二目标矩阵进行划分得到多个第一初始矩阵和第二初始矩阵,并将第一初始矩阵和第二初始矩阵进行半精度处理得到第一半精度矩阵和第二半精度矩阵,以便于后续并行计算硬件的执行计算。然后利用第一差值矩阵保存第一初始矩阵转化到第一半精度矩阵后的误差,以及利用第二差值矩阵保存第二初始矩阵转化到第二半精度矩阵后的误差,从而在第一半精度矩阵和第二半精度矩阵的乘法运算中增加误差补偿项以进行相应的半精度乘法运算,从而有效地提高了对单精度矩阵进行半精度乘法运算的精准度。再利用预设乘值乘以第一差值矩阵和第二差值矩阵后再进行相应的运算,从而可以解决浮点数下溢问题,以提高矩阵乘法运算的精准度。最后将每个矩阵组的各项乘法运算拆开,以调整运算顺序,从而进一步解决浮点数下溢问题以及提高矩阵乘法运算装置的工作效率。进而有效的在仅支持半精度乘法运算的硬件设备上得到一个精准度较高的单精度乘法运算结果,并提高乘法运算的工作效率。The embodiment of the present application is directed to the process of performing single-precision matrix multiplication using a computing chip that supports half-precision calculation. First, based on the hardware parameters of the parallel computing hardware, the first target matrix and the second target matrix are divided to obtain multiple first initial matrices and second initial matrices, and the first initial matrix and the second initial matrix are processed with half precision to obtain the first half-precision matrix and the second half-precision matrix, so as to facilitate the subsequent execution calculation of the parallel computing hardware. Then, the error after the first initial matrix is converted to the first half-precision matrix is saved by using the first difference matrix, and the error after the second initial matrix is converted to the second half-precision matrix is saved by using the second difference matrix, so as to add an error compensation item in the multiplication operation of the first half-precision matrix and the second half-precision matrix to perform the corresponding half-precision multiplication operation, thereby effectively improving the accuracy of the half-precision multiplication operation of the single-precision matrix. Then, the corresponding operation is performed after multiplying the first difference matrix and the second difference matrix by the preset multiplication value, so as to solve the floating-point underflow problem and improve the accuracy of the matrix multiplication operation. Finally, the multiplication operations of each matrix group are separated to adjust the operation order, so as to further solve the floating-point underflow problem and improve the working efficiency of the matrix multiplication operation device. This effectively obtains a single-precision multiplication result with higher accuracy on a hardware device that only supports half-precision multiplication operations, thereby improving the work efficiency of multiplication operations.
本申请实施例还提供一种矩阵乘法运算装置,可以实现上述并行计算硬件中矩阵乘法运算的处理方法,参照图20,该装置2000包括:The embodiment of the present application further provides a matrix multiplication operation device, which can implement the processing method of the matrix multiplication operation in the above parallel computing hardware. Referring to FIG. 20 , the device 2000 includes:
获取模块2010,用于获取第一初始矩阵和第二初始矩阵;其中,第一初始矩阵和第二初始矩阵均为单精度矩阵; An acquisition module 210 is used to acquire a first initial matrix and a second initial matrix; wherein the first initial matrix and the second initial matrix are both single-precision matrices;
半精度转换模块2020,用于基于单精度数据类型进行半精度处理,得到第一初始矩阵的第一半精度矩阵,以及第二初始矩阵的第二半精度矩阵;A half-precision conversion module 2020 is used to perform half-precision processing based on the single-precision data type to obtain a first half-precision matrix of the first initial matrix and a second half-precision matrix of the second initial matrix;
差值处理模块2030,用于基于第一初始矩阵和第一半精度矩阵的差值得到第一差值矩阵,以及基于第二初始矩阵和第二半精度矩阵的差值得到第二差值矩阵;其中,第一差值矩阵和第二差值矩阵均为半精度矩阵;The difference processing module 2030 is used to obtain a first difference matrix based on the difference between the first initial matrix and the first half-precision matrix, and to obtain a second difference matrix based on the difference between the second initial matrix and the second half-precision matrix; wherein the first difference matrix and the second difference matrix are both half-precision matrices;
计算模块2040,用于累加第一半精度矩阵和第二半精度矩阵的乘积、第一半精度矩阵和第二差值矩阵的乘积以及第二半精度矩阵和第一差值矩阵的乘积,得到第一单精度目标矩阵,并将第一单精度目标矩阵作为第一初始矩阵和第二初始矩阵进行矩阵乘法运算的结果。The calculation module 2040 is used to accumulate the product of the first half-precision matrix and the second half-precision matrix, the product of the first half-precision matrix and the second difference matrix, and the product of the second half-precision matrix and the first difference matrix to obtain a first single-precision target matrix, and use the first single-precision target matrix as the result of matrix multiplication operation between the first initial matrix and the second initial matrix.
本实施例的矩阵乘法运算装置的具体实施方式与上述并行计算硬件中矩阵乘法运算的处理方法的具体实施方式基本一致,在此不再赘述。The specific implementation of the matrix multiplication operation device of this embodiment is basically consistent with the specific implementation of the processing method of the matrix multiplication operation in the above-mentioned parallel computing hardware, and will not be repeated here.
本申请实施例还提供了一种电子设备,包括:The present application also provides an electronic device, including:
至少一个存储器;at least one memory;
至少一个处理器;at least one processor;
至少一个程序;at least one program;
所述程序被存储在存储器中,处理器执行所述至少一个程序以实现本申请实施上述的并行计算硬件中矩阵乘法运算的处理方法。该电子设备可以为包括手机、平板电脑、个人数字助理(Personal Digital Assistant,简称PDA)、车载电脑等任意智能终端。The program is stored in the memory, and the processor executes the at least one program to implement the processing method of matrix multiplication operation in the parallel computing hardware implemented in the present application. The electronic device can be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (PDA), a car computer, etc.
请参阅图21,图21示意了另一实施例的电子设备的硬件结构,电子设备包括:Please refer to FIG. 21 , which schematically shows the hardware structure of an electronic device according to another embodiment. The electronic device includes:
处理器2101,可以采用通用的CPU(CentralProcessingUnit,中央处理器)、微处理器、应用专用集成电路(ApplicationSpecificIntegratedCircuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请实施例所提供的技术方案;The processor 2101 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of the present application;
存储器2102,可以采用ROM(Read Only Memory,只读存储器)、静态存储设备、动态存储设备或者RAM(Random Access Memory,随机存取存储器)等形式实现。存储器2102可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器2102中,并由处理器2101来调用执行本申请实施例的并行计算硬件中矩阵乘法运算的处理方法;The memory 2102 can be implemented in the form of ROM (Read Only Memory), static storage device, dynamic storage device or RAM (Random Access Memory). The memory 2102 can store operating systems and other application programs. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program codes are stored in the memory 2102, and the processor 2101 calls and executes the processing method of matrix multiplication operation in the parallel computing hardware of the embodiment of this application;
输入/输出接口2103,用于实现信息输入及输出;Input/output interface 2103, used to implement information input and output;
通信接口2104,用于实现本设备与其他设备的通信交互,可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信;Communication interface 2104, used to realize communication interaction between the device and other devices, which can be realized through wired mode (such as USB, network cable, etc.) or wireless mode (such as mobile network, WIFI, Bluetooth, etc.);
总线2105,在设备的各个组件(例如处理器2101、存储器2102、输入/输出接口2103和通信接口2104)之间传输信息;A bus 2105 that transmits information between the various components of the device (e.g., the processor 2101, the memory 2102, the input/output interface 2103, and the communication interface 2104);
其中处理器2101、存储器2102、输入/输出接口2103和通信接口2104通过总线2105实现彼此之间在设备内部的通信连接。The processor 2101 , the memory 2102 , the input/output interface 2103 and the communication interface 2104 are connected to each other in communication within the device via the bus 2105 .
本申请实施例还提供了一种存储介质,存储介质为计算机可读存储介质,该存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述并行计算硬件中矩阵乘法运算的处理方法。An embodiment of the present application also provides a storage medium, which is a computer-readable storage medium and stores a computer program. When the computer program is executed by a processor, it implements a processing method for matrix multiplication operations in the above-mentioned parallel computing hardware.
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外,存储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory, as a non-transient computer-readable storage medium, can be used to store non-transient software programs and non-transient computer executable programs. In addition, the memory may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device. In some embodiments, the memory may optionally include a memory remotely disposed relative to the processor, and these remote memories may be connected to the processor via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
本申请实施例描述的实施例是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域技术人员可知,随着技术的演变和新应用场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments described in the embodiments of the present application are intended to more clearly illustrate the technical solutions of the embodiments of the present application and do not constitute a limitation on the technical solutions provided in the embodiments of the present application. Those skilled in the art will appreciate that with the evolution of technology and the emergence of new application scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本领域技术人员可以理解的是,图中示出的技术方案并不构成对本申请实施例的限定,可以包括比图示更多或更少的步骤,或者组合某些步骤,或者不同的步骤。 Those skilled in the art will appreciate that the technical solutions shown in the figures do not constitute a limitation on the embodiments of the present application, and may include more or fewer steps than shown in the figures, or a combination of certain steps, or different steps.
以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are merely illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place or distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、设备中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。Those skilled in the art will appreciate that all or some of the steps in the methods disclosed above, and the functional modules/units in the systems and devices may be implemented as software, firmware, hardware, or a suitable combination thereof.
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the specification of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchangeable where appropriate, so that the embodiments of the present application described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the above units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括多指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例的方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including multiple instructions to enable a computer device (which can be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store programs.
以上参照附图说明了本申请实施例的优选实施例,并非因此局限本申请实施例的权利范围。本领域技术人员不脱离本申请实施例的范围和实质内所作的任何修改、等同替换和改进,均应在本申请实施例的权利范围之内。 The preferred embodiments of the present invention are described above with reference to the accompanying drawings, but the scope of the rights of the present invention is not limited thereto. Any modification, equivalent substitution and improvement made by a person skilled in the art without departing from the scope and essence of the present invention should be within the scope of the rights of the present invention.
Claims (10)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311376164.XA CN117370722B (en) | 2023-10-23 | 2023-10-23 | Matrix multiplication operation processing method in parallel computing hardware and related equipment |
| CN202311376164.X | 2023-10-23 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025086399A1 true WO2025086399A1 (en) | 2025-05-01 |
Family
ID=89407349
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/135478 Pending WO2025086399A1 (en) | 2023-10-23 | 2023-11-30 | Processing method for matrix multiplication in parallel computing hardware, and related device |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN117370722B (en) |
| WO (1) | WO2025086399A1 (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102214160A (en) * | 2011-07-08 | 2011-10-12 | 中国科学技术大学 | Single-accuracy matrix multiplication optimization method based on loongson chip 3A |
| CN106502964A (en) * | 2016-12-06 | 2017-03-15 | 中国矿业大学 | A kind of extreme learning machine parallelization computational methods based on Spark |
| US20220206801A1 (en) * | 2020-12-26 | 2022-06-30 | Intel Corporation | Apparatuses, methods, and systems for 8-bit floating-point matrix dot product instructions |
| CN116097212A (en) * | 2020-09-26 | 2023-05-09 | 英特尔公司 | Apparatus, method, and system for a 16-bit floating point matrix dot product instruction |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10338919B2 (en) * | 2017-05-08 | 2019-07-02 | Nvidia Corporation | Generalized acceleration of matrix multiply accumulate operations |
| CN115237372A (en) * | 2022-08-08 | 2022-10-25 | 昆仑芯(北京)科技有限公司 | Multiplication circuit, machine learning operation circuit, chip and data processing method |
-
2023
- 2023-10-23 CN CN202311376164.XA patent/CN117370722B/en active Active
- 2023-11-30 WO PCT/CN2023/135478 patent/WO2025086399A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102214160A (en) * | 2011-07-08 | 2011-10-12 | 中国科学技术大学 | Single-accuracy matrix multiplication optimization method based on loongson chip 3A |
| CN106502964A (en) * | 2016-12-06 | 2017-03-15 | 中国矿业大学 | A kind of extreme learning machine parallelization computational methods based on Spark |
| CN116097212A (en) * | 2020-09-26 | 2023-05-09 | 英特尔公司 | Apparatus, method, and system for a 16-bit floating point matrix dot product instruction |
| US20220206801A1 (en) * | 2020-12-26 | 2022-06-30 | Intel Corporation | Apparatuses, methods, and systems for 8-bit floating-point matrix dot product instructions |
Also Published As
| Publication number | Publication date |
|---|---|
| CN117370722A (en) | 2024-01-09 |
| CN117370722B (en) | 2025-12-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Hong et al. | Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation | |
| CN109284130B (en) | Neural network computing device and method | |
| US12393835B2 (en) | Estimation of resources utilized by deep learning applications | |
| US11281965B2 (en) | Reconfigurable processing unit | |
| TWI737652B (en) | Fused multiply-add (fma) low functional unit | |
| US12405787B2 (en) | Utilizing structured sparsity in systolic arrays | |
| Flegar et al. | Adaptive precision block-jacobi for high performance preconditioning in the ginkgo linear algebra software | |
| CN110147251A (en) | Architecture, chip and computing method for computing neural network models | |
| Gu et al. | DLUX: A LUT-based near-bank accelerator for data center deep learning training workloads | |
| CN111045728B (en) | A computing device and related products | |
| CN221960554U (en) | Matrix multiplication computing device for AI accelerator integrated circuit and AI accelerator device | |
| KR102796774B1 (en) | Low-cost multi-fpga accelerating system for transformer-based language services | |
| CN111078286B (en) | Data communication method, computing system and storage medium | |
| US20250130987A1 (en) | Data type conversion method, storage medium, device, and printed circuit board | |
| CN111950689A (en) | Neural network training method and device | |
| WO2025086399A1 (en) | Processing method for matrix multiplication in parallel computing hardware, and related device | |
| US11068458B2 (en) | Mechanism for distributed-system-aware difference encoding/decoding in graph analytics | |
| CN117348931A (en) | Command devices, integrated circuit devices and boards | |
| CN114327630A (en) | High-performance operator generation method suitable for Huaji Shengteng chip | |
| US12093261B2 (en) | Storage formats for in-memory caches | |
| CN118642860B (en) | Multifunctional server based on task self-adaptive matching and application method thereof | |
| US20250004717A1 (en) | Semi-Constant Operand Multipliers | |
| US12299484B1 (en) | Hardware and software co-designed system for efficient distributed control of execution on a compute accelerator | |
| US12373261B2 (en) | Just-in-time re-partitioning of feature maps for efficient balancing of compute core workloads | |
| CN111047023A (en) | A computing device and related products |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23956636 Country of ref document: EP Kind code of ref document: A1 |