WO2025107800A1 - Ai processor, data processing method, and computer device - Google Patents
Ai processor, data processing method, and computer device Download PDFInfo
- Publication number
- WO2025107800A1 WO2025107800A1 PCT/CN2024/115871 CN2024115871W WO2025107800A1 WO 2025107800 A1 WO2025107800 A1 WO 2025107800A1 CN 2024115871 W CN2024115871 W CN 2024115871W WO 2025107800 A1 WO2025107800 A1 WO 2025107800A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- matrix
- convolution
- format
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the embodiments of the present application relate to the field of processor technology, and in particular to an AI processor, a data processing method, and a computer device.
- the core calculations in deep learning algorithms mainly include convolution calculations and matrix calculations. Convolution calculations can be equivalent to matrix calculations. Therefore, by accelerating matrix calculations, deep learning algorithms can be accelerated.
- a matrix acceleration engine is set to increase the speed of matrix calculation, and input data is moved based on the array dimension of the systolic array in the matrix acceleration engine. Since the data dimension represented by the data format used by different input data often does not match the array dimension of the systolic array, when the data is moved through the bus, the data is moved directly according to the systolic array dimension, which often leads to the problem of low bus bandwidth utilization.
- the embodiments of the present application provide an AI processor, a data processing method and a computer device.
- the technical solution is as follows:
- an embodiment of the present application provides an AI processor, the AI processor comprising: a matrix operation engine, a data handling engine and a first memory, the matrix operation engine and the data handling engine being connected via a bus;
- the first memory is used to store first operation data and second operation data, the first operation data and the second operation data are in a target data format, and the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine;
- the data transport engine is configured to read the first operator data and the second operator data from the first memory based on a bus bit width, and transport the first operator data and the second operator data to a second memory inside the matrix operation engine through the bus, wherein the bus bit width is greater than a data bit width corresponding to the array dimension;
- the matrix operation engine is used to perform a matrix operation on the first operator data and the second operator data in the second memory through a matrix operation unit to obtain a sub-operation result, wherein the data dimension of the sub-operation result matches the array dimension;
- the matrix operation engine is further used to accumulate the sub-operation results through an accumulator to obtain a matrix operation result of the first operation data and the second operation data.
- an embodiment of the present application provides a data processing method, which is used for an AI processor, wherein the AI processor includes a matrix operation engine, a data handling engine, and a first memory, wherein the matrix operation engine is connected to the data handling engine via a bus; the method includes:
- first operation data and second operation data are in a target data format, and the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine;
- bus bit width Based on a bus bit width, reading first operator data and second operator data from the first memory through the data transfer engine, and transferring the first operator data and the second operator data to a second memory inside the matrix operation engine through the bus, wherein the bus bit width is greater than a data bit width corresponding to the array dimension;
- the sub-operation results are accumulated by an accumulator in the matrix operation engine to obtain a matrix operation result of the first operation data and the second operation data.
- an embodiment of the present application provides a computer device, comprising a memory and an AI processor as described in the above aspects, wherein the memory stores at least one instruction, and the at least one instruction is used to be executed by the AI processor.
- the data dimension of the first operation data and the second operation data matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, so that the data handling engine can read the first operator data and the second operator data from the first memory according to the bus width, and carry the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and then the matrix operation engine performs matrix operation on the first operator data and the second operator data through the matrix operation unit to obtain the sub-operation result, and accumulates the sub-operation result through the accumulator to obtain the matrix operation result of the first operation data and the second operation data.
- the AI processor provided in the embodiment of the present application by storing the operation data in the target data format in the first memory, it is possible to realize the reading of operation data according to the bus width. Since the bus width is greater than the data width corresponding to the array dimension, compared with reading data according to the data width corresponding to the array dimension, the AI processor provided in the embodiment of the present application can improve the bandwidth utilization of data handling during matrix operation, thereby improving the efficiency of matrix operation.
- FIG1 shows a schematic diagram of the structure of an AI processor provided by an exemplary embodiment of the present application
- FIG2 shows a schematic diagram of a systolic array operation provided by an exemplary embodiment of the present application
- FIG3 is a schematic diagram showing the logical expression of convolutional network data and the physical storage form of data in different formats in the related art
- FIG4 is a schematic diagram showing a physical storage form of a first target data format provided by an exemplary embodiment of the present application
- FIG5 is a schematic diagram showing data processing by a systolic array based on a general data format in the related art
- FIG6 is a schematic diagram showing data processing by a systolic array based on a target data format provided by an exemplary embodiment of the present application
- FIG7 is a schematic diagram showing data processing by a systolic array based on a general data format in another related art
- FIG8 is a schematic diagram showing data processing by a systolic array based on a target data format provided by another exemplary embodiment of the present application.
- FIG9 is a schematic diagram showing a data format conversion provided by an exemplary embodiment of the present application.
- FIG10 shows a flow chart of a data processing method provided by an exemplary embodiment of the present application.
- FIG. 11 shows a structural block diagram of a computer device provided by an exemplary embodiment of the present application.
- Operation data refers to the data involved in the operation.
- the operation data in the embodiment of the present application is used for convolution operation or matrix multiplication operation.
- the operation data includes the convolution input data and the convolution weight data, which can also be called the convolution kernel data;
- the operation data includes the left matrix input data and the right matrix input data.
- Pulse Sequence An array in the Matrix Arithmetic Logic Unit (MALU) used to implement matrix multiplication operations.
- MALU Matrix Arithmetic Logic Unit
- the core idea of the systolic array is to divide the matrix into blocks, and then complete the entire matrix multiplication through a series of data movement and local multiplication operations.
- the input of the systolic array includes left input and right input, and the number of rows of the systolic array is represented as MALU_K, and the number of columns of the array is represented as MALU_N.
- Dimension refers to the number of data in a certain dimension, such as the number of rows, columns, height, width, channels, etc.
- the data format is HWC, if the data is 3 ⁇ 4 ⁇ 5, the dimension of the data in the H (height) dimension is 3, the dimension in the W (width) dimension is 4, and the dimension in the C (channel) dimension is 5.
- Deep learning methods are widely used in various fields such as image processing, video processing, speech processing, content generation, etc., and the core characteristics of deep learning are large amount of calculation and large number of parameters.
- Deep learning networks can be divided into three categories: Convolutional Neural Networks (CNN), Transformer and ViT (Vision Transformer).
- CNN Convolutional Neural Networks
- Transformer Transformer
- ViT Vision Transformer
- the core calculations in these three categories of networks are mainly concentrated in convolution calculations and matrix calculations, and convolution calculations can be equivalent to matrix calculations. Therefore, the core of accelerating deep learning algorithms is to accelerate matrix operations, and how to perfectly match the core matrix calculations with other commonly used vector calculations is the key issue in designing AI processors.
- the AI processor can be a processor based on neural network algorithms and acceleration, such as a neural network processor (Neural network Processing Unit, NPU).
- a neural network processor Neural network Processing Unit, NPU.
- the AI processor is integrated into a general deep learning framework, and matrix operations are performed through the matrix operation engine in the AI processor.
- the structure of the matrix operation unit in the matrix operation engine is generally in the form of a two-dimensional systolic array. Therefore, in order to use the matrix operation unit to perform data operations on the operation data, the related art generally carries out data transfer on the operation data according to the array dimension of the systolic array in the matrix operation unit. For a bus whose bus width is larger than the data width corresponding to the array dimension, the problem of insufficient bus bandwidth utilization often occurs during data transfer, resulting in low bus bandwidth utilization.
- the first operator data and the second operator data can be directly read from the first memory according to the bus bit width through the data transfer engine, and the first operator data and the second operator data can be transferred to the second memory inside the matrix operation engine through the bus, thereby fully improving the utilization rate of the bus bit width and optimizing the data transfer process.
- the AI processor 100 mainly includes a matrix operation engine 110, a data transfer engine 120 and a first memory 130, wherein the matrix operation engine 110 and the data transfer engine 120 are connected via a bus.
- the first memory 130 is used to store the first operation data and the second operation data.
- the first operation data and the second operation data are in a target data format.
- the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine.
- the AI processor stores the operation data of the dimensional expression method unique to the deep learning network through the first memory, resulting in insufficient bandwidth utilization in the process of data transfer through the data transfer engine.
- the data format of the convolutional network is NCHW, and the weight format is CoCiKhKw
- the data format of the convolutional network is NHWC
- the weight format is KhKwCiCo.
- N is the number of batches
- C is the number of data channels
- H is the height
- W is the width
- Kh is the height of the convolution kernel
- Kw is the width of the convolution kernel
- Ci is the number of input channels
- Co is the number of output channels.
- the first operation data and the second operation data in the first memory are both stored in the target data format.
- the original data format of the first operation data and the second operation data can be the target data format, or the first operation data and the second operation data in the original data format are format converted and stored in the first memory in the target data format.
- the architecture of the matrix operation engine is a multi-dimensional systolic array consisting of multiple physical matrices of multiply accumulate (MAC) operations, which is used to perform computational processing on a series of matrix operations of convolutional neural networks.
- MAC multiply accumulate
- the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, which may mean that the data dimension of each data channel represented by the target data format is equal to the array dimension of the systolic array, or that the data dimension of some data channels among the data channels represented by the target data format is equal to the array dimension of the systolic array.
- the array dimension of the systolic array may be the number of array rows or the number of array columns.
- the data dimension represented by the target data format may be the number of data channels; when the operation data is matrix data, the data dimension represented by the target data format may be the number of matrix rows or the number of matrix columns.
- the number of data channels represented by the target data format is equal to the array dimension of the systolic array, or the number of data rows or columns represented by the target data format is equal to the number of array rows or columns of the systolic array.
- the first operation data is matrix data
- the number of matrix rows and the number of matrix columns of the matrix data are both 128, and the number of array rows of the systolic array is 64.
- the number of matrix rows of the first operation data can be converted to 2 ⁇ 64, so that the number of matrix rows of the first operation data is equal to the number of array rows of the systolic array.
- the systolic array is in the form of a two-dimensional matrix, which may be a two-dimensional systolic, a one-dimensional systolic plus a one-dimensional broadcast, or a two-dimensional systolic and broadcast mixed structure.
- the data transfer engine 120 is used to read the first operator data and the second operator data from the first memory 130 based on the bus bit width, and transfer the first operator data and the second operator data to the second memory 111 inside the matrix operation engine 110 through the bus, and the bus bit width is greater than the data bit width corresponding to the array dimension.
- the bus width is an integer multiple of the data width corresponding to the array dimension.
- the bus width is twice the data width corresponding to the array dimension.
- the bus width may not be an integer multiple of the data width corresponding to the array dimension. The following embodiments are only illustrated by taking integer multiples as an example, but are not limited thereto.
- the data transfer engine may be a direct memory access (DMA) controller for transferring data between different memories, including an address bus, a data bus, and a control register.
- the data transfer engine may read the first operator data and the second operator data from the first memory, and transfer the first operator data and the second operator data to the second memory through the bus.
- DMA direct memory access
- the matrix operation engine cannot complete the operation in a single time, so it is necessary to divide the operation data into operation sub-data and perform operations on the operation sub-data, that is, to split the operation process of the first operation data and the second operation data into the operation process of multiple operation sub-data. Accordingly, the first operation data and the second operation data in the first memory need to be moved to the second memory in segments.
- the first operator data belongs to the first operation data and is a part of the first operation data
- the second operator data belongs to the second operation data and is a part of the second operation data. That is, the AI processor can realize the segmented transmission of the first operation data and the second operation data from the first memory to the second memory inside the matrix operation engine through the data transfer engine.
- the AI processor when the data dimension represented by the data format adopted by the first operation data and the second operation data does not match the array dimension of the systolic array, in order not to affect the matrix operation process in the matrix operation unit, the AI processor performs data transfer according to the array dimension through the data transfer engine, resulting in the sub-data transferred in adjacent time not having data continuity, which in turn leads to low bus bandwidth utilization when the bus bit width is larger than the data bit width corresponding to the array dimension.
- the AI processor can read the first operator data and the second operator data directly from the first memory according to the bus width through the data transfer engine, thereby fully utilizing the bus width and ensuring the continuity of data transfer.
- the data transport engine transports the data in sequence according to the data width corresponding to the array dimension of the pulse array and the number of data rows or columns. Since the data width of each data transport is smaller than the bus width, the bus width is not fully utilized.
- the operation data can be transported in the array dimension of the pulse array.
- the data is stored continuously in physical storage using the number as the unit.
- the data transfer engine can directly transfer the data according to the bus width, that is, the data width of the transferred data is equal to the bus width, thereby making full use of the bus width.
- the bus bit width is twice the data bit width corresponding to the array dimension
- the operation data is moved according to the data bit width corresponding to the array dimension, and the bandwidth utilization rate is only 50%; after the operation data is stored in the target data format, the operation data can be stored continuously in the physical storage in units of the array dimension of the systolic array, so that the operation data can be moved according to the bus bit width, so that the bandwidth utilization rate reaches 100%.
- the bus width is 2048 bits
- the array dimension of the systolic array is 64
- the data in each dimension are all fp16 (i.e., all are 16-bit floating point numbers)
- the data width corresponding to the array dimension of the systolic array is 1024 bits, i.e., the bus width (2048 bits) is greater than the data width (1024 bits) corresponding to the array dimension of the systolic array.
- the operation data is transferred according to the data width corresponding to the array dimension, and only 1024 bits of bus width can be occupied each time, resulting in a bandwidth utilization rate of only 50%; when the target data format is used to store the operation data, the operation sub-data with a data width of 2048 bits equal to the bus width can be read from the first memory according to the bus width.
- a second memory 111 is provided in the matrix operation engine 110 , and the data transfer engine 120 transfers the first operator data and the second operator data to the second memory 111 inside the matrix operation engine 110 during data transfer via the bus.
- the data transfer process may be understood as transmitting the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus.
- the matrix operation engine 110 is used to perform matrix operations on the first operator data and the second operator data in the second memory 111 through the matrix operation unit 112 to obtain sub-operation results, and the data dimension of the sub-operation results matches the array dimension.
- the matrix operation engine also includes a matrix operation unit. After storing the first operator data and the second operator data in the second memory, the matrix operation engine can perform matrix operations on the first operator data and the second operator data through the matrix operation unit using a systolic array to obtain a sub-operation result.
- the data dimension of the sub-operation result matches the array dimension of the systolic array.
- the number of data rows of the sub-operation result is equal to the number of array rows of the systolic array
- the number of data columns of the sub-operation result is equal to the number of array columns of the systolic array.
- the matrix operation engine may use the first operator data as the left input data of the systolic array and the second operator data as the right input data of the systolic array, and perform matrix operation through the matrix operation unit to obtain a sub-operation result.
- the matrix operation unit processes the first operator data and the second operator data through a matrix multiplication operation, so as to obtain a sub-operation result.
- the result data output by the matrix operation unit through the matrix operation unit is in the form of a two-dimensional matrix [Matrix_M, MALU_N], where, taking a calculation based on the systolic array as an example, after determining the left input data 21 and the right input data 22 of the systolic array, the matrix operation unit performs data operations from left to right and from top to bottom based on the properties of the systolic array, thereby outputting a set of data on the MALU_N side.
- the PEs (Process Element) in the systolic array is the smallest unit of the array, which is used to implement one-dimensional multiplication and addition calculations.
- the matrix operation engine 110 is further used to accumulate the sub-operation results through the accumulator 113 to obtain the matrix operation result of the first operation data and the second operation data.
- the matrix operation engine also includes an accumulator (ACC). After performing matrix operations through the matrix operation unit to obtain a large number of sub-operation results, the matrix operation engine can also accumulate the sub-operation results through the accumulator to obtain the matrix operation results of the first operation data and the second operation data.
- ACC accumulator
- the accumulator's accumulation process of sub-operation results refers to temporarily storing the sub-operation results output by the matrix operation unit, and after obtaining all the sub-operation results, performing data splicing on each sub-operation result according to the matrix element position of each sub-operation result in the operation result matrix, thereby obtaining the operation results of the first operation data and the second operation data.
- the first operation data and the second operation data are both 2 ⁇ 2 matrix data
- the matrix operation unit performs matrix operation to obtain: the first row matrix data of the first operation data and the first column matrix data of the second operation data
- the first sub-operation result corresponding to the first row matrix data of the first operation data and the second column matrix data of the second operation data
- the third sub-operation result corresponding to the second row matrix data of the first operation data and the first column matrix data of the second operation data
- the fourth sub-operation result corresponding to the second row matrix data of the first operation data and the second column matrix data of the second operation data
- the matrix element position of the first sub-operation data in the operation result matrix is the first row and the first column
- the matrix element position of the second sub-operation data in the operation result matrix is the first row and the second column
- the matrix element position of the third sub-operation data in the operation result matrix is the second row and the first column
- the matrix element position of the fourth sub-operation data in the operation result matrix is the second row and the second column
- the AI processor may also use the data transfer engine to transfer the operation results to the first memory through the bus.
- the data dimension of the first operation data and the second operation data matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, so that the data handling engine can read the first operator data and the second operator data from the first memory according to the bus width, and carry the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and then the matrix operation engine performs matrix operation on the first operator data and the second operator data through the matrix operation unit to obtain the sub-operation result, and accumulates the sub-operation result through the accumulator to obtain the matrix operation result of the first operation data and the second operation data.
- the AI processor provided in the embodiment of the present application by storing the operation data in the target data format in the first memory, it is possible to realize the reading of operation data according to the bus width. Since the bus width is greater than the data width corresponding to the array dimension, compared with reading data according to the data width corresponding to the array dimension, the AI processor provided in the embodiment of the present application can improve the bandwidth utilization of data handling during matrix operation, thereby improving the efficiency of matrix operation.
- the deep learning network may include a convolutional network for performing operations on convolutional data; or a matrix network for performing operations on matrix data. Therefore, after the AI processor is integrated into the deep learning network framework, the AI processor needs to process the convolutional data and the matrix data separately so that both the convolutional data and the matrix data conform to the target data format.
- the first operation data may be convolution input data
- the second operation data may be convolution weight data
- the operation result of the first operation data and the second operation data may be convolution output data.
- the convolution input data may be an image or feature map to be convolved
- the convolution weight data may be a convolution kernel used to convolve the image or feature map.
- the size of the convolution input data is smaller than the size of the convolution weight data.
- the convolution input data is 100 ⁇ 100 matrix data
- the convolution weight data is a 3 ⁇ 3 convolution kernel.
- H represents the height of an image or a feature map
- W represents the width of an image or a feature map
- C represents the feature dimension
- N represents a batch of images or feature maps.
- Figure 3 shows the logical expression of convolutional network data and the physical storage form of data in different formats in the related technology.
- the logical expression forms corresponding to NCHW format data, NHWC format data and CHWN format data are the same, that is, the data arrangement methods in the C direction, H direction and W direction are the same, and in the physical storage process, for NCHW format data, the W direction is taken first, followed by the H direction, then the C direction, and finally the N direction; for NHWC format data, the C direction is taken first, followed by the W direction, then the H direction, and finally the N direction; for CHWN format data, the N direction is taken first, followed by the W direction, then the H direction, and finally the C direction.
- the convolution input data when the first operation data is convolution input data, in order to fully utilize the bus bandwidth in the process of data transfer of the convolution input data, can be stored in a first target data format, wherein the number of sub-input channels represented by the first target data format is equal to the number of array rows of the systolic array.
- the number of array rows of the systolic array can be expressed as Ck0
- the convolution input data in the first target data format can be expressed as NCi1HiWiCk0
- N represents the number of groups of images or feature maps
- Hi represents the height of the input image or feature map
- Wi represents the width of the input image or feature map
- Ci1 is the value of Ci/Ck0 rounded up
- Ci is the total number of input channels (as shown in FIG4 ).
- Figure 4 shows the physical storage form of the first target data format provided by an exemplary embodiment of the present application.
- the data is arranged in the Ck0 direction first, then the Wi direction, then the Hi direction, then the Ci1 direction, and finally the N direction.
- the convolution weight data when the second operation data is convolution weight data, in order to fully utilize the bus bandwidth in the process of data transfer of the convolution weight data, the convolution weight data may be stored in a second target data format, wherein the number of sub-input channels represented by the second target data format is equal to the number of array rows of the systolic array, and the number of sub-output channels represented by the second target data format is equal to the number of array columns of the systolic array.
- the number of array rows of the systolic array can be expressed as Ck0
- the number of array columns of the systolic array can be expressed as Cn0
- the convolution weight data in the second target data format can be expressed as Co1Ci1KhKwCk0Cn0, wherein Kh represents the height of the convolution kernel, Kw represents the width of the convolution kernel, Ci1 is the value of Ci/Ck0 rounded up, Co1 represents the value of Co/Cn0 rounded up, Ci is the total number of input channels, and Co is the total number of output channels.
- convolution output data in a third target data format can be obtained by performing matrix operation through a matrix operation unit according to the convolution input data in the first target data format and the convolution weight data in the second target data format, wherein the number of sub-output channels represented by the third target data format is equal to the number of array columns of the systolic array.
- the AI processor can use the data transfer engine to transfer the convolution input sub-data and the convolution weight sub-data according to the bus bit width, and store them in the second memory inside the matrix operation engine.
- the AI processor uses a data transfer engine to sequentially read the first convolution input sub-data and the second convolution input sub-data from the first memory according to the bus bit width, wherein the first convolution input sub-data and the second convolution input sub-data are continuously stored data, and the data bit width of the first convolution input sub-data is equal to the bus bit width, and the data bit width of the second convolution input sub-data is equal to the bus bit width.
- the AI processor uses the data transfer engine to take six data 000, 020, 040, 001, 021, and 041 as the first convolution input sub-data according to the bus width, and transfers the six data at the same time, and when continuing to transfer data according to the number of array rows of the systolic array, it can continue to take 002, 022, 042, 003, 023, and 043 as the second convolution input sub-data.
- the first convolution input sub-data and the second convolution input sub-data are physically stored in the same memory.
- the stored data is arranged as continuous data.
- FIG. 5 shows a schematic diagram of data processing through a systolic array based on a general data format in the related art.
- a data handling engine to carry out data handling of convolution input data
- the AI processor needs to read data from the convolution input data in discrete segments and carry out data handling through the bus in sequence, resulting in a problem of low bandwidth utilization when the bus bandwidth is greater than MALU_K.
- the convolution input data is directly read in the Ci direction 503 according to the number of array rows 504 of MALU_K, so that a data block 505 (height is Cut_Hi, width is Cut_Wi, number of channels is MALU_K) can be obtained, and each segment of data in the MALU_K direction in the data block 505 is discretely distributed in the physical storage of the convolution input data, and when the bus bandwidth is greater than MALU_K, the data handling process cannot fully utilize the bus bandwidth.
- FIG6 Schematically, as shown in FIG6 , a schematic diagram of data processing by a systolic array based on a target data format provided by an exemplary embodiment of the present application is shown.
- the AI processor in order to take a data value of a length of MALU_K from the convolution input data, because the data is stored continuously along the MALU_K and then along the Wi direction under the target data format, when the bus bandwidth is greater than MALU_K, the AI processor can directly read the data values greater than MALU_K along the MALU_K and then along the Wi direction, thereby making full use of the bus bandwidth.
- the embodiment of the present application can directly read the data block 604 (height is Cut_Hi, width is Cut_Wi, number of channels is MALU_K) according to the bus bit width continuity, thereby improving the bus bandwidth utilization.
- the embodiment of the present application can directly read the data block 607 (height is MALU_K and width is MALU_N) continuously according to the bus bit width, thereby improving the bus bandwidth utilization.
- the convolution input data when the total number of input channels of the convolution input data is less than the number of array rows of the systolic array (i.e., Ci/Ck0 is less than 1), the convolution input data may be stored in a fourth target data format, wherein the number of input channels represented by the fourth target data format is the total number of input channels.
- the number of array rows of the systolic array can be represented as Ck0, and the total number of input channels of the convolution input data is Ci, which is less than Ck0.
- the convolution input data can be represented as NHiWiCi, where N represents the number of groups of images or feature maps, Hi represents the height of the input image or feature map, and Wi represents the width of the input image or feature map.
- the number of array rows of the systolic array can be represented as Ck0
- the number of array columns of the systolic array can be represented as Cn0
- Ci is the total number of input channels
- Co is the total number of output channels
- Ci is less than Ck0
- Co is greater than Cn0
- the convolution weight data can be represented as Co1KwCiKhCn0, where Kh represents the convolution kernel height, Kh represents the convolution kernel width, and Co1 represents the value of Co/Cn0 rounded up.
- the convolution output data when the total number of output channels of the convolution output data is less than the number of array columns of the systolic array (i.e., Co/Cn0 is less than 1), the convolution output data may be stored in a sixth target data format, wherein the number of output channels represented by the sixth target data format is the total number of output channels.
- the number of array columns of the systolic array may be represented as Cn0, and the total number of output channels of the convolution output data may be Co, where Co is less than Cn0.
- the convolution output data may be represented as NHoWoCo, where N represents a set of images or feature maps, Ho represents the height of the output image or feature map, and Wo represents the width of the output image or feature map.
- the convolution weight data may be stored in a seventh target data format, wherein the number of sub-input channels represented by the seventh target data format is equal to the number of array rows of the systolic array, and the number of output channels represented by the seventh target data format is the total number of output channels.
- the sixth target data format and the seventh target data format are used to store the convolution output data and the convolution weight data, additional data padding (padding) of the convolution weight data can be avoided during the operation.
- the target data formats corresponding to the convolution input data, the convolution weight data and the convolution output data are determined, thereby realizing the determination of different target data formats according to different data situations.
- the first operation data may be left matrix input data
- the second operation data may be right matrix input data
- the operation results of the first operation data and the second operation data are matrix output data.
- the left matrix is represented as MK (M rows and K columns)
- the right matrix is represented as KN (K rows and N columns)
- the result matrix is represented as MN (M rows and N columns).
- batch data of other dimensions will be added on this basis.
- the left matrix input data when the first operation data is the left matrix input data, in order to fully utilize the bus bandwidth in the process of data transfer of the left matrix input data, the left matrix input data may be stored in an eighth target data format, wherein the number of submatrix columns represented by the eighth target data format is equal to the number of array rows of the systolic array.
- the number of array rows of the systolic array is Wk0
- the left matrix input data in the eighth target data format can be expressed as B0B1W1HWk0, wherein B0 and B1 respectively represent the number of batches in two dimensions, H represents the number of matrix rows of the left matrix, W1 is the value rounded up of K/Wk0, and K is the number of matrix columns of the left matrix.
- the right matrix input data when the second operation data is right matrix input data, in order to fully utilize the bus bandwidth during data transfer of the right matrix input data, the right matrix input data can be stored in a ninth target data format, wherein the number of submatrix columns represented by the ninth target data format is equal to the number of array columns of the systolic array.
- the number of array columns of the systolic array is Wn0
- the right matrix input data in the ninth target data format can be expressed as B0B1W1HWn0, where B0 and B1 represent the number of batches in two dimensions, respectively.
- H represents the number of matrix rows of the right matrix
- W1 is the value of N/Wn0 rounded up
- N is the number of matrix columns of the right matrix.
- batches in the left matrix input data and the right matrix input data may not be equal, but when they are not equal, one of them must be 1.
- matrix output data using the tenth target data format can be obtained by performing matrix operations through a matrix operation unit based on left matrix input data using the eighth target data format and right matrix input data using the ninth target data format, wherein the number of matrix rows represented by the tenth target data format is equal to the number of matrix rows represented by the eighth target data format, and the number of matrix columns represented by the tenth target data format is equal to the number of matrix columns represented by the ninth target data format.
- the matrix output data can be expressed as B0B1W1HWn0, wherein B0 in the matrix output data is the maximum value of B0 in the left matrix input data and B0 in the right matrix input data, B1 in the matrix output data is the maximum value of B1 in the left matrix input data and B1 in the right matrix input data, W1 and Wn0 are the same as the W1 and Wn0 values of the right matrix input data, and H is the same as the H value of the left matrix input data.
- the AI processor can use the data transfer engine to transfer the left matrix input sub-data and the right matrix input sub-data according to the bus bit width, and store them in the second memory inside the matrix operation engine.
- the AI processor uses a data transfer engine to sequentially read the first left matrix input sub-data and the second left matrix input sub-data from the first memory according to the bus bit width, wherein the first left matrix input sub-data and the second left matrix input sub-data are continuously stored data, and the data bit width of the first left matrix input sub-data is equal to the bus bit width, and the data bit width of the second left matrix input sub-data is equal to the bus bit width.
- the AI processor uses a data transfer engine to sequentially read the first right matrix input sub-data and the second right matrix input sub-data from the first memory according to the bus bit width, wherein the first right matrix input sub-data and the second right matrix input sub-data are continuously stored data, and the data bit width of the first right matrix input sub-data is equal to the bus bit width, and the data bit width of the second right matrix input sub-data is equal to the bus bit width.
- FIG. 7 shows a schematic diagram of data processing through a systolic array based on a general data format in the related art.
- a data handling engine to carry out data handling on the left matrix input data
- the AI processor needs to read data from the left matrix input data in segments along the M direction discretely, and carry out data handling through the bus in sequence, resulting in a problem of low bandwidth utilization when the bus bandwidth is greater than MALU_K.
- the data arrangement mode when the data arrangement mode is in MK format, it includes M direction 701 and K direction 702, and the data is first stored continuously in K direction 702 during physical storage (i.e., first stored in K direction, then stored in M direction).
- the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N
- data is directly read from K direction 702 according to the number of array rows of MALU_K, and each segment of data in the MALU_K direction in the data block 703 is discretely distributed in the physical storage of the left matrix input data ( FIG. The data block 703 in FIG.
- FIG8 shows a schematic diagram of data processing through a systolic array based on a target data format provided by another exemplary embodiment of the present application.
- the AI processor in order to obtain a data value of a length of MALU_K from the left matrix input data, because the data is stored continuously along the MALU_K and then along the H_Left direction in the target data format, when the bus bandwidth is greater than MALU_K, the AI processor can directly read the data values greater than MALU_K along the MALU_K and then along the H_Left direction, thereby making full use of the bus bandwidth.
- the data arrangement mode is the target data format of B0B1W1HWk0, including the H direction 801 (vertical) and the W1 (K/Wk0) direction 802 (horizontal)
- the data is first continuous in the W1 direction 802 during the physical storage process (i.e., first stored along the W1 direction, and then stored along the H direction).
- the embodiment of the present application can directly read the data block 803 according to the bus width continuity (although the data block 803 in FIG8 spans multiple rows in the H direction, the data in the target data format is continuously stored in the H direction, so the data in the data block 803 is continuous), thereby improving the bus bandwidth utilization.
- the same is true for the right matrix input data in FIG8, and the embodiment of the present application will not be repeated here.
- the specific target data format adopted by the left matrix input data, the right matrix input data and the matrix output data is determined according to the matrix column number of the left matrix input data and the right matrix input data, and in combination with the array row number and the array column number of the systolic array. Since the bus width is larger than the data width corresponding to the array dimension, compared with data reading according to the data width corresponding to the array dimension, for matrix data, data reading can be directly performed according to the bus width, so that the data width of the input sub-data is equal to the bus width, thereby optimizing the efficiency of data transfer for matrix data.
- a data format converter may also be provided in the AI processor.
- the data format converter is used to perform unidirectional conversion on the data storage formats of convolution data and matrix data (if mutual conversion is to be achieved, data format converters with different conversion directions need to be set), or bidirectional conversion.
- the data format converter can be used to convert the data format of the convolution output data based on the target data format corresponding to the matrix input data.
- the data format of the convolution output data is NC1HWC0
- the data format of the matrix input data is B0B1W1HW0
- B0 can be set to 1
- B1 can be set to N
- W1 can be set to C1
- H can be set to H ⁇ W
- W0 can be set to C0, thereby converting the convolution output data into equivalent matrix input data.
- the data format converter can be used to convert the data format of the matrix output data based on the target data format corresponding to the convolution input data.
- the data format of the matrix output data is B0B1W1HW0
- the data format of the convolution input data is NC1HWC0
- N can be set to B0 ⁇ B1, C1 to W1, H to 1, W to H, and C0 Set to W0 to make the matrix output data equivalent to the convolution input data.
- the data equivalence relationship between the convolution output data and the matrix input data, or the data equivalence relationship between the matrix output data and the convolution input data can be determined, that is, the data equivalence between the convolution output data and the matrix input data, or the data equivalence between the matrix output data and the convolution input data can be achieved, thereby improving the data processing efficiency in the AI processor.
- a data format converter can also be provided in the AI processor to convert the data format of the calculation data into the target data format.
- a data format converter is used to read first operation data and second operation data from a first memory, the first operation data and the second operation data are in an original data format, and the data dimension represented by the original data format does not match the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, and then the data format converter converts the data format of the first operation data and the second operation data from the original data format to the target data format by performing data format conversion on the first operation data and the second operation data, and writes the first operation data and the second operation data into the first memory.
- the data format converter first reads the convolution input data in the original data format (NCHW) from the first memory, and converts the convolution input data according to the target data format (NCi1HiWiCk0), thereby storing the converted convolution input data in the target data format back in the first memory.
- the data needs to be converted into the target data format corresponding to the user based on the purpose of the data.
- the convolution input data is converted into the first target data format
- the convolution weight data is converted into the second target data format
- the left matrix input data is converted into the eighth target data format
- the right matrix input data is converted into the ninth target data format, and so on.
- convolution data it is also necessary to determine the target data format used for the convolution input data, convolution weight data, and convolution output data based on the relationship between the total number of input channels of the convolution input data and the number of array rows of the pulse array, and based on the relationship between the total number of output channels of the convolution output data and the number of array columns of the pulse array. This embodiment will not be elaborated here.
- corresponding format processing can also be provided at the operator layer and the framework layer. For example, at the operator layer, full support for the original data format and the target data format is added; at the framework layer, format information about the original data format and the target data format is provided.
- Math Dim (Math Dimention) is set to correspond to the operation data in the original data format (general format) in the algorithm framework
- Data Dim (Data Dimention) corresponds to the operation data in the target data format (special format) on the AI processor.
- the operation data in the target data format can be directly used for data processing, and in the debugging process, when the data is transmitted from the device (Device) to the host (Host), the operation data in the target data format can be converted into the operation data in the original data format (that is, the format conversion is completed during the transmission process) according to the format information of the original data format and the target data format, so as to facilitate viewing and analysis.
- FIG. 10 shows a flow chart of a data processing method provided by an exemplary embodiment of the present application.
- the method is used in the AI processor in the above embodiment, and the method includes:
- Step 1001 storing first operation data and second operation data in a first memory, wherein the first operation data and the second operation data are in a target data format, and the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine.
- Step 1002 based on the bus bit width, read the first operator data and the second operator data from the first memory through the data transfer engine, and transfer the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and the bus bit width is greater than the data bit width corresponding to the array dimension.
- Step 1003 through the matrix operation unit in the matrix operation engine, the first operator data and the second operator data in the second memory are matrix operated to obtain a sub-operation result, and the data dimension of the sub-operation result matches the array dimension.
- Step 1004 accumulating the sub-operation results through an accumulator in the matrix operation engine to obtain a matrix operation result of the first operation data and the second operation data.
- the first operation data is the convolution input data
- the second operation data is the convolution weight data
- the operation result of the first operation data and the second operation data is the convolution output data
- the first operation data is the left matrix input data
- the second operation data is the right matrix input data
- the operation result of the first operation data and the second operation data is the matrix output data
- the convolution input data adopts a first target data format, and the number of sub-input channels represented by the first target data format is equal to the number of array rows of the systolic array;
- the convolution weight data adopts a second target data format, and the number of sub-input channels represented by the second target data format is equal to the number of array rows of the systolic array, and the number of sub-output channels represented by the second target data format is equal to the number of array columns of the systolic array;
- the convolution output data adopts a third target data format, and the number of sub-output channels represented by the third target data format is equal to the number of array columns of the systolic array.
- the convolution input data when the total number of input channels of the convolution input data is less than the number of array rows of the systolic array, the convolution input data adopts a fourth target data format, and the number of input channels represented by the fourth target data format is the total number of input channels; the convolution weight data adopts a fifth target data format, and the number of input channels represented by the fifth target data format is the total number of input channels, and the number of sub-output channels represented by the fifth target data format is equal to the number of array columns of the systolic array.
- the convolution output data when the total number of output channels of the convolution output data is less than the number of array columns of the systolic array, the convolution output data adopts a sixth target data format, and the number of output channels represented by the sixth target data format is the total number of output channels; the convolution weight data adopts a seventh target data format, and the number of sub-input channels represented by the seventh target data format is equal to the number of array rows of the systolic array, and the number of output channels represented by the seventh target data format is the total number of output channels.
- the first convolution input sub-data and the second convolution input sub-data are read from the first memory in sequence, the first convolution input sub-data and the second convolution input sub-data are continuously stored data, and the data bit width of the first convolution input sub-data and the data bit width of the second convolution input sub-data are both equal to the bus bit width; based on the bus bit width, the first convolution weight sub-data and the second convolution weight sub-data are read from the first memory in sequence, the first convolution weight sub-data and the second convolution weight sub-data are continuously stored data, and the data bit width of the first convolution weight sub-data and the data bit width of the second convolution weight sub-data are both equal to the bus bit width.
- the left matrix input data adopts an eighth target data format
- the number of submatrix columns represented by the eighth target data format is equal to the number of array rows of the systolic array
- the right matrix input data adopts a ninth target data format
- the number of submatrix columns represented by the ninth target data format is equal to the number of array columns of the systolic array
- the matrix output data adopts a tenth target data format, and the number of matrix rows represented by the tenth target data format is equal to the number of matrix rows represented by the eighth target data format, and the number of matrix columns represented by the tenth target data format is equal to the number of matrix columns represented by the ninth target data format.
- the first left matrix input sub-data and the second left matrix input sub-data are read from the first memory in sequence, the first left matrix input sub-data and the second left matrix input sub-data are continuously stored data, and the data bit width of the first left matrix input sub-data and the data bit width of the second left matrix input sub-data are both equal to the bus bit width; based on the bus bit width, the first right matrix input sub-data and the second right matrix input sub-data are read from the first memory in sequence, the first right matrix input sub-data and the second right matrix input sub-data are continuously stored data, and the data bit width of the first right matrix input sub-data and the data bit width of the second right matrix input sub-data are both equal to the bus bit width.
- the AI processor further includes a data format converter
- the method also includes: when the operation results of the first operation data and the second operation data are convolution output data, and the convolution output data is used for subsequent matrix calculation, converting the data format of the convolution output data based on the target data format corresponding to the matrix input data through a data format converter; when the operation results of the first operation data and the second operation data are matrix output data, and the matrix output data is used for subsequent convolution calculation, converting the data format of the matrix output data based on the target data format corresponding to the convolution input data through a data format converter.
- the AI processor further includes a data format converter
- the method further includes: reading first operation data and second operation data from a first memory through a data format converter, the first operation data and the second operation data being in an original data format, the data dimension represented by the original data format not matching the array dimension of a systolic array in a matrix operation unit in a matrix operation engine;
- the data formats of the first operation data and the second operation data are converted by a data format converter, so that the data formats of the first operation data and the second operation data are converted from the original data format to the target data format, and the first operation data and the second operation data are written into the first memory.
- the data dimension of the first operation data and the second operation data matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, so that the data handling engine can read the first operator data and the second operator data from the first memory according to the bus width, and carry the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and then the matrix operation engine performs matrix operation on the first operator data and the second operator data through the matrix operation unit to obtain the sub-operation result, and accumulates the sub-operation result through the accumulator to obtain the matrix operation result of the first operation data and the second operation data.
- the AI processor provided in the embodiment of the present application by storing the operation data in the target data format in the first memory, it is possible to realize the reading of operation data according to the bus width. Since the bus width is greater than the data width corresponding to the array dimension, compared with reading data according to the data width corresponding to the array dimension, the AI processor provided in the embodiment of the present application can improve the bandwidth utilization of data handling during matrix operation, thereby improving the efficiency of matrix operation.
- FIG. 11 shows a block diagram of a computer device 1100 provided by an exemplary embodiment of the present application.
- the computer device 1100 may be a portable mobile terminal, such as a smart phone, a tablet computer, a Moving Picture Experts Group Audio Layer III (MP3) player, or a Moving Picture Experts Group Audio Layer IV (MP4) player.
- MP3 Moving Picture Experts Group Audio Layer III
- MP4 Moving Picture Experts Group Audio Layer IV
- the computer device 1100 may also be called a user device, a portable terminal, a workstation, a server, or other names.
- the computer device 1100 includes: an AI processor 1101 and a memory 1102 .
- the AI processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
- the AI processor 1101 may be implemented in at least one hardware form of digital signal processing (DSP), field-programmable gate array (FPGA), and programmable logic array (PLA).
- the AI processor 1101 may also include a main processor and a coprocessor.
- the main processor is a processor for processing data in the awake state, also known as a central processing unit (CPU); the coprocessor is a low-power processor for processing data in the standby state.
- the AI processor 1101 may be integrated with a graphics processing unit (GPU), which is responsible for rendering and drawing the content to be displayed on the display screen.
- the AI processor 1101 may also be used to process computing operations related to machine learning.
- the memory 1102 may include one or more computer-readable storage media, which may be tangible and non-transitory.
- the memory 1102 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices, flash memory storage devices.
- the computer device 1100 may also optionally include a peripheral device interface 1103 and at least one peripheral device.
- FIG. 11 does not limit the computer device 1100 , and may include more or fewer components than shown in the figure, or combine certain components, or adopt a different component arrangement.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Neurology (AREA)
- Complex Calculations (AREA)
Abstract
Description
本申请要求于2023年11月22日提交,申请号为202311579613.0、发明名称为“AI处理器、数据处理方法以及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed on November 22, 2023, with application number 202311579613.0 and invention name “AI processor, data processing method and computer device”, all contents of which are incorporated by reference into this application.
本申请实施例涉及处理器技术领域,特别涉及一种AI处理器、数据处理方法以及计算机设备。The embodiments of the present application relate to the field of processor technology, and in particular to an AI processor, a data processing method, and a computer device.
深度学习算法中的核心计算主要包括卷积计算和矩阵计算两大类,而卷积计算又可以等效为矩阵计算。因此通过加速矩阵计算,可以实现对深度学习算法的加速。The core calculations in deep learning algorithms mainly include convolution calculations and matrix calculations. Convolution calculations can be equivalent to matrix calculations. Therefore, by accelerating matrix calculations, deep learning algorithms can be accelerated.
相关技术中,通过设置矩阵加速引擎来提高矩阵计算的速度,并基于矩阵加速引擎中脉动阵列的阵列维数对输入数据进行数据搬运。由于不同输入数据采用的数据格式所表征的数据维数往往与脉动阵列的阵列维数不匹配,导致在通过总线进行数据搬运的过程中,直接根据脉动阵列维数进行数据搬运,往往会出现总线带宽利用率低的问题。In the related art, a matrix acceleration engine is set to increase the speed of matrix calculation, and input data is moved based on the array dimension of the systolic array in the matrix acceleration engine. Since the data dimension represented by the data format used by different input data often does not match the array dimension of the systolic array, when the data is moved through the bus, the data is moved directly according to the systolic array dimension, which often leads to the problem of low bus bandwidth utilization.
发明内容Summary of the invention
本申请实施例提供了一种AI处理器、数据处理方法以及计算机设备。所述技术方案如下:The embodiments of the present application provide an AI processor, a data processing method and a computer device. The technical solution is as follows:
一方面,本申请实施例提供了一种AI处理器,所述AI处理器包括:矩阵运算引擎、数据搬运引擎以及第一存储器,所述矩阵运算引擎与所述数据搬运引擎通过总线相连;On the one hand, an embodiment of the present application provides an AI processor, the AI processor comprising: a matrix operation engine, a data handling engine and a first memory, the matrix operation engine and the data handling engine being connected via a bus;
所述第一存储器,用于存储第一运算数据和第二运算数据,所述第一运算数据和所述第二运算数据采用目标数据格式,所述目标数据格式所表征的数据维数与所述矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配;The first memory is used to store first operation data and second operation data, the first operation data and the second operation data are in a target data format, and the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine;
所述数据搬运引擎,用于基于总线位宽,从所述第一存储器中读取第一运算子数据和第二运算子数据,并通过所述总线将所述第一运算子数据和所述第二运算子数据搬运至所述矩阵运算引擎内部的第二存储器,所述总线位宽大于所述阵列维数对应的数据位宽;The data transport engine is configured to read the first operator data and the second operator data from the first memory based on a bus bit width, and transport the first operator data and the second operator data to a second memory inside the matrix operation engine through the bus, wherein the bus bit width is greater than a data bit width corresponding to the array dimension;
所述矩阵运算引擎,用于通过矩阵运算单元,对所述第二存储器中的所述第一运算子数据和所述第二运算子数据矩阵运算,得到子运算结果,所述子运算结果的数据维数与所述阵列维数匹配;The matrix operation engine is used to perform a matrix operation on the first operator data and the second operator data in the second memory through a matrix operation unit to obtain a sub-operation result, wherein the data dimension of the sub-operation result matches the array dimension;
所述矩阵运算引擎,还用于通过累加器累加所述子运算结果,得到所述第一运算数据和所述第二运算数据的矩阵运算结果。The matrix operation engine is further used to accumulate the sub-operation results through an accumulator to obtain a matrix operation result of the first operation data and the second operation data.
另一方面,本申请实施例提供了一种数据处理方法,所述方法用于AI处理器,所述AI处理器包括矩阵运算引擎、数据搬运引擎以及第一存储器,所述矩阵运算引擎与所述数据搬运引擎通过总线相连;所述方法包括:On the other hand, an embodiment of the present application provides a data processing method, which is used for an AI processor, wherein the AI processor includes a matrix operation engine, a data handling engine, and a first memory, wherein the matrix operation engine is connected to the data handling engine via a bus; the method includes:
通过所述第一存储器存储第一运算数据和第二运算数据,所述第一运算数据和所述第二运算数据采用目标数据格式,所述目标数据格式所表征的数据维数与所述矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配;storing first operation data and second operation data by the first memory, wherein the first operation data and the second operation data are in a target data format, and the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine;
基于总线位宽,通过所述数据搬运引擎从所述第一存储器中读取第一运算子数据和第二运算子数据,并通过所述总线将所述第一运算子数据和所述第二运算子数据搬运至所述矩阵运算引擎内部的第二存储器,所述总线位宽大于所述阵列维数对应的数据位宽;Based on a bus bit width, reading first operator data and second operator data from the first memory through the data transfer engine, and transferring the first operator data and the second operator data to a second memory inside the matrix operation engine through the bus, wherein the bus bit width is greater than a data bit width corresponding to the array dimension;
通过所述矩阵运算引擎中的矩阵运算单元,对所述第二存储器中的所述第一运算子数据和所述第二运算子数据矩阵运算,得到子运算结果,所述子运算结果的数据维数与所述阵列维数匹配; Performing a matrix operation on the first operator data and the second operator data in the second memory through a matrix operation unit in the matrix operation engine to obtain a sub-operation result, wherein the data dimension of the sub-operation result matches the array dimension;
通过所述矩阵运算引擎中的累加器累加所述子运算结果,得到所述第一运算数据和所述第二运算数据的矩阵运算结果。The sub-operation results are accumulated by an accumulator in the matrix operation engine to obtain a matrix operation result of the first operation data and the second operation data.
另一方面,本申请实施例提供了一种计算机设备,所述计算机设备包括存储器以及如上述方面所述的AI处理器,所述存储器存储有至少一条指令,所述至少一条指令用于被所述AI处理器执行。On the other hand, an embodiment of the present application provides a computer device, comprising a memory and an AI processor as described in the above aspects, wherein the memory stores at least one instruction, and the at least one instruction is used to be executed by the AI processor.
本申请实施例中,通过在第一存储器中存储采用目标数据格式的第一运算数据以及第二运算数据,使得第一运算数据以及第二运算数据的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配,从而数据搬运引擎可以根据总线位宽,从第一存储器中读取第一运算子数据以及第二运算子数据,并通过总线将第一运算子数据和第二运算子数据搬运至矩阵运算引擎内部的第二存储器中,进而矩阵运算引擎通过矩阵运算单元对第一运算子数据和第二运算子数据进行矩阵运算,得到子运算结果,并通过累加器对子运算结果进行累加,得到第一运算数据和第二运算数据的矩阵运算结果。采用本申请实施例提供的AI处理器,通过在第一存储器中存储采用目标数据格式的运算数据,能够实现根据总线位宽进行运算数据读取。由于总线位宽大于阵列维数对应的数据位宽,因此相较于根据阵列维数对应的数据位宽进行数据读取,本申请实施例提供的AI处理器能够提高矩阵运算过程中数据搬运的带宽利用率,从而提高矩阵运算的效率。In an embodiment of the present application, by storing the first operation data and the second operation data in the target data format in the first memory, the data dimension of the first operation data and the second operation data matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, so that the data handling engine can read the first operator data and the second operator data from the first memory according to the bus width, and carry the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and then the matrix operation engine performs matrix operation on the first operator data and the second operator data through the matrix operation unit to obtain the sub-operation result, and accumulates the sub-operation result through the accumulator to obtain the matrix operation result of the first operation data and the second operation data. Using the AI processor provided in the embodiment of the present application, by storing the operation data in the target data format in the first memory, it is possible to realize the reading of operation data according to the bus width. Since the bus width is greater than the data width corresponding to the array dimension, compared with reading data according to the data width corresponding to the array dimension, the AI processor provided in the embodiment of the present application can improve the bandwidth utilization of data handling during matrix operation, thereby improving the efficiency of matrix operation.
图1示出了本申请一个示例性实施例提供的AI处理器的结构示意图;FIG1 shows a schematic diagram of the structure of an AI processor provided by an exemplary embodiment of the present application;
图2示出了本申请一个示例性实施例提供的脉动阵列运算示意图;FIG2 shows a schematic diagram of a systolic array operation provided by an exemplary embodiment of the present application;
图3示出了卷积类网络数据的逻辑表达形式以及相关技术中不同格式数据的物理存储形式的示意图;FIG3 is a schematic diagram showing the logical expression of convolutional network data and the physical storage form of data in different formats in the related art;
图4示出了本申请一个示例性实施例提供的第一目标数据格式的物理存储形式的示意图;FIG4 is a schematic diagram showing a physical storage form of a first target data format provided by an exemplary embodiment of the present application;
图5示出了相关技术中基于一般数据格式通过脉动阵列进行数据处理的示意图;FIG5 is a schematic diagram showing data processing by a systolic array based on a general data format in the related art;
图6示出了本申请一个示例性实施例提供的基于目标数据格式通过脉动阵列进行数据处理的示意图;FIG6 is a schematic diagram showing data processing by a systolic array based on a target data format provided by an exemplary embodiment of the present application;
图7示出了另一个相关技术中基于一般数据格式通过脉动阵列进行数据处理的示意图;FIG7 is a schematic diagram showing data processing by a systolic array based on a general data format in another related art;
图8示出了本申请另一个示例性实施例提供的基于目标数据格式通过脉动阵列进行数据处理的示意图;FIG8 is a schematic diagram showing data processing by a systolic array based on a target data format provided by another exemplary embodiment of the present application;
图9示出了本申请一个示例性实施例提供的数据格式转换的示意图;FIG9 is a schematic diagram showing a data format conversion provided by an exemplary embodiment of the present application;
图10示出了本申请一个示例性实施例提供的数据处理方法的流程图;FIG10 shows a flow chart of a data processing method provided by an exemplary embodiment of the present application;
图11示出了本申请一个示例性实施例提供的计算机设备的结构框图。FIG. 11 shows a structural block diagram of a computer device provided by an exemplary embodiment of the present application.
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present application. Instead, they are merely examples of devices and methods consistent with some aspects of the present application as detailed in the appended claims.
应当理解的是,在本文中提及的“若干个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。It should be understood that the "several" mentioned in this article refers to one or more, and "multiple" refers to two or more. "And/or" describes the association relationship of the associated objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects before and after are in an "or" relationship.
为了方便理解,下面对本申请实施例中涉及的名词进行说明。For ease of understanding, the nouns involved in the embodiments of the present application are explained below.
运算数据:指参与运算的数据,本申请实施例中的运算数据用于卷积运算或矩阵乘法运 算。其中,卷积运算过程中,运算数据包括卷积输入数据以及卷积权重数据,卷积权重数据又可以被称为卷积核数据;矩阵乘法运算过程中,运算数据包括左矩阵输入数据和右矩阵输入数据。Operation data: refers to the data involved in the operation. The operation data in the embodiment of the present application is used for convolution operation or matrix multiplication operation. In the convolution operation process, the operation data includes the convolution input data and the convolution weight data, which can also be called the convolution kernel data; in the matrix multiplication operation process, the operation data includes the left matrix input data and the right matrix input data.
脉动阵列(Pulse Sequence):矩阵运算单元(Matrix Arithmetic Logic Unit,MALU)中用于实现矩阵乘法运算的阵列。脉动阵列的核心思想是将矩阵分块,然后通过一系列的数据移动和局部乘法操作来完成整个矩阵乘法。脉动阵列的输入包括左输入和右输入,且脉动阵列的阵列行数表示为MALU_K,阵列列数表示为MALU_N。Pulse Sequence: An array in the Matrix Arithmetic Logic Unit (MALU) used to implement matrix multiplication operations. The core idea of the systolic array is to divide the matrix into blocks, and then complete the entire matrix multiplication through a series of data movement and local multiplication operations. The input of the systolic array includes left input and right input, and the number of rows of the systolic array is represented as MALU_K, and the number of columns of the array is represented as MALU_N.
维数:指某一维度下数据的数量,比如,行数、列数、高度、宽度、通道数等等。当数据格式为HWC时,若数据为3×4×5,则数据在H(高度)维度下的维数为3,在W(宽度)维度下的维数为4,在C(通道)维度下的维数为5。Dimension: refers to the number of data in a certain dimension, such as the number of rows, columns, height, width, channels, etc. When the data format is HWC, if the data is 3×4×5, the dimension of the data in the H (height) dimension is 3, the dimension in the W (width) dimension is 4, and the dimension in the C (channel) dimension is 5.
深度学习方法被广泛应用于图像处理,视频处理,语音处理,内容生成等各个领域,而深度学习的核心特点是计算量大,参数量大。深度学习网络可以分为三大类:卷积神经网络(Convolutional Neural Networks,CNN)类,Transformer类和ViT(Vision Transformer)类。这三大类网络中核心的计算主要集中在卷积计算和矩阵计算,而卷积计算又可以等效为矩阵计算。因此加速深度学习算法的核心是加速矩阵运算,而如何把核心的矩阵计算与其他常用向量计算完美契合是设计AI处理器的关键问题。Deep learning methods are widely used in various fields such as image processing, video processing, speech processing, content generation, etc., and the core characteristics of deep learning are large amount of calculation and large number of parameters. Deep learning networks can be divided into three categories: Convolutional Neural Networks (CNN), Transformer and ViT (Vision Transformer). The core calculations in these three categories of networks are mainly concentrated in convolution calculations and matrix calculations, and convolution calculations can be equivalent to matrix calculations. Therefore, the core of accelerating deep learning algorithms is to accelerate matrix operations, and how to perfectly match the core matrix calculations with other commonly used vector calculations is the key issue in designing AI processors.
可选的,AI处理器可以是一种基于神经网络算法和加速的处理器,比如神经网络处理器(Neural network Processing Unit,NPU)等。Optionally, the AI processor can be a processor based on neural network algorithms and acceleration, such as a neural network processor (Neural network Processing Unit, NPU).
相关技术中,通过将AI处理器集成在通用深度学习框架中,并通过AI处理器中的矩阵运算引擎进行矩阵运算。其中,矩阵运算引擎中矩阵运算单元的结构一般为二维脉动阵列形式,因此为了利用矩阵运算单元对运算数据进行数据运算,相关技术中一般根据矩阵运算单元中脉动阵列的阵列维数对运算数据进行数据搬运,而对于总线位宽大于阵列维数对应的数据位宽的总线,在进行数据搬运的过程中,往往会产生总线带宽利用不足的问题,导致总线带宽利用率低。In the related art, the AI processor is integrated into a general deep learning framework, and matrix operations are performed through the matrix operation engine in the AI processor. Among them, the structure of the matrix operation unit in the matrix operation engine is generally in the form of a two-dimensional systolic array. Therefore, in order to use the matrix operation unit to perform data operations on the operation data, the related art generally carries out data transfer on the operation data according to the array dimension of the systolic array in the matrix operation unit. For a bus whose bus width is larger than the data width corresponding to the array dimension, the problem of insufficient bus bandwidth utilization often occurs during data transfer, resulting in low bus bandwidth utilization.
本申请实施例中,通过在第一存储器中存储采用目标数据格式(所表征的数据维数与矩阵运算单元内脉动阵列的阵列维数匹配)的第一运算数据以及第二运算数据,可以直接通过数据搬运引擎,根据总线位宽从第一存储器中读取第一运算子数据以及第二运算子数据,并通过总线将第一运算子数据以及第二运算子数据搬运至矩阵运算引擎内部的第二存储器中,从而充分提高了总线位宽的利用率,优化了数据搬运过程。In an embodiment of the present application, by storing the first operation data and the second operation data in the target data format (the data dimension represented matches the array dimension of the systolic array in the matrix operation unit) in the first memory, the first operator data and the second operator data can be directly read from the first memory according to the bus bit width through the data transfer engine, and the first operator data and the second operator data can be transferred to the second memory inside the matrix operation engine through the bus, thereby fully improving the utilization rate of the bus bit width and optimizing the data transfer process.
请参考图1,其示出了本申请一个示例性实施例提供的AI处理器的结构示意图,该AI处理器100主要包括矩阵运算引擎110、数据搬运引擎120以及第一存储器130,其中,矩阵运算引擎110与数据搬运引擎120通过总线相连。Please refer to Figure 1, which shows a structural diagram of an AI processor provided by an exemplary embodiment of the present application. The AI processor 100 mainly includes a matrix operation engine 110, a data transfer engine 120 and a first memory 130, wherein the matrix operation engine 110 and the data transfer engine 120 are connected via a bus.
第一存储器130,用于存储第一运算数据和第二运算数据,第一运算数据和第二运算数据采用目标数据格式,目标数据格式所表征的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配。The first memory 130 is used to store the first operation data and the second operation data. The first operation data and the second operation data are in a target data format. The data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine.
相关技术中,AI处理器通过第一存储器存储深度学习网络中特有的维数表达方式的运算数据,导致在通过数据搬运引擎进行数据搬运的过程中,存在带宽利用率不足的问题。比如,在PyTorch中,卷积类网络数据格式为NCHW,权重格式为CoCiKhKw;TensorFlow中,卷积类网络数据格式为NHWC,权重格式为KhKwCiCo。其中,N为batch数量,C为数据通道的数量,H为高度,W为宽度,Kh为卷积核高度,Kw为卷积核宽度,Ci为输入通道数,Co为输出通道数。In the related art, the AI processor stores the operation data of the dimensional expression method unique to the deep learning network through the first memory, resulting in insufficient bandwidth utilization in the process of data transfer through the data transfer engine. For example, in PyTorch, the data format of the convolutional network is NCHW, and the weight format is CoCiKhKw; in TensorFlow, the data format of the convolutional network is NHWC, and the weight format is KhKwCiCo. Among them, N is the number of batches, C is the number of data channels, H is the height, W is the width, Kh is the height of the convolution kernel, Kw is the width of the convolution kernel, Ci is the number of input channels, and Co is the number of output channels.
而本申请实施例中,第一存储器中第一运算数据以及第二运算数据均采用目标数据格式进行存储。其中,第一运算数据和第二运算数据的原始数据格式可以为目标数据格式,或者,原始数据格式的第一运算数据和第二运算数据经过格式转换后,以目标数据格式存储在第一存储器中。 In the embodiment of the present application, the first operation data and the second operation data in the first memory are both stored in the target data format. The original data format of the first operation data and the second operation data can be the target data format, or the first operation data and the second operation data in the original data format are format converted and stored in the first memory in the target data format.
可选的,矩阵运算引擎的体系结构是由多个乘积累加运算(Multiply Accumulate,MAC)的物理矩阵构成的多维脉动阵列,用于对卷积神经网络的一系列矩阵运算进行计算处理。Optionally, the architecture of the matrix operation engine is a multi-dimensional systolic array consisting of multiple physical matrices of multiply accumulate (MAC) operations, which is used to perform computational processing on a series of matrix operations of convolutional neural networks.
可选的,目标数据格式所表征的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配,可以是指目标数据格式所表征各个数据通道的数据维度均与脉动阵列的阵列维数相等,也可以是指目标数据格式所表征的数据通道中存在部分数据通道的数据维数与脉动阵列的阵列维数相等。Optionally, the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, which may mean that the data dimension of each data channel represented by the target data format is equal to the array dimension of the systolic array, or that the data dimension of some data channels among the data channels represented by the target data format is equal to the array dimension of the systolic array.
可选的,脉动阵列的阵列维数可以是阵列行数,也可以是阵列列数。在运算数据为卷积类数据的情况下,目标数据格式所表征的数据维数可以是数据通道数;在运算数据为矩阵类数据的情况下,目标数据格式所表征的数据维数可以是矩阵行数或矩阵列数。Optionally, the array dimension of the systolic array may be the number of array rows or the number of array columns. When the operation data is convolution data, the data dimension represented by the target data format may be the number of data channels; when the operation data is matrix data, the data dimension represented by the target data format may be the number of matrix rows or the number of matrix columns.
比如,目标数据格式所表征的数据通道数与脉动阵列的阵列维数相等,或者,目标数据格式所表征的数据行数或者列数与脉动阵列的阵列行数或列数相等。For example, the number of data channels represented by the target data format is equal to the array dimension of the systolic array, or the number of data rows or columns represented by the target data format is equal to the number of array rows or columns of the systolic array.
示意性的,第一运算数据为矩阵类数据,矩阵类数据的矩阵行数和矩阵列数均为128,而脉动阵列的阵列行数为64,则在采用目标数据格式的情况下,可以将第一运算数据的矩阵行数转换为2×64,从而使得第一运算数据的矩阵行数等于脉动阵列的阵列行数。Illustratively, the first operation data is matrix data, the number of matrix rows and the number of matrix columns of the matrix data are both 128, and the number of array rows of the systolic array is 64. Then, when the target data format is adopted, the number of matrix rows of the first operation data can be converted to 2×64, so that the number of matrix rows of the first operation data is equal to the number of array rows of the systolic array.
可选的,脉动阵列为二维矩阵形式,可以是二维脉动,可以是一维脉动加一维广播,也可以是二维脉动广播混合使用结构。Optionally, the systolic array is in the form of a two-dimensional matrix, which may be a two-dimensional systolic, a one-dimensional systolic plus a one-dimensional broadcast, or a two-dimensional systolic and broadcast mixed structure.
数据搬运引擎120,用于基于总线位宽,从第一存储器130中读取第一运算子数据和第二运算子数据,并通过总线将第一运算子数据和第二运算子数据搬运至矩阵运算引擎110内部的第二存储器111,总线位宽大于阵列维数对应的数据位宽。The data transfer engine 120 is used to read the first operator data and the second operator data from the first memory 130 based on the bus bit width, and transfer the first operator data and the second operator data to the second memory 111 inside the matrix operation engine 110 through the bus, and the bus bit width is greater than the data bit width corresponding to the array dimension.
在一些实施例中,总线位宽为阵列维数对应的数据位宽的整数倍。比如,总线位宽为阵列维数对应的数据位宽的2倍。当然,总线位宽也可以不是阵列维数对应的数据位宽的整数倍,下述实施例仅以整数倍为例进行示意性说明,但并不对此构成限定。In some embodiments, the bus width is an integer multiple of the data width corresponding to the array dimension. For example, the bus width is twice the data width corresponding to the array dimension. Of course, the bus width may not be an integer multiple of the data width corresponding to the array dimension. The following embodiments are only illustrated by taking integer multiples as an example, but are not limited thereto.
可选的,数据搬运引擎可以是直接访问内存(Direct Memory Access,DMA)控制器,用于在不同存储器之间进行数据转运,其中包括一条地址总线、一条数据总线和控制寄存器。进而数据搬运引擎可以从第一存储器中读取第一运算子数据和第二运算子数据,并通过总线将第一运算子数据和第二运算子数据搬运至第二存储器。Optionally, the data transfer engine may be a direct memory access (DMA) controller for transferring data between different memories, including an address bus, a data bus, and a control register. The data transfer engine may read the first operator data and the second operator data from the first memory, and transfer the first operator data and the second operator data to the second memory through the bus.
由于运算数据的数据量庞大,矩阵运算引擎无法单次完成运算,因此需要将运算数据划分为运算子数据,并对运算子数据进行运算,即将第一运算数据和第二运算数据的运算过程拆分为多个运算子数据的运算过程。相应的,第一存储器中的第一运算数据和第二运算数据需要被分段搬运至第二存储器。Since the amount of operation data is huge, the matrix operation engine cannot complete the operation in a single time, so it is necessary to divide the operation data into operation sub-data and perform operations on the operation sub-data, that is, to split the operation process of the first operation data and the second operation data into the operation process of multiple operation sub-data. Accordingly, the first operation data and the second operation data in the first memory need to be moved to the second memory in segments.
可选的,第一运算子数据属于第一运算数据,为第一运算数据中的一部分数据,第二运算子数据属于第二运算数据,为第二运算数据中的一部分数据,即AI处理器通过数据搬运引擎可以实现分段式将第一运算数据以及第二运算数据从第一存储器传输至矩阵运算引擎内部的第二存储器中。Optionally, the first operator data belongs to the first operation data and is a part of the first operation data, and the second operator data belongs to the second operation data and is a part of the second operation data. That is, the AI processor can realize the segmented transmission of the first operation data and the second operation data from the first memory to the second memory inside the matrix operation engine through the data transfer engine.
相关技术中,第一运算数据和第二运算数据所采用的数据格式表征的数据维数与脉动阵列的阵列维数不匹配时,为了不影响矩阵运算单元内的矩阵运算过程,AI处理器通过数据搬运引擎根据阵列维数进行数据搬运,导致相邻时间内搬运的子数据不具有数据连续性,进而导致在总线位宽大于阵列维数对应的数据位宽的情况下,总线带宽利用率低。In the related art, when the data dimension represented by the data format adopted by the first operation data and the second operation data does not match the array dimension of the systolic array, in order not to affect the matrix operation process in the matrix operation unit, the AI processor performs data transfer according to the array dimension through the data transfer engine, resulting in the sub-data transferred in adjacent time not having data continuity, which in turn leads to low bus bandwidth utilization when the bus bit width is larger than the data bit width corresponding to the array dimension.
本申请实施例中,在存储采用目标数据格式的第一运算数据以及第二运算数据后,AI处理器可以通过数据搬运引擎,直接根据总线位宽从第一存储器中读取第一运算子数据以及第二运算子数据,在充分利用总线位宽的情况下,同时保证了数据搬运的连续性。In an embodiment of the present application, after storing the first operation data and the second operation data in the target data format, the AI processor can read the first operator data and the second operator data directly from the first memory according to the bus width through the data transfer engine, thereby fully utilizing the bus width and ensuring the continuity of data transfer.
示意性的,在运算数据未采用目标数据格式,且总线位宽大于阵列维数对应的数据位宽的情况下,为了使得矩阵运算单元的脉冲阵列能够直接对数据搬运引擎搬运的子运算数据进行矩阵运算,数据搬运引擎根据脉冲阵列的阵列维数对应的数据位宽,按照数据行数或数据列数依次进行数据搬运。由于每次搬运数据的数据位宽均小于总线位宽,因此导致总线位宽均未得到充分利用。而在采用目标数据格式的情况下,运算数据即可以以脉动阵列的阵列维 数为单位,在物理存储中实现数据连续存储,数据搬运引擎可以直接根据总线位宽对数据进行搬运,即搬运数据的数据位宽等于总线位宽,从而充分利用总线位宽。Schematically, when the operation data does not adopt the target data format and the bus width is larger than the data width corresponding to the array dimension, in order to enable the pulse array of the matrix operation unit to directly perform matrix operations on the sub-operation data transported by the data transport engine, the data transport engine transports the data in sequence according to the data width corresponding to the array dimension of the pulse array and the number of data rows or columns. Since the data width of each data transport is smaller than the bus width, the bus width is not fully utilized. In the case of adopting the target data format, the operation data can be transported in the array dimension of the pulse array. The data is stored continuously in physical storage using the number as the unit. The data transfer engine can directly transfer the data according to the bus width, that is, the data width of the transferred data is equal to the bus width, thereby making full use of the bus width.
在一个示意性的例子中,在总线位宽为阵列维数对应的数据位宽的两倍的情况下,若未采用目标数据格式存储运算数据,根据阵列维数对应的数据位宽对运算数据进行搬运,带宽利用率仅为50%;而在采用目标数据格式对运算数据进行存储之后,运算数据即可以以脉动阵列的阵列维数为单位,在物理存储中实现数据连续存储,从而即可以根据总线位宽对运算数据进行搬运,从而使得带宽利用率达到100%。In an illustrative example, when the bus bit width is twice the data bit width corresponding to the array dimension, if the target data format is not used to store the operation data, the operation data is moved according to the data bit width corresponding to the array dimension, and the bandwidth utilization rate is only 50%; after the operation data is stored in the target data format, the operation data can be stored continuously in the physical storage in units of the array dimension of the systolic array, so that the operation data can be moved according to the bus bit width, so that the bandwidth utilization rate reaches 100%.
比如,在总线位宽为2048bit,脉动阵列的阵列维数为64,且每维数据均为fp16(即均为16bit浮点数)的情况下,脉动阵列的阵列维数对应的数据位宽即为1024bit,即总线位宽(2048bit)大于脉动阵列的阵列维数对应的数据位宽(1024bit)。在未采用目标数据格式存储运算数据的情况下,根据阵列维数对应的数据位宽对运算数据进行搬运,每次仅能占用1024bit的总线宽度,导致带宽利用率仅为50%;而在采用目标数据格式对运算数据进行存储的情况下,即可以根据总线位宽,从第一存储器中读取与总线位宽相等的,数据位宽为2048bit的运算子数据。For example, when the bus width is 2048 bits, the array dimension of the systolic array is 64, and the data in each dimension are all fp16 (i.e., all are 16-bit floating point numbers), the data width corresponding to the array dimension of the systolic array is 1024 bits, i.e., the bus width (2048 bits) is greater than the data width (1024 bits) corresponding to the array dimension of the systolic array. When the target data format is not used to store the operation data, the operation data is transferred according to the data width corresponding to the array dimension, and only 1024 bits of bus width can be occupied each time, resulting in a bandwidth utilization rate of only 50%; when the target data format is used to store the operation data, the operation sub-data with a data width of 2048 bits equal to the bus width can be read from the first memory according to the bus width.
在一些实施例中,矩阵运算引擎110中设置有第二存储器111,数据搬运引擎120在通过总线进行数据搬运的过程中,将第一运算子数据和第二运算子数据搬运至矩阵运算引擎110内部的第二存储器111中。In some embodiments, a second memory 111 is provided in the matrix operation engine 110 , and the data transfer engine 120 transfers the first operator data and the second operator data to the second memory 111 inside the matrix operation engine 110 during data transfer via the bus.
可选的,数据搬运的过程即可以理解为将第一运算子数据和第二运算子数据通过总线传输至矩阵运算引擎内部的第二存储器中。Optionally, the data transfer process may be understood as transmitting the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus.
矩阵运算引擎110,用于通过矩阵运算单元112,对第二存储器111中的第一运算子数据和第二运算子数据矩阵运算,得到子运算结果,子运算结果的数据维数与阵列维数匹配。The matrix operation engine 110 is used to perform matrix operations on the first operator data and the second operator data in the second memory 111 through the matrix operation unit 112 to obtain sub-operation results, and the data dimension of the sub-operation results matches the array dimension.
在一些实施例中,矩阵运算引擎中还包括矩阵运算单元,在将第一运算子数据以及第二运算子数据存储至第二存储器后,矩阵运算引擎即可以通过矩阵运算单元,利用脉动阵列对第一运算子数据和第二运算子数据进行矩阵运算,得到子运算结果。In some embodiments, the matrix operation engine also includes a matrix operation unit. After storing the first operator data and the second operator data in the second memory, the matrix operation engine can perform matrix operations on the first operator data and the second operator data through the matrix operation unit using a systolic array to obtain a sub-operation result.
其中,子运算结果的数据维数与脉动阵列的阵列维数匹配。比如,子运算结果的数据行数等于脉动阵列的阵列行数,子运算结果的数据列数等于脉动阵列的阵列列数。The data dimension of the sub-operation result matches the array dimension of the systolic array. For example, the number of data rows of the sub-operation result is equal to the number of array rows of the systolic array, and the number of data columns of the sub-operation result is equal to the number of array columns of the systolic array.
在一些实施例中,矩阵运算引擎可以将第一运算子数据作为脉动阵列的左输入数据,第二运算子数据作为脉动阵列的右输入数据,通过矩阵运算单元进行矩阵运算,从而得到子运算结果。In some embodiments, the matrix operation engine may use the first operator data as the left input data of the systolic array and the second operator data as the right input data of the systolic array, and perform matrix operation through the matrix operation unit to obtain a sub-operation result.
可选的,矩阵运算单元通过矩阵乘运算,对第一运算子数据和第二运算子数据进行处理运算,即可以得到子运算结果。Optionally, the matrix operation unit processes the first operator data and the second operator data through a matrix multiplication operation, so as to obtain a sub-operation result.
示意性的,如图2所示,通过矩阵运算单元进行矩阵运算输出的结果数据都是[Matrix_M,MALU_N]的二维矩阵形式,其中,以基于脉动阵列进行一次计算为例,在确定脉动阵列的左输入数据21和右输入数据22之后,矩阵运算单元基于脉动阵列的性质,依次从左到右,从上到下进行数据运算,从而输出MALU_N侧的一组数据。其中,脉动阵列中的PEs(Process Element)为阵列最小单元,用于实现一维乘加计算。Schematically, as shown in FIG2 , the result data output by the matrix operation unit through the matrix operation unit is in the form of a two-dimensional matrix [Matrix_M, MALU_N], where, taking a calculation based on the systolic array as an example, after determining the left input data 21 and the right input data 22 of the systolic array, the matrix operation unit performs data operations from left to right and from top to bottom based on the properties of the systolic array, thereby outputting a set of data on the MALU_N side. Among them, the PEs (Process Element) in the systolic array is the smallest unit of the array, which is used to implement one-dimensional multiplication and addition calculations.
矩阵运算引擎110,还用于通过累加器113对子运算结果进行累加,得到第一运算数据和第二运算数据的矩阵运算结果。The matrix operation engine 110 is further used to accumulate the sub-operation results through the accumulator 113 to obtain the matrix operation result of the first operation data and the second operation data.
在一些实施例中,矩阵运算引擎中还包括累加器(Accumulator,ACC),在通过矩阵运算单元进行矩阵运算,得到大量子运算结果之后,矩阵运算引擎还可以通过累加器对子运算结果进行累加,从而得到第一运算数据和第二运算数据的矩阵运算结果。In some embodiments, the matrix operation engine also includes an accumulator (ACC). After performing matrix operations through the matrix operation unit to obtain a large number of sub-operation results, the matrix operation engine can also accumulate the sub-operation results through the accumulator to obtain the matrix operation results of the first operation data and the second operation data.
可选的,累加器对子运算结果的累加过程指暂存矩阵运算单元输出的子运算结果,并在得到全部子运算结果之后,根据各个子运算结果在运算结果矩阵中的矩阵元素位置,对各个子运算结果进行数据拼接,从而得到第一运算数据和第二运算数据的运算结果。Optionally, the accumulator's accumulation process of sub-operation results refers to temporarily storing the sub-operation results output by the matrix operation unit, and after obtaining all the sub-operation results, performing data splicing on each sub-operation result according to the matrix element position of each sub-operation result in the operation result matrix, thereby obtaining the operation results of the first operation data and the second operation data.
示意性的,第一运算数据和第二运算数据均为2×2矩阵数据,通过矩阵运算单元进行矩阵运算即可以得到:第一运算数据的第一行矩阵数据与第二运算数据的第一列矩阵数据对 应的第一子运算结果,第一运算数据的第一行矩阵数据与第二运算数据的第二列矩阵数据对应的第二子运算结果,第一运算数据的第二行矩阵数据与第二运算数据的第一列矩阵数据对应的第三子运算结果,第一运算数据的第二行矩阵数据与第二运算数据的第二列矩阵数据对应的第四子运算结果,从而第一子运算数据在运算结果矩阵中的矩阵元素位置为第一行第一列,第二子运算数据在运算结果矩阵中的矩阵元素位置为第一行第二列,第三子运算数据在运算结果矩阵中的矩阵元素位置为第二行第一列,第四子运算数据在运算结果矩阵中的矩阵元素位置为第二行第二列,进而累加器即可以根据各个子运算数据对应的矩阵元素位置,对子运算数据进行拼接,从而得到第一运算数据和第二运算数据的矩阵运算结果。Schematically, the first operation data and the second operation data are both 2×2 matrix data, and the matrix operation unit performs matrix operation to obtain: the first row matrix data of the first operation data and the first column matrix data of the second operation data The first sub-operation result corresponding to the first row matrix data of the first operation data and the second column matrix data of the second operation data, the third sub-operation result corresponding to the second row matrix data of the first operation data and the first column matrix data of the second operation data, and the fourth sub-operation result corresponding to the second row matrix data of the first operation data and the second column matrix data of the second operation data, so that the matrix element position of the first sub-operation data in the operation result matrix is the first row and the first column, the matrix element position of the second sub-operation data in the operation result matrix is the first row and the second column, the matrix element position of the third sub-operation data in the operation result matrix is the second row and the first column, and the matrix element position of the fourth sub-operation data in the operation result matrix is the second row and the second column, and then the accumulator can splice the sub-operation data according to the matrix element positions corresponding to each sub-operation data, so as to obtain the matrix operation results of the first operation data and the second operation data.
在一些实施例中,在得到第一运算数据和第二运算数据的运算结果之后,AI处理器还可以利用数据搬运引擎,通过总线将运算结果搬运至第一存储器中。In some embodiments, after obtaining the operation results of the first operation data and the second operation data, the AI processor may also use the data transfer engine to transfer the operation results to the first memory through the bus.
综上所述,本申请实施例中,通过在第一存储器中存储采用目标数据格式的第一运算数据以及第二运算数据,使得第一运算数据以及第二运算数据的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配,从而数据搬运引擎可以根据总线位宽,从第一存储器中读取第一运算子数据以及第二运算子数据,并通过总线将第一运算子数据和第二运算子数据搬运至矩阵运算引擎内部的第二存储器中,进而矩阵运算引擎通过矩阵运算单元对第一运算子数据和第二运算子数据进行矩阵运算,得到子运算结果,并通过累加器对子运算结果进行累加,得到第一运算数据和第二运算数据的矩阵运算结果。采用本申请实施例提供的AI处理器,通过在第一存储器中存储采用目标数据格式的运算数据,能够实现根据总线位宽进行运算数据读取。由于总线位宽大于阵列维数对应的数据位宽,因此相较于根据阵列维数对应的数据位宽进行数据读取,本申请实施例提供的AI处理器能够提高矩阵运算过程中数据搬运的带宽利用率,从而提高矩阵运算的效率。In summary, in the embodiment of the present application, by storing the first operation data and the second operation data in the target data format in the first memory, the data dimension of the first operation data and the second operation data matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, so that the data handling engine can read the first operator data and the second operator data from the first memory according to the bus width, and carry the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and then the matrix operation engine performs matrix operation on the first operator data and the second operator data through the matrix operation unit to obtain the sub-operation result, and accumulates the sub-operation result through the accumulator to obtain the matrix operation result of the first operation data and the second operation data. Using the AI processor provided in the embodiment of the present application, by storing the operation data in the target data format in the first memory, it is possible to realize the reading of operation data according to the bus width. Since the bus width is greater than the data width corresponding to the array dimension, compared with reading data according to the data width corresponding to the array dimension, the AI processor provided in the embodiment of the present application can improve the bandwidth utilization of data handling during matrix operation, thereby improving the efficiency of matrix operation.
在一些实施例中,深度学习网络可以包括卷积类网络,用于对卷积数据进行运算;也可以包括矩阵类网络,用于对矩阵数据进行运算。因此在将AI处理器集成到深度学习网络框架之后,AI处理器则需要分别对卷积数据和矩阵数据进行数据处理,使卷积数据和矩阵数据均符合目标数据格式。In some embodiments, the deep learning network may include a convolutional network for performing operations on convolutional data; or a matrix network for performing operations on matrix data. Therefore, after the AI processor is integrated into the deep learning network framework, the AI processor needs to process the convolutional data and the matrix data separately so that both the convolutional data and the matrix data conform to the target data format.
下面分别对卷积类以及矩阵类运算场景下采用的目标数据格式进行说明。The target data formats used in convolution and matrix operation scenarios are described below.
在一种可能的实施方式中,在处理卷积数据的过程中,第一运算数据可以为卷积输入数据,第二运算数据可以为卷积权重数据,第一运算数据和第二运算数据的运算结果则为卷积输出数据。In a possible implementation, in the process of processing convolution data, the first operation data may be convolution input data, the second operation data may be convolution weight data, and the operation result of the first operation data and the second operation data may be convolution output data.
可选的,该卷积输入数据可以是待进行卷积运算的图像或者特征图,卷积权重数据可以是用于图像或者特征图进行卷积处理的卷积核。通常情况下,卷积输入数据的尺寸小于卷积权重数据的尺寸。比如,卷积输入数据为100×100的矩阵数据,卷积权重数据则为3×3的卷积核。Optionally, the convolution input data may be an image or feature map to be convolved, and the convolution weight data may be a convolution kernel used to convolve the image or feature map. Typically, the size of the convolution input data is smaller than the size of the convolution weight data. For example, the convolution input data is 100×100 matrix data, and the convolution weight data is a 3×3 convolution kernel.
在一些实施例中,卷积类网络的维数信息的表达方式中,用H表示图像或特征图(Feature Map)的高,用W表示图像或特征图的宽,用C表示特征维数,用N表示一组(Batch)图像或特征图。In some embodiments, in the expression of dimensional information of a convolutional network, H represents the height of an image or a feature map, W represents the width of an image or a feature map, C represents the feature dimension, and N represents a batch of images or feature maps.
示意性的,如图3所示,其示出了卷积类网络数据的逻辑表达形式,以及相关技术中不同格式数据的物理存储形式。其中,NCHW格式数据、NHWC格式数据以及CHWN格式数据对应的逻辑表达形式是相同的,即在C方向、H方向和W方向上的数据排布方式是相同的,而在物理存储过程中,对于NCHW格式数据,是先取W方向其次H方向,然后是C方向,最后是N方向;对于NHWC格式数据,是先取C方向其次W方向,然后是H方向,最后是N方向;对于CHWN格式数据,先取N方向其次W方向,然后是H方向,最后是C方向。Schematically, as shown in Figure 3, it shows the logical expression of convolutional network data and the physical storage form of data in different formats in the related technology. Among them, the logical expression forms corresponding to NCHW format data, NHWC format data and CHWN format data are the same, that is, the data arrangement methods in the C direction, H direction and W direction are the same, and in the physical storage process, for NCHW format data, the W direction is taken first, followed by the H direction, then the C direction, and finally the N direction; for NHWC format data, the C direction is taken first, followed by the W direction, then the H direction, and finally the N direction; for CHWN format data, the N direction is taken first, followed by the W direction, then the H direction, and finally the C direction.
相关技术中,以NHWC格式数据为例,在脉动阵列的阵列行数Ck0=3的情况下,则需要按照000、020、040为一组,001、021、041为一组,进行数据搬运。从物理存储形式 中可以看出000、020、040与001、021、041处于离散存储状态,因此数据搬运引擎每次仅能搬运一组数据,即使总线带宽大于阵列行数对应的数据位宽,数据搬运引擎也只能每次读取3个数据值。In the related art, taking NHWC format data as an example, when the number of array rows of the systolic array Ck0=3, it is necessary to carry out data transfer according to 000, 020, 040 as a group and 001, 021, 041 as a group. It can be seen that 000, 020, 040 and 001, 021, 041 are in discrete storage states, so the data handling engine can only handle one set of data at a time. Even if the bus bandwidth is greater than the data bit width corresponding to the number of array rows, the data handling engine can only read 3 data values at a time.
在一些实施例中,在第一运算数据为卷积输入数据的情况下,为了在对卷积输入数据进行数据搬运的过程中,充分利用总线带宽,可以采用第一目标数据格式对卷积输入数据进行存储,其中,第一目标数据格式所表征的子输入通道数等于脉动阵列的阵列行数。In some embodiments, when the first operation data is convolution input data, in order to fully utilize the bus bandwidth in the process of data transfer of the convolution input data, the convolution input data can be stored in a first target data format, wherein the number of sub-input channels represented by the first target data format is equal to the number of array rows of the systolic array.
在一种可能的实施方式中,脉动阵列的阵列行数可以表示为Ck0,第一目标数据格式下的卷积输入数据可以表示为NCi1HiWiCk0,其中,N表示图像或特征图的组数,Hi表示输入图像或特征图的高,Wi表示输入图像或特征图的宽,Ci1为Ci/Ck0向上取整的值,Ci为总输入通道数(如图4所示)。In one possible implementation, the number of array rows of the systolic array can be expressed as Ck0, and the convolution input data in the first target data format can be expressed as NCi1HiWiCk0, wherein N represents the number of groups of images or feature maps, Hi represents the height of the input image or feature map, Wi represents the width of the input image or feature map, Ci1 is the value of Ci/Ck0 rounded up, and Ci is the total number of input channels (as shown in FIG4 ).
示意性的,如图4所示,其示出了本申请一个示例性实施例提供的第一目标数据格式的物理存储形式。从图4中可知,对于NCi1HiWiCk0这一目标数据格式,数据排布方式是按照先取Ck0方向,其次Wi方向,然后Hi方向,进而Ci1方向,最后N方向。Schematically, as shown in Figure 4, it shows the physical storage form of the first target data format provided by an exemplary embodiment of the present application. As can be seen from Figure 4, for the target data format NCi1HiWiCk0, the data is arranged in the Ck0 direction first, then the Wi direction, then the Hi direction, then the Ci1 direction, and finally the N direction.
比如,在脉动阵列的阵列行数Ck0=3的情况下,先取Ck0方向上的000、020、040三个值,进而从Wi方向,接着取001、021、041,在Wi方向取完之后(即取完003、023、043),从043转到Hi上的004,以此顺序对数据进行存储,从而保证基于脉动阵列的阵列行数进行数据搬运的过程中,相邻次搬运的数据处于连续状态。For example, when the number of array rows of the systolic array Ck0=3, first take the three values 000, 020, and 040 in the Ck0 direction, and then take 001, 021, and 041 from the Wi direction. After taking all the values in the Wi direction (that is, taking 003, 023, and 043), transfer from 043 to 004 on Hi, and store the data in this order, thereby ensuring that in the process of data transfer based on the number of array rows of the systolic array, the data transferred in adjacent times are in a continuous state.
在一些实施例中,在第二运算数据为卷积权重数据的情况下,为了在对卷积权重数据进行数据搬运的过程中,充分利用总线带宽,可以采用第二目标数据格式对卷积权重数据进行存储。其中,第二目标数据格式所表征的子输入通道数等于脉动阵列的阵列行数,第二目标数据格式所表征的子输出通道数等于脉动阵列的阵列列数。In some embodiments, when the second operation data is convolution weight data, in order to fully utilize the bus bandwidth in the process of data transfer of the convolution weight data, the convolution weight data may be stored in a second target data format, wherein the number of sub-input channels represented by the second target data format is equal to the number of array rows of the systolic array, and the number of sub-output channels represented by the second target data format is equal to the number of array columns of the systolic array.
在一种可能的实施方式中,脉动阵列的阵列行数可以表示为Ck0,脉动阵列的阵列列数可以表示Cn0,第二目标数据格式下的卷积权重数据可以表示为Co1Ci1KhKwCk0Cn0,其中,Kh表示卷积核高度,Kw表示卷积核宽度,Ci1为Ci/Ck0向上取整的值,Co1表示Co/Cn0向上取整的值,Ci为总输入通道数,Co为总输出通道数。In one possible implementation, the number of array rows of the systolic array can be expressed as Ck0, the number of array columns of the systolic array can be expressed as Cn0, and the convolution weight data in the second target data format can be expressed as Co1Ci1KhKwCk0Cn0, wherein Kh represents the height of the convolution kernel, Kw represents the width of the convolution kernel, Ci1 is the value of Ci/Ck0 rounded up, Co1 represents the value of Co/Cn0 rounded up, Ci is the total number of input channels, and Co is the total number of output channels.
在一些实施例中,在第一运算数据和第二运算数据的运算结果为卷积输出数据的情况下,根据采用第一目标数据格式的卷积输入数据以及采用第二目标数据格式的卷积权重数据,通过矩阵运算单元进行矩阵运算即可以得到采用第三目标数据格式的卷积输出数据,其中,第三目标数据格式所表征的子输出通道数等于脉动阵列的阵列列数。In some embodiments, when the operation result of the first operation data and the second operation data is convolution output data, convolution output data in a third target data format can be obtained by performing matrix operation through a matrix operation unit according to the convolution input data in the first target data format and the convolution weight data in the second target data format, wherein the number of sub-output channels represented by the third target data format is equal to the number of array columns of the systolic array.
在一种可能的实施方式中,脉动阵列的阵列列数可以表示Cn0,第三目标数据格式下的卷积输出数据可以表示为NCo1HoWoCn0,其中,N表示图像或特征图的组数,Ho表示输出图像或特征图的高,Wo表示输出图像或特征图的宽,Co1表示Co/Cn0向上取整的值,Co为总输出通道数。In one possible implementation, the number of array columns of the systolic array can be represented by Cn0, and the convolution output data under the third target data format can be represented as NCo1HoWoCn0, wherein N represents the number of groups of images or feature maps, Ho represents the height of the output image or feature map, Wo represents the width of the output image or feature map, Co1 represents the value of Co/Cn0 rounded up, and Co is the total number of output channels.
在一些实施例中,在将采用第一目标数据格式的卷积输入数据以及采用第二目标数据格式的卷积权重数据存储至AI处理器的第一存储器之后,AI处理器即可以利用数据搬运引擎,根据总线位宽,对卷积输入子数据以及卷积权重子数据进行数据搬运,并存储至矩阵运算引擎内部的第二存储器中。In some embodiments, after the convolution input data in the first target data format and the convolution weight data in the second target data format are stored in the first memory of the AI processor, the AI processor can use the data transfer engine to transfer the convolution input sub-data and the convolution weight sub-data according to the bus bit width, and store them in the second memory inside the matrix operation engine.
在一种可能的实施方式中,AI处理器利用数据搬运引擎,根据总线位宽,依次从第一存储器中读取第一卷积输入子数据以及第二卷积输入子数据,其中,第一卷积输入子数据与第二卷积输入子数据为连续存储的数据,且第一卷积输入子数据的数据位宽等于总线位宽,第二卷积输入子数据的数据位宽等于总线位宽。In one possible implementation, the AI processor uses a data transfer engine to sequentially read the first convolution input sub-data and the second convolution input sub-data from the first memory according to the bus bit width, wherein the first convolution input sub-data and the second convolution input sub-data are continuously stored data, and the data bit width of the first convolution input sub-data is equal to the bus bit width, and the data bit width of the second convolution input sub-data is equal to the bus bit width.
示意性的,如图4所示,在脉动阵列的阵列行数为3,且总线位宽为阵列行数对应的数据位宽的2倍的情况下,AI处理器利用数据搬运引擎,根据总线位宽,可以取000、020、040、001、021、041六个数据作为第一卷积输入子数据,同时对六个数据进行搬运,并且在继续按照脉动阵列的阵列行数进行数据搬运时,可以继续取002、022、042、003、023、043作为第二卷积输入子数据。可见,第一卷积输入子数据和第二卷积输入子数据在物理存 储的数据排布中为连续存储的数据。Schematically, as shown in FIG4, when the number of array rows of the systolic array is 3 and the bus width is twice the data width corresponding to the number of array rows, the AI processor uses the data transfer engine to take six data 000, 020, 040, 001, 021, and 041 as the first convolution input sub-data according to the bus width, and transfers the six data at the same time, and when continuing to transfer data according to the number of array rows of the systolic array, it can continue to take 002, 022, 042, 003, 023, and 043 as the second convolution input sub-data. It can be seen that the first convolution input sub-data and the second convolution input sub-data are physically stored in the same memory. The stored data is arranged as continuous data.
在一种可能的实施方式中,AI处理器利用数据搬运引擎,根据总线位宽,依次从第一存储器中读取第一卷积权重子数据以及第二卷积权重子数据,其中,第一卷积权重子数据与第二卷积权重子数据为连续存储的数据,且第一卷积权重子数据的数据位宽等于总线位宽,第二卷积权重子数据的数据位宽等于总线位宽。In one possible implementation, the AI processor uses a data transfer engine to sequentially read the first convolution weight sub-data and the second convolution weight sub-data from the first memory according to the bus bit width, wherein the first convolution weight sub-data and the second convolution weight sub-data are continuously stored data, and the data bit width of the first convolution weight sub-data is equal to the bus bit width, and the data bit width of the second convolution weight sub-data is equal to the bus bit width.
示意性的,如图5所示,其示出了相关技术中基于一般数据格式通过脉动阵列进行数据处理的示意图。以利用数据搬运引擎对卷积输入数据进行数据搬运为例,AI处理器在利用数据搬运引擎进行数据搬运的过程中,为了从卷积输入数据中取一段长度为MALU_K的数据值,则需要分段离散地从卷积输入数据中进行数据读取,并依次通过总线进行数据搬运,导致在总线带宽大于MALU_K的情况下,出现带宽利用率低的问题。Schematically, as shown in Figure 5, it shows a schematic diagram of data processing through a systolic array based on a general data format in the related art. Taking the use of a data handling engine to carry out data handling of convolution input data as an example, in the process of using the data handling engine to carry out data handling, in order to obtain a data value of a length of MALU_K from the convolution input data, the AI processor needs to read data from the convolution input data in discrete segments and carry out data handling through the bus in sequence, resulting in a problem of low bandwidth utilization when the bus bandwidth is greater than MALU_K.
示意性的,以图5中卷积输入数据对应的逻辑表述形式为例,在数据排布方式为NHiWiCi格式的情况下,包括Wi方向501,Hi方向502以及Ci方向503,且数据在物理存储过程中先在Ci方向503上连续,在脉动阵列的阵列行数为MALU_K,阵列列数为MALU_N的情况下,相关技术中,直接在Ci方向503上根据MALU_K的阵列行数504,对卷积输入数据进行读取,从而可以得到数据块505(高度为Cut_Hi,宽度为Cut_Wi,通道数为MALU_K),而数据块505中在MALU_K方向上的每一段数据在卷积输入数据的物理存储中均为离散分布的,并且在总线带宽大于MALU_K的情况下,数据搬运过程无法充分利用总线带宽。Schematically, taking the logical expression form corresponding to the convolution input data in Figure 5 as an example, when the data is arranged in NHiWiCi format, it includes Wi direction 501, Hi direction 502 and Ci direction 503, and the data is first continuous in the Ci direction 503 during physical storage. When the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N, in the related art, the convolution input data is directly read in the Ci direction 503 according to the number of array rows 504 of MALU_K, so that a data block 505 (height is Cut_Hi, width is Cut_Wi, number of channels is MALU_K) can be obtained, and each segment of data in the MALU_K direction in the data block 505 is discretely distributed in the physical storage of the convolution input data, and when the bus bandwidth is greater than MALU_K, the data handling process cannot fully utilize the bus bandwidth.
示意性的,以图5中卷积权重数据对应的逻辑表述形式为例,在数据排布方式为KhKwCiCo格式的情况下,包括Ci方向506和Co方向507,数据在物理存储过程中先在Co方向507上连续,在脉动阵列的阵列行数为MALU_K,阵列列数为MALU_N的情况下,相关技术中,直接在Ci方向506上根据MALU_K的阵列行数,在Co方向507上根据MALU_N的阵列列数,对卷积权重数据进行读取,从而每个数据块508(高度为MALU_K,宽度为MALU_N)中在MALU_N方向上的每一段数据在卷积权重数据的物理存储中均为离散分布的,并且在总线带宽大于MALU_N的情况下,数据搬运过程中无法充分利用总线带宽。Schematically, taking the logical expression form corresponding to the convolution weight data in Figure 5 as an example, when the data is arranged in the KhKwCiCo format, including the Ci direction 506 and the Co direction 507, the data is first continuous in the Co direction 507 during the physical storage process. When the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N, in the related art, the convolution weight data is directly read in the Ci direction 506 according to the number of array rows of MALU_K and in the Co direction 507 according to the number of array columns of MALU_N, so that each segment of data in the MALU_N direction in each data block 508 (with a height of MALU_K and a width of MALU_N) is discretely distributed in the physical storage of the convolution weight data, and when the bus bandwidth is greater than MALU_N, the bus bandwidth cannot be fully utilized during data transportation.
示意性的,如图6所示,其示出了本申请一个示例性实施例提供的基于目标数据格式,通过脉动阵列进行数据处理的示意图。以利用数据搬运引擎对卷积输入数据进行数据搬运为例,AI处理器在利用数据搬运引擎进行数据搬运的过程中,为了从卷积输入数据中取一段长度为MALU_K的数据值,由于目标数据格式下数据按照先沿MALU_K再沿Wi方向连续存储,因此在总线带宽大于MALU_K的情况下,AI处理器可以直接读取先沿MALU_K再沿Wi方向的大于MALU_K数量的数据值,从而充分利用了总线带宽。Schematically, as shown in FIG6 , a schematic diagram of data processing by a systolic array based on a target data format provided by an exemplary embodiment of the present application is shown. Taking the use of a data handling engine to handle convolution input data as an example, in the process of using the data handling engine to handle data, the AI processor, in order to take a data value of a length of MALU_K from the convolution input data, because the data is stored continuously along the MALU_K and then along the Wi direction under the target data format, when the bus bandwidth is greater than MALU_K, the AI processor can directly read the data values greater than MALU_K along the MALU_K and then along the Wi direction, thereby making full use of the bus bandwidth.
示意性的,以图6中卷积输入数据对应的逻辑表述形式为例,在数据排布方式为NCi1HiWiCk0这一目标数据格式的情况下,包括Wi方向601,Hi方向602以及Ci1(Ci/Ck0)方向603,且数据在物理存储过程中先在Ci1方向603上连续。Schematically, taking the logical expression form corresponding to the convolution input data in Figure 6 as an example, when the data arrangement is the target data format NCi1HiWiCk0, it includes Wi direction 601, Hi direction 602 and Ci1 (Ci/Ck0) direction 603, and the data is first continuous in the Ci1 direction 603 during the physical storage process.
在脉动阵列的阵列行数为MALU_K,阵列列数为MALU_N的情况下,由于数据在物理存储过程中先在Ci1方向603上连续,因此本申请实施例可以直接根据总线位宽连续性地读取数据块604(高度为Cut_Hi,宽度为Cut_Wi,通道数为MALU_K),从而提高总线带宽利用率。When the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N, since the data is first continuous in the Ci1 direction 603 during the physical storage process, the embodiment of the present application can directly read the data block 604 (height is Cut_Hi, width is Cut_Wi, number of channels is MALU_K) according to the bus bit width continuity, thereby improving the bus bandwidth utilization.
示意性的,以图6中卷积权重数据对应的逻辑表述形式为例,数据排布方式为Co1Ci1KhKwCk0Cn0这一目标数据格式的情况下,包括Ci1(Ci/Ck0)方向605和Co1(Co/Cn0)方向606,数据在物理存储过程中先在Co1方向606上连续,在脉动阵列的阵列行数为MALU_K,阵列列数为MALU_N的情况下,由于数据在物理存储过程中先在Co1方向606上连续,因此本申请实施例可以直接根据总线位宽连续性地读取数据块607(高度为MALU_K,宽度为MALU_N),从而提高总线带宽利用率。Schematically, taking the logical expression form corresponding to the convolution weight data in Figure 6 as an example, when the data is arranged in the target data format of Co1Ci1KhKwCk0Cn0, including Ci1 (Ci/Ck0) direction 605 and Co1 (Co/Cn0) direction 606, the data is first continuous in the Co1 direction 606 during the physical storage process. When the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N, since the data is first continuous in the Co1 direction 606 during the physical storage process, the embodiment of the present application can directly read the data block 607 (height is MALU_K and width is MALU_N) continuously according to the bus bit width, thereby improving the bus bandwidth utilization.
在一些实施例中,考虑到不同的深度学习网络中卷积数据的通道维数可能不同,并且卷积数据的总通道数可能小于脉动阵列的阵列行数或者阵列列数,因此对于总输入通道数小于脉动阵列的阵列行数的卷积输入数据,或者总输出通道数小于脉动阵列的阵列列数的卷积输 出数据,需要分别采用不同的目标数据格式对卷积数据进行存储。In some embodiments, considering that the channel dimensions of convolution data in different deep learning networks may be different, and the total number of channels of convolution data may be less than the number of array rows or array columns of the systolic array, for the convolution input data whose total input channel number is less than the number of array rows of the systolic array, or the convolution input data whose total output channel number is less than the number of array columns of the systolic array, To output data, different target data formats need to be used to store the convolution data.
在一种可能的实施方式中,在卷积输入数据的总输入通道数小于脉动阵列的阵列行数的情况下(即Ci/Ck0小于1),可以采用第四目标数据格式对卷积输入数据进行存储,其中,第四目标数据格式所表征的输入通道数为总输入通道数。In one possible implementation, when the total number of input channels of the convolution input data is less than the number of array rows of the systolic array (i.e., Ci/Ck0 is less than 1), the convolution input data may be stored in a fourth target data format, wherein the number of input channels represented by the fourth target data format is the total number of input channels.
在一种可能的实施方式中,脉动阵列的阵列行数可以表示为Ck0,卷积输入数据的总输入通道数为Ci,Ci小于Ck0。卷积输入数据可以表示为NHiWiCi,其中,N表示图像或特征图的组数,Hi表示输入图像或特征图的高,Wi表示输入图像或特征图的宽。In one possible implementation, the number of array rows of the systolic array can be represented as Ck0, and the total number of input channels of the convolution input data is Ci, which is less than Ck0. The convolution input data can be represented as NHiWiCi, where N represents the number of groups of images or feature maps, Hi represents the height of the input image or feature map, and Wi represents the width of the input image or feature map.
在一种可能的实施方式中,在卷积输入数据的总输入通道数小于脉动阵列的阵列行数,且卷积输出数据的总输出通道数不小于脉动阵列的阵列列数的情况下,可以采用第五目标数据格式对卷积权重数据进行存储,其中,第五目标数据格式所表征的输入通道数为总输入通道数,第五目标数据格式所表征的子输出通道数等于脉动阵列的阵列列数。采用第四目标数据格式和第五目标数据格式存储卷积输入数据和卷积权重数据时,能够避免对卷积输入数据和卷积权重数据进行额外的数据填充(padding)。In a possible implementation, when the total number of input channels of the convolution input data is less than the number of array rows of the systolic array, and the total number of output channels of the convolution output data is not less than the number of array columns of the systolic array, the convolution weight data may be stored in a fifth target data format, wherein the number of input channels represented by the fifth target data format is the total number of input channels, and the number of sub-output channels represented by the fifth target data format is equal to the number of array columns of the systolic array. When the fourth target data format and the fifth target data format are used to store the convolution input data and the convolution weight data, additional data padding for the convolution input data and the convolution weight data can be avoided.
在一种可能的实施方式中,脉动阵列的阵列行数可以表示为Ck0,脉动阵列的阵列列数可以表示Cn0,Ci为总输入通道数,Co为总输出通道数,且Ci小于Ck0,Co大于Cn0。卷积权重数据可以表示为Co1KwCiKhCn0,其中,Kh表示卷积核高度,Kh表示卷积核宽度,Co1表示Co/Cn0向上取整的值。In a possible implementation, the number of array rows of the systolic array can be represented as Ck0, the number of array columns of the systolic array can be represented as Cn0, Ci is the total number of input channels, Co is the total number of output channels, and Ci is less than Ck0, and Co is greater than Cn0. The convolution weight data can be represented as Co1KwCiKhCn0, where Kh represents the convolution kernel height, Kh represents the convolution kernel width, and Co1 represents the value of Co/Cn0 rounded up.
在一种可能的实施方式中,在卷积输出数据的总输出通道数小于脉动阵列的阵列列数的情况下(即Co/Cn0小于1),可以采用第六目标数据格式对卷积输出数据进行存储,其中,第六目标数据格式所表征的输出通道数为总输出通道数。In one possible implementation, when the total number of output channels of the convolution output data is less than the number of array columns of the systolic array (i.e., Co/Cn0 is less than 1), the convolution output data may be stored in a sixth target data format, wherein the number of output channels represented by the sixth target data format is the total number of output channels.
在一种可能的实施方式中,脉动阵列的阵列列数可以表示为Cn0,卷积输出数据的总输出通道数为Co,Co小于Cn0。卷积输出数据可以表示为NHoWoCo,其中,N表示一组图像或特征图,Ho表示输出图像或特征图的高,Wo表示输出图像或特征图的宽。In one possible implementation, the number of array columns of the systolic array may be represented as Cn0, and the total number of output channels of the convolution output data may be Co, where Co is less than Cn0. The convolution output data may be represented as NHoWoCo, where N represents a set of images or feature maps, Ho represents the height of the output image or feature map, and Wo represents the width of the output image or feature map.
在一种可能的实施方式中,在卷积输出数据的总输出通道数小于脉动阵列的阵列列数,且卷积输入数据的总输入通道数不小于脉动阵列的阵列行数的情况下,可以采用第七目标数据格式对卷积权重数据进行存储,其中,第七目标数据格式所表征的子输入通道数等于脉动阵列的阵列行数,第七目标数据格式所表征的输出通道数为总输出通道数。In a possible implementation, when the total number of output channels of the convolution output data is less than the number of array columns of the systolic array, and the total number of input channels of the convolution input data is not less than the number of array rows of the systolic array, the convolution weight data may be stored in a seventh target data format, wherein the number of sub-input channels represented by the seventh target data format is equal to the number of array rows of the systolic array, and the number of output channels represented by the seventh target data format is the total number of output channels.
在一种可能的实施方式中,脉动阵列的阵列行数可以表示为Ck0,脉动阵列的阵列列数可以表示Cn0,Ci为总输入通道数,Co为总输出通道数,且Ci大于Ck0,Co小于Cn0,卷积权重数据可以表示为Ci1KhKwCk0ECn0,其中,Kh表示卷积核高度,Kw表示卷积核宽度,Ci1为Ci/Ck0向上取整的值,W为Co小专用优化算法中的扩展(Expansion)系数,该系数由算力,带宽和卷积参数共同决定。In one possible implementation, the number of array rows of the systolic array can be represented as Ck0, the number of array columns of the systolic array can be represented as Cn0, Ci is the total number of input channels, Co is the total number of output channels, and Ci is greater than Ck0, Co is less than Cn0, and the convolution weight data can be expressed as Ci1KhKwCk0ECn0, wherein Kh represents the height of the convolution kernel, Kw represents the width of the convolution kernel, Ci1 is the value of Ci/Ck0 rounded up, and W is the expansion coefficient in the Co small special optimization algorithm, which is jointly determined by the computing power, bandwidth and convolution parameters.
采用第六目标数据格式和第七目标数据格式存储卷积输出数据和卷积权重数据时,能够避免运算过程中对卷积权重数据进行额外的数据填充(padding)。上述实施例中,对于卷积类数据,通过分别比较卷积输入数据的总输入通道数与脉动阵列的阵列行数,以及卷积输出数据的总输出通道数与脉动阵列的阵列列数,确定卷积输入数据、卷积权重数据以及卷积输出数据对应采用的目标数据格式,实现了根据不同数据情况确定不同目标数据格式。由于总线位宽大于阵列维数对应的数据位宽,因此相较于根据阵列维数对应的数据位宽进行数据读取,对于卷积类数据,可以实现直接根据总线位宽进行数据读取,使得输入子数据以及权重子数据的数据位宽均等于总线位宽,进而优化了数据搬运效率。When the sixth target data format and the seventh target data format are used to store the convolution output data and the convolution weight data, additional data padding (padding) of the convolution weight data can be avoided during the operation. In the above embodiment, for the convolution data, by comparing the total number of input channels of the convolution input data with the number of array rows of the systolic array, and the total number of output channels of the convolution output data with the number of array columns of the systolic array, the target data formats corresponding to the convolution input data, the convolution weight data and the convolution output data are determined, thereby realizing the determination of different target data formats according to different data situations. Since the bus bit width is larger than the data bit width corresponding to the array dimension, compared with data reading according to the data bit width corresponding to the array dimension, for the convolution data, data reading can be realized directly according to the bus bit width, so that the data bit width of the input sub-data and the weight sub-data are equal to the bus bit width, thereby optimizing the data handling efficiency.
在一种可能的实施方式中,在处理矩阵数据的过程中,第一运算数据可以为左矩阵输入数据,第二运算数据可以为右矩阵输入数据,第一运算数据和第二运算数据的运算结果则为矩阵输出数据。In a possible implementation, during processing of matrix data, the first operation data may be left matrix input data, the second operation data may be right matrix input data, and the operation results of the first operation data and the second operation data are matrix output data.
在一些实施例中,矩阵类网络的维数信息的表达方式中,左矩阵表示为MK(M行K列),右矩阵表示为KN(K行N列),结果矩阵表示为MN(M行N列)。并且在常用的深度学习 网络中,还会在此基础上增加其他维数的批处理(Batch)数据。In some embodiments, in the expression of the dimensional information of the matrix network, the left matrix is represented as MK (M rows and K columns), the right matrix is represented as KN (K rows and N columns), and the result matrix is represented as MN (M rows and N columns). In the network, batch data of other dimensions will be added on this basis.
在一些实施例中,在第一运算数据为左矩阵输入数据的情况下,为了在对左矩阵输入数据进行数据搬运的过程中,充分利用总线带宽,可以采用第八目标数据格式对左矩阵输入数据进行存储,其中,第八目标数据格式所表征的子矩阵列数等于脉动阵列的阵列行数。In some embodiments, when the first operation data is the left matrix input data, in order to fully utilize the bus bandwidth in the process of data transfer of the left matrix input data, the left matrix input data may be stored in an eighth target data format, wherein the number of submatrix columns represented by the eighth target data format is equal to the number of array rows of the systolic array.
在一种可能的实施方式中,脉动阵列的阵列行数为Wk0,第八目标数据格式下左矩阵输入数据可以表示为B0B1W1HWk0,其中,B0和B1分别表示两个维度的Batch数量,H表示左矩阵的矩阵行数,W1为K/Wk0向上取整的值,K为左矩阵的矩阵列数。In one possible implementation, the number of array rows of the systolic array is Wk0, and the left matrix input data in the eighth target data format can be expressed as B0B1W1HWk0, wherein B0 and B1 respectively represent the number of batches in two dimensions, H represents the number of matrix rows of the left matrix, W1 is the value rounded up of K/Wk0, and K is the number of matrix columns of the left matrix.
在一些实施例中,在第二运算数据为右矩阵输入数据的情况下,为了在对右矩阵输入数据进行数据搬运的过程中,充分利用总线带宽,可以采用第九目标数据格式对右矩阵输入数据进行存储,其中,第九目标数据格式所表征的子矩阵列数等于脉动阵列的阵列列数。In some embodiments, when the second operation data is right matrix input data, in order to fully utilize the bus bandwidth during data transfer of the right matrix input data, the right matrix input data can be stored in a ninth target data format, wherein the number of submatrix columns represented by the ninth target data format is equal to the number of array columns of the systolic array.
在一种可能的实施方式中,脉动阵列的阵列列数为Wn0,第九目标数据格式下右矩阵输入数据可以表示为B0B1W1HWn0,其中,B0和B1分别表示两个维度的Batch数量。H表示右矩阵的矩阵行数,W1为N/Wn0向上取整的值,N为右矩阵的矩阵列数。In a possible implementation, the number of array columns of the systolic array is Wn0, and the right matrix input data in the ninth target data format can be expressed as B0B1W1HWn0, where B0 and B1 represent the number of batches in two dimensions, respectively. H represents the number of matrix rows of the right matrix, W1 is the value of N/Wn0 rounded up, and N is the number of matrix columns of the right matrix.
需要说明的是,左矩阵输入数据和右矩阵输入数据中的Batch可以不相等,但要求当不相等时,需其中一个为1。It should be noted that the batches in the left matrix input data and the right matrix input data may not be equal, but when they are not equal, one of them must be 1.
在一些实施例中,在第一运算数据和第二运算数据的运算结果为矩阵输出数据的情况下,根据采用第八目标数据格式的左矩阵输入数据以及采用第九目标数据格式的右矩阵输入数据,通过矩阵运算单元进行矩阵运算即可以得到采用第十目标数据格式的矩阵输出数据,其中,第十目标数据格式所表征的矩阵行数等于第八目标数据格式所表征的矩阵行数,第十目标数据格式所表征的矩阵列数等于第九目标数据格式所表征的矩阵列数。In some embodiments, when the operation result of the first operation data and the second operation data is matrix output data, matrix output data using the tenth target data format can be obtained by performing matrix operations through a matrix operation unit based on left matrix input data using the eighth target data format and right matrix input data using the ninth target data format, wherein the number of matrix rows represented by the tenth target data format is equal to the number of matrix rows represented by the eighth target data format, and the number of matrix columns represented by the tenth target data format is equal to the number of matrix columns represented by the ninth target data format.
在一种可能的实施方式中,第十目标数据格式下,矩阵输出数据可以表示为B0B1W1HWn0,其中,矩阵输出数据中的B0为左矩阵输入数据中的B0和右矩阵输入数据中的B0的最大值,矩阵输出数据中的B1为左矩阵输入数据中的B1和右矩阵输入数据中的B1的最大值,W1和Wn0同右矩阵输入数据的W1和Wn0值,H同左矩阵输入数据的H值。In one possible implementation, under the tenth target data format, the matrix output data can be expressed as B0B1W1HWn0, wherein B0 in the matrix output data is the maximum value of B0 in the left matrix input data and B0 in the right matrix input data, B1 in the matrix output data is the maximum value of B1 in the left matrix input data and B1 in the right matrix input data, W1 and Wn0 are the same as the W1 and Wn0 values of the right matrix input data, and H is the same as the H value of the left matrix input data.
在一些实施例中,在将采用第八目标数据格式的左矩阵输入数据以及采用第九目标数据格式的右矩阵输入数据存储至AI处理器的第一存储器之后,AI处理器即可以利用数据搬运引擎,根据总线位宽,对左矩阵输入子数据以及右矩阵输入子数据进行数据搬运,并存储至矩阵运算引擎内部的第二存储器中。In some embodiments, after the left matrix input data in the eighth target data format and the right matrix input data in the ninth target data format are stored in the first memory of the AI processor, the AI processor can use the data transfer engine to transfer the left matrix input sub-data and the right matrix input sub-data according to the bus bit width, and store them in the second memory inside the matrix operation engine.
在一种可能的实施方式中,AI处理器利用数据搬运引擎,根据总线位宽,依次从第一存储器中读取第一左矩阵输入子数据以及第二左矩阵输入子数据,其中,第一左矩阵输入子数据与第二左矩阵输入子数据为连续存储的数据,且第一左矩阵输入子数据的数据位宽等于总线位宽,第二左矩阵输入子数据的数据位宽等于总线位宽。In one possible implementation, the AI processor uses a data transfer engine to sequentially read the first left matrix input sub-data and the second left matrix input sub-data from the first memory according to the bus bit width, wherein the first left matrix input sub-data and the second left matrix input sub-data are continuously stored data, and the data bit width of the first left matrix input sub-data is equal to the bus bit width, and the data bit width of the second left matrix input sub-data is equal to the bus bit width.
在一种可能的实施方式中,AI处理器利用数据搬运引擎,根据总线位宽,依次从第一存储器中读取第一右矩阵输入子数据以及第二右矩阵输入子数据,其中,第一右矩阵输入子数据与第二右矩阵输入子数据为连续存储的数据,且第一右矩阵输入子数据的数据位宽等于总线位宽,第二右矩阵输入子数据的数据位宽等于总线位宽。In one possible implementation, the AI processor uses a data transfer engine to sequentially read the first right matrix input sub-data and the second right matrix input sub-data from the first memory according to the bus bit width, wherein the first right matrix input sub-data and the second right matrix input sub-data are continuously stored data, and the data bit width of the first right matrix input sub-data is equal to the bus bit width, and the data bit width of the second right matrix input sub-data is equal to the bus bit width.
示意性的,如图7所示,其示出了相关技术中基于一般数据格式通过脉动阵列进行数据处理的示意图。以利用数据搬运引擎对左矩阵输入数据进行数据搬运为例,AI处理器在利用数据搬运引擎进行数据搬运的过程中,为了从左矩阵输入数据中取一段长度为MALU_K的数据值,则需要沿M方向分段离散地从左矩阵输入数据中读取数据,并依次通过总线进行数据搬运,导致在总线带宽大于MALU_K的情况下,出现带宽利用率低的问题。Schematically, as shown in Figure 7, it shows a schematic diagram of data processing through a systolic array based on a general data format in the related art. Taking the use of a data handling engine to carry out data handling on the left matrix input data as an example, in the process of using the data handling engine to carry out data handling, in order to obtain a data value of a length of MALU_K from the left matrix input data, the AI processor needs to read data from the left matrix input data in segments along the M direction discretely, and carry out data handling through the bus in sequence, resulting in a problem of low bandwidth utilization when the bus bandwidth is greater than MALU_K.
示意性的,以图7中左矩阵输入数据为例,数据排布方式为MK格式的情况下,包括M方向701以及K方向702,且数据在物理存储过程中先在K方向702上连续(即先沿K方向存储,再沿M方向存储)。在脉动阵列的阵列行数为MALU_K,阵列列数为MALU_N的情况下,相关技术中,直接从K方向702上根据MALU_K的阵列行数进行数据读取,数据块703中在MALU_K方向上的每一段数据在左矩阵输入数据的物理存储中均为离散分布的(图 7中的数据块703在M方向上跨越了多行,因此数据块703中的数据离散分布),并且在总线带宽大于MALU_K的情况下,数据搬运过程中也会产生带宽利用率低的问题。图7中右矩阵输入数据同理,本申请实施例在此不作赘述。Schematically, taking the left matrix input data in FIG. 7 as an example, when the data arrangement mode is in MK format, it includes M direction 701 and K direction 702, and the data is first stored continuously in K direction 702 during physical storage (i.e., first stored in K direction, then stored in M direction). When the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N, in the related art, data is directly read from K direction 702 according to the number of array rows of MALU_K, and each segment of data in the MALU_K direction in the data block 703 is discretely distributed in the physical storage of the left matrix input data ( FIG. The data block 703 in FIG. 7 spans multiple rows in the M direction, so the data in the data block 703 is discretely distributed), and when the bus bandwidth is greater than MALU_K, the problem of low bandwidth utilization will also occur during data transfer. The right matrix input data in FIG. 7 is similar, and the embodiment of the present application will not be described in detail here.
示意性的,如图8所示,其示出了本申请另一个示例性实施例提供的基于目标数据格式通过脉动阵列进行数据处理的示意图。以利用数据搬运引擎对左矩阵输入数据进行数据搬运为例,AI处理器在利用数据搬运引擎进行数据搬运的过程中,为了从左矩阵输入数据中取一段长度为MALU_K的数据值,由于目标数据格式下数据按照先沿MALU_K再沿H_Left方向连续存储,因此在总线带宽大于MALU_K的情况下,AI处理器可以直接读取先沿MALU_K再沿H_Left方向的大于MALU_K数量的数据值,从而充分利用了总线带宽。Schematically, as shown in FIG8, it shows a schematic diagram of data processing through a systolic array based on a target data format provided by another exemplary embodiment of the present application. Taking the use of a data handling engine to handle data on the left matrix input data as an example, in the process of using the data handling engine to handle data, the AI processor, in order to obtain a data value of a length of MALU_K from the left matrix input data, because the data is stored continuously along the MALU_K and then along the H_Left direction in the target data format, when the bus bandwidth is greater than MALU_K, the AI processor can directly read the data values greater than MALU_K along the MALU_K and then along the H_Left direction, thereby making full use of the bus bandwidth.
示意性的,以图8中左矩阵输入数据为例,在数据排布方式为B0B1W1HWk0这一目标数据格式的情况下,包括H方向801(纵向)以及W1(K/Wk0)方向802(横向),数据在物理存储过程中先在W1方向802上连续(即先沿W1方向存储,再沿H方向存储)。在脉动阵列的阵列行数为MALU_K,阵列列数为MALU_N的情况下,由于数据在物理存储过程中先在W1方向802上连续,因此本申请实施例可以直接根据总线位宽连续性地读取数据块803(图8中的数据块803虽然在H方向上跨越了多行,但由于目标数据格式下的数据在H方向上是连续存储的,因此数据块803中的数据连续),从而提高总线带宽利用率。图8中右矩阵输入数据同理,本申请实施例在此不作赘述。Schematically, taking the left matrix input data in FIG8 as an example, when the data arrangement mode is the target data format of B0B1W1HWk0, including the H direction 801 (vertical) and the W1 (K/Wk0) direction 802 (horizontal), the data is first continuous in the W1 direction 802 during the physical storage process (i.e., first stored along the W1 direction, and then stored along the H direction). When the number of array rows of the systolic array is MALU_K and the number of array columns is MALU_N, since the data is first continuous in the W1 direction 802 during the physical storage process, the embodiment of the present application can directly read the data block 803 according to the bus width continuity (although the data block 803 in FIG8 spans multiple rows in the H direction, the data in the target data format is continuously stored in the H direction, so the data in the data block 803 is continuous), thereby improving the bus bandwidth utilization. The same is true for the right matrix input data in FIG8, and the embodiment of the present application will not be repeated here.
上述实施例中,对于矩阵类数据,通过根据左矩阵输入数据以及右矩阵输入数据的矩阵列数,并结合脉动阵列的阵列行数以及阵列列数,确定左矩阵输入数据、右矩阵输入数据以及矩阵输出数据采用的具体目标数据格式。由于总线位宽大于阵列维数对应的数据位宽,因此相较于根据阵列维数对应的数据位宽进行数据读取,对于矩阵类数据,可以实现直接根据总线位宽进行数据读取,使得输入子数据的数据位宽均等于总线位宽,从而优化了对矩阵类数据进行数据搬运的效率。In the above embodiment, for matrix data, the specific target data format adopted by the left matrix input data, the right matrix input data and the matrix output data is determined according to the matrix column number of the left matrix input data and the right matrix input data, and in combination with the array row number and the array column number of the systolic array. Since the bus width is larger than the data width corresponding to the array dimension, compared with data reading according to the data width corresponding to the array dimension, for matrix data, data reading can be directly performed according to the bus width, so that the data width of the input sub-data is equal to the bus width, thereby optimizing the efficiency of data transfer for matrix data.
在一些实施例中,考虑到在深度学习网络中通常会有卷积类算子和矩阵类算子混合交替使用的场景,且卷积类数据和矩阵类数据在数据存储格式上存在差异,因此为了提高AI处理器中的数据处理效率,AI处理器中还可以设置有数据格式转换器。In some embodiments, considering that in deep learning networks there are often scenarios where convolution operators and matrix operators are used alternately, and there are differences in the data storage formats of convolution data and matrix data, in order to improve the data processing efficiency in the AI processor, a data format converter may also be provided in the AI processor.
可选的,该数据格式转换器用于对卷积类数据和矩阵类数据的数据存储格式进行单向转换(若要实现相互转换,则需要设置不同转换方向的数据格式转换器),或者,双向转换。Optionally, the data format converter is used to perform unidirectional conversion on the data storage formats of convolution data and matrix data (if mutual conversion is to be achieved, data format converters with different conversion directions need to be set), or bidirectional conversion.
在一种可能的场景下,当需要先利用卷积核对特征图(卷积输入数据)进行卷积处理,然后将卷积处理结果作为左矩阵输入数据,与右矩阵输入数据进行矩阵乘法运算时,需要将卷积输出数据转换为矩阵输入数据所采用的数据格式。In one possible scenario, when it is necessary to first use the convolution kernel to convolve the feature map (convolution input data), and then use the convolution result as the left matrix input data to perform matrix multiplication with the right matrix input data, it is necessary to convert the convolution output data into the data format used by the matrix input data.
在一种可能的实施方式中,在第一运算数据和第二运算数据的运算结果为卷积输出数据,且卷积输出数据用于后续矩阵计算的情况下,数据格式转换器则可以用于基于矩阵输入数据对应的目标数据格式,对卷积输出数据进行数据格式转换。In a possible implementation, when the operation result of the first operation data and the second operation data is convolution output data, and the convolution output data is used for subsequent matrix calculations, the data format converter can be used to convert the data format of the convolution output data based on the target data format corresponding to the matrix input data.
在一种可能的实施方式中,卷积输出数据的数据格式为NC1HWC0,矩阵输入数据的数据格式为B0B1W1HW0,则可以将B0设置为1,B1设置为N,W1设置为C1,H设置为H×W,W0设置为C0,从而实现将卷积输出数据转换为等效的矩阵输入数据。In one possible implementation, the data format of the convolution output data is NC1HWC0, and the data format of the matrix input data is B0B1W1HW0, then B0 can be set to 1, B1 can be set to N, W1 can be set to C1, H can be set to H×W, and W0 can be set to C0, thereby converting the convolution output data into equivalent matrix input data.
在一种可能的场景下,当需要先对两个矩阵(左矩阵输入数据和右矩阵输入数据)进行矩阵乘法运算,然后利用卷积核对矩阵乘法运算结果进行卷积处理时,需要将矩阵数据数据转换为卷积输入数据所采用的数据格式。In a possible scenario, when it is necessary to first perform matrix multiplication on two matrices (left matrix input data and right matrix input data) and then use a convolution kernel to convolve the matrix multiplication result, it is necessary to convert the matrix data into the data format used by the convolution input data.
在一种可能的实施方式中,在第一运算数据和第二运算数据的运算结果为矩阵输出数据,且矩阵输出数据用于后续卷积计算的情况下,数据格式转换器则可以用于基于卷积输入数据对应的目标数据格式,对矩阵输出数据进行数据格式转换。In a possible implementation, when the operation result of the first operation data and the second operation data is matrix output data, and the matrix output data is used for subsequent convolution calculations, the data format converter can be used to convert the data format of the matrix output data based on the target data format corresponding to the convolution input data.
在一种可能的实施方式中,矩阵输出数据的数据格式为B0B1W1HW0,卷积输入数据的数据格式为NC1HWC0,则可以将N设置为B0×B1,C1设置为W1,H设置为1,W设置为H,C0 设置为W0,从而实现将矩阵输出数据等效为卷积输入数据。In a possible implementation, the data format of the matrix output data is B0B1W1HW0, and the data format of the convolution input data is NC1HWC0, then N can be set to B0×B1, C1 to W1, H to 1, W to H, and C0 Set to W0 to make the matrix output data equivalent to the convolution input data.
上述实施例中,在卷积输出数据或矩阵输出数据采用目标数据格式,且后续存在数据类型不同的运算的情况下,可以通过确定卷积输出数据和矩阵输入数据之间的数据等效关系,或者矩阵输出数据和卷积输入数据之间的数据等效关系,即可以实现卷积输出数据和矩阵输入数据之间的数据等效,或者矩阵输出数据和卷积输入数据之间的数据等效,提高了AI处理器中的数据处理效率。In the above embodiments, when the convolution output data or the matrix output data adopts the target data format and there are subsequent operations with different data types, the data equivalence relationship between the convolution output data and the matrix input data, or the data equivalence relationship between the matrix output data and the convolution input data can be determined, that is, the data equivalence between the convolution output data and the matrix input data, or the data equivalence between the matrix output data and the convolution input data can be achieved, thereby improving the data processing efficiency in the AI processor.
在一些实施例中,考虑到AI处理器在集成到深度学习网络框架中之后,通过矩阵运算引擎进行数据运算只是其中的一个环节,即基于前序过程得到的运算数据可能不是采用的目标数据格式,从而直接对其进行数据搬运,可能会影响数据搬运的效率,因此AI处理器中还可以设置有数据格式转换器,用于将运算数据的数据格式转换为目标数据格式。In some embodiments, considering that after the AI processor is integrated into the deep learning network framework, data calculation through the matrix operation engine is only one of the links, that is, the calculation data obtained based on the previous process may not be in the target data format, so directly moving the data may affect the efficiency of data movement. Therefore, a data format converter can also be provided in the AI processor to convert the data format of the calculation data into the target data format.
在一种可能的实施方式中,数据格式转换器用于从第一存储器中读取第一运算数据以及第二运算数据,第一运算数据和第二运算数据采用原始数据格式,原始数据格式所表征的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数不匹配,进而数据格式转换器通过对第一运算数据以及第二运算数据进行数据格式转换,使得第一运算数据和第二运算数据的数据格式从原始数据格式转换为目标数据格式,并将第一运算数据和第二运算数据写入第一存储器。In a possible implementation, a data format converter is used to read first operation data and second operation data from a first memory, the first operation data and the second operation data are in an original data format, and the data dimension represented by the original data format does not match the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, and then the data format converter converts the data format of the first operation data and the second operation data from the original data format to the target data format by performing data format conversion on the first operation data and the second operation data, and writes the first operation data and the second operation data into the first memory.
在一个示意性的例子中,数据格式转换器先从第一存储器中读取采用原始数据格式(NCHW)的卷积输入数据,并根据目标数据格式(NCi1HiWiCk0)对卷积输入数据进行数据格式转换,从而将转换后采用目标数据格式的卷积输入数据重新存储至第一存储器中。In an illustrative example, the data format converter first reads the convolution input data in the original data format (NCHW) from the first memory, and converts the convolution input data according to the target data format (NCi1HiWiCk0), thereby storing the converted convolution input data in the target data format back in the first memory.
需要说明的是,在进行数据格式转换前,需要基于数据的用途,将数据转换为该用户对应的目标数据格式。比如,将卷积输入数据转换为第一目标数据格式,将卷积权重数据转换为第二目标数据格式,将左矩阵输入数据转换为第八目标数据格式,将右矩阵输入数据转换为第九目标数据格式等等。It should be noted that before performing data format conversion, the data needs to be converted into the target data format corresponding to the user based on the purpose of the data. For example, the convolution input data is converted into the first target data format, the convolution weight data is converted into the second target data format, the left matrix input data is converted into the eighth target data format, the right matrix input data is converted into the ninth target data format, and so on.
此外,对于卷积类数据,还需要基于卷积输入数据的总输入通道数量与脉冲阵列的阵列行数的大小关系,基于卷积输出数据的总输出通道数量与脉冲阵列的阵列列数的大小关系,确定卷积输入数据、卷积权重数据以及卷积输出数据所采用的目标数据格式,本实施例在此不做赘述。In addition, for convolution data, it is also necessary to determine the target data format used for the convolution input data, convolution weight data, and convolution output data based on the relationship between the total number of input channels of the convolution input data and the number of array rows of the pulse array, and based on the relationship between the total number of output channels of the convolution output data and the number of array columns of the pulse array. This embodiment will not be elaborated here.
在一种可能的实施方式中,在将AI处理器作为运算加速装置集成在通用深度学习框架中的情况下,为了降低将AI处理器集成到深度学习框架的复杂度,还可以在算子层和框架层提供相应的格式处理,比如,在算子层,增加对原始数据格式和目标数据格式的全支持;在框架层,提供关于原始数据格式以及目标数据格式的格式信息。In one possible implementation, when an AI processor is integrated into a general deep learning framework as a computing acceleration device, in order to reduce the complexity of integrating the AI processor into the deep learning framework, corresponding format processing can also be provided at the operator layer and the framework layer. For example, at the operator layer, full support for the original data format and the target data format is added; at the framework layer, format information about the original data format and the target data format is provided.
示意性的,如图9所示,设置Math Dim(Math Dimention)对应算法框架中的采用原始数据格式(通用格式)的运算数据,Data Dim(Data Dimention)对应在AI处理器上的采用目标数据格式(专用格式)的运算数据。在通过深度学习框架进行数据计算的过程中,可以直接使用目标数据格式的运算数据进行数据处理,而在调试过程中,将数据从设备(Device)向主机(Host)传输过程中,可以根据原始数据格式以及目标数据格式的格式信息,将采用目标数据格式的运算数据转换为采用原始数据格式的运算数据(即在传输过程中完成格式转换),以便进行查看分析。Schematically, as shown in FIG9 , Math Dim (Math Dimention) is set to correspond to the operation data in the original data format (general format) in the algorithm framework, and Data Dim (Data Dimention) corresponds to the operation data in the target data format (special format) on the AI processor. In the process of data calculation through the deep learning framework, the operation data in the target data format can be directly used for data processing, and in the debugging process, when the data is transmitted from the device (Device) to the host (Host), the operation data in the target data format can be converted into the operation data in the original data format (that is, the format conversion is completed during the transmission process) according to the format information of the original data format and the target data format, so as to facilitate viewing and analysis.
请参考图10,其示出了本申请一个示例性实施例提供的数据处理方法的流程图,该方法用于上述实施例中的AI处理器中,该方法包括:Please refer to FIG. 10 , which shows a flow chart of a data processing method provided by an exemplary embodiment of the present application. The method is used in the AI processor in the above embodiment, and the method includes:
步骤1001,通过第一存储器存储第一运算数据和第二运算数据,第一运算数据和第二运算数据采用目标数据格式,目标数据格式所表征的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配。 Step 1001, storing first operation data and second operation data in a first memory, wherein the first operation data and the second operation data are in a target data format, and the data dimension represented by the target data format matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine.
步骤1002,基于总线位宽,通过数据搬运引擎从第一存储器中读取第一运算子数据和第二运算子数据,并通过总线将第一运算子数据和第二运算子数据搬运至矩阵运算引擎内部的第二存储器,总线位宽大于阵列维数对应的数据位宽。Step 1002, based on the bus bit width, read the first operator data and the second operator data from the first memory through the data transfer engine, and transfer the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and the bus bit width is greater than the data bit width corresponding to the array dimension.
步骤1003,通过矩阵运算引擎中的矩阵运算单元,对第二存储器中的第一运算子数据和第二运算子数据矩阵运算,得到子运算结果,子运算结果的数据维数与阵列维数匹配。Step 1003, through the matrix operation unit in the matrix operation engine, the first operator data and the second operator data in the second memory are matrix operated to obtain a sub-operation result, and the data dimension of the sub-operation result matches the array dimension.
步骤1004,通过矩阵运算引擎中的累加器累加子运算结果,得到第一运算数据和第二运算数据的矩阵运算结果。Step 1004, accumulating the sub-operation results through an accumulator in the matrix operation engine to obtain a matrix operation result of the first operation data and the second operation data.
在一些实施例中,在处理卷积数据的过程中,第一运算数据为卷积输入数据,第二运算数据为卷积权重数据,第一运算数据和第二运算数据的运算结果为卷积输出数据;在处理矩阵数据的过程中,第一运算数据为左矩阵输入数据,第二运算数据为右矩阵输入数据,第一运算数据和第二运算数据的运算结果为矩阵输出数据。In some embodiments, in the process of processing convolution data, the first operation data is the convolution input data, the second operation data is the convolution weight data, and the operation result of the first operation data and the second operation data is the convolution output data; in the process of processing matrix data, the first operation data is the left matrix input data, the second operation data is the right matrix input data, and the operation result of the first operation data and the second operation data is the matrix output data.
在一些实施例中,在第一运算数据为卷积输入数据,第二运算数据为卷积权重数据,第一运算数据和第二运算数据的运算结果为卷积输出数据的情况下,卷积输入数据采用第一目标数据格式,第一目标数据格式所表征的子输入通道数等于脉动阵列的阵列行数;卷积权重数据采用第二目标数据格式,第二目标数据格式所表征的子输入通道数等于脉动阵列的阵列行数,第二目标数据格式所表征的子输出通道数等于脉动阵列的阵列列数;卷积输出数据采用第三目标数据格式,第三目标数据格式所表征的子输出通道数等于脉动阵列的阵列列数。In some embodiments, when the first operation data is convolution input data, the second operation data is convolution weight data, and the operation results of the first operation data and the second operation data are convolution output data, the convolution input data adopts a first target data format, and the number of sub-input channels represented by the first target data format is equal to the number of array rows of the systolic array; the convolution weight data adopts a second target data format, and the number of sub-input channels represented by the second target data format is equal to the number of array rows of the systolic array, and the number of sub-output channels represented by the second target data format is equal to the number of array columns of the systolic array; the convolution output data adopts a third target data format, and the number of sub-output channels represented by the third target data format is equal to the number of array columns of the systolic array.
在一些实施例中,在卷积输入数据的总输入通道数小于脉动阵列的阵列行数的情况下,卷积输入数据采用第四目标数据格式,第四目标数据格式所表征的输入通道数为总输入通道数;卷积权重数据采用第五目标数据格式,第五目标数据格式所表征的输入通道数为总输入通道数,第五目标数据格式所表征的子输出通道数等于脉动阵列的阵列列数。In some embodiments, when the total number of input channels of the convolution input data is less than the number of array rows of the systolic array, the convolution input data adopts a fourth target data format, and the number of input channels represented by the fourth target data format is the total number of input channels; the convolution weight data adopts a fifth target data format, and the number of input channels represented by the fifth target data format is the total number of input channels, and the number of sub-output channels represented by the fifth target data format is equal to the number of array columns of the systolic array.
在一些实施例中,在卷积输出数据的总输出通道数小于脉动阵列的阵列列数的情况下,卷积输出数据采用第六目标数据格式,第六目标数据格式所表征的输出通道数为总输出通道数;卷积权重数据采用第七目标数据格式,第七目标数据格式所表征的子输入通道数等于脉动阵列的阵列行数,第七目标数据格式所表征的输出通道数为总输出通道数。In some embodiments, when the total number of output channels of the convolution output data is less than the number of array columns of the systolic array, the convolution output data adopts a sixth target data format, and the number of output channels represented by the sixth target data format is the total number of output channels; the convolution weight data adopts a seventh target data format, and the number of sub-input channels represented by the seventh target data format is equal to the number of array rows of the systolic array, and the number of output channels represented by the seventh target data format is the total number of output channels.
在一些实施例中,数据搬运过程中,通过数据搬运引擎,基于总线位宽,依次从第一存储器中读取第一卷积输入子数据以及第二卷积输入子数据,第一卷积输入子数据与第二卷积输入子数据为连续存储的数据,且第一卷积输入子数据的数据位宽以及第二卷积输入子数据的数据位宽均等于总线位宽;基于总线位宽,依次从第一存储器中读取第一卷积权重子数据以及第二卷积权重子数据,第一卷积权重子数据与第二卷积权重子数据为连续存储的数据,且第一卷积权重子数据的数据位宽以及第二卷积权重子数据的数据位宽均等于总线位宽。In some embodiments, during data transfer, through a data transfer engine, based on the bus bit width, the first convolution input sub-data and the second convolution input sub-data are read from the first memory in sequence, the first convolution input sub-data and the second convolution input sub-data are continuously stored data, and the data bit width of the first convolution input sub-data and the data bit width of the second convolution input sub-data are both equal to the bus bit width; based on the bus bit width, the first convolution weight sub-data and the second convolution weight sub-data are read from the first memory in sequence, the first convolution weight sub-data and the second convolution weight sub-data are continuously stored data, and the data bit width of the first convolution weight sub-data and the data bit width of the second convolution weight sub-data are both equal to the bus bit width.
在一些实施例中,在第一运算数据为左矩阵输入数据,第二运算数据为右矩阵输入数据,第一运算数据和第二运算数据的运算结果为矩阵输出数据的情况下,左矩阵输入数据采用第八目标数据格式,第八目标数据格式所表征的子矩阵列数等于脉动阵列的阵列行数;右矩阵输入数据采用第九目标数据格式,第九目标数据格式所表征的子矩阵列数等于脉动阵列的阵列列数;矩阵输出数据采用第十目标数据格式,第十目标数据格式所表征的矩阵行数等于第八目标数据格式所表征的矩阵行数,第十目标数据格式所表征的矩阵列数等于第九目标数据格式所表征的矩阵列数。In some embodiments, when the first operation data is left matrix input data, the second operation data is right matrix input data, and the operation results of the first operation data and the second operation data are matrix output data, the left matrix input data adopts an eighth target data format, and the number of submatrix columns represented by the eighth target data format is equal to the number of array rows of the systolic array; the right matrix input data adopts a ninth target data format, and the number of submatrix columns represented by the ninth target data format is equal to the number of array columns of the systolic array; the matrix output data adopts a tenth target data format, and the number of matrix rows represented by the tenth target data format is equal to the number of matrix rows represented by the eighth target data format, and the number of matrix columns represented by the tenth target data format is equal to the number of matrix columns represented by the ninth target data format.
在一些实施例中,数据搬运过程中,通过数据搬运引擎,基于总线位宽,依次从第一存储器中读取第一左矩阵输入子数据以及第二左矩阵输入子数据,第一左矩阵输入子数据与第二左矩阵输入子数据为连续存储的数据,且第一左矩阵输入子数据的数据位宽以及第二左矩阵输入子数据的数据位宽均等于总线位宽;基于总线位宽,依次从第一存储器中读取第一右矩阵输入子数据以及第二右矩阵输入子数据,第一右矩阵输入子数据与第二右矩阵输入子数据为连续存储的数据,且第一右矩阵输入子数据的数据位宽以及第二右矩阵输入子数据的数据位宽均等于总线位宽。In some embodiments, during data transfer, through a data transfer engine, based on the bus bit width, the first left matrix input sub-data and the second left matrix input sub-data are read from the first memory in sequence, the first left matrix input sub-data and the second left matrix input sub-data are continuously stored data, and the data bit width of the first left matrix input sub-data and the data bit width of the second left matrix input sub-data are both equal to the bus bit width; based on the bus bit width, the first right matrix input sub-data and the second right matrix input sub-data are read from the first memory in sequence, the first right matrix input sub-data and the second right matrix input sub-data are continuously stored data, and the data bit width of the first right matrix input sub-data and the data bit width of the second right matrix input sub-data are both equal to the bus bit width.
在一些实施例中,AI处理器还包括数据格式转换器; In some embodiments, the AI processor further includes a data format converter;
该方法还包括:在第一运算数据和第二运算数据的运算结果为卷积输出数据,且卷积输出数据用于后续矩阵计算的情况下,通过数据格式转换器,基于矩阵输入数据对应的目标数据格式,转换卷积输出数据的数据格式;在第一运算数据和第二运算数据的运算结果为矩阵输出数据,且矩阵输出数据用于后续卷积计算的情况下,通过数据格式转换器,基于卷积输入数据对应的目标数据格式,转换矩阵输出数据的数据格式。The method also includes: when the operation results of the first operation data and the second operation data are convolution output data, and the convolution output data is used for subsequent matrix calculation, converting the data format of the convolution output data based on the target data format corresponding to the matrix input data through a data format converter; when the operation results of the first operation data and the second operation data are matrix output data, and the matrix output data is used for subsequent convolution calculation, converting the data format of the matrix output data based on the target data format corresponding to the convolution input data through a data format converter.
在一些实施例中,AI处理器还包括数据格式转换器;In some embodiments, the AI processor further includes a data format converter;
该方法还包括:通过数据格式转换器从第一存储器中读取第一运算数据以及第二运算数据,第一运算数据和第二运算数据采用原始数据格式,原始数据格式所表征的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数不匹配;The method further includes: reading first operation data and second operation data from a first memory through a data format converter, the first operation data and the second operation data being in an original data format, the data dimension represented by the original data format not matching the array dimension of a systolic array in a matrix operation unit in a matrix operation engine;
通过数据格式转换器转换第一运算数据以及第二运算数据的数据格式,使得第一运算数据和第二运算数据的数据格式从原始数据格式转换为目标数据格式,并将第一运算数据和第二运算数据写入第一存储器。The data formats of the first operation data and the second operation data are converted by a data format converter, so that the data formats of the first operation data and the second operation data are converted from the original data format to the target data format, and the first operation data and the second operation data are written into the first memory.
综上所述,本申请实施例中,通过在第一存储器中存储采用目标数据格式的第一运算数据以及第二运算数据,使得第一运算数据以及第二运算数据的数据维数与矩阵运算引擎中矩阵运算单元内脉动阵列的阵列维数匹配,从而数据搬运引擎可以根据总线位宽,从第一存储器中读取第一运算子数据以及第二运算子数据,并通过总线将第一运算子数据和第二运算子数据搬运至矩阵运算引擎内部的第二存储器中,进而矩阵运算引擎通过矩阵运算单元对第一运算子数据和第二运算子数据进行矩阵运算,得到子运算结果,并通过累加器对子运算结果进行累加,得到第一运算数据和第二运算数据的矩阵运算结果。采用本申请实施例提供的AI处理器,通过在第一存储器中存储采用目标数据格式的运算数据,能够实现根据总线位宽进行运算数据读取。由于总线位宽大于阵列维数对应的数据位宽,因此相较于根据阵列维数对应的数据位宽进行数据读取,本申请实施例提供的AI处理器能够提高矩阵运算过程中数据搬运的带宽利用率,从而提高矩阵运算的效率。In summary, in the embodiment of the present application, by storing the first operation data and the second operation data in the target data format in the first memory, the data dimension of the first operation data and the second operation data matches the array dimension of the systolic array in the matrix operation unit in the matrix operation engine, so that the data handling engine can read the first operator data and the second operator data from the first memory according to the bus width, and carry the first operator data and the second operator data to the second memory inside the matrix operation engine through the bus, and then the matrix operation engine performs matrix operation on the first operator data and the second operator data through the matrix operation unit to obtain the sub-operation result, and accumulates the sub-operation result through the accumulator to obtain the matrix operation result of the first operation data and the second operation data. Using the AI processor provided in the embodiment of the present application, by storing the operation data in the target data format in the first memory, it is possible to realize the reading of operation data according to the bus width. Since the bus width is greater than the data width corresponding to the array dimension, compared with reading data according to the data width corresponding to the array dimension, the AI processor provided in the embodiment of the present application can improve the bandwidth utilization of data handling during matrix operation, thereby improving the efficiency of matrix operation.
请参考图11,其示出了本申请一个示例性实施例提供的计算机设备1100的结构框图。其中,该计算机设备1100可以是便携式移动终端,比如:智能手机、平板电脑、动态影像专家压缩标准音频层面3(Moving Picture Experts Group Audio Layer III,MP3)播放器、动态影像专家压缩标准音频层面4(Moving Picture Experts Group Audio Layer IV,MP4)播放器。计算机设备1100还可能被称为用户设备、便携式终端、工作站、服务器等其他名称。Please refer to FIG. 11 , which shows a block diagram of a computer device 1100 provided by an exemplary embodiment of the present application. The computer device 1100 may be a portable mobile terminal, such as a smart phone, a tablet computer, a Moving Picture Experts Group Audio Layer III (MP3) player, or a Moving Picture Experts Group Audio Layer IV (MP4) player. The computer device 1100 may also be called a user device, a portable terminal, a workstation, a server, or other names.
通常,计算机设备1100包括有:AI处理器1101和存储器1102。Typically, the computer device 1100 includes: an AI processor 1101 and a memory 1102 .
AI处理器1101可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。AI处理器1101可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。AI处理器1101也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称中央处理器(Central Processing Unit,CPU);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,AI处理器1101可以在集成有图像处理器(Graphics Processing Unit,GPU),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,AI处理器1101还可以用于处理有关机器学习的计算操作。The AI processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The AI processor 1101 may be implemented in at least one hardware form of digital signal processing (DSP), field-programmable gate array (FPGA), and programmable logic array (PLA). The AI processor 1101 may also include a main processor and a coprocessor. The main processor is a processor for processing data in the awake state, also known as a central processing unit (CPU); the coprocessor is a low-power processor for processing data in the standby state. In some embodiments, the AI processor 1101 may be integrated with a graphics processing unit (GPU), which is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the AI processor 1101 may also be used to process computing operations related to machine learning.
存储器1102可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是有形的和非暂态的。存储器1102还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。The memory 1102 may include one or more computer-readable storage media, which may be tangible and non-transitory. The memory 1102 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices, flash memory storage devices.
在一些实施例中,计算机设备1100还可选包括有外围设备接口1103和至少一个外围设备。In some embodiments, the computer device 1100 may also optionally include a peripheral device interface 1103 and at least one peripheral device.
本领域技术人员可以理解,图11中示出的结构并不构成对计算机设备1100的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。 Those skilled in the art will appreciate that the structure shown in FIG. 11 does not limit the computer device 1100 , and may include more or fewer components than shown in the figure, or combine certain components, or adopt a different component arrangement.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person skilled in the art will understand that all or part of the steps to implement the above embodiments may be accomplished by hardware or by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a disk or an optical disk, etc.
以上所述仅为本申请的可选的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。 The above description is only an optional embodiment of the present application and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application shall be included in the protection scope of the present application.
Claims (19)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202311579613.0A CN120031084A (en) | 2023-11-22 | 2023-11-22 | AI processor, data processing method and computer device |
| CN202311579613.0 | 2023-11-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025107800A1 true WO2025107800A1 (en) | 2025-05-30 |
Family
ID=95729476
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2024/115871 Pending WO2025107800A1 (en) | 2023-11-22 | 2024-08-30 | Ai processor, data processing method, and computer device |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120031084A (en) |
| WO (1) | WO2025107800A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170161611A1 (en) * | 2015-12-03 | 2017-06-08 | International Business Machines Corporation | Variable-size problem solving with systolic arrays |
| CN112840356A (en) * | 2018-10-09 | 2021-05-25 | 华为技术有限公司 | Computing accelerator, processing method and related equipment |
| US20220283984A1 (en) * | 2021-03-04 | 2022-09-08 | Samsung Electronics Co., Ltd. | Neural processor |
| CN116911367A (en) * | 2023-07-27 | 2023-10-20 | 安谋科技(中国)有限公司 | Data processing methods, devices, electronic equipment and computer-readable storage media |
| CN116980277A (en) * | 2023-09-18 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium |
-
2023
- 2023-11-22 CN CN202311579613.0A patent/CN120031084A/en active Pending
-
2024
- 2024-08-30 WO PCT/CN2024/115871 patent/WO2025107800A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170161611A1 (en) * | 2015-12-03 | 2017-06-08 | International Business Machines Corporation | Variable-size problem solving with systolic arrays |
| CN112840356A (en) * | 2018-10-09 | 2021-05-25 | 华为技术有限公司 | Computing accelerator, processing method and related equipment |
| US20220283984A1 (en) * | 2021-03-04 | 2022-09-08 | Samsung Electronics Co., Ltd. | Neural processor |
| CN116911367A (en) * | 2023-07-27 | 2023-10-20 | 安谋科技(中国)有限公司 | Data processing methods, devices, electronic equipment and computer-readable storage media |
| CN116980277A (en) * | 2023-09-18 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Data processing method, device, computer equipment and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120031084A (en) | 2025-05-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107657581B (en) | A convolutional neural network CNN hardware accelerator and acceleration method | |
| CN111199273B (en) | Convolution calculation method, device, equipment and storage medium | |
| EP4156079A1 (en) | Image data storage method, image data processing method and system, and related apparatus | |
| CN107833176A (en) | A kind of information processing method and Related product | |
| EP4071619B1 (en) | Address generation method, related device and storage medium | |
| CN110674927A (en) | A data reorganization method for systolic array structure | |
| CN102687162A (en) | Method and device for image processing at pixel rate | |
| CN110232665B (en) | Maximum pooling method, apparatus, computer equipment and storage medium | |
| US11915338B2 (en) | Loading apparatus and method for convolution with stride or dilation of 2 | |
| CN111626405A (en) | CNN acceleration method, CNN acceleration device and computer readable storage medium | |
| JP2022518640A (en) | Data processing methods, equipment, equipment, storage media and program products | |
| CN117216459B (en) | Convolution operation method, convolution operation device, electronic device and storage medium | |
| CN110490308B (en) | Design method of acceleration library, terminal equipment and storage medium | |
| CN116051345A (en) | Image data processing method, device, computer equipment and readable storage medium | |
| CN110414672B (en) | Convolution operation method, device and system | |
| CN118427136A (en) | Direct memory access device and operation method, data processing device | |
| WO2025044276A1 (en) | Data processing method and apparatus, computer device, and storage medium | |
| US20210201122A1 (en) | Data processing methods, apparatuses, devices, storage media and program products | |
| CN117693757A (en) | Data format conversion device and method | |
| CN110377874B (en) | Convolution operation method and system | |
| CN120508407B (en) | Optimization method for data type conversion operator, computer device, readable storage medium and computer program product | |
| WO2025107800A1 (en) | Ai processor, data processing method, and computer device | |
| CN113570612A (en) | An image processing method, device and equipment | |
| WO2021031154A1 (en) | Method and device for loading feature map of neural network | |
| WO2021179289A1 (en) | Operational method and apparatus of convolutional neural network, device, and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24892959 Country of ref document: EP Kind code of ref document: A1 |