CN111897579A

CN111897579A - Image data processing method, image data processing device, computer equipment and storage medium

Info

Publication number: CN111897579A
Application number: CN202010829873.9A
Authority: CN
Inventors: 周杨杰; 冷静文; 杨孟天; 过敏意
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-11-06
Anticipated expiration: 2040-08-18
Also published as: CN111897579B

Abstract

The application relates to an image data processing method, comprising: acquiring image data of N images to be processed, wherein the image data of each image to be processed comprises C image channel data; storing the image data into C storage areas of a local memory, wherein N storage units in each storage area respectively store one image channel data of N images to be processed; when a data reading instruction occurs, determining a historical address according to a current address carried by the data reading instruction, and reading target image channel data from a local memory based on a storage address; arranging the read target image channel data in a two-dimensional matrix form, wherein the data corresponding to the same channel are in the same matrix row, and the data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns; and sequentially transmitting each line of data arranged in a two-dimensional matrix form to the pulse array according to a time sequence to carry out operation to obtain an operation result. By adopting the method, the data processing efficiency can be improved.

Description

Image data processing method, image data processing device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image data processing method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, machine learning technology has emerged, and various machine learning models based on machine learning technology can improve model processing accuracy through a large number of calculations. In actual Processing, for convolution calculation involving a machine learning model, parallel Processing can be performed by a multi-threaded SIMD module based on a Graphics Processing Unit (GPU) to improve the calculation efficiency. SIMD (Single Instruction multiple data) is a technology that uses one controller to control multiple processors and simultaneously performs the same operation on each of a set of data (also called "data vector") to realize spatial parallelism.

In conventional approaches, multiple Streaming Multiprocessors (SMs) may be included in a GPU-based multithreaded SIMD processor. Each SM has 16 load/store modules, one clock cycle allows 16 threads to compute source and destination addresses simultaneously, and each SM can handle 48 warp threads for a total of 1536 threads, thus enabling parallel processing of bulk data. While for matrix multiplication and multidimensional convolution involved in deep learning calculations, SIMD modules often encounter data access speeds that are much slower than data processing speeds. Therefore, when performing a calculation-intensive task such as matrix multiplication or multidimensional convolution, there is a problem that data processing efficiency is low.

Disclosure of Invention

In view of the above, it is desirable to provide an image data processing method, an apparatus, a computer device, and a storage medium that can improve processing efficiency when batch processing is performed on a large amount of data.

A method of image data processing, the method comprising:

acquiring image data of N images to be processed, wherein the image data of each image to be processed comprises image channel data of C channels; wherein N is a positive integer greater than or equal to 1, and C is a positive integer greater than 1;

storing the image data into C storage areas of a local memory according to channels, wherein N storage units in each storage area respectively store image channel data of one channel of N images to be processed;

when a data reading instruction occurs, determining a current address carried in the data reading instruction, and acquiring a preset number of historical addresses according to the current address; the preset number is determined according to the quotient of C divided by N;

reading target image channel data corresponding to different images to be processed and corresponding to different channels from the local memory in a preset offset mode based on the current address and the historical address;

arranging the read target image channel data in a two-dimensional matrix form, wherein the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns;

sequentially transmitting each line of data in target image channel data arranged in a two-dimensional matrix form to a pulse array according to a time sequence to perform operation to obtain an operation result; the width of the systolic array corresponds to the number of channels.

An image data processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring image data of N images to be processed, wherein the image data of each image to be processed comprises image channel data of C channels; wherein N is a positive integer greater than or equal to 1, and C is a positive integer greater than 1;

the storage module is used for storing the image data into C storage areas of a local memory according to channels, and N storage units in each storage area respectively store image channel data of one channel of N images to be processed;

the device comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining a current address carried in a data reading instruction when the data reading instruction occurs, and acquiring a preset number of historical addresses according to the current address; the preset number is determined according to the quotient of C divided by N;

the data reading module is used for reading target image channel data which correspond to different images to be processed and correspond to different channels from the local memory in a preset offset mode based on the current address and the historical address;

the arrangement module is used for arranging the read target image channel data in a two-dimensional matrix form, wherein the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns;

the operation module is used for sequentially transmitting each line of data in the target image channel data arranged in the form of a two-dimensional matrix to the pulse array according to a time sequence to perform operation to obtain an operation result; the width of the systolic array corresponds to the number of channels.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

The image data processing method, the image data processing device, the computer equipment and the storage medium acquire the image data of N images to be processed, wherein the image data of each image to be processed comprises image channel data of C channels. And storing the batch of image channel data into C storage areas of a local memory according to channels, wherein N storage units in each storage area respectively store the image channel data of one channel of N images to be processed. When a data reading instruction occurs, acquiring a preset number of historical addresses according to the current address carried in the data reading instruction, and reading target image channel data corresponding to different images to be processed and different channels from a local memory in a preset offset mode based on the current address and the historical addresses. And then arranging the read target image channel data in a two-dimensional matrix form, wherein the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns. When batch operation is needed, all lines of data in the target image channel data arranged in the form of a two-dimensional matrix can be directly and sequentially transmitted to the pulse array according to the time sequence to be operated to obtain an operation result. Therefore, reasonable data addressing is realized through the current address and the historical address, and then data operation is realized through the pulse array, so that the parallel processing of single instruction multiple data streams can be realized, the phenomenon of pause caused by the mismatching of data access and processing speed can be avoided, reasonable memory interaction and efficient and flexible cooperative control are realized, the data processing efficiency and the processing performance are greatly improved, and the energy consumption can be greatly reduced through the pipeline type data parallel processing.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a method for processing image data;

FIG. 2 is a diagram illustrating an overall architecture of a system for performing the image data processing method according to the present application;

FIG. 3 is a block diagram of a SIMD unit in one embodiment;

FIG. 4(A) is a schematic diagram of the structure of a systolic array in one embodiment;

FIG. 4(B) is a schematic diagram of an embodiment of a processing unit in a systolic array;

FIG. 4(C) is a schematic diagram of matrix multiplication in one embodiment;

FIG. 4(D) is a schematic diagram of the computation of a systolic array in one embodiment;

FIG. 5 is a flowchart illustrating a method for processing image data according to one embodiment;

FIG. 6 is a schematic architecture diagram of image data processing in one embodiment;

FIG. 7(A) is a schematic diagram of the convolution calculation in one embodiment;

FIG. 7(B) is a diagram illustrating a splitting of a convolution kernel in one embodiment;

FIG. 8 is a schematic diagram illustrating the conversion of convolution calculations into matrix multiplication calculations in one embodiment;

FIG. 9 is a flow diagram of a method for image data processing in an exemplary embodiment;

FIG. 10 is a block diagram showing the configuration of an image data processing apparatus according to an embodiment;

FIG. 11 is a block diagram showing a configuration of an image data processing apparatus according to another embodiment;

FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The image data processing method provided by the application can be applied to the application environment shown in fig. 1. Wherein user terminal 110 communicates with computer device 120 over a network. The user terminal 110 collects more than one image to be processed and transmits each N images to be processed as a processing batch to the computer device 120. The computer device 120 acquires image data of N images to be processed, the image data of each image to be processed including image channel data of C channels; wherein N is a positive integer greater than or equal to 1, and C is a positive integer greater than 1. When a data reading instruction occurs, the computer device 120 determines a current address carried in the data reading instruction, and obtains a preset number of historical addresses according to the current address, wherein the preset number is a quotient of C divided by N. The computer device 120 reads, from the local memory, target image channel data corresponding to different to-be-processed images and corresponding to different channels in a preset offset manner based on the current address and the historical address. The computer device 120 arranges the read target image channel data in a two-dimensional matrix form, where the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in adjacent matrix rows are distributed in two adjacent matrix columns. The computer device 120 sequentially transmits each line of data in the target image channel data arranged in the form of a two-dimensional matrix to the pulse array according to a time sequence to perform operation to obtain an operation result, wherein the width of the pulse array corresponds to the number of channels.

The user terminal 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The computer device 120 may specifically be a terminal or a server, wherein the server may be implemented by an independent server or a server cluster composed of a plurality of servers.

The present application relates to the field of Artificial Intelligence (AI), which is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The image data processing method provided by the embodiments of the present application specifically relates to a machine learning technology in the field of artificial intelligence. Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

In this scenario, the image data processing method provided by the embodiments of the present application can greatly improve the processing efficiency of the machine learning model for performing convolution operation on the computer device, and further improve the processing efficiency of the machine learning model.

Before the technical solution of the present application is explained in detail, the system architecture of the present application is explained. The system architecture for implementing the present application generally includes a SIMD module and a systolic array module. Referring to fig. 2, fig. 2 is a schematic diagram illustrating an overall architecture of a system for performing the image data processing method according to an embodiment of the present disclosure. As shown in fig. 2, a SIMD module and an SA (Systolic Array) module are deployed in the internal system architecture of the computer apparatus. The SIMD module comprises a Global Control module (Global SIMD Control Unit) and at least C SIMD lanes (SIMD units). Each SIMD unit includes a Memory area (Memory), a Register (REG), an arithmetic unit (ALU), and a Control subunit (Control). The SIMD module may acquire image data of an image to be processed by interacting with an off-chip store (HBM). Systolic Arrays (SAs) are homogeneous networks composed of a plurality of homogeneous, computationally-powerful processing units or nodes. Each processing unit (PE) can independently perform calculations and pass the calculation results to surrounding processing units. The structural characteristic of the systolic array enables the systolic array to achieve higher operation throughput on the basis of consuming smaller memory bandwidth. A large amount of convolution and matrix multiplication are used in the neural network calculation, and the pulsation array is adopted to assist in operation, so that the data processing efficiency is greatly improved. The systolic array in the embodiments of the present application may specifically include C × C processing units (PE).

With continued reference to FIG. 3, FIG. 3 is a block diagram of a SIMD unit in one embodiment. As shown in fig. 3, the SIMD unit is largely divided into the following sections: on-chip data storage (VMEM), a general purpose SIMD computation unit, and a SIMD control unit. The SIMD unit is a two-level (2-level) SIMD architecture. As shown in fig. 3, the on-chip data storage VMEM may specifically include SRAMs in C (e.g., 128) SIMD lanes, and a total of C × N storage units (banks), e.g., a total of 128 × 8 banks, implement storage of data on a chip. And interaction with off-chip data and registers is achieved through crossbar. When on-chip data in VMEM requires general purpose SIMD computation, data may be loaded into a register (such as 32bit Reg shown in fig. 3, i.e. a 32-bit register) by X-bar, and then the SIMD general purpose computation is completed in the ALU of the SIMD unit.

While if the computation is performed in a SIMD unit, there is often a significantly slower data access speed than the data processing speed. SIMD unit performance is therefore limited when performing computationally intensive tasks such as matrix multiplication, or multidimensional convolution. The method is improved based on the above, the data loaded into the register is transmitted to the systolic array, the matrix multiplication calculation is realized in the systolic array, and the calculation result is written back to the SIMD unit through the systolic array, so that the execution flow of the systolic array calculation is completed. The side length of the systolic array is the same as the number of the SIMD lanes, and is C (for example, 128). Each SIMD Lane is responsible for data processing of a systolic array for reading in a row of data and writing out a column of data. The SIMD unit transmits data into the systolic array and transmits data back from the systolic array to the SIMD unit, which can be realized by a serializer/deserializer. The serializer and the deserializer are a transceiving integrated circuit which performs the interconversion of serial data and parallel data.

It can be understood that each time the SIMD module reads and writes data, the systolic array processes data interaction in multiple clock cycles, where multiple actually corresponds to the number N of images to be processed. For example, the size of each data read by the SIMD module is a two-dimensional matrix in C × N format, and then a corresponding data read corresponds to a systolic array to perform N data interactions. The specific implementation of systolic arrays and the specific data interaction of SIMDs with systolic arrays are described in more detail in the following embodiments.

Referring to fig. 4(a), fig. 4(a) is a schematic structural diagram of a systolic array in one embodiment. As shown in fig. 4(a), the systolic array is composed of many independent Process Elements (PEs). The PE is schematically shown in fig. 4(B), and each PE has its own ALU and Buffer therein. Each PE can complete independent calculation and communicate with adjacent PEs, so that high-efficiency and energy-saving calculation is realized under the condition of continuous multiplexing of data. The pulse array has many different control modes, and in the design of the application, the integral calculation process is realized by adopting a design mode of fixing the weight.

With continued reference to fig. 4(B), in each clock cycle, the PE obtains Input Data (Input Data) from the left side, performs multiplication calculation with Weight parameters (Weight Data) internally pre-loaded into Weight buffers (such as Weight Buffer1 and Weight Buffer2), and then adds the Weight-Sum Data (also referred to as intermediate results) obtained from adjacent PEs, so as to obtain an intermediate result of the current PE and store the intermediate result in an Output Buffer (Output Buffer) of the current PE, so as to transmit the intermediate result of the current PE to other adjacent PEs, thereby implementing accumulation operation. By uniformly controlling all PEs in the systolic array, data flow in from the left side of the systolic array can be realized, and the whole matrix multiplication is completed in a systolic mode through calculation inside the PEs and communication between the PEs.

The details of the calculation of each clock cycle (cycle) when the matrix multiplication is implemented by the systolic array will be described below by taking the calculation flow of the matrix multiplication as an example. As shown in fig. 4(C), fig. 4(C) is a schematic diagram of matrix multiplication in an embodiment, and the matrix X (2 × 3) is multiplied by the matrix W (3 × 2) to obtain the matrix Y (2 × 2). Wherein the first row element of matrix X is (X)_1, ₁X_1,2X_1,3) The second row element of the matrix X is (X)_2,1X_2,2X_2,3). The first row element of the matrix W is (W)_1,1W_1,2) The second row element of the matrix W is (W)_2,1W_2,2) The third row element of the matrix W is (W)_3,1W_3,2). The first row element of the matrix Y obtained when the matrix X is multiplied by the matrix W is (Y)_1,1Y_1,2) The second row element of the matrix Y is (Y)_2,1Y_2,2)。

With continuing reference to FIG. 4(D), FIG. 4(D) is a computational schematic of a systolic array in one embodiment. Fig. 4(D) shows a data space-time diagram of two matrices in the calculation process, where the matrix W needs to be prefetched and fixed in the weight buffer of each PE, and the W value is not transmitted in the calculation process. For matrix X, different values are sent into the PE array in different clock cycles, and the values of matrix X flow through the PE array from top to bottom during the calculation process. The following operations are performed for each clock cycle:

cycle 0: w is pre-fetched into the PE array. X_1,1Afferent PE_1,1，PE_1,1Calculating X_1,1*W_1,1。

Cycle 1：X_2,1，X_1,2Respectively transmitted into PE_1,1，PE_1,2，X_1,1From PE_1,1Down into the PE_2,1. Calculation result X of Cycle 0_1,1*W_1,1From the right direction into the PE_1,2。PE_1,1Calculating X_2,1*W_1,1，PE_1,2Calculating X_1,2*W_2,1+X_1,1*W_1,1，PE_2,1Calculating X_1,1*W_1,2。

Cycle 2: similarly, matrix X is propagated down the PE array, and the computation is passed from the left PE to the right PE. Y is_1,1The results of (2) flow out of the array from the right side of the PE array.

Cycle 3: after four cycles, the PE array outputs two results Y_1,2，Y_2,1。

Cycle 4: after five cycles, the result Y is output_2,2And the matrix multiplication is completed.

It is to be understood that the operations of the pulse arrays in fig. 4(C) and 4(D) for matrix multiplication are illustrated schematically, and are not intended to limit the present application. In other embodiments, especially when performing matrix multiplication or multidimensional convolution of large amounts of data, the number of matrix elements required to participate in the calculation will be much greater than in the above example, the number of processing units participating in the operation in the systolic array will also be much greater than in the above illustration, and the clock cycle required to output the operation result will also be much greater than the above 5 clock cycles. It will be appreciated that although the amount of data to be operated on by the systolic array increases, the principle of matrix multiplication is similar to that described above in the schematic diagram of FIG. 4 (D).

The present solution is described in detail below with reference to various embodiments of the present application:

in one embodiment, as shown in fig. 5, an image data processing method is provided, which is described by way of example as being applied to the computer device 120 in fig. 1, and includes the following steps:

step S502, acquiring image data of N images to be processed, wherein the image data of each image to be processed comprises image channel data of C channels; wherein N is a positive integer greater than or equal to 1, and C is a positive integer greater than 1.

The image to be processed is an image corresponding to the current target task, and may specifically be a natural image, a monitored image, a face image, or other types of images, which is not limited in this application. It can be understood that in many business scenarios requiring processing by using a machine learning model, parallel processing is usually performed on a batch of images to be processed at the same time to obtain corresponding processing results. In the embodiment of the present application, N of the N images to be processed may be regarded as the number of currently processed batches, which is also referred to as mini-batch.

The image data is data corresponding to the image to be processed, and the image data may specifically be a pixel value of the image to be processed, or may be processed data of the image to be processed after some preset processing. For example, the image data of the image to be processed may specifically be a pixel value of the image to be processed, or image feature data output after the image to be processed passes through some convolution layers of the machine learning model, and the like, which is not limited in this embodiment of the present application.

It can be understood that, in the process of processing image data by the machine learning model, in order to ensure processing accuracy, processing is generally performed by taking a channel as a dimension, that is, image data corresponding to one image to be processed may specifically include image channel data of more than one channel.

Specifically, the computer device may obtain N images to be processed of the current batch from a local or other computer device, and determine image channel data of C channels of each image to be processed. In one embodiment, the computer device may first acquire a series of original images, and pre-process the series of original images to obtain N to-be-processed images of the current batch. For example, the computer device may perform image scaling or clipping on the original image to obtain a to-be-processed image with a preset format size.

It can be understood that, based on the configuration of the hardware device of the computer device, the computer device may process image channel data of C channels in parallel at a time, and often in the actual processing process, the number of channels corresponding to one image to be processed may be less than C, or exactly C, or more than C, and in this case, the data needs to be split or filled. For example, the number of channels corresponding to each image to be processed is Q, and when Q is smaller than C, the computer device may obtain and fill image channel data of (C-Q) channels in a data filling manner to obtain image data corresponding to the image to be processed. The data padding may specifically be performed by using a preset value, for example, a value 0. When Q is equal to C, the computer device may just acquire image channel data of C channels corresponding to each image to be processed. When Q is greater than C, the computer device may segment the matrix, split image channel data of Q channels, each C being a currently executed group, and fill up C channels with preset values for less than C of the segmented channels. The image data processing provided by the embodiments of the present application is performed in the same manner for each group, and then the operation results corresponding to each group are combined to obtain the final operation result.

It will be appreciated that the same is true in the N dimension, and that for a plurality of images to be processed corresponding to the target task, the computer device may acquire N images at a time as the current batch for parallel processing. For the combined computer device with less than N sheets, the to-be-processed images with less than N sheets may be directly processed, or preset data may be used for filling, for example, filling full black or full white images, and the like.

Step S504, storing the image data into C storage areas of the local memory according to channels, where N storage units in each storage area respectively store image channel data of one channel of the N images to be processed.

Specifically, the computer device may store the image data of the N to-be-processed images into C storage areas of the local memory according to the channels, where each storage area corresponds to image channel data of one of the channels. Each storage area comprises at least N storage units, and each storage unit stores image channel data of one channel of one to-be-processed image.

In one embodiment, the computer device may divide the local storage space into an arrangement of C × N in advance, where C corresponds to the number of channels and N corresponds to the number of images to be processed. That is, at least C storage areas may be deployed within a computer device, each storage area including at least N storage units. Wherein the number of the storage areas is matched with the number of the channels, and the number of the storage units in one storage area is matched with the number of the images to be processed in one batch.

In one embodiment, a computer device may store image data into a SIMD module. The SMID module comprises C SMID lanes, and SRAM in each SMID lane is used as a storage area. An SRAM comprises N banks, and one bank is used as a storage unit. That is, the computer device can store the image data in different banks, and read-write competition between different banks can be avoided when reading the data.

In one embodiment, each image channel data in each image to be processed includes I unit channel data, the I unit channel data in each image channel data is stored in I storage bits of one storage unit, and different storage bits correspond to different storage addresses; the current address is a storage address corresponding to one of the storage bits; i is a positive integer of 1 or more.

It is understood that each image channel data corresponds to one image, and in the actual processing, one image is usually divided into a plurality of units to be processed in parallel. That is, one image channel data may specifically include I unit channel data. Wherein, I may be a preset numerical value, or may be the number of elements in the output matrix. And the I unit channel data together constitute one image channel data. Thus, the output matrix with the preset size can be obtained after the matrix multiplication operation is carried out on the image data of the image to be processed.

Referring to FIG. 6, FIG. 6 is a schematic block diagram of image data processing in one embodiment. Taking an example that C is 128 and N is 8, as shown in 601 in fig. 6, image channel data of one channel of 8 to-be-processed images is correspondingly stored in each VMEM lane. One bank in one lane correspondingly stores image channel data of one channel of one to-be-processed image, and the one bank comprises I storage bits, and each storage bit stores one unit channel data. For example, (0, 0, 0), (0, 0, 1) … … (0, 0, I-1) of VMEM lane 0 in FIG. 6, I storage bits constitute one storage unit. This storage unit stores therein image channel data of a certain channel of a certain image to be processed.

It will be appreciated that each memory bit corresponds to a memory address, and the computer device may retrieve the unit channel data stored in the memory bit based on the memory address. Therefore, the image data of each image to be processed is divided into image channel data corresponding to a plurality of channels, and each image channel data is divided into I unit channel data, so that convolution operation can be realized by taking the unit channel data as a reference.

Step S506, when a data reading instruction occurs, determining a current address carried in the data reading instruction, and acquiring a preset number of historical addresses according to the current address; the predetermined number is determined by dividing C by N.

Specifically, the computer device may trigger the data reading instruction according to a certain period, and the data reading instruction triggered in each data reading period carries an address corresponding to the data to be accessed currently. The computer equipment can search forward according to the current address to obtain the historical addresses carried by the historical triggered data reading instructions with preset quantity. The preset number may be a quotient of the channel number C divided by N, or an rounded-up value of the quotient of C divided by N, that is, a value of the quotient may be increased by one when the remainder is not completely divided by C divided by N.

In one embodiment, the computer device may obtain, according to a preset rule, a corresponding storage address each time data reading is required. And when each data reading cycle arrives, the current data reading instruction can be generated based on the storage address (namely the current address) corresponding to the current data reading cycle. It will be appreciated that since the computer device will follow a certain rule each time data is read, for example, the data is read from the ith storage bit in the first storage unit of the first storage area in each time reading period, the corresponding current address will have a certain rule as the data reading period increases, for example, the ith storage bit in the first storage unit of the first storage area, where i increases as the data reading period increases.

In an embodiment, step S506, that is, when a data reading instruction occurs, the step of determining a current address carried in the data reading instruction and obtaining a preset number of historical addresses according to the current address specifically includes: when a data reading instruction occurs, determining a current address carried in the data reading instruction; and inputting the current address into a shift register in a serial-in parallel-out mode, and outputting a preset number of historical addresses which are input in a historical mode through the shift register.

The shift register is a device based on the cascade connection of flip-flops which works under a plurality of same time pulses. The shift register in serial-in parallel-out form can output the input serial data in parallel format. After the data required for serial communication is input, the parallel data can be read out at each bit of the output terminal at the same time. Specifically, the computer device may input a current address corresponding to the current data reading period to a shift register in a serial-in parallel-out manner, and output a preset number of history addresses that have been historically input through the shift register. The current data reading cycle is a cycle corresponding to the data reading instruction initiated at the current time.

In one embodiment, to implement spatio-temporal pattern addressing of systolic arrays, a SIPO shift register may be pre-deployed in the global control block of the SIMD module. Therefore, each time addressing of the systolic array is carried out, the computer equipment can input the current address data and simultaneously read out the historical addresses of the preset number input before, thereby realizing the recording control of the address data.

In the above embodiment, all the storage addresses corresponding to the current data reading period can be obtained through the shift register in the serial-in parallel-out mode, and then the target image channel data can be found based on the obtained storage addresses.

Step S508, based on the current address and the historical address, reading, from the local memory, target image channel data corresponding to different to-be-processed images and corresponding to different channels in a preset offset manner.

Specifically, the computer device may read, from the local memory, the target image channel data corresponding to different to-be-processed images and corresponding to different channels in a preset offset manner based on the current address and the historical address. The preset offset mode may be a mode in which when target image data of different channels are read, a memory cell offset is performed in a subsequent channel compared with a previous channel. That is, when reading the target image channel data in each storage area, the target image channel data is read after shifting one storage unit based on the data read in the previous storage area.

In one embodiment, the computer device may read, based on the current address and each historical address, target image channel data corresponding to the current data reading period in a preset offset manner, and then load the read target image channel data into a register corresponding to the current data reading period and arrange the target image channel data in a two-dimensional matrix form. The elements that are missing in the two-dimensional matrix may be filled with a predetermined value, for example, a value of 0.

In one embodiment, the current address is a storage address of the ith storage bit of the first storage unit in the first storage area, and I is less than or equal to I; reading target image channel data corresponding to different images to be processed and corresponding to different channels from a local memory in a preset offset mode based on the current address and the historical address, wherein the target image channel data comprises the following steps: in a current data reading period, determining a storage address of the ith storage bit of each storage unit in N continuous storage units including the first storage unit in the first storage area based on the current address, and acquiring corresponding unit channel data based on the determined storage address; for each storage area in N-1 storage areas behind the first storage area, acquiring corresponding unit channel data from the ith storage bit corresponding to the corresponding storage area in a manner of performing storage unit offset compared with the previous storage area based on the current address; for each historical address before the current address, determining unit channel data corresponding to the current data reading period from each storage area according to a preset offset mode based on data read by the corresponding historical address in the historical data reading period; the unit channel data read in the current data reading period are used for forming a two-dimensional matrix in a C-N format.

It will be appreciated that the systolic array does not require data for each clock cycle that corresponds to storage in the memory region continuously while the operation is being performed. As shown in fig. 4(D), the memory address of the target image channel data acquired by the systolic array every clock cycle is related to both the current address corresponding to the current data reading cycle and the historical address several times before. This systolic array is characterized by the spatiotemporal nature of systolic array addressing. Therefore, when a hybrid architecture of the SIMD module and the systolic array module is implemented, a data communication and control module between the SIMD module and the systolic array module needs to be reasonably designed and implemented, so that pipeline work can be implemented for data reading and processing, and data processing efficiency is greatly improved.

In one embodiment, the computer device may perform data addressing based on a memory address of an i-th storage bit of each memory unit in N consecutive memory units including the first memory unit in the currently corresponding first memory area in the current data reading cycle, and acquire corresponding unit channel data based on the determined memory address. And for each storage area in the N-1 storage areas behind the first storage area, acquiring corresponding unit channel data from the ith storage bit in the corresponding storage area in a manner of performing storage unit offset compared with the previous storage area based on the current address. And for each historical address before the current address, determining unit channel data corresponding to the current data reading period from each storage area according to a preset offset mode based on the data read by the corresponding historical address in the historical data reading period. The target image channel data read based on the current address and the historical address are used for forming a two-dimensional matrix in a C-N format.

For example, assuming that the current data read cycle is the jth data read cycle, the corresponding current address is Z_j. The historical addresses corresponding to 16 data reading periods before the jth data reading period are respectively Z_j-1，Z_j-2，Z_j-3，……，Z_j-16. The computer device may be based on Z_jTo Z_j-16The 17 storage addresses are used for acquiring target image channel data corresponding to the j-th data reading period, and a C-N two-dimensional matrix is formed.

It will be appreciated that when the next data read cycle, j +1, comes_jWill be one of the history addresses corresponding to the j +1 th data reading cycle. The computer device will be based on the memory address Z_jAnd continuing to read the unit channel data corresponding to the j +1 th data reading period according to a preset reading rule when the data read in the j data reading period goes down. Similarly, for Z_j-1To Z_j-15And if the historical addresses are not the same, the computer equipment continues to read the unit channel data corresponding to the (j + 1) th data reading period according to a preset reading rule.

In the above embodiment, for each data reading period, the data required for the systolic array to perform the matrix multiplication operation can be acquired in a preset offset manner based on the current address and the historical address corresponding to the current data reading period, so that the single-instruction multiple-data stream and the systolic array cooperate with each other to complete the matrix multiplication operation, and further, the processing efficiency when a large amount of data is processed in batch is greatly improved.

In one embodiment, for each history address before the current address, determining unit channel data corresponding to the current data reading period from each storage area in a preset offset manner based on data read by the corresponding history address in the history data reading period respectively, includes: determining unit channel data which are respectively read through various historical addresses in the previous data reading period; according to the unit channel data read in the previous data reading period, unit channel data which respectively correspond to each storage area and correspond to the current data reading period are obtained according to a preset reading rule; wherein, presetting a reading rule comprises: reading the unit channel data of the same storage area according to the sequence that the storage bits are sequentially increased from low to high, and reading the unit channel data of different storage units in the same storage area corresponding to the same storage bits according to the sequence that the storage units are sequentially increased from low to high.

Specifically, the computer device may determine unit channel data read by each historical address in a previous data reading period, and further obtain, according to the unit channel data read in the previous data reading period, unit channel data corresponding to each storage area and corresponding to the current data reading period according to a preset reading rule. When the computer equipment reads data, the following preset reading rules are followed: reading the unit channel data of the same storage area according to the sequence that the storage bits are sequentially increased from low to high, and reading the unit channel data of different storage units in the same storage area corresponding to the same storage bits according to the sequence that the storage units are sequentially increased from low to high.

It will be appreciated that each history address is read as a current address during its corresponding history data read cycle. In a corresponding historical data reading period, the corresponding data reading modes are all read according to the ith storage bit of each storage unit in N continuous storage units including the first storage unit in the first storage area, and then, for each storage area in N-1 storage areas behind the first storage area, corresponding unit channel data are obtained from the corresponding ith storage bit in the corresponding storage area in a mode of carrying out storage unit offset compared with the previous storage area.

The data reading and arrangement is described in detail below with reference to fig. 6: referring to fig. 6, in the matrix multiplication, image data corresponding to N images to be processed are stored in VMEM of C Lane. Because of the spatio-temporal nature of systolic array addressing described above, each addressing of C × N data (e.g., 128 × 8 data) of VMEM to REG is based on the following addressing principles: as shown in fig. 6, taking C128 and N8 as an example, the data addressing of VMEM Lane 0 is 8 data of a continuous space, the data addressing of VMEM Lane 1 is 7 data of a continuous space, and the data addressing of VMEM Lane 2 is 6 data of a continuous space and so on, where one data is delayed to be read due to the space-time characteristic of the systolic array in the previous addressing. That is, the addressing associated with each SIMD Lane will be offset by one memory location from the addressing associated with its previous SIMD Lane. Based on such a preset offset manner, it can be seen that 128 SIMD lanes can address according to a set of rules of 8 lanes consecutively, and the data address to be accessed each time needs to record 128/8, which is accessed before, to 16 addresses in addition to the current address of the current access application, so as to complete a complete addressing task from VMEM to REG in combination with the spatiotemporal characteristics of the systolic array.

It can be understood that, when the unit channel data in each storage unit corresponding to the current ith storage bit has been read during data reading for a certain storage area in one data reading cycle, the unit channel data in the (i + 1) th storage bit in each storage unit will continue to be read according to the ascending order of the storage bits.

In the above embodiment, when data is read based on each storage address, the data is sequentially read according to the preset reading rule, so that the unit channel data corresponding to the current data reading period can be smoothly, accurately and quickly read through the current address and the historical address.

Step S510, arranging the read target image channel data in a two-dimensional matrix form, where the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in adjacent matrix rows are distributed in two adjacent matrix columns.

Specifically, the computer device may arrange the read target image channel data in a form of a two-dimensional matrix and load the data into a register. The target image channel data read in one data reading cycle will be loaded into the same register. The target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns. Thus, when data is transmitted to the systolic array according to the time period, the space-time characteristic of addressing of the systolic array can be realized, and high-efficiency matrix multiplication can be realized.

With continued reference to FIG. 6, the arrangement of the target image channel data in register 0 and register 1 is shown in FIG. 6. And when the computer equipment reads the target image channel data corresponding to the current data reading period from the storage area, loading the read target image channel data into the register in a two-dimensional matrix mode. The data arrangement in the two-dimensional matrix satisfies the following rules: the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns.

The target image data corresponding to different channels are arranged according to the preset offset mode, so that unit channel data of each channel corresponding to the same position in the same image to be processed can be sequentially transmitted to corresponding processing units in the pulse array in different clock cycles, and the operation of matrix multiplication is realized according to the operation characteristics of the pulse array.

Step S512, sequentially transmitting each line of data in the target image channel data arranged in the form of a two-dimensional matrix to a pulse array according to a time sequence to perform operation to obtain an operation result; the width of the systolic array corresponds to the number of channels.

Specifically, the computer device may trigger a data transmission instruction according to a clock cycle, and sequentially transmit each line of data in the target image channel data arranged in the two-dimensional matrix form in the current register to the systolic array according to the time sequence of different clock cycles in response to the data transmission instruction of each clock cycle. Wherein the width of the pulse array corresponds to the number of channels. It can be understood that when each column of target image channel data is transmitted into the systolic array, each target image data in each column of target image channel data will be sequentially transmitted to the processing units corresponding to the corresponding row according to a clock cycle. When the target image data is transmitted in the ripple array, the ripple array performs matrix multiplication operation based on the weight matrix pre-loaded by the ripple array until a corresponding operation result is obtained.

It will be appreciated that a two-dimensional array of data read in one data read cycle has N columns of data, and accordingly, the N columns of data are divided into N clock cycles and transmitted to the systolic array for processing.

In one embodiment, the systolic array comprises C by C processing units, and the systolic array is pre-loaded with a weight matrix according to the arrangement of the processing units; transmitting each line of data in target image channel data arranged in a two-dimensional matrix form to a pulse array in sequence according to a time sequence for operation to obtain an operation result, wherein the operation result comprises the following steps: sequentially transmitting each line of data in the target image channel data arranged in a two-dimensional matrix form to a pulse array according to the time sequence of a clock cycle; based on the weight matrix pre-loaded in the systolic array, each processing unit cooperatively performs matrix multiplication on sequentially-transmitted target image channel data, and outputs an operation result from a row of processing units corresponding to the last channel in the systolic array.

Specifically, the computer device may pre-load the weight matrix into the systolic array, that is, the weight buffer of each processing unit in the systolic array is pre-loaded with a certain weight parameter in the weight matrix. Furthermore, when each line of data in the target image channel data arranged in a two-dimensional matrix form is continuously transmitted to the systolic array, each processing unit in the systolic array performs operation based on the corresponding weight parameter loaded in advance, the data flow of the whole systolic array realizes the matrix multiplication operation on the target image channel data until an operation result is output from a row of processing units corresponding to the last channel in the systolic array, and the operation result is obtained after N images to be processed are respectively subjected to matrix multiplication with the weight matrix.

In the embodiment, the single instruction multiple data stream and the systolic array are combined, so that the pipelined parallel processing of data without blocking can be realized, and the operation efficiency of matrix multiplication on batch data is greatly improved.

In one embodiment, based on a weight matrix preloaded in a systolic array, each processing unit cooperatively performs matrix multiplication on sequentially incoming target image channel data, and outputs an operation result from a row of processing units corresponding to a last channel in the systolic array, including: each processing unit in the systolic array sequentially executes the following operations until an operation result is output from a row of processing units corresponding to the last channel in the systolic array: for processing elements PE in systolic array_c,mAccording to the processing unit PE_c,mThe weight parameter in the system carries out multiplication operation on the target image channel data transmitted in the current clock cycle to obtain a corresponding product, and the processing unit PE_c,mAccording to a processing element PE_c-1,mThe incoming first intermediate result and the product determination and processing unit PE_c,mA corresponding second intermediate result; wherein C and m are both less than or equal to C, and C corresponds to the channel dimension; and at presentWhen a clock cycle arrives, the processing unit PE_c,mTransmitting the target image channel data transmitted in the current clock period to the processing unit PE_c,m+1And transmitting a second intermediate result obtained by calculating the current clock cycle to the processing unit PE as a first intermediate result of the next clock cycle_c+1,m。

In one embodiment, for the first column of target image channel data to be first transmitted to the systolic array, each processing element in the first column of processing elements in the systolic array performs a product operation with the corresponding incoming target image data based on its respective weight parameter. The processing units in the row then transmit the received target image channel data to the next row of systolic arrays. And, the processing units in the row will transmit the product result calculated by the processing unit to the processing unit in the next row from the processing unit corresponding to the first row, so that the processing unit in the next row can perform the accumulation operation.

With continued reference to fig. 6, each column of data in the register is sequentially transmitted to the first column of processing units in the systolic array according to the clock cycle, and the data that has been transmitted to the first column of processing units in the systolic array is continuously transmitted to the second column of processing units in the systolic array when the next clock cycle arrives, so that the data is continuously transmitted until the data is transmitted to the last column of processing units in the systolic array. For each row of processing units receiving the currently incoming data, the incoming data is multiplied by the weight parameter buffered in the row, and the multiplied data is summed with the intermediate result transmitted by the processing unit immediately preceding the row, and the resulting intermediate result is transmitted to the processing unit immediately following the row in the next clock cycle. In this way, the data operation and transmission are continuously carried out until the operation result is output through a row of processing units corresponding to the last channel in the pulse array.

In the above embodiment, the data is streamed in the systolic array, independent computation is performed by each PE, and communication is performed with the adjacent PE, so that efficient and energy-saving computation is realized under the condition that data is multiplexed continuously.

In one embodiment, the image processing further includes a step of importing a weight matrix into the systolic array, where the step specifically includes: acquiring a convolution kernel corresponding to an image to be processed; the format of the convolution kernel is R S C M, wherein R S corresponds to the area of the convolution kernel, C corresponds to the channel dimension of the convolution kernel, and M corresponds to the number dimension of the convolution kernel; splitting the convolution kernel into R S convolution kernel units in C M format; and (3) forming a weight matrix by each weight parameter of one target convolution kernel unit in the R x S convolution kernel units, and introducing the corresponding weight matrix into the systolic array.

In a particular application scenario, there is often a large number of convolution calculations when processing an image to be processed by a machine learning model. Schematic diagram of convolution calculation referring to fig. 7(a), as shown in fig. 7(a), a convolution kernel is a set of parallel feature maps (feature maps) that are composed by sliding different convolution kernels over an input image and performing certain operations. As shown in fig. 7(a), the convolution kernel has a size of R × S, and there are C channels and M channels. And for one image H, W, C to be processed, performing convolution calculation on the image to be processed through a convolution kernel to obtain an output feature map E, F, M. Correspondingly, when the number of the images to be processed is N, the corresponding output feature maps are also corresponding to the N images.

In order to realize efficient and energy-saving convolution calculation under a SIMD and Systolic Array mixed architecture, the application provides a calculation method called streaming convolution. As shown in fig. 7(B), the computer device splits each Filter planar data matrix of size R × S into R × S matrices of size 1 × 1, i.e., splits the convolution kernel of format R × S × C × M into R × S convolution kernel units of C × M. Referring to fig. 7(B), fig. 7(B) is a schematic diagram of splitting a convolution kernel in one embodiment. As shown in fig. 7(B), the computer device may split each convolution kernel of size R S into R S1 convolution kernel units.

After the splitting of the convolution kernel is achieved, the convolution calculation of each corresponding convolution kernel unit can be converted into a matrix multiplication calculation. Based on this, the computer device may form a weight matrix from each weight parameter of one target convolution kernel unit of the R × S convolution kernel units, and introduce the corresponding weight matrix into the systolic array.

Further, with the mixed architecture of the SIMD module and the systolic array module, the implementation of data scheduling required for convolution calculation can refer to the scheduling control in the matrix multiplication process in the foregoing embodiment. And when addressing with the control granularity of C x N is carried out each time, C corresponds to the channel dimension of the image to be processed, N corresponds to the mini-batch dimension of the image to be processed, and therefore the operation result corresponding to one convolution kernel unit can be obtained through the pulsation array calculation.

Referring to fig. 8, fig. 8 is a schematic diagram of the conversion of convolution calculations into matrix multiplication calculations in one embodiment. As shown in fig. 8, in principle, for the convolution calculation, N images to be processed are input, each image to be processed includes channel images of N channels, and each channel image has a size H × W. The size of the convolution kernel unit is 1 x 1, and the convolution kernel unit is corresponding to C channels, and the number of the convolution kernel units is M. The size of an output image obtained by performing convolution operation on the image to be processed through the convolution kernel unit becomes N pieces of E x F x M. It will be appreciated that this convolution process can be implemented by converting to matrix multiplication. Referring to the lower part of fig. 9, N images to be processed may be input to be sorted into an input matrix having a length E × F × N and a width C; the convolution kernel unit can be arranged into a weight matrix with the length of C and the width of M; the output image can be represented by a matrix product of the input matrix and the weight matrix, i.e., an output matrix with a length of E x F x N and a width of M.

In the above embodiment, the processing efficiency of the multidimensional convolution calculation can be greatly improved by splitting the convolution kernel into the convolution kernel units, and then converting the multidimensional convolution calculation between the convolution kernel and the image to be processed into the matrix multiplication operation between the plurality of convolution kernel units and the image to be processed respectively.

In one embodiment, the image data processing method further includes a step of performing operations based on the plurality of convolution kernel units, where the step specifically includes: returning the operation result corresponding to the target convolution kernel unit to a local memory and storing the operation result; determining convolution kernel units which do not participate in operation in the R x S convolution kernel units, and taking one convolution kernel unit which does not participate in operation as a next target convolution kernel unit; and the next target convolution kernel unit is reintroduced into the pulse array to participate in the next round of operation to obtain a corresponding operation result, and the step of returning the operation result corresponding to the target convolution kernel unit back to the local memory and storing is continuously executed until all of the R × S convolution kernel units participate in the operation to obtain the operation results corresponding to the convolution kernel units.

Specifically, the computer device may transmit back to the local memory and store an operation result corresponding to the target convolution kernel unit. And then selecting one convolution kernel unit which does not participate in the operation from the R x S convolution kernel units as a next target convolution kernel unit, and reintroducing the next target convolution kernel unit into the systolic array to participate in the operation of the next round to obtain a corresponding operation result. And repeating the operation continuously until all the R × S convolution kernel units participate in the operation to obtain operation results respectively corresponding to the convolution kernel units.

In an embodiment, the computer device may further determine a convolution feature map obtained by performing convolution operation on the image to be processed according to operation results stored in the local memory and respectively corresponding to the R × S convolution kernel units. Specifically, the computer device may perform merging or matrix addition operation on operation results respectively corresponding to the R × S convolution kernel units, so as to obtain a convolution feature map obtained by performing convolution operation on the image to be processed.

In the above embodiment, the processing efficiency of the multidimensional convolution calculation can be greatly improved by splitting the convolution kernel into the convolution kernel units, then converting the multidimensional convolution calculation between the convolution kernel and the image to be processed into the matrix multiplication operation between the plurality of convolution kernel units and the image to be processed, respectively, and obtaining the convolution characteristic map based on the operation result corresponding to each convolution kernel unit.

In a specific embodiment, the step of storing the image data into C storage areas in the local memory according to the channels specifically includes: acquiring a task code transmitted through a programming interface; and storing the image data into C storage areas in the local memory according to the channels according to the software parameters in the task codes. The image data processing method further includes: compiling the task codes to obtain corresponding executable programs; the method comprises the steps of determining a current address carried in a data search instruction when the data search instruction occurs and acquiring a preset number of historical addresses according to the current address by executing an executable program, and sequentially transmitting each line of data in target image channel data distributed in a two-dimensional matrix form to a pulse array according to a time sequence to perform operation to obtain an operation result.

Referring to fig. 9, fig. 9 is a flow chart of a method of processing image data in one embodiment. As shown in FIG. 10, the computer device may provide a programming interface through which a user may write task code related to a target task as desired. The computer device may analyze program parameters in the task code to enable the hardware modules required for the implementation of the present application by calling library functions of the dynamically linked library. The library function of the dynamic link library herein mainly refers to execution interfaces, such as a gemm interface or a conv interface, through which corresponding hardware modules are used.

And when the task code is further executed, the computer equipment can call the image data of the N images to be processed based on the software parameters in the task code, and perform operations such as zeroing, expanding, splitting or aligning on the image data according to a preset format based on the software parameters so as to store the image data into C storage areas in the local memory according to channels. The software parameters specifically include a data arrangement format. For example, when C is 128 and N is 8, the computer device may arrange the image data in a format of 128 × 8, perform zero-setting expansion on less than 128 lines of image data, and perform splitting on more than 128 lines of image data.

Next, the computer device may compile the task code to obtain a corresponding executable program. If the compiling is successful, the user determines whether to execute the compiling; and if the compiling fails, returning user error information until the user finishes the task code correction and enters the compiling flow again. And the computer equipment judges whether the pulse array needs to be called or not according to the interface transmitted by the software system by executing the executable program corresponding to the target task. If the image data processing method provided by the embodiments of the application is executed if necessary, the calculation of universality can be directly realized through a SIMD module if the systolic array does not need to be called.

For the scheme needing to call the systolic array, the computer equipment can realize the coordination control of data interaction between the SIMD module and the systolic array module by controlling the instructions and data of the whole SIMD systolic array hybrid architecture through the global control module in the SIMD module in the process of executing the executable program. That is, a data reading instruction is periodically triggered by a global control module in the SIMD module, and then a current address carried in the current data reading instruction is input to a shift register of the SIPO through the global control module to read out a preset number of history addresses input before, so that target image channel data corresponding to different to-be-processed images and corresponding to different channels is read from a local memory in a preset offset manner based on the current address and the history addresses. The computer equipment can arrange the read target image channel data in a two-dimensional matrix form, wherein the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns. And then, the computer equipment sequentially transmits each line of data in the target image channel data arranged in the form of a two-dimensional matrix to the pulse array according to the time sequence to perform matrix multiplication to obtain an operation result.

The application provides a hybrid system architecture combining single instruction stream, multiple data streams and a pulse array, and a reasonable memory interaction design is realized aiming at the data requirements of the pulse array and the control mode of the single instruction, multiple data streams, so that the memory access energy consumption of special application of a neural network is reduced to the greatest extent while the pulse array can calculate efficiently, and the efficient and energy-saving architecture design is realized. The achievement of the application can construct a hybrid system architecture which has commercial significance and is based on SIMD and pulse array, provides high-efficiency and energy-saving computing service for users, and greatly improves the processing efficiency of batch processing of a large amount of data.

The application also provides an application scene, and the application scene applies the image data processing method. Specifically, the application of the image data processing method to the application scene is as follows: when the computer device needs to process a batch of images to be processed through a machine learning model, for example, the machine learning model may specifically be an image classification model, and the corresponding images to be processed may specifically be original images to be classified. The computer device may divide the original images to be classified into a batch of images to be processed in advance, where the number of the images to be processed in each batch is N.

The computer device may provide a programming interface through which a user may write task code related to the classification task as desired. The computer device may analyze program parameters in the task code to enable the hardware modules required for the implementation of the present application by calling library functions of the dynamically linked library. And when the task code is further executed, the computer equipment can call the image data of the N images to be processed based on the software parameters in the task code, and perform operations such as zeroing, expanding, splitting or aligning on the image data according to a preset format based on the software parameters so as to store the image data into C storage areas in the local memory according to channels.

Next, the computer device may compile the task code to obtain a corresponding executable program. The computer device executes the executable program corresponding to the classification task to execute the image data processing method provided by the embodiments of the application, so that the data coordination processing of the SIMD module and the systolic array module is realized, the convolution processing of the image to be processed is realized based on the convolution kernel of the image classification model, and the convolution characteristic graph is output. And then carrying out classification processing on the basis of the convolution characteristic graph through a classification layer in the image classification model so as to output classification categories respectively corresponding to the images to be processed.

It is to be understood that, in other application scenarios, the machine learning module may also be other types of models, such as an object segmentation model, a face recognition model, an object prediction model, and the like, and the corresponding object tasks are matched with the machine learning model, and the application scenarios described above are only used for schematically illustrating the present application and are not used to limit the implementation scenarios of the present application.

It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

In one embodiment, as shown in fig. 10, there is provided an image data processing apparatus 1000, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, the apparatus specifically includes: an obtaining module 1001, a storage module 1002, a determining module 1003, a data reading module 1004, an arranging module 1005 and an operation module 1006, wherein:

an obtaining module 1001, configured to obtain image data of N images to be processed, where the image data of each image to be processed includes image channel data of C channels; wherein N is a positive integer greater than or equal to 1, and C is a positive integer greater than 1.

The storage module 1002 is configured to store the image data into C storage areas of the local memory according to channels, where N storage units in each storage area respectively store image channel data of one channel of N images to be processed.

A determining module 1003, configured to determine, when a data reading instruction occurs, a current address carried in the data reading instruction, and obtain a preset number of historical addresses according to the current address; the predetermined number is determined by dividing C by N.

And a data reading module 1004, configured to read, based on the current address and the historical address, target image channel data corresponding to different to-be-processed images and corresponding to different channels from the local memory in a preset offset manner.

The arrangement module 1005 is configured to arrange the read target image channel data in a two-dimensional matrix form, where the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in adjacent matrix rows are distributed in two adjacent matrix columns.

The operation module 1006 is configured to sequentially transmit each line of data in the target image channel data arranged in the two-dimensional matrix form to the systolic array according to a time sequence to perform operation, so as to obtain an operation result; the width of the systolic array corresponds to the number of channels.

In one embodiment, the determining module 1003 is further configured to determine, when a data reading instruction occurs, a current address carried in the data reading instruction; and inputting the current address into a shift register in a serial-in parallel-out mode, and outputting a preset number of historical addresses which are input in a historical mode through the shift register.

In one embodiment, each image channel data in each image to be processed comprises I unit channel data, the I unit channel data in each image channel data is respectively stored in I storage bits of one storage unit, and different storage bits correspond to different storage addresses; the current address is a storage address corresponding to one of the storage bits; i is a positive integer of 1 or more.

In one embodiment, the current address is the storage address of the ith storage bit of the first storage unit in the first storage area, and I is less than or equal to I; the data reading module 1004 is further configured to determine, in a current data reading cycle, based on a current address, a storage address of an i-th storage bit of each storage unit in N consecutive storage units including a first storage unit in the first storage area, and acquire corresponding unit channel data based on the determined storage address; for each storage area in N-1 storage areas behind the first storage area, acquiring corresponding unit channel data from the ith storage bit corresponding to the corresponding storage area in a manner of performing storage unit offset compared with the previous storage area based on the current address; for each historical address before the current address, determining unit channel data corresponding to the current data reading period from each storage area according to a preset offset mode based on data read by the corresponding historical address in the historical data reading period; the unit channel data read in the current data reading period are used for forming a two-dimensional matrix in a C-N format.

In one embodiment, the data reading module 1004 is further configured to determine unit channel data respectively read by the historical addresses in a previous data reading cycle; according to the unit channel data read in the previous data reading period, unit channel data which respectively correspond to each storage area and correspond to the current data reading period are obtained according to a preset reading rule; wherein, presetting a reading rule comprises: reading the unit channel data of the same storage area according to the sequence that the storage bits are sequentially increased from low to high, and reading the unit channel data of different storage units in the same storage area corresponding to the same storage bits according to the sequence that the storage units are sequentially increased from low to high.

In one embodiment, the systolic array comprises C by C processing units, and weight matrixes are pre-loaded in the systolic array according to the arrangement of the processing units; the operation module 1006 is further configured to sequentially transmit each line of data in the target image channel data arranged in the form of a two-dimensional matrix to the systolic array according to a time sequence of a clock cycle; based on the weight matrix pre-loaded in the systolic array, each processing unit cooperatively performs matrix multiplication on sequentially-transmitted target image channel data, and outputs an operation result from a row of processing units corresponding to the last channel in the systolic array.

In one embodiment, the operation module 1006 is further configured to sequentially perform the following operations for each processing unit in the systolic array until the operation result is output from a row of processing units corresponding to the last channel in the systolic array: for systolic arraysProcessing element PE in (1)_c,mAccording to the processing unit PE_c,mThe weight parameter in the system carries out multiplication operation on the target image channel data transmitted in the current clock cycle to obtain a corresponding product, and the processing unit PE_c,mAccording to a processing element PE_c-1,mThe incoming first intermediate result and the product determination and processing unit PE_c,mA corresponding second intermediate result; wherein C and m are both less than or equal to C, and C corresponds to the channel dimension; and when the next clock cycle comes, the processing unit PE_c,mTransmitting the target image channel data transmitted in the current clock period to the processing unit PE_c,m+1And transmitting a second intermediate result obtained by calculating the current clock cycle to the processing unit PE as a first intermediate result of the next clock cycle_c+1,m。

In one embodiment, the image data processing method further includes a weight import module 1007, where the weight import module 1007 is configured to obtain a convolution kernel corresponding to the image to be processed; the format of the convolution kernel is R S C M, wherein R S corresponds to the area of the convolution kernel, C corresponds to the channel dimension of the convolution kernel, and M corresponds to the number dimension of the convolution kernel; splitting the convolution kernel into R S C M convolution kernel units; and (3) forming a weight matrix by each weight parameter of one target convolution kernel unit in the R x S convolution kernel units, and introducing the corresponding weight matrix into the systolic array.

Referring to fig. 11, in one embodiment, the image data processing method further includes a repeated execution module 1008, wherein:

the storage module 1002 is further configured to transmit back the operation result corresponding to the target convolution kernel unit to the local memory and store the operation result;

the determining module 1003 is further configured to determine a convolution kernel unit that does not participate in the operation among the R × S convolution kernel units, and use one of the convolution kernel units that does not participate in the operation as a next target convolution kernel unit;

and a repeated execution module 1008, configured to reintroduce the next target convolution kernel unit into the systolic array to participate in the next round of operation, to obtain a corresponding operation result, and return to the step of returning the operation result corresponding to the target convolution kernel unit to the local memory and storing, and continue to execute the operation until all of the R × S convolution kernel units participate in the operation, to obtain operation results corresponding to each convolution kernel unit.

In one embodiment, the determining module 1003 is further configured to determine a convolution feature map obtained by performing convolution operation on the image to be processed according to operation results stored in the local memory and respectively corresponding to the R × S convolution kernel units.

In one embodiment, the storage module 1002 is further configured to obtain task code that is transmitted through the programming interface; and storing the image data into C storage areas in the local memory according to the channels according to the software parameters in the task codes. The image data processing device further comprises a code execution module 1008, which is used for compiling the task codes to obtain corresponding executable programs, wherein the executable programs are used for determining the current addresses carried in the data search instructions when the data search instructions occur and acquiring the preset number of historical addresses according to the current addresses when the executable programs are executed, and transmitting each line of data in the target image channel data arranged in a two-dimensional matrix form to a pulse array in sequence according to the time sequence to perform operation to obtain the operation results.

The image data processing device acquires image data of N images to be processed, wherein the image data of each image to be processed comprises image channel data of C channels. And storing the batch of image channel data into C storage areas of a local memory according to channels, wherein N storage units in each storage area respectively store the image channel data of one channel of N images to be processed. When a data reading instruction occurs, acquiring a preset number of historical addresses according to the current address carried in the data reading instruction, and reading target image channel data corresponding to different images to be processed and different channels from a local memory in a preset offset mode based on the current address and the historical addresses. And then arranging the read target image channel data in a two-dimensional matrix form, wherein the target image channel data corresponding to the same channel are in the same matrix row, and the target image channel data corresponding to the same image to be processed in the adjacent matrix rows are distributed in the two adjacent matrix columns. When batch operation is needed, all lines of data in the target image channel data arranged in the form of a two-dimensional matrix can be directly and sequentially transmitted to the pulse array according to the time sequence to be operated to obtain an operation result. Therefore, reasonable data addressing is realized through the current address and the historical address, and then data operation is realized through the pulse array, so that the parallel processing of single instruction multiple data streams can be realized, the phenomenon of pause caused by the mismatching of data access and processing speed can be avoided, reasonable memory interaction and efficient and flexible cooperative control are realized, the data processing efficiency and the processing performance are greatly improved, and the energy consumption can be greatly reduced through the pipeline type data parallel processing. For specific limitations of the image data processing apparatus, reference may be made to the above limitations of the image data processing method, which are not described herein again. The respective modules in the image data processing apparatus described above may be entirely or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal or a server, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image data processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of image data processing, the method comprising:

2. The method according to claim 1, wherein when a data reading instruction occurs, determining a current address carried in the data reading instruction, and obtaining a preset number of historical addresses according to the current address, includes:

when a data reading instruction occurs, determining a current address carried in the data reading instruction;

and inputting the current address into a shift register in a serial-in parallel-out mode, and outputting a preset number of historical addresses which are input in a historical mode through the shift register.

3. The method according to claim 1, wherein each image channel data in each image to be processed comprises I unit channel data, wherein the I unit channel data in each image channel data are respectively stored in I storage bits of one storage unit, and different storage bits correspond to different storage addresses; the current address is a storage address corresponding to one of the storage bits; i is a positive integer of 1 or more.

4. The method according to claim 3, wherein the current address is a memory address of an ith memory bit of a first memory location in a first memory area, I is less than or equal to I; the reading, from the local memory, target image channel data corresponding to different to-be-processed images and corresponding to different channels in a preset offset manner based on the current address and the historical address includes:

in a current data reading period, determining a storage address of the ith storage bit of each storage unit in N continuous storage units including the first storage unit in the first storage area based on the current address, and acquiring corresponding unit channel data based on the determined storage address;

for each storage area in N-1 storage areas behind the first storage area, acquiring corresponding unit channel data from the ith storage bit corresponding to the corresponding storage area in a manner of performing storage unit offset compared with the previous storage area on the basis of the current address;

for each historical address before the current address, determining unit channel data corresponding to the current data reading period from each storage area in a preset offset mode based on data read by the corresponding historical address in a historical data reading period; the unit channel data read in the current data reading period are used for forming a two-dimensional matrix in a C-N format.

5. The method according to claim 4, wherein for each history address before the current address, determining unit channel data corresponding to the current data reading period from each storage area in a preset offset manner based on data read by the corresponding history address in a history data reading period respectively comprises:

determining unit channel data which are respectively read through each historical address in the previous data reading period;

according to the unit channel data read in the previous data reading period, unit channel data which respectively correspond to each storage area and correspond to the current data reading period are obtained according to a preset reading rule;

wherein the preset reading rule comprises: reading the unit channel data of the same storage area according to the sequence that the storage bits are sequentially increased from low to high, and reading the unit channel data of different storage units in the same storage area corresponding to the same storage bits according to the sequence that the storage units are sequentially increased from low to high.

6. The method of claim 1, wherein the systolic array comprises C by C processing elements, and wherein the systolic array is pre-loaded with a weight matrix according to the arrangement of the processing elements; the method for sequentially transmitting each line of data in target image channel data arranged in a two-dimensional matrix form to a pulse array according to a time sequence to perform operation to obtain an operation result comprises the following steps:

sequentially transmitting each line of data in the target image channel data arranged in a two-dimensional matrix form to a pulse array according to the time sequence of a clock cycle;

based on the weight matrix pre-loaded in the systolic array, each processing unit cooperatively performs matrix multiplication on sequentially-transmitted target image channel data, and outputs an operation result from a row of processing units corresponding to the last channel in the systolic array.

7. The method of claim 6, wherein the outputting the operation result from a row of processing units corresponding to a last channel in the systolic array by performing a matrix multiplication operation on sequentially incoming target image channel data based on a weight matrix preloaded in the systolic array comprises:

each processing unit in the systolic array sequentially executes the following operations until an operation result is output from a row of processing units corresponding to the last channel in the systolic array:

for the processing element PE in the systolic array_c,mAccording to the processing unit PE_c,mThe weight parameter in the system carries out multiplication operation on the target image channel data transmitted in the current clock cycle to obtain a corresponding product, and the processing unit PE_c,mAccording to a processing element PE_c-1,mThe incoming first intermediate result and the product determination are compared with the processing element PE_c,mA corresponding second intermediate result; wherein C and m are both less than or equal to C, and C corresponds to the channel dimension; and

when the next clock cycle comes, the processing element PE_c,mWill be whenTarget image channel data transmitted in the previous clock cycle is transmitted to the processing unit PE_c,m+1And transmitting a second intermediate result obtained by calculating the current clock cycle to the processing unit PE as a first intermediate result of the next clock cycle_c+1,m。

8. The method of claim 6, further comprising:

acquiring a convolution kernel corresponding to the image to be processed; the convolution kernel format is R S C M, wherein R S corresponds to the area of the convolution kernel, C corresponds to the channel dimension of the convolution kernel, and M corresponds to the number dimension of the convolution kernel;

splitting the convolution kernel into R S C M-formatted convolution kernel units;

and forming a weight matrix by each weight parameter of one target convolution kernel unit in the R x S convolution kernel units, and introducing the corresponding weight matrix into the pulse array.

9. The method of claim 8, further comprising:

returning the operation result corresponding to the target convolution kernel unit to the local memory and storing the operation result;

determining convolution kernel units which do not participate in operation in the R × S convolution kernel units, and taking one convolution kernel unit which does not participate in operation as a next target convolution kernel unit;

and reintroducing the next target convolution kernel unit into the pulsation array to participate in the next round of operation to obtain a corresponding operation result, returning the operation result corresponding to the target convolution kernel unit back to the local memory and storing the operation result, and continuing to execute the steps until all the R × S convolution kernel units participate in the operation to obtain the operation results corresponding to the convolution kernel units.

10. The method of claim 9, further comprising:

and determining a convolution characteristic diagram obtained by performing convolution operation on the image to be processed according to operation results which are stored in a local memory and respectively correspond to the R × S convolution kernel units.

11. The method according to any one of claims 1 to 10, wherein the storing the image data into C storage areas in a local memory by channels comprises:

acquiring a task code transmitted through a programming interface;

storing the image data into C storage areas in a local memory according to channels according to software parameters in the task codes;

the method further comprises the following steps:

compiling the task codes to obtain corresponding executable programs;

and executing the executable program to execute the steps of determining the current address carried in the data searching instruction when the data searching instruction occurs, and acquiring a preset number of historical addresses according to the current address, and sequentially transmitting each line of data in the target image channel data arranged in a two-dimensional matrix form to a pulse array according to a time sequence to perform operation to obtain an operation result.

12. An image data processing apparatus, characterized in that the apparatus comprises:

13. The apparatus according to claim 12, wherein the determining module is further configured to determine, when a data reading instruction occurs, a current address carried in the data reading instruction; and inputting the current address into a shift register in a serial-in parallel-out mode, and outputting a preset number of historical addresses which are input in a historical mode through the shift register.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 11.