WO2025218403A1 - Procédé de traitement de données, processeur, puce et dispositif électronique - Google Patents
Procédé de traitement de données, processeur, puce et dispositif électroniqueInfo
- Publication number
- WO2025218403A1 WO2025218403A1 PCT/CN2025/082601 CN2025082601W WO2025218403A1 WO 2025218403 A1 WO2025218403 A1 WO 2025218403A1 CN 2025082601 W CN2025082601 W CN 2025082601W WO 2025218403 A1 WO2025218403 A1 WO 2025218403A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- processed
- control logic
- block data
- loop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/781—On-chip cache; Off-chip memory
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30029—Logical and Boolean instructions, e.g. XOR, NOT
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30047—Prefetch instructions; cache control instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30134—Register stacks; shift registers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
Definitions
- the present disclosure relates to the field of computer technology, and in particular to a data processing method, a processor, a chip, and an electronic device.
- Neural network algorithms are a recently popular type of machine learning algorithm in artificial intelligence (AI), achieving remarkable results in a variety of fields, such as image recognition, speech recognition, and natural language processing.
- AI artificial intelligence
- processors such as central processing units (CPUs) and graphics processing units (GPUs) consumes significant computational time and energy.
- Processors and AI accelerators incorporate numerous proprietary modules to accelerate training and inference. Matrix multiplication and convolution are common operators in AI algorithms.
- SIMD Single Instruction Multiple Data
- SIMT Single Instruction Multiple Thread
- the present disclosure proposes a data processing technology solution.
- a data processing method comprising: the data processing method is applied to a processor, the processor comprising a control logic unit and a dot multiplication unit array, the dot multiplication unit array comprising a plurality of dot multiplication units for performing multiplication and accumulation operations, the method comprising: the control logic unit sequentially reading, according to an acquired control instruction, cyclic block data of data to be processed from a memory to the dot multiplication unit array; the dot multiplication unit array performing a multiplication and accumulation operation on the cyclic block data of the data to be processed received each time, and determining a cyclic block result of the data to be processed received each time; the control logic unit determining a logical operation result of the data to be processed based on the plurality of cyclic block results acquired from the dot multiplication unit array.
- the control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the acquired control instruction, including: the control logic unit parses the access information of the data to be processed and the logical operation type of the data to be processed according to the acquired control instruction; determines the cyclic block order of the data to be processed according to the logical operation type of the data to be processed; and reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the access information and the preset size and the cyclic block order of the data to be processed, wherein the size of the cyclic block data is less than or equal to the preset size, and the preset size is determined by the number of idle dot multiplication units in the dot multiplication unit array.
- the access information of the data to be processed includes access information of a first matrix and a second matrix
- the number of columns of the first matrix is the same as the number of rows of the second matrix
- the logical operation type includes a matrix multiplication operation of the first matrix and the second matrix
- the loop block order determined by the matrix multiplication operation type includes: a first outer loop order in the direction of the number of rows of the first matrix, a second outer loop order in the direction of the number of columns of the second matrix, and an inner loop order in the direction of the number of columns of the first matrix and the number of rows of the second matrix; or, a first outer loop order in the direction of the number of columns of the second matrix, a second outer loop order in the direction of the number of rows of the first matrix, and an inner loop order in the direction of the number of columns of the first matrix and the number of rows of the second matrix.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching loop block data.
- the loop block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the loop block order of the data to be processed, including: the control logic unit stores the first loop block data of the first matrix read from the local memory into the reuse cache according to the first outer loop order and the inner loop order, so that the control logic unit reuses the first loop block data stored in the reuse cache according to the second outer loop order; or, the control logic unit stores the second loop block data of the second matrix read from the local memory into the reuse cache according to the second outer loop order and the inner loop order, so that the control logic unit reuses the second loop block data stored in the reuse cache according to the first outer loop order.
- the dot multiplication unit array performs a multiplication and accumulation operation on the circular block data of the data to be processed each time it is received, and determines the circular block result of the data to be processed each time it is received, including: the dot multiplication unit array performs a multiplication and accumulation operation on the first circular block data and the second circular block data obtained each time, and obtains the circular block result corresponding to each multiplication and accumulation operation; the control logic unit determines the logical operation result of the data to be processed based on the multiple circular block results obtained from the dot multiplication unit array, including: the control logic unit performs an accumulation operation on the multiple circular block results obtained from the dot multiplication unit array in any round of the circular block sequence to obtain a logical operator result; the control logic unit writes the logical operator result into a register stack; the control logic unit determines the logical operation result based on the multiple logical operator results obtained from the register stack.
- the access information of the data to be processed includes access information of the data to be convolved and the convolution kernel
- the logical operation type includes the convolution operation of the data to be convolved and the convolution kernel
- the circular block order determined by the convolution operation type includes the coordinate order of the elements in the convolution kernel
- the circular block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the circular block order of the data to be processed, including: the control logic unit forms a third circular block data with multiple elements with the same coordinates read from multiple convolution kernels in the memory each time according to the coordinate order of the elements in the convolution kernel, and, according to the coordinates of the current element in each convolution kernel and the access information, reads multiple elements from the data to be convolved in the memory to form a fourth circular block data; the control logic unit writes the third circular block data and the fourth circular block data into the dot product unit array.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching circular block data.
- the method also includes: in the process of calculating the fourth circular block data corresponding to the same row in the convolution output result, the control logic unit writes M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column of the convolution kernel into the reuse cache according to the coordinate order of the elements in the convolution kernel, where r, s, and M are positive integers; the control logic unit reads M-1 elements from the reuse cache, reads 1 element from the data to be convolved in the local memory, and determines the fourth circular block data corresponding to the coordinates of the element in the rth row and s+1th column of the convolution kernel.
- the method further includes: in the process of calculating the fourth circular block data corresponding to the adjacent rows in the convolution output result, in response to the control logic unit calculating the fourth circular block data of the current row of the convolution output result, M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column in the convolution kernel are written into the reuse cache, where r, s, and M are positive integers; in response to the control logic unit calculating the fourth circular block data of the next row of the current row of the convolution output result, M elements are read from the reuse cache to constitute the fourth circular block data corresponding to the coordinates of the element in the r-1th row and sth column in the convolution kernel.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching cyclic block data
- the control logic unit according to the coordinate order of the elements in the convolution kernel, converts the multiple elements with the same coordinates read from the multiple convolution kernels in the memory each time into third cyclic block data, including: the control logic unit reads the multiple convolution kernels from the local memory to the reuse cache; according to the coordinate order of the elements in the convolution kernel, the control logic unit converts the multiple elements with the same coordinates read from the multiple convolution kernels in the reuse cache each time into third cyclic block data.
- the dot product unit array performs a multiplication and accumulation operation on the cyclic block data of the data to be processed each time received to determine the cyclic block result of the data to be processed each time received, including: the dot product unit array performs a multiplication and accumulation operation on the third cyclic block data and the fourth cyclic block data received each time to determine the cyclic block result.
- a processor comprising a control logic unit and a dot multiplication unit array, wherein the dot multiplication unit array comprises a plurality of dot multiplication units for performing multiplication and accumulation operations, and wherein: the control logic unit sequentially reads cyclic block data of to-be-processed data from a memory to the dot multiplication unit array according to an acquired control instruction; the dot multiplication unit array performs a multiplication and accumulation operation on the cyclic block data of the to-be-processed data received each time, and determines a cyclic block result of the to-be-processed data received each time; and the control logic unit determines a logical operation result of the to-be-processed data according to the plurality of cyclic block results acquired from the dot multiplication unit array.
- the control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the acquired control instruction, including: the control logic unit parses the access information of the data to be processed and the logical operation type of the data to be processed according to the acquired control instruction; determines the cyclic block order of the data to be processed according to the logical operation type of the data to be processed; and reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the access information and the preset size and the cyclic block order of the data to be processed, wherein the size of the cyclic block data is less than or equal to the preset size, and the preset size is determined by the number of idle dot multiplication units in the dot multiplication unit array.
- the access information of the data to be processed includes access information of a first matrix and a second matrix
- the number of columns of the first matrix is the same as the number of rows of the second matrix
- the logical operation type includes a matrix multiplication operation of the first matrix and the second matrix
- the loop block order determined by the matrix multiplication operation type includes: a first outer loop order in the direction of the number of rows of the first matrix, a second outer loop order in the direction of the number of columns of the second matrix, and an inner loop order in the direction of the number of columns of the first matrix and the number of rows of the second matrix; or, a first outer loop order in the direction of the number of columns of the second matrix, a second outer loop order in the direction of the number of rows of the first matrix, and an inner loop order in the direction of the number of columns of the first matrix and the number of rows of the second matrix.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching loop block data.
- the loop block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the loop block order of the data to be processed, including: the control logic unit stores the first loop block data of the first matrix read from the local memory into the reuse cache according to the first outer loop order and the inner loop order, so that the control logic unit reuses the first loop block data stored in the reuse cache according to the second outer loop order; or, the control logic unit stores the second loop block data of the second matrix read from the local memory into the reuse cache according to the second outer loop order and the inner loop order, so that the control logic unit reuses the second loop block data stored in the reuse cache according to the first outer loop order.
- the dot multiplication unit array performs a multiplication and accumulation operation on the circular block data of the data to be processed each time it is received, and determines the circular block result of the data to be processed each time it is received, including: the dot multiplication unit array performs a multiplication and accumulation operation on the first circular block data and the second circular block data obtained each time, and obtains the circular block result corresponding to each multiplication and accumulation operation; the control logic unit determines the logical operation result of the data to be processed based on the multiple circular block results obtained from the dot multiplication unit array, including: the control logic unit performs an accumulation operation on the multiple circular block results obtained from the dot multiplication unit array in any round of the circular block sequence to obtain a logical operator result; the control logic unit writes the logical operator result into a register stack; the control logic unit determines the logical operation result based on the multiple logical operator results obtained from the register stack.
- the access information of the data to be processed includes access information of the data to be convolved and the convolution kernel
- the logical operation type includes the convolution operation of the data to be convolved and the convolution kernel
- the circular block order determined by the convolution operation type includes the coordinate order of the elements in the convolution kernel
- the circular block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the circular block order of the data to be processed, including: the control logic unit forms a third circular block data with multiple elements with the same coordinates read from multiple convolution kernels in the memory each time according to the coordinate order of the elements in the convolution kernel, and, according to the coordinates of the current element in each convolution kernel and the access information, reads multiple elements from the data to be convolved in the memory to form a fourth circular block data; the control logic unit writes the third circular block data and the fourth circular block data into the dot product unit array.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching circular block data
- the processor is further used to: in the process of calculating the fourth circular block data corresponding to the same row in the convolution output result, the control logic unit writes M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column of the convolution kernel into the reuse cache according to the coordinate order of the elements in the convolution kernel, where r, s, and M are positive integers; the control logic unit reads M-1 elements from the reuse cache, reads 1 element from the data to be convolved in the local memory, and determines the fourth circular block data corresponding to the coordinates of the element in the rth row and s+1th column of the convolution kernel.
- the processor is further used to: in the process of calculating the fourth circular block data corresponding to the adjacent rows in the convolution output result, in response to the control logic unit calculating the fourth circular block data of the current row of the convolution output result, write M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column in the convolution kernel into the reuse cache, where r, s, and M are positive integers; in response to the control logic unit calculating the fourth circular block data of the next row of the current row of the convolution output result, read M elements from the reuse cache to constitute the fourth circular block data corresponding to the coordinates of the element in the r-1th row and sth column in the convolution kernel.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching cyclic block data
- the control logic unit according to the coordinate order of the elements in the convolution kernel, converts the multiple elements with the same coordinates read from the multiple convolution kernels in the memory each time into third cyclic block data, including: the control logic unit reads the multiple convolution kernels from the local memory to the reuse cache; according to the coordinate order of the elements in the convolution kernel, the control logic unit converts the multiple elements with the same coordinates read from the multiple convolution kernels in the reuse cache each time into third cyclic block data.
- the dot product unit array performs a multiplication and accumulation operation on the cyclic block data of the data to be processed each time received to determine the cyclic block result of the data to be processed each time received, including: the dot product unit array performs a multiplication and accumulation operation on the third cyclic block data and the fourth cyclic block data received each time to determine the cyclic block result.
- an artificial intelligence chip comprising the processor as described above.
- an electronic device comprising the artificial intelligence chip as described above.
- control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the control instructions obtained, so that the dot multiplication unit array performs multiplication and accumulation operations on the cyclic block data of the data to be processed each time it is received, and determines the cyclic block result of the data to be processed each time it is received.
- the control logic unit determines the logical operation result of the data to be processed based on the multiple cyclic block results obtained from the dot multiplication unit array.
- the reading and logical operation of the data to be processed can be converted into the reading and logical operation of multiple cyclic block data of the data to be processed (small-sized image data), which is beneficial for processing larger-sized data while keeping the processor hardware resources unchanged, reducing the pressure on the storage bandwidth.
- FIG1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
- FIG2 shows a flow chart of a data processing method according to an embodiment of the present disclosure.
- FIG3 is a schematic diagram showing a cyclic blocking sequence according to an embodiment of the present disclosure.
- FIG4 is a schematic diagram showing a situation in which repeated reading of cyclically blocked data occurs according to an embodiment of the present disclosure.
- FIG5 is a schematic diagram showing the third cycle of block data of a convolution kernel according to an embodiment of the present disclosure.
- FIG6 is a schematic diagram showing fourth-cycle block data of a convolution kernel according to an embodiment of the present disclosure.
- FIG7 shows a schematic diagram of a reuse cache according to an embodiment of the present disclosure.
- FIG8 shows a schematic diagram of another reuse cache according to an embodiment of the present disclosure.
- FIG9 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
- a and/or B can represent the existence of three situations: A alone, A and B simultaneously, and B alone.
- at least one herein refers to any combination of at least two of any one or more of a plurality of items.
- at least one of A, B, and C can represent any one or more elements selected from the set consisting of A, B, and C.
- Figure 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
- the processor includes a control logic unit and a dot multiplication unit array, wherein the dot multiplication unit array includes a plurality of dot multiplication units for performing multiply-accumulate operations.
- the dot multiplication unit may also include an operator to perform a specified operation, such as calculating multiple multiplications and performing accumulation.
- the dot multiplication unit may include a multiplier, an adder, etc.
- the specific structures of each dot multiplication unit may be the same or different, and this disclosure does not limit this.
- the dot multiplication unit may also include other types of operators to accommodate various different calculation processes. This disclosure does not limit the number and type of operators included in the dot multiplication unit.
- the control logic unit can connect each dot multiplication unit in the dot multiplication unit array.
- the control logic unit can number each dot multiplication unit in the form of a two-dimensional matrix or a multi-dimensional matrix so that multiple dot multiplication units can be logically arranged in the form of a two-dimensional matrix or a multi-dimensional matrix, thereby better adapting to the logical operations of the matrix.
- the memory corresponding to the processor may include a local memory arranged outside the processor, and the control logic unit in the processor can be connected to the local memory.
- the control logic unit can be used for address calculation to load data from the memory to the dot multiplication unit array and control the dot multiplication unit array to process the data to be processed.
- the local memory may be an on-chip cache
- the control logic unit may load the executable program and the data to be processed (for example, the input matrix) on the off-chip flash memory into the above-mentioned local memory (on-chip cache), and then perform the subsequent logical operations on the data to be processed.
- the local memory may store data to be processed and an executable program.
- the executable program may include control instructions.
- the processor executes the control instructions to implement logical operations on the data to be processed, such as matrix multiplication operations, convolution operations, and other multiplication-accumulation-related operations on the data to be processed.
- the memory may also include a reuse buffer set inside the processor.
- the reuse buffer may be set inside the dot multiplication unit array (not shown in Figure 1) or outside the dot multiplication unit array (see Figure 1). Compared with setting the reuse buffer outside the dot multiplication unit array and setting the reuse buffer inside the dot multiplication unit array, the dot multiplication unit array can have higher data reading efficiency.
- the control logic unit is provided with a loader, a decoder, etc., wherein the loader can be used to load the data to be processed or part of the data to be processed in the local memory into the reuse cache in the processor.
- the decoder can decode the control instructions for accessing data in the executable program based on the change in the storage address of the data to be processed after loading. For example, for the control instruction to access data X in the local memory, since the data X is cached in the reuse cache, the address of the data X stored in the reuse cache can be obtained through decoding.
- the decoder can convert the control instruction to access data X in the local memory into a control instruction to access data X in the reuse cache, which is conducive to the subsequent control logic unit directly sending the data cached in the reuse cache to the dot product unit, and the dot product unit performs a multiplication and accumulation operation on it.
- the control logic unit can also directly load data from the off-chip memory to the reuse cache, which is not limited in this disclosure.
- a corresponding register file (Register File) can also be set for the processor for reading, storing, processing and transmitting data without being combined with a memory.
- the register file includes memory address registers, memory data registers, instruction registers, operation code word registers, accumulators, flag registers, etc., and the present disclosure does not limit this.
- the control logic unit can be used for address calculation so as to move data between the dot multiplication unit array and the register file.
- the processor of the embodiment of the present disclosure may be a newly designed one, or may be obtained by improving an existing processor chip.
- the types of processor chips may include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), a general-purpose computing on graphics processing units (GPGPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a tensor processing unit (TPU), a field programmable gate array (FPGA), or other programmable logic devices, and may also include a microprocessor or other conventional processor processor.
- Figure 2 shows a flow chart of a data processing method according to an embodiment of the present disclosure. As shown in Figure 2, the data processing method is applied to a processor, and the data processing method may include the following steps:
- step S11 the control logic unit reads the cyclic block data of the data to be processed from the memory to the point multiplication unit array in sequence according to the obtained control instructions.
- the control instructions are used to indicate the access information of the data to be processed and the logical operation type of the data to be processed, such as matrix multiplication operations, convolution operations and other multiplication-accumulation related operations of the data to be processed.
- control logic unit can obtain the address information of the data to be processed according to the control instruction, and determine the address information of each cyclic block data of the data to be processed in sequence according to the address information of the data to be processed, and read the cyclic block data of the data to be processed from the memory to the dot multiplication unit array.
- the memory may include a local memory provided outside the processor, and the control logic unit may read the cyclic block data of the data to be processed from the local memory in sequence to the dot product unit array according to the acquired control instructions.
- the memory may also be provided in the processor for caching the cyclic block data.
- the control logic unit reads data from the reuse cache faster than the speed of reading data from the local memory.
- control logic unit may also synchronously store the cyclic block data that needs to be read repeatedly into the reuse cache provided in the processor, so that the control logic unit can subsequently read the cyclic block data from the reuse cache to the dot multiplication unit array.
- the data to be processed includes feature data in a deep learning task
- the feature data includes at least one of image feature data, speech feature data, and text feature data.
- the data to be processed may be image feature data, and the image feature data of the target object (such as a face feature map) may be stored in a memory; in a scenario where a deep neural network is used to perform speech recognition on a target object, the data to be processed may be speech feature data, and the speech feature data of the target object may be stored in a memory; in a scenario where a deep neural network is used to perform text recognition on a target document, the data to be processed may be text feature data, and the text feature data of the target document may be stored in a memory; the embodiments of the present application do not impose any restrictions on the type of data to be processed.
- control logic unit may obtain data structure information of the data to be processed and address information of the data to be processed through control instructions; wherein the data structure information of the data to be processed includes, for example, the dimension of the data to be processed, the size of the data to be processed, the data type of the elements in the data to be processed (for example, integer type, single-precision floating-point type, double-precision floating-point type, character type, etc.), and other information used to describe the data to be processed; the address information of the data to be processed includes, for example, the base address (Base Address) of the data to be processed in the memory, the addressing space, and other address-related information.
- the data structure information of the data to be processed includes, for example, the dimension of the data to be processed, the size of the data to be processed, the data type of the elements in the data to be processed (for example, integer type, single-precision floating-point type, double-precision floating-point type, character type, etc.), and other information used to describe the data to be processed
- step S12 the dot product unit array performs a multiplication and accumulation operation on the cyclic block data of the to-be-processed data received each time, and determines the cyclic block result of the to-be-processed data received each time.
- the dot multiplication unit array may include tile_M ⁇ tile_N dot multiplication units, each of which may calculate tile_K multiplications and accumulate them, where tile_M, tile_N, and tile_K are positive integers.
- tile_M, tile_N, and tile_K are positive integers.
- the present disclosure does not impose any restrictions on the specific values of tile_M, tile_N, and tile_K, and they may be set according to actual application scenarios.
- the dot product array performs matrix multiplication or convolution operations by performing multiplication and accumulation operations on the received blocks of data to be processed.
- the dot product array can multiply a matrix of size tile_M ⁇ tile_K by a matrix of size tile_K ⁇ tile_N.
- the dot product array can process tile_M convolved elements and tile_N convolution kernel elements at a time, where the number of channels of tile_M convolved elements and tile_N convolution kernel elements is tile_K.
- step S13 the control logic unit determines the logic operation result of the data to be processed according to the multiple loop block results obtained from the dot product unit array.
- control logic unit may concatenate and/or add the obtained multiple cyclic block results to determine the logical operation result of the data to be processed.
- the control logic unit may concatenate and/or add each received cyclic block data with the previously received cyclic block data.
- the control logic unit may also concatenate and/or add all received cyclic block data, and the present disclosure does not limit this.
- the control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the control instruction obtained for indicating the logical operation of the data to be processed, so that the dot multiplication unit array performs multiplication and accumulation operations on the cyclic block data of the data to be processed each time it is received, and determines the cyclic block result of the data to be processed each time it is received.
- the control logic unit determines the logical operation result of the data to be processed based on the multiple cyclic block results obtained from the dot multiplication unit array.
- the reading and logical operation of the data to be processed can be converted into the reading and logical operation of multiple cyclic block data of the data to be processed (small-sized image data), which is beneficial for processing larger-sized data while keeping the processor hardware resources unchanged, reducing the pressure on the storage bandwidth.
- step S11 the control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot product unit array according to the acquired control instruction.
- the memory may include a local memory disposed outside the processor; or, in addition to the local memory disposed outside the processor, the memory may also include a reuse cache disposed within the processor for caching cyclic block data, and the embodiments of the present disclosure do not impose specific restrictions on this.
- step S11 may include: the control logic unit parses the access information of the data to be processed and the logical operation type of the data to be processed according to the acquired control instruction; determines the cyclic blocking order of the data to be processed according to the logical operation type of the data to be processed; and reads the cyclic blocking data of the data to be processed from the memory in sequence to the dot multiplication unit array according to the access information and the preset size and the cyclic blocking order of the data to be processed, wherein the size of the cyclic blocking data is less than or equal to the preset size, and the preset size is determined by the number of idle dot multiplication units in the dot multiplication unit array.
- the control logic unit parses the acquired control instruction and can parse out the access information of the data to be processed and the logical operation type of the data to be processed.
- the access information includes, for example, size information, address information, and layout information of the data to be processed.
- the layout information can be a row-major sequence (RowMajor) or a column-major sequence (ColMajor), where the row-major sequence indicates that the elements of the same row of the data to be processed are adjacent in the memory, and the column-major sequence indicates that the elements of the same column of the data to be processed are adjacent in the memory.
- the access information may include size information, address information (such as a starting address), layout information, etc. of the first matrix and the second matrix.
- the control logic unit can determine the loop block order of the data to be processed based on the logical operation type of the data to be processed; for example, if the logical operation type of the data to be processed is a matrix multiplication operation of a first matrix and a second matrix, the control logic unit can select a pre-stored loop block order that matches the matrix multiplication operation.
- the control logic unit can determine the size of the data to be processed and, based on the size of the data to be processed, determine whether to perform cyclic block processing on the data to be processed. If the size of the data to be processed is less than or equal to the preset size, the number of idle multiplication units in the multiplication unit array is sufficient to process the data to be processed.
- the control logic unit can directly read all the data to be processed from the memory into the multiplication unit array, so that the multiplication unit array can perform multiplication and addition operations on the data to be processed.
- the control logic unit can determine the cyclic blocking order of the data to be processed according to the logical operation type of the data to be processed; and determine the cyclic blocking data of the data to be processed according to the access information and the preset size of the data to be processed, and then read the cyclic blocking data of the data to be processed from the memory in sequence according to the cyclic blocking order of the data to be processed to the dot multiplication unit array, and realize the multiplication and addition operation of the data to be processed by the dot multiplication unit array on multiple cyclic blocking data.
- each dot multiplication unit can calculate tile_K multiplications and accumulate them.
- the dot multiplication unit array can complete the multiplication operation of a matrix of size tile_M ⁇ tile_K with a matrix of size tile_K ⁇ tile_N.
- the control logic unit is responsible for loop control, and reads the loop block data of the required size from the data to be processed stored in the memory in the order of the loop block to the dot multiplication unit array.
- control logic unit parses the obtained control instruction and can parse out the access information of the data to be processed, which includes, for example, the size information of the data to be convolved and the convolution kernel, address information, convolution description information (for example, including the stride of the convolution kernel and the amount of padding), and the dimension information of the convolution output result.
- the access information may include the convolution description information (for example, including the stride of the convolution kernel and the amount of padding), the dimension information of the convolution output result, the size information of the data to be convolved and the convolution kernel, address information, layout information, etc.
- the control logic unit can determine the circular blocking order of the data to be processed based on the logical operation type of the data to be processed; for example, if the logical operation type of the data to be processed is a convolution operation of the data to be convolved and the convolution kernel, the control logic unit can select a pre-stored circular blocking order that matches the convolution operation.
- the control logic unit can determine the size of the data to be convolved and the convolution kernel, and based on the size of the data to be convolved and the convolution kernel, determine whether to perform cyclic block processing on the data to be convolved and the convolution kernel. If the size of the data to be convolved and the convolution kernel is less than or equal to the preset size, it means that the number of idle point multiplication units in the point multiplication unit array can meet the requirements for convolution operation of the data to be convolved and the convolution kernel.
- the control logic unit can directly read the data to be convolved and the convolution kernel from the memory into the point multiplication unit array, so that the point multiplication unit array can perform multiplication and addition operation on them.
- the size of any data in the data to be convolved or the convolution kernel is larger than the preset size, it means that the number of idle point multiplication units in the point multiplication unit array cannot meet the requirements of convolution operation of the data to be convolved and the convolution kernel.
- the control logic unit can determine the circular blocking order of the data to be processed according to the logical operation type of the data to be processed; and determine the circular blocking data from the data to be convolved and/or the convolution kernel that is larger than the preset size according to the access information and preset size of the data to be processed, and then read the corresponding circular blocking data from the memory to the point multiplication unit array in sequence according to the circular blocking order of the data to be processed, and realize the convolution operation of the data to be convolved and the convolution kernel through the multiplication and addition operation of multiple circular blocking data by the point multiplication unit array.
- the size of the cyclic block data is less than or equal to a preset size, and the present disclosure does not impose any specific limitation on the specific size of the cyclic block data.
- each dot multiplication unit calculates tile_K multiplications and accumulates them.
- the dot multiplication unit array can process tile_M elements to be convolved and tile_N convolution kernel elements at a time, where the number of channels of tile_M elements to be convolved and tile_N convolution kernel elements is tile_K.
- the control logic unit is responsible for loop control, and reads the loop block data that meets the size from the data to be processed stored in the memory in the order of loop blocks to the dot multiplication unit array.
- the access information of the data to be processed includes access information of a first matrix and a second matrix
- the number of columns of the first matrix is the same as the number of rows of the second matrix
- the logical operation includes a matrix multiplication operation of the first matrix and the second matrix
- the loop block order determined by the matrix multiplication operation type includes: a first outer loop order in the direction of the number of rows of the first matrix, a second outer loop order in the direction of the number of columns of the second matrix, and an inner loop order in the direction of the number of columns of the first matrix and the direction of the number of rows of the second matrix.
- FIG3 illustrates a schematic diagram of a loop block order according to an embodiment of the present disclosure.
- the control instruction received by the control logic unit may be to perform a matrix multiplication operation on a first matrix A of size M ⁇ K and a second matrix B of size K ⁇ N, where the number of columns K of the first matrix A is the same as the number of rows K of the second matrix B.
- C represents the matrix multiplication result of the first matrix A and the second matrix B
- tile_A represents the first loop block data of the first matrix A, whose size is tile_M ⁇ tile_K
- tile_B represents the second loop block data of the second matrix B, whose size is tile_K ⁇ tile_N
- tile_C represents the matrix multiplication result of the first loop block data tile_A and the second loop block data tile_B.
- the loop block order can be a multi-layer loop (for example, including three layers of nested loops), a first outer loop order in the direction of the number of rows of the first matrix A (such as the loop in the M direction in Figure 3), a second outer loop order in the direction of the number of columns of the second matrix B (such as the loop in the N direction in Figure 3), and an inner loop order in the direction of the number of columns of the first matrix A and the number of rows of the second matrix B (such as the loop in the K direction in Figure 3).
- a multi-layer loop for example, including three layers of nested loops
- a first outer loop order in the direction of the number of rows of the first matrix A such as the loop in the M direction in Figure 3
- a second outer loop order in the direction of the number of columns of the second matrix B such as the loop in the N direction in Figure 3
- an inner loop order in the direction of the number of columns of the first matrix A and the number of rows of the second matrix B (such as the loop in the K direction in
- step S11 the control logic unit can sequentially read the first circular block data tile_A of the first matrix A and the second circular block data tile_B of the second matrix B from the memory into the dot multiplication unit array in accordance with the circular block order, so that in step S12, the dot multiplication unit array performs a multiplication-accumulation operation on the first circular block data tile_A and the second circular block data tile_B obtained each time, to obtain a circular block result tile_C corresponding to each multiplication-accumulation operation. Furthermore, in step S13, the control logic unit performs a cumulative operation on the multiple circular block results tile_C obtained sequentially from the dot multiplication unit array in any round of the circular block order to obtain a logical operator result; the pseudo code is as follows:
- the first outer loop (such as the loop in the M direction in Figure 3) can be looped M/tile_M times
- the second outer loop (such as the loop in the N direction in Figure 3) can be looped N/tile_N times
- the inner loop (such as the loop in the K direction in Figure 3) can be looped K/tile_K times, so that the first matrix A can be divided into M/tile_M rows and K/tile_K columns
- the second matrix B can be divided into K/tile_K rows and N/tile_N columns.
- Each row and column in the first matrix A corresponds to a first loop block data tile_A
- tile_A mk represents the first loop block data tile_A in the mth row and kth column of the first matrix A
- each row and column in the second matrix B corresponds to a second loop block data tile_B
- tile_B kn represents the second loop block data tile_B in the kth row and nth column of the second matrix B.
- tile_A mk ⁇ tile_B kn represents the matrix multiplication of the first circular tile data tile_A mk in the mth row and kth column of the first matrix A and the second circular tile data tile_B kn in the kth row and nth column of the second matrix B.
- the loop block sequence may include m ⁇ n rounds of loops.
- the mth time of any first outer loop (such as the loop in the M direction in Figure 3) and the nth time of the second outer loop (such as the loop in the N direction in Figure 3) correspond to all K/tile_K times of inner loops (such as the loop in the K direction in Figure 3) as a round of loops, referred to as the mth and nth rounds.
- the control logic unit can accumulate the multiple loop block results tile_A mk ⁇ tile_B kn obtained from the dot product unit array in any round of loop block sequence to obtain the logical operator results of the mth and nth rounds.
- the loop block order determined by the matrix multiplication operation type may also include: a first outer loop order in the direction of the number of columns of the second matrix, a second outer loop order in the direction of the number of rows of the first matrix, and an inner loop order in the direction of the number of columns of the first matrix and the direction of the number of rows of the second matrix.
- the loop block order may be a first outer loop order in the direction of the number of columns of the second matrix B (such as the loop in the N direction in Figure 3), a second outer loop order in the direction of the number of rows of the first matrix A (such as the loop in the M direction in Figure 3), and an inner loop order in the direction of the number of columns of the first matrix A and the direction of the number of rows of the second matrix B (such as the loop in the K direction in Figure 3).
- step S11 the control logic unit can sequentially read the first circular block data tile_A of the first matrix A and the second circular block data tile_B of the second matrix B from the memory into the dot multiplication unit array in accordance with the circular block order, so that in step S12, the dot multiplication unit array performs a multiplication-accumulation operation on the first circular block data tile_A and the second circular block data tile_B obtained each time, to obtain a circular block result tile_C corresponding to each multiplication-accumulation operation. Furthermore, in step S13, the control logic unit performs a cumulative operation on the multiple circular block results tile_C obtained sequentially from the dot multiplication unit array in any round of the circular block order to obtain a logical operator result; the pseudo code is as follows:
- the previous loop block result can be stored in an accumulator cache (for example, it can be part of a reused cache).
- the previous loop block result can be read from the accumulator cache and added to the current loop block data, and the addition result can be used to update the current loop block data as the next loop block data.
- an accumulator cache for storing intermediate data (for example, loop block data) of a memory loop (for example, a loop in the K direction), it is helpful to reduce the number of memory accesses and reduce the pressure on the memory bandwidth. In this way, multiple loop block results are accumulated based on the accumulator cache to obtain a logical operator result.
- the control logic unit writes the logical operator result into the register file; the control logic unit determines the logical operation result based on the multiple logical operator results obtained from the register file. For example, the control logic unit can write the logical operator result C mn obtained after each round of inner loops, which is K/tile_K times, into the register file. The control logic unit then concatenates the logical operator results C mn obtained from the register file for a total of M/tile_M ⁇ N/tile_N rounds to obtain the logical operation results of the first matrix A and the first matrix B. Providing a register file further reduces the number of accesses to local memory outside the processor.
- FIG4 illustrates a schematic diagram of repeated reading of cyclic block data according to an embodiment of the present disclosure.
- first matrix A can be divided into 3 ⁇ 2 blocks, namely: first cyclic block data tile_A 11 , first cyclic block data tile_A 12 , first cyclic block data tile_A 21 , first cyclic block data tile_A 22 , first cyclic block data tile_A 31 , and first cyclic block data tile_A 32 .
- the second matrix B can be divided into 2 ⁇ 2 blocks, namely: second cyclic block data tile_B 11 , second cyclic block data tile_B 12 , second cyclic block data tile_B 21 , and second cyclic block data tile_B 22 .
- the first outer loop (for example, the loop along the rows of the first matrix A in FIG4 ) is performed for the first time
- the second outer loop for example, the loop along the columns of the second matrix B in FIG4
- one inner loop is performed (for example, the loop along the columns of the first matrix A and the rows of the second matrix B in FIG4 ; one inner loop may include two loops).
- the results of each inner loop are accumulated.
- the logical operator result C 11 tile_A 11 ⁇ tile_B 11 + tile_A 12 ⁇ tile_B 21 can be obtained.
- the results of each inner loop are accumulated.
- the logical operator result C 32 tile_A 31 ⁇ tile_B 12 + tile_A 32 ⁇ tile_B 22 is obtained.
- First-loop block data tile_A 31 and first-loop block data tile_A 32 are reused, while second-loop block data tile_B 12 and second-loop block data tile_B 22 are reused.
- the control logic unit repeatedly reads first-loop block data tile_A 31 , first-loop block data tile_A 32 , second-loop block data tile_B 12 , and second-loop block data tile_B 22 from local memory.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching the loop block data
- the loop block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the loop block order of the data to be processed, including: the control logic unit stores the first loop block data of the first matrix read from the local memory into the reuse cache according to the first outer loop order and the inner loop order, so that the control logic unit reuses the first loop block data stored in the reuse cache according to the second outer loop order; or, the control logic unit stores the second loop block data of the second matrix read from the local memory into the reuse cache according to the second outer loop order and the inner loop order, so that the control logic unit reuses the second loop block data stored in the reuse cache according to the first outer loop order.
- the control logic unit can read the first loop block data tile_A 11 of the first matrix A and the second loop block data tile_B 11 of the second matrix B from the local memory to obtain the loop block result tile_A 11 ⁇ tile_B 11 , and can store the first loop block data tile_A 11 of the first matrix A and the second loop block data tile_B 11 of the second matrix B into the reuse cache.
- the first loop block data tile_A 11 can be read from the reuse cache
- the second loop block data tile_B 12 can be read from the local memory to obtain the loop block result tile_A 11 ⁇ tile_B 12 .
- the first loop block data tile_A 21 can be read from the local memory
- the second loop block data tile_B 11 can be read from the reuse cache to obtain the loop block result tile_A 21 ⁇ tile_B 11 .
- the first loop block data tile_A 31 can be read from the local memory
- the second loop block data tile_B 11 can be read from the reuse cache to obtain the loop block result tile_A 31 ⁇ tile_B 11 .
- the control logic unit can read the first loop block data tile_A 12 of the first matrix A and the second loop block data tile_B 21 of the second matrix B from the local memory to obtain the loop block result tile_A 12 ⁇ tile_B 21 , and can store the first loop block data tile_A 12 of the first matrix A and the second loop block data tile_B 21 of the second matrix B into the reuse cache.
- the first loop block data tile_A 12 can be read from the reuse cache, and the second loop block data tile_B 22 can be read from the local memory to obtain the loop block result tile_A 12 ⁇ tile_B 22 .
- the first loop block data tile_A 22 can be read from the local memory
- the second loop block data tile_B 21 can be read from the reuse cache to obtain the loop block result tile_A 22 ⁇ tile_B 21 .
- the first loop block data tile_A 32 can be read from the local memory
- the second loop block data tile_B 21 can be read from the reuse cache to obtain the loop block result tile_A 32 ⁇ tile_B 21 .
- first outer loop, the second outer loop, and the inner loop can be cycled to other times, which can be referred to above and will not be repeated here.
- the specific method of reusing the first loop block data and the second loop block data based on the reuse cache in the multi-layer loop process can be set according to the actual application scenario, and the embodiment of the present disclosure does not limit this.
- the reuse cache will not be set very large.
- the size of the reuse cache can be set to cache one first loop block data and one second loop block data. Only one first loop block data and one second loop block data can be cached at the same time.
- the loop block data can be read again from the reuse cache.
- the control logic unit can respond to each update of the first outer loop sequence and read the first loop block data of the first matrix from the local memory and store it in the reuse cache, so that the control logic unit reuses the first loop block data stored in the reuse cache during the process of traversing the second outer loop sequence.
- control logic unit can respond to each update of the second outer loop sequence and read the second loop block data of the second matrix from the local memory and store it in the reuse cache, so that the control logic unit reuses the second loop block data stored in the reuse cache during the process of traversing the first outer loop sequence.
- the access information of the data to be processed includes access information of the data to be convolved and the convolution kernel
- the logical operation includes the convolution operation of the data to be convolved and the convolution kernel
- the circular block order determined by the convolution operation type includes the coordinate order of the elements in the convolution kernel
- the circular block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the circular block order of the data to be processed, including: the control logic unit forms a third circular block data with multiple elements with the same coordinates read from multiple convolution kernels in the memory each time according to the coordinate order of the elements in the convolution kernel; and, according to the coordinates of the current element in each convolution kernel and the access information, reads multiple elements from the data to be convolved in the memory to form a fourth circular block data; the control logic unit writes the third circular block data and the fourth circular block data into the dot product unit array.
- the memory may include a local memory disposed outside the processor; or, in addition to the local memory disposed outside the processor, the memory may also include a reuse cache disposed within the processor for caching cyclic block data, and the embodiments of the present disclosure do not impose specific restrictions on this.
- the access information of the data to be convolved and the convolution kernel includes convolution description information, which can be used to characterize the dimensional information of the data to be convolved, the dimensional information of the convolution kernel, the dimensional information of the convolution output result, layout information, etc.
- the control logic unit can read the convolution description information and perform a mapping operation on the data to be convolved and the convolution kernel according to the convolution description information.
- Each mapping operation can read the third loop block data and the fourth loop block data once, and then associate the third loop block data and the fourth loop block data with the dot product unit array and perform calculations. In this way, traversing each element of the convolution kernel and accumulating the results can obtain the final convolution output result.
- control logic unit can expand the matrix operation instructions to implement the convolution function according to the read convolution description information, and convert the convolution operation into matrix multiplication, making the processor more versatile.
- FIG. 5 shows a schematic diagram of the third cycle block data of the convolution kernel according to an embodiment of the present disclosure.
- the memory stores a convolution kernel C ⁇ R ⁇ S ⁇ K, where C represents the channel dimension of the convolution kernel, R represents the height dimension of the convolution kernel, S represents the width dimension of the convolution kernel, and K represents the number dimension of the convolution kernel.
- Each convolution kernel may include R ⁇ S ⁇ C/tile_K convolution kernel elements, each of which occupies one unit of space in the height dimension R and the width dimension S, and may occupy tile_K units of space in the channel dimension C.
- each third-cycle block data tile_B rsc is tile_K ⁇ tile_N, where tile_N represents the number of convolution kernels and tile_K represents the size of each convolution kernel element in the channel dimension.
- Figure 6 shows a schematic diagram of the fourth cycle block data of the convolution kernel according to an embodiment of the present disclosure.
- the memory stores the data to be convolved C ⁇ H ⁇ W (for example, including an input image of size C ⁇ H ⁇ W), where C represents the channel dimension of the data to be convolved, H represents the height dimension of the data to be convolved, and W represents the width dimension of the kernel to be convolved.
- the data to be convolved may include C ⁇ H ⁇ W/tile_K elements to be convolved, each element to be convolved occupies one unit of space in the height dimension H and the width dimension W respectively, and may occupy tile_K units of space in the channel dimension C.
- the control logic unit can calculate the coordinates of the data to be convolved according to the coordinate order of the elements in the convolution kernel, according to the coordinates (r, s, c) of the current element in each convolution kernel, and the dimension information of the convolution output result included in the access information, and read tile_M elements from the data to be convolved in the memory according to the coordinates of the data to be convolved, and map them to the fourth loop block data.
- the number of tile_M elements can be determined by the dimension information of the convolution output result.
- the control logic unit can write the third circular block data and the fourth circular block data corresponding to the coordinates of each element in the convolution kernel into the dot multiplication unit array, so that the dot multiplication unit array performs matrix multiplication operations on the third circular block data and the fourth circular block data.
- the matrix multiplication operation please refer to the matrix multiplication operation above, which will not be repeated here.
- the embodiments of the present disclosure can determine the third-loop block data of the convolution kernel and the fourth-loop block data of the data to be convolved, respectively, according to the coordinate order of the elements in the convolution kernel.
- This eliminates the need for im2col (for example, sliding the convolution kernel on the data to be convolved, converting the data contained in each convolution kernel window into a column vector, and finally arranging them into a new matrix by column) expansion in the memory, thereby reducing the pressure on the memory bandwidth.
- this method is conducive to reusing the matrix calculation structure to implement the convolution engine, so that the convolution description information can be used to expand the matrix operation instructions and realize the convolution function.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching circular block data.
- the method also includes: in the process of calculating the fourth circular block data corresponding to the same row in the convolution output result, the control logic unit writes M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column of the convolution kernel into the reuse cache according to the coordinate order of the elements in the convolution kernel, where M is a positive integer; the control logic unit reads M-1 elements from the reuse cache, reads 1 element from the data to be convolved in the local memory, and determines the fourth circular block data corresponding to the coordinates of the element in the rth row and s+1th column of the convolution kernel.
- Figure 7 shows a schematic diagram of a reuse cache according to an embodiment of the present disclosure.
- the input data of the convolution is reused, further reducing the number of accesses to the local memory outside the processor and the bandwidth pressure on the local memory outside the processor.
- M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column in the convolution kernel are written into the reuse cache; in response to the control logic unit calculating the fourth circular block data of the next row of the current row of the convolution output result, M elements are read from the reuse cache to constitute the fourth circular block data corresponding to the coordinates of the element in the r-1th row and sth column in the convolution kernel.
- FIG8 is a schematic diagram of another reuse cache according to an embodiment of the present disclosure.
- the implementation of convolution operation requires the addition of software im2col (for example, sliding the convolution kernel on the data to be convolved, and then converting the data contained in each convolution kernel window into a column vector, and finally arranging them into a new matrix by column) calculation logic, which will introduce additional overhead.
- software im2col for example, sliding the convolution kernel on the data to be convolved, and then converting the data contained in each convolution kernel window into a column vector, and finally arranging them into a new matrix by column
- 9 times the amount of data is required, which puts a greater pressure on the storage bandwidth.
- the embodiment of the present disclosure can determine the third loop block data of the convolution kernel and the fourth loop block data of the data to be convolved according to the coordinate order of the elements in the convolution kernel, and can reuse the matrix calculation structure to implement the convolution engine.
- control logic unit forms third circular block data with multiple elements having the same coordinates read from the multiple convolution kernels in the memory each time according to the coordinate order of the elements in the convolution kernel, including: the control logic unit reads multiple convolution kernels from the local memory to the reuse cache; and forms third circular block data with multiple elements having the same coordinates read from the multiple convolution kernels in the reuse cache each time according to the coordinate order of the elements in the convolution kernel.
- the control logic unit can read the three convolution kernels [S1, S2], [S3, S4], and [S5, S6] from the local memory into the reuse cache.
- the logic control unit reads multiple elements S1, S3, and S5 at coordinate 1 from the reuse cache for the first time according to the coordinate order of the elements in the convolution kernels [S1, S2], [S3, S4], and [S5, S6] to form the third circular block data [S1, S3, S5].
- the logic control unit reads multiple elements S2, S4, and S6 at coordinate 2 from the reuse cache for the second time to form the third circular block data [S2, S4, S6].
- the present disclosure only takes the convolution kernels [S1, S2], [S3, S4], and [S5, S6] as examples, and does not limit the size and number of the convolution kernels.
- the convolution kernel has been read into the reuse cache inside the processor, and the convolution kernel stored in the reuse cache can be reused, avoiding repeated reading of the local memory outside the processor.
- the dot product unit array performs a multiplication-accumulation operation on each received cyclic block data of the to-be-processed data to determine a cyclic block result for each received cyclic block data, including: the dot product unit array performs a multiplication-accumulation operation on each received third cyclic block data and fourth cyclic block data to determine a cyclic block result.
- the dot product unit array performs a multiplication-accumulation operation on each received first cyclic block data and second cyclic block data, which will not be repeated here.
- the subsequent control logic unit can perform accumulation operations on multiple loop block results obtained from the point multiplication unit array in sequence according to the loop block order to obtain the logical operator result and write it into the register file; the control logic unit can determine the convolution output result based on the multiple logical operator results obtained from the register file.
- the control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the control instruction obtained for indicating the logical operation of the data to be processed, so that the dot multiplication unit array performs multiplication and accumulation operations on the cyclic block data of the data to be processed each time it is received, and determines the cyclic block result of the data to be processed each time it is received.
- the control logic unit determines the logical operation result of the data to be processed based on the multiple cyclic block results obtained from the dot multiplication unit array.
- the reading and logical operation of the data to be processed can be converted into the reading and logical operation of multiple cyclic block data of the data to be processed (small-sized image data), which is beneficial for processing larger-sized data while keeping the processor hardware resources unchanged, reducing the pressure on the storage bandwidth.
- the present disclosure also provides a processor, an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any data processing method provided by the present disclosure.
- a processor an electronic device, a computer-readable storage medium, and a program, all of which can be used to implement any data processing method provided by the present disclosure.
- the corresponding technical solutions and descriptions are referred to the corresponding records in the method section and will not be repeated here.
- the processor shown in Figure 1 includes a control logic unit and a dot multiplication unit array
- the dot multiplication unit array includes multiple dot multiplication units for performing multiplication and accumulation operations
- the processor is used to: the control logic unit reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array in sequence according to the obtained control instructions, and the control instructions are used to indicate the logical operation of the data to be processed;
- the dot multiplication unit array performs a multiplication and accumulation operation on the cyclic block data of the data to be processed each time it is received, and determines the cyclic block result of the data to be processed each time it is received; the control logic unit determines the logical operation result of the data to be processed based on the multiple cyclic block results obtained from the dot multiplication unit array.
- the control logic unit sequentially reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the acquired control instruction, including: the control logic unit parses the access information of the data to be processed and the logical operation type of the data to be processed according to the acquired control instruction; determines the cyclic block order of the data to be processed according to the logical operation type of the data to be processed; and reads the cyclic block data of the data to be processed from the memory to the dot multiplication unit array according to the access information and the preset size and the cyclic block order of the data to be processed, wherein the size of the cyclic block data is less than or equal to the preset size, and the preset size is determined by the number of idle dot multiplication units in the dot multiplication unit array.
- the access information of the data to be processed includes access information of a first matrix and a second matrix
- the number of columns of the first matrix is the same as the number of rows of the second matrix
- the logical operation type includes a matrix multiplication operation of the first matrix and the second matrix
- the loop block order determined by the matrix multiplication operation type includes: a first outer loop order in the direction of the number of rows of the first matrix, a second outer loop order in the direction of the number of columns of the second matrix, and an inner loop order in the direction of the number of columns of the first matrix and the number of rows of the second matrix; or, a first outer loop order in the direction of the number of columns of the second matrix, a second outer loop order in the direction of the number of rows of the first matrix, and an inner loop order in the direction of the number of columns of the first matrix and the number of rows of the second matrix.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching loop block data.
- the loop block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the loop block order of the data to be processed, including: the control logic unit stores the first loop block data of the first matrix read from the local memory into the reuse cache according to the first outer loop order and the inner loop order, so that the control logic unit reuses the first loop block data stored in the reuse cache according to the second outer loop order; or, the control logic unit stores the second loop block data of the second matrix read from the local memory into the reuse cache according to the second outer loop order and the inner loop order, so that the control logic unit reuses the second loop block data stored in the reuse cache according to the first outer loop order.
- the dot multiplication unit array performs a multiplication and accumulation operation on the circular block data of the data to be processed each time it is received, and determines the circular block result of the data to be processed each time it is received, including: the dot multiplication unit array performs a multiplication and accumulation operation on the first circular block data and the second circular block data obtained each time, and obtains the circular block result corresponding to each multiplication and accumulation operation; the control logic unit determines the logical operation result of the data to be processed based on the multiple circular block results obtained from the dot multiplication unit array, including: the control logic unit performs an accumulation operation on the multiple circular block results obtained from the dot multiplication unit array in any round of the circular block sequence to obtain a logical operator result; the control logic unit writes the logical operator result into a register stack; the control logic unit determines the logical operation result based on the multiple logical operator results obtained from the register stack.
- the access information of the data to be processed includes access information of the data to be convolved and the convolution kernel
- the logical operation type includes the convolution operation of the data to be convolved and the convolution kernel
- the circular block order determined by the convolution operation type includes the coordinate order of the elements in the convolution kernel
- the circular block data of the data to be processed are read from the memory in sequence to the dot product unit array according to the circular block order of the data to be processed, including: the control logic unit forms a third circular block data with multiple elements with the same coordinates read from multiple convolution kernels in the memory each time according to the coordinate order of the elements in the convolution kernel, and, according to the coordinates of the current element in each convolution kernel and the access information, reads multiple elements from the data to be convolved in the memory to form a fourth circular block data; the control logic unit writes the third circular block data and the fourth circular block data into the dot product unit array.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching circular block data
- the processor is further used to: in the process of calculating the fourth circular block data corresponding to the same row in the convolution output result, the control logic unit writes M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column of the convolution kernel into the reuse cache according to the coordinate order of the elements in the convolution kernel, where r, s, and M are positive integers; the control logic unit reads M-1 elements from the reuse cache, reads 1 element from the data to be convolved in the local memory, and determines the fourth circular block data corresponding to the coordinates of the element in the rth row and s+1th column of the convolution kernel.
- the processor is further used to: in the process of calculating the fourth circular block data corresponding to the adjacent rows in the convolution output result, in response to the control logic unit calculating the fourth circular block data of the current row of the convolution output result, write M elements of the fourth circular block number corresponding to the coordinates of the element in the rth row and sth column in the convolution kernel into the reuse cache, where r, s, and M are positive integers; in response to the control logic unit calculating the fourth circular block data of the next row of the current row of the convolution output result, read M elements from the reuse cache to constitute the fourth circular block data corresponding to the coordinates of the element in the r-1th row and sth column in the convolution kernel.
- the memory includes a local memory arranged outside the processor, and a reuse cache arranged within the processor for caching cyclic block data
- the control logic unit according to the coordinate order of the elements in the convolution kernel, converts the multiple elements with the same coordinates read from the multiple convolution kernels in the memory each time into third cyclic block data, including: the control logic unit reads the multiple convolution kernels from the local memory to the reuse cache; according to the coordinate order of the elements in the convolution kernel, the control logic unit converts the multiple elements with the same coordinates read from the multiple convolution kernels in the reuse cache each time into third cyclic block data.
- the dot product unit array performs a multiplication and accumulation operation on the cyclic block data of the data to be processed each time received to determine the cyclic block result of the data to be processed each time received, including: the dot product unit array performs a multiplication and accumulation operation on the third cyclic block data and the fourth cyclic block data received each time to determine the cyclic block result.
- the functions or modules included in the processor provided by the embodiments of the present disclosure can be used to execute the method described in the above method embodiment. Its specific implementation can refer to the description of the above method embodiment. For the sake of brevity, it will not be repeated here.
- the present disclosure also provides a computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the above method when executed by a processor.
- the computer-readable storage medium may be a volatile or non-volatile computer-readable storage medium.
- the embodiments of the present disclosure also provide an artificial intelligence chip, which includes the processor described above.
- the present disclosure also provides an electronic device including the processor described above.
- the electronic device may include a user equipment (UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, a personal digital assistant (PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like.
- UE user equipment
- PDA personal digital assistant
- An embodiment of the present disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code.
- computer-readable code runs in a processor of an electronic device
- the processor in the electronic device executes the above method.
- the electronic device may be provided as a terminal, a server, or other forms of devices.
- FIG9 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
- the electronic device 1900 can be provided as a server or a terminal device.
- the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932 for storing instructions that can be executed by the processing component 1922, such as an application.
- the application stored in the memory 1932 can include one or more modules, each of which corresponds to a set of instructions.
- the processing component 1922 is configured to execute instructions to perform the above method.
- the electronic device 1900 may further include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958.
- the electronic device 1900 may operate based on an operating system stored in the memory 1932, such as Microsoft's server operating system (Windows Server TM ), Apple's graphical user interface-based operating system (Mac OS X TM ), a multi-user multi-process computer operating system (Unix TM ), a free and open source Unix-like operating system (Linux TM ), an open source Unix-like operating system (FreeBSD TM ), or the like.
- Microsoft's server operating system Windows Server TM
- Mac OS X TM Apple's graphical user interface-based operating system
- Unix TM multi-user multi-process computer operating system
- Linux TM free and open source Unix-like operating system
- FreeBSD TM open source Unix-like operating system
- a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions that can be executed by the processing component 1922 of the electronic device 1900 to perform the above method.
- the present disclosure may be a system, method and/or computer program product.
- the computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.
- Computer-readable storage media can be a tangible device that can hold and store the instructions used by the instruction execution device.
- Computer-readable storage media can be, for example, (but not limited to) an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof.
- Non-exhaustive list of computer-readable storage media include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, for example, a punch card or a convex structure in a groove on which instructions are stored, and any suitable combination thereof.
- RAM random access memory
- ROM read-only memory
- EPROM or flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- mechanical encoding device for example, a punch card or a convex structure in a groove on which instructions are stored, and any suitable combination thereof.
- Computer-readable storage media used herein is not interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagated by waveguides or other transmission media (for example, light pulses by fiber optic cables), or electrical signals transmitted by wires.
- the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing/processing device, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
- the network can include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- the network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions to be stored in the computer-readable storage medium in each computing/processing device.
- the computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, and conventional procedural programming languages such as "C" language or similar programming languages.
- Computer-readable program instructions may be executed entirely on a user's computer, partially on a user's computer, as an independent software package, partially on a user's computer, partially on a remote computer, or entirely on a remote computer or server.
- the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., utilizing an Internet service provider to connect via the Internet).
- an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), may be personalized by utilizing the state information of the computer-readable program instructions.
- the electronic circuit may execute the computer-readable program instructions, thereby realizing various aspects of the present disclosure.
- These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processor of the computer or other programmable data processing device, a device is generated that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
- These computer-readable program instructions can also be stored in a computer-readable storage medium, where these instructions cause the computer, programmable data processing device, and/or other device to operate in a specific manner.
- the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
- Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to implement the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
- each box in the block diagram and/or flow chart, and the combination of the boxes in the block diagram and/or flow chart can be implemented by a dedicated hardware-based system that performs the prescribed function or action, or can be implemented by a combination of dedicated hardware and computer instructions.
- the computer program product may be implemented in hardware, software, or a combination thereof.
- the computer program product is implemented as a computer storage medium.
- the computer program product is implemented as a software product, such as a software development kit (SDK).
- SDK software development kit
- the writing order of each step does not mean a strict execution order and does not constitute any limitation on the implementation process.
- the specific execution order of each step should be determined by its function and possible internal logic.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Computer Hardware Design (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Image Processing (AREA)
Abstract
La présente divulgation concerne un procédé de traitement de données, un processeur, une puce et un dispositif électronique. Le procédé comprend les étapes suivantes : sur la base d'une instruction de commande obtenue, une unité logique de commande lit séquentiellement, d'une mémoire à un réseau d'unités de produit scalaire, des données de pavage de boucle de données à traiter ; le réseau d'unités de produit scalaire effectue une opération de multiplication-accumulation sur les données de pavage de boucle des données à traiter qui sont reçues à chaque fois, et détermine un résultat de pavage de boucle des données à traiter qui sont reçues à chaque fois ; et sur la base d'une pluralité de résultats de pavage de boucle obtenus à partir du réseau d'unités de produit scalaire, l'unité logique de commande détermine un résultat d'opération logique des données à traiter. Les modes de réalisation de la présente divulgation peuvent convertir, en opération de lecture et de logique de multiples éléments de données de pavage de boucle des données à traiter, l'opération de lecture et de logique des données à traiter, de telle sorte que des données ayant une taille plus grande peuvent être traitées à condition que les ressources matérielles du processeur ne soient pas modifiées, et que la pression sur la largeur de bande de stockage soit réduite.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410472589.9A CN118277328B (zh) | 2024-04-18 | 2024-04-18 | 数据处理方法、处理器、芯片及电子设备 |
| CN202410472589.9 | 2024-04-18 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025218403A1 true WO2025218403A1 (fr) | 2025-10-23 |
Family
ID=91640010
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/082601 Pending WO2025218403A1 (fr) | 2024-04-18 | 2025-03-14 | Procédé de traitement de données, processeur, puce et dispositif électronique |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN118277328B (fr) |
| WO (1) | WO2025218403A1 (fr) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118277328B (zh) * | 2024-04-18 | 2025-04-18 | 摩尔线程智能科技(北京)股份有限公司 | 数据处理方法、处理器、芯片及电子设备 |
| CN118626294B (zh) * | 2024-08-13 | 2025-09-02 | 北京壁仞科技开发有限公司 | 数据处理方法、装置、电子设备和计算机可读存储介质 |
| CN119512979B (zh) * | 2025-01-16 | 2025-04-25 | 山东云海国创云计算装备产业创新中心有限公司 | 确定逻辑块地址访问序列的方法及装置、程序产品 |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200159809A1 (en) * | 2018-11-15 | 2020-05-21 | Imec Vzw | Convolution Engine for Neural Networks |
| CN112862059A (zh) * | 2019-11-28 | 2021-05-28 | 华为技术有限公司 | 长短期记忆lstm网络计算设备以及计算设备 |
| US20210265015A1 (en) * | 2020-02-20 | 2021-08-26 | Illumina, Inc. | Hardware Execution and Acceleration of Artificial Intelligence-Based Base Caller |
| CN114707114A (zh) * | 2022-04-25 | 2022-07-05 | 上海壁仞智能科技有限公司 | 分块方法及装置、卷积运算的方法及装置、存储介质 |
| CN115982530A (zh) * | 2023-03-13 | 2023-04-18 | 苏州浪潮智能科技有限公司 | 加速器运算控制方法、系统、存储介质、装置及设备 |
| CN118277328A (zh) * | 2024-04-18 | 2024-07-02 | 摩尔线程智能科技(北京)有限责任公司 | 数据处理方法、处理器、芯片及电子设备 |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106445471B (zh) * | 2016-10-13 | 2018-06-01 | 北京百度网讯科技有限公司 | 处理器和用于在处理器上执行矩阵乘运算的方法 |
| CN113989169B (zh) * | 2020-07-08 | 2024-12-10 | 北京硅升科技有限公司 | 一种膨胀卷积加速计算方法及装置 |
| CN113641952B (zh) * | 2021-10-14 | 2022-02-08 | 北京壁仞科技开发有限公司 | 卷积设备、卷积方法、矩阵拆聚装置以及矩阵拆聚方法 |
| CN114970849B (zh) * | 2022-06-28 | 2024-08-13 | 西安交通大学 | 一种硬件加速器多阵列并行计算方法及系统 |
| CN115357854B (zh) * | 2022-08-30 | 2025-09-09 | 无锡江南计算技术研究所 | 一种高效的矩阵乘运算加速装置及方法 |
| CN117787365B (zh) * | 2023-12-29 | 2024-11-26 | 中科南京智能技术研究院 | 一种卷积数据流的调度方法、装置、介质及设备 |
-
2024
- 2024-04-18 CN CN202410472589.9A patent/CN118277328B/zh active Active
-
2025
- 2025-03-14 WO PCT/CN2025/082601 patent/WO2025218403A1/fr active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200159809A1 (en) * | 2018-11-15 | 2020-05-21 | Imec Vzw | Convolution Engine for Neural Networks |
| CN112862059A (zh) * | 2019-11-28 | 2021-05-28 | 华为技术有限公司 | 长短期记忆lstm网络计算设备以及计算设备 |
| US20210265015A1 (en) * | 2020-02-20 | 2021-08-26 | Illumina, Inc. | Hardware Execution and Acceleration of Artificial Intelligence-Based Base Caller |
| CN114707114A (zh) * | 2022-04-25 | 2022-07-05 | 上海壁仞智能科技有限公司 | 分块方法及装置、卷积运算的方法及装置、存储介质 |
| CN115982530A (zh) * | 2023-03-13 | 2023-04-18 | 苏州浪潮智能科技有限公司 | 加速器运算控制方法、系统、存储介质、装置及设备 |
| CN118277328A (zh) * | 2024-04-18 | 2024-07-02 | 摩尔线程智能科技(北京)有限责任公司 | 数据处理方法、处理器、芯片及电子设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118277328A (zh) | 2024-07-02 |
| CN118277328B (zh) | 2025-04-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109284130B (zh) | 神经网络运算装置及方法 | |
| CN111310910B (zh) | 一种计算装置及方法 | |
| US10140251B2 (en) | Processor and method for executing matrix multiplication operation on processor | |
| US11531540B2 (en) | Processing apparatus and processing method with dynamically configurable operation bit width | |
| US12046028B1 (en) | Compiler system for deploying CNN models to FPGA-based high-performance accelerators | |
| WO2025218403A1 (fr) | Procédé de traitement de données, processeur, puce et dispositif électronique | |
| CN108133270B (zh) | 卷积神经网络加速方法及装置 | |
| WO2020047823A1 (fr) | Convolution sur des réseaux de neurones parcimonieux et à quantification | |
| CN111651205B (zh) | 一种用于执行向量内积运算的装置和方法 | |
| US20230259578A1 (en) | Configurable pooling processing unit for neural network accelerator | |
| CN106846235B (zh) | 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统 | |
| CN113807998A (zh) | 图像处理方法、目标检测装置、机器视觉设备和存储介质 | |
| CN111814957A (zh) | 神经网络运算方法及相关设备 | |
| CN113724127A (zh) | 一种图像矩阵卷积的实现方法、计算设备及储存介质 | |
| US11941383B1 (en) | Compilation with caching of code analysis result | |
| EP4073628B1 (fr) | Évaluation d'expression arithmétique entraînée par des données de colonne | |
| US12395187B2 (en) | Computer architecture with data decompression support for neural network computing | |
| US12292838B2 (en) | Host device performing near data processing function and accelerator system including the same | |
| TW202542747A (zh) | 數據處理方法、處理器、晶片及電子設備 | |
| CN116842304A (zh) | 一种不规则稀疏矩阵的计算方法及系统 | |
| KR102722476B1 (ko) | 증가된 정밀도의 뉴럴 프로세싱 요소 | |
| TWI902338B (zh) | 原子操作的處理方法、電子設備和電腦可讀儲存媒體 | |
| US20230195660A1 (en) | Memory expansion device performing near data processing function and accelerator system including the same | |
| KR20230095795A (ko) | Ndp 기능을 포함하는 호스트 장치 및 이를 포함하는 가속기 시스템 | |
| KR102718583B1 (ko) | 뉴럴 네트워크를 위한 데이터 처리 방법 및 장치 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25789273 Country of ref document: EP Kind code of ref document: A1 |