WO2022151779A1 - Procédé et dispositif de mise en œuvre d'opération de convolution, et procédé et dispositif de traitement de données - Google Patents
Procédé et dispositif de mise en œuvre d'opération de convolution, et procédé et dispositif de traitement de données Download PDFInfo
- Publication number
- WO2022151779A1 WO2022151779A1 PCT/CN2021/124460 CN2021124460W WO2022151779A1 WO 2022151779 A1 WO2022151779 A1 WO 2022151779A1 CN 2021124460 W CN2021124460 W CN 2021124460W WO 2022151779 A1 WO2022151779 A1 WO 2022151779A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- matrix
- convolution
- sub
- data
- convolution kernel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/22—Microcontrol or microprogram arrangements
- G06F9/28—Enhancement of operational speed, e.g. by using several microcontrol devices operating in parallel
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present invention relates to the technical field of neural networks, and in particular, to a method and apparatus for implementing convolution operations, a method and apparatus for data processing, a computing device and a computer-readable storage medium.
- CNN Convolutional Neural Network
- a convolutional layer In a convolutional neural network, at least one convolutional layer is included.
- a convolution kernel is used to perform a convolution operation on the matrix of the input data, and a feature matrix can be extracted.
- the convolution operation is a large number of multiplication and addition operations.
- the convolutional neural network includes multiple convolutional layers, and the convolutional operation occupies most of the calculation volume of the convolutional neural network. Therefore, how to accelerate the convolutional operation is Very important technical issue.
- the present application provides a method and device for implementing convolution operation, a data processing method and device, a computing device and a computer-readable storage medium, which can realize the acceleration of convolution operation with a sliding step size greater than 1 .
- a first aspect of the present application provides a method for implementing a convolution operation, wherein the sliding step size s in the convolution operation is greater than 1, and the method includes:
- the sub-matrix extraction step includes: for each position in the matrix where element extraction has not been performed, starting from the first position and moving to each position according to the sliding step s to extract each element to form a sub-matrix;
- the data sub-matrix and the convolution kernel sub-matrix of each pair are respectively subjected to a convolution operation with a sliding step size of 1, and each matrix obtained by the operation is subjected to a matrix summation.
- the convolution operation with a sliding step size s greater than 1 is transformed into a convolution operation with a sliding step size of 1 for multiple pairs of data sub-matrix and convolution kernel sub-matrix, so that the existing acceleration algorithm can be supported to perform the convolution. operation.
- the acceleration algorithm includes the Winograd fast convolution algorithm and an improved algorithm based thereon.
- the Winograd fast convolution algorithm and its improved algorithm can be selected.
- Improved algorithms such as Cook-Toom algorithm, Coppersmith-Winograd algorithm, Agarwal-Cooley algorithm, etc.
- a second aspect of the present application provides a method for implementing a convolution operation, wherein the sliding step size s in the convolution operation is greater than 1, including:
- the sub-matrix extraction step includes: for each position in the matrix where element extraction has not been performed, starting from the first position, and extracting each element to form a sub-matrix according to each position to which the sliding step s can move;
- Each pair of data sub-matrix and convolution kernel sub-matrix are placed according to their positions to form a data reorganization matrix and a convolution kernel reorganization matrix;
- the data reorganization matrix and the convolution kernel reorganization matrix perform a convolution operation with a sliding step size of 1.
- the convolution operation with a sliding step size s greater than 1 is transformed into a convolution operation with a sliding step size of 1 for the data reorganization matrix and the convolution kernel reorganization matrix, so that the existing acceleration algorithm can be supported to perform the convolution operation. .
- the method further includes: performing hole filling between the convolution kernel sub-matrices to form the convolution kernel reorganization matrix.
- the recombination matrix of the convolution kernel forms a sparse matrix, which is convenient for fast operation.
- the acceleration algorithm includes a fast Fourier transform convolution algorithm and an improved algorithm based thereon.
- the improved algorithm is, for example, the improved algorithm of the conventional FFT, and another example is the fast Number Theory Transformation (NTT) algorithm.
- a third aspect of the present application provides a data processing method, including a convolution operation in the data processing process, wherein at least one convolution operation is implemented using any one of the methods described in the first aspect, or at least one time using the second aspect. any of the methods described above.
- the data processing can be the processing of corresponding data in neural network algorithms such as image recognition, video recognition, and speech recognition.
- Image recognition includes face recognition, lane detection, vehicle recognition, etc.
- video recognition includes video classification, stereo vision matching. and many more.
- a fourth aspect of the present application provides a device for implementing a convolution operation, which is used to process a convolution operation with a sliding step size s greater than 1, including:
- the first processing unit is configured to perform the sub-matrix extraction step cyclically on the input data matrix and the convolution kernel matrix, respectively, to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- the sub-matrix extraction step includes: For each position of element extraction, start from the first position and move to each position according to the sliding step s to extract each element to form a sub-matrix;
- the second processing unit is configured to perform a convolution operation with a sliding step size of 1 on each pair of the data sub-matrix and the convolution kernel sub-matrix, and perform a matrix summation on each matrix obtained by the operation.
- the acceleration algorithm includes the Winograd fast convolution algorithm or an improved algorithm based thereon.
- a fifth aspect of the present application provides a device for implementing a convolution operation, which is used to process a convolution operation with a sliding step size s greater than 1, including:
- the first processing unit is configured to perform the sub-matrix extraction step cyclically on the input data matrix and the convolution kernel matrix, respectively, to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- the sub-matrix extraction step includes: For each position of element extraction, start from the first position, and extract each element according to each position that the sliding step s can move to to form a sub-matrix;
- the second processing unit is used for arranging each pair of data sub-matrix and convolution kernel sub-matrix according to their positions to form a data reorganization matrix and a convolution kernel reorganization matrix;
- the third processing unit is used to perform a convolution operation with a sliding step size of 1 on the data reorganization matrix and the convolution kernel reorganization matrix.
- the method further includes: filling holes between the convolution kernel sub-matrices to form the convolution kernel reorganization matrix;
- the acceleration algorithm includes a fast Fourier transform convolution algorithm or an improved algorithm based thereon.
- a sixth aspect of the present application provides a data processing apparatus for including a convolution operation in a data processing process, wherein at least one convolution operation is implemented using any of the methods described in the first aspect, or at least one convolution operation is performed using the first The method of any one of the two aspects is implemented.
- a seventh aspect of the present application provides a computing device, including:
- At least one memory connected to the processor and storing program instructions which, when executed by the at least one processor, cause the at least one processor to perform the method of any one of the first aspect, or The method of any one of the second aspect, or the method of the third aspect.
- the processor includes a convolution calculation unit, the convolution calculation unit includes each processing element PE, and the PE includes:
- the input transformation unit is used to perform matrix transformation calculation on the input data
- the convolution kernel transformation unit is used to perform matrix transformation calculation on the convolution kernel or intermediate calculation result data
- a matrix multiplication unit coupled with the input transformation unit and the convolution kernel transformation unit, and used for multiplying the output matrix of the input transformation unit and the convolution kernel transformation unit;
- the inverse transformation unit coupled with the matrix multiplication unit, is used for performing matrix inverse transformation calculation on the output data calculated by the matrix multiplication unit.
- the matrix multiplication unit includes:
- a first systolic array unit for outputting a third matrix according to the first matrix and the second matrix
- the second systolic array unit is configured to output a result matrix according to the first matrix and the third matrix.
- the result matrix B T dB can be output according to the first matrix B and the second matrix d T , and in the Winograd fast convolution algorithm, the Winograd fast convolution algorithm defined for 2D convolution, There are formulas with such a structure in the operation formula of the 3D convolution, such as GgG T , B T dB, A T extendedA.
- the formula with such a structure can be quickly operated.
- the matrix multiplying unit includes each MAC unit in an array
- At least one MAC unit including each first input end ki, each second input end gi, and a third input end pin; its output is each first output end oi, and a second output end pou;
- each second input gi is connected to a MUX, its output and each first input ki are respectively input to a multiplier, the output hi of each multiplier and the second output pou are connected to an adder, and the output of the adder is ha and each first input terminal ki are respectively input to a MUX and then output to each output terminal oi;
- the output ha of the adder and the third input terminal pin are input to a MUX and then output to the second output terminal pou.
- the MAC unit with MUX when the 0-valued element in the matrix for the convolution operation, such as the 0-valued element of the convolution kernel reorganization matrix with holes, is input to the MAC unit, it can be selected according to the MUX. By performing this processing, the processing of a large number of 0 elements in the sparse matrix is omitted, and the speed-up of the convolution operation for the sparse matrix is realized.
- An eighth aspect of the present application provides a computer-readable storage medium on which program instructions are stored, the program instructions, when executed by a computer, cause the computer to perform any one of the methods described in the first aspect, or the second aspect Any of the methods described, or the method described in the third aspect.
- the technical solution of the present application can be applied to the convolution operation of convolution kernels with different sliding steps, and the convolution operation with a sliding step greater than 1 is transformed into a convolution operation with a sliding step of 1 through the technical solution of the present application, Furthermore, existing fast convolution algorithms can be supported, such as the traditional Winograd fast convolution algorithm and the FFT fast convolution algorithm and their improved algorithms, so that time and space overhead can be better balanced.
- the present application is applicable to the acceleration of convolution operations of 1D convolution, 2D convolution and 3D convolution, which can accelerate the training and reasoning process of convolutional neural networks. Furthermore, in the computing device of the present application, a structure suitable for the Winograd fast convolution algorithm and the FFT fast convolution algorithm is provided, and by adding a sparse systolic array, the processing of the sparse matrix is optimized.
- FIG. 1 is a flowchart of a first embodiment of a method for implementing convolution operation provided by the application
- FIG. 2 is a flowchart of a second embodiment of a method for implementing convolution operations provided by the application
- 3A is a flowchart of a first specific embodiment of a 2D convolution operation implementation method provided by the present application
- 3B is a schematic diagram of a 2D convolution operation
- 3C is a schematic diagram of the extraction process of the submatrix in the 2D convolution of the present application.
- 3D is a schematic diagram of summing after the data sub-matrix and the convolution kernel sub-matrix of each pair in the 2D convolution of the present application are respectively subjected to a convolution operation;
- FIG. 4 is a schematic diagram of an extraction process of a sub-matrix in the 1D convolution of the present application
- FIG. 5 is a schematic diagram of an extraction process of a sub-matrix in the 3D convolution of the present application
- 6A is a flowchart of a second specific embodiment of a 2D convolution operation implementation method provided by the present application.
- 6B is a schematic diagram of the convolution operation of the data reorganization matrix and the convolution kernel reorganization matrix in FIG. 6A;
- FIG. 7 is a schematic diagram of a first embodiment of a device for implementing convolution operation provided by the present application.
- FIG. 8 is a schematic diagram of a second embodiment of a convolution operation implementation device provided by the present application.
- FIG. 9 is a schematic diagram of an embodiment of a computing device of the present application.
- FIG. 10A is a schematic diagram of a specific implementation manner of a computing device of the present application.
- FIG. 10B is a schematic diagram of the logical structure of the PE in FIG. 10A;
- Fig. 10C is a logical schematic diagram of the matrix multiplication unit in Fig. 10B;
- FIG. 10D is a schematic structural diagram of a specific implementation manner of FIG. 10C;
- FIG. 10E is a schematic diagram of the MAC unit in FIG. 10D .
- 1D convolution means that the input matrix and the convolution kernel matrix are both one-dimensional matrices
- 2D convolution means that the input matrix and the convolution kernel matrix are both two-dimensional matrices
- 3D convolution means that both the input matrix and the convolution kernel matrix are two-dimensional matrices.
- Both the input matrix and the convolution kernel matrix are three-dimensional matrices.
- the dimension here refers to the dimension of the matrix itself, not the number of channels of the input data or the number of convolution kernels.
- Single-parameter convolution kernel or called single-element convolution kernel
- the parameter quantity is a convolution kernel, which is called 1 convolution kernel in 1D convolution and 1*1 in 2D convolution
- the convolution kernel is called 1*1*1 convolution kernel in 3D convolution, and the value of the convolution kernel can be 1.
- Dilated/Atrous Convolution Also known as atrous convolution, dilated convolution, dilated convolution, atrous convolution, etc., it refers to the filling of spaces or zeros between elements of the convolution kernel matrix to form a hole matrix, and the hole is used. The convolution operation performed by the matrix on the input matrix.
- Winograd fast convolution algorithm is a fast convolution algorithm.
- Winograd fast convolution algorithm By transforming the input data matrix and the convolution kernel matrix respectively, and then performing the Hadamard product, Then transform the result to obtain the convolution result.
- the Hadamard product also known as element-wise product, is a type of matrix operation, which is the product of corresponding elements in two matrices.
- FFT fast convolution algorithm referred to as FFT fast convolution algorithm
- FFT fast convolution algorithm is a fast convolution algorithm. The results are multiplied and then the convolution result is obtained by inverse Fourier transform.
- Systolic array The matrix operation unit designed based on the systolic array can greatly accelerate the calculation of the neural network.
- the standard systolic array includes multiple units arranged in a two-dimensional array, and the input is input along the first and second dimensions of the systolic array. For each element of the first and second matrices, the output result is the result matrix of the convolution operation of the two matrices.
- FIG. 10D shows a systolic array in an embodiment of the present application.
- Cook-Toom algorithm is an improved Winograd algorithm.
- Agarwal-Cooley algorithm Agarwal-Cooley algorithm, is an improved Winograd algorithm.
- Winograd fast convolution algorithm for the 1D convolution, 2D convolution and 3D convolution operations with a step size of 1 of the convolution operation, the operation formulas defined by the Winograd fast convolution algorithm are as follows:
- Y represents the operation result
- d represents the input data matrix
- g represents the convolution kernel matrix
- G represents the convolution kernel transformation (Filter Transform) matrix
- B T represents the input transformation (Input Transform) matrix
- a T Represents the Output Transform matrix
- ⁇ represents the Hadamard product
- R represents the matrix rotated 90° clockwise.
- the entire calculation process of the Winograd fast convolution algorithm logically includes the following steps:
- Winograd fast convolution algorithm Using the Winograd fast convolution algorithm, the number of multiplications is reduced and the number of additions is increased by a small amount. Since the multiplication calculation of the computing device is generally slower than the addition, the speedup of the convolution operation can be realized.
- the existing Winograd fast convolution algorithm and its improved algorithm are only suitable for the convolution operation with a step size of 1. When the step size is greater than 1, the algorithm cannot be used to speed up the convolution operation.
- the Winograd fast convolution algorithm is usually suitable for smaller size convolution kernels, such as The 2*2, or 3*3 size convolution kernels mentioned above.
- the FFT fast convolution algorithm can be used to speed up the convolution operation.
- the principle of FFT implementation is that the convolution in the time domain and the multiplication in the frequency domain are equivalent, so it performs FFT transformation on the data to be convolved, performs IFFT transformation after multiplication in the frequency domain, and then extracts the convolution result.
- the following is an example of an image f(x,y) whose size is A*B, a convolution kernel h(x,y) whose size is C*D, and the use of FFT to achieve fast 2D convolution as an example, including steps:
- the FFT fast convolution algorithm realizes that a large number of multiplications in the convolution calculation are changed to the calculation volume of three FFT calculations.
- the calculation volume can be significantly shortened.
- the convolution operation for the FFT fast convolution is only applicable to the convolution operation with a step size of 1. When the step size is greater than 1, the algorithm cannot be used to speed up the convolution operation.
- the existing Winograd fast convolution algorithm and FFT fast convolution algorithm are both suitable for the convolution operation with a step size of 1.
- the step size is greater than 1, these two algorithms cannot be used to speed up the convolution operation.
- the present application provides a convolution operation method, which converts a convolution operation with a step size greater than 1 into a convolution operation with a step size of 1 through matrix transformation, which can be applied to the above-mentioned fast volume product algorithm, or other fast convolution algorithms.
- the Winograd fast convolution algorithm or the FFT fast convolution algorithm can be flexibly selected according to the size of the convolution kernel after the matrix transformation.
- Applicable scenarios of this application can be applied to various application fields that require convolution operations, such as image recognition, video recognition, and speech recognition neural network algorithms.
- Image recognition includes face recognition, lane detection, Vehicle recognition, etc.
- Video recognition includes video classification, stereo vision matching, etc.
- the corresponding product field includes mobile phones, such as the classification and identification of images in the mobile phone album, and the product field can also be smart vehicles.
- Figure 1 shows a flowchart of the first embodiment of the convolution operation implementation method provided by the present application.
- the convolution operation steps include:
- S110 Circulate the input data matrix and the convolution kernel matrix to perform the sub-matrix extraction step, respectively, to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- the sub-matrix extraction step includes: for each position in the matrix where element extraction has not been performed, starting from the first position and moving to each position according to the sliding step s to extract each element to form a sub-matrix.
- S120 Perform a convolution operation with a sliding step size of 1 on each pair of the data sub-matrix and the convolution kernel sub-matrix, and perform a matrix summation on each matrix obtained by the operation.
- the convolution operation with a step size greater than 1 can be transformed into a convolution operation with a step size of 1 for each pair of data sub-matrix and convolution kernel sub-matrix. Since the step size is 1, it can be applied to existing various A convolution acceleration algorithm.
- the data sub-matrix and the convolution kernel sub-matrix are subjected to a sliding step size of 1.
- the accelerated algorithm when the accelerated algorithm is used, when the size of the convolution kernel is small, such as less than 3*3, the accelerated algorithm can use the Winograd fast convolution algorithm or an improved algorithm based on it.
- the improved algorithms include Cook-Toom algorithm, Coppersmith-Winograd algorithm, Agarwal-Cooley algorithm, Agarwal-Cooley algorithm and the like.
- Figure 2 shows a flow chart of the second embodiment of the method for implementing convolution operation provided by the present application.
- the convolution operation step includes:
- S210 Circulate the input data matrix and the convolution kernel matrix to perform the sub-matrix extraction step, respectively, to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- the sub-matrix extraction step includes: for each position in the matrix where element extraction has not been performed, starting from the first position, and extracting each element to form a sub-matrix according to each position to which the sliding step s can move;
- S220 The corresponding pairs of data sub-matrixes and convolution kernel sub-matrices are arranged by position to form a data reorganization matrix and a convolution kernel reorganization matrix; among them, a data reorganization matrix and a convolution kernel matrix with the same size as the data matrix can be respectively formed Convolution kernel reorganization matrix of the same size.
- S230 Perform a convolution operation with a sliding step size of 1 on the data reorganization matrix and the convolution kernel reorganization matrix.
- holes may be further filled between the convolution kernel sub-matrices to form a convolution kernel reorganization matrix with holes, thereby forming a sparse matrix.
- the size of the holes between adjacent convolution kernel sub-matrixes may be one hole, or may be multiple holes, and the matrix is sparser with more holes.
- the convolution operation with a step size greater than 1 can be transformed into a convolution operation with a data reorganization matrix and a convolution kernel reorganization matrix with a step size of 1, or into a data reorganization matrix and a convolution kernel reorganization matrix with holes.
- the convolution operation with a step size of 1, since the step size is 1, can be applied to various existing convolution acceleration algorithms.
- the data reorganization matrix and the convolution kernel reorganization matrix step size is 1.
- the FFT fast convolution algorithm or an improved algorithm based on it can be used.
- the improved algorithm is, for example, the improved algorithm of the conventional FFT, and another example is the fast Number Theory Transformation (NTT) algorithm.
- each element is extracted, and the following sub-matrix is obtained:
- the sub-matrix is the sub-matrix extracted and generated this time.
- the extraction method of the sub-matrix can also be understood as using a single-parameter convolution check with a value of 1 to check each element corresponding to the position from Am,n to Ax,y with a sliding step size s, that is, the above-mentioned solid-line rectangular box
- the matrix in the matrix obtained by performing the convolution operation.
- step size is s, it is not difficult to understand that for each element in the corresponding position from A1,1 to As-1,s-1 in the step size s, that is, each element in the above dashed box, in fact, each When the sub-matrix is extracted, it corresponds to the element at the first position of each element extraction.
- the above sub-matrix extraction steps are also applicable to 1D convolution and 3D convolution.
- the extraction method of the sub-matrix from the position of the element Am can be understood as, with the sliding step s, using a single-parameter convolution kernel with a value of 1, the one-dimensional convolution formed from Am to Ax is used.
- the one-dimensional matrix obtained by the convolution operation of the matrix. where Ax refers to the last element.
- the extraction method of the sub-matrix from the position of the elements Am, n, o can be understood as, with the sliding step size s, using a single-parameter convolution kernel with a value of 1, and from Am, n , o to the three-dimensional matrix formed by the convolution operation of the three-dimensional matrix formed by Ax, y, and z.
- Ax,y,z refer to the last element.
- FIG. 3A shows a flowchart of a first specific implementation method of a 2D convolution operation
- FIG. 3B shows a schematic diagram of a 2D convolution operation.
- the input data matrix size is 11* 11.
- the size of the convolution kernel matrix is 5*5, and the sliding step size of the convolution operation is 3.
- 3C-FIG. 3D are shown below, and the 2D convolution operation method of the present application will be described in detail, including the following steps:
- S310 Circulate the input data matrix with a size of 11*11 and a convolution kernel matrix with a size of 5*5, respectively, and perform the sub-matrix extraction step to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- the first sub-matrix is extracted with a sliding step size of 3, and the first sub-matrix of the convolution kernel shown in the figure is obtained;
- the second sub-matrix is extracted with a step size of 3, and the volume shown in the figure is obtained.
- the third sub-matrix is extracted with a step size of 3, and the volume shown in the figure is obtained.
- the fourth sub-matrix is extracted with a step size of 3 to obtain the fourth sub-matrix of the convolution kernel. sub-matrix, the extraction process and subsequent extraction processes will not be described again.
- each data sub-matrix and each convolution kernel sub-matrix respectively obtained by extracting the data matrix and the convolution kernel matrix as shown in FIG. 3D can be obtained.
- the resulting 3 ⁇ 2 9 pairs of data sub-matrix and convolution kernel sub-matrix.
- S320 Perform a convolution operation with a sliding step size of 1 on the obtained pairs of the data sub-matrix and the convolution kernel sub-matrix, respectively, to obtain matrices with the same matrix size.
- the result matrix is obtained by summing the obtained matrices. See Figure 3D for a schematic diagram of this step.
- Figure 4 shows a schematic diagram of the extraction of the convolution kernel sub-matrix in a 1D convolution operation.
- the size of the convolution kernel matrix shown in Figure 4 is 5*1, and the sliding step size of the convolution operation is 3.
- the difference in the extraction step of the sub-matrix is that, since it is a 1D convolution operation, compared with the 2D convolution, the matrix has only one row, so the sub-matrix is only extracted for each element of this row.
- FIG. 5 shows a schematic diagram of extraction of a convolution kernel sub-matrix in a 3D convolution operation, and only a schematic diagram of extraction of the first sub-matrix is shown here.
- the size of the convolution kernel matrix shown in Figure 5 is 5*5*5, and the sliding step size of the convolution operation is 3.
- S610 Circulate the input data matrix with a size of 11*11 and a convolution kernel matrix with a size of 5*5, respectively, and perform the sub-matrix extraction step to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- step 310 For this step, reference may be made to the above-mentioned step 310, which will not be repeated here.
- S620 As shown in FIG. 6B, place each data sub-matrix by position to form a data reorganization matrix with the same size as the data matrix; and, place each convolution kernel sub-matrix by position, and perform holes between the convolution kernel sub-matrices Padding forms a kernel reorganization matrix with holes.
- each pair of data sub-matrix and convolution kernel sub-matrix in each recombination matrix should be placed correspondingly, for example, they should be placed in the first position of each reorganization matrix, or they should be placed in the second position. Wait.
- each position may be placed with reference to the position in the original matrix where the first element of each sub-matrix is located in the step of extracting the sub-matrix in step S310.
- S630 Perform a convolution operation with a sliding step size of 1 on the data reorganization matrix and the convolution kernel reorganization matrix.
- the convolution kernel reorganization matrix is set to a size of 9*9, so the original size of the convolution kernel matrix of 5*5 becomes
- the hole size between adjacent convolution kernel sub-matrices is 2 holes.
- FIG. 7 is a schematic diagram of the first embodiment of the convolution operation implementation device provided for this application.
- the convolution operation implementation device is used to process the sliding step size s in the convolution operation that is greater than 1
- the convolution operation, the convolution operation implementation device includes:
- the first processing unit 410 is configured to perform the sub-matrix extraction step cyclically respectively on the input data matrix and the convolution kernel matrix to generate multiple pairs of data sub-matrix and convolution kernel sub-matrix;
- the sub-matrix extraction step includes: For each position where element extraction has not been performed, start from the first position and move to each position according to the sliding step s to extract each element to form a sub-matrix;
- the second processing unit 420 is configured to perform a convolution operation with a sliding step size of 1 on each pair of the data sub-matrix and the convolution kernel sub-matrix, and perform a matrix summation on each matrix obtained by the operation.
- the acceleration algorithm includes the Winograd fast convolution algorithm and improved algorithms based thereon.
- FIG. 8 shows a schematic diagram of a second embodiment of the convolution operation implementation device provided by the present application.
- the convolution operation implementation device is used to process the sliding step size s in the convolution operation that is greater than 1.
- Convolution operation, the device for implementing convolution operation includes:
- the first processing unit 510 is configured to cyclically execute the sub-matrix extraction steps for the input data matrix and the convolution kernel matrix, respectively, to generate multiple pairs of data sub-matrices and convolution kernel sub-matrices; the sub-matrix extraction steps include: For each position where element extraction has not been performed, start from the first position and extract each element to form a sub-matrix according to each position that the sliding step s can move to;
- the second processing unit 520 is configured to place each pair of data sub-matrix and convolution kernel sub-matrix according to their positions to form a data reorganization matrix and a convolution kernel reorganization matrix;
- the third processing unit 530 is configured to perform a convolution operation with a sliding step size of 1 on the data reorganization matrix and the convolution kernel reorganization matrix.
- hole filling is performed between each convolution kernel sub-matrix to form the convolution kernel reorganization matrix
- the accelerated algorithm includes a fast Fourier transform convolution algorithm and improved algorithms based thereon.
- FIG. 9 is a schematic structural diagram of a computing device 900 provided by an embodiment of the present application.
- the computing device 900 includes: a processor 910 , a memory 920 , and a communication interface 930 .
- the communication interface 930 in the computing device 900 shown in the figure may be used to communicate with other devices.
- the processor 910 can be connected with the memory 920 .
- the memory 920 may be used to store the program codes and data. Therefore, the memory 920 may be a storage unit within the processor 910 , or may be an external storage unit independent of the processor 910 , or may include a storage unit within the processor 910 and an external storage unit independent of the processor 910 . part.
- computing device 900 may also include a bus.
- the memory 920 and the communication interface 930 may be connected to the processor 910 through a bus.
- the bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus or the like.
- PCI Peripheral Component Interconnect
- EISA Extended Industry Standard Architecture
- the bus can be divided into an address bus, a data bus, a control bus, and the like.
- the processor 910 may adopt a central processing unit (central processing unit, CPU).
- the processor may also be other general-purpose processors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable gate arrays
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the processor 910 uses one or more integrated circuits to execute related programs to implement the technical solutions provided by the embodiments of the present application.
- the memory 920 which may include read-only memory and random access memory, provides instructions and data to the processor 910 .
- a portion of processor 910 may also include non-volatile random access memory.
- the processor 910 may also store device type information.
- the processor 910 executes the computer-executable instructions in the memory 920 to execute the operation steps of the above method.
- the computing device 900 may correspond to a corresponding subject in executing the methods according to the various embodiments of the present application, and the above-mentioned and other operations and/or functions of each module in the computing device 900 are for the purpose of realizing the present application, respectively.
- the corresponding processes of each method in the embodiment will not be repeated here.
- FIG. 10A is a schematic diagram of a specific implementation manner of a computing device of the present application.
- the computing device is implemented by a chip, and the chip may be a neural network processor (NPU).
- NPU neural network processor
- FIG. 9 The communication interface 930 in the chip can be realized by the host interface 710 in the chip, the multi-port RAM interface 730 and the internal circuit of the chip; the processor 910 in FIG. 9 can be calculated by the control unit 740 and the convolution calculation unit 760, the accumulator unit 770 and the activator.
- Unit 780 is implemented; memory 920 in FIG. 9 may be implemented by multi-port RAM 720 , unified cache unit 750 and instruction memory 790 .
- a host interface (Host Interface) 710 is used for data communication with an external host, and receives tasks or data (including the data of the input matrix and the convolution kernel matrix) assigned by the external host.
- the host interface 710 is an optional component.
- the external host may be a main CPU.
- the NPU acts as a co-processor to communicate with the external main CPU to obtain tasks assigned by the main CPU.
- Multi-ported RAM (Multi-ported RAM) 720 the multi-ported RAM 720 has a plurality of storage areas inside, which can be respectively coupled with the control unit 740 to realize parallel reading or writing of each data.
- an XOR-based multi-port RAM 720 can be used, and the multi-port RAM 720 can further improve the data bandwidth compared with the traditional DRAM, and is an optional component.
- the Multi-ported RAM Interface 730 is an interface used to access the multi-port RAM 720 and the unified cache unit 750, and is also used to realize data transfer between the multi-port RAM 720 and the unified cache unit 750.
- the input data in the RAM (Multi-ported RAM) 720 is transferred to the unified cache unit 750, or the result data of the calculation cached by the unified cache unit 750 is transferred to the multi-ported RAM 720.
- the control unit (Controller) 740 is mainly used for controlling instruction reading, reading and writing and sequential logic control of the multi-port RAM 720 and the unified cache unit 750 through the multi-port RAM interface 730 .
- the unified buffer unit (Unified Buffer) 750 is used to store the temporary data of the convolution calculation unit 760, including input data (including the data of the input matrix and the convolution kernel matrix), the data of the intermediate processing, the output data, and the like.
- Convolution Engine 760 is mainly used to implement convolution operations, including accelerated convolution operations of 1D convolution, 2D convolution or 3D convolution; it has multiple Processing Elements (PE), The convolution operation is performed in PE. For example, it can be used to implement the convolution operation of each data sub-matrix and the convolution kernel sub-matrix in FIG. 3D to obtain each result sub-matrix.
- PE Processing Elements
- the accumulator unit (Accumulator) 770 is mainly used for performing matrix sum operation processing on the operation result of the convolution calculation unit 760 . For example, it can be used to implement the matrix sum calculation of the respective result sub-matrices in Figure 3D.
- An activator calculation unit (Activation) 780 is configured to process the operation result of the accumulator unit 770, and the processing includes applying a nonlinear function to the operation result of the accumulator unit 770 to generate an activation value.
- the activator calculation unit 780 is equivalent to a function for realizing the activation function.
- Instruction memory (Instruction Buffer) 790 for storing instructions used by the control unit 740;
- the above-mentioned basic principle is that the convolution calculation unit 760 is controlled by the control unit 740 to read each data sub-matrix and the corresponding convolution kernel sub-matrix from the unified buffer unit 750 to perform convolution operations, respectively, and the operation results are sequentially obtained by the accumulator unit 770.
- the sum activator calculation unit 780 performs the summation of each result sub-matrix, generates activation values, and outputs them.
- the convolution calculation unit 760 is further exemplified to illustrate its working principle. Assuming that there are each first matrix A and each corresponding second matrix B, the convolution operation is: the convolution calculation unit 760 reads each The data corresponding to the second matrix B is buffered on each PE in the convolution calculation unit 760, and then the corresponding data of each first matrix A is read from the unified buffer unit 750 and the matrix operation is performed on each second matrix B to obtain each For the third matrix C, the partial results or final results of each third matrix C are provided to the accumulator unit 770 for matrix summation operation.
- the first matrix A may be a data sub-matrix shown in FIG. 3D
- the second matrix B may be a convolution kernel sub-matrix corresponding to the data sub-matrix shown in FIG. 3D
- each pair of data sub-matrix and convolution kernel sub-matrix in FIG. 3D can be operated in parallel on the corresponding PE.
- the first matrix A may be the data reorganization matrix shown in FIG. 6B
- the second matrix B may be the convolution kernel reorganization matrix with holes shown in FIG. 6B .
- FIG. 10B is a schematic diagram of the logical structure of a PE in FIG. 10A .
- the PE includes:
- the input registers (Input Registers) 761 are used for buffering the input data, which may be the data of the input matrix; for example, it may be the data of the first matrix A mentioned above.
- the convolution kernel register (Filter Registers) 762 is used for buffering the convolution kernel or intermediate calculation result data, which can be each element of the convolution kernel matrix; for example, it can be the data of the second matrix B mentioned above.
- Input transform unit (Input Transform) 763 for performing matrix transformation calculation on the input data obtained from the input register 761;
- the convolution kernel transformation unit (Filter Transform) 764 is used to perform matrix transformation calculation on the convolution kernel or intermediate calculation result data obtained from the input register 761;
- Matrix multiplication unit (Multiplicator Block) 765 is used for the output matrix of input transformation unit 763, the output matrix of convolution kernel transformation unit 764, carries out matrix multiplication calculation; Matrix multiplication unit 765 will be further described later;
- Inverse transform unit (Inverse Transform) 766 for performing matrix inverse transform calculation on the output data of matrix multiplication unit 765;
- Output Registers (Output Registers) 767 are used to buffer calculation results for output.
- the above-mentioned input transformation unit 763 is used to transform the input data matrix d
- the convolution kernel transformation unit 764 is used to transform the convolution kernel.
- the matrix g is transformed
- the matrix multiplication unit 765 is used to perform the Hadamard product operation
- the inverse transformation unit is used to transform the output matrix A.
- the above-mentioned input transformation unit 763 and convolution kernel transformation unit 764 are respectively used to perform FFT transformation on the input data matrix and the convolution kernel matrix, and matrix multiplication
- the unit 765 is used to perform a matrix multiplication operation
- the inverse transformation unit is used to perform IFFT transformation on the output matrix operation result.
- FIG. 10C is a schematic diagram of the matrix multiplication unit 765 in FIG. 10B .
- the matrix multiplication unit 765 uses a configurable systolic array (Systolic Tensor Array) for implementing the Winograd fast convolution algorithm.
- the systolic array includes at least two systolic array units, which are specifically described as follows:
- the first systolic array unit is used for receiving the first matrix B, buffering it in the interior, and receiving the second matrix d T , and using the first matrix B to operate the second matrix to output the third matrix (d T B) T ;
- the second systolic array unit is used for receiving the first matrix B and buffering it internally, and receiving the third matrix (d T B) T , and using the first matrix B to operate the third matrix to output the result matrix B T dB.
- the systolic array can output the result matrix B T dB according to the first matrix B and the second matrix d T , and in the above introduction to the Winograd fast convolution algorithm, the Winograd fast convolution algorithm defined for 2D convolution ,
- the Winograd fast convolution algorithm defined for 2D convolution
- FIG. 10D is a schematic structural diagram of the matrix multiplying unit 765, that is, a specific implementation of the systolic array described in FIG. 10C , the systolic array includes several MAC units in an array
- FIG. 10E is a schematic diagram of a MAC unit in FIG. 10D , As shown in FIG. 10D and FIG.
- the MAC unit includes several first input terminals k1-k4 (or denoted by ki) corresponding to the first matrix, and several second input terminals g1-g4 (or denoted by ki) corresponding to the second matrix It is represented by gi), and the third input pin; its output is each first output terminal o1-o4 (or represented by oi), and the second output terminal pou;
- each second input terminal g1-g4 is connected to a MUX at the same time, its output and each first input terminal k1-k4 are respectively input into a multiplier, and each multiplier outputs h1-h4 (or represented by hi), and each multiplier outputs h1-h4 (or represented by hi).
- the output h1-h4 and the second output terminal pou are connected to an adder (ACC) and output ha, the output ha of the adder and each first input terminal k1-k4 are respectively input to a MUX and then output to each output terminal o1 -o4; the output ha of the adder is also input to a MUX together with the third input pin pin, and then output to the register (ie, the square with a black triangle in the figure) for buffering and then output to the second output terminal pou.
- ACC adder
- the MAC and register groups are also designed to reduce the area.
- the output of the group as a whole is A total of four sets of registers need to be set on the side (the bottom and right sides of the group of MACs in the dotted line box in the upper left corner of the figure represent the output side of the group of MACs), which is a reduction of four sets of registers compared to the two sets of registers set on the output side of each MAC. set of registers, thereby reducing the circuit area.
- the MAC unit with MUX When the zero-valued elements in the matrix for convolution operation, such as the zero-valued elements of the convolution kernel recombination matrix with holes as shown in FIG. 6B, are input to the MAC unit, for example, the above-mentioned MAC
- the second input terminal g1-g4 receives a 0 value, and can choose whether to perform this processing according to the MUX, thereby omitting the processing process of a large number of 0 elements in the sparse matrix, and realizing the speedup of the convolution operation.
- the disclosed system, apparatus and method may be implemented in other manners.
- the apparatus embodiments described above are only illustrative.
- the division of the units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
- the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
- the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
- the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
- Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, is used to execute the above method, and the method includes at least one of the solutions described in the above embodiments one.
- the computer storage medium of the embodiments of the present application may adopt any combination of one or more computer-readable media.
- the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
- the computer readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination of the above. More specific examples (a non-exhaustive list) of computer readable storage media include: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the above.
- a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
- Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
- Computer program code for performing the operations of the present application may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or wide area network (WAN), or may be connected to an external computer (eg, through the Internet using an Internet service provider) connect).
- LAN local area network
- WAN wide area network
- Internet service provider an external computer
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Complex Calculations (AREA)
Abstract
L'invention concerne un procédé de mise en œuvre d'opération de convolution, dans lequel un pas de glissement s dans une opération de convolution est supérieur à 1. Le procédé consiste à : effectuer respectivement, de manière cyclique, une étape d'extraction de sous-matrices sur une matrice de données d'entrée et une matrice de noyau de convolution pour générer une pluralité de paires d'une sous-matrice de données et d'une sous-matrice de noyau de convolution (S110), l'étape d'extraction de sous-matrices consistant à : pour des positions, dans les matrices, auxquelles aucune extraction d'élément n'est effectuée, se déplacer, à partir d'une première position, jusqu'aux positions correspondant à un pas de glissement s pour extraire des éléments destinés à former une sous-matrice ; effectuer respectivement une opération de convolution avec un pas de glissement de 1 sur les paires d'une sous-matrice de données et d'une sous-matrice de noyau de convolution, puis faire la somme des matrices obtenues (S120) ; ou former respectivement les sous-matrices de données et les sous-matrices de noyau de convolution en une matrice recombinée de données et une matrice recombinée de noyau de convolution, et effectuer une opération de convolution avec un pas de glissement de 1 sur la matrice recombinée de données et la matrice recombinée de noyau de convolution. L'opération de convolution est transformée en une opération de convolution avec un pas de 1, de manière à pouvoir utiliser un algorithme d'accélération pour effectuer une opération d'accélération.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110040705.6 | 2021-01-13 | ||
| CN202110040705.6A CN114764615A (zh) | 2021-01-13 | 2021-01-13 | 卷积运算的实现方法、数据处理方法及装置 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022151779A1 true WO2022151779A1 (fr) | 2022-07-21 |
Family
ID=82364184
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2021/124460 Ceased WO2022151779A1 (fr) | 2021-01-13 | 2021-10-18 | Procédé et dispositif de mise en œuvre d'opération de convolution, et procédé et dispositif de traitement de données |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN114764615A (fr) |
| WO (1) | WO2022151779A1 (fr) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115204373A (zh) * | 2022-08-05 | 2022-10-18 | 广东工业大学 | 一种卷积神经网络的快速卷积及缓存模式的设计方法 |
| CN115563443A (zh) * | 2022-09-23 | 2023-01-03 | 上海壁仞智能科技有限公司 | 卷积运算方法及装置、卷积处理方法、设备与存储介质 |
| CN115906978A (zh) * | 2022-11-22 | 2023-04-04 | 上海交通大学 | 基于时频联动的可重构光学张量卷积加速方法及装置 |
| CN118964810A (zh) * | 2024-10-17 | 2024-11-15 | 浙江芯劢微电子股份有限公司 | 一种卷积运算的数字电路、芯片和方法 |
| CN119416850A (zh) * | 2024-10-18 | 2025-02-11 | 北京航空航天大学 | 一种适配硬件张量指令及内存的神经网络推理优化方法 |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116348882A (zh) * | 2020-06-30 | 2023-06-27 | 华为技术有限公司 | 一种卷积神经网络数据处理方法及其相关设备 |
| CN115292662B (zh) * | 2022-08-18 | 2023-09-22 | 上海燧原科技有限公司 | 一种卷积加速运算方法、装置、电子设备及存储介质 |
| CN115578243B (zh) * | 2022-10-09 | 2024-01-05 | 北京中科通量科技有限公司 | 一种面向稀疏矩阵的膨胀处理方法 |
| WO2024108584A1 (fr) * | 2022-11-25 | 2024-05-30 | 华为技术有限公司 | Procédé et dispositif de traitement d'opérateur creux |
| CN116366226A (zh) * | 2023-03-30 | 2023-06-30 | 蚂蚁区块链科技(上海)有限公司 | 两方联合的数据处理方法及装置 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108388541A (zh) * | 2016-04-22 | 2018-08-10 | 北京中科寒武纪科技有限公司 | 卷积运算装置及方法 |
| CN109754064A (zh) * | 2017-11-07 | 2019-05-14 | 三星电子株式会社 | 执行解卷积的神经网络的方法和装置 |
| CN110020678A (zh) * | 2019-03-25 | 2019-07-16 | 联想(北京)有限公司 | 一种数据处理方法、电子设备及计算机存储介质 |
| CN111902813A (zh) * | 2018-03-27 | 2020-11-06 | Sk电信有限公司 | 用于卷积运算的装置以及方法 |
| CN113641952A (zh) * | 2021-10-14 | 2021-11-12 | 北京壁仞科技开发有限公司 | 卷积设备、卷积方法、矩阵拆聚装置以及矩阵拆聚方法 |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110288682B (zh) * | 2019-06-28 | 2023-09-26 | 北京百度网讯科技有限公司 | 用于控制三维虚拟人像口型变化的方法和装置 |
-
2021
- 2021-01-13 CN CN202110040705.6A patent/CN114764615A/zh active Pending
- 2021-10-18 WO PCT/CN2021/124460 patent/WO2022151779A1/fr not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108388541A (zh) * | 2016-04-22 | 2018-08-10 | 北京中科寒武纪科技有限公司 | 卷积运算装置及方法 |
| CN109754064A (zh) * | 2017-11-07 | 2019-05-14 | 三星电子株式会社 | 执行解卷积的神经网络的方法和装置 |
| CN111902813A (zh) * | 2018-03-27 | 2020-11-06 | Sk电信有限公司 | 用于卷积运算的装置以及方法 |
| CN110020678A (zh) * | 2019-03-25 | 2019-07-16 | 联想(北京)有限公司 | 一种数据处理方法、电子设备及计算机存储介质 |
| CN113641952A (zh) * | 2021-10-14 | 2021-11-12 | 北京壁仞科技开发有限公司 | 卷积设备、卷积方法、矩阵拆聚装置以及矩阵拆聚方法 |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115204373A (zh) * | 2022-08-05 | 2022-10-18 | 广东工业大学 | 一种卷积神经网络的快速卷积及缓存模式的设计方法 |
| CN115563443A (zh) * | 2022-09-23 | 2023-01-03 | 上海壁仞智能科技有限公司 | 卷积运算方法及装置、卷积处理方法、设备与存储介质 |
| CN115906978A (zh) * | 2022-11-22 | 2023-04-04 | 上海交通大学 | 基于时频联动的可重构光学张量卷积加速方法及装置 |
| CN118964810A (zh) * | 2024-10-17 | 2024-11-15 | 浙江芯劢微电子股份有限公司 | 一种卷积运算的数字电路、芯片和方法 |
| CN119416850A (zh) * | 2024-10-18 | 2025-02-11 | 北京航空航天大学 | 一种适配硬件张量指令及内存的神经网络推理优化方法 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114764615A (zh) | 2022-07-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022151779A1 (fr) | Procédé et dispositif de mise en œuvre d'opération de convolution, et procédé et dispositif de traitement de données | |
| CN109992743B (zh) | 矩阵乘法器 | |
| CN106445471B (zh) | 处理器和用于在处理器上执行矩阵乘运算的方法 | |
| CN112840356B (zh) | 运算加速器、处理方法及相关设备 | |
| TWI834729B (zh) | 神經網路處理器及其卷積操作方法 | |
| CN108229645B (zh) | 卷积加速和计算处理方法、装置、电子设备及存储介质 | |
| CN108416327B (zh) | 一种目标检测方法、装置、计算机设备及可读存储介质 | |
| CN110263909B (zh) | 图像识别方法及装置 | |
| WO2019109795A1 (fr) | Procédé de traitement d'opération de convolution et produit associé | |
| CN107451652A (zh) | 高效的稀疏并行的基于威诺格拉德的卷积方案 | |
| WO2021081854A1 (fr) | Circuit d'opération de convolution et procédé d'opération de convolution | |
| CN112703511B (zh) | 运算加速器和数据处理方法 | |
| WO2018107383A1 (fr) | Procédé et dispositif de calcul de convolution d'un réseau de neurones artificiels, et support d'enregistrement lisible par ordinateur | |
| CN110050267A (zh) | 用于数据管理的系统和方法 | |
| CN106846235B (zh) | 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统 | |
| CN112219210B (zh) | 信号处理装置和信号处理方法 | |
| CN112528219B (zh) | 存储器装置及其运算方法、计算设备 | |
| CN111369450A (zh) | 去除摩尔纹的方法与装置 | |
| JP2023541350A (ja) | 表畳み込みおよびアクセラレーション | |
| CN115238863A (zh) | 一种卷积神经网络卷积层的硬件加速方法、系统及应用 | |
| US20240184521A1 (en) | Computation apparatus, method, system, circuit, and device, and chip | |
| WO2022205197A1 (fr) | Multiplicateur matriciel, procédé de calcul matriciel, et dispositif apparenté | |
| CN109754062A (zh) | 卷积扩展指令的执行方法以及相关产品 | |
| CN114254563A (zh) | 数据处理方法及装置、电子设备、存储介质 | |
| CN114003201A (zh) | 矩阵变换方法、装置及卷积神经网络加速器 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21918970 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 21918970 Country of ref document: EP Kind code of ref document: A1 |