WO2024222115A1

WO2024222115A1 - Neural network pruning method and apparatus, and data processing method and apparatus

Info

Publication number: WO2024222115A1
Application number: PCT/CN2024/074982
Authority: WO
Inventors: 赵亢; 于献智; 陈汉亭; 韩凯; 胡婷; 王云鹤; 姚骏
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-04-25
Filing date: 2024-01-31
Publication date: 2024-10-31
Anticipated expiration: 2025-10-25
Also published as: CN118839740A

Abstract

A neural network pruning method and apparatus, and a data processing method and apparatus. A complementary sparse mode is provided. Therefore, every K pruning vectors with a length of M in weight tensors are spliced into one data block (or vector), and after sparsity, at the position of a non-zero element in one pruning vector, the other pruning vectors in a data block to which the pruning vector belongs are zero elements. The method improves the accuracy of application of a network model, reduces the calculation amount of the model, and can balance the accuracy and acceleration.

Description

Neural network pruning method, data processing method and device

相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS

本申请要求在2023年4月25日提交中华人民共和国国家知识产权局、申请号为202310470198.9、发明名称为“神经网络的剪枝方法、数据处理方法及装置”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of the People's Republic of China on April 25, 2023, with application number 202310470198.9 and invention name "Neural Network Pruning Method, Data Processing Method and Device", all contents of which are incorporated by reference in this application.

Technical Field

本申请涉及深度学习技术领域，特别涉及一种神经网络的剪枝方法、数据处理方法及装置。The present application relates to the field of deep learning technology, and in particular to a neural network pruning method, a data processing method and a device.

Background Art

目前神经网络的规模不断增加，参数量和计算量也不断加大。为了降低参数量和计算量，可以采用剪枝、量化、蒸馏或者轻量模型设计等方式。剪枝的应用较为广泛，剪枝是指在冗余的较大模型中将部分非关键权重置为0，以实现稀疏化模型。需要剪枝的部分的选择方法，可以采用任意稀疏方式，但是任意剪枝方式中0元素的分布位置比较随机，由于没有规律可遵循，导致没有带来推理性能的优化。需要剪枝的部分的选择方法，也可以采用结构化稀疏方式，但是结构化稀疏方式中对卷积的权重按照通道进行剪枝，稀疏的粒度较粗，准确率较低。At present, the scale of neural networks is constantly increasing, and the number of parameters and the amount of calculation are also increasing. In order to reduce the number of parameters and the amount of calculation, pruning, quantization, distillation or lightweight model design can be used. Pruning is widely used. Pruning refers to resetting some non-critical weights to 0 in redundant large models to achieve a sparse model. The selection method of the part that needs to be pruned can adopt any sparse method, but the distribution position of the 0 element in any pruning method is relatively random. Since there is no rule to follow, it does not bring about the optimization of reasoning performance. The selection method of the part that needs to be pruned can also adopt a structured sparse method, but in the structured sparse method, the weights of the convolution are pruned according to the channel, the sparse granularity is coarse, and the accuracy is low.

发明内容Summary of the invention

本申请实施例提供一种神经网络的剪枝方法、数据处理方法及装置，用以提供一种平衡推理性能和准确率的剪枝方式。The embodiments of the present application provide a neural network pruning method, a data processing method and a device, so as to provide a pruning method that balances reasoning performance and accuracy.

第一方面，本申请实施例提供一种神经网络的剪枝方法，该方法可以由剪枝装置来执行，比如由剪枝装置中的处理器来实现。方法包括：获取待剪枝神经网络的第一权重张量；对所述第一权重张量执行至少一次剪枝处理，得到第二权重张量；其中，所述第二权重张量包括多个数据块，所述每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成；K和M均为正整数；第一数据块包括的第一剪枝向量中第一位置对应的权重为非零元素，在第一数据块中除所述第一剪枝向量以外的其它剪枝向量中所述第一位置对应的权重为零元素，所述第一数据块为所述多个数据块中的任一数据块，所述第一剪枝向量为所述第一数据块中的任一剪枝向量。In a first aspect, an embodiment of the present application provides a pruning method for a neural network, which can be performed by a pruning device, such as a processor in the pruning device. The method includes: obtaining a first weight tensor of the neural network to be pruned; performing at least one pruning process on the first weight tensor to obtain a second weight tensor; wherein the second weight tensor includes multiple data blocks, each of which is composed of K pruning vectors, and each pruning vector is composed of M weights; K and M are both positive integers; the weight corresponding to the first position in the first pruning vector included in the first data block is a non-zero element, and the weight corresponding to the first position in the other pruning vectors in the first data block except the first pruning vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruning vector is any pruning vector in the first data block.

相比采用单通道或者按块稀疏方式，本申请实施例采用更小的粒度来稀疏，可以提高网络模型的应用的准确度。相比采用任意稀疏方式来说，按照互补的方式来进行稀疏，具有一定的规则，理解为半结构化稀疏的一种，因此在推理阶段无需采用稠密计算，减少了模型的计算量。本申请实施例提供的剪枝方式，每K个剪枝向量的同一位置仅选择一个非零元素参与计算，由于同一位置的其它权重均为零元素，从而在应用于加速硬件中，可以提高运算速度，并且可以降低功耗。本申请实施例提供的方案能够在准确度和加速上进行平衡。Compared with the single-channel or block-based sparse method, the embodiment of the present application uses a smaller granularity for sparseness, which can improve the accuracy of the application of the network model. Compared with the use of any sparse method, sparseness is performed in a complementary manner, with certain rules, which is understood as a kind of semi-structured sparseness. Therefore, there is no need to use dense calculations in the reasoning stage, which reduces the amount of calculation of the model. In the pruning method provided in the embodiment of the present application, only one non-zero element is selected for the same position of each K pruning vector to participate in the calculation. Since the other weights at the same position are all zero elements, the operation speed can be improved and the power consumption can be reduced when applied to acceleration hardware. The solution provided in the embodiment of the present application can balance accuracy and acceleration.

在一种可能的设计中，所述第二权重张量的稀疏度S满足：S＝1-1/K。上述方案中，可以基于稀疏度确定剪枝向量的数量，提供一种可行的方式使得一个剪枝向量中非零元素所在的位置，在该剪枝向量所属的数据块中的其它剪枝向量在该位置为零元素。In one possible design, the sparsity S of the second weight tensor satisfies: S = 1-1/K. In the above scheme, the number of pruned vectors can be determined based on the sparsity, providing a feasible way to make the position where the non-zero element in a pruned vector is located, and other pruned vectors in the data block to which the pruned vector belongs have zero elements at that position.

在一种可能的设计中，所述剪枝处理按照稀疏度递增的方式执行N次。按照渐进式的稀疏方式，可以提高网络的准确率。In a possible design, the pruning process is performed N times in a manner of increasing sparsity. The accuracy of the network can be improved by using a progressive sparsification method.

在一种可能的设计中，所述剪枝处理执行的次数N与所述数据块包括的剪枝向量的数量K和/或所述第二权重张量的稀疏度相关。In one possible design, the number of times N that the pruning process is performed is related to the number K of pruning vectors included in the data block and/or the sparsity of the second weight tensor.

在一种可能的设计中，所述剪枝处理执行的次数满足：N＝S/(1-S)，或者，N＝K-1。In one possible design, the pruning process is performed a number of times that satisfies: N=S/(1-S), or N=K-1.

可以理解权重张量沿着每个M*K来进行剪枝，每次剪枝可以理解多剪枝掉一个剪枝向量包括的权重的数量，相比一次剪枝掉相当数量的权重来说，可以提高网络的准确率。It can be understood that the weight tensor is pruned along each M*K. Each pruning can be understood as pruning the number of weights included in one pruning vector. Compared with pruning a considerable number of weights at a time, the accuracy of the network can be improved.

在一种可能的设计中，N＝K-1，第i次剪枝处理的稀疏度满足： In one possible design, N = K-1, and the sparsity of the i-th pruning process satisfies:

其中，S_i表示第i次剪枝处理的稀疏度，i为正整数，1≤i＜K。Wherein, _Si represents the sparsity of the i-th pruning process, i is a positive integer, 1≤i＜K.

在一种可能的设计中，执行第i次剪枝处理，i取值小于或者等于N，包括：In one possible design, the i-th pruning process is performed, where i is less than or equal to N, including:

对输入的四维权重张量进行变形得到的变形后三维权重张量，所述三维权重张量中包括G个数据块，所述每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成；A deformed three-dimensional weight tensor obtained by deforming an input four-dimensional weight tensor, wherein the three-dimensional weight tensor includes G data blocks, each data block is composed of K pruning vectors, and each pruning vector is composed of M weights;

对所述三维权重张量沿K维度进行剪枝，将K维度上重要程度最小的q_i个权重置为零，以得到第i次剪枝后的三维权重向量；Pruning the three-dimensional weight tensor along the K dimension, resetting the q _i weights with the least importance in the K dimension to zero, so as to obtain the three-dimensional weight vector after the i-th pruning;

然后将第i次剪枝后的三维权重张量进行变形得到变形后的四维权重张量；Then the three-dimensional weight tensor after the i-th pruning is deformed to obtain a deformed four-dimensional weight tensor;

对所述变形后的四维权重张量进行训练，以得到经过所述第i次剪枝处理的四维权重张量；Training the deformed four-dimensional weight tensor to obtain a four-dimensional weight tensor that has undergone the i-th pruning process;

其中，所述输入的四维权重张量为所述第一权重张量或者为经过第i-1次剪枝处理的四维权重张量。Among them, the input four-dimensional weight tensor is the first weight tensor or the four-dimensional weight tensor that has undergone the i-1th pruning process.

上述设计中，采用渐进式剪枝方式，每一代训练减掉一部分权重，相比一次剪枝掉相当数量的权重来说，可以提高网络的准确率。In the above design, a progressive pruning method is adopted, and a part of the weights are reduced in each generation of training. Compared with pruning a considerable number of weights at one time, the accuracy of the network can be improved.

在一种可能的设计中，输入的四维权重张量的形状为(C_out,C_in,FH,FW)，变形后的三维权重张量的形状为(G，K，M)；其中G满足：
In a possible design, the shape of the input four-dimensional weight tensor is (C _out ,C _in ,FH,FW), and the shape of the deformed three-dimensional weight tensor is (G, K, M); where G satisfies:

其中，C_out表示所述神经网络输出激活的数量，,C_in表示所述神经网络输入激活的数量，所述FH表示神经网络的内核高度，所述FW表示所述神经网络的内核宽度；C_in*FH*FW＝K*M，或者，所述C_in＝K*M。Wherein, C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; C _in *FH*FW=K*M, or C _in =K*M.

上述设计中，本申请实施例在C_in或者C_in*FH*FW的维度上进行稀疏处理，相比采用外向量尺度(outer vector wise，OVW)稀疏方式，在单个向量上进行稀疏相比，可以提高准确度。而OVW稀疏方式是在C_out维度进行稀疏处理，本申请实施例相比该方式可以提高加速效率。In the above design, the embodiment of the present application performs sparse processing on the dimension of C _in or C _in *FH*FW, which can improve the accuracy compared with the outer vector wise (OVW) sparse method on a single vector. The OVW sparse method performs sparse processing on the C _out dimension, and the embodiment of the present application can improve the acceleration efficiency compared with this method.

在一种可能的设计中，将K维度上重要程度最小的q_i个权重置为零，包括：In one possible design, the q _i weights with the least importance in the K dimension are reset to zero, including:

将K维度上L1范数最小的q_i个权重置为零；或者，Reset the q _i weights with the smallest L1 norm in K dimensions to zero; or,

将K维度上L2范数最小的q_i个权重置为零。Reset the q _i weights with the smallest L2 norm in K dimensions to zero.

在一种可能的设计中，当训练过程对所有的权重进行调整时，所述q_i＝i；或者，In one possible design, when the training process adjusts all weights, q _i =i; or,

当训练过程仅对为被剪枝的权重进行调整时，所述q_i＝1。When the training process only adjusts the unpruned weights, q _i =1.

第二方面，本申请实施例提供一种数据处理方法，包括：In a second aspect, an embodiment of the present application provides a data processing method, including:

获取待处理数据，并生成所述待处理数据的向量表示；Acquire data to be processed, and generate a vector representation of the data to be processed;

对所述待处理数据的向量表示与神经网络包括的权重张量执行计算，以得到所述待处理数据的处理结果；Performing calculations on the vector representation of the data to be processed and the weight tensor included in the neural network to obtain a processing result of the data to be processed;

其中，所述神经网络的权重张量是经过剪枝处理得到的，所述权重张量包括多个数据块，所述每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成；K和M均为正整数；第一数据块包括的第一剪枝向量中第一位置对应的权重为非零元素，在第一数据块中除所述第一剪枝向量以外的其它剪枝向量中所述第一位置对应的权重为零元素，所述第一数据块为所述多个数据块中的任一数据块，所述第一剪枝向量为所述第一数据块中的任一剪枝向量。Among them, the weight tensor of the neural network is obtained through pruning processing, and the weight tensor includes multiple data blocks, each of which is composed of K pruned vectors, and each pruned vector is composed of M weights; K and M are both positive integers; the weight corresponding to the first position in the first pruned vector included in the first data block is a non-zero element, and the weight corresponding to the first position in the other pruned vectors in the first data block except the first pruned vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruned vector is any pruned vector in the first data block.

本申请实施例提供的方案，每K个剪枝向量的同一位置仅选择一个非零元素参与计算，由于同一位置的其它权重均为零元素，从而在应用于加速硬件中，可以提高运算速度，并且可以降低功耗。本申请实施例提供的方案能够在准确度和加速上进行平衡。In the solution provided by the embodiment of the present application, only one non-zero element is selected at the same position of each K pruning vectors to participate in the calculation. Since the other weights at the same position are all zero elements, the calculation speed can be increased and the power consumption can be reduced when applied to acceleration hardware. The solution provided by the embodiment of the present application can balance accuracy and acceleration.

在一种可能的设计中，对所述待处理数据的向量表示与神经网络包括的权重张量执行计算，包括：In one possible design, performing calculations on the vector representation of the data to be processed and the weight tensor included in the neural network includes:

所述待处理数据的向量表示为第一矩阵，以及对所述权重张量进行处理得到为第二矩阵以及所述第二矩阵中非零元素的索引；The vector representation of the data to be processed is a first matrix, and the weight tensor is processed to obtain a second matrix and the index of the non-zero elements in the second matrix;

根据所述第二矩阵中每个非零元素的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。A multiplication operation is performed on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

在一种可能的设计中，所述第二矩阵中每个行元素对应于一个数据块的权重，所述第一矩阵中列元素的数量为K*M。In a possible design, each row element in the second matrix corresponds to the weight of a data block, and the number of column elements in the first matrix is K*M.

在一种可能的设计中，根据所述第二矩阵中每个非零元素的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果，包括：In a possible design, performing a multiplication operation on the second matrix and the first matrix to obtain a calculation result according to an index of each non-zero element in the second matrix includes:

根据所述第二矩阵中非零元素的索引确定所述第一矩阵中与所述第二矩阵中的第i行的非零元素进行运算的元素，以对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。Determine the non-zero elements in the first matrix corresponding to the i-th row in the second matrix according to the index of the non-zero elements in the second matrix. The elements of the row operation are used to perform a multiplication operation on the second matrix and the first matrix to obtain a calculation result.

在一种可能的设计中，所述每个数据块中非零元素的数量为M；所述每个数据块中第k个非零元素的索引值为k-1，用于指示第k个非零元素所属的剪枝向量的索引；In a possible design, the number of non-zero elements in each data block is M; the index value of the k-th non-zero element in each data block is k-1, which is used to indicate the index of the pruning vector to which the k-th non-zero element belongs;

所述第二矩阵中第一非零元素的索引值为g时，所述第一矩阵中，与所述第一非零元素执行乘法的元素的所在行r满足：r＝g*M。When the index value of the first non-zero element in the second matrix is g, the row r of the element in the first matrix that is multiplied with the first non-zero element satisfies: r=g*M.

在一种可能的设计中，所述权重张量的形状为(C_out,C_in,FH,FW)，C_out表示所述神经网络的输出激活的数量，,C_in表示所述神经网络输入激活的数量，FH表示所述神经网络的内核高度，FW表示所述神经网络的内核高度；所述第二矩阵的列元素数为C_out*FH*FW，所述第二矩阵的行元素数为C_in；C_in＝K*M，所述权重张量包括的数据块的数量 In a possible design, the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel height of the neural network; the number of column elements of the second matrix is C _out *FH*FW, and the number of row elements of the second matrix is C _in ; C _in =K*M, and the number of data blocks included in the weight tensor is

在一种可能的设计中，所述权重张量的形状为(C_out,C_in,FH,FW)，C_out表示所述神经网络的输出激活的数量，,C_in表示所述神经网络输入激活的数量，FH表示所述神经网络的内核高度，FW表示所述神经网络的内核宽度；所述第二矩阵的列元素数为C_out，所述第二矩阵的行元素数为C_in*FH*FW；C_in*FH*FW＝K*M，所述权重张量包括的数据块的数量 In a possible design, the shape of the weight tensor is (C _out ,C _in ,FH,FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; the number of column elements of the second matrix is C _out , and the number of row elements of the second matrix is C _in *FH*FW; C _in *FH*FW=K*M, and the number of data blocks included in the weight tensor is

第三方面，本申请实施例提供一种加速器，包括：In a third aspect, an embodiment of the present application provides an accelerator, including:

存储电路，用于存储待处理数据对应的第一矩阵、神经网络的权重张量对应的第二矩阵以及所述第二矩阵中非零元素的索引；A storage circuit, used to store a first matrix corresponding to the data to be processed, a second matrix corresponding to the weight tensor of the neural network, and the index of the non-zero elements in the second matrix;

其中，所述神经网络的权重张量是经过剪枝处理得到的，所述权重张量包括多个数据块，所述每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成；K和M均为正整数；第一数据块包括的第一剪枝向量中第一位置对应的权重为非零元素，在第一数据块中除所述第一剪枝向量以外的其它剪枝向量中所述第一位置对应的权重为零元素，所述第一数据块为所述多个数据块中的任一数据块，所述第一剪枝向量为所述第一数据块中的任一剪枝向量；Wherein, the weight tensor of the neural network is obtained through pruning processing, and the weight tensor includes multiple data blocks, each of which is composed of K pruned vectors, and each pruned vector is composed of M weights; K and M are both positive integers; the weight corresponding to the first position in the first pruned vector included in the first data block is a non-zero element, and the weight corresponding to the first position in the other pruned vectors in the first data block except the first pruned vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruned vector is any pruned vector in the first data block;

运算电路，用于根据所述第二矩阵中每个非零元素的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。An operation circuit is used to perform a multiplication operation on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

在一种可能的设计中，所述运算电路包括选择单元和计算单元；In one possible design, the operation circuit includes a selection unit and a calculation unit;

所述选择单元，用于根据所述第二矩阵中非零元素的索引从所述存储电路读取所述第二矩阵中的第i行的非零元素，并根据所述第二矩阵非零元素的索引从所述存储电路中读取所述第一矩阵的第i列元素中与所述第二矩阵非零元素进行运算的元素；The selection unit is used to read the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements in the second matrix, and read the elements in the i-th column of the first matrix that are operated on the non-zero elements of the second matrix from the storage circuit according to the index of the non-zero elements of the second matrix;

所述计算单元，用于对所述第二矩阵第i行的非零元素与读取的所述第一矩阵中第i列的元素执进行乘加运算。The calculation unit is used to perform multiplication and addition operations on the non-zero elements in the i-th row of the second matrix and the elements in the i-th column of the read first matrix.

在一种可能的设计中，所述存储电路包括存储单元和选择单元，所述存储单元用于存储待处理数据对应的第一矩阵、神经网络包括的权重张量对应的第二矩阵以及所述第二矩阵中非零元素的索引；In a possible design, the storage circuit includes a storage unit and a selection unit, the storage unit is used to store a first matrix corresponding to the data to be processed, a second matrix corresponding to the weight tensor included in the neural network, and the index of non-zero elements in the second matrix;

所述选择单元，用于根据所述第二矩阵中非零元素的索引从所述存储电路读取所述第二矩阵中的第i行的非零元素，并将所述第二矩阵中的第i行的非零元素发送给所述运算电路；以及根据所述第二矩阵非零元素的索引从所述存储电路中读取所述第一矩阵的第i列元素中与所述第二矩阵中的第i行的非零元素进行运算的元素，并将所述第一矩阵的第i列元素中与所述第二矩阵中的第i行的非零元素进行运算的元素发送给所述运算电路；The selection unit is used to read the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements in the second matrix, and send the non-zero elements of the i-th row in the second matrix to the operation circuit; and read the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements of the second matrix, and send the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix to the operation circuit;

所述运算电路，具体用于对所述第二矩阵第i行的非零元素与所述第一矩阵的第i列元素中与所述第二矩阵中的第i行的非零元素进行运算的元素进行乘加运算。The operation circuit is specifically used to perform multiplication and addition operations on the non-zero elements in the i-th row of the second matrix and the elements in the i-th column of the first matrix and the non-zero elements in the i-th row of the second matrix.

在一种可能的设计中，所述权重张量的形状为(C_out,C_in,FH,FW)，C_out表示所述神经网络的输出激活的数量，,C_in表示所述神经网络输入激活的数量，FH表示所述神经网络的内核高度，FW表示所述神经网络的内核宽度；所述第二矩阵的列元素数为C_out*FH*FW，所述第二矩阵的行元素数为C_in；C_in＝K*M，所述权重张量包括的数据块的数量 In one possible design, the shape of the weight tensor is (C _out , C _in , FH, FW), where C _out represents the output stimulus of the neural network. C _in represents the number of activations of the neural network input, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; the number of column elements of the second matrix is C _out *FH*FW, and the number of row elements of the second matrix is C _in ; C _in =K*M, the number of data blocks included in the weight tensor

在一种可能的设计中，所述权重张量的形状为(C_out,C_in,FH,FW)，C_out表示所述神经网络的输出激活的数量，,C_in表示所述神经网络输入激活的数量，FH表示所述神经网络的内核高度，FW表示所述神经网络的内核高度；所述第二矩阵的列元素数为C_out，所述第二矩阵的行元素数为C_in*FH*FW；C_in*FH*FW＝K*M，所述权重张量包括的数据块的数量 In a possible design, the shape of the weight tensor is (C _out ,C _in ,FH,FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel height of the neural network; the number of column elements of the second matrix is C _out , and the number of row elements of the second matrix is C _in *FH*FW; C _in *FH*FW=K*M, and the number of data blocks included in the weight tensor is

第四方面，本申请实施例提供一种处理系统，包括：In a fourth aspect, an embodiment of the present application provides a processing system, including:

处理器，用于获取待处理数据，并生成所述待处理数据的向量表示；A processor, configured to obtain data to be processed and generate a vector representation of the data to be processed;

加速器，用于对所述待处理数据的向量表示与神经网络包括的权重张量执行计算，以得到所述待处理数据的处理结果；An accelerator, configured to perform calculations on the vector representation of the data to be processed and the weight tensor included in the neural network to obtain a processing result of the data to be processed;

在一种可能的设计中，所述处理器，还用于将所述待处理数据的向量表示处理为第一矩阵，以及对所述权重张量进行处理得到第二矩阵以及所述第二矩阵中非零元素的索引，并将所述第一矩阵、所述第二矩阵以及所述第二矩阵中非零元素的索引发送给所述加速器；In one possible design, the processor is further used to process the vector representation of the data to be processed into a first matrix, and process the weight tensor to obtain a second matrix and the index of the non-zero elements in the second matrix, and send the first matrix, the second matrix and the index of the non-zero elements in the second matrix to the accelerator;

所述加速器，具体用于根据所述第二矩阵中每个非零元素的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。The accelerator is specifically used to perform a multiplication operation on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

在一种可能的设计中，所述加速器，具体用于根据所述第二矩阵中非零元素的索引确定所述第一矩阵中与所述第二矩阵中的第i行的非零元素进行运算的元素，以对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。In one possible design, the accelerator is specifically used to determine the elements in the first matrix to be operated on the non-zero elements in the i-th row of the second matrix according to the indices of the non-zero elements in the second matrix, so as to perform multiplication operations on the second matrix and the first matrix to obtain a calculation result.

在一种可能的设计中，所述权重张量的形状为(C_out,C_in,FH,FW)，C_out表示所述神经网络的输出激活的数量，,C_in表示所述神经网络输入激活的数量，FH表示所述神经网络的内核高度，FW表示所述神经网络的内核宽度；所述第二矩阵的列元素数为C_out*FH*FW，所述第二矩阵的行元素数为C_in；C_in＝K*M，所述权重张量包括的数据块的数量 In a possible design, the shape of the weight tensor is (C _out ,C _in ,FH,FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; the number of column elements of the second matrix is C _out *FH*FW, and the number of row elements of the second matrix is C _in ; C _in =K*M, and the number of data blocks included in the weight tensor is

第五方面，本申请还提供了一种剪枝装置，该剪枝装置具有实现上述第一方面方法的功能。所述功能可以通过硬件实现，也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块。In a fifth aspect, the present application further provides a pruning device, which has the function of implementing the method of the first aspect. The function can be implemented by hardware, or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above functions.

在一个可能的设计中，所述剪枝装置的结构中可以包括获取单元和剪枝单元，这些单元可以执行上述方法示例中的相应功能，具体参见方法示例中的详细描述，此处不做赘述。剪枝装置中还可以包括训练单元，用于执行训练操作。In a possible design, the structure of the pruning device may include an acquisition unit and a pruning unit, which can perform the above For the corresponding functions in the method example, please refer to the detailed description in the method example, which will not be repeated here. The pruning device may also include a training unit for performing a training operation.

在一个可能的设计中，所述剪枝装置的结构中可以包括处理器和存储器，所述处理器被配置为执行上述提及的方法。所述存储器与所述处理器耦合，其保存所述剪枝装置必要的程序指令和数据。In a possible design, the structure of the pruning device may include a processor and a memory, wherein the processor is configured to execute the above-mentioned method. The memory is coupled to the processor and stores necessary program instructions and data for the pruning device.

第六方面，本申请还提供了一种计算机存储介质，所述计算机存储介质中存储有计算机可执行指令，所述计算机可执行指令在被所述计算机调用时用于使所述计算机执行上述第一方面提及的任一种方法，或者执行上述第二方面提及的任一种方法。In a sixth aspect, the present application further provides a computer storage medium, in which computer executable instructions are stored. When the computer executable instructions are called by the computer, the computer is used to enable the computer to execute any one of the methods mentioned in the first aspect above, or to execute any one of the methods mentioned in the second aspect above.

第七方面，本申请还提供了一种包含指令的计算机程序产品，当其在计算机上运行时，使得计算机执行上述第一方面提及的任一种方法，或者执行上述第二方面提及的任一种方法。In a seventh aspect, the present application also provides a computer program product comprising instructions, which, when executed on a computer, enables the computer to execute any of the methods mentioned in the first aspect above, or execute any of the methods mentioned in the second aspect above.

第八方面，本申请还提供了一种芯片，所述芯片与存储器耦合，用于读取并执行存储器中存储的程序指令，以实现上述第一方面提及的任一种方法，或者执行上述第二方面提及的任一种方法。In an eighth aspect, the present application also provides a chip, which is coupled to a memory and is used to read and execute program instructions stored in the memory to implement any of the methods mentioned in the first aspect above, or to execute any of the methods mentioned in the second aspect above.

本申请在上述各方面提供的实现的基础上，还可以进行进一步组合以提供更多实现。Based on the implementations provided in the above aspects, the present application can also be further combined to provide more implementations.

BRIEF DESCRIPTION OF THE DRAWINGS

图1为卷积神经网络结构示意图；Figure 1 is a schematic diagram of the convolutional neural network structure;

图2为本申请实施例提供的系统架构10示意图；FIG2 is a schematic diagram of a system architecture 10 provided in an embodiment of the present application;

图3为本申请实施例提供的数据处理装置200的结构示意图；FIG3 is a schematic diagram of the structure of a data processing device 200 provided in an embodiment of the present application;

图4为本申请实施例提供的剪枝方法流程示意图；FIG4 is a schematic diagram of a pruning method flow chart provided in an embodiment of the present application;

图5为本申请实施例提供的剪枝后的数据块的示意图；FIG5 is a schematic diagram of a pruned data block provided in an embodiment of the present application;

图6为本申请实施例提供的稀疏度与剪枝次数对应示意图；FIG6 is a schematic diagram showing the correspondence between sparsity and the number of pruning times provided in an embodiment of the present application;

图7为本申请实施例提供的渐进剪枝流程示意图；FIG7 is a schematic diagram of a progressive pruning process provided in an embodiment of the present application;

图8A为本申请实施例提供的数据处理方法流程示意图；FIG8A is a schematic diagram of a data processing method provided in an embodiment of the present application;

图8B为本申请实施例提供的数据块的非零索引的编码方式示意图；FIG8B is a schematic diagram of a coding method for non-zero indexes of data blocks provided in an embodiment of the present application;

图9A为本申请实施例提供的一种加速器结构示意图；FIG9A is a schematic diagram of an accelerator structure provided in an embodiment of the present application;

图9B为本申请实施例提供的另一种加速器结构示意图；FIG9B is a schematic diagram of another accelerator structure provided in an embodiment of the present application;

图10A为本申请实施例提供的一种处理系统结构示意图；FIG10A is a schematic diagram of a processing system structure provided in an embodiment of the present application;

图10B为本申请实施例提供的另一种处理系统结构示意图；FIG10B is a schematic diagram of another processing system structure provided in an embodiment of the present application;

图11A为本申请实施例提供的一种数据处理流程示意图；FIG11A is a schematic diagram of a data processing flow provided in an embodiment of the present application;

图11B为本申请实施例提供的另一种数据处理流程示意图；FIG11B is a schematic diagram of another data processing flow provided in an embodiment of the present application;

图12为本申请实施例提供的剪枝装置结构示意图。FIG. 12 is a schematic diagram of the structure of a pruning device provided in an embodiment of the present application.

DETAILED DESCRIPTION

在对本申请实施例提供一种神经网络的剪枝方法、数据处理方法及装置进行说明之前，先对本申请实施例中涉及的相关概念进行说明。Before describing a neural network pruning method, a data processing method, and a device provided in the embodiments of the present application, the related concepts involved in the embodiments of the present application are first described.

(1)神经网络：(1) Neural Network:

神经网络，用于实现机器学习，深度学习，搜索，推理，决策等。本申请提及的神经网络可以包括多种类型，如深度神经网络(deep neural networks，DNN)、卷积神经网络(convolutional neural networks，CNN)、循环神经网络(recurrent neural networks，RNN)、残差网络、采用transformer模型的神经网络或其他神经网络等。下面对一些神经网络进行示例性介绍。Neural networks are used to implement machine learning, deep learning, search, reasoning, decision making, etc. The neural networks mentioned in this application may include various types, such as deep neural networks (DNN), convolutional neural networks (CNN), recurrent neural networks (RNN), residual networks, neural networks using transformer models or other neural networks, etc. Some neural networks are introduced as examples below.

(2)卷积神经网络：(2) Convolutional Neural Networks:

卷积神经网络(convolutional neural networks，CNN)是一种带有卷积结构的深度神经网络，是一种深度学习(deep learning)架构，深度学习架构是指通过机器学习的算法，在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构，CNN是一种前馈(feed-forward)人工神经网络，该前馈人工神经网络中的各个神经元对输入其中的数据进行处理。Convolutional neural networks (CNN) are deep neural networks with convolutional structures. They are deep learning architectures, which refer to multiple levels of learning at different levels of abstraction through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network, in which each neuron processes the data input into it.

如图1所示，卷积神经网络(CNN)100可以包括输入层110，卷积层/池化层120，其中池化层为可选的，以及神经网络层130。如图1所示卷积层/池化层120可以包括如示例121-126层，在一种实现中，121层为卷积层，122层为池化层，123层为卷积层，124层为池化层，125为卷积层，126为池化层；在另一种实现方式中，121、122为卷积层，123为池化层，124、125为卷积层，126为池化层。即卷积层的输出可以作为随后的池化层的输入，也可以作为另一个卷积层的输入以继续进行卷积操作。以卷积层121为例，卷积层121可以包括很多个卷积算子，卷积算子也称为卷积核。卷积算子本质上可以是一个权重矩阵，这个权重矩阵通常被预先定义。以图像处理为例，不同的权重矩阵所提取图像中不同的特征，例如一个权重矩阵用来提取图像边缘信息，另一个权重矩阵用来提取图像的特定颜色，又一个权重矩阵用来对图像中不需要的噪点进行模糊化。As shown in FIG1 , a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer/pooling layer 120, wherein the pooling layer is optional, and a neural network layer 130. As shown in FIG1 , the convolutional layer/pooling layer 120 may include layers 121-126, for example. In one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, layer 123 is a convolutional layer, layer 124 is a pooling layer, layer 125 is a convolutional layer, and layer 126 is a pooling layer; in another ... In one implementation, 121 and 122 are convolution layers, 123 is a pooling layer, 124 and 125 are convolution layers, and 126 is a pooling layer. That is, the output of the convolution layer can be used as the input of the subsequent pooling layer, or as the input of another convolution layer to continue the convolution operation. Taking the convolution layer 121 as an example, the convolution layer 121 can include many convolution operators, which are also called convolution kernels. The convolution operator can essentially be a weight matrix, which is usually predefined. Taking image processing as an example, different weight matrices extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to blur unnecessary noise in the image.

这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到，通过训练得到的权重值形成的各个权重矩阵可以从输入的数据中提取信息，从而帮助卷积神经网络100进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications. The weight matrices formed by the weight values obtained through training can extract information from the input data, thereby helping the convolutional neural network 100 to make correct predictions.

当卷积神经网络100有多个卷积层的时候，初始的卷积层(例如121)往往提取较多的一般特征，该一般特征也可以称之为低级别的特征；随着卷积神经网络100深度的加深，越往后的卷积层(例如126)提取到的特征越来越复杂，比如高级别的语义之类的特征，语义越高的特征越适用于待解决的问题。When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (for example, 121) often extracts more general features, which can also be called low-level features. As the depth of the convolutional neural network 100 increases, the features extracted by the later convolutional layers (for example, 126) become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for the problem to be solved.

池化层：Pooling layer:

由于常常需要减少训练参数的数量，因此卷积层之后常常需要周期性的引入池化层，即如图1中120所示例的121-126各层，可以是一层卷积层后面跟一层池化层，也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中，池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子，以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外，就像卷积层中用权重矩阵的大小应该与图像大小相关一样，池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸，池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolution layer, that is, as shown in the example of 120 in FIG. 1, each layer 121-126 may be a convolution layer followed by a pooling layer, or multiple convolution layers may be followed by one or more pooling layers. In the image processing process, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator to sample the input image to obtain an image of smaller size. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value. The maximum pooling operator can take the pixel with the largest value in the range within a specific range as the result of the maximum pooling. In addition, just as the size of the weight matrix used in the convolution layer should be related to the image size, the operator in the pooling layer should also be related to the image size. The size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average value or maximum value of the corresponding sub-region of the image input to the pooling layer.

在经过卷积层/池化层120的处理后，卷积神经网络100还不足以输出所需要的输出信息。因为如前所述，卷积层/池化层120只会提取特征，并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息)，卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此，在神经网络层130中可以包括多层隐含层(如图1所示的131、132至13n)以及输出层140，该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到，例如该任务类型可以包括图像识别，图像分类，图像超分辨率重建等等。After being processed by the convolution layer/pooling layer 120, the convolution neural network 100 is not sufficient to output the required output information. Because as mentioned above, the convolution layer/pooling layer 120 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolution neural network 100 needs to use the neural network layer 130 to generate one or a group of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in Figure 1) and an output layer 140. The parameters contained in the multiple hidden layers can be pre-trained according to the relevant training data of the specific task type. For example, the task type may include image recognition, image classification, image super-resolution reconstruction, etc.

在神经网络层130中的多层隐含层之后，也就是整个卷积神经网络100的最后层为输出层140，该输出层140具有类似分类交叉熵的损失函数，具体用于计算预测误差，一旦整个卷积神经网络100的前向传播(如图1由110至140的传播为前向传播)完成，反向传播(如图1由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差，以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。After the multiple hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140. The output layer 140 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 100 (as shown in Figure 1, the propagation from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Figure 1, the propagation from 140 to 110 is back propagation) will begin to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

需要说明的是，如图1所示的卷积神经网络100仅作为一种卷积神经网络的示例，在具体的应用中，卷积神经网络还可以以其他网络模型的形式存在，例如，多个卷积层/池化层并行，将分别提取的特征均输入给神经网络层130进行处理。It should be noted that the convolutional neural network 100 shown in Figure 1 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models. For example, multiple convolutional layers/pooling layers are in parallel, and the extracted features are input into the neural network layer 130 for processing.

(3)剪枝/稀疏：(3) Pruning/thinning:

稀疏/剪枝是将神经网络(或者神经网络)中非关键部分的权重数值置为零(0)。稀疏方式可以包括任意稀疏、结构化稀疏以及非结构化稀疏。任意稀疏是指选择置0的权重可以是任意位置，推理计算采用稀疏后的权重计算。结构化稀疏是按照一定规则进行裁剪，例如卷积中整通道裁剪，全连接中的整列裁剪，裁剪后通过重排，推理计算仍然采用稠密计算。非结构化稀疏(也可以称为半结构化稀疏)可以理解是介于结构化稀疏和任意稀疏之间，相比结构化稀疏的粒度更低，需要按一定的规则裁剪，但规则相对较弱，需要定制化的代码实现才能取得很好的推理性能。比如英伟达提出的2:4稀疏，在长度为4的向量中有两个位置的权重为零元素。Sparsity/pruning is to set the weight values of non-critical parts of a neural network (or neural network) to zero (0). Sparse methods can include arbitrary sparsity, structured sparsity, and unstructured sparsity. Arbitrary sparsity means that the weights chosen to be set to 0 can be at any position, and the inference calculation uses the sparse weight calculation. Structured sparsity is to prune according to certain rules, such as whole channel pruning in convolution, whole column pruning in full connection, and after pruning, the inference calculation still uses dense calculation through rearrangement. Unstructured sparsity (also called semi-structured sparsity) can be understood as between structured sparsity and arbitrary sparsity. Compared with structured sparsity, it has a lower granularity and needs to be pruned according to certain rules, but the rules are relatively weak, and customized code implementation is required to achieve good inference performance. For example, NVIDIA The proposed 2:4 sparseness has two positions with zero weights in the vector of length 4.

(4)稀疏度：0元素个数占总元素个数的比例。(4) Sparsity: the ratio of the number of 0 elements to the total number of elements.

在本申请的描述中，除非另有说明，“多个”是指两个或多于两个。另外，“/”表示前后关联的对象是一种“或”的关系，例如，A/B可以表示A或B；本申请中的“和/或”仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况，其中A,B可以是单数或者复数。并且，为了便于清楚描述本申请实施例的技术方案，在本申请的实施例中，采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定，并且“第一”、“第二”等字样也并不限定一定不同。还需要说明的是，除非特殊说明，一个实施例中针对一些技术特征的具体描述也可以应用于解释其他实施例提及对应的技术特征。In the description of the present application, unless otherwise specified, "plurality" means two or more than two. In addition, "/" indicates that the objects associated before and after are in an "or" relationship. For example, A/B can represent A or B. The "and/or" in the present application is only a description of the association relationship of the associated objects, indicating that there may be three relationships. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. A and B can be singular or plural. In addition, in order to facilitate the clear description of the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same items or similar items with basically the same functions and effects. Those skilled in the art can understand that the words "first", "second", etc. do not limit the quantity and execution order, and the words "first", "second", etc. do not necessarily limit the difference. It should also be noted that, unless otherwise specified, the specific description of some technical features in one embodiment can also be used to explain the corresponding technical features mentioned in other embodiments.

参见图2所示，为本申请实施例提供的一种系统架构10示意图。系统架构10中包括剪枝装置100和数据处理装置200。2 is a schematic diagram of a system architecture 10 provided in an embodiment of the present application. The system architecture 10 includes a pruning device 100 and a data processing device 200.

剪枝装置100用于执行本申请实施例提供的神经网络的剪枝方法，数据处理装置200用于实现本申请实施例提供的数据处理方法。剪枝装置100针对神经网络完成剪枝之后，可以将剪枝后的神经网络发送给数据处理装置200。The pruning device 100 is used to execute the pruning method of the neural network provided in the embodiment of the present application, and the data processing device 200 is used to implement the data processing method provided in the embodiment of the present application. After the pruning device 100 completes pruning of the neural network, the pruned neural network can be sent to the data processing device 200.

剪枝装置100可以通过软件或者通过硬件实现。The pruning device 100 may be implemented by software or hardware.

剪枝装置100作为软件功能单元的一种举例，剪枝装置100可以包括运行在计算实例上的代码。其中，计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地，上述计算实例可以是一台或者多台。例如，剪枝装置100可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是，用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中，也可以分布在不同的region中。进一步地，用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone，AZ)中，也可以分布在不同的AZ中，每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中，通常一个region可以包括多个AZ。The pruning device 100 is taken as an example of a software functional unit. The pruning device 100 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the computing instance may be one or more. For example, the pruning device 100 may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the code may be distributed in the same region or in different regions. Furthermore, the multiple hosts/virtual machines/containers used to run the code may be distributed in the same availability zone (AZ) or in different AZs, each AZ including one data center or multiple data centers with close geographical locations. Generally, a region may include multiple AZs.

同样，用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud，VPC)中，也可以分布在多个VPC中。其中，通常一个VPC设置在一个region内，同一region内两个VPC之间，以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关，经通信网关实现VPC之间的互连。Similarly, multiple hosts/virtual machines/containers used to run the code can be distributed in the same virtual private cloud (VPC) or in multiple VPCs. Usually, a VPC is set up in a region. For cross-region communication between two VPCs in the same region and between VPCs in different regions, a communication gateway needs to be set up in each VPC to achieve interconnection between VPCs through the communication gateway.

剪枝装置100作为硬件功能单元的一种举例，剪枝装置100可以包括至少一个计算设备，如服务器等。至少一个计算设备可以部署于云网络中，也可以部署于本地。或者，剪枝装置100也可以是利用专用集成电路(application-specific integrated circuit，ASIC)实现、或可编程逻辑器件(programmable logic device，PLD)实现的设备等。其中，上述PLD可以是复杂程序逻辑器件(complex programmable logical device，CPLD)、现场可编程门阵列(field-programmable gate array，FPGA)、通用阵列逻辑(generic array logic，GAL)或其任意组合实现。The pruning device 100 is an example of a hardware functional unit. The pruning device 100 may include at least one computing device, such as a server. The at least one computing device may be deployed in a cloud network or locally. Alternatively, the pruning device 100 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.

一些场景中，剪枝装置100包括的多个计算设备可以分布在相同的region中，也可以分布在不同的region中。In some scenarios, the multiple computing devices included in the pruning apparatus 100 may be distributed in the same region or in different regions.

系统架构10还包括数据存储系统300。可选地，剪枝装置100与其它计算设备配合，例如：数据存储、路由器、负载均衡器等设备。剪枝装置100可以使用数据存储系统300中的数据，或者调用数据存储系统300中的程序代码实现本申请提供的神经网络的剪枝方法。The system architecture 10 also includes a data storage system 300. Optionally, the pruning device 100 cooperates with other computing devices, such as data storage, routers, load balancers, etc. The pruning device 100 can use the data in the data storage system 300, or call the program code in the data storage system 300 to implement the neural network pruning method provided in the present application.

图3示例性示出了本申请提供的一种数据处理装置200的结构示意图。如图3所示，该数据处理装置200包括处理系统210、硬件加速器220、数据总线230和控制总线240。处理系统210和硬件加速器220可通过数据总线230传输数据，可通过控制总线240传输控制信息。处理系统210是整个系统的控制部件，包括处理器211，可选地，处理系统210还包括存储器212。处理器211用于运行软件端的代码，并将加速任务(例如矩阵或者向量)加载到硬件加速器220上。可选的，硬件加速器220可以包括一个或多个知识产权(intellectual property，IP)核，硬件加速器220可控制该一个或多个IP核的工作状态和数据读取等。IP核的功能例如包括：将一些在数字电路中常用但比较复杂的功能块，如同步动态随机存储器(synchronous dynamic random access memory，SDRAM)控制器、外设部件互连标准(peripheral component interconnect，PCI)接口等设计成可修改参数的电路。处理器211可以是中央处理器(central processing unit，CPU)，网络处理器(network processor，NP)或者CPU和NP的组合。处理器211还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit，ASIC)，可编程逻辑器件(programmable logic device，PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device，CPLD)，现场可编程逻辑门阵列(field-programmable gate array，FPGA)，通用阵列逻辑(generic array logic,GAL)或其任意组合。存储器212用于存储数据，可由处理器211控制向存储器212读取数据或写入数据，存储器212可以是双倍速率同步动态随机存储器(double data rate synchronous dynamic random access memory，DDR SDRAM)、随机读取存储器(random-access memory，RAM)或非易失性存储器(non-volatile memory)，例如快闪存储器(flash memory)、硬盘 (hard disk drive，HDD)或固态硬盘(solid-state drive，SSD)，存储器212还可以包括上述种类的存储器的组合。FIG3 exemplarily shows a schematic diagram of the structure of a data processing device 200 provided by the present application. As shown in FIG3 , the data processing device 200 includes a processing system 210, a hardware accelerator 220, a data bus 230, and a control bus 240. The processing system 210 and the hardware accelerator 220 can transmit data through the data bus 230, and can transmit control information through the control bus 240. The processing system 210 is the control component of the entire system, including a processor 211, and optionally, the processing system 210 also includes a memory 212. The processor 211 is used to run the code on the software side and load the acceleration task (such as a matrix or a vector) onto the hardware accelerator 220. Optionally, the hardware accelerator 220 may include one or more intellectual property (IP) cores, and the hardware accelerator 220 may control the working state and data reading of the one or more IP cores. The functions of the IP core include, for example: designing some commonly used but relatively complex function blocks in digital circuits, such as synchronous dynamic random access memory (SDRAM) controllers, peripheral component interconnect (PCI) interfaces, etc., into circuits with modifiable parameters. The processor 211 may be a central processing unit (CPU), a network processor (NP) or a combination of a CPU and a NP. The processor 211 may further include a hardware chip. The above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof. The memory 212 is used to store data. The processor 211 can control the reading or writing of data to the memory 212. The memory 212 can be a double data rate synchronous dynamic random access memory (DDR SDRAM), a random access memory (RAM), or a non-volatile memory (non-volatile memory), such as a flash memory, a hard disk, or a hard disk. The memory 212 may include a hard disk drive (HDD) or a solid-state drive (SSD). The memory 212 may also include a combination of the above-mentioned types of memories.

硬件加速器220，也可以简称为加速器，为整个系统的硬件加速部件，能够根据特定的计算任务设计并实现专用的加速IP核，通过硬件加速器上的IP核可实现对算法的加速。本申请实施例中，硬件加速器220可以应用在神经网络中，可以实现对神经网络算法的加速。硬件加速器220与处理系统210在数据交互完成后可独立的并行执行任务。硬件加速器220可以是图形处理器(graphics processing unit，GPU)、或者FPGA、或者供专门应用的集成电路(application specific integrated circuit，ASIC)、或者微控制器(micro control unit，MCU)等。The hardware accelerator 220, which can also be referred to as an accelerator, is a hardware acceleration component of the entire system. It can design and implement a dedicated acceleration IP core according to a specific computing task. The IP core on the hardware accelerator can accelerate the algorithm. In the embodiment of the present application, the hardware accelerator 220 can be used in a neural network to accelerate the neural network algorithm. After the data interaction is completed, the hardware accelerator 220 and the processing system 210 can independently execute tasks in parallel. The hardware accelerator 220 can be a graphics processing unit (GPU), or an FPGA, or an integrated circuit (application specific integrated circuit, ASIC) for special applications, or a microcontroller (micro control unit, MCU), etc.

数据总线230，用于整个处理系统210和硬件加速器220的数据传输。数据总线230可采用(advanced extensible interface-Stream，AXI-Stream)协议或者总线接口(PCI Express，PCI-E)总线协议，AXI是一种总线协议，AXI-Stream协议是高性能传输协议，允许无限制的数据突发传输。The data bus 230 is used for data transmission between the entire processing system 210 and the hardware accelerator 220. The data bus 230 may adopt the advanced extensible interface-Stream (AXI-Stream) protocol or the bus interface (PCI Express, PCI-E) bus protocol. AXI is a bus protocol, and the AXI-Stream protocol is a high-performance transmission protocol that allows unlimited data burst transmission.

控制总线240，用于整个处理系统210和硬件加速器220控制信号的传输。控制总线40可采用AXI-Lite协议，AXI-Lite协议是一种轻量级的地址映射单次传输协议，适用于硬件运算电路的控制信号传输。The control bus 240 is used for transmission of control signals between the entire processing system 210 and the hardware accelerator 220. The control bus 40 may adopt the AXI-Lite protocol, which is a lightweight address-mapped single-transmission protocol suitable for transmission of control signals of hardware computing circuits.

一些可能的实施方式中，处理系统210可以具有硬件加速的功能，即处理系统210、硬件加速器220的功能通过一个器件来实现，比如采用CPU。In some possible implementations, the processing system 210 may have a hardware acceleration function, that is, the functions of the processing system 210 and the hardware accelerator 220 are implemented by one device, such as a CPU.

数据处理装置200以通过任何通信机制/通信标准的通信网络与剪枝装置100进行交互，通信网络可以是广域网、局域网、点对点连接等方式，或它们的任意组合。The data processing device 200 interacts with the pruning device 100 via a communication network of any communication mechanism/communication standard. The communication network may be a wide area network, a local area network, a point-to-point connection, or any combination thereof.

本申请实施例提供的方案可以适用于深度学习场景中，比如使用卷积神经网络的场景中，使用深度神经网络的场景中。此外，本申请实施例适用于Transformer网络或者transformer+CNN的融合网络的场景中。The solution provided by the embodiment of the present application can be applied to deep learning scenarios, such as scenarios using convolutional neural networks and scenarios using deep neural networks. In addition, the embodiment of the present application is applicable to scenarios of Transformer networks or fusion networks of transformers + CNN.

下面先对本申请实施例提供的神经网络的剪枝方法流程进行说明。参见图4为本申请实施例提供的神经网络的剪枝方法流程示意图。该神经网络的剪枝方法可以由剪枝装置100执行，也可以由剪枝装置100中的模块来执行。The following is a description of the process of the neural network pruning method provided in the embodiment of the present application. See Figure 4 for a schematic diagram of the process of the neural network pruning method provided in the embodiment of the present application. The neural network pruning method can be executed by the pruning device 100, or by a module in the pruning device 100.

S401，获取待剪枝神经网络的第一权重张量。S401, obtaining a first weight tensor of the neural network to be pruned.

待剪枝神经网络可以是经过预训练的神经网络，也可以是未经训练的神经网络。The neural network to be pruned can be a pre-trained neural network or an untrained neural network.

待剪枝神经网络可以是卷积神经网络，比如用于进行图像分类的、目标识别的、图像处理的、图像降噪等功能的神经网络。The neural network to be pruned may be a convolutional neural network, such as a neural network used for image classification, target recognition, image processing, image denoising, and the like.

第一权重张量可以理解为卷积核的权重的向量表示。卷积核可以是单通道也可以是多通道。比如，卷积核的形状为(C_out,C_in,FH,FW)。C_out表示卷积核的输出激活的数量(或者卷积核的输出通道的数量),C_in表示卷积核输入激活的数量(或者为卷积核输入通道的数量)，FH表示卷积核的内核高度，FW表示卷积核的内核高度。The first weight tensor can be understood as a vector representation of the weight of the convolution kernel. The convolution kernel can be single-channel or multi-channel. For example, the shape of the convolution kernel is (C _out , C _in , FH, FW). C _out represents the number of output activations of the convolution kernel (or the number of output channels of the convolution kernel), C _in represents the number of input activations of the convolution kernel (or the number of input channels of the convolution kernel), FH represents the kernel height of the convolution kernel, and FW represents the kernel height of the convolution kernel.

S402，对第一权重张量执行至少一次剪枝处理，得到第二权重张量。S402, perform at least one pruning process on the first weight tensor to obtain a second weight tensor.

第二权重张量包括多个数据块，每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成。K和M均为正整数。第一数据块包括的第一剪枝向量中第一位置对应的权重为非零元素，在第一数据块中除第一剪枝向量以外的其它剪枝向量中第一位置对应的权重为零元素。第一数据块为多个数据块中的任一数据块。第一剪枝向量为第一数据块中的任一剪枝向量。一个数据块也可以理解为由多个剪枝向量拼接而成的。数据块的大小与执行神经网络运算加速的硬件的规格相关，比如执行神经网络运算加速的硬件的规格为数据块包括的向量长度。The second weight tensor includes multiple data blocks, each data block consists of K pruned vectors, and each pruned vector consists of M weights. K and M are both positive integers. The weight corresponding to the first position in the first pruned vector included in the first data block is a non-zero element, and the weight corresponding to the first position in the pruned vectors other than the first pruned vector in the first data block is a zero element. The first data block is any data block among the multiple data blocks. The first pruned vector is any pruned vector in the first data block. A data block can also be understood as being composed of multiple pruned vectors. The size of the data block is related to the specifications of the hardware that performs neural network operation acceleration, such as the specifications of the hardware that performs neural network operation acceleration is the length of the vector included in the data block.

剪枝向量也可以简称为向量，或者称为最短向量，也可以采用其它的名称来命名，本申请实施例对该剪枝向量的名称不作具体限定。The pruning vector may also be referred to as a vector, or as a shortest vector, or may be named by other names. The embodiment of the present application does not specifically limit the name of the pruning vector.

应理解的是，每个数据块均满足第一数据块所满足的条件，一个剪枝向量中非零元素所在的位置，在该剪枝向量所属的数据块中的其它剪枝向量在该位置为零元素。比如，参见图5所示，M＝4的剪枝向量中，数据块包括两个剪枝向量，第1个剪枝向量的第0位置和第3个位置的权重为非零元素，第1个剪枝向量的第1个位置和第2个位置的权重为零元素，从而在第2个剪枝向量的第0位置和第3个位置的权重均为零元素，第2个剪枝向量的第1个位置和第2个位置的权重为非零元素。即第1个剪枝向量和第2个剪枝向量的零元素的位置互补。基于此，本申请实施例提供的剪枝方式可以称为互补稀疏或者互补剪枝，当然也可以采用其它的名称来命名，本申请实施例对此不作具体限定。为了便于描述，后续将本申请实施例提供的剪枝方式称为互补稀疏。 It should be understood that each data block satisfies the conditions satisfied by the first data block, and the position where the non-zero element in a pruning vector is located is a zero element in other pruning vectors in the data block to which the pruning vector belongs. For example, as shown in FIG5, in the pruning vector of M=4, the data block includes two pruning vectors, the weights of the 0th position and the 3rd position of the first pruning vector are non-zero elements, and the weights of the 1st position and the 2nd position of the first pruning vector are zero elements, so that the weights of the 0th position and the 3rd position of the second pruning vector are both zero elements, and the weights of the 1st position and the 2nd position of the second pruning vector are non-zero elements. That is, the positions of the zero elements of the first pruning vector and the second pruning vector are complementary. Based on this, the pruning method provided in the embodiment of the present application can be called complementary sparse or complementary pruning, and of course other names can be used to name it, and the embodiment of the present application does not specifically limit this. For ease of description, the pruning method provided in the embodiment of the present application will be referred to as complementary sparse in the following.

相比采用单通道或者按块稀疏方式，本申请实施例采用更小的粒度来稀疏，可以提高网络模型的应用的准确度。相比采用任意稀疏方式来说，按照互补的方式来进行稀疏，在推理阶段无需采用稠密计算，减少了模型的计算量。本申请实施例提供的方案能够在准确度和加速上进行平衡。Compared with the single-channel or block-based sparse method, the embodiment of the present application uses a smaller granularity for sparseness, which can improve the accuracy of the application of the network model. Compared with the use of any sparse method, the sparse method is complementary, and there is no need to use dense calculations in the reasoning stage, which reduces the amount of calculation of the model. The solution provided by the embodiment of the present application can balance accuracy and acceleration.

互补稀疏中，将K个剪枝向量可以拼成一个进行运算的向量(也即数据块)，从而剪枝后的稀疏度与K之间的关系满足公式(1)：
In complementary sparsity, K pruned vectors can be combined into a vector (i.e., data block) for operation, so that the relationship between the sparsity after pruning and K satisfies formula (1):

比如，K＝2，稀疏度S＝50％。再比如，K＝4时，稀疏度S＝75％，依次类推。For example, when K=2, the sparsity S=50%. For another example, when K=4, the sparsity S=75%, and so on.

本申请实施例在执行剪枝处理时，可以采用一步到位的方式，即执行一次剪枝处理。也可以采用渐进式剪枝方式(或者称为渐进式稀疏方式)。The embodiment of the present application may adopt a one-step approach when performing pruning, that is, perform pruning once, or may adopt a progressive pruning approach (or progressive thinning approach).

如果针对渐进式剪枝方式进行说明。渐进式剪枝方式中，按照稀疏度递增的方式执行N次。一些实施例中，可以根据每个数据块包括的剪枝向量的数量K和/或第二权重张量的稀疏度(即剪枝需要的稀疏度)来确定。If the progressive pruning method is described, the progressive pruning method is performed N times in a manner of increasing sparsity. In some embodiments, the number K of pruning vectors included in each data block and/or the sparsity of the second weight tensor (i.e., the sparsity required for pruning) can be determined.

比如，N与K之间的关系满足如下公式(2)：
N＝K-1公式(2)。For example, the relationship between N and K satisfies the following formula (2):
N=K-1 formula (2).

又比如，N与S之间的关系满足如下公式(3)：
For another example, the relationship between N and S satisfies the following formula (3):

以N＝K-1为例，每次进行剪枝处理的稀疏度可以满足如下公式(4)：
Taking N=K-1 as an example, the sparsity of each pruning process can satisfy the following formula (4):

例如，要训练一个稀疏度为75％的神经网络，经过3次剪枝，稀疏度的变化情况为：0->25％->50％->75％。再比如，要训练一个稀疏度为87.5％的神经网络，经过7次剪枝，稀疏度的变化情况为：0->12.5％->25％->37.5％->50％->62.5％->75％->87.5。For example, to train a neural network with a sparsity of 75%, after 3 prunings, the sparsity changes as follows: 0->25%->50%->75%. For another example, to train a neural network with a sparsity of 87.5%, after 7 prunings, the sparsity changes as follows: 0->12.5%->25%->37.5%->50%->62.5%->75%->87.5.

可以理解的是，对于一个M*K的数据块(或者称为矩阵)按照渐进稀疏时，执行次数每增加1，增加的稀疏度相当于再减去一个最短向量包括的权重数，基于此，本申请实施例也可以将该渐进式稀疏方式称为逐向量渐进稀疏，当然也可以采用其它的名称命名，本申请实施例对此不作限定。It can be understood that for an M*K data block (or matrix), when the progressive sparsity is performed, the increased sparsity is equivalent to subtracting the number of weights included in the shortest vector for each increase in the number of executions by 1. Based on this, the embodiment of the present application can also refer to the progressive sparsity method as vector-by-vector progressive sparsity. Of course, other names can also be used, and the embodiment of the present application is not limited to this.

本申请实施例中可以每执行一次剪枝，执行一代训练(也可以称为epoch)。基于此，上述提及的剪枝处理可以包括剪枝操作和训练操作。一些实施例中，在采用逐向量渐进稀疏方式时，在每个稀疏度值训练的次数可以是相同。参见图6所示，不同的稀疏度对应的不同台阶(stage)是等宽的，即在每个稀疏度值训练的次数相同。不同的稀疏度值对应的训练可以采用相同的训练样本集，也可以不同的训练样本集。例如，可以将样本集拆分为多个训练样本子集，每执行一次剪枝处理，使用一个训练样本子集执行一代训练。比如，不同的训练样本子集包括的batch(表示一批数据)数量相同，则在每个稀疏度值训练的次数相同。使用训练集中的一小部分样本，对神经挽留过的权重进行一次反向传播的参数更新，这一小部分样本被称为“一批数据”。In the embodiment of the present application, each time pruning is performed, one generation of training (also referred to as epoch) is performed. Based on this, the pruning process mentioned above may include pruning operations and training operations. In some embodiments, when a vector-by-vector progressive sparsity method is adopted, the number of training times at each sparsity value may be the same. As shown in FIG6 , different steps (stages) corresponding to different sparsities are of equal width, that is, the number of training times at each sparsity value is the same. The training corresponding to different sparsity values may use the same training sample set or different training sample sets. For example, the sample set may be split into multiple training sample subsets, and each time a pruning process is performed, a training sample subset is used to perform a generation of training. For example, different training sample subsets include the same number of batches (indicating a batch of data), and the number of training times at each sparsity value is the same. A small portion of samples in the training set is used to perform a back-propagation parameter update on the weights retained by the neural network. This small portion of samples is called a "batch of data."

如下针对剪枝流程进行说明。以第i次剪枝为例，i取值小于或者等于N。例如，每次剪枝操作可以分为三个步骤，分别为(1)变形、(2)剪枝和(3)恢复(也可以称为逆变形或者反变形)。The pruning process is described below. Taking the i-th pruning as an example, i is less than or equal to N. For example, each pruning operation can be divided into three steps, namely (1) deformation, (2) pruning and (3) restoration (also called inverse deformation or anti-deformation).

(1)变形：对输入的四维权重张量进行变形得到变形后的三维权重张量。所述三维权重张量中包括G个数据块，所述每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成。(1) Deformation: The input four-dimensional weight tensor is deformed to obtain a deformed three-dimensional weight tensor. The three-dimensional weight tensor includes G data blocks, each of which is composed of K pruned vectors, and each pruned vector is composed of M weights.

可以理解的是，第1次执行剪枝时，输入的四维权重张量可以理解为第一权重张量。第2至N次执行剪枝时，输入的四维权重张量可以理解为经过第i次剪枝处理的四维权重张量。It can be understood that when pruning is performed for the first time, the input four-dimensional weight tensor can be understood as the first weight tensor. When pruning is performed for the second to Nth times, the input four-dimensional weight tensor can be understood as the four-dimensional weight tensor that has undergone the i-th pruning process.

输入的四维权重张量的形状可以描述为(C_out,C_in,FH,FW)。C_out表示卷积核的输出激活的数量(或者卷积核的输出通道的数量),C_in表示卷积核输入激活的数量(或者为卷积核输入通道的数量)，FH表示卷积核的内核高度，FW表示卷积核的内核高度)。采用四组参数来描述权重张量。当然一些场景中，单输出通道的情况下，权重张量中C_out＝1，在该情况下，输入的权重张量的形状可以理解为三维，即(C_in,FH,FW)。单输入通道的情况下，权重张量中C_in＝1，输入的权重张量的形状可以理解为三维，即(C_out,FH,FW)。The shape of the input four-dimensional weight tensor can be described as (C _out , C _in , FH, FW). C _out represents the number of output activations of the convolution kernel (or the number of output channels of the convolution kernel), C _in represents the number of input activations of the convolution kernel (or the number of input channels of the convolution kernel), FH represents the kernel height of the convolution kernel, and FW represents the kernel height of the convolution kernel). Four sets of parameters are used to describe the weight tensor. Of course, in some scenarios, in the case of a single output channel, C _out = 1 in the weight tensor. In this case, the shape of the input weight tensor can be understood as three-dimensional, that is, (C _in , FH, FW). In the case of a single input channel, C _in = 1 in the weight tensor, and the shape of the input weight tensor can be understood as three-dimensional, that is, (C _out , FH, FW).

以输入的权重张量的形状为(C_out,C_in,FH,FW)为例，变形后的三维权重张量的形状为G，K，M)；其中G满足如下公式(5)：
Taking the shape of the input weight tensor as (C _out ,C _in ,FH,FW) as an example, the shape of the deformed three-dimensional weight tensor is G, K, M); Where G satisfies the following formula (5):

一些场景中，为了在通用并行计算硬件上或者专用加速硬件上实现加速，K*M满足如下公式(6)：
K*M＝L，L∈{C_in,C_in*FH*FW}公式(6)。In some scenarios, in order to achieve acceleration on general parallel computing hardware or dedicated acceleration hardware, K*M satisfies the following formula (6):
K*M＝L, L∈{C _in ,C _in *FH*FW} formula (6).

例如，专用加速硬件可以是具有独立功能，可对本申请实施例提供的互补稀疏进行加速的电路或者硬件，比如可以专用的MCU，或者专用的GPU，或者专用的CPU等。L的取值，与加速硬件(专用加速硬件或者通用并行计算硬件)的规格相关，比如加速硬件的规格可以是加速硬件所支持处理的矩阵乘积的矩阵的大小。比如，加速硬件支持处理A*L的矩阵与L*B的矩阵的乘积运算。例如，加速硬件的规格中L＝C_in，再比如，加速硬件的规格中L＝C_in*FH*FW。一些实施例中，L还可以取值为C_in*FH或者，C_in*FW。For example, dedicated acceleration hardware may be a circuit or hardware with independent functions that can accelerate the complementary sparsity provided in the embodiments of the present application, such as a dedicated MCU, a dedicated GPU, or a dedicated CPU, etc. The value of L is related to the specifications of the acceleration hardware (dedicated acceleration hardware or general parallel computing hardware). For example, the specifications of the acceleration hardware may be the size of the matrix of the matrix product supported by the acceleration hardware. For example, the acceleration hardware supports the product operation of the matrix A*L and the matrix L*B. For example, in the specifications of the acceleration hardware, L=C _in , and for another example, in the specifications of the acceleration hardware, L=C _in *FH*FW. In some embodiments, L may also be C _in *FH or C _in *FW.

一些实施例中，以L＝C_in为例，执行变形时，可以将输入的四维权重张量先重排为(FH,FW,C_out,C_in)。变形的目的是便于沿C_in维度进行剪枝。进一步地，将重排后的权重张量变形为(G，K，M)的三维权重张量。此时，G＝C_out*FH*FW。In some embodiments, taking L=C _in as an example, when performing deformation, the input four-dimensional weight tensor can be first rearranged into (FH, FW, C _out , C _in ). The purpose of deformation is to facilitate pruning along the C _in dimension. Further, the rearranged weight tensor is deformed into a three-dimensional weight tensor of (G, K, M). At this time, G=C _out *FH*FW.

另一些实施例中，以L＝C_in*FH*FW为例，执行变形时，可以将输入的四维权重张量先重排为(C_out,FH,FW,C_in)，进一步地，将重排后的权重张量变形为(G，K，M)的三维权重张量。此时，G＝C_out。In some other embodiments, taking L=C _in *FH*FW as an example, when performing deformation, the input four-dimensional weight tensor can be first rearranged into (C _out , FH, FW, C _in ), and further, the rearranged weight tensor can be deformed into a three-dimensional weight tensor of (G, K, M). In this case, G=C _out .

(2)剪枝：对所述三维权重张量沿K维度进行剪枝，将K维度上重要程度最小的q_i个权重置为零，以得到第i次剪枝后的三维权重向量。(2) Pruning: Prune the three-dimensional weight tensor along the K dimension, reset the q _i weights with the least importance in the K dimension to zero, so as to obtain the three-dimensional weight vector after the i-th pruning.

(G，K，M)的三维权重张量，可以理解为G个K*M的数据块，每个K*M数据块，在K维度进行剪枝，可以理解每列中的K权重中重要程度最小q_i个权重置为零。The three-dimensional weight tensor of (G, K, M) can be understood as G K*M data blocks. Each K*M data block is pruned in the K dimension. It can be understood that the q _i weights with the least importance among the K weights in each column are reset to zero.

示例性地，将K维度上重要程度最小的q_i个权重置为零，可以为：Exemplarily, the q _i weights with the least importance in the K dimension are reset to zero, which can be:

比如，所述q_i＝i，或者q_i＝1。For example, said _qi =i, or _qi =1.

(3)恢复：将第i次剪枝后的三维权重张量进行变形得到变形后的四维权重张量。其中，所述输入的四维权重张量为所述第一权重张量或者为经过第i-1次剪枝处理的四维权重张量。(3) Restoration: deforming the three-dimensional weight tensor after the i-th pruning to obtain a deformed four-dimensional weight tensor, wherein the input four-dimensional weight tensor is the first weight tensor or the four-dimensional weight tensor after the i-1-th pruning.

一种可能的实施方式中，本申请实施例中的每次剪枝处理中，可以先执行剪枝操作，然后执行训练操作，也可以先执行训练操作再执行剪枝操作。在训练过程中，针对被剪枝掉的权重可以执行更新也可以不执行更新。In a possible implementation, in each pruning process in the embodiment of the present application, the pruning operation may be performed first and then the training operation, or the training operation may be performed first and then the pruning operation. During the training process, the pruned weights may or may not be updated.

一种示例中，以先执行剪枝操作再执行训练操作为例，并且减掉的权重执行更新为例。其中，q_i＝i。In one example, the pruning operation is performed first and then the training operation is performed, and the subtracted _weights are updated as an example.

在训练之前，先执行第1次剪枝，i＝1，经过第1次剪枝后的稀疏度为S₁＝1/K。可以理解在执行第1次剪枝时，每K个剪枝向量构成的数据块中减掉M个权重。每个K*M的数据块，每列中的K个权重中重要程度最小的1个权重置为零，即每个数据块中M个权重被减掉。然后针对剪枝后的权重向量进行第一代训练，在训练过程中针对各个权重进行调整。然后继续执行第2次剪枝，i＝2，经过第2次剪枝后的稀疏度为S₂＝2/K。可以理解在执行第2次剪枝时，每K个剪枝向量构成的数据块中减掉2*M个权重。每个K*M的数据块，每列中的K个权重中重要程度最小的2个权重置为零，即每个数据块中2*M个权重被减掉。然后针对剪枝后的权重向量进行第二代训练，在训练过程中针对各个权重进行调整。依此类推。Before training, the first pruning is performed, i=1, and the sparsity after the first pruning is S ₁ =1/K. It can be understood that when the first pruning is performed, M weights are subtracted from the data block composed of each K pruned vector. For each K*M data block, the least important weight among the K weights in each column is reset to zero, that is, M weights are subtracted from each data block. Then the first generation of training is performed for the pruned weight vector, and each weight is adjusted during the training process. Then continue to perform the second pruning, i=2, and the sparsity after the second pruning is S ₂ =2/K. It can be understood that when the second pruning is performed, 2*M weights are subtracted from the data block composed of each K pruned vector. For each K*M data block, the two least important weights among the K weights in each column are reset to zero, that is, 2*M weights are subtracted from each data block. Then the second generation of training is performed for the pruned weight vector, and each weight is adjusted during the training process. And so on.

需要说明的是，最后一次经过剪枝后，在最后一代训练过程中可以仅对未被剪枝掉的权重进行调整。It should be noted that after the last pruning, only the weights that have not been pruned can be adjusted during the last generation of training.

另一种示例中，可以先执行剪枝操作再执行训练操作为例，并且减掉的权重不执行更新为例。其中，q_i＝1。In another example, the pruning operation may be performed first and then the training operation may be performed, and the subtracted _weights may not be updated.

在训练之前，先执行第1次剪枝，i＝1，经过第1次剪枝后的稀疏度为S₁＝1/K。可以理解在执行第1次剪枝时，每K个剪枝向量构成的数据块中减掉M个权重。每个K*M的数据块，每列中的K个权重中重要程度最小的1个权重置为零，即每个数据块中M个权重被减掉。然后针对剪枝后的权重向量进行第一代训练，在训练过程中针对未被减掉的权重进行调整。然后继续执行第2次剪枝，i＝2，经过第2次剪枝后的稀疏度为S₂＝2/K。可以理解在执行第2次剪枝时，每K个剪枝向量构成的数据块中继续减掉M个权重。每个K*M的数据块，每列中的K个权重中除减掉的权重中重要程度最小的1个权重置为零，此时每个数据块中已有2*M个权重被减掉。然后针对剪枝后的权重向量进行第二代训练，在训练过程中针对未被减掉的权重进行调整。依此类推。Before training, the first pruning is performed, i=1, and the sparsity after the first pruning is S ₁ =1/K. It can be understood that when the first pruning is performed, M weights are subtracted from the data block composed of each K pruned vectors. For each K*M data block, the least important weight among the K weights in each column is reset to zero, that is, M weights are subtracted from each data block. Then the first generation of training is performed on the pruned weight vectors, and the weights that have not been subtracted are adjusted during the training process. Then continue to perform the second pruning, i=2, and the sparsity after the second pruning is S ₂ =2/K. It can be understood that when the second pruning is performed, M weights are continued to be subtracted from the data block composed of each K pruned vectors. For each K*M data block, the least important weight among the K weights in each column except the subtracted weights is reset to zero, and at this time 2*M weights have been subtracted from each data block. Then the second generation of training is performed on the pruned weight vectors During training, the weights that were not lost are adjusted. And so on.

又一种示例中，可以先执行训练操作再执行剪枝操作为例，并且减掉的权重执行更新为例。其中，q_i＝i。In another example, a training operation may be performed first and then a pruning operation may be performed, and the subtracted _weights may be updated as an example.

执行第1次剪枝之前，执行第一代训练，i＝1，在训练过程中对各个权重进行调整。然后进行第1次剪枝操作，经过第1次剪枝后的稀疏度为S₁＝1/K。可以理解在执行第1次剪枝时，每K个剪枝向量构成的数据块中减掉M个权重。每个K*M的数据块，每列中的K个权重中重要程度最小的1个权重置为零，即每个数据块中M个权重被减掉。然后针对剪枝后的权重向量进行第二代训练，在训练过程中针对所有权重进行调整。然后继续执行第2次剪枝，i＝2，经过第2次剪枝后的稀疏度为S₂＝2/K。可以理解在执行第2次剪枝时，每K个剪枝向量构成的数据块中减掉2*M个权重。每个K*M的数据块，每列中的K个权重中除减掉的权重中重要程度最小的2个权重置为零，此时每个数据块中已有2*M个权重被减掉。依此类推。Before performing the first pruning, the first generation of training is performed, i=1, and each weight is adjusted during the training process. Then the first pruning operation is performed, and the sparsity after the first pruning is S ₁ =1/K. It can be understood that when performing the first pruning, M weights are subtracted from each data block consisting of K pruned vectors. For each K*M data block, the least important weight among the K weights in each column is reset to zero, that is, M weights are subtracted from each data block. Then the second generation of training is performed for the pruned weight vectors, and all weights are adjusted during the training process. Then continue to perform the second pruning, i=2, and the sparsity after the second pruning is S ₂ =2/K. It can be understood that when performing the second pruning, 2*M weights are subtracted from each data block consisting of K pruned vectors. For each K*M data block, the two least important weights among the K weights in each column are reset to zero, and at this time, 2*M weights have been subtracted in each data block. And so on.

又一种示例中，可以先执行训练操作再执行剪枝操作为例，并且减掉的权重不执行更新为例。其中，q_i＝i。In another example, the training operation may be performed first and then the pruning operation may be performed, and the subtracted _weights may not be updated.

执行第1次剪枝之前，执行第一代训练，i＝1，在训练过程中对各个权重进行调整。然后进行第1次剪枝操作，经过第1次剪枝后的稀疏度为S₁＝1/K。可以理解在执行第1次剪枝时，每K个剪枝向量构成的数据块中减掉M个权重。每个K*M的数据块，每列中的K个权重中重要程度最小的1个权重置为零，即每个数据块中M个权重被减掉。然后针对剪枝后的权重向量进行第二代训练，在训练过程中针对未被减掉的权重进行调整。然后继续执行第2次剪枝，i＝2，经过第2次剪枝后的稀疏度为S₂＝2/K。可以理解在执行第2次剪枝时，每K个剪枝向量构成的数据块中减掉M个权重。每个K*M的数据块，每列中的K个权重中除减掉的权重中重要程度最小的1个权重置为零，此时每个数据块中已有2*M个权重被减掉。依此类推。Before performing the first pruning, perform the first generation of training, i=1, and adjust each weight during the training process. Then perform the first pruning operation, and the sparsity after the first pruning is S ₁ =1/K. It can be understood that when performing the first pruning, M weights are subtracted from each data block consisting of K pruned vectors. For each K*M data block, the least important weight among the K weights in each column is reset to zero, that is, M weights are subtracted from each data block. Then perform the second generation of training for the pruned weight vectors, and adjust the weights that have not been subtracted during the training process. Then continue to perform the second pruning, i=2, and the sparsity after the second pruning is S ₂ =2/K. It can be understood that when performing the second pruning, M weights are subtracted from each data block consisting of K pruned vectors. For each K*M data block, the K weights in each column are reset to zero after the weight with the least importance is subtracted. At this time, 2*M weights have been subtracted in each data block. And so on.

例如，参见图7所示，变形后形状为(1,4,4)的张量的渐进剪枝过程，可以看到，随着稀疏度变高，K维度上保留的数目原来越少。需要说明的是，图7中仅作为一种示例。实际剪枝过程中，经过训练后权重的取值会发生变化，图7中仅示意剪枝的数量的变化，以及稀疏度的变化。For example, referring to FIG7, the progressive pruning process of a tensor with a deformed shape of (1,4,4), it can be seen that as the sparsity increases, the number of retained K dimensions becomes smaller and smaller. It should be noted that FIG7 is only an example. In the actual pruning process, the value of the weight will change after training. FIG7 only illustrates the change in the number of pruning and the change in sparsity.

下面对本申请实施例提供的数据处理方法流程进行说明。参见图8A为本申请实施例提供的数据处理流程示意图。该数据处理方法可以由数据处理装置200执行，也可以由数据处理装置200中的模块来执行。该实施例中的处理器可以是图3中的处理器211，加速器可以是图3中的硬件加速器220。The data processing method provided by the embodiment of the present application is described below. See FIG. 8A for a schematic diagram of the data processing flow provided by the embodiment of the present application. The data processing method can be executed by the data processing device 200, or by a module in the data processing device 200. The processor in this embodiment can be the processor 211 in FIG. 3, and the accelerator can be the hardware accelerator 220 in FIG. 3.

S801，处理器211获取待处理数据，并生成所述待处理数据的向量表示。S801: The processor 211 obtains data to be processed and generates a vector representation of the data to be processed.

待处理数据的向量表示，可以是矩阵形式的表示，例如待处理数据的向量表示为第一矩阵。The vector representation of the data to be processed may be a representation in matrix form, for example, the vector representation of the data to be processed is a first matrix.

S802，硬件加速器220对所述待处理数据的向量表示与神经网络包括的权重张量执行计算，以得到所述待处理数据的处理结果。神经网络包括的权重张量可以是矩阵形式的表示，比如第二矩阵。S802, the hardware accelerator 220 performs calculations on the vector representation of the data to be processed and the weight tensor included in the neural network to obtain a processing result of the data to be processed. The weight tensor included in the neural network can be a matrix representation, such as a second matrix.

一些实施例中，处理器211可以将待处理数据的向量表示处理为第一矩阵，还对权重张量进行处理得到第二矩阵以及第二矩阵中的非零元素的索引，还将第一矩阵、第二矩阵以及第二矩阵中非零元素的索引发送给加速器。进而加速器根据第二矩阵中每个非零元素的索引对第二矩阵和第一矩阵执行乘法操作(也可以描述为乘法运算)得到计算结果。In some embodiments, the processor 211 may process the vector representation of the data to be processed into a first matrix, process the weight tensor to obtain a second matrix and the index of the non-zero elements in the second matrix, and send the first matrix, the second matrix, and the index of the non-zero elements in the second matrix to the accelerator. Then, the accelerator performs a multiplication operation (also described as a multiplication operation) on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

一些实施例中，硬件加速器220(简称加速器220)对所述待处理数据的向量表示与神经网络包括的权重张量执行计算，可以理解对所述待处理数据的第一矩阵与神经网络的第二矩阵进行乘法运算。In some embodiments, the hardware accelerator 220 (referred to as accelerator 220) performs calculations on the vector representation of the data to be processed and the weight tensor included in the neural network, which can be understood as multiplying the first matrix of the data to be processed and the second matrix of the neural network.

在一种可能的实施方式中，可以对权重张量进行编码。一个剪枝后的权重张量可以被编码为非零权重+索引。非零权重在存储时，可以按照拼接向量的顺序紧致存储。每个非零权重对应一个索引。加速器对所述待处理数据的向量表示与神经网络包括的权重张量执行计算时，可以根据第二矩阵中每个非零权重的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。一些实施例中，第二矩阵中每个行元素对应一个数据块的权重，即第二矩阵的行元素的数量为K*M。第一矩阵的列元素的数量为K*M。在执行矩阵乘法时，采用第二矩阵×第一矩阵的方式。另一些实施中，可以是第二矩阵中每个列元素对应一个数据块的权重，即第二矩阵的列元素的数量为K*M。第一矩阵的行元素的数量为K*M。在执行矩阵乘法时，采用第一矩阵×第二矩阵的方式。In one possible implementation, the weight tensor may be encoded. A pruned weight tensor may be encoded as non-zero weights + indices. When storing, non-zero weights may be compactly stored in the order of the concatenated vectors. Each non-zero weight corresponds to an index. When the accelerator performs calculations on the vector representation of the data to be processed and the weight tensor included in the neural network, the second matrix may be multiplied by the first matrix according to the index of each non-zero weight in the second matrix to obtain a calculation result. In some embodiments, each row element in the second matrix corresponds to the weight of a data block, that is, the number of row elements in the second matrix is K*M. The number of column elements in the first matrix is K*M. When performing matrix multiplication, the second matrix × first matrix method is adopted. In other implementations, each column element in the second matrix may be multiplied by the first matrix. The weight of a data block, that is, the number of column elements of the second matrix is K*M. The number of row elements of the first matrix is K*M. When performing matrix multiplication, the first matrix×the second matrix is used.

在一种可能的实施方式中，加速器在根据所述第二矩阵中每个非零元素的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果时，可以根据所述第二矩阵中非零元素的索引确定所述第一矩阵中与所述第二矩阵中的第i行的非零元素进行运算的元素，以对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。In one possible implementation, when the accelerator performs a multiplication operation on the second matrix and the first matrix to obtain a calculation result according to the index of each non-zero element in the second matrix, it can determine the elements in the first matrix that are operated on the non-zero elements in the i-th row of the second matrix according to the index of the non-zero elements in the second matrix, so as to perform a multiplication operation on the second matrix and the first matrix to obtain a calculation result.

作为一种示例，第二矩阵中非零元素的索引可以如下编码方式：每个数据块中非零元素的数量为M；所述每个数据块中第k个非零元素的索引值为k-1，用于指示第k个非零元素所属的剪枝向量的索引。As an example, the index of the non-zero elements in the second matrix can be encoded as follows: the number of non-zero elements in each data block is M; the index value of the kth non-zero element in each data block is k-1, which is used to indicate the index of the pruning vector to which the kth non-zero element belongs.

例如，参见图8B所示，为本申请实施例提供的数据块的非零索引的编码方式示意图。图8B中，以稀疏度为50％为例，则K＝2，以M＝8为例。则针对剪枝向量1和剪枝向量2的非零元素进行二进制编码时，M维度上的第1个非零元素的权重属于剪枝向量1，则索引为0，M维度上的第2个非零元素的权重属于剪枝向量2，则索引为1；M维度上的第3个非零元素的权重属于剪枝向量3，则索引为1，M维度上的第4个非零元素的权重属于剪枝向量1，则索引为0，依次类推。图8B中，以稀疏度为75％为例，则K＝4，以M＝4为例。则针对剪枝向量1-剪枝向量4的非零元素进行二进制编码时，M维度上的第1个非零元素的权重属于剪枝向量1，则索引为0，二进制编码为0，M维度上的第2个非零元素的权重属于剪枝向量3，则索引为2，二进制编码为10；M维度上的第3个非零元素的权重属于剪枝向量2，则索引为1，二进制编码为01，M维度上的第4个非零元素的权重属于剪枝向量4，则索引为3，二进制编码为11。For example, see FIG8B , which is a schematic diagram of the encoding method of the non-zero index of the data block provided in an embodiment of the present application. In FIG8B , taking the sparsity of 50% as an example, K=2, and M=8 as an example. When the non-zero elements of pruning vector 1 and pruning vector 2 are binary-encoded, the weight of the first non-zero element in the M dimension belongs to pruning vector 1, and the index is 0; the weight of the second non-zero element in the M dimension belongs to pruning vector 2, and the index is 1; the weight of the third non-zero element in the M dimension belongs to pruning vector 3, and the index is 1; the weight of the fourth non-zero element in the M dimension belongs to pruning vector 1, and the index is 0, and so on. In FIG8B , taking the sparsity of 75% as an example, K=4, and M=4 as an example. When binary encoding is performed on the non-zero elements of pruning vector 1-pruning vector 4, the weight of the first non-zero element in the M dimension belongs to pruning vector 1, the index is 0, and the binary encoding is 0; the weight of the second non-zero element in the M dimension belongs to pruning vector 3, the index is 2, and the binary encoding is 10; the weight of the third non-zero element in the M dimension belongs to pruning vector 2, the index is 1, and the binary encoding is 01; the weight of the fourth non-zero element in the M dimension belongs to pruning vector 4, the index is 3, and the binary encoding is 11.

示例性地，结合上述的编码方式，根据所述第二矩阵中非零元素的索引确定所述第一矩阵中与所述第二矩阵中的第i行的非零元素进行运算的元素时，比如第二矩阵中第一非零元素的索引值为g时，则第一矩阵中与所述第一非零元素执行乘法的元素的所在行r满足：r＝g*M。Exemplarily, in combination with the above-mentioned encoding method, when determining the elements in the first matrix that are operated on with the non-zero elements in the i-th row of the second matrix according to the indices of the non-zero elements in the second matrix, for example, when the index value of the first non-zero element in the second matrix is g, the row r of the element in the first matrix that is multiplied with the first non-zero element satisfies: r=g*M.

以权重张量的形状为(C_out,C_in,FH,FW)进行说明。示例性地，以权重张量描述为一个卷积核为例。一种示例中，以L＝K*M＝C_in为例，则处理器可以将一个剪枝后的形状为(C_out,C_in,FH,FW)的卷积核变形为(C_out*FH*FW，C_in)的第二矩阵。待处理数据的形状为(C_in，IW，IH)，处理器将该待处理数据变形为(C_in，OH*OW)的矩阵。padding表示填充的零元素。stride表示步幅，用于减少输入参数的数量，减少计算量。填充零元素是为了使得输入的方块都能作为卷积核窗口的中心。然后由加速器执行第一矩阵×第二矩阵的操作。后续描述均以第一矩阵×第二矩阵的操作为例。对于卷积核矩阵(即第二矩阵)的第(i，j)元素w_ij，比如该w_ij为非零元素。w_ij需要与第一矩阵中的j列的对应元素x_rj相乘时，第二矩阵中w_ij的索引值为g时，r＝g*M。进而，完成矩阵乘法后，得到的结果矩阵的形状为(C_out*FH*FW，OH*OW)。进一步执行并行矩阵行规约操作，即可以将结果矩阵中每FH*FW个数求和得到一个数，经过并行矩阵行规约操作后，矩阵的形状为(C_out，OW×OH)。然后将该形状为(C_out，OW×OH)的矩阵转换为(C_out，OW,OH)，即为得到的计算结果。The shape of the weight tensor is (C _out ,C _in ,FH,FW) for illustration. Exemplarily, the weight tensor is described as a convolution kernel. In one example, taking L=K*M=C _in as an example, the processor can transform a pruned convolution kernel with a shape of (C _out ,C _in ,FH,FW) into a second matrix of (C _out *FH*FW, C _in ). The shape of the data to be processed is (C _in ,IW,IH), and the processor transforms the data to be processed into a matrix of (C _in ,OH*OW). Padding indicates the padding of zero elements. Stride indicates the stride, which is used to reduce the number of input parameters and the amount of calculation. The purpose of padding zero elements is to make all the input blocks serve as the center of the convolution kernel window. The accelerator then performs the operation of the first matrix × the second matrix. The subsequent descriptions all take the operation of the first matrix × the second matrix as an example. For the (i, j)th element w _ij of the convolution kernel matrix (i.e., the second matrix), for example, the w _ij is a non-zero element. When w _ij needs to be multiplied with the corresponding element x _rj of the jth column in the first matrix, when the index value of w _ij in the second matrix is g, r=g*M. Furthermore, after completing the matrix multiplication, the shape of the result matrix is (C _out *FH*FW, OH*OW). Further perform a parallel matrix row reduction operation, that is, the sum of each FH*FW number in the result matrix can be used to obtain a number. After the parallel matrix row reduction operation, the shape of the matrix is (C _out , OW×OH). Then, the matrix with the shape of (C _out , OW×OH) is converted into (C _out , OW, OH), which is the calculation result.

另一种示例中，以L＝K*M＝C_in*FW*FH为例，则处理器可以将一个剪枝后的形状为(C_out,C_in,FH,FW)的卷积核变形为(C_out，C_in*FH*FW)的第二矩阵。待处理数据的形状为(C_in，IW，IH)，处理器将该待处理数据变形为(C_in*FH*FW，OH*OW)的矩阵。 padding表示填充的零元素。stride表示步幅，用于减少输入参数的数量，减少计算量。填充零元素是为了使得输入的方块都能作为卷积核窗口的中心。然后加速器执行第一矩阵×第二矩阵的操作。对于卷积核矩阵(即第二矩阵)的第(i，j)元素w_ij，比如该w_ij为非零元素。w_ij需要与第一矩阵中的j列的对应元素x_rj相乘时，第二矩阵中w_ij的索引值为g时，r＝g*M。进而，完成矩阵乘法后，得到的结果矩阵的形状为(OC,OW×OH)的结果矩阵后，将其转换为(OC,OW,OH)，即为得到的计算结果。In another example, taking L=K*M=C _in *FW*FH as an example, the processor can transform a pruned convolution kernel of the shape (C _out ,C _in ,FH,FW) into a second matrix of (C _out ,C _in *FH*FW). The shape of the data to be processed is (C _in ,IW,IH), and the processor transforms the data to be processed into a matrix of (C _in *FH*FW,OH*OW). Padding indicates the padding of zero elements. Stride indicates the stride, which is used to reduce the number of input parameters and the amount of calculation. The purpose of padding with zero elements is to make all the input blocks serve as the center of the convolution kernel window. The accelerator then performs the operation of the first matrix × the second matrix. For the (i, j)th element w _ij of the convolution kernel matrix (i.e., the second matrix), for example, the w _ij is a non-zero element. When w _ij needs to be multiplied with the corresponding element x _rj of the jth column in the first matrix, when the index value of w _ij in the second matrix is g, r = g*M. Furthermore, After completing the matrix multiplication, the shape of the result matrix is (OC, OW×OH), which is converted to (OC, OW, OH), which is the calculation result.

下面作为一种举例，针对加速器的结构进行说明。参见图9A和图9B所示，加速器可以包括存储电路910和运算电路920。加速器可以是图3中的硬件加速器220。存储电路910可以用于待处理数据对应的第一矩阵、神经网络的权重张量对应的第二矩阵以及所述第二矩阵中非零元素的索引。运算电路920，用于根据所述第二矩阵中每个非零元素的索引对所述第二矩阵与所述第一矩阵执行乘法操作得到计算结果。As an example, the structure of the accelerator is described below. As shown in Figures 9A and 9B, the accelerator may include a storage circuit 910 and an operation circuit 920. The accelerator may be the hardware accelerator 220 in Figure 3. The storage circuit 910 may be used for the first matrix corresponding to the data to be processed, the second matrix corresponding to the weight tensor of the neural network, and the index of the non-zero elements in the second matrix. The operation circuit 920 is used to perform a multiplication operation on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

在一些场景中，输入的第一矩阵中并非所有的元素都参与运算，而是需要与第二矩阵中的非零元素执行乘法运算的元素才会参与运算，因此可以通过选择电路来对需要执行运算的第一矩阵的元素进行选择(也可以理解对第一矩阵进行稀疏)。选择电路也可以称为选择单元，后续描述时将选择电路称为选择单元为例。In some scenarios, not all elements in the input first matrix participate in the operation, but only elements that need to be multiplied with non-zero elements in the second matrix participate in the operation. Therefore, the elements of the first matrix that need to be operated can be selected through a selection circuit (it can also be understood that the first matrix is sparse). The selection circuit can also be called a selection unit. In the subsequent description, the selection circuit is called a selection unit for example.

一种可能的实施方式中，选择单元部署于在运算电路中。参见图9A所示，示例性地，运算电路920中可以包括选择单元921和计算单元922。选择单元921根据所述第二矩阵中非零元素的索引从所述存储电路读取所述第二矩阵中的第i行的非零元素，并根据所述第二矩阵非零元素的索引从所述存储电路中读取所述第一矩阵的第i列元素中与所述第二矩阵非零元素进行运算的元素。所述计算单元922，用于对所述第二矩阵第i行的非零元素与读取的所述第一矩阵中第i列的元素执进行乘加运算。可以理解的是，选择单元921是K个剪枝向量的相同位置仅选择一个非零元素进行计算，而其它的元素均为零元素。从而实现的硬件上的加速比与稀疏度严格相关，比如50％稀疏度的卷积加速比为2。75％稀疏度的卷积加速比为4，依次类推。In a possible implementation, the selection unit is deployed in the operation circuit. Referring to FIG. 9A , exemplarily, the operation circuit 920 may include a selection unit 921 and a calculation unit 922. The selection unit 921 reads the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements in the second matrix, and reads the elements of the i-th column of the first matrix from the storage circuit according to the index of the non-zero elements of the second matrix to be operated with the non-zero elements of the second matrix. The calculation unit 922 is used to perform multiplication and addition operations on the non-zero elements of the i-th row of the second matrix and the elements of the i-th column of the read first matrix. It can be understood that the selection unit 921 selects only one non-zero element for calculation at the same position of the K pruning vectors, and the other elements are all zero elements. The acceleration ratio on the hardware thus achieved is strictly related to the sparsity, for example, the convolution acceleration ratio of 50% sparsity is 2. The convolution acceleration ratio of 75% sparsity is 4, and so on.

一些可能的应用场景中，运算电路920可以采用一个或者多个逻辑算数单元(arithmetic and logic unit，ALU)。比如，加速器可以是MCU或者CPU或者GPU等等。ALU是加速器的IP核的运算单元，是数据的加工处理部件，是数据通道中最特殊的数据通道单元。它实现加、减、与、或、异或、非、左移、右移、半字节交换等多种运算。选择单元921可以由一个单独的ALU来执行，也可以嵌入到其它进行运算的ALU中。In some possible application scenarios, the operation circuit 920 may use one or more arithmetic and logic units (ALUs). For example, the accelerator may be an MCU, a CPU, or a GPU, etc. ALU is the operation unit of the IP core of the accelerator, a data processing component, and the most special data channel unit in the data channel. It implements multiple operations such as addition, subtraction, and, or, XOR, NOT, left shift, right shift, and half-byte exchange. The selection unit 921 may be executed by a separate ALU or embedded in other ALUs that perform operations.

在另一种可能的实施方式中，选择单元部署于存储电路中。示例性地，参见图9B所示，所述存储电路910包括存储单元911和选择单元912。存储单元911存储待处理数据对应的第一矩阵、神经网络包括的权重张量对应的第二矩阵以及所述第二矩阵中非零元素的索引。一些场景中，存储电路910还负责数据读取，可以在存储电路的数据读取过程中来对第一矩阵进行稀疏。In another possible implementation, the selection unit is deployed in a storage circuit. Exemplarily, as shown in FIG9B , the storage circuit 910 includes a storage unit 911 and a selection unit 912. The storage unit 911 stores a first matrix corresponding to the data to be processed, a second matrix corresponding to the weight tensor included in the neural network, and the index of the non-zero elements in the second matrix. In some scenarios, the storage circuit 910 is also responsible for data reading, and the first matrix can be sparse during the data reading process of the storage circuit.

所述选择单元912根据所述第二矩阵中非零元素的索引从所述存储单元911读取所述第二矩阵中的第i行的非零元素，并将所述第二矩阵中的第i行的非零元素发送给所述运算电路；以及根据所述第二矩阵非零元素的索引从所述存储单元911中读取所述第一矩阵的第i列元素中与所述第二矩阵中的第i行的非零元素进行运算的元素，并将所述第一矩阵的第i列元素中与所述第二矩阵中的第i行的非零元素进行运算的元素发送给所述运算电路920。进一步地，运算电路920对所述第二矩阵第i行的非零元素与所述第一矩阵的第i列元素中与所述第二矩阵中的第i行的非零元素进行运算的元素进行乘加运算。The selection unit 912 reads the non-zero elements of the i-th row in the second matrix from the storage unit 911 according to the index of the non-zero elements in the second matrix, and sends the non-zero elements of the i-th row in the second matrix to the operation circuit; and reads the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix from the storage unit 911 according to the index of the non-zero elements of the second matrix, and sends the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix to the operation circuit 920. Further, the operation circuit 920 performs multiplication and addition operations on the non-zero elements of the i-th row in the second matrix and the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix.

一些实施例中，以选择单元部署于存储电路910中为例，即存储电路910包括选择单元912，参见图9A所示，存储电路910中的存储单元911还可以包括第一数据缓存区9111、第二数据缓存器9112、索引信息缓存区9113。In some embodiments, taking the example of a selection unit being deployed in a storage circuit 910, that is, the storage circuit 910 includes a selection unit 912, as shown in FIG. 9A, the storage unit 911 in the storage circuit 910 may also include a first data cache area 9111, a second data cache area 9112, and an index information cache area 9113.

第一数据缓存区9111，用于缓存从处理器的存储器212中加载到硬件加速器中的数据，例如，第一数据缓存区9111可缓存神经网络每层的卷积核参数矩阵(包括第二矩阵)。第二矩阵可以采用紧致存储方式。需要与第一矩阵执行乘法操作的K*M个中非零元素紧致存储，比如按行存储，每次读取该行非零元素。The first data cache area 9111 is used to cache data loaded from the processor's memory 212 into the hardware accelerator. For example, the first data cache area 9111 can cache the convolution kernel parameter matrix (including the second matrix) of each layer of the neural network. The second matrix can be compactly stored. The K*M non-zero elements that need to be multiplied with the first matrix are compactly stored, such as row storage, and the non-zero elements of the row are read each time.

第二数据缓存区9112，用于缓存从存储器212中加载到硬件加速器中的数据，例如，硬件加速器应用于神经网络中时，第二数据缓存区9112可缓存输入矩阵(包括第一矩阵)。索引信息缓存区9113，可用于缓存第二矩阵非零元素的索引以及第一矩阵中元素的索引。第二矩阵中非零元素的索引可以采用如前所述的方式。第一矩阵的元素可以包括行索引(也可以理解为行地址)和列索引(可以理解为列地址)。The second data buffer area 9112 is used to cache data loaded from the memory 212 into the hardware accelerator. For example, when the hardware accelerator is applied to a neural network, the second data buffer area 9112 can cache the input matrix (including the first matrix). The index information buffer area 9113 can be used to cache the index of the non-zero elements of the second matrix and the index of the elements in the first matrix. The index of the non-zero elements in the second matrix can be used in the manner described above. The elements of the first matrix may include a row index (which can also be understood as a row address) and a column index (which can be understood as a column address).

选择单元912，用于根据索引信息缓存区9113中缓存的索引选择对应位置的数据，比如根据索引信息缓存区9113中缓存的第二矩阵的索引确定第一矩阵需要与第一矩阵中非零元素进行乘法操作的元素。The selection unit 912 is used to select data at a corresponding position according to the index cached in the index information cache area 9113, for example, according to the index The index of the second matrix cached in the information cache area 9113 determines the element of the first matrix that needs to be multiplied with the non-zero element in the first matrix.

运算电路920为硬件加速器的核心部件，内部固化了运算电路的各种功能单元。示例性地，运算电路920可以包括乘法单元9201、累加单元9202、激励函数单元9203和向量计算单元9204。乘法单元，用于将输入的向量和矩阵对应位置的元素相乘得到中间值，或者用于将输入的矩阵和矩阵相乘得到中间值，乘法单元9201可以是矩阵和向量相乘单元，也可以是矩阵和矩阵相乘单元。累加单元9202，用于将乘法单元9201计算得到的中间值累加得到各个门向量值。激励函数单元9203，用于将累加单元得到门向量值进行激励函数处理，向量计算单元9204，用于对各个门向量进行计算，得到最终的结果。The operation circuit 920 is the core component of the hardware accelerator, and various functional units of the operation circuit are solidified inside. Exemplarily, the operation circuit 920 may include a multiplication unit 9201, an accumulation unit 9202, an excitation function unit 9203 and a vector calculation unit 9204. The multiplication unit is used to multiply the elements of the corresponding positions of the input vector and the matrix to obtain an intermediate value, or to multiply the input matrix and the matrix to obtain an intermediate value. The multiplication unit 9201 can be a matrix and vector multiplication unit or a matrix and matrix multiplication unit. The accumulation unit 9202 is used to accumulate the intermediate values calculated by the multiplication unit 9201 to obtain each gate vector value. The excitation function unit 9203 is used to perform excitation function processing on the gate vector value obtained by the accumulation unit, and the vector calculation unit 9204 is used to calculate each gate vector to obtain the final result.

在一些可能的实施方式中，硬件加速器中还可以包括输入缓存区923和输出缓存区924，用于临时存储硬件加速器中输入和输出数据，运算电路920完成计算后可将结果放到输出缓存区924中并输出，其中，输出缓存区924可在缓存满后一次性写回到处理器中。输入缓存区923和输出缓存区924可以部署于运算电路920内部，也可以部署于运算电路920外部。In some possible implementations, the hardware accelerator may further include an input buffer 923 and an output buffer 924 for temporarily storing input and output data in the hardware accelerator. After the operation circuit 920 completes the calculation, the result may be placed in the output buffer 924 and output, wherein the output buffer 924 may be written back to the processor at one time after the cache is full. The input buffer 923 and the output buffer 924 may be deployed inside the operation circuit 920 or outside the operation circuit 920.

一些实施例中，硬件加速器中可以包括内存存取单元913。内存存取单元913可以采用直接内存存取(direct memory access，DMA)。内存存取单元913可以用于硬件加速器与其它设备之间的数据传输。以图3中的硬件加速器采用图9A所示的结构为例，则数据处理装置200的软硬件协同框架，参见图10A所示。内存存取单元913可以用于硬件加速器220与存储器212之间的数据传输。内存存取单元913可以部署于存储电路910内部，也可以部署与存储电路910外部，图10A中以部署在存储电路910外部为例。In some embodiments, the hardware accelerator may include a memory access unit 913. The memory access unit 913 may use direct memory access (DMA). The memory access unit 913 may be used for data transmission between the hardware accelerator and other devices. Taking the hardware accelerator in FIG. 3 using the structure shown in FIG. 9A as an example, the software and hardware collaborative framework of the data processing device 200 is shown in FIG. 10A. The memory access unit 913 may be used for data transmission between the hardware accelerator 220 and the memory 212. The memory access unit 913 may be deployed inside the storage circuit 910 or outside the storage circuit 910. FIG. 10A takes the deployment outside the storage circuit 910 as an example.

一些实施例中，以内存存取单元913采用DMA为例。硬件加速器的电路均可以配备一个DMA以实现并行数据读取。一些实施例中，硬件加速器中还可以包括控制互联电路，用于与处理系统210之间的控制信号线路的互联。结合图10A，在处理系统210中处理器211的控制下，通过控制总线和数据总线分别控制DMA的开启，DMA将存储器212中的待处理数据的第一矩阵和权重张量对应的第二矩阵以及第二矩阵中的非零元素的索引传输至硬件加速器220中的存储电路910中，比如将第二矩阵传输至第一数据缓存区9111，将第一矩阵传输至第二数据缓存区9112，并将第二矩阵的非零元素的索引传输至索引信息缓存单元9113中。之后经过运算电路920的处理得到目标结果。参见图11A所示，选择单元912从第一数据缓存区9111中读取第二矩阵中的第j行非零元素。可以理解每列非零元素在第一数据缓存区9111中按行紧致存储。选择单元912可以从第一数据缓存区9111中读取第j行所在的存储行。选择单元912从索引信息缓存区9113中读取第二矩阵中第j行非零元素的索引，然后根据第j行非零元素的索引从第二数据缓存区9112中读取与第二矩阵中第j行非零元素的进行乘积的第一矩阵第j列中的元素。选择单元912可以将读取数据(包括第二矩阵中的第j行非零元素，第一矩阵第j列中需要与第j行非零元素进行乘积的元素)发送给运算电路920。比如选择单元912可以将读取数据输出到输入缓存区923(可以在运算电路920内部，也可以在运算电路920内部)。如果在运算电路920外部，运算电路920可以从输入缓存区923中获取读取数据以进行运算，比如乘法单元针对第二矩阵中的第j行非零元素和第一矩阵第j列中需要与第j行非零元素进行乘积的元素一一对应执行乘法操作得到多个值，然后累加单元9202对多个值进行加法操作。In some embodiments, the memory access unit 913 adopts DMA as an example. The circuits of the hardware accelerator can be equipped with a DMA to realize parallel data reading. In some embodiments, the hardware accelerator may also include a control interconnection circuit for interconnecting the control signal line between the processing system 210. In conjunction with Figure 10A, under the control of the processor 211 in the processing system 210, the opening of DMA is controlled by the control bus and the data bus respectively, and the DMA transmits the first matrix of the data to be processed in the memory 212 and the second matrix corresponding to the weight tensor and the index of the non-zero elements in the second matrix to the storage circuit 910 in the hardware accelerator 220, such as transferring the second matrix to the first data cache area 9111, transferring the first matrix to the second data cache area 9112, and transferring the index of the non-zero elements of the second matrix to the index information cache unit 9113. After that, the target result is obtained by processing by the operation circuit 920. As shown in Figure 11A, the selection unit 912 reads the j-th row non-zero elements in the second matrix from the first data cache area 9111. It can be understood that each column of non-zero elements is compactly stored in rows in the first data buffer area 9111. The selection unit 912 can read the storage row where the j-th row is located from the first data buffer area 9111. The selection unit 912 reads the index of the non-zero element of the j-th row in the second matrix from the index information buffer area 9113, and then reads the element in the j-th column of the first matrix multiplied with the non-zero element of the j-th row in the second matrix from the second data buffer area 9112 according to the index of the non-zero element of the j-th row. The selection unit 912 can send the read data (including the non-zero element of the j-th row in the second matrix, and the element in the j-th column of the first matrix that needs to be multiplied with the non-zero element of the j-th row) to the operation circuit 920. For example, the selection unit 912 can output the read data to the input buffer area 923 (which can be inside the operation circuit 920 or inside the operation circuit 920). If outside the operation circuit 920, the operation circuit 920 can obtain read data from the input buffer area 923 for operation. For example, the multiplication unit performs multiplication operations on the non-zero elements in the j-th row of the second matrix and the elements in the j-th column of the first matrix that need to be multiplied with the non-zero elements in the j-th row to obtain multiple values, and then the accumulation unit 9202 performs addition operations on the multiple values.

另一些实施例中，以选择单元部署于运算电路中为例，即运算电路920包括选择单元921，参见图9B所示，存储电路910中的存储单元911还可以包括第一数据缓存区9111、第二数据缓存器9112、索引信息缓存区9113。第一数据缓存区9111、第二数据缓存器9112、索引信息缓存区9113的功能如前所述，此处不再赘述。示例性地，运算电路920中的计算单元922中可以包括乘法单元9201、累加单元9202、激励函数单元9203和向量计算单元9204。乘法单元9201、累加单元9202、激励函数单元9203和向量计算单元9204的功能如前所述，此处不再赘述。In some other embodiments, taking the case where the selection unit is deployed in the operation circuit, that is, the operation circuit 920 includes a selection unit 921, as shown in FIG. 9B, the storage unit 911 in the storage circuit 910 may also include a first data buffer 9111, a second data buffer 9112, and an index information buffer 9113. The functions of the first data buffer 9111, the second data buffer 9112, and the index information buffer 9113 are as described above, and are not repeated here. Exemplarily, the calculation unit 922 in the operation circuit 920 may include a multiplication unit 9201, an accumulation unit 9202, an excitation function unit 9203, and a vector calculation unit 9204. The functions of the multiplication unit 9201, the accumulation unit 9202, the excitation function unit 9203, and the vector calculation unit 9204 are as described above, and are not repeated here.

在一些可能的实施方式中，硬件加速器中还可以包括输入缓存区923和输出缓存区924，用于临时存储硬件加速器中输入和输出数据，运算电路920完成计算后可将结果放到输出缓存区924中并输出，其中，输出缓存区924可在缓存满后一次性写回到处理器中。输入缓存区923和输出缓存区924可以部署于运算电路920内部，也可以部署于运算电路920外部。图9B以输入缓存区923和输出缓存区924部署于运算电路920之外为例。In some possible implementations, the hardware accelerator may further include an input buffer 923 and an output buffer 924 for temporarily storing input and output data in the hardware accelerator. After the operation circuit 920 completes the calculation, the result may be placed in the output buffer 924 and output, wherein the output buffer 924 may be written back to the processor at one time after the cache is full. The input buffer 923 and the output buffer 924 may be deployed inside the operation circuit 920 or outside the operation circuit 920. FIG. 9B takes the input buffer 923 and the output buffer 924 deployed outside the operation circuit 920 as an example.

一些实施例中，硬件加速器中可以包括内存存取单元913。内存存取单元913可以采用直接内存存取(direct memory access，DMA)。内存存取单元913可以用于硬件加速器与其它设备之间的数据传输。以图3中的硬件加速器采用图9B所示的结构为例，则数据处理装置200的软硬件协同框架，参见图10B所示。内存存取单元913可以用于硬件加速器220与存储器212之间的数据传输。内存存取单元913可以部署于存储电路910内部，也可以部署与存储电路910外部，图10B中以部署在存储电路910外部为例。In some embodiments, the hardware accelerator may include a memory access unit 913. The memory access unit 913 may use direct memory access (DMA). The memory access unit 913 may be used for data transmission between the hardware accelerator and other devices. Taking the hardware accelerator in FIG3 adopting the structure shown in FIG9B as an example, the hardware and software collaborative framework of the data processing device 200 is shown in FIG10B. The memory access unit 913 can be used for data transmission between the hardware accelerator 220 and the memory 212. The memory access unit 913 can be deployed inside the storage circuit 910 or outside the storage circuit 910. FIG10B takes the deployment outside the storage circuit 910 as an example.

一些实施例中，以内存存取单元913采用DMA为例。硬件加速器的电路均可以配备一个DMA以实现并行数据读取。一些实施例中，硬件加速器中还可以包括控制互联电路(图10B中未示出)，用于与处理系统210之间的控制信号线路的互联。结合图10B，在处理器210中处理器211的控制下，通过控制总线和数据总线分别控制DMA的开启，DMA将存储器212中的待处理数据的第一矩阵和权重张量对应的第二矩阵以及第二矩阵中的非零元素的索引传输至硬件加速器220中的存储电路910中，比如将第二矩阵传输至第一数据缓存区9111，将第一矩阵传输至第二数据缓存区9112，并将第二矩阵的非零元素的索引传输至索引信息缓存单元9113中。之后经过运算电路920的处理得到目标结果。参见图11B所示，选择单元921从第一数据缓存区9111中读取第二矩阵中的第j行非零元素。可以理解每列非零元素在第一数据缓存区9111中按行紧致存储。选择单元921可以从第一数据缓存区9111中读取第j行所在的存储行。选择单元921从索引信息缓存区9113中获取第二矩阵中第j行非零元素的索引，然后根据第j行非零元素的索引从第二数据缓存区9112中读取与第二矩阵中第j行非零元素的进行乘积的第一矩阵第j列中的元素。比如，选择单元921可以从第二数据缓存区9112中读取第一矩阵中第j列中元素，第一矩阵中每列元素在第二数据缓存区9112中按行存储。存储电路910可以将第一矩阵中第j列所在存储行、第二矩阵中的第j行非零元素以及第j行非零元素的索引输出到输入缓存区923。然后选择单元921根据第j行非零元素的索引从第一矩阵中第j列元素中获得需要与第二矩阵中的第j行非零元素进行乘法操作的元素。计算单元922可以根据选择单元921的选择结果对第一矩阵中第j列元素中需要第j行非零元素进行乘法操作的元素，以及第j行非零元素进行计算。比如乘法单元9201针对第二矩阵中的第j行非零元素和第一矩阵第j列中需要与第j行非零元素进行乘积的元素一一对应执行乘法操作得到多个值，然后累加单元9202对多个值进行加法操作。In some embodiments, the memory access unit 913 adopts DMA as an example. The circuits of the hardware accelerator can be equipped with a DMA to realize parallel data reading. In some embodiments, the hardware accelerator may also include a control interconnection circuit (not shown in FIG. 10B) for interconnecting the control signal line between the processing system 210. In conjunction with FIG. 10B, under the control of the processor 211 in the processor 210, the opening of DMA is controlled by the control bus and the data bus respectively, and the DMA transmits the first matrix of the data to be processed in the memory 212 and the second matrix corresponding to the weight tensor and the index of the non-zero elements in the second matrix to the storage circuit 910 in the hardware accelerator 220, such as transferring the second matrix to the first data cache 9111, transferring the first matrix to the second data cache 9112, and transferring the index of the non-zero elements of the second matrix to the index information cache unit 9113. After that, the target result is obtained by processing the operation circuit 920. Referring to FIG. 11B, the selection unit 921 reads the j-th row non-zero elements in the second matrix from the first data cache 9111. It can be understood that the non-zero elements of each column are compactly stored in rows in the first data buffer area 9111. The selection unit 921 can read the storage row where the j-th row is located from the first data buffer area 9111. The selection unit 921 obtains the index of the non-zero element of the j-th row in the second matrix from the index information buffer area 9113, and then reads the element in the j-th column of the first matrix multiplied with the non-zero element of the j-th row in the second matrix from the second data buffer area 9112 according to the index of the non-zero element of the j-th row. For example, the selection unit 921 can read the elements in the j-th column of the first matrix from the second data buffer area 9112, and each column element in the first matrix is stored in rows in the second data buffer area 9112. The storage circuit 910 can output the storage row where the j-th column in the first matrix is located, the non-zero element of the j-th row in the second matrix, and the index of the non-zero element of the j-th row to the input buffer area 923. Then the selection unit 921 obtains the elements that need to be multiplied with the non-zero elements in the j-th row from the elements in the j-th column of the first matrix according to the index of the non-zero elements in the j-th row. The calculation unit 922 can calculate the elements that need to be multiplied with the non-zero elements in the j-th row and the non-zero elements in the j-th row in the elements in the j-th column of the first matrix according to the selection result of the selection unit 921. For example, the multiplication unit 9201 performs multiplication operations on the non-zero elements in the j-th row of the second matrix and the elements in the j-th column of the first matrix that need to be multiplied with the non-zero elements in the j-th row to obtain multiple values, and then the accumulation unit 9202 performs addition operations on the multiple values.

基于以上实施例，参见图12所示为本申请实施例提供的剪枝装置的结构示意图。剪枝装置用于实现图4所示的剪枝方法。剪枝装置中可以包括获取单元1201和剪枝单元1202。获取单元可以用于获取待剪枝神经网络的第一权重张量。剪枝单元1202，对第一权重张量执行至少一次剪枝处理，得到第二权重张量。一些实施例中，剪枝装置还可以包括训练单元1203。剪枝单元1202具体用于对输入的四维权重张量进行变形得到的变形后三维权重张量，所述三维权重张量中包括G个数据块，所述每个数据块由K个剪枝向量组成，每个剪枝向量由M个权重组成；对所述三维权重张量沿K维度进行剪枝，将K维度上重要程度最小的q_i个权重置为零，以得到第i次剪枝后的三维权重向量；然后将第i次剪枝后的三维权重张量进行变形得到变形后的四维权重张量。训练单元1203用于对所述变形后的四维权重张量进行训练，以得到经过所述第i次剪枝处理的四维权重张量。Based on the above embodiments, see FIG. 12 for a schematic diagram of the structure of a pruning device provided in an embodiment of the present application. The pruning device is used to implement the pruning method shown in FIG. 4. The pruning device may include an acquisition unit 1201 and a pruning unit 1202. The acquisition unit may be used to acquire a first weight tensor of a neural network to be pruned. The pruning unit 1202 performs at least one pruning process on the first weight tensor to obtain a second weight tensor. In some embodiments, the pruning device may further include a training unit 1203. The pruning unit 1202 is specifically used to deform the input four-dimensional weight tensor to obtain a deformed three-dimensional weight tensor, wherein the three-dimensional weight tensor includes G data blocks, each of which is composed of K pruning vectors, and each pruning vector is composed of M weights; the three-dimensional weight tensor is pruned along the K dimension, and the q _i weights with the least importance in the K dimension are reset to zero to obtain the three-dimensional weight vector after the i-th pruning; and then the three-dimensional weight tensor after the i-th pruning is deformed to obtain the deformed four-dimensional weight tensor. The training unit 1203 is used to train the deformed four-dimensional weight tensor to obtain a four-dimensional weight tensor that has undergone the i-th pruning process.

下面对本申请实施例提供的方案所达到的技术效果进行说明。The technical effects achieved by the solutions provided in the embodiments of the present application are described below.

目前采用的硬件2:4稀疏方式中，长度为4的向量中有两个位置为0。该两个位置之间无规则，是针对架构和架构的GPU所设计的方式，因此不适用于通用的加速硬件结构，比如通用GPU或者CPU中。本申请实施例提供的方案可以适用于通用的并行计算硬件中。还可以在专用硬件结构上取得加速。本申请实施例采用互补稀疏渐进式的剪枝方式，相比一次剪枝来说，可以提高训练得到的网络的准确度。另外，本申请实施例提供的方案在C_in或者C_in*FH*FW的维度上进行稀疏处理，相比采用外向量尺度(outer vector wise，OVW)稀疏方式，在单个向量上进行稀疏相比，可以提高准确度。而OVW稀疏方式是在C_out维度进行稀疏处理，本申请实施例相比该方式可以提高加速效率。at present In the hardware 2:4 sparse method, there are two positions in the vector of length 4 that are 0. There is no rule between the two positions. Architecture and The method designed for GPUs with architectures is therefore not suitable for general acceleration hardware structures, such as general GPUs or CPUs. The solution provided in the embodiments of the present application can be applied to general parallel computing hardware. Acceleration can also be achieved on dedicated hardware structures. The embodiments of the present application adopt a complementary sparse progressive pruning method, which can improve the accuracy of the trained network compared to a single pruning. In addition, the solution provided in the embodiments of the present application performs sparse processing on the dimensions of C _in or C _in *FH*FW, which can improve accuracy compared to sparse processing on a single vector using an outer vector wise (OVW) sparse method. The OVW sparse method performs sparse processing on the C _out dimension, and the embodiments of the present application can improve acceleration efficiency compared to this method.

此外，需要说明的是，本申请实施例还可以应用于使用半结构稀疏矩阵乘法的优化场景中。1×1的卷积也可以理解为矩阵乘法，因此任意矩阵乘法稀疏的场景也可以使用本申请实施例提供的方案。In addition, it should be noted that the embodiments of the present application can also be applied to optimization scenarios using semi-structured sparse matrix multiplication. 1×1 convolution can also be understood as matrix multiplication, so any scenario where matrix multiplication is sparse can also use the solution provided by the embodiments of the present application.

应理解，以上装置的各单元的划分仅仅是一种逻辑功能的划分，实际实现时可以全部或部分集成到一个物理实体上，也可以物理上分开。It should be understood that the division of the various units of the above device is only a division of logical functions, and in actual implementation, they can be fully or partially integrated into one physical entity, or they can be physically separated.

在上述实施例中，可以全部或部分地通过软件、硬件或者其组合来实现、当使用软件程序实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个指令。在计算机上加载和执行计算机程序指令时，全部或部分地产生按照本申请的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。指令可以存储在计算机存储介质中，或者从一个计算机存储介质向另一个计算机存储介质传输，例如，指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、双绞线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机存储介质可以是计算机能够存取的任何介质或者是包含一个或多个介质集成的服务器、数据中心等数据存储设备。介质可以是磁性介质，(例如，软盘、硬盘、磁带、磁光盘(MO)等)、光介质(例如光盘)、或者半导体介质(例如ROM、EPROM、EEPROM、固态硬盘(solid state disk，SSD))等。In the above embodiments, all or part of the embodiments may be implemented by software, hardware or a combination thereof. When implemented by a software program, all or part of the embodiments may be implemented in the form of a computer program product. A computer program product includes one or more instructions. When the computer program instructions are loaded and executed on a computer, the process or function according to the present application is generated in whole or in part. The computer may be a general-purpose computer, A dedicated computer, computer network, or other programmable device. Instructions can be stored in a computer storage medium or transmitted from one computer storage medium to another computer storage medium. For example, instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, optical fiber, twisted pair) or wireless (e.g., infrared, wireless, microwave, etc.) means. Computer storage media can be any medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more media integrated. The medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape, a magneto-optical disk (MO), etc.), an optical medium (e.g., an optical disk), or a semiconductor medium (e.g., a ROM, an EPROM, an EEPROM, a solid state disk (SSD)), etc.

本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system), and computer program product according to the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the process and/or box in the flowchart and/or block diagram can be implemented by instructions. These instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the function specified in one or more processes of the flowchart and/or one or more boxes of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。 Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

A neural network pruning method, characterized by comprising:

Obtain the first weight tensor of the neural network to be pruned;

Perform at least one pruning process on the first weight tensor to obtain a second weight tensor;

The second weight tensor includes a plurality of data blocks, each of which is composed of K pruning vectors, each of which is composed of M weights; K and M are both positive integers;

The weight corresponding to the first position in the first pruning vector included in the first data block is a non-zero element, the weight corresponding to the first position in other pruning vectors in the first data block except the first pruning vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruning vector is any pruning vector in the first data block.

The method according to claim 1 is characterized in that the sparsity S of the second weight tensor satisfies: S=1-1/K.

The method according to claim 1 or 2 is characterized in that the pruning process is performed N times in an increasing sparsity manner.

The method as claimed in claim 3 is characterized in that the number of times N that the pruning process is performed is related to the number K of pruning vectors included in the data block and/or the sparsity of the second weight tensor.

The method according to claim 4, wherein the pruning process is performed a number of times that:

N=S/(1-S), or, N=K-1.

The method according to any one of claims 3 to 5, characterized in that N=K-1, and the sparsity of the i-th pruning process satisfies:

Wherein, _Si represents the sparsity of the i-th pruning process, i is a positive integer, 1≤i<K.

The method according to any one of claims 3 to 6, wherein performing the i-th pruning process, where i is less than or equal to N, comprises:

A deformed three-dimensional weight tensor obtained by deforming an input four-dimensional weight tensor, wherein the three-dimensional weight tensor includes G data blocks, each data block is composed of K pruning vectors, and each pruning vector is composed of M weights;

Pruning the three-dimensional weight tensor along the K dimension, resetting the q _i weights with the least importance in the K dimension to zero, so as to obtain the three-dimensional weight vector after the i-th pruning;

Then the three-dimensional weight tensor after the i-th pruning is deformed to obtain a deformed four-dimensional weight tensor;

Training the deformed four-dimensional weight tensor to obtain a four-dimensional weight tensor that has undergone the i-th pruning process;

Among them, the input four-dimensional weight tensor is the first weight tensor or the four-dimensional weight tensor that has undergone the i-1th pruning process.

The method of claim 7, wherein the shape of the input four-dimensional weight tensor is (C _out , C _in , FH, FW), and the shape of the deformed three-dimensional weight tensor is (G, K, M); wherein G satisfies:

Wherein, C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; C _in *FH*FW=K*M, or C _in =K*M.

The method according to claim 7 or 8, characterized in that resetting the q _i weights with the least importance in the K dimension to zero comprises:

Reset the q _i weights with the smallest L1 norm in K dimensions to zero; or,

Reset the q _i weights with the smallest L2 norm in K dimensions to zero.

The method according to any one of claims 7 to 9, characterized in that when the training process adjusts all weights, said q _i =i; or,

When the training process only adjusts the unpruned weights, q _i =1.

A data processing method, comprising:

Acquire data to be processed, and generate a vector representation of the data to be processed;

Performing calculations on the vector representation of the data to be processed and the weight tensor included in the neural network to obtain a processing result of the data to be processed;

The weight tensor of the neural network is obtained through pruning, and the weight tensor includes multiple data blocks, each of which is composed of K pruning vectors, and each pruning vector is composed of M weights; K and M are both positive integers; the weight corresponding to the first position in the first pruning vector included in the first data block is a non-zero element, and the weights of other pruning vectors in the first data block except the first pruning vector are non-zero elements. The weight corresponding to the first position in the vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruning vector is any pruning vector in the first data block.

The method of claim 11, wherein performing calculations on the vector representation of the data to be processed and the weight tensor included in the neural network comprises:

The vector representation of the data to be processed is a first matrix, and the weight tensor is processed to obtain a second matrix and the index of the non-zero elements in the second matrix;

A multiplication operation is performed on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

The method as claimed in claim 12 is characterized in that each row element in the second matrix corresponds to the weight of a data block, and the number of column elements in the first matrix is K*M.

The method of claim 13, wherein performing a multiplication operation on the second matrix and the first matrix to obtain a calculation result according to an index of each non-zero element in the second matrix comprises:

The elements in the first matrix to be operated on with the non-zero elements in the i-th row in the second matrix are determined according to the indices of the non-zero elements in the second matrix, so as to perform a multiplication operation on the second matrix and the first matrix to obtain a calculation result.

The method of claim 14, wherein the number of non-zero elements in each data block is M; the index value of the k-th non-zero element in each data block is k-1, which is used to indicate the index of the pruning vector to which the k-th non-zero element belongs;

When the index value of the first non-zero element in the second matrix is g, the row r of the element in the first matrix that is multiplied with the first non-zero element satisfies: r=g*M.

The method according to any one of claims 12 to 15, characterized in that the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel height of the neural network; the number of column elements of the second matrix is C _out *FH*FW, and the number of row elements of the second matrix is C _in ; C _in =K*M, and the number of data blocks included in the weight tensor

The method according to any one of claims 12 to 15, characterized in that the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; the number of column elements of the second matrix is C _out , and the number of row elements of the second matrix is C _in *FH*FW; C _in *FH*FW=K*M, and the number of data blocks included in the weight tensor is

An accelerator, characterized in that it comprises:

A storage circuit, used to store a first matrix corresponding to the data to be processed, a second matrix corresponding to the weight tensor of the neural network, and the index of the non-zero elements in the second matrix;

Wherein, the weight tensor of the neural network is obtained through pruning processing, and the weight tensor includes multiple data blocks, each of which is composed of K pruned vectors, and each pruned vector is composed of M weights; K and M are both positive integers; the weight corresponding to the first position in the first pruned vector included in the first data block is a non-zero element, and the weight corresponding to the first position in the other pruned vectors in the first data block except the first pruned vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruned vector is any pruned vector in the first data block;

An operation circuit is used to perform a multiplication operation on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

The accelerator of claim 18, wherein each row element in the second matrix corresponds to the weight of a data block, and the number of column elements in the first matrix is K*M.

The accelerator according to claim 19, wherein the operation circuit comprises a selection unit and a calculation unit;

The selection unit is used to read the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements in the second matrix, and read the elements in the i-th column of the first matrix that are operated on the non-zero elements of the second matrix from the storage circuit according to the index of the non-zero elements of the second matrix;

The calculation unit is used to perform a calculation on the non-zero elements of the i-th row of the second matrix and the elements of the i-th column of the first matrix read. Row multiplication and addition operation.

The accelerator of claim 19, wherein the storage circuit comprises a storage unit and a selection unit, the storage unit being used to store a first matrix corresponding to the data to be processed, a second matrix corresponding to the weight tensor included in the neural network, and an index of non-zero elements in the second matrix;

The selection unit is used to read the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements in the second matrix, and send the non-zero elements of the i-th row in the second matrix to the operation circuit; and read the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix from the storage circuit according to the index of the non-zero elements of the second matrix, and send the elements of the i-th column elements of the first matrix that are operated with the non-zero elements of the i-th row in the second matrix to the operation circuit;

The operation circuit is specifically used to perform multiplication and addition operations on the non-zero elements in the i-th row of the second matrix and the elements in the i-th column of the first matrix that are operated on the non-zero elements in the i-th row of the second matrix.

The accelerator according to claim 20 or 21, characterized in that the number of non-zero elements in each data block is M; the index value of the k-th non-zero element in each data block is k-1, which is used to indicate the index of the pruning vector to which the k-th non-zero element belongs;

The accelerator according to any one of claims 18 to 22, characterized in that the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; the number of column elements of the second matrix is C _out *FH*FW, and the number of row elements of the second matrix is C _in ; C _in =K*M, and the number of data blocks included in the weight tensor

The accelerator according to any one of claims 18 to 22, characterized in that the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel height of the neural network; the number of column elements of the second matrix is C _out , and the number of row elements of the second matrix is C _in *FH*FW; C _in *FH*FW=K*M, and the number of data blocks included in the weight tensor is

A processing system, characterized in that it comprises:

A processor, configured to obtain data to be processed and generate a vector representation of the data to be processed;

An accelerator, configured to perform calculations on the vector representation of the data to be processed and the weight tensor included in the neural network to obtain a processing result of the data to be processed;

Among them, the weight tensor of the neural network is obtained through pruning processing, and the weight tensor includes multiple data blocks, each of which is composed of K pruned vectors, and each pruned vector is composed of M weights; K and M are both positive integers; the weight corresponding to the first position in the first pruned vector included in the first data block is a non-zero element, and the weight corresponding to the first position in the other pruned vectors in the first data block except the first pruned vector is a zero element, the first data block is any data block among the multiple data blocks, and the first pruned vector is any pruned vector in the first data block.

The system of claim 25, wherein the processor is further configured to process the vector representation of the data to be processed into a first matrix, process the weight tensor to obtain a second matrix and the indices of non-zero elements in the second matrix, and send the first matrix, the second matrix, and the indices of non-zero elements in the second matrix to the accelerator;

The accelerator is specifically used to perform a multiplication operation on the second matrix and the first matrix according to the index of each non-zero element in the second matrix to obtain a calculation result.

The system as claimed in claim 26 is characterized in that each row element in the second matrix corresponds to the weight of a data block, and the number of column elements in the first matrix is K*M.

The system as described in claim 27 is characterized in that the accelerator is specifically used to determine the elements in the first matrix that are operated on the non-zero elements of the i-th row in the second matrix according to the indices of the non-zero elements in the second matrix, so as to perform a multiplication operation on the second matrix and the first matrix to obtain a calculation result.

The system of claim 28, wherein the number of non-zero elements in each data block is M; The index value of the kth non-zero element in the data block is k-1, which is used to indicate the index of the pruning vector to which the kth non-zero element belongs;

The system according to any one of claims 26 to 29, characterized in that the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel width of the neural network; the number of column elements of the second matrix is C _out *FH*FW, and the number of row elements of the second matrix is C _in ; C _in =K*M, and the number of data blocks included in the weight tensor

The system according to any one of claims 26 to 29, characterized in that the shape of the weight tensor is (C _out , C _in , FH, FW), C _out represents the number of output activations of the neural network, C _in represents the number of input activations of the neural network, FH represents the kernel height of the neural network, and FW represents the kernel height of the neural network; the number of column elements of the second matrix is C _out , and the number of row elements of the second matrix is C _in *FH*FW; C _in *FH*FW=K*M, and the number of data blocks included in the weight tensor is

A pruning device, characterized in that it comprises:

A memory for storing program instructions;

A processor is used to be coupled with the memory, call program instructions in the memory, and execute the method according to any one of claims 1 to 10, or execute the method according to any one of claims 11 to 17.

A computer storage medium, characterized in that a computer program is stored in the computer storage medium, and when the computer program is executed by a computer, the computer executes the method according to any one of claims 1 to 10, or the computer executes the method according to any one of claims 11 to 17.