CN117787357A

CN117787357A - Computer vision-oriented post-training model sparse method and device

Info

Publication number: CN117787357A
Application number: CN202311839020.3A
Authority: CN
Inventors: 刘祥龙; 龚睿昊; 王梓宁; 郭晋阳
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-29

Abstract

The invention discloses a post-training model sparse method and device for computer vision. The method includes the following steps: introducing control loss to achieve the target sparsity rate; using weighted reconstruction loss as supervision to reduce the difference between the sparse output and the original output; adding the control loss and reconstruction loss to obtain the overall optimized target loss; using The target loss is used to determine the reconstructed weight and sparsity rate that are most suitable for the target sparsity rate of each layer; the function that determines the reconstructed weight and sparsity rate is converted into a differentiable function to achieve differentiable optimization of sparse rate distribution. This method establishes a bridge between sparse threshold and sparse rate from the perspective of probability density, and further uses kernel density estimation technology to construct a differentiable representation between sparse threshold and sparse rate, which can preserve the original weight to the greatest extent without damaging the original weight. Learn the optimal sparsity rate allocation with global constraints in the case of visual features, and quickly obtain the required sparse neural network.

Description

A post-training model sparse method and device for computer vision

技术领域Technical field

本发明涉及一种面向计算机视觉的后训练模型稀疏方法，同时也涉及相应的后训练模型稀疏装置，属于计算机视觉技术领域。The invention relates to a post-training model sparse method for computer vision, and also relates to a corresponding post-training model sparse device, belonging to the technical field of computer vision.

背景技术Background technique

深度神经网络(DNNs)在计算机视觉、自然语言处理和信息检索等多个领域取得了显著的成功。然而，在资源有限的边缘设备上部署深度神经网络时，减少神经网络的内存占用和提高能源效率成为关键问题，因此，各种压缩技术被提出以提高模型的效率。在这些压缩技术中，稀疏训练(Sparse Training)是一种通过训练过程中设置稀疏性约束来减少模型参数的方法，后训练稀疏(PTS)还可以消除高训练成本，但通常会由于忽略了每层合理的稀疏率分配而导致模型精度明显下降。现有技术中的稀疏率求解方法主要针对训练感知场景，在数据量有限且训练成本较低的后训练稀疏场景下往往无法稳定收敛。而恢复稀疏模型精度的大多数方法需要在大量数据下长时间重新训练模型，当训练数据集庞大时，通常需要数小时甚至数天的时间，为各行各业的大规模应用带来了障碍。Deep neural networks (DNNs) have achieved remarkable success in multiple fields such as computer vision, natural language processing, and information retrieval. However, when deploying deep neural networks on edge devices with limited resources, reducing the memory footprint of neural networks and improving energy efficiency become key issues, and therefore, various compression techniques have been proposed to improve the efficiency of the model. Among these compression techniques, sparse training (Sparse Training) is a method to reduce model parameters by setting sparsity constraints during the training process. Post-training sparsity (PTS) can also eliminate high training costs, but is usually caused by neglecting each Reasonable sparsity rate allocation of layers will lead to a significant decrease in model accuracy. The sparsity rate solution method in the existing technology is mainly aimed at training perception scenarios, and often cannot converge stably in post-training sparse scenarios where the amount of data is limited and the training cost is low. Most methods to restore the accuracy of sparse models require retraining the model under large amounts of data for a long time. When the training data set is huge, it usually takes hours or even days, which brings obstacles to large-scale applications in various industries.

在现有技术中，POT(Post-training via layer-wise calibration)提出在不训练标签的情况下，以后训练的方式对神经网络进行稀疏化，其逐层重构稀疏权重以提高精度，可以实现产生稀疏模型成本的降低。然而，为了达到与原始神经网络模型相当的性能，其最大的整体稀疏率只能达到50％，当稀疏程度增加时，模型的准确率会迅速崩溃。另外，一些基于再训练的方法，例如基于经验的ERK(-Rènyi-Kernel)口LAMP(Layer-Adaptive Sparsity for the Magnitude-based Pruning)，基于学习STR(Soft ThresholdWeight Reparameterization for Learnable Sparsity)和LTP(Learnable ThresholdPruning)，这些方法虽然为每一层生成一个稀疏率，但在后训练稀疏设置下都会遇到问题，基于经验的ERK和LAMP方法严重依赖手工设计的先验知识，不能保证最优解。基于学习的STR和LTP方法要么破坏原始的权重分布，要么需要端到端的训练来收敛。因此，他们都无法实现训练后稀疏度的高效准确分配，甚至需要仔细调整超参数并重复多次实验以达到目标稀疏率，使得产生稀疏模型的成本进一步提高。In the existing technology, POT (Post-training via layer-wise calibration) proposes to sparse the neural network through post-training without training labels, and reconstructs sparse weights layer by layer to improve accuracy, which can be achieved Reduced cost of producing sparse models. However, in order to achieve comparable performance to the original neural network model, its maximum overall sparsity rate can only reach 50%, and when the degree of sparsity increases, the accuracy of the model will quickly collapse. In addition, some retraining-based methods, such as experience-based ERK ( -Rènyi-Kernel) or LAMP (Layer-Adaptive Sparsity for the Magnitude-based Pruning), based on learning STR (Soft ThresholdWeight Reparameterization for Learnable Sparsity) and LTP (Learnable Threshold Pruning). Although these methods generate a sparsity rate for each layer, However, problems will be encountered under the sparse setting of post-training. The experience-based ERK and LAMP methods rely heavily on manually designed prior knowledge and cannot guarantee the optimal solution. Learning-based STR and LTP methods either destroy the original weight distribution or require end-to-end training to converge. Therefore, none of them can achieve efficient and accurate distribution of sparsity after training, and even need to carefully adjust hyperparameters and repeat experiments multiple times to achieve the target sparsity rate, further increasing the cost of generating sparse models.

在申请号为202211358758.3的中国发明申请中，公开了一种基于通道剪枝yolov5s的检测行人方法，包括如下步骤：解开网络融合层、稀疏化训练模型、模型剪枝、微调训练、目标模型，从而减少神经网络参数的同时对神经网络进行加速推理，让yolov5算法能在低算力的平台上达到加速推理效果。In the Chinese invention application with application number 202211358758.3, a method for detecting pedestrians based on channel pruning YOLOv5s is disclosed, including the following steps: unlocking the network fusion layer, sparse training model, model pruning, fine-tuning training, and target model, thereby reducing the parameters of the neural network while accelerating the reasoning of the neural network, so that the YOLOv5 algorithm can achieve accelerated reasoning effects on low computing power platforms.

发明内容Contents of the invention

本发明所要解决的首要技术问题在于提供一种面向计算机视觉的后训练模型稀疏方法。The primary technical problem to be solved by the present invention is to provide a post-training model sparse method for computer vision.

本发明所要解决的另一技术问题在于提供一种面向计算机视觉的后训练模型稀疏装置。Another technical problem to be solved by the present invention is to provide a post-training model sparse device for computer vision.

为了实现上述目的，本发明采用以下的技术方案：In order to achieve the above objects, the present invention adopts the following technical solutions:

根据本发明实施例的第一方面，提供一种面向计算机视觉的后训练模型稀疏方法，包括如下步骤：According to the first aspect of the embodiment of the present invention, a post-training model sparse method for computer vision is provided, including the following steps:

(1)引入控制损失来实现目标稀疏率，该控制损失为：(1) Introduce control loss to achieve the target sparsity rate. The control loss is:

其中，r_i和N_i分别表示第i层权重的稀疏率和元素个数；r₀是全局稀疏率目标；L_c为控制损失；Among them, r _i and N _i represent the sparsity rate and number of elements of the i-th layer weight respectively; r ₀ is the global sparsity rate target; L _c is the control loss;

(2)采用权重的重建损失作为监督以减少稀疏输出和原始输出之间的差异，该重建损失为：(2) Use the reconstruction loss of weights as supervision to reduce the difference between the sparse output and the original output. The reconstruction loss is:

L_rec＝D_KL(Y_dense||Y_sparse)L _rec =D _KL (Y _dense ||Y _sparse )

其中，L_rec为重建损失；D_KL(·)表示Kullback-Leibler散度函数；Y_dense为原始神经网络模型的输出；Y_sparse为稀疏后神经网络模型的输出；Among them, L _rec is the reconstruction loss; D _KL (·) represents the Kullback-Leibler divergence function; Y _dense is the output of the original neural network model; Y _sparse is the output of the sparse neural network model;

(3)将所述控制损失和所述重建损失相加得到整体优化的目标损失：(3) Add the control loss and the reconstruction loss to obtain the overall optimized target loss:

L＝L_rec+L_c L＝L _rec +L _c

其中，L为目标损失；Among them, L is the target loss;

(4)利用所述目标损失来确定每层最适合所述目标稀疏率的重建后的权重和稀疏率：(4) Use the target loss to determine the reconstructed weights and sparsity rates of each layer that are most suitable for the target sparsity rate:

其中，arg min为arg min函数；W为重建后的权重；r_i第i层权重的稀疏率；Among them, arg min is the arg min function; W is the reconstructed weight; r _i is the sparse rate of the i-th layer weight;

(5)将确定每层重建后的权重和稀疏率的函数转化为可微函数以实现稀疏率分配的可微优化。(5) Convert the function that determines the reconstructed weight and sparsity rate of each layer into a differentiable function to achieve differentiable optimization of sparse rate allocation.

其中较优地，步骤(2)中：Preferably, in step (2):

所述稀疏模型输出Y_sparse通过下式计算：The sparse model output Y _sparse is calculated by the following formula:

Y_sparse＝f(X，M⊙W)Y _sparse =f(X,M⊙W)

其中，f表示神经网络模型，X是神经网络模型的输入，M是掩码，W是神经网络模型的权重。Among them, f represents the neural network model, X is the input of the neural network model, M is the mask, and W is the weight of the neural network model.

其中较优地，所述掩码M的生成通过基于幅值的方式进行计算：Preferably, the generation of the mask M is calculated based on amplitude:

M＝0.5*sgn(|W|-t)+0.5M＝0.5*sgn(|W|-t)+0.5

其中，sgn表示符号函数，如果是正值，则返回+1，否则返回-1；t是稀疏阈值；W是神经网络模型的权重。Among them, sgn represents the sign function. If it is a positive value, +1 is returned, otherwise -1 is returned; t is the sparse threshold; W is the weight of the neural network model.

其中较优地，步骤(5)中：Preferably, in step (5):

所述目标损失L相对于l层的稀疏率r_l的导数为：The derivative of the target loss L with respect to the sparsity rate r _l of layer l is:

其中，L为目标损失，L_rec为重建损失，Lc为控制损失，r_l为l层的稀疏率。Among them, L is the target loss, L _rec is the reconstruction loss, Lc is the control loss, and r _l is the sparse rate of layer l.

其中较优地，所述控制损失Lc相对于l层的稀疏率r_l的导数为：Preferably, the derivative of the control loss Lc with respect to the sparsity rate r _l of layer l is:

其中，r_i和N_i分别表示第i层权重的稀疏率和元素个数，r_l和N_l分别表示第l层权重的稀疏率和元素个数。Among them, r _i and N _i respectively represent the sparse rate and the number of elements of the i-th layer weight, and r _l and N _l respectively represent the sparse rate and number of elements of the l-th layer weight.

其中较优地，所述重建损失L_rec相对于l层的稀疏率r_l的导数为：Preferably, the derivative of the reconstruction loss L _rec with respect to the sparsity rate r _l of layer l is:

其中，M_l为l层的掩码，t_l为l层的稀疏阈值。Among them, M _l is the mask of layer l, and t _l is the sparse threshold of layer l.

其中较优地，步骤(5)中：Preferably, in step (5):

层l的稀疏率r_l计算满足公式：The sparsity rate r _l of layer l is calculated to satisfy the formula:

其中，p(w)为权重W_l的概率密度函数，W_l为层l的权重分布，w为从该分布中采样权重；g^-1为可微函数g的逆函数，t_l＝g(r_l)；t_l为l层的稀疏阈值。Among them, p(w) is the probability density function of weight W _l , W _l is the weight distribution of layer l, w is the sampling weight from this distribution; g ^-1 is the inverse function of differentiable function g, t _l =g ( r _l ); t _l is the sparse threshold of layer l.

其中较优地，所述层l的稀疏率r_l对所述稀疏阈值t_l的导数为：Preferably, the derivative of the sparsity rate r _l of the layer l with respect to the sparsity threshold t _l is:

其中，p(t_l)为t_l的概率密度函数，p(-t_l)为-t_l的概率密度函数；r_l为l层的稀疏率，t_l为l层的稀疏阈值。Among them, p(t _l ) is the probability density function of t _l , p(-t _l ) is the probability density function of -t _l ; r _l is the sparse rate of layer l, and t _l is the sparse threshold of layer l.

其中较优地，采用核密度估计方法估算p(w)如下：Among them, the better method is to use the kernel density estimation method to estimate p(w) as follows:

其中，p(w)为权重W_l的概率密度函数；K是非负核函数；n是采样点的数量；h是一个平滑参数称为带宽，h＞0；下标为h的核函数称为缩放核函数，定义为w为第l层的权重分布W_l中的采样权重；w_i为第i层的权重。Among them, p(w) is the probability density function of weight W _l ; K is a non-negative kernel function; n is the number of sampling points; h is a smoothing parameter called bandwidth, h>0; the kernel function with subscript h is called The scaling kernel function is defined as w is the sampling weight in the weight distribution W _l of the l-th layer; w _i is the weight of the i-th layer.

根据本发明实施例的第二方面，提供一种面向计算机视觉的后训练模型稀疏装置，包括处理器和存储器，所述处理器和所述存储器耦接；其中，According to a second aspect of an embodiment of the present invention, there is provided a post-training model sparse device for computer vision, comprising a processor and a memory, wherein the processor and the memory are coupled; wherein:

所述存储器用于存储计算机程序；The memory is used to store computer programs;

所述处理器用于运行存储在所述存储器中的计算机程序，执行上述的面向计算机视觉的后训练模型稀疏方法。The processor is configured to run a computer program stored in the memory and execute the above-mentioned post-training model sparse method for computer vision.

与现有技术相比较，本发明实施例提供的后训练模型稀疏方法，通过定义可控的稀疏率重建目标，并从概率密度的角度建立了稀疏阈值和稀疏率之间的桥梁，进一步使用核密度估计技术来构建稀疏阈值和稀疏率之间的可微表示形式，可以在不损害原始权重、最大程度保留视觉特征的情况下学习到具有全局约束的最优稀疏率分配，快速获得所需的稀疏神经网络。因此，本发明实施例提供的后训练模型稀疏方法具有稀疏率分配可控、精度高、效率高等有益效果。Compared with the existing technology, the post-training model sparse method provided by the embodiment of the present invention reconstructs the target by defining a controllable sparsity rate, and establishes a bridge between the sparsity threshold and the sparsity rate from the perspective of probability density, and further uses kernel Density estimation technology is used to construct a differentiable representation between sparsity threshold and sparsity rate. It can learn the optimal sparsity rate allocation with global constraints without damaging the original weight and retaining visual features to the greatest extent, and quickly obtain the required Sparse neural networks. Therefore, the post-training model sparse method provided by the embodiment of the present invention has beneficial effects such as controllable sparsity rate distribution, high accuracy, and high efficiency.

附图说明Description of the drawings

图1为本发明实施例中，面向计算机视觉的后训练模型稀疏方法的流程图；Figure 1 is a flow chart of a post-training model sparse method for computer vision in an embodiment of the present invention;

图2为本发明实施例中，面向计算机视觉的后训练模型稀疏方法的逻辑框架图；Figure 2 is a logical framework diagram of a post-training model sparse method for computer vision in an embodiment of the present invention;

图3为本发明实施例中，稀疏率可学习与稀疏率不可学习的准确率的6对比测试结果的示意图；Figure 3 is a schematic diagram of 6 comparative test results of the accuracy of the sparsity rate that can be learned and the sparsity rate that cannot be learned in the embodiment of the present invention;

图4为本发明实施例中，后训练模型稀疏装置的结构示意图。Figure 4 is a schematic structural diagram of a post-training model sparse device in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明的技术内容进行详细具体的说明。The technical content of the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

如图1所示，本发明实施例提供的一种快速可控的后训练稀疏(简写为FCPTS，下同)方法，至少包括如下步骤：As shown in Figure 1, the embodiment of the present invention provides a fast and controllable post-training sparse (abbreviated as FCPTS, the same below) method, which at least includes the following steps:

S1：引入控制损失来实现目标稀疏率；该控制损失Lc为：S1: Introduce control loss to achieve the target sparsity rate; the control loss Lc is:

其中，r_i和N_i分别表示第i层权重的稀疏率和元素个数；r₀是全局稀疏率目标；L_c是控制全局稀疏率的损失。Among them, r _i and N _i represent the sparsity rate and the number of elements of the i-th layer weight respectively; r ₀ is the global sparsity rate target; L _c is the loss to control the global sparsity rate.

利用控制损失L_c以达到目标稀疏率r₀，无需复杂的超参数调整。The control loss L _c is used to achieve the target sparsity rate r ₀ without complex hyperparameter tuning.

S2：采用权重的重建损失作为监督以减少稀疏输出和原始输出之间的差异，以保留原始神经网络模型中学习到的图像特征；该重建损失L_rec为：S2: The reconstruction loss of the weight is used as supervision to reduce the difference between the sparse output and the original output to retain the image features learned in the original neural network model; the reconstruction loss L _rec is:

L_rec＝D_KL(Y_dense||Y_sparse) (2)L _rec =D _KL (Y _dense ||Y _sparse ) (2)

其中，D_KL(·)表示Kullback-Leibler散度函数；Y_dense为原始神经网络模型的输出；Y_sparse为稀疏后神经网络模型的输出。Where D _KL (·) represents the Kullback-Leibler divergence function; Y _dense is the output of the original neural network model; and Y _sparse is the output of the sparse neural network model.

在重建损失L_rec的监督下，权重优化将被引导到一个有助于稀疏输出与原始输出密切相似的方向，以得到更高的精度。此外，整个重建基于训练良好的原始权重，因此，该过程快速且易于收敛。Under the supervision of the reconstruction loss L _rec , the weight optimization will be guided in a direction that helps the sparse output closely resemble the original output, leading to higher accuracy. Furthermore, the entire reconstruction is based on well-trained original weights, so the process is fast and easy to converge.

稀疏模型输出Y_sparse可以通过下式计算：The sparse model output Y _sparse can be calculated as follows:

Y_sparse＝f(X，M⊙W) (3)Y _sparse =f(X,M⊙W) (3)

其中，f表示神经网络模型，X是神经网络模型的输入，在计算机视觉任务中一般是三维图像，M是掩码，W是神经网络模型的权重。Among them, f represents the neural network model, X is the input of the neural network model, which is generally a three-dimensional image in computer vision tasks, M is the mask, and W is the weight of the neural network model.

掩码M的生成可以通过基于幅值的方式进行计算：The generation of the mask M can be calculated in an amplitude-based manner:

M＝0.5*sgn(|W|-t)+0.5 (4)M＝0.5*sgn(|W|-t)+0.5 (4)

其中，sgn表示符号函数，如果是正值，则返回+1，否则返回-1；t是稀疏阈值。Among them, sgn represents the sign function, if it is a positive value, it returns +1, otherwise it returns -1; t is the sparse threshold.

S3：将控制损失L_c和重建损失L_rec相加得到整体优化的目标损失L：S3: Add the control loss L _c and the reconstruction loss L _rec to get the overall optimized target loss L:

L＝L_rec+L_c (5)L＝L _rec +L _c (5)

S4：利用目标损失L来确定每层最适合目标稀疏率r₀的重建后的权重W和稀疏率r_i：S4: Use the target loss L to determine the reconstructed weight W and sparsity rate r _i of each layer that are most suitable for the target sparsity rate r ₀ :

其中，arg min为arg min函数。Among them, arg min is the arg min function.

S5：将确定每层重建后的权重W和稀疏率r_i的函数转化为可微函数以实现稀疏率分配的可微优化。S5: Convert the function that determines the reconstructed weight W and sparsity rate r _i of each layer into a differentiable function to achieve differentiable optimization of sparse rate allocation.

在步骤S4定义了优化目标之后，由于确定每层重建后的权重W和稀疏率r_i的argmin函数本身不是可微的，需要将其转化可微函数。在本发明的一个实施例中，提出一个巧妙的桥函数，从而实现采用稀疏阈值t来计算稀疏率r。该桥函数的导数可以通过核密度估计得到，就可以实现将原来较为复杂的优化问题转化为针对稀疏阈值t的优化问题。由于转化后的优化函数是可微的，因此就实现了稀疏率分配的可微优化，即通过学习每层的最优的稀疏阈值t可以得到每层的稀疏率r。下面对其转化过程进行详细的说明。After the optimization objective is defined in step S4, since the argmin function itself that determines the reconstructed weight W and sparsity rate r _i of each layer is not differentiable, it needs to be converted into a differentiable function. In one embodiment of the present invention, a clever bridge function is proposed to realize the use of sparsity threshold t to calculate the sparsity rate r. The derivative of the bridge function can be obtained through kernel density estimation, which can transform the original more complex optimization problem into an optimization problem targeting the sparse threshold t. Since the transformed optimization function is differentiable, differentiable optimization of sparsity rate allocation is achieved, that is, the sparsity rate r of each layer can be obtained by learning the optimal sparsity threshold t of each layer. The conversion process is described in detail below.

给定一个特定的层l，目标损失L相对于l层的稀疏率r_l的导数为：Given a specific layer l, the derivative of the target loss L with respect to the sparsity rate r _l of the l-th layer is:

其中，r_l为l层的稀疏率，L为目标损失，L_rec为重建损失，Lc为控制损失。Among them, r _l is the sparsity rate of layer l, L is the target loss, L _rec is the reconstruction loss, and Lc is the control loss.

根据公式(1)可以得出公式(7)中的后一项为：According to formula (1), it can be concluded that the latter term in formula (7) is:

将公式(7)的前一项进行如下变换：Transform the first term of formula (7) as follows:

根据公式(2)、公式(3)和公式(4)，公式(9)中的第一项和第二项(即和/>)很容易进行计算得出结果。对于公式(9)中的第三项(即/>)，需要找到一个可微函数g，满足t_l＝g(r_l)，则公式(9)就可以进行计算了。虽然该可微函数g难于被构造出来，但其逆函数g^-1可以采用可微估计的方式进行形式化。然后再将稀疏率r的学习迁移到稀疏阈值t。具体过程如下：According to formula (2), formula (3) and formula (4), the first and second terms in formula (9) (i.e. and/> ) is easy to calculate and get the result. For the third term in formula (9) (i.e./> ), we need to find a differentiable function g that satisfies t _l =g(r _l ), then formula (9) can be calculated. Although the differentiable function g is difficult to construct, its inverse function g ^-1 can be formalized in a differentiable estimation manner. Then the learning of the sparsity rate r is transferred to the sparsity threshold t. The specific process is as follows:

给定层l的权重分布W_l及其概率密度函数p(w)，从该分布中采样权重w，则稀疏率r_l可以由下式计算：Given the weight distribution W _l of layer l and its probability density function p(w), and sampling the weight w from this distribution, the sparsity rate r _l can be calculated by the following formula:

其中，p(w)为权重W_l的概率密度函数；g^-1为可微函数g的逆函数；t_l为l层的稀疏阈值。Among them, p(w) is the probability density function of weight W _l ; g ^-1 is the inverse function of the differentiable function g; t _l is the sparse threshold of layer l.

因此，稀疏率r_l对稀疏阈值t_l的导数可以写成：Therefore, the derivative of the sparsity rate r _l with respect to the sparsity threshold t _l can be written as:

其中，p(t_l)为t_l的概率密度函数，p(-t_l)为-t_l的概率密度函数；r_l为l层的稀疏率，t_l为l层的稀疏阈值。Where p(t _l ) is the probability density function of t _l , p(-t _l ) is the probability density function of -t _l ; r _l is the sparsity rate of layer l, and t _l is the sparsity threshold of layer l.

从公式(11)可以看出，计算导数的关键步骤是建模概率密度函数p(w)。为此，本发明实施例采用核密度估计(KDE)方法来估算p(w)如下：It can be seen from formula (11) that the key step in calculating the derivative is to model the probability density function p(w). To this end, the embodiment of the present invention uses the kernel density estimation (KDE) method to estimate p(w) as follows:

其中，K是非负核函数；n是采样点的数量；h是一个平滑参数称为带宽，h＞0；下标为h的核函数称为缩放核函数，定义为w为第l层的权重分布W_l中的采样权重；w_i为第i层的权重。Where K is a non-negative kernel function; n is the number of sampling points; h is a smoothing parameter called bandwidth, h＞0; the kernel function with subscript h is called the scaling kernel function, defined as w is the sampling weight in the weight distribution W _l of the lth layer; _wi is the weight of the ith layer.

基于估计量的偏差与其方差之间的权衡考虑，通常设置：n＝100，h＝0.5，K(x)＝Φ(x)；其中，Φ是标准正态密度函数。利用核密度估计KDE技术，证明了桥函数是可微的。Based on the trade-off between the bias of the estimator and its variance, it is usually set as follows: n = 100, h = 0.5, K(x) = Φ(x); where Φ is the standard normal density function. Using the kernel density estimation KDE technique, it is proved that the bridge function is differentiable.

借助上述桥函数和可微估计，可以通过中间变量即稀疏阈值t对稀疏率r进行学习优化：With the help of the above bridge function and differentiable estimation, the sparse rate r can be learned and optimized through the intermediate variable, namely the sparse threshold t:

根据公式(8)、公式(11)和公式(12)，可以计算得到公式(13)中的控制损失L_c对稀疏阈值t_l的导数。According to formula (8), formula (11) and formula (12), the derivative of the control loss L _c in formula (13) with respect to the sparse threshold t _l can be calculated.

对于重建损失L_rec，也可以直接计算L_rec对稀疏阈值t_l的导数：For the reconstruction loss L _rec , the derivative of L _rec with respect to the sparsity threshold t _l can also be calculated directly:

最后，通过学习到的稀疏阈值t和桥函数即公式(10)，可以计算出准确的稀疏率r，有助于稀疏度分配的灵活可控学习。Finally, through the learned sparse threshold t and bridge function (10), the accurate sparsity rate r can be calculated, which is helpful for flexible and controllable learning of sparsity distribution.

以上将本发明实施例提供的一种面向计算机视觉的后训练模型稀疏方法进行了详细介绍。该后训练稀疏方法的逻辑框架如图2所示，首先定义了可控的稀疏率重建目标，然后设计了一种不干扰训练权重的可微稀疏率分配方法，基于稀疏率重建目标和可微稀疏率分配方法，可以很容易地学习到最优稀疏率分配。并且，得益于可控和可微的稀疏分配，该后训练稀疏方法在后训练场景中表现良好，只需要一次网络重建就可以达到目标稀疏率，具有很高的效率，通常，仅使用一个NVIDIA RTX 3090 GPU，就可以在几十分钟内生成稀疏神经网络，实现了以高效率、高精度重建稀疏神经网络。The post-training model sparse method for computer vision provided by the embodiment of the present invention has been introduced in detail above. The logical framework of the post-training sparse method is shown in Figure 2. First, a controllable sparsity rate reconstruction target is defined, and then a differentiable sparsity rate allocation method is designed that does not interfere with the training weights. Based on the sparsity rate reconstruction target and differentiable The sparsity rate allocation method can easily learn the optimal sparsity rate allocation. Moreover, thanks to the controllable and differentiable sparse allocation, this post-training sparse method performs well in post-training scenarios, requiring only one network reconstruction to reach the target sparsity rate, with high efficiency. Usually, only one NVIDIA RTX 3090 GPU can generate sparse neural networks in tens of minutes, achieving the reconstruction of sparse neural networks with high efficiency and high accuracy.

本发明实施例提供的一种面向计算机视觉的后训练模型稀疏方法在实际操作中，可以按照表1所示的算法步骤对给定的神经网络进行稀疏。其中，本发明实施例中的数据集均为计算机视觉任务中的图像数据集，例如ImageNet等。In actual operation, the post-training model sparse method for computer vision provided by the embodiment of the present invention can sparse a given neural network according to the algorithm steps shown in Table 1. Among them, the data sets in the embodiments of the present invention are all image data sets in computer vision tasks, such as ImageNet, etc.

表1Table 1

为了验证本发明实施例提供的一种面向计算机视觉的后训练模型稀疏方法的实际性能和效果，发明人在图像分类和目标检测等计算机视觉任务中进行了如下一系列的实验测试和评估，并与现有技术进行了相关对比。In order to verify the actual performance and effect of the post-training model sparse method for computer vision provided by the embodiment of the present invention, the inventor conducted the following series of experimental tests and evaluations in computer vision tasks such as image classification and target detection, and Relevant comparisons are made with existing technologies.

第一.在图像分类数据集上的实验测试。First. Experimental testing on image classification data set.

测试数据集选用CIFAR-10/CIFAR-100和ImageNet。CIFAR-10数据集是由50K个训练图像和10K个测试图像组成，大小为32×32，包含10个类别；CIFAR-100数据集包含100个类别。ImageNet ILSVRC12包含大约120万张训练图像和50万张具有1000个类的测试图像。The test data set uses CIFAR-10/CIFAR-100 and ImageNet. The CIFAR-10 data set consists of 50K training images and 10K test images, with a size of 32×32 and contains 10 categories; the CIFAR-100 data set contains 100 categories. ImageNet ILSVRC12 contains approximately 1.2 million training images and 500,000 test images with 1000 classes.

对比测试中选用唯一的后训练稀疏(PTS)方法POT作为现有技术方案，其复现了一些最初为再训练而设计的技术，即POT中的启发式非均匀稀疏方法ERK，以及PTS设置下的ProbMask。In the comparative test, the only post-training sparse (PTS) method POT was selected as the existing technical solution, which reproduces some technologies originally designed for retraining, namely the heuristic non-uniform sparse method ERK in POT, and the PTS setting ProbMask.

测试网络选用了广泛使用的网络结构，具体包括用于CIFAR-10/CIFAR-100的ResNet-32/ResNet-56，用于ImageNet的ResNet-18/ResNet-50、MobileNetV2、RegNet-200M/RegNet-400M和Vit-Base/Vit-Large。The test network selected widely used network structures, including ResNet-32/ResNet-56 for CIFAR-10/CIFAR-100, ResNet-18/ResNet-50, MobileNetV2, RegNet-200M/RegNet- for ImageNet. 400M and Vit-Base/Vit-Large.

采用本发明实施例提供的技术方案与现有技术方案分别在数据集CIFAR-10、CIFAR-100上，不同稀疏率下的Top1/Top5准确率(％)的比较测试结果如表2所示。Table 2 shows the comparative test results of the Top1/Top5 accuracy (%) under different sparsity rates using the technical solution provided by the embodiment of the present invention and the prior art solution on the CIFAR-10 and CIFAR-100 datasets.

表2Table 2

从表2中可以看出，随着稀疏度的增加，现有技术方案在80％的稀疏度时快速退化，在90％或更早时完全崩溃。相比之下，本发明实施例提供的技术方案(表中用Ours表示)即使在90％的稀疏度下也能保持非常高的精度稳定，特别是在CIFAR-100数据集上尤其明显。It can be seen from Table 2 that as the sparsity increases, the existing technical solution quickly degrades at 80% sparsity and completely collapses at 90% or earlier. In contrast, the technical solution provided by the embodiment of the present invention (denoted by Ours in the table) can maintain very high accuracy stability even at 90% sparsity, especially on the CIFAR-100 dataset.

未经过稀疏的原始网络ResNet-32在CIFAR-10/100数据集上的准确率(％)分别为93.53/99.77和70.16/90.89。未经过稀疏的原始网络ResNet-56在CIFAR-10/100数据集上的准确率(％)分别为94.37/99.83和72.63/91.94。由此可以看出，采用本发明实施例提供的技术方案后的网络与原始网络相比较，准确率并没有明显的下降。The accuracy (%) of the original network ResNet-32 without sparsity on the CIFAR-10/100 dataset is 93.53/99.77 and 70.16/90.89 respectively. The accuracy (%) of the original network ResNet-56 without sparsity on the CIFAR-10/100 dataset is 94.37/99.83 and 72.63/91.94 respectively. It can be seen from this that compared with the original network, the accuracy of the network after adopting the technical solution provided by the embodiment of the present invention does not decrease significantly.

采用本发明实施例提供的技术方案与现有技术方案分别在数据集ImageNet上，不同稀疏率下的Top1/Top5精度比较测试结果如表3所示。其中，未经过稀疏的原始神经网络模型的精度列在了模型架构的名称下方。Using the technical solution provided by the embodiment of the present invention and the existing technical solution respectively on the data set ImageNet, the Top1/Top5 accuracy comparison test results under different sparsity rates are shown in Table 3. The accuracy of the original neural network model without sparsification is listed below the name of the model architecture.

表3table 3

从表3中可以看出，随着稀疏度的增加，本发明实施例提供的技术方案在所有神经网络模型中的精度都明显优于现有技术方案，特别是在稀疏率超过70％的情况下更加明显。需要说明的是，在70％的稀疏率下，本发明实施例提供的技术方案在网络ResNet-18和ResNet-50上实现了几乎类似于其原始神经网络模型对应的精度水平(仅下降了2％)，而现有技术方案的精度损失近10％。对于网络RegNet和MobileNet这样的轻量级架构，现在技术方案在60％的稀疏性下就遇到不可接受的精度下降，而本发明实施例提供的技术方案可以提高3％～8％。As can be seen from Table 3, with the increase of sparsity, the accuracy of the technical solution provided by the embodiment of the present invention in all neural network models is significantly better than that of the existing technical solution, especially when the sparsity rate exceeds 70%. It should be noted that at a sparsity rate of 70%, the technical solution provided by the embodiment of the present invention achieves an accuracy level almost similar to that corresponding to its original neural network model on the networks ResNet-18 and ResNet-50 (only a decrease of 2%), while the accuracy loss of the existing technical solution is nearly 10%. For lightweight architectures such as the networks RegNet and MobileNet, the current technical solution encounters an unacceptable decrease in accuracy at a sparsity of 60%, while the technical solution provided by the embodiment of the present invention can improve by 3% to 8%.

此外，在Vit-Base/Vit-Large模型中的对比测试结果如表4所示。其中，未经过稀疏的原始神经网络模型的精度列在了模型架构的名称下方。In addition, the comparative test results in the Vit-Base/Vit-Large model are shown in Table 4. The accuracy of the original neural network model without sparsification is listed below the name of the model architecture.

表4Table 4

从表4中可以看出，本发明实施例提供的技术方案在ViT模型中的准确率也优于现有技术方案，证明了本发明实施例提供的技术方案也适用于基于注意力的模型架构。It can be seen from Table 4 that the accuracy of the technical solution provided by the embodiment of the present invention in the ViT model is also better than the existing technical solution, which proves that the technical solution provided by the embodiment of the present invention is also applicable to the attention-based model architecture.

第二.在目标检测数据集上的实验测试。Second. Experimental testing on the target detection data set.

测试数据集选用PASCAL VOC，该数据集是一个被广泛使用的目标检测数据集，包含20个目标类别，其中的每个图像都有边界框注释和对象类注释。测试网络结构选用MobileNetV1 SSD和MobileNetV2 SSD-lite。对比测试中仍然选用POT作为现有技术方案。The test data set uses PASCAL VOC, which is a widely used object detection data set and contains 20 object categories. Each image has bounding box annotations and object class annotations. The test network structure uses MobileNetV1 SSD and MobileNetV2 SSD-lite. In the comparison test, POT was still selected as the existing technical solution.

采用本发明实施例提供的技术方案与现有技术方案分别在数据集PASCAL VOC上，稀疏率为90％时的均值平均准确率(mAP)比较测试结果如表5所示。其中，未经过稀疏的原始神经网络模型的mAP列在了模型架构的名称下方。The comparative test results of the mean average accuracy (mAP) when the sparsity rate is 90% using the technical solution provided by the embodiment of the present invention and the existing technical solution on the data set PASCAL VOC are shown in Table 5. Among them, the mAP of the original neural network model without sparse is listed under the name of the model architecture.

表5table 5

从表5中可以看出，与采用L2-归一化幅值和ERK稀疏率分配算法的POT(现有技术)相比，本发明实施例提供的技术方案在90％稀疏率的极端条件下具有明显的性能提升。对于稀疏的MobileNetV1 SSD模型，本发明实施例提供的技术方案的mAP只下降了2.6％；在压缩后的MobileNetV2 SSD-Lite模型上，在现有技术方案出现性能崩溃的情况下，本发明实施例提供的技术方案仍可以获得59.1％的mAP。As can be seen from Table 5, compared with POT (existing technology) using L2-normalized amplitude and ERK sparsity rate allocation algorithm, the technical solution provided by the embodiment of the present invention under the extreme condition of 90% sparsity rate Has significant performance improvements. For the sparse MobileNetV1 SSD model, the mAP of the technical solution provided by the embodiment of the present invention only dropped by 2.6%; on the compressed MobileNetV2 SSD-Lite model, when the performance of the existing technical solution collapsed, the embodiment of the present invention The provided technical solution can still obtain 59.1% mAP.

第三.稀疏率可学习与稀疏率不可学习的准确率的对比测试。Third. Comparative test of the accuracy of sparse rate learnable and sparse rate non-learnable.

稀疏率可学习的技术方案即本发明实施例提供的技术方案。作为对比测试中的稀疏率不可学习的技术方案，是指只使用ERK算法初始化模型每一层的稀疏度，然后一次性对整网进行重建的技术方案。The technical solution in which the sparsity rate can be learned is the technical solution provided by the embodiments of the present invention. As a technical solution in which the sparsity rate cannot be learned in the comparative test, it refers to a technical solution that only uses the ERK algorithm to initialize the sparsity of each layer of the model, and then reconstructs the entire network at once.

对比测试选用在ResNet-50、RegNetX-400M和MobileNetV2三个模型上进行。首先采用ERK算法初始化模型每一层的稀疏度，然后在稀疏率是否可学习两种情况下分别进行重建，稀疏率可学习与稀疏率不可学习的准确率的对比测试结果如图3所示。The comparison test was conducted on three models: ResNet-50, RegNetX-400M and MobileNetV2. First, the ERK algorithm is used to initialize the sparsity of each layer of the model, and then reconstruction is performed under two conditions: whether the sparsity rate can be learned or not. The comparative test results of the accuracy of the sparsity rate that can be learned and the sparsity rate that cannot be learned are shown in Figure 3.

从图3中可以看出，本发明实施例提供的技术方案与不可学习的技术方案相比，可以获得更高的准确率，并且随着稀疏率的增大，本发明实施例提供的技术方案的准确率的提高更加显著。例如，在模型RegNetX-400M上，稀疏率为80％时提高了约10％；稀疏率为90％时提高了约25％。因此，证明了本发明实施例提供的技术方案，由于具有全局约束的可学习稀疏率，可以动态优化稀疏率分配，所以更加合理，可以获得意想不到的模型性能。As can be seen from Figure 3, the technical solution provided by the embodiment of the present invention can achieve higher accuracy than the non-learnable technical solution, and as the sparsity rate increases, the technical solution provided by the embodiment of the present invention The accuracy improvement is even more significant. For example, on the model RegNetX-400M, the improvement is about 10% when the sparsity rate is 80%; the improvement is about 25% when the sparsity rate is 90%. Therefore, it is proved that the technical solution provided by the embodiment of the present invention can dynamically optimize the sparse rate allocation due to the learnable sparse rate with global constraints, so it is more reasonable and can obtain unexpected model performance.

第四.重建效率和推理效率的对比测试。Fourth. Comparative test of reconstruction efficiency and inference efficiency.

测试数据集选用CIFAR-100和ImageNet。对比测试中选用POT作为现有技术方案。测试网络结构选用ResNet-32和ResNet-18。在对比测试中，收集整理了获得稀疏神经网络的重建时间成本，并在真实硬件上评估了推理速度。The test data sets use CIFAR-100 and ImageNet. In the comparative test, POT was selected as the existing technical solution. ResNet-32 and ResNet-18 were selected as test network structures. In the comparative test, the reconstruction time cost of obtaining a sparse neural network was collected and the inference speed was evaluated on real hardware.

本发明实施例提供的技术方案，由于服从原始神经网络模型的原始权重分布，因此，可以充分利用历史训练成果。此外，该算法直接在一个较小的优化空间(即阈值空间)上进行网络优化，没有任何随机性，因此，可以快速稳定地收敛。更重要的是，可控的稀疏率使我们能够在不进行重复超参数调整的情况下达到目标。采用本发明实施例提供的技术方案与现有技术方案的重建效率的比较测试结果如表6所示。The technical solution provided by the embodiment of the present invention obeys the original weight distribution of the original neural network model, and therefore can make full use of historical training results. In addition, this algorithm directly performs network optimization on a smaller optimization space (i.e., threshold space) without any randomness, so it can converge quickly and stably. More importantly, the controllable sparsification rate allows us to achieve the target without repeated hyperparameter tuning. The comparative test results of the reconstruction efficiency using the technical solution provided by the embodiment of the present invention and the existing technical solution are as shown in Table 6.

表6Table 6

从表6中可以看出，本发明实施例提供的技术方案能够将生成ResNet-32稀疏模型的时间缩减到9分钟，而现有技术方案例如POT等需要超过一个半小时。重建效率相差约10倍。As can be seen from Table 6, the technical solution provided by the embodiment of the present invention can reduce the time to generate a ResNet-32 sparse model to 9 minutes, while existing technical solutions such as POT require more than one and a half hours. The reconstruction efficiency differs by about 10 times.

此外，发明人还在Ambarella CV22上进行了该后训练稀疏方法的推理性能的测试。Ambarella CV22是一款支持非结构化稀疏加速的自动驾驶芯片，该推理性能的测试结果如表7所示。In addition, the inventor also tested the inference performance of the post-training sparse method on Ambarella CV22. Ambarella CV22 is an autonomous driving chip that supports unstructured sparse acceleration. The test results of this inference performance are shown in Table 7.

表7Table 7

从表7中可以看出，本发明实施例提供的后训练模型稀疏方法，得益于稀疏性，推理延迟和内存占用都显著减少。这一增益对于ResNet-18尤其显著，实现了近2.4倍的加速，并在70％的稀疏度下节省了超过50％的内存。As can be seen from Table 7, the post-training model sparse method provided by the embodiment of the present invention benefits from the sparsity, and the inference delay and memory usage are significantly reduced. This gain is especially significant for ResNet-18, achieving nearly 2.4x speedup and over 50% memory savings at 70% sparsity.

通过上述一系列的对比实验测试和评估可以证明，本发明实施例提供的一种面向计算机视觉的后训练模型稀疏方法，利用可微估计来实现可学习和可控的稀疏率，得益于最优的稀疏分配，在涵盖图像分类和目标检测任务的4个不同数据集上取得了最先进的测试结果，将PTS精度和效率的极限推到一个新的水平。The above series of comparative experimental tests and evaluations prove that the post-training model sparsification method for computer vision provided by the embodiment of the present invention uses differentiable estimation to achieve a learnable and controllable sparsity rate. Thanks to the optimal sparse allocation, it has achieved state-of-the-art test results on four different data sets covering image classification and target detection tasks, pushing the limits of PTS accuracy and efficiency to a new level.

基于上述面向计算机视觉的后训练模型稀疏方法，本发明实施例进一步提供一种面向计算机视觉的后训练模型稀疏装置，如图4所示，该后训练模型稀疏装置包括一个或多个处理器和存储器。其中，存储器与处理器耦接，用于存储一个或多个计算机程序，当一个或多个计算机程序被一个或多个处理器执行，使得一个或多个处理器实现如上述实施例中面向计算机视觉的后训练模型稀疏方法。Based on the above post-training model sparse method for computer vision, embodiments of the present invention further provide a post-training model sparse device for computer vision. As shown in Figure 4, the post-training model sparse device includes one or more processors and memory. Wherein, the memory is coupled to the processor and is used to store one or more computer programs. When the one or more computer programs are executed by one or more processors, the one or more processors implement the computer-oriented implementation in the above embodiment. Post-training model sparse methods for vision.

其中，处理器用于控制该后训练模型稀疏装置的整体操作，以完成上述面向计算机视觉的后训练模型稀疏方法的全部或部分步骤。该处理器模块可以是中央处理器(CPU)、图形处理器(GPU)、现场可编程逻辑门阵列(FPGA)、专用集成电路(ASIC)、数字信号处理(DSP)芯片等。存储器用于存储各种类型的数据以支持在该后训练稀疏装置的操作，这些数据例如可以包括用于后训练稀疏装置操作的任何应用程序或方法的指令，以及应用程序相关的数据。该存储器模块可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，例如静态随机存取存储器(SRAM)、电可擦除可编程只读存储器(EEPROM)、可擦除可编程只读存储器(EPROM)、可编程只读存储器(PROM)、只读存储器(ROM)、磁存储器、快闪存储器等。Wherein, the processor is used to control the overall operation of the post-training model sparse device to complete all or part of the steps of the above-mentioned post-training model sparse method for computer vision. The processor module can be a central processing unit (CPU), a graphics processing unit (GPU), a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), a digital signal processing (DSP) chip, etc. The memory is used to store various types of data to support operations of the post-training sparse device. These data may include, for example, instructions for any application program or method for post-training the operation of the sparse device, as well as application-related data. The memory module may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, etc.

综上所述，与现有技术相比较，本发明实施例提供的后训练模型稀疏方法，从概率密度的角度建立了稀疏阈值和稀疏率之间的桥梁，进一步使用核密度估计技术来构建稀疏阈值和稀疏率之间的可微表示形式，可以在不损害原始权重、最大程度保留视觉特征的情况下学习到具有全局约束的最优稀疏率分配，快速获得所需的稀疏神经网络。因此，本发明实施例提供的后训练模型稀疏方法具有稀疏率分配可控、精度高、效率高等有益效果，并可在各种神经网络架构(ResNet、MobileNet、RegNet)和任务(CIFAR-10/100、ImageNet和PASCAL VOC)上推广。In summary, compared with the prior art, the post-training model sparse method provided in the embodiment of the present invention establishes a bridge between the sparse threshold and the sparse rate from the perspective of probability density, and further uses the kernel density estimation technology to construct a differentiable representation between the sparse threshold and the sparse rate, which can learn the optimal sparse rate allocation with global constraints without damaging the original weights and retaining the visual features to the greatest extent, and quickly obtain the required sparse neural network. Therefore, the post-training model sparse method provided in the embodiment of the present invention has the beneficial effects of controllable sparse rate allocation, high precision, high efficiency, etc., and can be promoted on various neural network architectures (ResNet, MobileNet, RegNet) and tasks (CIFAR-10/100, ImageNet and PASCAL VOC).

上面对本发明提供的面向计算机视觉的后训练模型稀疏方法及装置进行了详细的说明。对本领域的一般技术人员而言，在不背离本发明实质内容的前提下对它所做的任何显而易见的改动，都将构成对本发明专利权的侵犯，将承担相应的法律责任。The above is a detailed description of the post-training model sparse method and device for computer vision provided by the present invention. For those skilled in the art, any obvious changes made to it without departing from the essence of the present invention will constitute an infringement of the patent right of the present invention and will bear corresponding legal responsibilities.

Claims

1. A post-training model sparse method facing computer vision is characterized by comprising the following steps:

(1) Introducing a control loss to achieve the target sparsity ratio, the control loss being:

wherein r is _i And N _i The sparsity and the element number of the i layer weight are respectively represented; r is (r) ₀ Is a global sparsity target; l (L) _c To control losses;

(2) Taking the reconstruction loss of the weight as supervision to reduce the difference between the sparse output and the original output so as to preserve the image characteristics in the original neural network model, wherein the reconstruction loss is as follows:

L _rec ＝D _KL (Y _dense ||Y _sparse )

wherein L is _rec Loss for reconstruction; d (D) _KL (. Cndot.) represents the Kullback-Leibler divergence function; y is Y _dense Outputting an original neural network model; y is Y _sparse Outputting a sparse post neural network model;

(3) Adding the control loss and the reconstruction loss yields an overall optimized target loss:

L＝L _rec +L _c

wherein L is the target loss;

(4) Determining a reconstructed weight and a reconstructed sparsity rate of each layer most suitable for the target sparsity rate by using the target loss:

wherein arg min is an arg min function; w is the reconstructed weight; r is (r) _i The sparsity of the i-th layer weight;

(5) And converting the function for determining the weight and the sparsity of each layer after reconstruction into a micro-function so as to realize micro-optimization of sparsity distribution.

2. The computer vision oriented post-training model sparseness method of claim 1, wherein in step (2):

the sparse model outputs Y _sparse Calculated by the following formula:

Y _sparse ＝f(X,M⊙W)

where f represents the neural network model, X is a three-dimensional image of the input neural network model, M is a mask, and W is the weight of the neural network model.

3. The computer vision oriented post-training model sparseness method of claim 2, wherein:

the generation of the mask M is calculated by means of an amplitude-based approach:

M＝0.5*sgn(|W|-t)+0.5

wherein sgn represents a sign function, returns +1 if it is a positive value, otherwise returns-1; t is a sparse threshold; w is the weight of the neural network model.

4. The computer vision oriented post-training model sparseness method of claim 1, wherein in step (5):

sparsity r of the target loss L relative to the L layer _l The derivative of (2) is:

wherein L is target loss, L _rec For reconstruction loss, lc is control loss, r _l The sparseness of layer i.

5. The computer vision oriented post-training model sparseness method of claim 4, wherein:

sparseness r of control loss Lc relative to layer l _l The derivative of (2) is:

wherein r is _i And N _i The sparsity and the element number of the weight of the ith layer are respectively expressed, r _l And N _l The sparseness of the first layer weight and the number of elements are respectively represented.

6. The computer vision oriented post-training model sparseness method of claim 4, wherein:

the reconstruction loss L _rec Sparseness r relative to layer l _l The derivative of (2) is:

wherein M is _l Mask for layer l, t _l Is the sparseness threshold of layer i.

7. The computer vision oriented post-training model sparseness method of claim 1, wherein in step (5):

sparseness r of layer l _l The calculation satisfies the formula:

wherein p (W) is a weight W _l Probability density function, W _l A weight distribution for layer l, w being the sampling weights from the distribution; g ^-1 As an inverse function of the microcompatible function g, t _l ＝g(r _l )；t _l Is the sparseness threshold of layer i.

8. The computer vision oriented post-training model sparseness method of claim 7, wherein:

sparseness r of layer l _l For the sparse threshold t _l The derivative of (2) is:

wherein p (t) _l ) At t _l Probability density function, p (-t) _l ) Is-t _l Probability density functions of (2); r is (r) _l For the sparseness of layer l, t _l Is the sparseness threshold of layer i.

9. The computer vision oriented post-training model sparseness method of claim 7, wherein:

the estimation of p (w) using the nuclear density estimation method is as follows:

wherein p (W) is a weight W _l Probability density functions of (2); k is a non-negative kernel function; n is the number of sampling points; h is bandwidth, h>0; the kernel function with subscript h is a scaling kernel function, defined asW is the weight distribution W of the first layer _l Sampling weights of (a); w (w) _i Is the weight of the i-th layer.

10. A post-training model sparse device for computer vision tasks, comprising a processor and a memory, the processor and the memory being coupled; wherein,

the memory is used for storing a computer program;

the processor is configured to execute a computer program stored in the memory, and perform the computer vision oriented post-training model sparseness method of any of claims 1-9.