[go: up one dir, main page]

CN111368699B - Convolutional neural network pruning method based on patterns and pattern perception accelerator - Google Patents

Convolutional neural network pruning method based on patterns and pattern perception accelerator Download PDF

Info

Publication number
CN111368699B
CN111368699B CN202010130327.6A CN202010130327A CN111368699B CN 111368699 B CN111368699 B CN 111368699B CN 202010130327 A CN202010130327 A CN 202010130327A CN 111368699 B CN111368699 B CN 111368699B
Authority
CN
China
Prior art keywords
pattern
neural network
convolutional neural
pruning
pointer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010130327.6A
Other languages
Chinese (zh)
Other versions
CN111368699A (en
Inventor
谭展宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cross Information Core Technology Research Institute Xi'an Co ltd
Tsinghua University
Original Assignee
Cross Information Core Technology Research Institute Xi'an Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cross Information Core Technology Research Institute Xi'an Co ltd filed Critical Cross Information Core Technology Research Institute Xi'an Co ltd
Priority to CN202010130327.6A priority Critical patent/CN111368699B/en
Publication of CN111368699A publication Critical patent/CN111368699A/en
Application granted granted Critical
Publication of CN111368699B publication Critical patent/CN111368699B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

本发明基于图案的卷积神经网络剪枝方法及图案感知加速器,首先提出了基于图案的卷积神经网络(PCNN)的滤波器,基于多背包框架实现滤波器内的规则剪枝,能够与现有剪枝方法正交,其与细粒度稀疏性相似,但在一定程度上仍保持一定的规则性。在每一层中,我们将滤波器中非零的数量限制为相同,使得不同卷积窗口的计算工作负载可以相同。基于滤波器中的均匀稀疏性比,一层中的图案类型非常有限。利用具有非常少的位的图案掩码和相应的非零序列来描述滤波器,而不是描述所有的值,显著地减小内存大小和计算量,对精确度和硬件都十分友好。且与滤波器剪枝等其它剪枝方法能够正交,PCNN的剪枝方法可以实现达9倍权值剪枝,且精确度损失可以忽略不计。

Figure 202010130327

The pattern-based convolutional neural network pruning method and the pattern-aware accelerator of the present invention first propose a pattern-based convolutional neural network (PCNN) filter, and realize regular pruning in the filter based on a multi-knapsack framework, which can be compared with existing There are pruning methods that are orthogonal, which are similar to fine-grained sparsity, but still retain some regularity to some extent. In each layer, we limit the number of non-zeros in the filter to be the same, so that the computational workload can be the same for different convolution windows. Based on the uniform sparsity ratio in the filter, the types of patterns in a layer are very limited. Using a pattern mask with very few bits and the corresponding non-zero sequence to describe the filter, instead of describing all the values, significantly reduces the memory size and calculation amount, and is very friendly to accuracy and hardware. And it can be orthogonal to other pruning methods such as filter pruning. PCNN's pruning method can achieve up to 9 times weight pruning, and the loss of accuracy is negligible.

Figure 202010130327

Description

基于图案的卷积神经网络剪枝方法及图案感知加速器Pattern-Based Convolutional Neural Network Pruning Method and Pattern-Aware Accelerator

技术领域technical field

本发明涉及卷积神经网络的方法和硬件优化,具体为基于图案的卷积神经网络剪枝方法及图案感知加速器。The invention relates to a convolutional neural network method and hardware optimization, in particular to a pattern-based convolutional neural network pruning method and a pattern-aware accelerator.

背景技术Background technique

神经网络是目前人工智能领域最重要的算法之一,它可以广泛应用于人脸识别、自驱动和保健等领域。传统上,神经网络由三个不同的层组成,包括卷积层、池化层和全连接层,分别实现特征提取、降采样和分类功能。随着网络架构的发展,卷积层已成为计算和存储的主要瓶颈。因此,为了以更低的开销将人工智能带给各个领域,迫切需要解决大量的计算和参数。Neural network is one of the most important algorithms in the field of artificial intelligence at present, and it can be widely used in fields such as face recognition, self-driving and health care. Traditionally, a neural network consists of three distinct layers, including a convolutional layer, a pooling layer, and a fully connected layer, which implement feature extraction, downsampling, and classification functions, respectively. With the development of network architectures, convolutional layers have become the main bottleneck of computation and storage. Therefore, in order to bring artificial intelligence to various fields with lower overhead, it is urgent to solve a large number of calculations and parameters.

卷积神经网络(CNN)在诸如图像分类、对象检测和自然语言处理等许多应用中得到迅速发展。现有技术中,提高精确度的一个趋势是增加模型容量,但是伴随而来的,工作负载(内存访问损耗)和参数分别增加了超过100亿和6千万。结果,像CPU这样的传统嵌入式计算平台不足以处理如此大量的数据。对于具有大量并行单元的边缘设备,提出了用于神经网络的在ASIC或FPGA上实施的许多加速器。Convolutional neural networks (CNNs) have been rapidly developed in many applications such as image classification, object detection, and natural language processing. In the existing technology, a trend to improve accuracy is to increase the model capacity, but with it, the workload (memory access loss) and parameters have increased by more than 10 billion and 60 million respectively. As a result, traditional embedded computing platforms like CPUs are not powerful enough to handle such large amounts of data. For edge devices with a large number of parallel units, many accelerators for neural networks implemented on ASICs or FPGAs are proposed.

虽然这些加速器可以提升吞吐量,但是将片外数据从DRAM移到片上存储以及处理大量的计算仍然是非常具有挑战性的。已经提出了各种方法来进一步减轻存储和计算负担,包括低秩因子分解、量化和知识蒸馏。另一种有效的方法是权值剪枝,其可以显著减小模型尺寸。然而,早期研究中提出的方法随机地剪枝权值(称为不规则剪枝),这要求稀疏的模型以稀疏列压缩(CSC)格式的变体存储,导致相当多的额外索引(indices)来表示所有权值。此外,在高度并行的架构中,不同计算单元之间的工作负载的不平衡导致停顿或复杂的工作负载仲裁设计,阻碍了这些硬件充分利用权值剪枝的优点。While these accelerators can improve throughput, moving off-chip data from DRAM to on-chip storage and processing large amounts of computation is still very challenging. Various methods have been proposed to further reduce the storage and computation burden, including low-rank factorization, quantization, and knowledge distillation. Another effective method is weight pruning, which can significantly reduce the model size. However, the methods proposed in earlier research pruned the weights randomly (called ragged pruning), which required sparse models to be stored in a variant of the Sparse Column Compression (CSC) format, resulting in considerable additional indices. to represent the ownership value. Furthermore, in highly parallel architectures, the imbalance of workload among different computing units leads to stalls or complicated workload arbitration design, preventing these hardware from fully exploiting the advantages of weight pruning.

与不规则剪枝相比,规则剪枝对硬件更友好,包括通道剪枝、滤波器剪枝、卷积核/连接剪枝和形状剪枝,如图1所示。通道和滤波器剪枝分别用于深度方面相等的输入通道和输出通道,通道剪枝分别等同于针对输入和输出的滤波器剪枝。卷积核或连接剪枝的含义相同,即去除输入和输出通道之间的一些滤波器。并且形状剪枝是剪枝相同位置的滤波器中的权值。规则剪枝已经探讨了硬件友好性。探索不同的粒度,包括非规则稀疏性(0维)、向量级别的稀疏性(1维)、卷积核级别的稀疏性(2维)和滤波器级别的稀疏性(3维)。然而,可以观察到,较高水平的稀疏性比较低水平的稀疏性对精确度的影响更大。总之,一方面,粗粒度的剪枝严重影响精确度,另一方面,细粒度的剪枝对硬件不友好,迫切需要找出更好的粒度。Compared with irregular pruning, regular pruning is more hardware-friendly, including channel pruning, filter pruning, convolution kernel/connection pruning and shape pruning, as shown in Figure 1. Channel and filter pruning are used for input and output channels that are equal in depth, respectively, and channel pruning is equivalent to filter pruning for input and output, respectively. Convolution kernel or connection pruning means the same, i.e. remove some filters between input and output channels. And shape pruning is pruning the weights in the filter at the same position. Rule pruning has been explored for hardware friendliness. Explore different granularities, including irregular sparsity (0-dimensional), vector-level sparsity (1-dimensional), kernel-level sparsity (2-dimensional), and filter-level sparsity (3-dimensional). However, it can be observed that higher levels of sparsity have a greater impact on accuracy than lower levels of sparsity. In short, on the one hand, coarse-grained pruning seriously affects the accuracy, on the other hand, fine-grained pruning is not friendly to hardware, and it is urgent to find a better granularity.

剪枝的目的是为了去除冗余,而模型压缩是去除原始模型中的冗余的有效的方法。模型压缩涉及各种方法,包括参数剪枝和共享、低秩因子分解和知识蒸馏。剪枝方法是相当于外科医生以去除对性能不敏感的神经网络中固有的冗余神经元或突触。通过在整个网络中定位非信息性的权值,它可以轻松地将参数减少超过2倍,而精确度下降可以忽略不计。利用具有较少参数得低秩因子分解张量或矩阵降解来接近原始参数。知识蒸馏通过蒸馏大型神经网络来训练一个压缩的较薄神经网络。然而,因为知识蒸馏的基本思想是通过学习softmax输出的类分布来从教师模型训练紧凑模型,知识蒸馏只能应用于具有softmax损失函数的分类任务。与其它两种压缩方法相比,模型剪枝对各种设置享有更好的压缩比和鲁棒性。The purpose of pruning is to remove redundancy, and model compression is an effective method to remove redundancy in the original model. Model compression involves various methods, including parameter pruning and sharing, low-rank factorization, and knowledge distillation. Pruning methods are the equivalent of a surgeon to remove redundant neurons or synapses inherent in neural networks that are not performance sensitive. By locating non-informative weights throughout the network, it can easily reduce parameters by more than a factor of 2 with negligible drop in accuracy. Approximate the original parameters using low-rank factorized tensors or matrix degradations with fewer parameters. Knowledge distillation trains a compressed thinner neural network by distilling a large neural network. However, because the basic idea of knowledge distillation is to train a compact model from a teacher model by learning the class distribution output by softmax, knowledge distillation can only be applied to classification tasks with a softmax loss function. Compared to the other two compression methods, model pruning enjoys better compression ratios and robustness to various settings.

最早的模型剪枝提出了一种迭代剪枝方法来剪枝小幅度的非信息权值。对于ImageNet数据集上的AlexNet,它使权值数量减少了9倍,计算量减少了3倍。提供AMC来自动选择最佳剪枝设置而不是手工方法,以实现更好的压缩和精确度。然而,当在硬件上部署压缩模型时,需要不可避免的索引开销来对以稀疏列压缩(CSC)格式存储的具有4位索引的权值进行编码。此外,不同计算通道间的不平衡工作负载导致并行性不足。这个问题的根源在于其剪枝的不规则性。The earliest model pruning proposed an iterative pruning method to prune small non-informative weights. For AlexNet on the ImageNet dataset, it reduces the number of weights by a factor of 9 and the amount of computation by a factor of 3. AMC is provided to automatically select the best pruning settings instead of manual methods for better compression and precision. However, when deploying compressed models on hardware, unavoidable indexing overhead is required to encode weights with 4-bit indexes stored in sparse columnar compression (CSC) format. Furthermore, unbalanced workloads across different computing channels lead to insufficient parallelism. The root of this problem lies in the irregularity of its pruning.

规则剪枝通过以规则的方式剪枝权值来解决这个问题。现有的规则剪枝的方法旨在减少输入/输出通道和旨在剪枝若干个卷积核,以便以完全相同形状的方式去除输入和输出通道之间的卷积核连通性或滤波器内的权值。Regular pruning solves this problem by pruning weights in a regular fashion. Existing regular pruning methods aim at reducing input/output channels and aim at pruning several kernels in order to remove kernel connectivity between input and output channels or filter intra- weights.

为了处理现有的规则剪枝方法与精确度之间的权衡,需要更细粒度剪枝来保护预测精确度,同时保留规则性而不需要大量索引。To deal with the trade-off between existing regular pruning methods and accuracy, finer-grained pruning is required to preserve prediction accuracy while preserving regularity without requiring extensive indexes.

如图2所示,剪枝算法金字塔的底部更规则但对精确度有害,而顶部不规则但危害较小。在形状级别剪枝和细粒度剪枝之间存在空白,以开发一种对模型压缩不太敏感但仍然规则的方法。鉴于稀疏感受野趋于聚集成大致圆形的观察,表明与向量剪枝或滤波器内步长剪枝(intra-filter stride pruning)或形状剪枝相比,具有图案的剪枝滤波器似乎是最佳方法。合理的假设是,当我们使那些非零值自然地分布在滤波器中而不是限制它们在向量中或在固定形状中对准时,可以提高这个剪枝模型的预测精确度,但是在实际的算法和架构中引入的结果都不太明显。对于算法,具有有限元素的IKR的总图案集合缺乏可靠的选择过程,并且滤波器的图案选择仅基于l1-范数。这种方法不足以保持整个模型的精确度,因为通过层进行的误差传播被忽略。此外,实验局限于小型神经网络和数据集,使得产生的架构不能跳过冗余计算,因此不能加速计算。As shown in Figure 2, the bottom of the pruning algorithm pyramid is more regular but harmful to accuracy, while the top is irregular but less harmful. There is a gap between shape-level pruning and fine-grained pruning to develop a method that is less sensitive to model compression but still regular. Compared to vector pruning or intra-filter stride pruning or shape pruning, pruned filters with patterns seem to be an Best way. It is reasonable to assume that the prediction accuracy of this pruned model can be improved when we make those non-zero values naturally distribute across the filter rather than constraining them to be aligned in a vector or in a fixed shape, but in practice the algorithm And the results introduced in the architecture are less obvious. For the algorithm, the total pattern set of IKR with finite elements lacks a reliable selection process, and the pattern selection of the filter is based only on the l1-norm. This approach is insufficient to preserve the accuracy of the entire model, since error propagation through the layers is ignored. Furthermore, experiments are limited to small neural networks and datasets, such that the resulting architectures cannot skip redundant computations and thus cannot accelerate computations.

发明内容Contents of the invention

针对现有技术中存在的问题,本发明提供一种基于图案的卷积神经网络剪枝方法及图案感知加速器,所述的方法填补了细粒度剪枝和形状级别的剪枝之间的剪枝算法金字塔的空白,提供了规则剪枝的新维度,实现比传统粗粒度剪枝方法更好的性能,所述的加速器能够在稀疏网络中取得更高的内存压缩比和更好的加速效果,通过彻底的工作负载均衡来实现高并行计算。Aiming at the problems existing in the prior art, the present invention provides a pattern-based convolutional neural network pruning method and a pattern-aware accelerator. The method fills the gap between fine-grained pruning and shape-level pruning. The gap in the algorithm pyramid provides a new dimension of regular pruning, which achieves better performance than traditional coarse-grained pruning methods. The accelerator can achieve higher memory compression ratio and better acceleration in sparse networks. Highly parallel computing is achieved through thorough workload balancing.

本发明是通过以下技术方案来实现:The present invention is achieved through the following technical solutions:

基于图案的卷积神经网络剪枝方法,包括如下步骤,A pattern-based convolutional neural network pruning method, comprising the following steps,

步骤1,基于图案的卷积神经网络,以硬件条件为约束,通过如下的目标函数,对卷积神经网络每一层的卷积核图案集进行蒸馏筛选,获得从Fl到Pl的变换关系,从而得到更小的卷积核图案集合,降低图案索引的比特开销;Step 1, the pattern-based convolutional neural network is constrained by the hardware conditions, and the convolution kernel pattern set of each layer of the convolutional neural network is distilled and screened through the following objective function to obtain the transformation from F l to P l relationship, so as to obtain a smaller set of convolution kernel patterns and reduce the bit overhead of the pattern index;

Figure BDA0002395609510000041
Figure BDA0002395609510000041

s.t.∑xlj≤Vl,xlj∈{0,1},   (2)st∑x lj ≤ V l , x lj ∈ {0, 1}, (2)

Pl=∪pljxlj,plj∈Fn,     (3)P l =∪p lj x lj , p ljF n , (3)

l=1,...,N,j=1,...,Nl.       (4)l=1,...,N, j=1,...,N l . (4)

其中,所述的卷积神经网络包含N个卷积层,卷积层中的权值集合为W={W1,…,WN},Wl表示第l个卷积层中的权值;wlj是Wl中的第j个卷积核的权重,Nl是Wl中的卷积核的数量;n是每个图案非零值的数量,λ和β用于平衡后两项的参数;Wherein, the convolutional neural network includes N convolutional layers, the set of weights in the convolutional layers is W={W 1 ,...,W N }, and W 1 represents the weights in the lth convolutional layer ; w lj is the weight of the j-th convolution kernel in W l , N l is the number of convolution kernels in W l ; n is the number of non-zero values for each pattern, and λ and β are used to balance the last two terms parameters;

每一层的图案集的完整集合为F={F1,F2,…,Fn},从F中提取蒸馏出的集P={P1,P2,…,Pl},而没有冗余图案;Fn是具有n个非零值的图案的全集或候选集,在每个Fn中有

Figure BDA0002395609510000042
种图案类型,tn是Fn中图案的总数,tn=|Fn|;Pl是从Fn导出的第l层中的选择的图案,在每个Pl中仅有Vl种图案类型,
Figure BDA0002395609510000043
Vl是选择的图案的期望数量,Vl=|Pl|;The complete set of pattern sets for each layer is F={F 1 , F 2 ,...,F n }, and the distilled set P={P 1 ,P 2 ,...,P l } extracted from F, without Redundant patterns; Fn is the full set or candidate set of patterns with n non-zero values, in each Fn there is
Figure BDA0002395609510000042
pattern types, t n is the total number of patterns in F n , t n = |F n |; P l is the selected pattern in the lth layer derived from F n , there are only V l types in each P l Pattern type,
Figure BDA0002395609510000043
V l is the desired number of selected patterns, V l = |P l |;

当xlj=1时,选择Pl的第i个图案,反之亦然;

Figure BDA0002395609510000044
是通过保持最大绝对值来将wlj匹配到Pl中的最近图案的投影函数;When x lj =1, select the i-th pattern of P l , and vice versa;
Figure BDA0002395609510000044
is the projection function that matches w lj to the nearest pattern in P l by maintaining the maximum absolute value;

步骤2,将卷积层的权值由稀疏性和蒸馏后的图案被表示为{S,P},实现基于图案的卷积神经网络的剪枝;其中,S={s1,s2,…,sl}是第l个卷积层的稀疏度,由nl/Kl确定,其中,nl是非零值的数量,Kl是每一层卷积核的尺寸大小。Step 2, the weights of the convolutional layer are represented by sparsity and the distilled pattern as {S, P} to realize the pruning of the pattern-based convolutional neural network; where, S={s 1 , s 2 , ..., s l } is the sparsity of the lth convolutional layer, determined by n l /K l , where n l is the number of non-zero values, and K l is the size of the convolution kernel of each layer.

优选的,步骤1中,给定Vl和n的值,将通过目标函数得到Fl到Pl的变换关系,转化为基于背包问题的图案蒸馏,通过贪心算法求解;直到在硬件约束下遍历不同的n和Vl,选择得到最佳设置。Preferably, in step 1, given the values of V l and n, the transformation relationship from F l to P l will be obtained through the objective function, converted into a pattern distillation based on the knapsack problem, and solved by a greedy algorithm; until traversing under hardware constraints Different n and V l , choose to get the best setting.

进一步,通过如下步骤将通过目标函数得到Fl到Pl的变换关系,转化为基于背包问题的图案选择,Further, through the following steps, the transformation relationship from F l to P l obtained through the objective function is transformed into a pattern selection based on the knapsack problem,

步骤a,通过以下公式表示图案蒸馏:In step a, the pattern distillation is represented by the following formula:

Figure BDA0002395609510000051
Figure BDA0002395609510000051

s.t.∑xli≤Vl,xli∈{0,1},i=1,…,tn    (9)st∑x li ≤ V l , x li ∈ {0, 1}, i=1, ..., t n (9)

Pl=∪plixli,pli∈Fn,     (10)P l =∪p li x li , p li ∈ F n , (10)

l=1,…,N,j=1,…,Nl.      (11)l=1,...,N, j=1,...,N l . (11)

步骤b,将每个wlj看作单个背包,则每个背包的容量为1;用选择的图案表示W被认为是多背包问题MKP;Step b, consider each w lj as a single knapsack, then the capacity of each knapsack is 1; using the selected pattern to represent W is considered to be a multi-knapsack problem MKP;

步骤c,由于wlj的所有容量都是1,基于KP的图案选择实际上是具有完全相同容量的多背包问题MKP-1。Step c, since all capacities of w lj are 1, KP-based pattern selection is actually a multi-knapsack problem MKP-1 with exactly the same capacity.

再进一步,通过基于背包问题的图案蒸馏优化的贪心算法,对多背包问题MKP进行求解;具体的,在进行图案选择中,对于每个层,为每个卷积核选择最价值的图案,然后将使得损失函数最小的前Vl个图案,表示WlFurther, the multi-knapsack problem MKP is solved through the greedy algorithm based on the pattern distillation optimization of the knapsack problem; specifically, in the pattern selection, for each layer, select the most valuable pattern for each convolution kernel, and then Denote the top V l patterns that will minimize the loss function, denote W l .

优选的,对所述的卷积神经网络,强制卷积滤波器适合所选择的图案集,从而通过选择图案以接近卷积神经网络的权值W方案,然后采用交替方向乘子法ADMM来优化卷积神经网络;具体的,Preferably, for the convolutional neural network, the convolution filter is forced to be suitable for the selected pattern set, thereby by selecting the pattern to approach the weight W scheme of the convolutional neural network, and then using the alternating direction multiplier method ADMM to optimize convolutional neural network; specifically,

给定原始数据集{X,Y}和损失函数L,对卷积神经网络进行如下优化,Given the original dataset {X, Y} and loss function L, the convolutional neural network is optimized as follows,

Figure BDA0002395609510000052
Figure BDA0002395609510000052

其中,f表示CNN的函数,W表示卷积层的权值,θ表示CNN的其它参数,并且P是选择的图案集。where f represents the function of the CNN, W represents the weight of the convolutional layer, θ represents other parameters of the CNN, and P is the selected pattern set.

基于图案的卷积神经网络的图案感知加速器,由图案感知处理元件群组、图案掩码寄存器、卷积核寄存器、稀疏计算控制器与译码单元和共享激活寄存器&零值检测单元组成;稀疏计算控制器与译码单元输入端连接图案掩码寄存器和卷积核寄存器,并且根据输入的图案掩码映射表与图案感知处理元件群组交互连接,图案感知处理元件群组输入端连接共享激活寄存器&零值检测单元;A pattern-aware accelerator based on a pattern-based convolutional neural network, consisting of a pattern-aware processing element group, a pattern mask register, a convolution kernel register, a sparse calculation controller and a decoding unit, and a shared activation register & zero-value detection unit; sparse The calculation controller and the input end of the decoding unit are connected to the pattern mask register and the convolution kernel register, and are interactively connected to the pattern perception processing element group according to the input pattern mask mapping table, and the pattern perception processing element group input end is connected to share activation Register & zero value detection unit;

由上述任意一项所述的剪枝方法得到的基于图案的卷积神经网络的权值寄存在卷积核寄存器中。The weights of the pattern-based convolutional neural network obtained by the pruning method described in any one of the above items are stored in the convolution kernel register.

优选的,所述的图案感知处理元件中执行如下流水线化的数据处理;Preferably, the following pipelined data processing is performed in the pattern perception processing element;

第一阶段,数据预处理阶段;The first stage is the data preprocessing stage;

根据图案掩码映射表将权值恢复到原始滤波器;对于激活,数据被加载到寄存器堆中,并且激活稀疏性掩码被生成以表示零;对于图案,其被转换成权值稀疏性掩码,类似于激活;Restore the weights to the original filter according to the pattern mask mapping table; for activations, the data is loaded into the register file, and an activation sparsity mask is generated to represent zero; for patterns, it is transformed into a weight sparsity mask Code, similar to activation;

第二阶段,稀疏指针的生成;The second stage is the generation of sparse pointers;

计算指针偏移表来定位非零激活权值对,基于指针偏移表逐步获得非零激活权值对的每个指针,直到完成整个滤波器计算;Calculate the pointer offset table to locate the non-zero activation weight pair, and gradually obtain each pointer of the non-zero activation weight pair based on the pointer offset table until the entire filter calculation is completed;

最后第三阶段,当来自各个输入通道的所有部分和被加在一起时,使用ReLU来获得最终结果。Finally in the third stage, when all the partial sums from the individual input channels are added together, ReLU is used to obtain the final result.

进一步,第二阶段中,所述的稀疏指针的生成,具体处理方法如下,Further, in the second stage, the specific processing method for the generation of the sparse pointer is as follows,

通过处理滤波器图案编码以计算指针偏移表;Computing pointer offset tables by processing filter pattern codes;

在每个卷积层计算开始时,加载配置字建立图案译码器;At the beginning of each convolutional layer calculation, load the configuration word to build the pattern decoder;

基于图案译码器,每个图案编码可以被解码成9比特长度的相应图案滤波器稀疏性掩码;输入图案编码匹配指定图案,然后获得9位的图案稀疏性掩码,其中“1”表示非零值,而“0”表示零值;Based on the pattern decoder, each pattern code can be decoded into a corresponding pattern filter sparsity mask of 9 bits in length; the input pattern code matches the specified pattern, and then a 9-bit pattern sparsity mask is obtained, where "1" means non-zero value, and "0" means zero value;

同时激活寄存器内,输入向量一方面被缓存在寄存器堆中,另一方面,向量的每个值被检查是否为零并且获得9位长度的激活稀疏性掩码;At the same time in the activation register, the input vector is cached in the register file on the one hand, and on the other hand, each value of the vector is checked whether it is zero and an activation sparsity mask of 9-bit length is obtained;

图案稀疏性掩码和激活稀疏性掩码之间强制执行和运算之后,获得计算稀疏性掩码,以表示有效的非零激活权值对,将计算稀疏性掩码根据指针偏移表进行指针偏移处理后,得到最终的稀疏指针。After enforcing the AND operation between the pattern sparsity mask and the activation sparsity mask, the computed sparsity mask is obtained to represent valid non-zero activation weight pairs, and the computed sparsity mask is indexed according to the pointer offset table After offset processing, the final sparse pointer is obtained.

再进一步,用于稀疏性掩码偏移的指针偏移表是计算有效激活权值对的每个位置;指针偏移表由指针头和指针偏移两部分组成;指针头表示第一有效对的位置;指针偏移表示当前位置与最近的未使用的非零值之间的距离,对指针偏移表进行如下处理:Furthermore, the pointer offset table used for the sparsity mask offset is to calculate each position of the effective activation weight pair; the pointer offset table consists of two parts: the pointer head and the pointer offset; the pointer head represents the first effective pair position; the pointer offset indicates the distance between the current position and the nearest unused non-zero value, and the pointer offset table is processed as follows:

首先,非运算被应用于计算稀疏性掩码的每个位,以便累加非零之间的零的数量;First, a NOT operation is applied to each bit of the computed sparsity mask in order to accumulate the number of zeros between nonzeros;

其次,通过负掩码加每个位;并且一旦观察到位0,累加和将被重置;Second, add each bit through the negative mask; and once bit 0 is observed, the accumulated sum will be reset;

根据偏离表生成指针的流程如下:The process of generating pointers according to the deviation table is as follows:

首先,指针的初始值为零,并且直接与指针头相加,以找到第一有效对;First, the initial value of the pointer is zero, and it is directly added to the head of the pointer to find the first valid pair;

然后,将指针与当前位置的偏移和额外的1相加,以找到下一个有效对;Then, add the pointer to the offset from the current position plus an extra 1 to find the next valid pair;

最后,每次在累加之后,将指针与9进行比较以决定是否结束该工作流程。Finally, after each accumulation, the pointer is compared to 9 to decide whether to end the workflow.

优选的,所述图案感知处理元件PE中封装的滤波器,其包含存储在不同SRAM中的图案掩码和非零序列;Preferably, the filter packaged in the pattern-aware processing element PE includes pattern masks and non-zero sequences stored in different SRAMs;

将字长固定的非零权值,作为非零序列组从权值SRAM中读出;在权值SRAM中,权值量化为8位字长;Read non-zero weights with a fixed word length from the weight SRAM as a non-zero sequence group; in the weight SRAM, the weights are quantized to an 8-bit word length;

图案配置器根据配置字重新配置权值寄存器堆,根据非零序列的长度将从图案SRAM读取的向量划分为若干组;The pattern configurator reconfigures the weight register file according to the configuration word, and divides the vectors read from the pattern SRAM into several groups according to the length of the non-zero sequence;

图案配置器根据配置字重新配置权值的组划分方式后,将权值作为滤波器分配给图案感知处理元件PE组;After the pattern configurator reconfigures the group division method of the weight according to the configuration word, it assigns the weight as a filter to the pattern perception processing element PE group;

SRAM的访问采用按位的组划分方式;SRAM access adopts bit-by-bit group division;

图案配置器用于定义图案字拆分方式以加载随后的移位寄存器,从而顺序读取图案编码以馈送到PE阵列。The pattern configurator is used to define how the pattern words are split to load the subsequent shift registers to sequentially read the pattern codes to feed to the PE array.

与现有技术相比,本发明具有以下有益的技术效果:Compared with the prior art, the present invention has the following beneficial technical effects:

本发明首先提出了基于图案的卷积神经网络(PCNN)的滤波器,基于多背包框架实现滤波器内的规则剪枝,能够与现有剪枝方法正交,其与细粒度稀疏性相似,但在一定程度上仍保持一定的规则性。在每一层中,我们将滤波器中非零的数量限制为相同,使得不同卷积窗口的计算工作负载可以相同。基于滤波器中的均匀稀疏性比,一层中的图案类型非常有限。利用具有非常少的位的图案掩码和相应的非零序列来描述滤波器,而不是描述所有的值,显著地减小内存大小和计算量,对精确度和硬件都十分友好。通过实验表明,PCNN的剪枝方法可以实现达9倍权值剪枝,且精确度损失可以忽略不计。与滤波器剪枝等其它剪枝方法相结合,PCNN的剪枝方法实现了对模型尺寸的达34.4倍的缩减,其精确度损失可以忽略不计,这证明了正交性特征。The present invention first proposes a filter based on a pattern-based convolutional neural network (PCNN), and implements rule pruning in the filter based on a multi-knapsack framework, which can be orthogonal to existing pruning methods, which is similar to fine-grained sparsity, But still maintain a certain degree of regularity to a certain extent. In each layer, we limit the number of non-zeros in the filter to be the same, so that the computational workload can be the same for different convolution windows. Based on the uniform sparsity ratio in the filter, the types of patterns in one layer are very limited. Using a pattern mask with very few bits and the corresponding non-zero sequence to describe the filter, instead of describing all the values, significantly reduces the memory size and calculation amount, and is very friendly to accuracy and hardware. Experiments show that the pruning method of PCNN can achieve up to 9 times weight pruning, and the accuracy loss is negligible. Combined with other pruning methods such as filter pruning, PCNN's pruning method achieves up to 34.4 times reduction in model size with negligible accuracy loss, which demonstrates the orthogonality feature.

进一步的,在确定一层中每个滤波器的完全相同的稀疏性之后,通过非线性背包问题的框架,实现图案蒸馏,以在整个枚举集中选择最佳的一些类型的图案,对精确度与各种因素之间的关系进行了优化,例如滤波器稀疏性和图案类型的数量。Further, after determining the exact same sparsity of each filter in one layer, through the framework of the nonlinear knapsack problem, realize pattern distillation to select the best types of patterns in the entire enumeration set. The relationship with various factors such as filter sparsity and number of pattern types is optimized.

本发明提出了基于滤波器图案感知的加速器,比传统的剪枝得到更好的存储压缩比,不仅能够对图案滤波器剪枝益处进行评估,而且平衡了低开销的滤波器编码和不同卷积窗口中的工作负载,实现了高度的并行计算,使得存储明显增加和计算明显缩减。与不规则剪枝相比,进行图案感知的加速器充分地利用了PCNN的优势,其内存占用大小减少了6.5倍(在8倍权值和9倍压缩比方面测试),并且在平衡工作负载的情况下实现了充分加速,其中每个PE的区域开销为13.9%,内存为3.3%。The present invention proposes an accelerator based on filter pattern perception, which has a better storage compression ratio than traditional pruning, not only can evaluate the benefits of pattern filter pruning, but also balances low-overhead filter coding and different convolutions The workload in the window realizes a high degree of parallel computing, which makes the storage significantly increased and the calculation significantly reduced. Compared with irregular pruning, the pattern-aware accelerator fully exploits the advantages of PCNN, its memory footprint is reduced by 6.5 times (tested in terms of 8 times weight and 9 times compression ratio), and it is more effective in balancing workload. Sufficient acceleration is achieved for the case where the area overhead per PE is 13.9% and the memory is 3.3%.

附图说明Description of drawings

图1为现有技术中通道剪枝、滤波器剪枝、卷积核剪枝、形状剪枝的示意图。FIG. 1 is a schematic diagram of channel pruning, filter pruning, convolution kernel pruning, and shape pruning in the prior art.

图2为现有技术中剪枝方法的不同规则性的剪枝算法金字塔示意图。Fig. 2 is a schematic diagram of a pruning algorithm pyramid with different regularities of the pruning method in the prior art.

图3a为本发明实例中所述图案滤波器的顶部和底部的示意图。Fig. 3a is a schematic diagram of the top and bottom of the pattern filter described in the example of the present invention.

图3b为本发明实例中所述的各种图案类型的开销和压缩。Figure 3b shows overhead and compression for various pattern types described in the examples of the present invention.

图4为本发明实例中所述在VGG16的CONV4中的图案分布示意图。Fig. 4 is a schematic diagram of pattern distribution in CONV4 of VGG16 described in the example of the present invention.

图5为本发明实例中所述不同图案选择的结果比较。Figure 5 is a comparison of the results of different pattern selections described in the examples of the present invention.

图6a为本发明实例中所述图案感知加速器的权值和图案的存储结构。Fig. 6a is the storage structure of weights and patterns of the pattern perception accelerator in the example of the present invention.

图6b为本发明实例中所述图案感知加速器的权值的存储格式。Fig. 6b is the storage format of the weight value of the pattern perception accelerator in the example of the present invention.

图7为本发明实例中所述的卷积层的计算示意图。Fig. 7 is a schematic diagram of the calculation of the convolutional layer described in the example of the present invention.

图8a为本发明实例中所述加速器的计算单元的示意图。Fig. 8a is a schematic diagram of the computing unit of the accelerator in the example of the present invention.

图8b为本发明实例中所述加速器的生成指针的流程示意图。Fig. 8b is a schematic flow chart of generating pointers of the accelerator in the example of the present invention.

图8c为本发明实例中所述加速器的解码器模块的示意图。Fig. 8c is a schematic diagram of the decoder module of the accelerator in the example of the present invention.

图9为本发明实例中所述加速器的处理流程示意图。Fig. 9 is a schematic diagram of the processing flow of the accelerator in the example of the present invention.

图10为本发明实例中所述加速器的结构布置示意图。Fig. 10 is a schematic diagram of the structural arrangement of the accelerator in the example of the present invention.

具体实施方式Detailed ways

下面结合具体的实施例对本发明做进一步的详细说明,所述是对本发明的解释而不是限定。The present invention will be further described in detail below in conjunction with specific embodiments, which are explanations of the present invention rather than limitations.

在传统方式中,用K值填充滤波器,从而使用K的数据序列来描述滤波器。然而,当简单地采用剪枝方法时,滤波器通常表现出稀疏性,因此我们可以利用稀疏性掩码来表示非零的位置。In the traditional way, the filter is filled with K values, so that the data sequence of K is used to describe the filter. However, when simply employing pruning methods, filters often exhibit sparsity, so we can exploit sparsity masks to represent non-zero locations.

借助于稀疏性掩码,只需要非零部分,如图3a所示顶部所述的滤波器的原始表示,以及底部所示的基于图案的使用的图案掩码和非零序列。对稀疏性掩码进行编码会带来微小的开销。以最常用的3×3滤波器为例,由于至多为

Figure BDA0002395609510000091
种图案类型,9位字长编码可以表示所有可能性。为了获得规则剪枝,使用非零序列的相同长度有利于平衡不同卷积窗口的工作负载。因此,可将图案类型减少到
Figure BDA0002395609510000101
其中7位字长掩码编码是足够的。如图3b所示,柱状条表示图案编码的位成本。从下到上的两条线分别表示当非零数量为4时滤波器的开销百分比和压缩比。对于各种稀疏性,图案滤波器引入平均17%的编码开销,而同时在1个非零的示例中享有至多9倍的压缩。With the help of a sparsity mask, only the non-zero part is required, as shown in Figure 3a for the raw representation of the filter described at the top, and for the pattern-based use of the pattern mask and sequence of non-zeros shown at the bottom. Encoding the sparsity mask has a small overhead. Taking the most commonly used 3×3 filter as an example, since at most
Figure BDA0002395609510000091
A pattern type, 9-bit word length code can represent all possibilities. To obtain regular pruning, using the same length of non-zero sequences is beneficial to balance the workload of different convolution windows. Therefore, the pattern types can be reduced to
Figure BDA0002395609510000101
A 7-bit word length mask encoding is sufficient. As shown in Fig. 3b, the histogram bars represent the bit cost of pattern encoding. The two lines from bottom to top represent the overhead percentage and compression ratio of the filter when the number of non-zeros is 4, respectively. For various sparsities, the pattern filter introduces an average 17% coding overhead while enjoying up to 9x compression in 1 non-zero example.

此外,我们观察到图案分布,如图4所示,在VGG16的CONV4中的图案分布,总共有126个图案,其中非零数量是4。因此,一些图案类型仍然不太重要,从而允许进一步去除一些图案。Furthermore, we observed pattern distribution, as shown in Fig. 4, in CONV4 of VGG16, with a total of 126 patterns, of which the number of non-zeros is 4. Therefore, some pattern types remain less important, allowing further removal of some patterns.

因此,基于以上考虑,我们可以使用以下术语来描述我们的图案剪枝方法。Therefore, based on the above considerations, we can use the following terms to describe our pattern pruning method.

集S={s1,s2,...,sl}是第l个卷积层的稀疏性比,由nl/Kl确定,其中nl是非零值的数量,而Kl是每一层的卷积核大小。利用sl的共享参数,保证了每个滤波器的平衡工作负载。The set S = {s 1 , s 2 , ..., s l } is the sparsity ratio of the l-th convolutional layer, determined by n l /K l , where n l is the number of non-zero values and K l is The convolution kernel size of each layer. With the shared parameters of sl , a balanced workload of each filter is guaranteed.

接下来,集F={F1,F2,…,Fn}是每一层的图案集的完整集合,并且从F中提取蒸馏出的集P={P1,P2,…,Pl},而没有冗余图案。Next, the set F = {F 1 , F 2 , ..., F n } is the complete set of pattern sets for each layer, and the distilled set P = {P 1 , P 2 , ..., P l } without redundant patterns.

在每个Fn中有

Figure BDA0002395609510000102
种图案类型,在每个Pl中仅有Vl
Figure BDA0002395609510000103
种图案类型。In each Fn there is
Figure BDA0002395609510000102
pattern types, only V l in each P l
Figure BDA0002395609510000103
pattern type.

通过基于图案的卷积神经网络的优化,我们可以获得从Fl到Pl的变换关系。此时,卷积层的权值被阐述为{S,P},这是对图案剪枝方法的直观描述。Through the optimization of the pattern-based convolutional neural network, we can obtain the transformation relation from F l to P l . At this point, the weights of the convolutional layers are formulated as {S, P}, which is an intuitive description of the pattern pruning method.

最后,对于硬件,通过∑slrl来计算权值的内存压缩比,其中rl是每一层的参数与总模型的比。图案掩码的开销至多为γΣKl,由卷积核大小Kl和图案掩码字长与非零序列字长的比γ确定。这里,图案掩码的字长为

Figure BDA0002395609510000104
对于计算加速,由于规则剪枝,硬件可以充分利用稀疏性实现FLOP与∑Slrl的降低比。Finally, for hardware, the memory compression ratio of the weights is computed by ∑s l r l , where r l is the ratio of the parameters of each layer to the total model. The overhead of the pattern mask is at most γΣK l , which is determined by the convolution kernel size K l and the ratio γ of the pattern mask word length to the non-zero sequence word length. Here, the word length of the pattern mask is
Figure BDA0002395609510000104
For computing acceleration, due to rule pruning, the hardware can make full use of sparsity to achieve a reduction ratio of FLOPs to ∑ Slrl .

其中,基于图案的卷积神经网络的优化具体如下,Among them, the optimization of the pattern-based convolutional neural network is as follows,

考虑一个包含N个卷积层的卷积神经网络CNN,其中Wl表示第l个卷积层中的权值,W={W1,…,WN}。考虑到硬件,我们的目的是找到适当数量的图案以较少的索引的比特成本来表示W。同时,需要图案的更高的稀疏性。这样,我们得到以下目标函数Consider a convolutional neural network CNN containing N convolutional layers, where W l represents the weight in the lth convolutional layer, W={W 1 ,...,W N }. Considering the hardware, our aim is to find the appropriate number of patterns to represent W with less bit cost of indexing. At the same time, a higher sparsity of patterns is required. Thus, we get the following objective function

Figure BDA0002395609510000111
Figure BDA0002395609510000111

s.t.∑xlj≤Vl,xlj∈{0,1},    (2)st∑x lj ≤ V l , x lj ∈ {0, 1}, (2)

Pl=∪pljxlj,plj∈Fn,      (3)P l =∪p lj x lj , p ljF n , (3)

l=1,...,N,j=1,…,Nl.      (4)l=1,...,N, j=1,...,N l . (4)

其中,wlj是Wl中的第j个卷积核,Nl是Wl中的卷积核的数量。n是每个图案的非零值的数量,λ和β是用于平衡后两项的参数。Fn是具有n个非零值的图案的全集或候选集,tn是Fn中图案的总数(即,tn=|Fn|)。Pl是从Fn导出的第l层中的选择的图案,Vl是选择的图案的期望数量,Vl=|Pl|。当xlj=1时,选择Pl的第i个图案,反之亦然。

Figure BDA0002395609510000112
是通过保持最大(最大n)绝对值来将wlj匹配到Pl中的最近图案的射影函数。where w lj is the jth convolution kernel in W l , and N l is the number of convolution kernels in W l . n is the number of non-zero values for each pattern, and λ and β are parameters used to balance the latter two terms. F n is the full or candidate set of patterns with n non-zero values, and t n is the total number of patterns in F n (ie, t n =|F n |). P l is the selected pattern in layer l derived from Fn , V l is the desired number of selected patterns, V l = |P l |. When x lj =1, the ith pattern of P l is selected, and vice versa.
Figure BDA0002395609510000112
is the projective function that matches w lj to the nearest pattern in P l by maintaining the largest (max n) absolute value.

实际上,方程1-4中的优化问题没有明确的解。然而,给定Vl和n的值,上述优化类似于0-1背包问题,并且可以通过贪心算法求解。直到在硬件约束下遍历不同的n和Vl,才能选择最佳设置。In practice, the optimization problem in Equations 1-4 has no definite solution. However, given the values of Vl and n, the above optimization is similar to the 0-1 knapsack problem and can be solved by a greedy algorithm. The optimal setting cannot be chosen until the different n and Vl are traversed under hardware constraints.

我们提供了一种选择图案以接近卷积神经网络CNN的权值W方案,然后采用交替方向乘子法(ADMM)来优化我们的卷积神经网络模型。我们强制卷积滤波器适合所选择的图案集,而不是强制权值进行特殊量化。给定原始数据集{X,Y}和损失函数L,进行如下优化,We provide a scheme for selecting patterns to approximate the weights W of a CNN, and then employ the Alternating Direction Multiplier Method (ADMM) to optimize our CNN model. Instead of forcing the weights to be specially quantized, we force the convolutional filters to fit the chosen pattern set. Given the original data set {X, Y} and the loss function L, the following optimization is performed,

Figure BDA0002395609510000113
Figure BDA0002395609510000113

其中,f表示CNN的函数,W表示卷积层的权值,θ表示CNN的其它参数,并且P是选择的图案集。where f represents the function of the CNN, W represents the weight of the convolutional layer, θ represents other parameters of the CNN, and P is the selected pattern set.

上述提到的背包问题具体如下。The knapsack problem mentioned above is detailed as follows.

原始的背包问题的目标是选择有限容量的对象,而所选择对象的总收益应尽可能大。给定具有有限体积V的背包,有n个项目,第i个项目的体积和收益为vi和ci。0-1背包问题可以用公式表示为以下线性整数规划公式的一个优化问题:The goal of the original knapsack problem is to select objects with limited capacity, and the total return of the selected objects should be as large as possible. Given a knapsack with finite volume V, with n items, the i-th item has volume and payoff v i and c i . The 0-1 knapsack problem can be formulated as an optimization problem of the following linear integer programming formula:

Figure BDA0002395609510000121
Figure BDA0002395609510000121

s.t.∑vixi≤V,xi∈{0,1},i=1,…,n.   (7)st∑v i x i ≤ V, x i ∈ {0, 1}, i=1,..., n. (7)

其中xi=1表示我们选择第i个项目进入背包,反之亦然。where xi = 1 means we choose the i-th item to go into the backpack, and vice versa.

从而基于背包问题(KP)的图案选择如下。The pattern selection based on the knapsack problem (KP) is thus as follows.

图案蒸馏意味着我们必须从候选集中选择利润最大的有限图案,这完全对应于背包问题的核心思想。然而,与每个图案的体积等于1的背包问题不同,每个图案可以被选择更多次,并且每个图案的收益不再是指定值而是wlj和所有选择的图案P之间的函数。Pattern distillation means that we have to select the most profitable finite pattern from the candidate set, which exactly corresponds to the core idea of the knapsack problem. However, unlike the knapsack problem where the volume of each pattern is equal to 1, each pattern can be selected more times, and the payoff for each pattern is no longer a specified value but a function between w lj and all selected patterns P .

我们用以下公式表示图案蒸馏:We express pattern distillation with the following formula:

Figure BDA0002395609510000122
Figure BDA0002395609510000122

s.t.∑xli≤Vl,xli∈{0,1},i=1,...,tn    (9)st∑x li ≤ V l , x li ∈ {0, 1}, i=1, ..., t n (9)

Pl=∪plixli,pli∈Fn,    (10)P l =∪ pli x li , p li ∈ F n , (10)

l=1,…,N,j=1,…,Nl.       (11)l=1,...,N, j=1,...,N l . (11)

另外,如果我们将每个wlj看作单个背包,则每个背包的容量为1,换句话说,我们只能从wlj的候选集中选择一个图案。然后,用选择的图案表示W被认为是多背包问题(MKP)。例外地,由于wlj的所有容量都是1,基于KP的图案选择实际上是具有完全相同容量的多背包问题(MKP)。Also, if we regard each w lj as a single knapsack, each knapsack has a capacity of 1, in other words, we can only choose one pattern from w lj 's candidate set. Then, representing W with the selected patterns is considered as the multi-knapsack problem (MKP). Exceptionally, since all capacities of w lj are 1, KP-based pattern selection is actually a multiple knapsack problem (MKP) with exactly the same capacity.

众所周知,MKP是强NP难问题。获得MKP近似算法的直接方法是广义贪心算法(G-Greedy)。G-Greedy首先对项目按降序排序,对背包按升序排序,然后根据顺序并在容量限制下逐个分配项目。然而,在我们的图案选择中,不同于G-Greedy算法,每个图案不限于选择的次数。针对这一特殊问题,我们提出了一种解决基于KP的图案优化的有效的贪心算法。对于每个层,我们首先为每个卷积核选择最有利的图案,然后收集已经选择的图案的数量,最后保持具有最高频率的图案(前10%或固定数量,这里,对于每个层,我们使用前Vl个有价值的图案),细节在算法1中示出。It is well known that MKP is a strongly NP-hard problem. The direct method to obtain the approximate algorithm of MKP is the generalized greedy algorithm (G-Greedy). G-Greedy first sorts items in descending order and knapsacks in ascending order, and then assigns items one by one according to the order and under capacity constraints. However, in our pattern selection, unlike the G-Greedy algorithm, each pattern is not limited to the number of selections. For this particular problem, we propose an efficient greedy algorithm for solving KP-based pattern optimization. For each layer, we first select the most favorable pattern for each kernel, then collect the number of already selected patterns, and finally keep the pattern with the highest frequency (top 10% or a fixed number, here, for each layer, We use the top V1 valuable patterns), details are shown in Algorithm 1.

Figure BDA0002395609510000131
Figure BDA0002395609510000131

本质上,提出的贪心算法是一种数理统计方法,其简单但有效。我们将展示不同的选择的图案之间的差异并且说明从我们的方法中蒸馏的图案是更有价值的。Essentially, the proposed greedy algorithm is a mathematical statistical method, which is simple but effective. We will show the difference between different selected patterns and show that the patterns distilled from our method are more valuable.

对上述图案选择的有效性进行评估。The effectiveness of the above pattern selection was evaluated.

例如,对于n=2,我们将权值W投影到所有可能的图案集合Fn,并统计已经用于表示W的每个图案的频率,如图4所示。然后计算每个层中的图案与权值之间的重建损失For example, for n=2, we project the weight W to the set of all possible patterns F n , and count the frequency of each pattern that has been used to represent W, as shown in FIG. 4 . Then compute the reconstruction loss between the pattern and the weights in each layer

Figure BDA0002395609510000132
Figure BDA0002395609510000132

并且与三种图案选择方法比较:And compared with the three pattern selection methods:

·最高:总体频率最高的多个图案。• Highest: Multiple patterns with the highest overall frequency.

·最低:总体频率最低的多个图案。• Lowest: the number of patterns with the lowest overall frequency.

·蒸馏:蒸馏得到的图案。• Distillation: Distilled pattern.

图5示出了在VGG-16中导致不同性能的不同图案,其中|Pn|=16。频率最高的图案明显优于频率最低的图案,因为最频繁使用的图案比不频繁使用的图案更稳定。然而,蒸馏的图案实现了最好的性能,表明总体统计图案不是最佳的,这意味着不同的层需要不同的图案。需要注意的是,重建损失表示每个层中的图案与权值之间的损失,如等式12所示Figure 5 shows different patterns leading to different performance in VGG-16 where | Pn |=16. The pattern with the highest frequency is significantly better than the pattern with the lowest frequency because the most frequently used patterns are more stable than the infrequently used patterns. However, the distilled pattern achieves the best performance, showing that the overall statistical pattern is not optimal, implying that different layers require different patterns. Note that the reconstruction loss represents the loss between the pattern and the weights in each layer, as shown in Equation 12

总的来说,根据不同模型和数据集上的各种稀疏性设置,可以观察到本发明所述的图案剪枝方法的优点。图案剪枝的益处来自于规则性和细粒度剪枝之间的平衡。与形状级别剪枝相比,图案剪枝允许各种滤波器中的不同稀疏形状,从而增强表示能力。与细粒度剪枝相比,我们仅对一层中的滤波器的统一稀疏比进行约束,因而我们的方法与细粒度剪枝的区别不大。以此方式,可将其视为用于细粒度剪枝的规则修补(regular-patch)。Overall, the advantages of the pattern pruning methods described in this invention can be observed for different models and various sparsity settings on datasets. The benefits of pattern pruning come from a balance between regularity and fine-grained pruning. Compared with shape-level pruning, pattern pruning allows different sparse shapes in various filters, thus enhancing the representation ability. Compared with fine-grained pruning, we only constrain the uniform sparsity ratio of filters in one layer, so our method is not much different from fine-grained pruning. In this way, it can be viewed as a regular-patch for fine-grained pruning.

通过进一步约束每一层中的图案数量以研究规则性对精确度的影响,与对于每个权值具有4位的深度压缩相比,本发明所述的方法实现了更少的存储索引的开销。By further constraining the number of patterns in each layer to study the impact of regularity on accuracy, the method described in the present invention achieves less overhead for storing indexes than deep compression with 4 bits per weight .

将本发明的方法与对CIFAR10上的VGG-16和ResNet-18的其它研究进行比较。对于不同研究中使用的各种基线,我们公平地比较了相对精确度的结果。能够观察到我们的方法可以显著压缩参数并且同时以可忽略不计的精确度损失减少浮点计算FLOP。浮点计算FLOP减少的较好实现来源于在具有大尺寸的层中的较高稀疏性,其占用更多的计算量,而其它粗粒度方法避免剪枝这些层。对于ResNet-18,PCNN还实现了比其它粗粒度剪枝更好的性能,尤其是在FLOP减少方面。而且本发明提出的PCNN剪枝方法与现有的剪枝能够正交。The method of the present invention is compared with other studies on VGG-16 and ResNet-18 on CIFAR10. For various baselines used in different studies, we fairly compare the relative accuracy results. It can be observed that our method can significantly compress parameters and at the same time reduce floating-point computation FLOPs with negligible loss of precision. Better implementation of floating-point FLOP reduction comes from higher sparsity in layers with large size, which takes more computation, while other coarse-grained methods avoid pruning these layers. For ResNet-18, PCNN also achieves better performance than other coarse-grained pruning, especially in FLOP reduction. Moreover, the PCNN pruning method proposed by the present invention can be orthogonal to existing pruning methods.

在上述的基础上,本发明还提供了一种支持具有小面积和低功率开销的图像感知加速器。注意,目前3×3滤波器被广泛使用并且是各种CNN中的主导类型,提出的架构针对这种类型进行优化。于此同时,配合相应额图案感知加速器的存储格式,以最大化存储压缩的潜力。On the basis of the above, the present invention also provides an image perception accelerator with small area and low power consumption. Note that currently 3×3 filters are widely used and the dominant type in various CNNs, and the proposed architecture is optimized for this type. At the same time, it cooperates with the storage format of the corresponding pattern-aware accelerator to maximize the potential of storage compression.

在图案剪枝方法中,滤波器由相应的图案掩码和非零序列表示。因此,在存储中,我们处理封装中的滤波器,其包含存储在不同SRAM中的图案掩码和非零序列。对于图案掩码和非零序列的不同的SRAM设计的原因是这两种数据类型的不同格式和不同的精细的控制流。In pattern pruning methods, filters are represented by corresponding pattern masks and sequences of non-zeros. So, in storage we deal with filters in packages containing pattern masks and non-zero sequences stored in different SRAMs. The reason for the different SRAM designs for pattern masks and non-zero sequences is the different formats and different fine-grained control flows of these two data types.

对于非零权值,字长是固定的,使得它们可以作为序列组从存储中读出,如图6b所示。在该存储结构中,我们将权值量化为8位字长。以字节读取的方式读取来自SRAM的权值,这可以灵活地按顺序加载非零权值。本发明的存储SRAM基于Faraday的MEMAKER生成,能够基于图案配置字自定义字节的位宽,以满足字节读取图案中非零序列长度的需要。图案配置器根据配置字重新配置权值寄存器堆,根据非零序列的长度将从图案SRAM读取的向量划分为若干组,得到图案掩码(SPM码),通过图案译码器跌倒9位的权重掩模输入到图案感知处理元件PE群组,如图6a所示。另外,配置字也重新配置权值的组划分方式存储在卷积核寄存器。之后,通过稀疏计算控制器将权值作为滤波器分配给图案感知处理元件PE群组。For non-zero weights, the word length is fixed such that they can be read from storage as sequential groups, as shown in Figure 6b. In this storage structure, we quantize the weights into 8-bit words. Read weights from SRAM as byte reads, which provides flexibility to sequentially load non-zero weights. The storage SRAM of the present invention is generated based on Faraday's MEMAKER, and the bit width of the byte can be customized based on the pattern configuration word to meet the requirement of the non-zero sequence length in the byte reading pattern. The pattern configurator reconfigures the weight register file according to the configuration word, divides the vectors read from the pattern SRAM into several groups according to the length of the non-zero sequence, obtains the pattern mask (SPM code), and passes the pattern decoder to 9-bit The weight mask is input to the group of pattern-aware processing elements PE, as shown in Fig. 6a. In addition, the configuration word also reconfigures the group division method of the weight value and stores it in the convolution kernel register. Afterwards, the weights are distributed as filters to the group of pattern-aware processing elements PE by the sparse calculation controller.

然而,图案掩码的字长与权值的字长不同,并且跨层可变。相比之下,我们在掩码的存储器访问中采用按位而不是按序的组划分。根据实验的结果,实际上至多32种类型的图案足以保证精确度,因此我们在60位宽的SRAM中存储图案编码,如图6b所示,其包括12个5位图案(32种类型)、或15个4位图案(16种类型)、或20个3位图案(8种类型)、或30个2位图案(4种类型)、或60个1位图案(2种类型)。图案配置器在这里定义图案字拆分方式以加载随后的移位寄存器。接着,顺序读取图案编码以馈送到PE阵列。与固定长度的副本相比,如在最开始对硬件的描述中,图案剪枝的开销可减少到小于γΣKl。利用图案编码的动态位宽格式,第l层中的图案开销率被从γ优化为γl',其中γl'被优化为动态调整

Figure BDA0002395609510000161
而不是最大方法
Figure BDA0002395609510000162
因此,图案的总开销减少到
Figure BDA0002395609510000164
Figure BDA0002395609510000163
例如,在CIFAR10上的VGG-16上的各种设置a的情况下动态方案可以比最大方法节省40%的索引存储。However, the word length of the pattern mask is different from that of the weights and is variable across layers. In contrast, we employ bitwise rather than sequential group partitioning in masked memory accesses. According to the results of the experiment, in fact, at most 32 types of patterns are enough to ensure the accuracy, so we store the pattern code in a 60-bit wide SRAM, as shown in Figure 6b, which includes 12 5-bit patterns (32 types), Or 15 4-bit patterns (16 types), or 20 3-bit patterns (8 types), or 30 2-bit patterns (4 types), or 60 1-bit patterns (2 types). The pattern configurator defines here how pattern words are split to load subsequent shift registers. Next, the pattern codes are read sequentially to feed to the PE array. Compared to fixed-length copies, the overhead of pattern pruning can be reduced to less than γΣKl as in the initial hardware description. Utilizing the dynamic bit width format for pattern encoding, the pattern overhead rate in layer l is optimized from γ to γl ', where γl ' is optimized for dynamic adjustment
Figure BDA0002395609510000161
instead of the max method
Figure BDA0002395609510000162
Therefore, the overall overhead of the pattern is reduced to
Figure BDA0002395609510000164
Figure BDA0002395609510000163
For example, the dynamic scheme can save 40% index storage than the largest method in the case of various settings a on VGG-16 on CIFAR10.

本发明的基于图案滤波器稀疏的图案感知加速器,其整体结构如图8a所示,由处理元件(PE)群组、图案掩码寄存器、卷积核寄存器、稀疏计算控制器与译码单元和共享激活寄存器&零值检测单元组成。稀疏计算控制器与译码单元输入端连接图案掩码寄存器和卷积核寄存器,并且根据输入的图案掩码映射表与处理元件(PE)群组交互连接,处理元件(PE)群组输入端连接共享激活寄存器&零值检测单元。The overall structure of the pattern perception accelerator based on pattern filter sparseness of the present invention is shown in FIG. Shared activation register & zero value detection unit. The sparse calculation controller is connected to the pattern mask register and the convolution kernel register with the input end of the decoding unit, and is interactively connected with the processing element (PE) group according to the input pattern mask mapping table, and the input end of the processing element (PE) group Connect shared activation register & zero value detection unit.

提出的加速器结构可以充分利用具有共享激活和平衡的图案滤波器的神经网络中的稀疏性。如图7所示,每个输入特征映射可以重复用于具有相应滤波器组的所有输出通道。The proposed accelerator structure can fully exploit the sparsity in neural networks with shared activations and balanced pattern filters. As shown in Figure 7, each input feature map can be reused for all output channels with corresponding filter banks.

通过对n个PE的激活复用,为每个PE加载不同输出通道的权值。这样,即使激活中的零随机分布,也可以维持不同PE之间的平衡,从而可以同时处理激活稀疏性和权值稀疏性。图8a中的图案掩码映射表用于表示每个滤波器的图案类型,这有助于重建卷积计算。与直接处理数据流的普通副本相比,它可以从理论算法增益上完全实现加速。此外,可从激活中的稀疏性获得更高的加速。By multiplexing the activations of n PEs, the weights of different output channels are loaded for each PE. In this way, the balance between different PEs can be maintained even with zero random distribution in activations, so that both activation sparsity and weight sparsity can be handled simultaneously. The pattern mask mapping table in Fig. 8a is used to represent the pattern type of each filter, which facilitates the reconstruction convolution computation. It can fully achieve speedup from theoretical algorithmic gain compared to plain copy that directly processes the data stream. Additionally, higher speedups can be obtained from sparsity in activations.

图9所示的图案感知处理单元PE中的流水线式策略。为了高吞吐量,所有运算都被流水线化。The pipelined strategy in the pattern-aware processing unit PE shown in FIG. 9 . All operations are pipelined for high throughput.

第一阶段是数据预处理阶段。根据图案掩码映射表将权值恢复到原始滤波器。对于激活,数据被加载到寄存器堆中,并且激活稀疏性掩码被生成以表示零。对于图案,其被转换成权值稀疏性掩码,类似于激活。The first stage is the data preprocessing stage. Restore the weights to the original filter according to the pattern mask mapping table. For activations, data is loaded into the register file, and an activation sparsity mask is generated to represent zeros. For patterns, it is converted into a weight sparsity mask, similar to activations.

在第二阶段,稀疏指针的生成,计算指针偏移表来定位非零激活权值对,有助于减少冗余循环。之后,我们可以基于指针偏移表逐步获得非零激活权值对的每个指针,直到完成整个滤波器计算。In the second stage, the generation of sparse pointers, the pointer offset table is calculated to locate non-zero activation weight pairs, which helps to reduce redundant loops. Afterwards, based on the pointer offset table, we can gradually obtain each pointer of a pair of non-zero activation weights until the entire filter computation is completed.

最后,当来自各个输入通道的所有部分和被乘加在一起时,使用ReLU来获得最终结果。Finally, when all the partial sums from the individual input channels are multiplied together, ReLU is used to obtain the final result.

稀疏指针的生成在提出的加速器中是关键性的。因此,下面将详细介绍生成指针的工作流程。概述如图8b所示,其中处理滤波器图案编码以计算指针偏移表。在层计算开始时,配置字被加载以建立图案译码器。基于图案译码器,每个图案编码可以被解码成9位长度的相应滤波器稀疏性掩码。如图8b的左上部展现的示例所示,输入编码匹配第三图案,然后获得在在图8b中左侧灰色的方块中示出的稀疏性掩码,其中“1”表示非零值,而“0”表示零值。同时在激活寄存器内,输入向量一方面被缓存在寄存器堆中,另一方面,向量的每个值被检查是否为零并且获得9位长度的稀疏性掩码。在两个稀疏性掩码之间强制执行和运算之后,可以获得计算稀疏性掩码,以表示有效的非零激活权值对,通过稀疏性掩码进行指针偏移处理后,得到最终的稀疏指针。在激活权值对稀疏性检测的帮助下,可以进一步减少运算量。The generation of sparse pointers is critical in the proposed accelerator. Therefore, the workflow for generating pointers is detailed below. An overview is shown in Figure 8b, where the filter pattern code is processed to compute the pointer offset table. At the beginning of layer computation, configuration words are loaded to build the pattern decoder. Based on the pattern decoder, each pattern code can be decoded into a corresponding filter sparsity mask of 9-bit length. As shown in the example shown in the upper left part of Fig. 8b, the input code matches the third pattern, and then obtains the sparsity mask shown in the gray square on the left in Fig. 8b, where "1" indicates a non-zero value, and "0" means zero value. At the same time in the activation register, the input vector is cached in the register file on the one hand, and on the other hand, each value of the vector is checked for zero and a sparsity mask of length 9 is obtained. After enforcing an AND operation between two sparsity masks, a computed sparsity mask can be obtained to represent valid non-zero activation weight pairs, and after pointer offset processing through the sparsity mask, the final sparseness pointer. With the help of activation weights for sparsity detection, the amount of computation can be further reduced.

用于稀疏性掩码偏移的指针偏移表是计算有效激活权值对的每个位置。偏移模块的函数在图8c的上部示出。存在加法器——和运算链以获得9字指针偏移表,其每一者表示与最近的零的距离。指针偏移表可以被如下处理:首先,非运算被应用于计算稀疏性掩码的每个位,以便累加非零之间的零的数量。其次,通过负掩码加每个位。并且一旦观察到位0,累加和将被重置。指针偏移表的示例在图8c中示出,由指针头和指针偏移两部分组成。指针头表示第一有效对的位置。指针偏移表示当前位置与最近的未使用的非零值之间的距离。但实际上,仅采用位于有效对的那些偏移,因为没有达到其它偏移。The pointer offset table for the sparsity mask offset is each location where valid activation weight pairs are computed. The function of the offset module is shown in the upper part of Fig. 8c. There is an adder-and chain of operations to obtain a table of 9-word pointer offsets, each representing the distance to the nearest zero. The pointer offset table can be processed as follows: First, a negation operation is applied to each bit of the computed sparsity mask in order to accumulate the number of zeros between nonzeros. Second, add each bit through the negative mask. And once bit 0 is observed, the accumulated sum will be reset. An example of the pointer offset table is shown in Fig. 8c, which consists of two parts: pointer head and pointer offset. The pointer head indicates the location of the first valid pair. The pointer offset represents the distance between the current position and the nearest unused non-zero value. But in practice, only those offsets that are on valid pairs are used, since no other offsets are reached.

指针生成的工作流程和示例在图8c的底部示出。首先,指针的初始值为零,并且直接与指针头相加,以找到第一有效对。然后,将指针与当前位置的偏移和额外的1相加,以找到下一个有效对。每次在累加之后,将指针与9进行比较以决定是否结束该工作流程。在图8b和图8c中的示例中,在原始稀疏性掩码中存在三个有效对。因此,第一次添加到指针头的指针是要找到(a1,w1),并且稍后添加到对应的指针偏移和另一个“1”以找到(a3,w3)和(a5,w5)。当指针总计为9时,工作流程结束,并且指针被重置为零。The workflow and examples of pointer generation are shown at the bottom of Fig. 8c. First, the pointer is initialized to zero and added directly to the pointer head to find the first valid pair. Then, add the pointer to the offset from the current position plus an extra 1 to find the next valid pair. Each time after accumulation, the pointer is compared with 9 to decide whether to end the workflow. In the examples in Figures 8b and 8c, there are three valid pairs in the original sparsity mask. So, the first time the pointer added to the pointer head is to find (a 1 , w 1 ), and later added to the corresponding pointer offset and another "1" to find (a 3 , w 3 ) and (a 5 , w 5 ). When the pointer totals 9, the workflow ends and the pointer is reset to zero.

因此,利用图案感知加速器,可以在具有共享的激活和滤波器的等长非零序列的不同PE之间对齐工作负载。应注意,在图8a中,只将所需的激活和权值加载到PE中,从而减小了内部带宽。此外,基于稀疏性掩码的处理流水线可以同时检测激活向量和滤波器中的零。Thus, with pattern-aware accelerators, workloads can be aligned across different PEs with shared equal-length non-zero sequences of activations and filters. It should be noted that in Fig. 8a, only the required activations and weights are loaded into the PE, thus reducing the internal bandwidth. Furthermore, the sparsity mask-based processing pipeline can simultaneously detect zeros in activation vectors and filters.

如下进行加速器的评估,本发明所述的加速器用RTL实施了图案感知加速器的架构和基线副本。用UMC 55nm标准功率CMOS工艺中的Synopsys Design Compiler评估图案感知的架构的面积和功率开销。图案感知的加速器布局在图10中示出,包括用于输入和输出数据缓冲区、权值缓冲区、图案SRAM、寄存器堆以及PE群组和锁相环。每个组分的开销,在图案感知加速器的架构的总体设计中,图案设计带来的开销很小。The evaluation of the accelerator, described in the present invention, implements the architecture and baseline replica of the pattern-aware accelerator with RTL as follows. The area and power overhead of the pattern-aware architecture was evaluated with Synopsys Design Compiler in UMC's 55nm standard power CMOS process. The pattern-aware accelerator layout is shown in Figure 10, including buffers for input and output data, weight buffers, pattern SRAM, register files, and PE groups and phase-locked loops. The overhead of each component, in the overall design of the architecture of the pattern-aware accelerator, the overhead brought by the pattern design is very small.

存储占用方面,与不规则剪枝相比,图案剪枝方法仅需要用于每一滤波器的额外图案编码而非每一权值。利用图案剪枝方法,对于大多数情况,128KB权值SRAM最多用于32768个3X3大小的滤波器,具有4个非零,而对于工作负载,4K图案SRAM是足够的,16种类型的图案足以维持精确度。结果,该加速器架构仅引入3.1%的存储开销来存储参数。并且当考虑数据SRAM时,图案SRAM的面积等于其它SRAM的3.4%。用于稀疏性的EIE的额外开销占据了整个芯片的面积的19.1%,并且指针SRAM的大小是其它SRAM的大约25%。相比之下,与EIE相比,我们可以节省7倍的开销。以具有滤波器中的1个非零和16种图案的VGG-16为例,由于在架构中以8位表示权值,因此可以节省6.5倍的参数(同时考虑索引数据开销和权值数据开销)。In terms of memory footprint, compared to random pruning, the pattern pruning method only requires an extra pattern encoding for each filter instead of each weight. With the pattern pruning method, for most cases, 128KB weight SRAM is used for at most 32768 filters of 3X3 size with 4 non-zeros, while for workloads, 4K pattern SRAM is enough, 16 types of patterns are enough Maintain accuracy. As a result, this accelerator architecture introduces only 3.1% memory overhead to store parameters. And when the data SRAM is considered, the area of the pattern SRAM is equal to 3.4% of the other SRAM. The overhead of EIE for sparsity occupies 19.1% of the area of the whole chip, and the size of pointer SRAM is about 25% of other SRAM. In contrast, we can save 7x overhead compared to EIE. Taking VGG-16 with 1 non-zero and 16 patterns in the filter as an example, since weights are represented in 8 bits in the architecture, 6.5 times of parameters can be saved (considering both index data overhead and weight data overhead ).

性能改善方面,与无图案感知架构相比,提出的图案感知的架构可实现9倍的完全加速,带来13.9%的面积开销与3.3%的存储开销。另外,在提出的加速器架构处理稠密的网络时,可以绕过流水线中的数据预处理阶段和稀疏指针阶段,并且可以在稀疏期间通过时钟门控来关闭相关模块。而在其它类似的研究中,仅稀疏网络得到了很好的支持。对于Eyeriss,仅通过时钟门控来处理零计算以节省功耗,但仍然消耗时间。对于Cambricon-X,仅仅关注权值稀疏性而非激活权值对。SCNN通过引入坐标对齐数据来利用权值和激活稀疏性。然而,对于稠密的网络,那些计算方式的开销无法被隐藏。EIE支持不规则剪枝,但是导致额外的索引成本,而Cambricon-S利用粗粒度剪枝以实现低的额外成本,但是仅适合于通道级别的剪枝。In terms of performance improvement, compared with the non-pattern-aware architecture, the proposed pattern-aware architecture can achieve 9 times full acceleration, resulting in 13.9% area overhead and 3.3% storage overhead. In addition, when the proposed accelerator architecture deals with dense networks, the data preprocessing stage and the sparse pointer stage in the pipeline can be bypassed, and related modules can be turned off by clock gating during the sparse period. In other similar studies, only sparse networks are well supported. For Eyeriss, zero calculations are only processed by clock gating to save power, but still consume time. For Cambricon-X, we only focus on weight sparsity rather than activation weight pairs. SCNN exploits weight and activation sparsity by introducing coordinate-aligned data. However, for dense networks, those computational overheads cannot be hidden. EIE supports irregular pruning, but results in additional indexing cost, while Cambricon-S utilizes coarse-grained pruning to achieve low additional cost, but is only suitable for channel-level pruning.

本发明通过提出了一种新型滤波器表示方法PCNN,用于描述具有模板掩码和非零序列的稀疏滤波器。这种剪枝方法填补了细粒度剪枝和形状级别的剪枝之间的剪枝算法金字塔的空白,提供了规则剪枝的新维度以实现比传统粗粒度剪枝方法更好的性能。实验表明,可以在精确度损失小于1%的情况下实现8倍~9倍的压缩和计算加速。另外,图案剪枝可容易地与粗粒度剪枝方法相结合以改进精确度,从而以可忽略不计的精确度损失实现34.4的压缩比。对于存储开销,仅引入每个滤波器的5位索引,其低于EIE中每个权值的4位索引。对于计算,仅有13.9%的PE面积开销和3.3%的存储面积开销,提出的加速器架构可以实现与PCNN算法等效或更好的加速,因为它可以用图案剪枝来平衡工作负载,以及检测权值和激活的稀疏性。The present invention proposes a novel filter representation method PCNN for describing sparse filters with template masks and non-zero sequences. This pruning method fills the gap in the pruning algorithm pyramid between fine-grained pruning and shape-level pruning, providing a new dimension of regular pruning to achieve better performance than traditional coarse-grained pruning methods. Experiments show that 8 to 9 times of compression and calculation acceleration can be achieved with less than 1% loss of accuracy. Additionally, pattern pruning can be easily combined with coarse-grained pruning methods to improve accuracy, achieving a compression ratio of 34.4 with negligible loss of accuracy. For storage overhead, only a 5-bit index per filter is introduced, which is lower than the 4-bit index per weight in EIE. For computation, with only 13.9% PE area overhead and 3.3% storage area overhead, the proposed accelerator architecture can achieve equivalent or better acceleration than the PCNN algorithm, since it can balance the workload with pattern pruning, and detect Sparsity of weights and activations.

Claims (9)

1. The pattern perception accelerator of the convolutional neural network based on the pattern is characterized by consisting of a pattern perception processing element group, a pattern mask register, a convolutional kernel register, a sparse calculation controller, a decoding unit and a shared activation register & zero value detection unit; the sparse calculation controller is connected with the input end of the decoding unit, the pattern mask register and the convolution kernel register, and is interactively connected with the pattern perception processing element group according to the input pattern mask mapping table, and the input end of the pattern perception processing element group is connected with the shared activation register and the zero value detection unit;
registering the weight of the convolutional neural network based on the pattern obtained by the convolutional neural network based on the pattern pruning method in a convolutional kernel register;
the convolutional neural network pruning method based on the pattern comprises the following steps,
step 1, the convolution neural network based on patterns takes hardware conditions as constraints, and distillation screening is carried out on a convolution kernel pattern set of each layer of the convolution neural network through the following objective function to obtain a secondary F l To P l Thereby obtaining a smaller convolution kernel mapA pattern set, which reduces the bit overhead of pattern index;
Figure FDA0004049121830000011
s.t.∑x lj ≤V l ,x lj ∈{0,1}, (2)
P l =∪p lj x lj ,P lj ∈F n , (3)
l=1,...,N,j=1,...,N l . (4)
wherein the convolutional neural network comprises N convolutional layers, and the weight set in the convolutional layers is W = { W = 1 ,…,W N },W l Representing the weight in the first convolution layer; w is a lj Is W l Weight of the jth convolution kernel of (1), N l Is W l The number of convolution kernels in (a); n is the number of non-zero values per pattern, λ and β are used to balance the parameters of the latter two terms;
the complete set of pattern sets for each layer is F = { F = } 1 ,F 2 ,…,F n Extraction of distilled set P = { P from F 1 ,P 2 ,…,P l }, no redundant pattern; f n Is a full or candidate set of patterns having n non-zero values, at each F n Therein is provided with
Figure FDA0004049121830000012
Kind of pattern type, t n Is F n Total number of middle patterns, t n =|F n |;P l Is from F n Derived selected patterns in the l-th layer, at each P l In which only V is l Seed pattern type, <' > or>
Figure FDA0004049121830000021
V l Is the desired number of selected patterns, V l =|P l |;
When x is lj When =1, P is selected l The ith pattern of (2), otherwiseVice versa;
Figure FDA0004049121830000022
is to maintain w at the maximum absolute value lj Is matched to P l The projection function of the nearest pattern in (a);
step 2, expressing the weight of the convolutional layer as { S, P } from sparsity and distilled pattern, and realizing pruning of the convolutional neural network based on the pattern; wherein S = { S = 1 ,s 2 ,...,s l Is the sparsity of the l convolutional layer, from n l /K l Determining, wherein n l Number of non-zero values, K l Is the size of each layer of the convolution kernel.
2. The pattern aware accelerator of a pattern based convolutional neural network of claim 1, wherein said pattern aware processing element performs pipelined data processing therein;
the first stage, data preprocessing stage;
restoring the weight value to the original filter according to the pattern mask mapping table; for activate, data is loaded into the register file and an activate sparsity mask is generated to represent zeros; for patterns, it is converted to weight sparsity masks, similar to activations;
the second stage, generating a sparse pointer;
calculating a pointer offset table to position the nonzero activation weight pair, and gradually obtaining each pointer of the nonzero activation weight pair based on the pointer offset table until the whole filter is calculated;
finally, in the third stage, the ReLU is used to obtain the final result when all partial sums from the respective input channels are added together.
3. The pattern-aware accelerator of a convolutional neural network based on pattern as claimed in claim 2, wherein in the second phase, the sparse pointer is generated by the following specific processing method,
calculating a pointer offset table by processing the filter pattern code;
loading configuration words to establish a pattern decoder when the calculation of each convolution layer is started;
based on the pattern decoder, each pattern code may be decoded into a corresponding pattern filter sparsity mask of 9-bit length; inputting a pattern code matching a specified pattern, and then obtaining a 9-bit pattern sparsity mask, wherein '1' represents a non-zero value and '0' represents a zero value;
simultaneously activating the register, the input vector being cached in the register file on the one hand, and on the other hand, each value of the vector being checked for zero and obtaining an activation sparsity mask of 9 bits length;
and after the forced execution and operation between the pattern sparsity mask and the activation sparsity mask, obtaining a calculation sparsity mask to represent an effective nonzero activation weight pair, and performing pointer offset processing on the calculation sparsity mask according to a pointer offset table to obtain a final sparse pointer.
4. The pattern aware accelerator of a pattern-based convolutional neural network of claim 3, wherein the pointer offset table for sparsity mask offsets is for each position where a valid activation weight pair is calculated; the pointer offset table consists of a pointer head and a pointer offset; the index tip indicates the position of the first active pair; the pointer offset represents the distance between the current position and the nearest unused non-zero value, and the pointer offset table is processed as follows:
first, a negation operation is applied to each bit of the computed sparsity mask to accumulate the number of zeros between non-zeros;
secondly, add each bit by a negative mask; and once bit 0 is observed, the accumulated sum will be reset;
the procedure for generating the pointer from the deviation table is as follows:
firstly, the initial value of the pointer is zero, and the initial value is directly added with the head of the pointer to find a first effective pair;
then, add the pointer to the offset of the current position and the extra 1 to find the next valid pair;
finally, each time after accumulation, the pointer is compared to 9 to decide whether to end the workflow.
5. The pattern aware accelerator of a pattern based convolutional neural network of claim 2, wherein the pattern aware processing element PE comprises a filter encapsulated in it that contains a pattern mask and a non-zero sequence stored in different SRAMs;
reading a nonzero weight with a fixed word length as a nonzero sequence group from a weight SRAM; in a weight SRAM, the weight is quantized to 8-bit word length;
the pattern configurator reconfigures the weight register file according to the configuration words, and divides the vectors read from the pattern SRAM into a plurality of groups according to the length of the nonzero sequence;
the pattern configurator reconfigures the group division mode of the weight according to the configuration words, and then distributes the weight to the pattern perception processing element PE group as a filter;
the access of the SRAM adopts a group division mode according to bits;
the pattern configurator is used to define a pattern word splitting pattern to load a subsequent shift register to sequentially read pattern encodings for feeding to the PE array.
6. The pattern aware accelerator of a pattern-based convolutional neural network as claimed in claim 1, wherein in step 1, V is given l And n, F will be obtained by the objective function l To P l The transformation relation of (1) is converted into pattern distillation based on the knapsack problem, and solved through a greedy algorithm; until different n and V are traversed under hardware constraints l The best setting is selected.
7. The pattern aware accelerator of a pattern based convolutional neural network of claim 6, wherein F is obtained by an objective function by l To P l The transformation relationship of (a) is converted into a pattern selection based on the knapsack problem,
step a, pattern distillation is expressed by the following formula:
Figure FDA0004049121830000041
s.t.∑x li ≤V l ,x li ∈{0,1},i=1,...,t n (9)
P l =∪p li x li ,p li ∈F n , (10)
l=1,...,N,j=l,...,N l . (11)
step b, mixing each w lj As a single backpack, the capacity of each backpack is 1; representing W with the selected pattern is considered a multi-knapsack problem MKP;
step c, due to w lj All capacities of 1, the KP based pattern selection is actually a multi-knapsack problem MKP-1 with exactly the same capacity.
8. The pattern aware accelerator of a pattern-based convolutional neural network of claim 7, wherein the multi-knapsack problem MKP is solved by a greedy algorithm of pattern distillation optimization based on knapsack problem; specifically, in making the pattern selection, for each layer, the most valuable pattern is selected for each convolution kernel, and then the top V that will minimize the loss function l A pattern, represents W l
9. The pattern aware accelerator of a pattern-based convolutional neural network of claim 1, wherein for said convolutional neural network, a forced convolution filter is adapted to the selected pattern set to optimize the convolutional neural network by selecting the pattern to approximate the weight W scheme of the convolutional neural network, and then using an alternating direction multiplier method ADMM; in particular, the method comprises the following steps of,
given the original data set X, Y and the loss function L, the convolutional neural network is optimized as follows,
Figure FDA0004049121830000051
where f represents a function of CNN, W represents weights of convolutional layers, θ represents other parameters of CNN, and P is a selected pattern set.
CN202010130327.6A 2020-02-28 2020-02-28 Convolutional neural network pruning method based on patterns and pattern perception accelerator Active CN111368699B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010130327.6A CN111368699B (en) 2020-02-28 2020-02-28 Convolutional neural network pruning method based on patterns and pattern perception accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010130327.6A CN111368699B (en) 2020-02-28 2020-02-28 Convolutional neural network pruning method based on patterns and pattern perception accelerator

Publications (2)

Publication Number Publication Date
CN111368699A CN111368699A (en) 2020-07-03
CN111368699B true CN111368699B (en) 2023-04-07

Family

ID=71208350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010130327.6A Active CN111368699B (en) 2020-02-28 2020-02-28 Convolutional neural network pruning method based on patterns and pattern perception accelerator

Country Status (1)

Country Link
CN (1) CN111368699B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149369B (en) * 2020-09-21 2024-04-05 交叉信息核心技术研究院(西安)有限公司 Multi-core packaging level system based on core particle architecture and core particle-oriented task mapping method thereof
CN112132268B (en) * 2020-09-25 2024-07-26 交叉信息核心技术研究院(西安)有限公司 Task traction feature distillation deep neural network learning training method and system and readable storage medium
CN112288085B (en) * 2020-10-23 2024-04-09 中国科学院计算技术研究所 Image detection method and system based on convolutional neural network
KR20230070492A (en) 2020-12-24 2023-05-23 후아웨이 테크놀러지 컴퍼니 리미티드 Decoding with signaling of feature map data
US12423507B2 (en) * 2021-07-12 2025-09-23 International Business Machines Corporation Elucidated natural language artifact recombination with contextual awareness
CN113516235B (en) * 2021-07-13 2024-10-18 南京大学 A deformable convolution accelerator and a deformable convolution acceleration method
CN113610227B (en) * 2021-07-23 2023-11-21 人工智能与数字经济广东省实验室(广州) A deep convolutional neural network pruning method for image classification
CN113688976B (en) * 2021-08-26 2025-08-19 伟光有限公司 Neural network acceleration method, device, equipment and storage medium
CN114037857B (en) * 2021-10-21 2022-09-23 中国科学院大学 Image classification precision improving method
CN114169501A (en) * 2021-12-02 2022-03-11 深圳市华尊科技股份有限公司 Neural network compression method and related equipment
CN115456152B (en) * 2022-08-11 2023-12-05 南方科技大学 Sparse convolutional neural network accelerator and computing device based on weight precoding
CN116306878A (en) * 2023-01-29 2023-06-23 Oppo广东移动通信有限公司 Neural network pruning method, device, electronic equipment and storage medium
CN116994309B (en) * 2023-05-06 2024-04-09 浙江大学 Face recognition model pruning method for fairness perception

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886397A (en) * 2019-03-21 2019-06-14 西安交通大学 A kind of neural network structure beta pruning compression optimization method for convolutional layer
CN110276450A (en) * 2019-06-25 2019-09-24 交叉信息核心技术研究院(西安)有限公司 Multi-granularity-based deep neural network structured sparse system and method
CN110782019A (en) * 2019-10-28 2020-02-11 中国科学院自动化研究所 Convolutional Neural Network Compression Method, System and Device Based on Decomposition and Pruning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
US10936913B2 (en) * 2018-03-20 2021-03-02 The Regents Of The University Of Michigan Automatic filter pruning technique for convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109886397A (en) * 2019-03-21 2019-06-14 西安交通大学 A kind of neural network structure beta pruning compression optimization method for convolutional layer
CN110276450A (en) * 2019-06-25 2019-09-24 交叉信息核心技术研究院(西安)有限公司 Multi-granularity-based deep neural network structured sparse system and method
CN110782019A (en) * 2019-10-28 2020-02-11 中国科学院自动化研究所 Convolutional Neural Network Compression Method, System and Device Based on Decomposition and Pruning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种用于卷积神经网络压缩的混合剪枝方法;靳丽蕾等;《小型微型计算机系统》;20181211(第12期);全文 *
基于深层卷积神经网络的剪枝优化;马治楠等;《电子技术应用》;20181206(第12期);全文 *

Also Published As

Publication number Publication date
CN111368699A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN111368699B (en) Convolutional neural network pruning method based on patterns and pattern perception accelerator
US20220327367A1 (en) Accelerator for deep neural networks
Zhou et al. Rethinking bottleneck structure for efficient mobile network design
Umuroglu et al. Finn: A framework for fast, scalable binarized neural network inference
Zhao et al. SmartExchange: Trading higher-cost memory storage/access for lower-cost computation
Chu et al. Mixed-precision quantized neural networks with progressively decreasing bitwidth
Prost-Boucle et al. Scalable high-performance architecture for convolutional ternary neural networks on FPGA
Jain et al. Biscaled-dnn: Quantizing long-tailed datastructures with two scale factors for deep neural networks
US20180046895A1 (en) Device and method for implementing a sparse neural network
Nakahara et al. High-throughput convolutional neural network on an FPGA by customized JPEG compression
Li et al. ESCALATE: Boosting the efficiency of sparse CNN accelerator with kernel decomposition
US20210191733A1 (en) Flexible accelerator for sparse tensors (fast) in machine learning
Chen et al. Tight compression: Compressing CNN through fine-grained pruning and weight permutation for efficient implementation
WO2022016261A1 (en) System and method for accelerating training of deep learning networks
Huang et al. EPQuant: A Graph Neural Network compression approach based on product quantization
Kwon et al. Mobile transformer accelerator exploiting various line sparsity and tile-based dynamic quantization
Sun et al. ZFP-V: Hardware-optimized lossy floating point compression
Hanif et al. Resistive crossbar-aware neural network design and optimization
Karimzadeh et al. Towards energy efficient dnn accelerator via sparsified gradual knowledge distillation
Wang et al. Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging
Jiang et al. Groupq: Group-wise quantization with multi-objective optimization for cnn accelerators
Tang et al. SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From Scratch
Hosny et al. Sparse bitmap compression for memory-efficient training on the edge
KR102420763B1 (en) Neural network system and processing method of filter data of neural network
Ansari et al. Empirical analysis of fixed point precision quantization of CNNs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Tan Zhanhong

Inventor before: Ma Kaisheng

Inventor before: Tan Zhanhong

CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Tan Zhanhong

Inventor before: Ma Kaisheng

Inventor before: Tan Zhanhong

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240529

Address after: 710077 5th floor, B3, phase II, software new town, tianguba Road, Yanta District, Xi'an City, Shaanxi Province

Patentee after: Cross Information Core Technology Research Institute (Xi'an) Co.,Ltd.

Country or region after: China

Patentee after: TSINGHUA University

Address before: 710077 5th floor, B3, phase II, software new town, tianguba Road, Yanta District, Xi'an City, Shaanxi Province

Patentee before: Cross Information Core Technology Research Institute (Xi'an) Co.,Ltd.

Country or region before: China