CN112116061A - Weight and activation value quantification method for long-term and short-term memory network - Google Patents
Weight and activation value quantification method for long-term and short-term memory network Download PDFInfo
- Publication number
- CN112116061A CN112116061A CN202010774421.5A CN202010774421A CN112116061A CN 112116061 A CN112116061 A CN 112116061A CN 202010774421 A CN202010774421 A CN 202010774421A CN 112116061 A CN112116061 A CN 112116061A
- Authority
- CN
- China
- Prior art keywords
- value
- quantization
- threshold
- num
- activation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
本发明公开了一种面向长短期记忆网络的权值和激活值的量化方法,包括步骤:1)收集长短期记忆网络的权重值和激活值数集;2)确定对应的目标量化范围,开始阈值遍历循环,根据计算出的不同阈值情况下的缩放因子和饱和值,将权重值和激活值进行缩放或置为饱和值;3)遍历完成后,分别计算权重值和激活值的初始数集与映射后数集的KL散度,最终分别输出正、负数方向截断阈值以及最小的KL散度值。本发明实现了将高精度浮点数训练完成的长短期记忆网络转化为定点数网络,创新性地设计了针对长短期记忆网络的权重值和激活值的量化结构,在保证算法硬件实施精度的同时,减小了硬件开销,提高了运行速度。
The invention discloses a method for quantifying weights and activation values of long-term and short-term memory networks, comprising the steps of: 1) collecting a set of weight values and activation values of long-term and short-term memory networks; 2) determining a corresponding target quantization range, and starting Threshold traversal loop, according to the calculated scaling factor and saturation value under different thresholds, the weight value and activation value are scaled or set to saturation value; 3) After the traversal is completed, calculate the initial set of weight value and activation value respectively And the KL divergence of the number set after the mapping, and finally output the positive and negative direction truncation thresholds and the minimum KL divergence value. The invention realizes the transformation of the long-term and short-term memory network completed by high-precision floating-point number training into a fixed-point number network, and innovatively designs a quantization structure for the weight value and activation value of the long-term and short-term memory network. , reducing the hardware overhead and improving the running speed.
Description
技术领域technical field
本发明属于循环神经网络量化领域,具体涉及一种面向长短期记忆网络的权值和激活值的量化方法。The invention belongs to the field of cyclic neural network quantization, and in particular relates to a method for quantizing weights and activation values for long-term and short-term memory networks.
背景技术Background technique
随着图形处理器,通用中央处理器的计算能力不断提升,缓解了人工神经网络对运算复杂度的要求。2012年以后,基于神经网络的人工智能算法不断发展,广泛应用于模式识别、语音处理和图像处理等多个领域。然而,硬件性能的发展始终不能满足算法的进化。2016年提出的SSD网络引入了多达500亿次的计算,需要在大型工作站中运行。然而,桌面处理器和移动端处理器难以承受如此大的计算量,这样极大地限制了神经网络的应用场景,包括各种终端应用如虚拟现实技术和增强现实技术。解决该问题有两种方案,其中一个是针对神经网路中的冗余信息进行压缩,例如将浮点全精度的神经网络进行量化,通过压缩数据位宽,实现对网络的参数量的压缩。当前,大多数量化方案针对于神经网络中的卷积神经网络,因为其网络参数多,计算复杂度高,并没有针对循环神经网络的量化方案,本发明针对循环神经网络中广泛使用的长短期记忆网络(LSTM),设计了一套适用于小规模LSTM网络的量化方案,可以在保证网络精度的同时,缩减硬件设计的开销,使LSTM网络低成本地应用在移动端设备中,促进基于神经网络的人工智能算法的发展。With the continuous improvement of the computing power of graphics processors and general-purpose central processing units, the requirements of artificial neural networks for computational complexity are alleviated. Since 2012, artificial intelligence algorithms based on neural networks have continued to develop and are widely used in many fields such as pattern recognition, speech processing, and image processing. However, the development of hardware performance cannot always satisfy the evolution of algorithms. The SSD network proposed in 2016 introduced up to 50 billion computations, which needed to be run in large workstations. However, desktop processors and mobile processors cannot afford such a large amount of computation, which greatly limits the application scenarios of neural networks, including various terminal applications such as virtual reality technology and augmented reality technology. There are two solutions to this problem, one of which is to compress the redundant information in the neural network, such as quantizing the floating-point full-precision neural network, and compressing the data bit width to realize the compression of the network parameters. At present, most quantization schemes are aimed at the convolutional neural network in the neural network, because of its many network parameters and high computational complexity, there is no quantization scheme for the recurrent neural network. Memory network (LSTM), designed a set of quantization schemes suitable for small-scale LSTM networks, which can reduce the overhead of hardware design while ensuring network accuracy, so that LSTM networks can be applied in mobile devices at low cost, and promote neural-based The development of artificial intelligence algorithms for networks.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种面向长短期记忆网络的权值和激活值的量化方法,将高精度浮点数训练完成的长短期记忆网络转化为定点数网络,包含权重值和激活值的量化。定点数网络具有于浮点数网络相近的网络精度,并可以低成本、高速度地应用于终端系统中。The purpose of the present invention is to provide a method for quantifying the weights and activation values of long-term and short-term memory networks, which converts the long-term and short-term memory networks trained with high-precision floating-point numbers into fixed-point number networks, including the quantization of weight values and activation values. The fixed-point number network has similar network precision to the floating-point number network, and can be applied to the terminal system at low cost and high speed.
本发明采用如下技术方案来实现的:The present invention adopts following technical scheme to realize:
一种面向长短期记忆网络的权值和激活值的量化方法,包括以下步骤:A quantification method for the weights and activations of long short-term memory networks, including the following steps:
1)收集长短期记忆网络的权重值数集,并确定权重值的目标量化范围,开始阈值遍历循环,根据计算出的不同阈值情况下的缩放因子和饱和值,将权重值进行缩放或置为饱和值;遍历完成后,计算权重值的初始数集与映射后数集的KL散度,最终分别输出正、负数方向截断阈值以及最小的KL散度值;1) Collect the weight value set of the long and short-term memory network, and determine the target quantization range of the weight value, start the threshold traversal cycle, and scale or set the weight value according to the calculated scaling factor and saturation value under different thresholds. Saturation value; after the traversal is completed, calculate the KL divergence of the initial set of weight values and the mapped set, and finally output the positive and negative direction truncation thresholds and the minimum KL divergence value respectively;
2)收集长短期记忆网络的激活值数集,并确定激活值的初始量化范围和目标量化范围,开始阈值遍历循环,根据计算出的不同阈值情况下的缩放因子和饱和值,将激活值进行缩放或置为饱和值;遍历完成后,计算激活值的初始数集与映射后数集的KL散度,最终分别输出正、负数方向截断阈值以及最小的KL散度值。2) Collect the activation value set of the long short-term memory network, and determine the initial quantization range and target quantization range of the activation value, start the threshold traversal cycle, and calculate the activation value according to the calculated scaling factor and saturation value under different thresholds. Scale or set to saturation value; after the traversal is completed, calculate the KL divergence of the initial set of activation values and the mapped set, and finally output the positive and negative direction truncation thresholds and the minimum KL divergence value respectively.
本发明进一步的改进在于,步骤1)的具体实现方法如下:A further improvement of the present invention is that the concrete realization method of step 1) is as follows:
101)长短期记忆网络的权重值LSTM网络中的循环参数或者输入参数,权重值数集大小较小而且稀疏,在100-4500;101) The weight value of the long short-term memory network The loop parameter or input parameter in the LSTM network, the weight value set size is small and sparse, ranging from 100 to 4500;
102)对于阻塞序列的LSTM网络的输入参数,首先搜索权重值的合适目标量化范围num,num一共选取2N个值,N=1~7采用特定量化算法计算后,收集每个num下最优量化方案的KL散度,这里输入参数的量化选择为INT3,即num等于8;而循环参数比输入参数数据量大,因而这里循环参数的量化选择为INT4,即num等于16;102) For the input parameters of the LSTM network of the blocking sequence, first search for the appropriate target quantization range num for the weight value, and select a total of 2 N values for num. For the KL divergence of the quantization scheme, the quantization selection of the input parameter here is INT3, that is, num is equal to 8; and the loop parameter is larger than the input parameter data, so the quantization selection of the loop parameter here is INT4, that is, num is equal to 16;
103)然后在INT3遍历搜索最佳的阈值,权重值的最小分组步长step设置为“(max-min)/num/3”;“遍历所有阈值”的过程,相当于在3×num个初始分组中,选取num个分组的数据按照缩放因子sf进行缩放,num个分组外的数据设置为饱和值low和high;从3×num个初始分组中遍历,通过循环变量j和g控制;当负数方向阈值取第2个分组,即idk=2的平均值,正数方向阈值取第2个分组,即idj=2的平均值时,KL散度最小,在在INT4遍历搜索最佳的阈值时,当负数方向阈值取第6个分组,即idk=6的中心值,正数方向阈值取第5个分组,即idj=5的中心值时,KL散度最小;103) Then traverse and search for the best threshold in INT3, and the minimum grouping step step of the weight value is set to "(max-min)/num/3"; the process of "traversing all thresholds" is equivalent to 3 × num initial In the grouping, the data of num groups is selected to be scaled according to the scaling factor sf, and the data outside num groups is set to the saturation value low and high; it is traversed from the 3 × num initial groups and controlled by the loop variables j and g; when the negative number The direction threshold takes the second group, that is, the average value of idk=2, and the positive direction threshold takes the second group, that is, the average value of idj=2, and the KL divergence is the smallest. When searching for the best threshold in INT4 traversal , when the negative direction threshold takes the sixth group, that is, the center value of idk=6, and the positive direction threshold takes the fifth group, that is, the center value of idj=5, the KL divergence is the smallest;
104)INT3量化最后将输入参数分成了24,即3×num个组,根据求出来的最优阈值,将第1、2组和第23、24组范围内的输入参数剔除,并分别置为负数方向阈值和正数方向阈值;而INT4量化最后将循环参数分成了48,即4×num个组根据求出来的最优阈值,将第1~6组和第44~48组范围内的权值剔除,并分别置为负数方向阈值和正数方向阈值。104) INT3 quantization finally divides the input parameters into 24 groups, that is, 3 × num groups. According to the obtained optimal threshold, the input parameters in the first, second, and 23rd and 24th groups are eliminated and set as Negative direction threshold and positive direction threshold; and INT4 quantization finally divides the loop parameters into 48, that is, 4 × num groups according to the obtained optimal threshold, the weights within the range of the 1st to 6th groups and the 44th to 48th groups are divided Eliminate and set as negative direction threshold and positive direction threshold respectively.
本发明进一步的改进在于,步骤2)的具体实现方法如下:The further improvement of the present invention is, the concrete realization method of step 2) is as follows:
201)长短期记忆网络的激活值为LSTM网络迭代运行1000次,如门模块的乘法阵列的输出数据,激活值数集大小不受限制,比较密集,在为106左右;201) The activation value of the long short-term memory network is iteratively run 1000 times by the LSTM network, such as the output data of the multiplication array of the gate module, the size of the activation value set is not limited, and it is relatively dense, which is about 106;
202)对于LSTM网络的激活值,首先搜索激活值的初始无精度损失量化范围num1和合适的目标量化范围num,采用相同的特定量化算法计算后,num,即INT8等于128,num1,即INT16等于65536,收集每组num1和num下最优量化方案的KL散度;202) For the activation value of the LSTM network, first search for the initial non-precision loss quantization range num1 of the activation value and the appropriate target quantization range num, and after calculating with the same specific quantization algorithm, num, that is, INT8 is equal to 128, num1, that is, INT16 is equal to 65536, Collect the KL divergence of the optimal quantization scheme under each group of num1 and num;
203)然后在INT8遍历搜索最佳的阈值,激活值的最小分组步长step设置为“(max-min)/num1”,遍历只通过循环变量k控制,阈值的遍历范围最大值为初始分组num1和目标分组num的位宽差值,对于INT16和INT8,取值范围为0至8之间的整数;由于激活值的阈值遍历只受一个变量控制,这里选择idk=1的量化方案;203) Then traverse the INT8 to search for the best threshold, the minimum grouping step step of the activation value is set to "(max-min)/num1", the traversal is only controlled by the loop variable k, and the maximum traversal range of the threshold is the initial grouping num1 The bit width difference value of the target group num, for INT16 and INT8, the value range is an integer between 0 and 8; since the threshold traversal of the activation value is only controlled by one variable, the quantization scheme of idk=1 is selected here;
204)最后将激活参数分成了1024,即8×num个组,根据求出来的最优阈值,将第1~256组和第512~1024组范围内的激活值剔除,并分别置为负数方向阈值和正数方向阈值。204) Finally, the activation parameters are divided into 1024 groups, that is, 8×num groups. According to the obtained optimal threshold, the activation values in the range of the 1st to 256th groups and the 512th to 1024th groups are eliminated and set to the negative direction respectively. Threshold and positive direction threshold.
本发明至少具有如下有益的技术效果:The present invention at least has the following beneficial technical effects:
神经网络在训练的过程中学习到了数据样本的模式可分性,同时由于数据中存在的噪声,使得网络具有较强的鲁棒性。也就是说,在输入样本中做轻微的变动并不会大幅度影响性能。噪声的引进会使各个层的激活值输出发生变动,然而却对结果影响不大,也就是说,训练好的网络对这些噪声具有一定的容忍度。在训练过程中使用高精度,如FP16的数值表示,使得网络具有一定的容忍度。推理时使用低精度的数据来表示网络参数以及激活值的话,会存在误差,这个误差也是一种噪声。因此,低精度数据,如定点数INT8,引进的差异是在网络的容忍度之内的,对结果不会产生太大影响。最终,在LSTM网络预测器中,用INT8数据表示网络激活值,用INT4和INT3表示权重,对比Golden Reference,LSTM预测器的准确率准确率下降非常小,变化控制在-0.06%至+0.01%之间。该结果表明本文所设计的量化方法适用于基于LSTM网络的缓存替换算法,在保证算法硬件实施精度的同时,减小了硬件开销,提高了运行速度。本发明的目标是将训练完成的FP16的LSTM网络,量化为INT8甚至INT4。量化操作是针对LSTM网络中的各个部件,如门模块的乘法器阵列,网络模块的Sigmoid激活函数等,求取每个模块最合适的输出位宽。具体方案是针对LSTM网络的激活值、权值分布情况,优化TensorRT INT8卷积神经网络量化方案,提出了一套适合于LSTM网络的量化方法。概括来说,本发明有以下特点:The neural network learns the pattern separability of the data samples during the training process, and at the same time, due to the noise existing in the data, the network has strong robustness. That is, making a small change in the input sample does not affect performance drastically. The introduction of noise will change the activation value output of each layer, but it has little effect on the result, that is to say, the trained network has a certain tolerance for these noises. Using high precision in the training process, such as the numerical representation of FP16, makes the network have a certain tolerance. If low-precision data is used to represent network parameters and activation values during inference, there will be errors, and this error is also a kind of noise. Therefore, low-precision data, such as fixed-point INT8, introduces differences that are within the tolerance of the network and will not have much impact on the results. Finally, in the LSTM network predictor, the network activation value is represented by INT8 data, and the weight is represented by INT4 and INT3. Compared with the Golden Reference, the accuracy of the LSTM predictor has a very small drop in accuracy, and the change is controlled from -0.06% to +0.01% between. The results show that the quantization method designed in this paper is suitable for the cache replacement algorithm based on LSTM network, which reduces the hardware overhead and improves the running speed while ensuring the hardware implementation accuracy of the algorithm. The goal of the present invention is to quantify the trained FP16 LSTM network into INT8 or even INT4. The quantization operation is for each component in the LSTM network, such as the multiplier array of the gate module, the Sigmoid activation function of the network module, etc., to obtain the most suitable output bit width of each module. The specific scheme is to optimize the TensorRT INT8 convolutional neural network quantization scheme according to the activation value and weight distribution of the LSTM network, and propose a set of quantization methods suitable for the LSTM network. In general, the present invention has the following characteristics:
1.量化方案区别LSTM网络中权重值和激活值,权重值离线量化,激活值在线量化。1. The quantization scheme distinguishes between the weight value and the activation value in the LSTM network, the weight value is quantized offline, and the activation value is quantized online.
2.量化方案以线性量化方案为基础,针对LSTM网络数据的分布进行了优化。2. The quantization scheme is based on the linear quantization scheme and is optimized for the distribution of LSTM network data.
本发明有以下优点:The present invention has the following advantages:
1.本发明离线量化权重值,尽可能缩减了权重值的数据位宽;在线量化激活值,防止量化方案带来额外的硬件资源开销。1. The weight value is quantized offline in the present invention, which reduces the data bit width of the weight value as much as possible; the activation value is quantized online to prevent the quantization scheme from bringing additional hardware resource overhead.
2.本发明采用线性的、非饱和的量化方式,以KL散度为标准,在全局范围内搜索阈值,将数据分布超出阈值的数值置为饱和值,将分散的分布转换为密集的分布,降低了量化的误差。2. The present invention adopts a linear, non-saturated quantization method, takes the KL divergence as a standard, searches for a threshold value in a global range, sets the numerical value whose data distribution exceeds the threshold value as a saturated value, and converts the scattered distribution into a dense distribution, The quantization error is reduced.
3.本发明针对LSTM网络中,稀疏数据导致KL散度计算出错的问题进行了修正,使得方案可以适应不同规模大小的LSTM网络。3. The present invention corrects the problem that sparse data leads to errors in KL divergence calculation in the LSTM network, so that the solution can be adapted to LSTM networks of different sizes.
附图说明Description of drawings
图1为长短期记忆网络的硬件架构图;Figure 1 is a hardware architecture diagram of a long short-term memory network;
图2为本发明设计的饱和量化过程示意图;Fig. 2 is a schematic diagram of a saturated quantization process designed by the present invention;
图3为本发明设计的权重值和激活值的量化过程伪代码;Fig. 3 is the quantization process pseudo code of weight value and activation value designed by the present invention;
图4为本发明设计的权重值量化的硬件结构图;Fig. 4 is the hardware structure diagram of the weight value quantization designed by the present invention;
图5为本发明设计的激活值量化的硬件结构图;Fig. 5 is the hardware structure diagram of activation value quantization designed by the present invention;
图6为INT3量化时KL散度随整数范围的变化趋势图;Fig. 6 is the change trend diagram of KL divergence with the integer range when INT3 is quantized;
图7为INT3量化时KL散度随idk和idj的变化趋势图;Figure 7 is a graph showing the variation trend of KL divergence with idk and idj during INT3 quantization;
图8为INT3量化时的保留分组和剔除分组示意图;Fig. 8 is a schematic diagram of reserved grouping and excluded grouping during INT3 quantization;
图9为INT3量化和FP16的对比示意图。Figure 9 is a schematic diagram of the comparison between INT3 quantization and FP16.
具体实施方式Detailed ways
以下结合附图和实施例对本发明做出进一步的说明。The present invention will be further described below with reference to the accompanying drawings and embodiments.
本发明所提出的面向长短期记忆网络的量化方法中,长短期记忆网络的硬件如图1,它由门模块、网络模块和存储模块三部分组成。门模块计算LSTM网络中的输入门向量、输出门向量、遗忘门向量和记忆单元向量。网络模块负责LSTM网络中的激活函数计算,网络状态计算和输出向量计算。存储模块负责存储LSTM网络的参数和状态。In the quantification method oriented to the long and short-term memory network proposed by the present invention, the hardware of the long-term and short-term memory network is shown in Figure 1, which consists of three parts: a gate module, a network module and a storage module. The gate module computes the input gate vector, output gate vector, forget gate vector and memory cell vector in the LSTM network. The network module is responsible for the activation function calculation, network state calculation and output vector calculation in the LSTM network. The storage module is responsible for storing the parameters and states of the LSTM network.
本发明所提出的量化方案采用饱和的量化方式。设置一个阈值T,将FP16中,在区间-|T|至|T|的数据映射至-128至127之间。阈值T的取值范围为0<T<65504。超出阈值范围外的数据映射为饱和值-128或者127。示意图如图2.当阈值选取恰当时,就可以将零散分布的,较大的激活值或者权重值剔除掉,保证了映射精度。The quantization scheme proposed by the present invention adopts a saturated quantization method. Set a threshold value T to map the data in the interval -|T| to |T| to -128 to 127 in FP16. The value range of the threshold T is 0<T<65504. Data outside the threshold range is mapped to a saturation value of -128 or 127. The schematic diagram is shown in Figure 2. When the threshold is properly selected, scattered and large activation values or weight values can be eliminated to ensure the mapping accuracy.
本发明所提出的饱和量化方案中阈值T的计算如下:首先,把FP16的数据分布设定为标准,当用INT8去表示这个最优分布时,相当于信息的重新编码,再用相对熵(Elativeentropy),也被称作KL散度,去衡量两种分布的差异,相对熵越大表示两种分布的差异越大,反之亦然。然后,遍历所有阈值,并计算KL散度。最后,选取KL散度最小的阈值,代表这种阈值下,用饱和的线性量化方法进行映射,FP16和INT8的分布差异最小。The calculation of the threshold value T in the saturated quantization scheme proposed by the present invention is as follows: first, the data distribution of FP16 is set as the standard, and when INT8 is used to represent this optimal distribution, it is equivalent to re-encoding of information, and then the relative entropy ( Elativeentropy), also known as KL divergence, to measure the difference between two distributions, the greater the relative entropy, the greater the difference between the two distributions, and vice versa. Then, all the thresholds are traversed, and the KL divergence is calculated. Finally, the threshold with the smallest KL divergence is selected, which represents that under this threshold, the saturated linear quantization method is used for mapping, and the distribution difference between FP16 and INT8 is the smallest.
本发明所提出的饱和量化方案中KL散度的计算公式如(1)。q表示最优编码方式下的数据分布,例如FP16。p表示重新编码后的数据分布,如INT8。KL散度表示两种分布的差异,它的值一定大于零。但是当数集中数据比较稀疏时,例如权重参数,p(x)或者q(x)有可能为0,由于log函数的存在,那么KL散度的计算会发生错误,为了防止这种情况的发生,当p(x)或者q(x)为0时,将其赋值为EPS,即MATLAB单浮点数精度,同时为了保证计算出的KL散度大于零,根据吉布斯不等式,∑p或者∑q必须为1,所以选取随机的一个分布再减去一个EPS值。The calculation formula of KL divergence in the saturated quantization scheme proposed by the present invention is as (1). q represents the data distribution under the optimal encoding method, such as FP16. p represents the recoded data distribution, such as INT8. KL divergence represents the difference between two distributions, and its value must be greater than zero. However, when the data in the dataset is relatively sparse, such as the weight parameter, p(x) or q(x) may be 0. Due to the existence of the log function, the calculation of the KL divergence will be wrong. In order to prevent this from happening , when p(x) or q(x) is 0, assign it as EPS, that is, MATLAB single floating point precision, and in order to ensure that the calculated KL divergence is greater than zero, according to Gibbs inequality, ∑p or ∑ q must be 1, so pick a random distribution and subtract an EPS value.
if p(x)or q(x)=0,p(x)or q(x)=+EPSif p(x)or q(x)=0, p(x)or q(x)=+EPS
p or q(rand)=p or q(rand)-EPS (1)p or q(rand)=p or q(rand)-EPS (1)
本发明所提出的量化方案的伪代码如图3,实施步骤如下:The pseudo code of the quantization scheme proposed by the present invention is as shown in Figure 3, and the implementation steps are as follows:
1.收集激活值或权重值。权重值为LSTM网络中的循环参数或者输入参数;激活值为LSTM网络迭代运行1000次,每个模块(如门模块的乘法阵列)的输出数据。不同序列的LSTM网络权重值或激活值分别进行量化;权重值数集大小较小而且稀疏,在100-4500左右。激活值数集大小不受限制,比较密集,本文数集大小为106左右。1. Collect activation or weight values. The weight value is the loop parameter or input parameter in the LSTM network; the activation value is the output data of each module (such as the multiplication array of the gate module) after 1000 iterations of the LSTM network. The LSTM network weight values or activation values of different sequences are quantized separately; the weight value set is small and sparse, around 100-4500. The size of the activation value data set is not limited, and it is relatively dense. The size of the data set in this paper is about 106.
2.确定阈值遍历范围集和,开始阈值遍历循环。计算不同阈值情况下的缩放因子sf和饱和值high和low;2. Determine the threshold traversal range set sum, and start the threshold traversal loop. Calculate the scaling factor sf and saturation values high and low under different thresholds;
3.遍历激活值或者权重集和,按第2步计算出的缩放因子和饱和值,将每个激活值或权重值缩放,或者置为饱和值;3. Traverse the activation value or weight set sum, scale each activation value or weight value according to the scaling factor and saturation value calculated in
4.完成激活值或者权重值遍历,计算初始数集x与映射后数集x_tmp的KL散度,并与上一次迭代结果进行对比,保留KL散度较小的结果。最终算法输出的idk表示负数方向截断阈值,idj表示正数方向的截断阈值,以及最小的KL散度值。4. Complete the activation value or weight value traversal, calculate the KL divergence of the initial data set x and the mapped data set x_tmp, and compare it with the result of the previous iteration, and retain the result with a smaller KL divergence. The idk output by the final algorithm represents the truncation threshold in the negative direction, idj represents the truncation threshold in the positive direction, and the minimum KL divergence value.
本发明的性能测试Performance Testing of the Invention
按照示例的权重量化方法,对LSTM网络中参数依次进行量化,输入参数采用INT3的数值表示,循环参数采用INT4的数值表示。按照示例的激活值量化方法,对LSTM网络中每个模块依次进行量化,这个过程称作校准(Calibration),结果如表2所示,对Sigmoid激活函数模块,在保留所有精度时,输出位宽为25比特,经过量化后,输出位宽被截断为10比特。说明Sigmoid函数的激活值大量分布在阈值附近,这是由Sigmoid函数特性决定的。因此,大量数值被置为饱和值后,误差仍然比较小。对Sigmoid激活函数模块,激活值的量化硬件结构如图5所示,其中高位[25:12]线与后,控制选择器输出低位[11:2]或者饱和值“512”,如果MSB为“1”时,则输出补码。同理,对于乘法器B模块的输入参数,其权值的量化硬件结构如图4所示,其中高位[18:8]线与后,控制选择器输出低位[7:0]或者饱和值“128”,如果MSB为“1”时,则输出补码。According to the weight quantization method of the example, the parameters in the LSTM network are sequentially quantized, the input parameters are represented by the numerical value of INT3, and the loop parameter is represented by the numerical value of INT4. According to the activation value quantization method of the example, each module in the LSTM network is quantized in turn. This process is called calibration. The results are shown in Table 2. For the Sigmoid activation function module, when all the precision is preserved, the output bit width is 25 bits, and after quantization, the output bit width is truncated to 10 bits. It shows that the activation value of the sigmoid function is largely distributed around the threshold, which is determined by the characteristics of the sigmoid function. Therefore, after a large number of values are set to saturation values, the error is still relatively small. For the Sigmoid activation function module, the quantization hardware structure of the activation value is shown in Figure 5. After the high-order [25:12] line AND, the control selector outputs the low-order [11:2] or the saturation value "512", if the MSB is " 1", the complement code is output. In the same way, for the input parameters of the multiplier B module, the quantization hardware structure of its weight is shown in Figure 4. After the high-order [18:8] line AND, the control selector outputs the low-order [7:0] or saturation value" 128", if the MSB is "1", the complement is output.
经过校准后,各个LSTM硬件的准确率与软件算法FP16的准确率对比如表1所示,准确率下降非常小,不超过0.06%,甚至一些预测器的准确率提升了0.01%。结果表明本文所设计的量化方法适用于特定LSTM网络算法,在保证算法硬件实施精度的同时,减小了硬件开销,提高了运行速度。After calibration, the accuracy rate of each LSTM hardware is compared with that of the software algorithm FP16 as shown in Table 1. The accuracy rate drops very little, no more than 0.06%, and even the accuracy rate of some predictors is improved by 0.01%. The results show that the quantization method designed in this paper is suitable for a specific LSTM network algorithm, which reduces the hardware overhead and improves the running speed while ensuring the hardware implementation accuracy of the algorithm.
表1.各类LSTM激活值量化后的硬件准确率与软件算法FP16的准确率对比Table 1. Comparison of hardware accuracy and software algorithm FP16 accuracy after quantization of various LSTM activation values
表2.本发明中LSTM网络各个模块的量化结果Table 2. Quantization results of each module of the LSTM network in the present invention
实施例Example
本发明可以在长短期记忆网络的软硬件系统中实现。The present invention can be implemented in a software and hardware system of a long short-term memory network.
以LSTM网络的输入参数为例子,进行权重值量化,共计三步:Taking the input parameters of the LSTM network as an example, the weight value quantization is performed in three steps:
第一步,搜索合适的目标量化范围num。实验时,一共选取了2N个值,其中N=1~7,运行kl_comp_weight()函数,收集每个num下最优量化方案下的KL散度,结果如图6所示。当num增大时,KL散度减小,也就是当定点数的位宽越大,定点数与浮点数的分布越接近,误差越小。当num大于8时,KL散度随着num的增大,减小幅度迅速下降。为了降低硬件设计成本,输入参数的量化选择为INT3。The first step is to search for a suitable target quantization range num. During the experiment, a total of 2 N values were selected, where N = 1 to 7, and the kl_comp_weight() function was run to collect the KL divergence under the optimal quantization scheme for each num. The results are shown in Figure 6. When num increases, the KL divergence decreases, that is, when the bit width of the fixed-point number is larger, the distribution of the fixed-point number and the floating-point number is closer, and the error is smaller. When num is greater than 8, the KL divergence decreases rapidly with the increase of num. In order to reduce the cost of hardware design, the quantization of input parameters is selected as INT3.
第二步,在INT3,即num等于8的情况下,搜索最佳的阈值。阈值的搜索过程如图7所示,图中展示了不同阈值下,量化后定点数与浮点数的KL散度。当负数方向阈值取第2个分组,即idk=2的平均值,正数方向阈值取第2个分组,即idj=2的平均值时,KL散度最小,值为0.064。The second step is to search for the best threshold in the case of INT3, that is, when num is equal to 8. The search process of the threshold is shown in Figure 7. The figure shows the KL divergence of fixed-point and floating-point numbers after quantization under different thresholds. When the negative direction threshold takes the second group, that is, the average value of idk=2, and the positive direction threshold takes the second group, that is, the average value of idj=2, the KL divergence is the smallest, with a value of 0.064.
第三步,如图8所示的输入参数分布情况,将其分成了24,即3×num个组绘制成柱状图。根据图7所求出来的最优阈值,将第0、1组和第23、24组范围内的输入参数剔除,并分别置为负数方向阈值和正数方向阈值。In the third step, as shown in Figure 8, the input parameter distribution is divided into 24, that is, 3 × num groups are drawn into a histogram. According to the optimal threshold obtained in Figure 7, the input parameters in the range of the 0th, 1st group and the 23rd, 24th group are eliminated, and set as the negative direction threshold and the positive direction threshold respectively.
最终,量化前FP16与量化后INT3的数据按照大小排列后的分布情况如图9所示。可见,量化前输入参数的范围大约在-0.39至0.45之间,执行量化方法kl_comp_weight()后,由于第0组和第24组的两个数据点零散分布,如图9中实线圈内的点,并超过了阈值,所以被置为饱和值。而分布中心的密集数据被较精准地量化为INT3。Finally, the distribution of the data of FP16 before quantization and INT3 after quantization arranged according to size is shown in FIG. 9 . It can be seen that the range of the input parameters before quantization is about -0.39 to 0.45. After the quantization method kl_comp_weight() is executed, due to the scattered distribution of the two data points in the 0th group and the 24th group, as shown in Figure 9. The points in the solid circle , and exceeds the threshold, so it is set to the saturation value. The dense data in the distribution center is more accurately quantified as INT3.
Claims (3)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010774421.5A CN112116061A (en) | 2020-08-04 | 2020-08-04 | Weight and activation value quantification method for long-term and short-term memory network |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010774421.5A CN112116061A (en) | 2020-08-04 | 2020-08-04 | Weight and activation value quantification method for long-term and short-term memory network |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN112116061A true CN112116061A (en) | 2020-12-22 |
Family
ID=73799604
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010774421.5A Pending CN112116061A (en) | 2020-08-04 | 2020-08-04 | Weight and activation value quantification method for long-term and short-term memory network |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN112116061A (en) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113408696A (en) * | 2021-05-17 | 2021-09-17 | 珠海亿智电子科技有限公司 | Fixed point quantization method and device of deep learning model |
| CN114219306A (en) * | 2021-12-16 | 2022-03-22 | 蕴硕物联技术(上海)有限公司 | Method, apparatus, medium, and program product for creating a weld quality detection model |
| CN114418087A (en) * | 2021-12-24 | 2022-04-29 | 北京奕斯伟计算技术有限公司 | Model quantification method, device and equipment based on optimized kl divergence |
| CN114998661A (en) * | 2022-06-22 | 2022-09-02 | 山东浪潮科学研究院有限公司 | A target detection method based on fixed-point quantization |
| CN115294638A (en) * | 2022-06-30 | 2022-11-04 | 青岛熙正数字科技有限公司 | Iris identification system deployment method based on FPGA, iris identification method and system |
| WO2023003432A1 (en) * | 2021-07-22 | 2023-01-26 | 주식회사 사피온코리아 | Method and device for determining saturation ratio-based quantization range for quantization of neural network |
-
2020
- 2020-08-04 CN CN202010774421.5A patent/CN112116061A/en active Pending
Cited By (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113408696A (en) * | 2021-05-17 | 2021-09-17 | 珠海亿智电子科技有限公司 | Fixed point quantization method and device of deep learning model |
| WO2023003432A1 (en) * | 2021-07-22 | 2023-01-26 | 주식회사 사피온코리아 | Method and device for determining saturation ratio-based quantization range for quantization of neural network |
| KR20230015186A (en) * | 2021-07-22 | 2023-01-31 | 주식회사 사피온코리아 | Method and Device for Determining Saturation Ratio-Based Quantization Range for Quantization of Neural Network |
| KR102813466B1 (en) | 2021-07-22 | 2025-05-28 | 리벨리온 주식회사 | Method and Device for Determining Saturation Ratio-Based Quantization Range for Quantization of Neural Network |
| CN114219306A (en) * | 2021-12-16 | 2022-03-22 | 蕴硕物联技术(上海)有限公司 | Method, apparatus, medium, and program product for creating a weld quality detection model |
| CN114418087A (en) * | 2021-12-24 | 2022-04-29 | 北京奕斯伟计算技术有限公司 | Model quantification method, device and equipment based on optimized kl divergence |
| CN114998661A (en) * | 2022-06-22 | 2022-09-02 | 山东浪潮科学研究院有限公司 | A target detection method based on fixed-point quantization |
| CN114998661B (en) * | 2022-06-22 | 2024-08-27 | 山东浪潮科学研究院有限公司 | Target detection method based on fixed point quantitative determination |
| CN115294638A (en) * | 2022-06-30 | 2022-11-04 | 青岛熙正数字科技有限公司 | Iris identification system deployment method based on FPGA, iris identification method and system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112116061A (en) | Weight and activation value quantification method for long-term and short-term memory network | |
| Kim et al. | Zero-centered fixed-point quantization with iterative retraining for deep convolutional neural network-based object detectors | |
| Johnson | Rethinking floating point for deep learning | |
| CN109543830B (en) | Splitting accumulator for convolutional neural network accelerator | |
| CN110363281A (en) | A convolutional neural network quantization method, device, computer and storage medium | |
| CN110717585B (en) | Training methods, data processing methods and related products of neural network models | |
| US20220004884A1 (en) | Convolutional Neural Network Computing Acceleration Method and Apparatus, Device, and Medium | |
| JP7231731B2 (en) | Adaptive quantization method and apparatus, device, medium | |
| CN111695671A (en) | Method and device for training neural network and electronic equipment | |
| CN111950715A (en) | 8-bit integer full-quantization inference method and device based on self-adaptive dynamic shift | |
| CN113741858B (en) | In-memory multiplication and addition calculation method, device, chip and computing device | |
| CN110265002A (en) | Audio recognition method, device, computer equipment and computer readable storage medium | |
| US20240004952A1 (en) | Hardware-Aware Mixed-Precision Quantization | |
| Jiang et al. | A low-latency LSTM accelerator using balanced sparsity based on FPGA | |
| WO2021213649A1 (en) | Method and system for generating a predictive model | |
| KR102651452B1 (en) | Quantization method of deep learning network | |
| WO2022247368A1 (en) | Methods, systems, and mediafor low-bit neural networks using bit shift operations | |
| CN112613604A (en) | Neural network quantification method and device | |
| CN120409566B (en) | A method and system for joint quantization of weights and activations of large language models | |
| US20250252311A1 (en) | System and method for adaptation of containers for floating-point data for training of a machine learning model | |
| EP4481554A1 (en) | Methods for decomposition of high-precision matrix multiplications into multiple matrix multiplications of different data types | |
| CN117311663A (en) | A reconfigurable approximate multiply-accumulate unit based on neural network activation distribution | |
| CN114170490B (en) | Image recognition method and system based on self-adaptive data quantization and polyhedral template | |
| CN116611494A (en) | Training method and device for electric power defect detection model, computer equipment and medium | |
| CN118690797A (en) | Method, apparatus, device, medium and program product for precision conversion |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201222 |