CN116611470A

CN116611470A - Method for setting quantization bit number of neural network model

Info

Publication number: CN116611470A
Application number: CN202310460434.9A
Authority: CN
Inventors: 陈其宾; 段强; 姜凯; 李锐
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2023-04-23
Filing date: 2023-04-23
Publication date: 2023-08-18
Also published as: WO2024221573A1

Abstract

The present invention relates to the technical field of model quantization, specifically a method for setting quantization digits of a neural network model, comprising the following steps: preparing a data set containing a small amount of data; defining quantization loss evaluation indicators; defining model quantization digits; obtaining model-related memory limits ; define the target neural network model; calculate the quantization loss of different quantization digits of the activation value; analyze the model reasoning step and the activation value in the memory at each moment; the beneficial effect is: the neural network model quantization digit setting method proposed by the present invention solves the problem of memory limitation Conditional activation value quantization digit selection problem. Firstly, analyze the precision loss of activation values under different bit widths, and then analyze the set of activation values in memory at different times to construct an optimization problem. Under the condition of limiting the memory usage of activation values at all times, through integer linear programming method, optimize to get Optimal precision loss.

Description

A method for setting quantization digits of neural network model

技术领域technical field

本发明涉及模型量化技术领域，具体为一种神经网络模型量化位数设置方法。The invention relates to the technical field of model quantization, in particular to a method for setting quantization digits of a neural network model.

背景技术Background technique

随着深度学习相关技术不断发展，神经网络模型在很多行业和场景得到广泛应用。但是深度学习模型参数量和计算量大，对硬件资源要求较高，往往与硬件资源的局限性形成冲突。With the continuous development of deep learning related technologies, neural network models have been widely used in many industries and scenarios. However, the deep learning model has a large amount of parameters and calculations, and has high requirements for hardware resources, which often conflict with the limitations of hardware resources.

现有技术中，物联网领域有大量的嵌入式设备，由于设备硬件资源极为有限，难以部署深度学习模型。另外，深度学习模型越来越大，即便是在服务器或云端，部署一些大模型的压力也逐步凸显。其中，内存限制是制约模型部署的关键因素，决定了是否能将模型部署到设备上。In the existing technology, there are a large number of embedded devices in the Internet of Things field. Due to the extremely limited hardware resources of the devices, it is difficult to deploy deep learning models. In addition, the deep learning model is getting bigger and bigger, even on the server or cloud, the pressure to deploy some large models is gradually highlighted. Among them, memory limitation is a key factor restricting model deployment, which determines whether the model can be deployed to the device.

但是，在深度学习模型推理过程中，一般是激活值占据大部分内存。模型量化是用来降低模型内存占用的有效方法，尤其是多精度模型量化。但是，相比于传统的量化方法，多精度量化往往导致模型精度损失较大，如何为不同的权重和激活值选择不同的量化位数，是影响模型精度损失的关键。目前，有一些方法是针对模型权重的多精度量化，没有涉及激活值位数选择。而有一些针对激活值位数选择的，又没有考虑模型精度下降的目标。However, during the inference process of deep learning models, it is generally the activation values that occupy most of the memory. Model quantization is an effective method to reduce the memory usage of models, especially multi-precision model quantization. However, compared with traditional quantization methods, multi-precision quantization often leads to a greater loss of model accuracy. How to choose different quantization digits for different weights and activation values is the key to affecting the loss of model accuracy. At present, there are some methods for multi-precision quantization of model weights, which do not involve the selection of activation value digits. However, some are selected for the number of activation values, and do not consider the goal of reducing the accuracy of the model.

发明内容Contents of the invention

本发明的目的在于提供一种神经网络模型量化位数设置方法，以解决上述背景技术中提出的问题。The purpose of the present invention is to provide a method for setting the number of quantization digits of a neural network model to solve the problems raised in the above-mentioned background technology.

为实现上述目的，本发明提供如下技术方案：一种神经网络模型量化位数设置方法，所述方法包括以下步骤：In order to achieve the above object, the present invention provides the following technical solutions: a method for setting the number of quantized digits of a neural network model, said method comprising the following steps:

准备包含少量数据的数据集；Prepare datasets containing small amounts of data;

定义量化损失评估指标；Define quantitative loss assessment indicators;

定义模型量化位数；Define the number of quantized digits of the model;

获取模型相关内存限制；Get the memory limit related to the model;

定义目标神经网络模型；Define the target neural network model;

计算激活值不同量化位数的量化损失；Calculate the quantization loss of different quantization digits of the activation value;

分析模型推理步骤及各时刻内存中激活值；Analyze model reasoning steps and activation values in memory at each moment;

优化问题构建，并采用整数线性规划方法求解；The optimization problem is constructed and solved using the integer linear programming method;

模型量化和推理。Model quantization and inference.

优选的，准备包含少量数据的数据集时，据集作为校准数据集以及量化损失计算数据集。Preferably, when preparing a data set containing a small amount of data, the data set is used as a calibration data set and a quantization loss calculation data set.

优选的，定义量化损失评估指标时，计算非量化激活值数据以及量化激活值数据的差异，差异采用但不限于均方误差、余弦相似度等计算，作为量化损失。Preferably, when defining the quantitative loss evaluation index, the difference between the non-quantized activation value data and the quantized activation value data is calculated, and the difference is calculated by but not limited to mean square error, cosine similarity, etc., as the quantized loss.

优选的，定义模型量化位数时，基于硬件支持情况及模型精度要求，模型量化位数是但不限于2，4，6，8位数，使用Bitwidth表示量化位数集合，Preferably, when defining model quantization digits, based on hardware support and model accuracy requirements, model quantization digits are but not limited to 2, 4, 6, and 8 digits, and Bitwidth is used to represent the set of quantization digits.

Bitwidth＝{2，4，6，8}。Bitwidth = {2, 4, 6, 8}.

优选的，获取模型相关内存限制时，模型推理过程中，推理相关内存占用包括推理框架内存占用、代码数据内存占用、模型权重内存占用以及激活值内存占用，其中模型相关内存占用主要包括模型权重内存占用以及激活值内存占用；假设硬件内存大小为m_total，模型相关内存占用外的内存为m_other，剩余内存大小m_model，即模型相关内存限制。Preferably, when obtaining the model-related memory limit, during the model inference process, the inference-related memory usage includes the inference framework memory usage, code data memory usage, model weight memory usage, and activation value memory usage, where the model-related memory usage mainly includes model weight memory Occupancy and activation value memory occupation; suppose the hardware memory size is m _total , the memory other than the model-related memory occupation is m _other , and the remaining memory size is m _model , which is the model-related memory limit.

优选的，定义目标神经网络模型时，神经网络模型为包括但不限于ResNet、MobileNet、SSD的主流神经网络模型。Preferably, when defining the target neural network model, the neural network model is a mainstream neural network model including but not limited to ResNet, MobileNet, SSD.

优选的，计算激活值不同量化位数的量化损失时，针对损失计算数据集中的每个样本数据，使用量化前模型进行推理，并获取所有的激活值数据，针对所有激活值数据，采用所有的量化位数，计算各激活值的不同量化位数的量化损失，完成所有样本数据推理后，汇总各激活值不同量化位数的量化损失，并采用但不限于均值等方式，计算所有样本数据的各激活值的不同量化位数的量化损失集合Loss；公式如下，其中loss_ab表示激活值数据a使用量化位数b的量化损失，Loss＝{loss_ab},a∈Activations,b∈Bitwidth。Preferably, when calculating the quantization loss of different quantization digits of the activation value, for each sample data in the loss calculation data set, use the pre-quantization model for reasoning, and obtain all the activation value data, and for all the activation value data, use all Quantization digits, calculate the quantization loss of different quantization digits of each activation value, after completing the inference of all sample data, summarize the quantization losses of different quantization digits of each activation value, and use but not limited to the mean value to calculate the quantization loss of all sample data The quantization loss set Loss of different quantization bits of each activation value; the formula is as follows, where loss _ab represents the quantization loss of activation value data a using quantization bits b, Loss={loss _ab }, a∈Activations,b∈Bitwidth.

优选的，分析模型推理步骤及各时刻内存中激活值时，在模型推理过程中，模型权重一直保存在内存中，而不再使用的激活值会及时销毁，释放占用的内存空间；令Activations表示模型所有的激活值，activation_set_t表示t时刻在内存中的激活值集合，因此任一activation_sett都是Activations的子集，而不同时刻的中的元素会重复；逐步推理，得到所有时刻内存中激活值集合Activation_sets，公式如下，其中T表示推理过程中的所有时刻的集合。Preferably, when analyzing the model reasoning steps and the activation values in the memory at each moment, during the model reasoning process, the model weights are always stored in the memory, and the activation values that are no longer used will be destroyed in time to release the occupied memory space; let Activations represent All the activation values of the model, activation_set _t represents the set of activation values in memory at time t, so any activation _sett is a subset of Activations, and at different times The elements in will be repeated; through step-by-step reasoning, the activation value set Activation _sets in the memory at all times can be obtained, the formula is as follows, where T represents the set of all moments in the inference process.

优选的，优化问题构建，并采用整数线性规划方法求解时，令优化目标为总量化损失最小，约束条件为每个时刻的激活值集合占用的内存小于激活值内存限制，由于模型权重占用内存可以提前计算，假设为m_weight，令m_c表示激活值内存限制，使用如下公式计算m_c，Preferably, when the optimization problem is constructed and solved using the integer linear programming method, the optimization objective is to minimize the total quantization loss, and the constraint condition is that the memory occupied by the activation value set at each moment is less than the activation value memory limit. Since the model weight occupies memory It can be calculated in advance, assuming m _weight , let m _c represent the activation value memory limit, use the following formula to calculate m _c ,

m_c＝m_model-m_weight m _c = m _model - m _weight

令total_loss表示总量化损失，m_t表示时刻t激活值集合占用的内存大小，m_a表示激活值a占用的内存大小，size_a表示激活值a的元素个数，b_a表示激活值a的量化位数，从Loss集合得到，Let total_loss represent the total quantization loss, m _t represents the memory size occupied by the activation value set at time t, m _a represents the memory size occupied by the activation value a, size _a represents the number of elements of the activation value a, b _a represents the activation value a quantization bits, Obtained from the Loss collection,

min total_lossmin total_loss

m_a＝size_a*b_a m _a = size _a *b _a

优化问题是一个整数线性规划问题，优化变量是每个激活值的量化位数b_a，采用整数线性规划方法求解，得到每个激活值的量化位数。The optimization problem is an integer linear programming problem, and the optimization variable is the quantization bit b _a of each activation value, which is solved by using the integer linear programming method to obtain the quantization bit number of each activation value.

优选的，模型量化和推理时，使用优化得到的量化位数，进行模型量化，并进而进行模型推理。Preferably, during model quantization and inference, the optimized quantization bits are used to perform model quantization and then perform model inference.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

本发明提出的神经网络模型量化位数设置方法解决在内存限制条件下的激活值量化位数选择问题。首先分析激活值在不同位宽下的精度损失，其次分析不同时刻在内存中的激活值集合，构建一个优化问题，在限制所有时刻激活值内存占用的条件下，通过整数线性规划方法，优化得到最优的精度损失。通过上述方式，可以在硬件内存限制条件下，选择最优的激活值量化位数，使得精度损失最小，具有较高的实用价值和创新价值。The method for setting the quantization digits of the neural network model proposed by the invention solves the problem of selecting the quantization digits of the activation value under the condition of memory limitation. Firstly, analyze the precision loss of activation values under different bit widths, and then analyze the set of activation values in memory at different times to construct an optimization problem. Under the condition of limiting the memory usage of activation values at all times, through integer linear programming method, optimize to get Optimal precision loss. Through the above method, the optimal activation value quantization bit can be selected under the condition of hardware memory limitation, so that the precision loss is minimized, and it has high practical value and innovative value.

附图说明Description of drawings

图1为本发明方法流程图。Fig. 1 is a flow chart of the method of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案进行清楚、完整地描述，及优点更加清楚明白，以下结合附图对本发明实施例进行进一步详细说明。应当理解，此处所描述的具体实施例是本发明一部分实施例，而不是全部的实施例，仅仅用以解释本发明实施例，并不用于限定本发明实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to clearly and completely describe the purpose, technical solution, and advantages of the present invention, the embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are part of the embodiments of the present invention, rather than all embodiments, and are only used to explain the embodiments of the present invention, and are not intended to limit the embodiments of the present invention. All other embodiments obtained under the premise of creative labor belong to the protection scope of the present invention.

请参阅图1，本发明提供一种技术方案：一种神经网络模型量化位数设置方法，所述方法包括以下步骤：Please refer to Fig. 1, the present invention provides a kind of technical scheme: a kind of quantization digit setting method of neural network model, described method comprises the following steps:

首先，准备包含少量数据的数据集。该数据集作为校准数据集以及量化损失计算数据集。First, prepare a dataset containing a small amount of data. This dataset serves as a calibration dataset as well as a quantization loss computation dataset.

其次，定义量化损失评估指标。量化损失计算是计算非量化激活值数据以及量化激活值数据的差异，该差异可以采用但不限于均方误差、余弦相似度等计算，作为量化损失。非量化激活值数据是指采用浮点数的权重和输入(上一个激活值数据)计算相应算子，如卷积、全连接等，计算结果即非量化激活值数据。量化激活值数据是指分别对浮点数的权重和输入数据量化，使用量化后数据计算相应算子，并将计算后的结果进行反量化，得到量化激活值数据。Second, define quantitative loss evaluation metrics. Quantized loss calculation is to calculate the difference between non-quantized activation value data and quantized activation value data. The difference can be calculated by using but not limited to mean square error, cosine similarity, etc., as quantized loss. Non-quantized activation value data refers to the use of floating-point weights and inputs (previous activation value data) to calculate corresponding operators, such as convolution, full connection, etc., and the calculation result is non-quantized activation value data. Quantizing the activation value data refers to quantizing the weight of the floating-point number and the input data respectively, using the quantized data to calculate the corresponding operator, and dequantizing the calculated result to obtain the quantized activation value data.

第三，定义模型量化位数。基于硬件支持情况及模型精度要求，模型量化位数可以是但不限于2，4，6，8等位数，使用Bitwidth表示量化位数集合。Third, define the model quantization digits. Based on hardware support and model accuracy requirements, the model quantization digits can be but not limited to 2, 4, 6, and 8 digits, and Bitwidth is used to represent the set of quantization digits.

Bitwidth＝{2，4，6，8}Bitwidth={2, 4, 6, 8}

第四，获取模型相关内存限制。模型推理过程中，推理相关内存占用包括推理框架内存占用、代码数据内存占用、模型权重内存占用以及激活值内存占用等，其中模型相关内存占用主要包括模型权重内存占用以及激活值内存占用。假设硬件内存大小为m_total，模型相关内存占用外的内存为m_other，剩余内存大小m_model，即模型相关内存限制。Fourth, get the model-related memory limit. During model inference, inference-related memory usage includes inference framework memory usage, code data memory usage, model weight memory usage, and activation value memory usage, among which model-related memory usage mainly includes model weight memory usage and activation value memory usage. Assume that the hardware memory size is m _total , the memory other than the model-related memory occupation is m _other , and the remaining memory size is m _model , that is, the model-related memory limit.

m_model＝m_total-m_other m _model = m _total - m _other

第五，定义目标神经网络模型。神经网络模型为包括但不限于ResNet、MobileNet、SSD等主流神经网络模型。Fifth, define the target neural network model. The neural network model includes but is not limited to ResNet, MobileNet, SSD and other mainstream neural network models.

第六，计算激活值不同量化位数的量化损失。针对损失计算数据集中的每个样本数据，使用量化前模型进行推理，并获取所有的激活值数据。同时，针对所有激活值数据，采用所有的量化位数，计算各激活值的不同量化位数的量化损失。完成所有样本数据推理后，汇总各激活值不同量化位数的量化损失，并采用但不限于均值等方式，计算所有样本数据的各激活值的不同量化位数的量化损失集合Loss。公式如下，其中loss_ab表示激活值数据a使用量化位数b的量化损失。Sixth, calculate the quantization loss of different quantization digits of the activation value. For each sample data in the loss calculation dataset, use the pre-quantization model for inference, and obtain all activation value data. At the same time, for all activation value data, all quantization bits are used to calculate the quantization loss of different quantization bits for each activation value. After the inference of all sample data is completed, the quantization losses of different quantization digits of each activation value are summarized, and the quantization loss set Loss of different quantization digits of each activation value of all sample data is calculated by means of but not limited to the mean value. The formula is as follows, where loss _ab represents the quantization loss of activation value data a using quantization bits b.

Loss＝{loss_ab},a∈Activations,b∈BitwidthLoss＝{loss _ab },a∈Activations,b∈Bitwidth

第七，分析模型推理步骤及各时刻内存中激活值。在模型推理过程中，模型权重一直保存在内存中，而不再使用的激活值会及时销毁，释放占用的内存空间。令Activations表示模型所有的激活值，activation_set_t表示t时刻在内存中的激活值集合，因此任一都是Activations的子集，而不同时刻的/>中的元素可能会重复。逐步推理，得到所有时刻内存中激活值集合Activation_sets，公式如下，其中T表示推理过程中的所有时刻的集合。Seventh, analyze the model reasoning steps and the activation values in memory at each moment. During the model inference process, the model weights are always stored in memory, and the activation values that are no longer used will be destroyed in time to release the occupied memory space. Let Activations represent all the activation values of the model, and activation_set _t represents the set of activation values in memory at time t, so any They are all subsets of Activations, but at different times /> Elements in may be repeated. Step by step reasoning, the activation value set Activation _sets in the memory at all moments is obtained, the formula is as follows, where T represents the set of all moments in the reasoning process.

第八，优化问题构建，并采用整数线性规划方法求解。令优化目标为总量化损失最小，约束条件为每个时刻的激活值集合占用的内存小于激活值内存限制。由于模型权重占用内存可以提前计算，假设为m_weight，令m_c表示激活值内存限制，使用如下公式计算m_c。Eighth, the optimization problem is constructed and solved using the integer linear programming method. The optimization goal is to minimize the total quantization loss, and the constraint condition is that the memory occupied by the activation value set at each moment is less than the activation value memory limit. Since the memory occupied by the model weight can be calculated in advance, assuming it is m _weight , let m _c represent the memory limit of the activation value, and use the following formula to calculate m _c .

m_c＝m_model-m_weight m _c = m _model - m _weight

令total_loss表示总量化损失，m_t表示时刻t激活值集合占用的内存大小，m_a表示激活值a占用的内存大小，size_a表示激活值a的元素个数，b_a表示激活值a的量化位数。可从Loss集合得到。Let total_loss represent the total quantization loss, m _t represents the memory size occupied by the activation value set at time t, m _a represents the memory size occupied by the activation value a, size _a represents the number of elements of the activation value a, b _a represents the activation value a quantization bits. Available from the Loss collection.

min total_lossmin total_loss

m_a＝size_a*b_a m _a = size _a *b _a

上述优化问题是一个整数线性规划问题，优化变量是每个激活值的量化位数b_a，可以采用整数线性规划方法求解，得到每个激活值的量化位数。The above optimization problem is an integer linear programming problem, and the optimization variable is the quantization digit b _a of each activation value, which can be solved by an integer linear programming method to obtain the quantization digit of each activation value.

第九，模型量化和推理。使用优化得到的量化位数，进行模型量化，并进而进行模型推理。Ninth, model quantification and inference. Use the optimized quantization digits to perform model quantization and then perform model inference.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由所附权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications and substitutions can be made to these embodiments without departing from the principle and spirit of the present invention. and modifications, the scope of the invention is defined by the appended claims and their equivalents.

Claims

1. A method for setting quantization bit number of neural network model is characterized in that: the method comprises the following steps:

preparing a dataset comprising a small amount of data;

defining a quantization loss evaluation index;

defining a model quantization bit number;

acquiring a related memory limit of a model;

defining a target neural network model;

calculating quantization losses of different quantization bits of the activation value;

analyzing the model reasoning step and the activation values in the memory at each moment;

constructing an optimization problem, and solving by adopting an integer linear programming method;

model quantization and reasoning.

2. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when preparing a data set containing a small amount of data, the data set is used as a calibration data set and a quantization loss calculation data set.

3. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when the quantization loss evaluation index is defined, the non-quantization activation value data and the difference of the quantization activation value data are calculated, and the difference is calculated by adopting, but not limited to, mean square error, cosine similarity and the like as the quantization loss.

4. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when defining the model quantization bit number, based on the hardware support condition and the model precision requirement, the model quantization bit number is but not limited to 2,4,6,8 bit number, using Bitwidth to represent the quantization bit number set,

Bitwidth＝{2，4，6，8}。

5. the method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when the model related memory limit is acquired, in the model reasoning process, reasoning related memory occupation comprises reasoning frame memory occupation, code data memory occupation, model weight memory occupation and activation value memory occupation, wherein the model related memory occupation mainly comprises model weight memory occupation and activation value memory occupation; assume that the hardware memory size is m _total The memory outside the related memory occupation of the model is m _other Remaining memory size m _model I.e. model dependent memory limitations.

6. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when defining the target neural network model, the neural network model is a mainstream neural network model including, but not limited to ResNet, mobileNet, SSD.

7. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when calculating quantization losses for different number of quantization bits of the activation value, each sample in the data set is calculated for the lossThe method comprises the steps of (1) reasoning data by using a pre-quantization model, obtaining all activation value data, calculating quantization losses of different quantization digits of each activation value by adopting all quantization digits for all activation value data, summarizing the quantization losses of different quantization digits of each activation value after all sample data reasoning is completed, and calculating a quantization Loss set Loss of different quantization digits of each activation value of all sample data by adopting a mean value mode and the like; the formula is as follows, wherein loss _ab Represents quantization Loss of the activation value data a using the quantization bit number b, loss= { Loss _ab },a∈Activations,b∈

Bitwidth。

8. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: when analyzing the model reasoning step and the activation values in the memory at all times, in the model reasoning process, the model weight is always stored in the memory, and the activation values which are not used any more can be destroyed in time, so that the occupied memory space is released; let the actions represent all the activation values of the model, activation_set _t Representing the set of activation values in memory at time t, and therefore eitherAre all subsets of actions, but +.>The elements of (a) are repeated; gradually reasoning to obtain the Activation value set Activation in the memory at all moments _sets The formula is as follows> Where T represents the set of all moments in the reasoning process.

9. According to claim 1The method for setting the quantization bit number of the neural network model is characterized by comprising the following steps of: when the optimization problem is constructed and solved by adopting an integer linear programming method, the optimization target is made to be minimum in total loss, the constraint condition is that the memory occupied by the activation value set at each moment is smaller than the memory limit of the activation value, and the memory occupied by the model weight can be calculated in advance and is assumed to be m _weight Let m _c Representing an activation value memory constraint, m is calculated using the following formula _c ，

m _c ＝m _model -m _weight

Let total_loss denote total loss, m _t Representing the memory size, m occupied by the time t activation value set _a Representing the memory size and size occupied by the activation value a _a The number of elements representing the activation value a, b _a Quantization bit number, loss, representing activation value a _aba From the set of Loss,

min total_loss

m _a ＝size _a *b _a

the optimization problem is an integer linear programming problem, and the optimization variable is the quantization bit number b of each activation value _a And solving by adopting an integer linear programming method to obtain the quantization bit number of each activation value.

10. The method for setting quantization bit numbers of a neural network model according to claim 1, wherein: during model quantization and reasoning, model quantization is performed by using the quantization bit number obtained by optimization, and model reasoning is further performed.