CN119089962A

CN119089962A - Neural network lightweight method, device, equipment, medium and product

Info

Publication number: CN119089962A
Application number: CN202411094400.3A
Authority: CN
Inventors: 汤成; 谢小燕; 曹梦婉
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2024-08-09
Filing date: 2024-08-09
Publication date: 2024-12-06

Abstract

The invention provides a neural network lightweight method, a device, equipment, a medium and a product, wherein the method comprises the steps of determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantized clipping range and a third preset value interval of a quantized bit width; the method comprises the steps of establishing a search space based on a first preset value interval, a second preset value interval and a third preset value interval, determining optimal parameter combinations of all network layers in the search space, and quantizing a model to be quantized layer by layer based on the optimal parameter combinations of all network layers to obtain a quantized model. The neural network light-weight method provided by the invention comprehensively considers various factors to construct the search space, so that the light-weight method is more universal, is suitable for different types of network structures, can also ensure that the quantized model can keep the precision, further selects the optimal parameter combination of each network layer to realize the light weight of the model, and can reduce the consumption of computing resources and accelerate the processing speed on the premise of keeping the precision.

Description

Neural network lightweight method, device, equipment, medium and product

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a neural network lightweight method, a device, equipment, a medium and a product.

Background

With the development of artificial intelligence technology, the method is widely applied to the fields of machine learning, natural language processing, computer vision, control decision and the like. Application functions such as recognition and the like are realized by processing data based on the neural network model, and are key technologies for artificial intelligence application. In order to improve the accuracy of the neural network model in processing data, the parameter amount of the neural network model is increased, so that the calculation resources such as calculation power storage requirements for processing the data of the neural network model are increased, and the corresponding processing data duration is also increased. In particular, after the explosion of large model technology, while large language models perform well in performing natural language processing and natural language generation tasks, their training and deployment is quite complex, with parameter volumes typically reaching the billions or even trillions scale. Therefore, a model weight reducing method is used for reducing weight of the model, and is important for realizing efficient reasoning.

In the conventional weight reduction method, a quantization bit width is generally selected, and quantization parameters are unified according to the quantization bit width, but this weight reduction method generally causes a reduction in the accuracy of the model after weight reduction.

Disclosure of Invention

The invention provides a neural network light-weight method, a device, equipment, a medium and a product, which are used for solving the problem that the light-weight model accuracy is reduced due to the light-weight mode of the traditional neural network.

In a first aspect, the present invention provides a neural network lightweight method, including:

Determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization clipping range and a third preset value interval of a quantization bit width, wherein the transfer difficulty coefficient is used for adjusting the coefficient of the quantization difficulty of a plurality of parameters in a network layer;

Constructing a search space based on the first preset value interval, the second preset value interval and the third preset value interval;

determining an optimal parameter combination of each network layer in the model to be quantized in the search space, wherein the optimal parameter combination comprises a target transfer difficulty coefficient, a target quantization cutting range and a target quantization bit width;

and carrying out layer-by-layer quantization on the model to be quantized based on the optimal parameter combination of each network layer to obtain a quantized model.

In one embodiment, in the search space, when determining the optimal parameter set for each network layer in the model to be quantized, the following steps are performed for each network layer:

Acquiring a calibration data set of the model to be quantized;

Model reasoning is carried out based on the calibration data set, so that a plurality of initial activation values of the network layer are obtained;

determining an initial weight corresponding to each initial activation value;

And selecting a target transfer difficulty coefficient of the network layer in the model to be quantized in the first preset value interval based on the initial activation values and the initial weights.

In one embodiment, after selecting the target transfer difficulty coefficient of the network layer in the model to be quantized in the first preset value interval based on a plurality of initial activation values and a plurality of initial weights, the method further comprises:

determining a scaling factor based on the target transfer difficulty coefficient, wherein the scaling factor is used for quantizing parameters to be quantized;

and converting each initial activation value and each initial weight based on the scaling factors to obtain a plurality of target activation values and a plurality of target weights.

In one embodiment, in the search space, when determining the optimal parameter set of each network layer in the model to be quantized, the following steps are further performed for each network layer:

Selecting a plurality of outliers from the plurality of target activation values and the plurality of target weights;

constructing a number axis based on a plurality of target parameters, the plurality of target parameters including the plurality of target activation values and the plurality of target weights;

determining remaining target parameters other than the plurality of outliers among the plurality of target parameters as normal values;

determining a first parameter distribution of a plurality of normal values and a second parameter distribution of the plurality of outliers based on the number axis;

And selecting a target quantization clipping range of the network layer in the model to be quantized in the second preset value interval based on the first parameter distribution and the second parameter distribution.

In one embodiment, the selecting a plurality of outliers from the plurality of target activation values and the plurality of target weights comprises:

Determining the magnitude of each target activation value;

Determining a target activation value corresponding to the first N maximum amplitude values in the plurality of amplitude values as an abnormal activation value;

Determining a target weight corresponding to each abnormal activation value as an abnormal weight;

the plurality of outlier activation values and the plurality of outlier weights are respectively determined as outliers.

Determining target hardware to be deployed by the quantized model;

And selecting the target quantization bit width of the network layer in the model to be quantized in the third preset value interval according to the target hardware.

In one embodiment, when the model to be quantized is quantized layer by layer based on the optimal parameter combination of each network layer, the following steps are performed for each network layer:

respectively determining target parameters meeting the target quantization cutting range as parameters to be quantized;

Respectively determining target parameters which do not meet the target quantitative clipping range as parameters to be mapped;

quantizing a plurality of parameters to be quantized based on the scaling factor and the target quantization bit width;

if any parameter to be mapped is larger than the quantization maximum value of the target quantization bit width, mapping the any parameter to be mapped to the quantization maximum value;

And if any parameter to be mapped is smaller than the quantization minimum value of the target quantization bit width, mapping the any parameter to be mapped to the quantization minimum value.

In one embodiment, after performing layer-by-layer quantization on the model to be quantized based on the optimal parameter combination of each network layer, the method further includes:

Acquiring a first output of the model to be quantized, a first inference time consumption of the model to be quantized, a second output of the quantized model and a second inference time consumption of the quantized model;

determining a loss value based on the first output, the second output, the first inference consuming time, and the second inference consuming time;

And selecting a target parameter combination meeting the optimal quantization loss condition in the search space based on the loss value.

In a second aspect, the present invention also provides a neural network lightweight device, including:

the device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization clipping range and a third preset value interval of a quantization bit width, the transfer difficulty coefficient is used for adjusting the coefficient of quantization difficulty of a plurality of parameters in a network layer, and the quantization clipping range refers to the range of a parameter to be quantized in the plurality of parameters in the network layer;

the construction module is used for constructing a search space based on the first preset value interval, the second preset value interval and the third preset value interval;

the second determining module is used for determining the optimal parameter combination of each network layer in the model to be quantized in the search space, wherein the optimal parameter combination comprises a target transfer difficulty coefficient, a target quantization cutting range and a target quantization bit width;

And the quantization module is used for quantizing the model to be quantized layer by layer based on the optimal parameter combination of each network layer to obtain a quantized model.

In a third aspect, the present invention provides an apparatus comprising an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the neural network lightening method as described in any one of the above when the program is executed.

In a fourth aspect, the present invention also provides a medium comprising a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a neural network lightening method as described in any of the above.

In a fifth aspect, the invention also provides an article of manufacture comprising a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program when executed by the processor implementing the steps of a neural network lightening method as any one of the above.

According to the neural network light-weight method, device, equipment, medium and product provided by the invention, a plurality of factors influencing light weight such as the transfer difficulty coefficient, the quantization cutting range and the quantization bit width are comprehensively considered, so that the value range corresponding to each factor influencing light weight is preset, the search space is constructed, the light-weight method can be more universal and suitable for different types of network layer structures, the follow-up quantized model can be ensured to keep precision, the optimal parameter combination of each network layer in the model to be quantized is further selected in the search space, and the quantization treatment is carried out on the model to be quantized based on the optimal parameter combination of each network layer, so that the consumption of computing resources is reduced and the processing speed is accelerated on the premise of keeping the precision of the quantized model, and the accuracy and the performance of the model are ensured.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a neural network lightweight method provided by the invention.

FIG. 2 is a schematic diagram of the difficulty of model activation value quantization transfer provided by the invention.

Fig. 3 is a schematic diagram of model quantization clipping provided by the present invention.

Fig. 4 is a schematic structural diagram of a neural network lightweight device provided by the invention.

Fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the invention may be practiced otherwise than as specifically illustrated or described herein.

The neural network lightweight method, apparatus, device, medium and product provided by the present invention are described below with reference to fig. 1 to 5.

As shown in fig. 1, the method comprises the following steps:

Step 101, determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization cutting range and a third preset value interval of a quantization bit width;

102, constructing a search space based on the first preset value interval, the second preset value interval and the third preset value interval;

step 103, determining the optimal parameter combination of each network layer in the model to be quantized in the search space;

and 104, quantizing the model to be quantized layer by layer based on the optimal parameter combination of each network layer to obtain a quantized model.

It should be noted that, the neural network light-weight method provided by the embodiment of the invention is realized based on the neural network light-weight device, the size of the neural network model is compressed by using the data type with smaller memory overhead, the reasoning speed of the model on hardware is accelerated, and meanwhile, the accuracy and performance of the model are ensured. The embodiment of the invention describes a neural network lightweight method taking a neural network lightweight device as an execution subject.

Specifically, before the neural network is lightened, various factors related to the weight reduction are comprehensively considered, and a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization cutting range and a third preset value interval of a quantization bit width are preset in the light weight of the neural network.

It should be noted that the transfer difficulty coefficient may be controlled by a super parameter α in the smooth quantum (SmoothQuant) method, specifically, the core idea of the SmoothQuant method is to control how much quantization difficulty is migrated from the activation value to the weight value by introducing a super parameter α, so as to balance the quantization difficulty of the weight and the activation value in the model by adjusting the quantization difficulty.

The quantization clipping range refers to the range of the parameter to be quantized in the plurality of parameters in the network layer to be quantized, and the quantization of the activation value and the weight is considered, so that it can be understood that the quantization clipping range refers to the range of the parameter to be quantized in the plurality of activation values and the weight in the network layer to be quantized.

The quantization bit width refers to the number of bits used for representing the weight and the activation value in the light-weight process of the neural network, specifically, the quantization bit width determines the number of bits occupied by the weight and the activation value when being stored and processed in a computer, in general, the smaller the quantization bit width is, the lower the storage requirement and the calculation speed of the model are, but at the same time, the accuracy loss of the model may be caused, so in the practical application process, the proper quantization bit width needs to be selected according to the specific performance requirement.

In the embodiment of the invention, the parameters to be quantized in the quantization clipping range are quantized according to the proper quantization bit width, and the parameters in the non-quantization clipping range are directly mapped to the quantization minimum value or the quantization maximum value of the quantization bit width.

Further, the neural network light-weight device determines a first preset value interval of the transfer difficulty coefficient, a second preset value interval of the quantization clipping range and a third preset value interval of the quantization bit width.

In one embodiment, the first preset value interval of the transfer difficulty coefficient may be [90%,95% ], the second preset value interval of the quantization clipping range may be [0.3,0.7], and the third preset value interval of the quantization bit width may be { int4, int8, fp4, pf8}. Because the distribution of the activation values is different between different layers of the same model, the transfer difficulty coefficient and the quantization clipping range between the different layers may be different, and a value range is required to be flexibly selected. The choice of quantization bit width is determined according to the hardware condition of the model, and in general, the hardware condition of the model will not change for each layer, so the quantization bit widths of different layers may be the same.

Further, the neural network light-weight device sets a first step length of a first preset value interval and a second step length of a second preset value interval.

Further, the neural network light-weight device constructs a search space based on the first preset value interval and the first step length thereof, the second preset value interval and the second step length thereof, and the third preset value interval.

It should be noted that, each preset value interval has a plurality of values, and a plurality of parameter combinations can be constructed between different preset value intervals. One value is selected from the first preset value interval, one value is selected from the second preset value interval, one value is selected from the third preset value interval, and one parameter combination can be constructed by three values selected from three different preset value intervals.

Further, the neural network light-weight device determines the optimal parameter combination of each network layer in the model to be quantized by using an optimization algorithm, such as grid search, bayesian optimization and the like, in a search space, wherein the optimal parameter combination of each network layer comprises a target transfer difficulty coefficient, a target quantization clipping range and a target quantization bit width.

Further, the neural network light-weight device quantizes the model to be quantized layer by layer based on the optimal parameter combination of each network layer to obtain a quantized model.

According to the neural network light-weight method provided by the invention, a plurality of factors influencing light weight such as the transfer difficulty coefficient, the quantization clipping range and the quantization bit width are comprehensively considered, so that the value range corresponding to each factor influencing light weight is preset, the search space is constructed, the light-weight method can be more universal and suitable for different types of network layer structures, the follow-up quantized model can be ensured to keep the precision, the optimal parameter combination of each network layer in the model to be quantized is further selected in the search space, and the model to be quantized is quantized based on the optimal parameter combination of each network layer, so that the consumption of calculation resources is reduced and the processing speed is accelerated on the premise of keeping the precision of the quantized model, and the accuracy and the performance of the model are ensured.

Further, based on step 103, when determining the optimal parameter combination of each network layer in the model to be quantized in the search space, the following steps are performed for each network layer:

Acquiring a calibration data set of the model to be quantized;

determining an initial weight corresponding to each initial activation value;

Specifically, the neural network light-weight device acquires a calibration data set of a model to be quantized.

Further, the neural network light-weight device performs model reasoning based on the calibration data set to obtain a plurality of initial activation values of the current network layer.

Further, the neural network light-weight device determines an initial weight corresponding to each initial activation value.

It should be noted that, each neuron receives an input signal from the previous layer, performs linear combination through weights, adds bias to obtain a linear combination result, and further converts the linear combination result into an activation value of the neuron through an activation function. Thus, each activation value may have a corresponding weight, and there may be multiple weights, which are not limited herein.

Further, the neural network light-weight device selects a target transfer difficulty coefficient of a current network layer in the model to be quantized in a first preset value interval by using an optimization algorithm based on a plurality of initial activation values and a plurality of initial weights.

According to the embodiment of the invention, the target transfer difficulty coefficient of the network layer is selected in the first preset value interval based on the initial activation value and the initial weight, so that the difficulty of transferring from the activation value to the weight can be controlled, the precision and the efficiency in the quantization process are balanced, and the model precision and the model performance are maintained, and meanwhile, the model reasoning speed and the model reasoning efficiency are improved.

Further, after selecting the target transfer difficulty coefficient of the network layer in the model to be quantized in the first preset value interval based on a plurality of initial activation values and a plurality of initial weights, the method further comprises:

Specifically, the neural network light-weight device determines a scaling factor for quantizing the parameter to be quantized based on the target transfer difficulty coefficient.

In one embodiment, based on the target transfer difficulty coefficient, a specific calculation formula for determining the scaling factor is as follows:

Wherein s characterizes the scaling factor; The method comprises the steps of representing a target transfer difficulty coefficient, wherein X is an activation value matrix formed by a plurality of initial activation values, W is a weight matrix formed by a plurality of initial weights; characterizing a maximum absolute value of an initial activation value of an ith input channel; The maximum absolute value of the initial weight of the ith input channel is characterized.

It should be noted that, referring to fig. 2, fig. 2 is a schematic diagram of the model activation value quantization transfer difficulty provided by the present invention. As shown in fig. 2, X is an activation value matrix composed of a plurality of initial activation values, and has a size of 2×4, and w is a weight matrix composed of a plurality of initial weights, and has a size of 4×3. The maximum activation value in the activation value matrix X is 16 times worse than the minimum activation value, and the maximum weight in the weight matrix W is only 2 times worse than the minimum weight, so that the activation value matrix X is more difficult to quantify than the weight matrix W, and partial difficulty in quantifying the activation value can be considered to be transferred into the quantization weight by a matrix equivalent substitution method.

Further, the neural network light-weight device converts each initial activation value and each initial weight based on the scaling factor to obtain a plurality of target activation values and a plurality of target weights.

It may also be understood that the neural network light-weight device converts each initial activation value and each initial weight based on the scaling factor in an equivalent substitution manner, so as to obtain a plurality of converted target activation values and a plurality of target weights.

Wherein an equivalent alternative by matrix multiplication can be considered, the specific formula is as follows:

wherein X is an activation value matrix composed of a plurality of initial activation values, W is a weight matrix composed of a plurality of initial weights; Is an activation value matrix after equivalent replacement; Is the weight matrix after equivalent replacement.

As shown in fig. 2, by conversionIs equivalent to the original XW. The maximum activation value in the converted activation value matrix X is reduced to 4 times than the minimum activation value, and the maximum activation value in the converted weight matrix W is increased to 4 times than the minimum activation value, so that the difficulty of quantifying the activation value and the quantifying weight is adjusted.

According to the embodiment of the invention, the scaling factor is determined firstly based on the target transfer difficulty coefficient, and the initial activation values and the initial weights are replaced equivalently based on the scaling factor, so that the difficulty of controlling the transfer from the activation values to the weights is realized, compared with the traditional separate quantization, the value range of the activation values is reduced, the quantization difficulty of the activation values is reduced, a more effective quantization process can be realized, and the performance and the accuracy of the model after the quantization are ensured.

Further, based on step 103, when determining the optimal parameter combination of each network layer in the model to be quantized in the search space, the following steps are further performed for each network layer:

Specifically, the neural network light-weight device determines a plurality of target activation values and a plurality of target weights as a plurality of parameters, and the following process is described with the plurality of parameters because both the activation values and the weights are quantized in the quantization process.

Further, the neural network light-weight device selects a plurality of outliers from a plurality of parameters.

Further, the neural network light-weight device is used for constructing a number axis based on a plurality of target parameters, specifically, a coordinate origin on a straight line is selected, the coordinate origin is positive to the right, the coordinate origin is negative to the left, the position of each target parameter is marked on the straight line according to the numerical value of the plurality of target parameters, the corresponding number axis can be obtained, the distribution condition of the plurality of target parameters can be displayed through the number axis, and the central trend or the discrete degree of the parameters can be intuitively known.

Further, the neural network light-weight device determines, as a normal value, the target parameter remaining except for the plurality of outliers among the plurality of target parameters.

Further, the neural network light-weight device determines a first parameter distribution of a plurality of normal values and a second parameter distribution of a plurality of outliers based on the number axis.

Further, the neural network light-weight device selects a target quantization clipping range of a current network layer in the model to be quantized in a second preset value interval through an optimization algorithm based on the first parameter distribution and the second parameter distribution, so that the target parameters can be included as much as possible, and the target parameters comprise outliers and normal values.

If the target parameters are mainly distributed at the positive half axis and the maximum value of the parameters, the maximum value of the parameters is taken as a starting point, the clipping proportion is determined, and the target quantized clipping range can be obtained by dividing the maximum value of the parameters on the number axis until the clipping proportion is met. If the target parameters are mainly distributed at the negative half axis and the parameter minimum value, the target quantized clipping range can be obtained by taking the parameter minimum value as a starting point, determining clipping proportion, and dividing on the number axis from the parameter minimum value until the clipping proportion is met. If the target parameters are mainly distributed near the origin, but not near the parameter maximum, selecting the region with highest density in the [ parameter minimum, parameter maximum ] range without taking the parameter maximum as the starting point, and dividing according to the cutting proportion to obtain the target quantized cutting range.

In an embodiment, referring to fig. 3, fig. 3 is a schematic diagram of model quantization clipping provided in the present invention. Taking the quantized bit width quantized to int8 as an example, the left side of the graph is the quantized clipping of the maximum and minimum full-quantity mapping, the precision loss of the quantized dense region exists on the negative half axis, the right side of the graph is the quantized clipping of the clipping part of the maximum value mapping, and the unimportant maximum value can be directly mapped to the quantized maximum value, so that the model reasoning precision is better kept.

The embodiment of the invention considers the distribution condition of a plurality of normal values and the distribution condition of a plurality of outliers, can divide the denser parameters into the target quantization clipping range as much as possible, and considers the non-uniform distribution condition of the activation values and the weights, thereby realizing the quantization of self-adaptive scaling according to the distribution characteristics of the weights and the activation values, and further ensuring the precision and the performance of the quantized model.

Further, based on step 103, the selecting a plurality of outliers from the plurality of target activation values and the plurality of target weights includes:

Determining the magnitude of each target activation value;

Specifically, the neural network light-weight means determines the magnitude of each target activation value.

Further, the neural network light-weight device compares the values of the plurality of amplitude values to obtain a value comparison result.

Further, the neural network light-weight device determines the target activation values corresponding to the first N maximum amplitude values to be abnormal activation values according to the value comparison result, wherein the N maximum amplitude values are set according to practical situations, and preferably, the first 0.5% or 1% of the amplitude values in the plurality of amplitude values ordered according to the maximum value can be selected, and the corresponding target activation values are abnormal activation values.

The abnormal activation value may be determined by the L2 norm.

Further, the neural network light-weight device determines a target weight corresponding to each abnormal activation value as an abnormal weight.

Further, the neural network light-weight device determines a plurality of abnormal activation values and a plurality of abnormal weights as outliers, respectively.

According to the embodiment of the invention, the outliers of the activation values are determined through the amplitude values or the L2 norms in the activation values of the model, and the outliers of the weights are correspondingly determined through the outliers of the activation values, so that the weights affecting the activation values are better saved, and the weight parameters which can generate the outliers of the activation values can be better protected when the quantization range is selected, so that the precision of the quantized model is better kept.

Determining target hardware to be deployed by the quantized model;

Specifically, the neural network lightweight device determines target hardware for model deployment to be quantized.

It should be noted that, according to the target hardware deployed by the model, different quantization bit widths, such as data types of int4, int8, fp4, fp8, etc., part of the low-power devices only have integer data types, such as int4, int8, and fp4, fp8 at the server side has more reasoning speed advantage.

Further, the neural network light-weight device selects a target quantization bit width of the current network layer in the model to be quantized in a third preset value interval according to the target hardware.

According to the embodiment of the invention, the storage requirement, the execution efficiency and the reasoning speed of the model can be directly influenced by the target hardware of the model deployment, so that different data types with different quantization bit widths can be considered for adapting to different hardware conditions, and the adaptability of quantization is improved.

Further, based on step 103, when the model to be quantized is quantized layer by layer based on the optimal parameter combination of each network layer, the following steps are performed for each network layer:

Specifically, the neural network light-weight device determines target parameters within a target quantization clipping range as parameters to be quantized, respectively.

Further, the neural network light-weight device respectively determines target parameters which do not meet the target weight clipping range as parameters to be mapped.

Further, the neural network light-weight device quantizes a plurality of parameters to be quantized based on the scaling factor and the target quantization bit width.

It should be noted that, parameters to be mapped within the target quantization clipping range are discrete, and a certain quantization difficulty exists. But to avoid loss of precision after quantization, these discrete values can be mapped directly to the quantized maximum value.

Further, if any parameter to be mapped is larger than the quantization maximum value of the target quantization bit width, the neural network light-weight device maps the any parameter to be mapped to the quantization maximum value.

Further, if any parameter to be mapped is smaller than the quantization minimum value of the target quantization bit width, the neural network light-weight device maps the any parameter to be mapped to the quantization minimum value.

It should be noted that, in general, the bit width of the network layer parameter will be far greater than the quantized bit width, and the parameter to be mapped in the range that does not meet the target quantization clipping is generally distributed at the maximum value, so the parameter to be mapped is either smaller than the quantized minimum value or larger than the quantized maximum value of the target quantized bit width.

According to the embodiment of the invention, the parameters to be quantized in the target quantization cutting range are quantized according to the scaling factors, and the discrete parameters to be mapped are directly mapped to the quantized bit width, so that a more effective quantization effect is realized, and the precision and performance of the quantized model are ensured.

Further, after performing layer-by-layer quantization on the model to be quantized based on the optimal parameter combination of each network layer to obtain a quantized model, the method further comprises:

Specifically, the neural network light-weight device obtains a first output of a model to be quantized, a first inference time-consuming of the model to be quantized, a second output of the model after quantization, and a second inference time-consuming of the model after quantization.

Further, the neural network light-weight device determines the loss value based on the first output, the second output, the first inference time-consuming, and the second inference time-consuming.

In one embodiment, the loss value is calculated as follows:

Wherein l characterizes the loss value; a second output characterizing the quantized model; Y represents a first output of the model to be quantized; the second inference time consuming for characterizing the quantized model, the first inference time consuming for characterizing the model to be quantized by T, and the MSE is a mean square error function.

Further, the neural network light-weight device selects a target parameter combination meeting the optimal quantization loss condition in the search space based on the loss value, wherein the target parameter combination can be used for adjusting specific parameters in the quantization process so as to guide the subsequent quantization operation.

The neural network light-weight method can effectively reduce the deployment cost of the large model and improve the reasoning efficiency of the large model, and on the basis, the large model is applied to the services of intelligent communities, mobile households, real-time application, reasoning and the like to promote the intelligent development of the services.

According to the embodiment of the invention, the loss value is obtained through calculation of the quantized model and the model before quantization, and the target parameter combination of the optimal quantization loss condition is selected through the loss value, so that the performance loss in the quantization process is minimized, and the quantized model can be ensured to realize the optimal light weight effect while maintaining the performance.

The neural network light-weight device provided by the invention is described below, and the neural network light-weight device described below and the neural network light-weight method described above can be referred to correspondingly to each other.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a neural network lightweight device according to the present invention.

The neural network lightweight device includes:

the first determining module 410 is configured to determine a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization clipping range, and a third preset value interval of a quantization bit width, where the transfer difficulty coefficient is used to adjust a coefficient of quantization difficulty of a plurality of parameters in a network layer, and the quantization clipping range is a range of a parameter to be quantized in the plurality of parameters in the network layer.

A construction module 420, configured to construct a search space based on the first preset value interval, the second preset value interval, and the third preset value interval.

And the second determining module 430 is configured to determine, in the search space, an optimal parameter combination of each network layer in the model to be quantized, where the optimal parameter combination includes a target transfer difficulty coefficient, a target quantization clipping range and a target quantization bit width.

And the quantization module 440 is configured to quantize the model to be quantized layer by layer based on the optimal parameter combination of each network layer, so as to obtain a quantized model.

According to the neural network light-weight device provided by the invention, a plurality of factors influencing light weight such as the transfer difficulty coefficient, the quantization clipping range and the quantization bit width are comprehensively considered, so that the value range corresponding to each factor influencing light weight is preset, the search space is constructed, the light-weight method can be more universal and suitable for different types of network layer structures, the follow-up quantized model can be ensured to keep the precision, the optimal parameter combination of each network layer in the model to be quantized is further selected in the search space, and the model to be quantized is quantized based on the optimal parameter combination of each network layer, so that the consumption of calculation resources is reduced and the processing speed is accelerated on the premise of keeping the precision of the quantized model, and the accuracy and the performance of the model are ensured.

Further, the second determining module 430 is further configured to:

Acquiring a calibration data set of the model to be quantized;

determining an initial weight corresponding to each initial activation value;

Further, the neural network lightweight device is further configured to:

Further, the second determining module 430 is further configured to:

Determining the magnitude of each target activation value;

Further, the second determining module 430 is further configured to:

Determining target hardware to be deployed by the quantized model;

Further, the quantization module 440 is further configured to:

Further, the neural network lightweight device is further configured to:

It should be noted that, in the specific operation, the neural network light-weight device provided by the present invention may execute the neural network light-weight method described in any of the above embodiments, which is not described in detail in this embodiment.

Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include a processor (processor) 510, a communication interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, where the processor 510, the communication interface 520, and the memory 530 complete communication with each other through the communication bus 540. The processor 510 may call logic instructions in the memory 530 to execute a neural network lightweight method, where the method includes determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization clipping range and a third preset value interval of a quantization bit width, where the transfer difficulty coefficient is used to adjust coefficients of a plurality of parameter quantization difficulties in a network layer, where the quantization clipping range is a range of parameters to be quantized in the network layer, constructing a search space based on the first preset value interval, the second preset value interval and the third preset value interval, determining an optimal parameter combination of each network layer in a model to be quantized in the search space, where the optimal parameter combination includes a target transfer difficulty coefficient, a target quantization clipping range and a target quantization bit width, and performing layer-by-layer quantization on the model to be quantized based on the optimal parameter combination of each network layer, to obtain a quantized model.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

In another aspect, the invention further provides a computer program product, which comprises a computer program stored on a non-transitory computer readable storage medium, wherein the computer program comprises program instructions, when the program instructions are executed by a computer, the computer is capable of executing the neural network lightweight method provided by the embodiments, the method comprises the steps of determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization clipping range and a third preset value interval of a quantization bit width, wherein the transfer difficulty coefficient is used for adjusting coefficients of a plurality of parameter quantization difficulties in a network layer, the quantization clipping range refers to the range of the parameters to be quantized in the network layer, constructing a search space based on the first preset value interval, the second preset value interval and the third preset value interval, determining an optimal parameter combination of each network layer in a model to be quantized in the search space, wherein the optimal parameter combination comprises a target transfer difficulty coefficient, a target quantization range and a target quantization bit width, and quantizing the optimal parameter combination is performed on the model to be quantized layer by layer based on the optimal parameter combination of each network layer.

In still another aspect, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, where the computer program is implemented when executed by a processor to perform the neural network light-weight method provided in the foregoing embodiments, where the method includes determining a first preset value interval of a transfer difficulty coefficient, a second preset value interval of a quantization clipping range, and a third preset value interval of a quantization bit width, where the transfer difficulty coefficient is a coefficient for adjusting quantization difficulty of a plurality of parameters in a network layer, where the quantization clipping range refers to a range of a parameter to be quantized in the plurality of parameters in the network layer, constructing a search space based on the first preset value interval, the second preset value interval, and the third preset value interval, determining an optimal parameter combination of each network layer in a model to be quantized in the search space, where the optimal parameter combination includes a target transfer difficulty coefficient, a target quantization range, and a target quantization bit width, and quantizing the model to be quantized layer by layer based on the optimal parameter combination of each network layer.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.

Claims

1. A neural network lightweight method, comprising:

2. The neural network lightweight method according to claim 1, characterized in that in determining an optimal parameter set of each network layer in a model to be quantized in the search space, the following steps are performed for each network layer:

Acquiring a calibration data set of the model to be quantized;

determining an initial weight corresponding to each initial activation value;

3. The neural network lightweight method according to claim 2, further comprising, after selecting the target transfer difficulty coefficient of the network layer in the model to be quantized in the first preset value interval based on a plurality of initial activation values and a plurality of initial weights:

4. A neural network lightweight method according to claim 3, characterized in that in determining the optimum parameter set of each network layer in the model to be quantized in the search space, the following steps are further performed for each network layer:

5. The neural network lightweight method of claim 4, wherein said selecting a plurality of outliers among said plurality of target activation values and said plurality of target weights comprises:

Determining the magnitude of each target activation value;

6. The neural network lightweight method according to claim 1, characterized in that in determining an optimal parameter set of each network layer in the model to be quantized in the search space, the following steps are further performed for each network layer:

Determining target hardware to be deployed by the quantized model;

7. The neural network lightweight method according to claim 5, characterized in that, when the model to be quantized is quantized layer by layer based on an optimal parameter combination of each network layer, the following steps are performed for each network layer:

8. The neural network light-weight method according to any one of claims 1 to 7, wherein after performing layer-by-layer quantization on the model to be quantized based on an optimal parameter combination of each network layer, obtaining a quantized model, further comprising:

9. A neural network lightweight device, comprising:

10. An apparatus comprising an electronic device including a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the neural network weight reduction method of any one of claims 1 to 8.

11. A medium comprising a non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the neural network weight reduction method of any of claims 1 to 8.

12. An article of manufacture comprising a computer program product comprising a computer program, characterized in that the computer program when executed by a processor implements the steps of the neural network weight reduction method of any one of claims 1 to 8.