CN120409566A

CN120409566A - A method and system for joint quantization of weights and activations of large language models

Info

Publication number: CN120409566A
Application number: CN202510863856.XA
Authority: CN
Inventors: 陆婉婷; 龚乐君
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2025-06-26
Filing date: 2025-06-26
Publication date: 2025-08-01
Anticipated expiration: 2045-06-26
Also published as: CN120409566B

Abstract

The present invention discloses a method and system for jointly quantizing the weights and activations of a large language model, belonging to the field of model quantization technology. The method comprises: collecting and preprocessing a calibration set, inputting a large language model to perform forward propagation, and recording an activation matrix; for each embedding dimension, counting the maximum absolute values of activations of all word units in the dimension; generating a global threshold by combining a quantile statistical method with a global sensitivity coefficient, and determining that a dimension with a maximum absolute value of activation exceeding the global threshold is an outlier dimension; designing scaling factors for normal dimensions and outlier dimensions respectively to generate a reconstruction weight matrix; using Bayesian gradient to jointly optimize the truncation threshold of the reconstruction weight; calculating the scaling factor of the reconstruction weight matrix to obtain a reconstruction quantization weight matrix; applying the scaling factor to the activation matrix of the current layer according to the embedding dimension to quantize it; and performing multiplication calculation on the reconstruction quantization weight matrix to obtain a multiplication output result in the integer domain, and then performing unified dequantization recovery.

Description

Large language model weight and activation combined quantization method and system

Technical Field

The invention belongs to the technical field of model quantization, and particularly relates to a large language model weight and activation combined quantization method and system.

Background

Large language models (e.g., GPT-3, paLM, LLaMA, etc.) exhibit excellent performance in natural language generation, dialog systems, etc. tasks by virtue of their billion-level parameter amounts and powerful semantic understanding capabilities. However, the huge model scale causes extremely high computing resources and memory occupation in the reasoning process, and particularly in the edge equipment, low-delay scene and resource limited environment, the high deployment cost severely restricts the practical application. To break this bottleneck, quantization techniques become a key means to reduce model computation and storage overhead, with the core goal of mapping model weights and activation values from high precision (e.g., FP32/FP 16) to low precision (e.g., INT8/INT 4), thereby improving inference efficiency. Although quantization techniques have advanced, significant challenges remain in the practical application of large language models.

In a transducer architecture, the distribution of activation values often has long tail characteristics, the activation values of a part of the dimensions deviate significantly from the mean value, forming outliers, the variation patterns of different token (word elements, representing the smallest processing unit in the input sequence, such as a word, a sub-word or a character) in each embedded dimension are similar, and these outliers are rare but dominate the result of matrix multiplication, resulting in significant errors caused by insufficient resolution in direct quantization. Aiming at the problem, the prior researches put forward methods such as mixed precision quantization, mathematical equivalent transformation and the like, wherein the mixed precision quantization method is difficult to be efficiently executed on a hardware accelerator by reserving high precision for outliers to maintain performance, so that the actual acceleration effect is limited, the method based on mathematical equivalent transformation (such as smooth factor distribution) is used for trying to balance weight and activated quantization difficulty, however, the dynamic range after weight scaling is expanded to possibly cause quantization errors to exceed activation error compensation capability and further aggravate performance loss, the channel-by-channel quantization method can reduce quantization errors, compatibility problems exist between the channel-by-channel quantization method and a general matrix multiplication (GEMM) kernel widely adopted in hardware acceleration, and high-efficiency calculation is difficult to realize, and in addition, the prior method often adopts a static threshold value or a fixed strategy when the distribution of the activation values is uneven, so that the data characteristics of different model layers or tasks cannot be dynamically adapted, so that the problem of missed detection or false detection is further influenced.

The distribution of the weight matrix is usually concentrated, but in a large language model, the actual weight distribution often contains extreme values, which makes the problem of dynamic range optimization of the weight matrix also restrict the quantization effect. The traditional MinMax quantization method (minimum maximum value mapping quantization method) directly determines the dynamic range based on the weight extreme value, but the ubiquitous existence of the extreme value in the large language model leads to the fact that the scaling factor is dominated by the extreme value, the quantization resolution of the normal weight value is greatly reduced, the accumulated error is obviously increased, although partial research attempts are made to optimize the cut-off threshold value through a gradient descent method, the method is highly dependent on initial value selection, is easy to fall into a local optimal solution, global optimal is difficult to guarantee, and the calculation cost introduced in the optimization process further limits the practicability.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a large language model weight and activation combined quantization method and system, which solve the problems in the prior art.

The aim of the invention can be achieved by the following technical scheme:

a large language model weight and activation joint quantization method comprises the following steps:

collecting a calibration set, preprocessing the calibration set, inputting the calibration set into a large language model to execute forward propagation, and recording an activation matrix of each layer;

Based on the activation matrix, counting the maximum absolute value of activation of all the words in each embedded dimension, combining the global sensitivity coefficient by a quantile counting method, dynamically generating a global threshold value, and judging that the maximum absolute value of activation in the dimension exceeds the global threshold value as an outlier dimension;

based on the outlier dimension judgment result, differential scaling factors are respectively designed for the normal dimension and the outlier dimension and are applied to corresponding rows of the original weight matrix to generate a reconstruction weight matrix;

based on the reconstruction weight matrix, using Bayes-gradient joint optimization to reconstruct a truncated threshold of the weight;

Calculating a scaling factor of the reconstruction weight matrix based on the optimized cut-off threshold value, and quantizing the reconstruction weight to obtain a reconstruction quantization weight matrix;

loading scaling factors of normal dimension and outlier dimension, and quantizing the active matrix of the current layer according to the embedded dimension by applying the scaling factors to obtain a quantized active matrix;

and multiplying the quantized activation matrix and the reconstructed quantization weight matrix to obtain a multiplication output result in the integer domain, and mapping the multiplication output result in the integer domain back to a floating point approximation representation based on a scaling factor of the reconstructed weight matrix to perform unified inverse quantization recovery.

Further, global thresholdThe calculation formula of (2) is as follows:

Wherein, the Counting the function for the quantiles; Is a global sensitivity coefficient; To activate the maximum absolute value set; Is a high quantile parameter.

Further, a weight matrix is reconstructedThe construction rules of (a) are as follows:

Wherein, the Unified scaling factors for all normal dimensions; the method comprises the steps of generating an active matrix, namely, generating an active matrix, wherein the active matrix comprises an independent scaling factor for each outlier dimension, j is an index of an embedded dimension of the active matrix and an index of each row of the original weight matrix and the reconstructed weight matrix, i is an index of each row of the active matrix, namely, an index of each word element, O is an outlier dimension set, and is stored with the index of the outlier dimension, and N is a quantization bit width; Reconstructing a j-th row vector of the weight matrix; the j-th row vector of the original weight matrix; a value for a j-th embedded dimension of an i-th lemma in the activation matrix; the maximum absolute value of activation for all tokens in the normal dimension j.

Further, the quantization process of the reconstruction weight is as follows:

Wherein, the Reconstructing values of the quantization weight matrix at the d-th row and the k-th column positions; Reconstructing the value of the weight matrix at the position of the d row and the k column, wherein the value is a floating point numerical value; And The lower limit of the truncated threshold and the upper limit of the truncated threshold of the reconstruction weight matrix in the quantization process are respectively defined; A scaling factor for reconstructing the weight matrix; N is the quantization bit width; Representing performing a rounding operation; is a section cut-off function.

Further, the formula for quantizing the active matrix of the current layer by applying the scaling factor in the embedded dimension is:

j is the column index of the activation matrix and the quantized activation matrix, and represents the index of the embedded dimension; the value of the j embedded dimension of the i-th word element in the quantized activation matrix of the current layer is an integer number; the value of the j embedded dimension of the i-th word element in the current layer activation matrix is a floating point value; Representing performing a rounding operation; To activate the scaling factor of the jth embedded dimension of the matrix, O is the set of outlier dimensions, and the index of the outlier dimension is stored if I.e. j is the normal dimension, thenUnified scaling factor representing all normal dimensions of the activation matrix, if the dimensionsI.e. j is the outlier dimension, thenAn independent scaling factor representing each outlier dimension.

Further, the expression of unified dequantization recovery is:

Wherein, the The output matrix after unified inverse quantization recovery is obtained; Outputting a result for multiplication in an integer domain; A scaling factor for reconstructing the weight matrix; the quantized activation matrix; and reconstructing the quantization weight matrix.

A large language model weight and activation joint quantization system, comprising:

The activation matrix acquisition module is used for collecting a calibration set, preprocessing the calibration set, inputting the calibration set into a large language model to execute forward propagation, and recording the activation matrix of each layer;

The outlier dimension judging module is used for counting the maximum absolute value of the activation of all the word elements on each embedded dimension based on the activation matrix, dynamically generating a global threshold value by combining a global sensitivity coefficient through a quantile counting method, and judging that the maximum absolute value of the activation in the dimension exceeds the global threshold value as an outlier dimension;

The weight reconstruction module is used for respectively designing differentiated scaling factors for the normal dimension and the outlier dimension based on the outlier dimension judgment result and applying the scaling factors to corresponding rows of the original weight matrix to generate a reconstructed weight matrix;

The truncated threshold optimization module is used for optimizing a truncated threshold of the reconstruction weight by using Bayes-gradient combination based on the reconstruction weight matrix;

the weight quantization module is used for calculating a scaling factor of the reconstruction weight matrix based on the optimized cut-off threshold value and quantizing the reconstruction weight to obtain the reconstruction quantization weight matrix;

the activation matrix quantization module is used for loading scaling factors of normal dimension and outlier dimension, and quantizing the activation matrix of the current layer according to the embedded dimension by applying the scaling factors to obtain a quantized activation matrix;

And the unified inverse quantization module is used for carrying out multiplication calculation on the quantized activation matrix and the reconstructed quantization weight matrix to obtain a multiplication output result in the integer domain, and mapping the multiplication output result in the integer domain back to the floating point approximate representation based on the scaling factor of the reconstructed weight matrix to carry out unified inverse quantization recovery.

A computer storage medium storing a readable program which, when executed by a processor, is capable of performing a large language model weight and activation joint quantization method as described above.

An electronic device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

The memory is configured to store at least one executable instruction that causes the processor to perform an operation corresponding to activating a joint quantization method with a large language model weight as described above.

A computer program product comprising computer instructions that instruct a computing device to perform an operation corresponding to activating a joint quantization method with one of the large language model weights described above.

The invention has the beneficial effects that:

1. The invention designs an activation outlier judging mechanism based on a dynamic quantile threshold value and a differential scaling factor calculating method, so that the activation dimension with long tail distribution in a large language model can realize fine quantization, thereby effectively inhibiting quantization errors of an activation side, and further optimizes a weight quantization cutoff threshold value by combining a Bayesian global search and gradient descent local fine tuning combined optimization strategy, so that the extremum distribution problem of a weight side is dynamically adapted, and the cooperative control of quantization precision and hardware execution efficiency is realized.

2. According to the invention, the activated scaling factors are pre-multiplied to the weights and the offline quantization design is implemented, so that the accuracy can be recovered only by one inverse quantization operation in the reasoning process, and the large language model reasoning acceleration with high accuracy, high throughput and low delay is realized by matching with the hardware-friendly INT8 matrix multiplication instruction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort.

FIG. 1 is a schematic flow diagram of a joint quantization method of the present invention;

FIG. 2 is a schematic diagram of a process for determining the dimension of an activation outlier and optimizing parameters according to the present invention;

FIG. 3 is a schematic diagram of a weight quantization cutoff threshold optimization flow in accordance with the present invention;

FIG. 4 is a schematic diagram of the quantitative execution and accuracy recovery process of the online reasoning stage of the present invention;

FIG. 5 is a diagram illustrating the quantization and reasoning within a transducer block according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in FIG. 1, the method for combining the weight and the activation of the large language model comprises the following steps:

s1, collecting a calibration set, preprocessing the calibration set, inputting the calibration set into a large language model to execute forward propagation, and recording an activation matrix of each layer;

A representative set of input data (calibration set) is collected, covering the various input scenarios that may be encountered by the large language model, and the calibration set is preprocessed. The calibration set is input into a large language model to be quantized, forward propagation is performed, an activation matrix of each layer is recorded, the large language model is a deep neural network model constructed based on a transducer structure, a stacked structure formed by a multi-layer self-attention module and a feedforward network is provided, the parameter quantity of the stacked structure is generally in the order of hundreds of millions to billions, the stacked structure is suitable for tasks such as natural language understanding and text generation, the model can comprise a Decoder-only (such as a GPT model), a Encoder-only (Decoder-only) structure (such as a BERT model) or a Encoder-Decoder (such as a T5 model) structure, and the like, and the specific steps comprise:

s11, collecting and preprocessing a calibration set;

And performing standardization processing on the data in the calibration set, including word segmentation, coding, normalization, sequence filling length alignment and other operations, so as to ensure that a large language model can directly accept the data for forward propagation and promote sampling representativeness.

S12, activating data acquisition;

inputting the preprocessed calibration set into an original full-precision model (taking 16-bit floating point number precision as an example) one by one, sequentially executing forward propagation calculation along the sequence of the model structure, and recording the current layer activation matrix Wherein T is the number of tokens and D is the number of embedded dimensions.

S2, based on an activation matrix, counting the maximum absolute value of activation of all the word elements in each embedded dimension, combining a global sensitivity coefficient by a quantile counting method, dynamically generating a global threshold, and judging that the maximum absolute value of activation in the dimension exceeds the global threshold as an outlier dimension;

as shown in fig. 2, the step of determining the outlier dimension includes:

s21, counting an activation extremum (activation maximum absolute value);

Let the current layer activation matrix be , wherein,For the number of tokens to be used,Is the number of embedded dimensions. For each embedded dimension d, calculate its maximum absolute value of activation across all tokens:

obtaining an activated maximum absolute value set For measuring the dynamic amplitude of different embedding dimensions.

S22, calculating a global threshold;

to automatically determine outlier dimensions, a high-scoring parameter is set And global sensitivity coefficientThe global threshold is generated according to the following formula:

Wherein, the As quantile statistical function, the first to extract the active maximum absolute value of all dimensionsDividing the position; For the high-fraction parameter, the position of the quantile in the global threshold calculation for determining the activation value distribution is more and more approximate to 100%, and the wider the covered data is; The global sensitivity coefficient is used for adjusting the scaling amplitude of the global threshold value, enhancing the sensitivity to outliers and being more sensitive to extreme values when the value of the global sensitivity coefficient is higher.

S23, outlier dimension judgment;

And comparing the activation maximum absolute value corresponding to each dimension with a global threshold value, and judging whether the activation maximum absolute value is an outlier dimension or not. Defining an outlier dimension set The following are provided:

I.e. if the maximum absolute value of the activation in a certain dimension j exceeds the global threshold It is considered to activate the outlier dimension of the distribution and an independent scaling process will be implemented in the subsequent quantization flow.

In this embodiment, in consideration of different tolerance of different model layers to outlier detection, in order to improve quantization robustness and generalization performance, a grid search strategy is adopted to determine an optimal parameter combination in a candidate parameter set, so as to perform super-parameter search optimization. Specifically, let:

enumerating all combinations, defining optimization targets as:

Wherein, the The value range is a discrete set for the high quantile parameterThe more the value of the quantile is close to 100%, the wider the covered data is; the value range is a continuous interval for the global sensitivity coefficient The method is used for adjusting the scaling amplitude of the global threshold value, enhancing the sensitivity to outliers and being more sensitive to extreme values when the value of the global threshold value is higher; mapping the floating point input to a target integer range (e.g., INT 8) for a quantization operation function; And Respectively an activation matrix and a reconstruction weight matrix; Is an original weight matrix; a mathematical desired operator representing a statistical average of all samples in the calibration dataset; The Frobenius norm square of the matrix is defined as the square sum of matrix elements and used for measuring the difference of output before and after quantization; For optimal parameter combinations determined by a grid search strategy, the goal is to minimize the Mean Square Error (MSE) of each layer of quantized output with the outlier dimension decision with the original full precision output.

S3, based on the outlier dimension judgment result, respectively designing differentiated scaling factors for the normal dimension and the outlier dimension, and applying the scaling factors to corresponding rows of the original weight matrix to generate a reconstruction weight matrix;

the step of generating a reconstructed weight matrix comprises:

S31, scaling factor calculation:

1) Scaling factors for normal dimensions, i.e. normal dimensions, for all non-outlier dimensions The maximum absolute value of activation in each dimension is counted:

Wherein, the The active matrix of the current layer is represented, wherein T is the number of words and elements, and D is the number of embedded dimensions; For activating a row index of the matrix, representing the position of the lemma in the sequence; For the column index of the activation matrix, an index representing the embedded dimension; the value of the j embedded dimension of the i-th word element in the current layer activation matrix; the maximum absolute value of activation for all tokens in the normal dimension j.

Taking the maximum of the maximum absolute values of all normal dimension activations, defining a scaling factor of the normal dimension:

Where N is the quantization bit width (e.g., INT8 corresponds to n=8), symmetric quantization is typically usedMapping the interval, and unifying the scaling factors in all normal dimensions to be;

2) Scaling factor of outlier dimensions, for each outlier dimensionCalculating independent scaling factors thereof respectively:

Wherein, the For each outlier dimension, an independent scaling factor is used for preserving the dynamic range of the outlier dimension and avoiding precision loss, and the independent scaling factors of all outlier dimensions form a set;

S32, calculating a reconstruction weight matrix;

To maintain the numerical equivalence of the scaling operation in the reasoning path, inversely multiplying the scaling factor to the corresponding row of the original weight matrix to obtain a reconstructed weight matrix . The construction rule is as follows:

wherein j is an index of the embedded dimension of the activation matrix, and is also an index of each row of the original weight matrix and the reconstructed weight matrix; Reconstructing a j-th row vector of the weight matrix; The j-th row vector of the original weight matrix; reconstructing a weight matrix The quantization operation will be performed in an off-line phase.

Support is provided for the reasoning stage, and the storage is needed in S3:

(1) Unified scaling factor for all normal dimensions of an active matrix ;

(2) Activating a set of independent scaling factors for each outlier dimension of the matrixWherein;

(3) Reconstructing a weight matrix。

These parameters will participate in the quantized mapping together with the inferred multiplicative execution in the subsequent steps. By fusing the activation scaling factors into weights in an offline stage and performing one-time weight quantization operation offline, the inference stage can perform efficient integer matrix multiplication by only single-side quantization of the activation, thereby further reducing online iterative computation overhead.

S4, based on the reconstruction weight matrix, using Bayes-gradient joint optimization to reconstruct a truncation threshold of the weight;

As shown in fig. 3, in order to further enhance the weight quantization accuracy and avoid the problem of dynamic range compression caused by extremum, the present invention proposes a truncated threshold search strategy combining bayesian optimization and gradient fine tuning. The strategy is divided into two stages, namely global parameter space exploration is firstly carried out by using Bayesian optimization to determine a better initial cut-off threshold value, and gradient descent fine adjustment is then carried out on the basis of Bayesian optimization results to further reduce quantization errors.

The specific steps of optimizing the truncation threshold of the reconstruction weight include:

S41, initializing a search space;

Setting a reconstruction weight matrix , wherein,In order to input the number of channels,Defining its quantization cut-off interval (cut-off threshold) as output channel number,AndThe lower limit of the truncated threshold and the upper limit of the truncated threshold of the reconstruction weight matrix in the quantization process are respectively defined; limiting each element by reconstructing the value of the weight matrix at the d-th row and k-th column positions Falls within the dynamic range:

Setting a search interval as follows during initialization:

Wherein, the Reconstructing a weight matrixRandomly selecting m groups of candidate truncation thresholds within the rangeRespectively carrying out reconstruction weight quantization and calculating reconstruction quantization weight distributionAnd reconstructing weight distributionKL divergence between:

Wherein, the The KL divergence value corresponding to the i-th group of candidate truncation threshold is used for evaluating deviation of quantized reconstruction weight distribution and reconstruction weight distribution under the candidate truncation threshold; the relative entropy between the two probability distributions P and Q is measured as a KL divergence function.

The obtained data pairConstructing an initial training dataset:

The reconstruction weight quantization adopts the following asymmetric linear mapping:

Wherein, the Reconstructing the value of the quantization weight matrix at the position of the d row and the k column, wherein the value is an integer number and falls into the set quantization integer intervalA section; reconstructing the values of the weight matrix at the d-th row and the k-th column positions, wherein the values are floating point values, and the values are fused with an activated scaling factor; And The lower limit of the truncated threshold and the upper limit of the truncated threshold of the reconstruction weight matrix in the quantization process are respectively defined; n is the quantization bit width (for example, INT8 corresponds to n=8); Representing performing a rounding operation; the interval truncation function is used for representing that the quantized result is subjected to interval truncation operation, so that the result is ensured to fall into a set quantized integer interval Within the interval.

S42, modeling a Gaussian process agent;

constructing a Gaussian process proxy model with KL divergence as output, and truncating the input candidate threshold value And (3) establishing a function mapping:

Wherein, the For candidate cut-off thresholdThe corresponding KL divergence value is used for evaluating deviation of the quantized reconstruction weight distribution and the reconstruction weight distribution under the candidate truncation threshold; for a Gaussian process modeler, representing any two candidate cutoff thresholds AndOutput betweenSatisfying a joint gaussian distribution; Representing candidate truncation thresholds as mean functions Is a predictive mean value of (2); As a covariance function (kernel function), representing the correlation between two candidate truncation thresholds; for another candidate cutoff threshold, and Together as input pairs of kernel functions for establishing the covariance matrix.

The kernel function employs radial basis functions (RBF kernels):

Wherein, the The signal variance is a gaussian process super parameter, and the fluctuation amplitude of the function is controlled; the length scale is a Gaussian process super-parameter, and the smoothness of the function is adjusted; for Euclidean distance, is used for measuring similarity between two candidate cut-off thresholds, and is a Gaussian process super-parameter AndThrough Maximum Likelihood Estimation (MLE) optimization, the objective function is:

Wherein, the The dimension of the vector formed by the KL divergence values corresponding to all the candidate truncation thresholds is m multiplied by 1, wherein m is the number of the candidate truncation thresholds; The sample set is a set of m groups of candidate truncation thresholds and is used for training a Gaussian process proxy model; Is that Is a transposed vector of (2); is composed of kernel function A covariance matrix is generated; is a covariance matrix The inverse of (3) is used for likelihood calculation and prediction distribution solving of the Gaussian process; is a weighted quadratic form of a sample error term and represents that the observed KL divergence is over-parameter in the current Gaussian process And the sum of squares of the fit residuals at l; const is a constant term that can be ignored when solving the maximum likelihood; is a log-likelihood function representing a sample set of cut-off thresholds at a given candidate Super-parameters of Gaussian processAnd/under the condition of observingLog probability density of (c).

S43, generating and iteratively updating a new candidate truncation threshold;

Based on the gaussian process proxy model, using the "expected improvement (Expected Improvement, EI)" criterion as a sampling strategy, the next set of candidate truncation thresholds that potentially improve the current optimal result is selected:

Wherein, the For the current candidate cut-off thresholdThe corresponding expected improvement value under the gaussian process agent model, i.e. the potential improvement benefit of the current point compared to the historical optimal result; for a mathematical expectation operator, representing an expectation of average yield under an uncertainty distribution (Gaussian process prediction distribution); the KL divergence value is the current history optimal value; For the prediction value of KL divergence at the candidate cutoff threshold T by the Gaussian process agent model, solving by using an L-BFGS method (finite memory Newton method):

For new candidate cut-off threshold Performing reconstruction weight quantization and KL divergence calculation, updating a datasetA gaussian process proxy model is re-fitted:

And repeating the process, setting the iteration upper limit to be 50 times, and obtaining a new candidate truncation threshold value with convergence.

S44, gradient descent fine tuning;

from Bayesian search results (data set finally obtained as described above ) Selecting the first P intervals with the lowest KL divergence valueAnd carrying out fine tuning by taking the mean square error between the reconstructed quantization weight matrix and the reconstructed weight matrix as an objective function:

Calculating the gradient:

,

Performing gradient descent update:

,

Wherein, the In order for the rate of learning to be high,Is a bias-solving operation, and is carried out on a candidate cut-off threshold valueGradient descent is performed internally until the loss function converges or reaches the maximum iteration number, and finallyMinimum cut-off threshold。

S5, calculating a scaling factor of the reconstruction weight matrix based on the optimized cut-off threshold, and quantizing the reconstruction weight to obtain the reconstruction quantization weight matrix;

The step of obtaining the reconstructed quantization weight matrix comprises the following steps:

s51, based on the cut-off threshold determined in the step 4 Calculating a reconstruction weight matrixIs a scaling factor of (a)The following are provided:

where N represents the target quantization bit width (e.g., INT8 corresponds to n=8), and a scaling factor is used to linearly map the reconstruction weight values to The interval, the specific weight quantization process is as follows:

Wherein, the Reconstructing a value of the quantization weight matrix at the position of the d row and the k column, wherein the value is an integer number and falls into a set quantization integer interval, namely [0,Interval; reconstructing the values of the weight matrix at the d-th row and the k-th column positions, wherein the values are floating point values, and the values are fused with an activated scaling factor; A scaling factor of the reconstructed weight matrix, wherein N is a target quantization bit width (for example, INT8 corresponds to N=8); Representing performing a rounding operation; for the interval cut-off function, the interval cut-off operation is performed on the quantized result to ensure that the result falls into Within an integer range of (a).

S52, parameter preservation:

in order to support the decoding and dequantization operations of the online reasoning process, the following quantization meta-information needs to be persisted to the deployment file in S5:

1) Reconstructing a quantization weight matrix: stored in integer format (e.g., INT 8);

2) Scaling factor for reconstructing weight matrix :;

3) An outlier dimension tag table for indicating whether the inference phase uses independent scaling processing.

S6, loading scaling factors of normal dimension and outlier dimension, and quantizing the current layer of active matrix according to the embedded dimension by applying the scaling factors to obtain quantized active matrix;

Setting the activation matrix of the current layer , wherein,For the number of tokens to be used,Is the number of embedded dimensions. The current layer activation matrix is quantized according to an embedded dimension (column) by applying a scaling factor, and a quantization formula is as follows:

The result is an integer tensor (quantized activation matrix)The activation quantization process effectively adapts dynamic change of activation distribution through dimension sensing scaling strategy, combines with outlier independent processing mechanism, further reduces quantization error, and provides precision guarantee for subsequent integer domain matrix multiplication module.

S7, the quantized activation matrix and the reconstructed quantization weight matrix are divided into #) Multiplication calculation is carried out to obtain multiplication output result in integer domain, and scaling factor based on reconstruction weight matrix is calculated) Mapping the multiplication output result in the integer domain back to the floating point approximate representation, and carrying out unified inverse quantization recovery;

In order to realize high-throughput integer multiplication and controllable precision restoration in the reasoning process, the invention adopts a unified inverse quantization mechanism, and as shown in fig. 4, after low-precision matrix multiplication calculation is completed, the output tensor is restored to the floating point approximate representation through fusion of the scaling factors. The method comprises the following specific steps:

s71, low-precision calculation;

Calling an integer matrix multiplication kernel supported by an underlying hardware instruction (such as AVX-512, CUDA Tensor Core or ARM NEON), and carrying out multiplication calculation on the quantized activation matrix and the reconstructed quantization weight matrix by using an INT8 data type:

Wherein, the The quantized activation matrix; reconstructing a quantization weight matrix; The method is implemented completely under INT8 precision, and has extremely high parallel efficiency and buffer utilization rate.

S72, unified inverse quantization recovery;

mapping the multiplication output result in the integer domain back to a floating point approximation representation, and adopting a unified scaling recovery formula:

Wherein the method comprises the steps of The output matrix after unified inverse quantization recovery is obtained; The scaling factor of the reconstructed weight matrix (i.e. in S5 ). Because the invention pre-multiplies the activation scaling factor into the original weight matrix, only the inverse quantization step at one side of the weight is needed to be reserved in the reasoning path, and the method is expressed as follows:

Wherein, the Fusing the scaling factors to the weights for activating operations; A diagonal matrix of scaling factors for each column of the activation matrix; and the output matrix after the unified inverse quantization recovery is provided for a subsequent layer or task module to use. The unified inverse quantization design simplifies the integer domain reasoning path, effectively avoids multiple floating point multiplications, and improves the on-line execution efficiency and the system stability.

FIG. 5 is a diagram illustrating the quantization and reasoning within a transducer block according to the present invention. The activation matrix X is first normalized and then split into Q, K, V projection paths, and the original weight matrix of each path (e.g.) All are reconstructed by activating the scaling factors, and then the offline quantization (such as weight reconstruction and quantization with offline phase in the part of the left dashed frame) is performed, and the activation matrix is quantized as suchThe subsequent operation of activating and multiplying the weights can be completed in an integer domain, and the accuracy is recovered by inverse quantization through the scaling factors of the reconstructed weight matrix. The attention output results (operations such as Softmax and residual connection are not shown in the figure, wherein Softmax is a normalized exponential function for converting the attention score into a probability distribution) are then normalized and then transferred into a linear layer (only one linear layer is shown in the figure) of the feedforward neural network part, and quantization matrix multiplication and inverse quantization recovery are also performed.

Based on similar inventive concepts, the embodiments of the present invention also provide a computer storage medium storing a readable program, which when executed by a processor, is capable of performing a large language model weight and activation joint quantization method as described above.

Based on similar inventive concepts, an embodiment of the present invention provides an electronic device including a processor, a memory, a communication interface, and a communication bus, where the processor, the memory, and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform an operation corresponding to activating the joint quantization method with one of the large language model weights.

Based on similar inventive concepts, embodiments of the present invention also provide a computer program product comprising computer instructions that instruct a computing device to perform the operations described above for a large language model weight and activating a joint quantization method.

Example 2

Based on the large language model weight and activation joint quantization method proposed in embodiment 1, in this embodiment, a large language model weight and activation joint quantization system is proposed, as shown in fig. 1, which specifically includes:

The method of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, RAM, floppy disk, hard disk, or magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the method described herein may be stored on such software process on a recording medium using a general purpose computer, special purpose processor, or programmable or special purpose hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes a storage component (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, performs the methods described herein. Furthermore, when a general purpose computer accesses code for implementing the methods illustrated herein, execution of the code converts the general purpose computer into a special purpose computer for performing the methods illustrated herein.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims.

Claims

1. A large language model weight and activation joint quantization method, comprising the steps of:

2. The method of claim 1, wherein the global threshold is a global thresholdThe calculation formula of (2) is as follows:

3. The large language model weight and activation joint quantization method of claim 1, wherein the weight matrix is reconstructedThe construction rules of (a) are as follows:

4. A large language model weight and activation joint quantization method according to claim 3, wherein the quantization process of the reconstructed weight is:

5. The method for combined quantization of large language model weights and activations as claimed in claim 4, wherein the formula for quantization of the activation matrix of the current layer by applying scaling factors in the embedded dimension is:

6. The method for combined quantization of large language model weights and activations according to claim 5, wherein the unified inverse quantization recovery expression is:

7. A large language model weight and activation joint quantization system, comprising:

8. A computer storage medium storing a readable program, wherein the program is capable of performing a large language model weight and activation joint quantization method according to any one of claims 1-6 when executed by a processor.

9. An electronic device is characterized by comprising a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

The memory is configured to store at least one executable instruction that causes the processor to perform operations corresponding to activating a joint quantization method with a large language model weight according to any one of claims 1-6.

10. A computer program product comprising computer instructions that instruct a computing device to perform an operation according to any one of claims 1-6 in which a large language model weight corresponds to activating a joint quantization method.