[go: up one dir, main page]

CN119740640A - Large model fine-tuning method based on hybrid quantization and related equipment - Google Patents

Large model fine-tuning method based on hybrid quantization and related equipment Download PDF

Info

Publication number
CN119740640A
CN119740640A CN202411739761.9A CN202411739761A CN119740640A CN 119740640 A CN119740640 A CN 119740640A CN 202411739761 A CN202411739761 A CN 202411739761A CN 119740640 A CN119740640 A CN 119740640A
Authority
CN
China
Prior art keywords
quantization
model
determining
algorithm
video memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411739761.9A
Other languages
Chinese (zh)
Other versions
CN119740640B (en
Inventor
王玉龙
左畅
苏森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202411739761.9A priority Critical patent/CN119740640B/en
Publication of CN119740640A publication Critical patent/CN119740640A/en
Application granted granted Critical
Publication of CN119740640B publication Critical patent/CN119740640B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请提供一种基于混合量化的大模型精调方法及相关设备,包括:步骤确定待训练模型的若干参数矩阵、用于对待训练模型进行训练的若干量化方法和限制条件;根据所述限制条件和若干所述量化方法,对任一所述参数矩阵进行迭代量化,确定任一所述参数矩阵的量化损失值集和显存占用数值集;根据所述量化损失值集和所述显存占用数值集,通过优化方法确定任一所述参数矩阵的最优量化算法;根据若干所述最优量化算法,对所述待训练模型中的若干参数矩阵进行迭代混合量化,确定大模型;通过精调算法对所述大模型的低秩组件的参数进行调整,完成模型精调。

The present application provides a large model fine-tuning method based on hybrid quantization and related equipment, including: the steps of determining several parameter matrices of the model to be trained, several quantization methods and constraints for training the model to be trained; iteratively quantizing any of the parameter matrices according to the constraints and the several quantization methods, and determining the quantization loss value set and memory occupancy value set of any of the parameter matrices; determining the optimal quantization algorithm of any of the parameter matrices through an optimization method according to the quantization loss value set and the memory occupancy value set; iteratively hybrid quantizing several parameter matrices in the model to be trained according to the several optimal quantization algorithms to determine the large model; adjusting the parameters of the low-rank components of the large model through a fine-tuning algorithm to complete model fine-tuning.

Description

Large model fine tuning method based on mixed quantization and related equipment
Technical Field
The application relates to the technical field of large model training, in particular to a large model fine tuning method based on mixed quantization and related equipment.
Background
With the progress of computing hardware and the availability of massive data, large models have made breakthrough progress in many fields such as natural language processing, computer vision, speech recognition, etc. These models typically contain billions or even more parameters that enable complex data patterns to be captured, thereby enabling highly accurate prediction and generation tasks. However, training and deployment of large models also face challenges such as high consumption of computing resources, increased storage requirements, and extended training time.
In the fine tuning process of a large model, the traditional method is often directly applied to full-precision training, which is not only low in efficiency, but also easily encounters resource bottlenecks. In order to reduce the resources required for fine tuning of large models, researchers have begun to explore quantization methods to quantify non-training parameters of large models. Although the method can reduce the video memory required by fine tuning of the large model to a certain extent, the method cannot fully utilize equipment resources and has quantization loss.
Disclosure of Invention
In view of the above, the present application is directed to a large model fine tuning method based on hybrid quantization and related devices.
Based on the above object, the present application provides a large model fine tuning method based on mixed quantization, comprising:
Determining a plurality of parameter matrixes of the model to be trained, a plurality of quantization methods and limiting conditions for training the model to be trained;
According to the limiting conditions and the quantization methods, carrying out iterative mixed quantization on any parameter matrix, and determining a quantization loss value set and a video memory occupation value set of any parameter matrix;
Determining an optimal quantization algorithm of any parameter matrix by an optimization method according to the quantization loss value set and the video memory occupation value set;
according to a plurality of optimal quantization algorithms, carrying out iterative mixed quantization on a plurality of parameter matrixes in the model to be trained, and determining a large model;
And adjusting parameters of the low-rank component of the large model through a fine adjustment algorithm to finish fine adjustment of the model.
Optionally, the limiting conditions comprise a preset quantization loss value and a preset video memory value, wherein the iterative hybrid quantization comprises iterative training of a parameter matrix through a plurality of quantization algorithms;
and performing iterative mixed quantization on any parameter matrix according to the limiting condition and the quantization methods to determine a quantization loss value set and a video memory occupation value set of any parameter matrix, wherein the method comprises the following steps:
Determining an initial quantization loss and an initial quantization component, decomposing a difference value between the initial quantization loss and the initial quantization component, and determining a low-rank component;
quantizing the parameter matrix and the low-rank component through a quantization algorithm to determine a quantization component;
calculating norms of any parameter matrix, the quantization component and the sum of the low-rank components, and determining quantization loss of the current iteration turn;
Determining a quantization loss value and a video memory occupation value of a current quantization algorithm of any parameter matrix according to a current iteration round and the quantization loss of the current iteration round;
And determining quantization loss values and video memory occupied values of a plurality of quantization algorithms, and determining a quantization loss value set and a video memory occupied value set.
Optionally, the determining the quantization loss and the video memory occupation value of the current quantization algorithm of any parameter matrix according to the current iteration round and the quantization loss of the current iteration round includes:
determining a current iteration round and a quantization loss value of the current iteration round;
And responding to the fact that the current iteration round is larger than a preset iteration round and/or the quantization loss of the current iteration round is larger than a preset quantization loss value, taking the quantization loss value of the current iteration round as the quantization loss value of the current quantization algorithm, and taking the video memory occupation size of the current iteration round as the video memory occupation value of the current quantization algorithm.
Optionally, the quantization algorithm is a high-precision quantization algorithm, the norm is a Frobenius norm, and the decomposition method is an SVD decomposition method.
Optionally, the determining an optimal quantization algorithm of any one of the parameter matrices according to the quantization loss value set and the video memory occupation value set includes:
Determining a quantization loss value and a video memory occupation value of any quantization algorithm according to the quantization loss value set and the video memory occupation value set;
And determining the optimal quantization algorithm through an integer linear programming method according to the preset quantization loss value, the preset video memory value, the quantization loss value and the video memory occupation value of any quantization algorithm.
Optionally, the performing iterative hybrid quantization on a plurality of parameter matrices in the model to be trained according to a plurality of optimal quantization algorithms, and determining a large model includes:
And carrying out iterative quantization on the parameter matrix according to an optimal quantization algorithm corresponding to any one of the parameter matrices, and determining the large model in response to determining that a preset iterative round is reached.
Optionally, the adjusting the parameters of the low rank component of the large model by a fine tuning algorithm to complete fine tuning of the model includes:
and adjusting parameters of a low-rank component of the large model through a low-rank fine adjustment algorithm to finish fine adjustment of the model.
Based on the same inventive concept, the embodiment of the application also provides a large model fine tuning device based on mixed quantization, which comprises the following steps:
The determining module is configured to determine a plurality of parameter matrixes of the model to be trained, a plurality of quantization methods and limiting conditions for training the model to be trained;
The first mixed quantization module is configured to perform iterative mixed quantization on any one of the parameter matrixes according to the limiting conditions and the quantization methods, and determine a quantization loss value set and a video memory occupation value set of any one of the parameter matrixes;
The optimization module is configured to determine an optimal quantization algorithm of any parameter matrix through an optimization method according to the quantization loss value set and the video memory occupation value set;
The second mixed quantization module is configured to perform iterative mixed quantization on a plurality of parameter matrixes in the model to be trained according to a plurality of optimal quantization algorithms, so as to determine a large model;
and the fine tuning module is configured to adjust parameters of the low-rank component of the large model through a fine tuning algorithm to finish fine tuning of the model.
Based on the same inventive concept, the embodiment of the application also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the large model fine tuning method based on mixed quantization according to any one of the above when executing the program.
Based on the same inventive concept, the embodiments of the present application also provide a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform any of the above-described hybrid quantization-based large model refinement methods.
From the above, it can be seen that the mixed quantization-based large model fine tuning method, apparatus, electronic device and storage medium provided by the application solve the quantization algorithm of the optimization of each parameter matrix of the large model through the optimization algorithm, so that the total quantization loss of the large model is minimum while the video memory resource can be fully utilized, and quantize the parameter matrix of the large model through mixed iteration, thereby further reducing the quantization loss of the large model.
Drawings
In order to more clearly illustrate the technical solutions of the present application or related art, the drawings that are required to be used in the description of the embodiments or related art will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow chart of a large model fine tuning method based on mixed quantization according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a large model fine tuning device based on mixed quantization according to an embodiment of the present application;
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present application more apparent.
It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present application should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present application belongs. The terms "first," "second," and the like, as used in embodiments of the present application, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In order to facilitate understanding of the technical solutions of the present disclosure, some technical terms related to the present disclosure are described below.
The Frobenius norm, abbreviated as F-norm, is a matrix norm used to measure the "size" or "energy" of a matrix.
Super-parameters (HYPERPARAMETER) are abbreviated as super-parameters.
The parameter matrix is a multidimensional array used for storing and processing multidimensional data in MATLAB and other mathematical or engineering software. Each element may be a scalar, vector, matrix, or other data type. The dimensions of the parameter matrix are determined by its size and may be one, two or higher dimensions. In large-scale machine learning and deep learning models, the parameter matrix is a matrix that stores model weights and biases, and these parameters are learned and adjusted by a training process to minimize some loss function, thereby enabling the model to accurately predict or classify. The dimensions and complexity of these parameter matrices depend on the specific model architecture and task.
The video memory, which is collectively referred to as "graphics processor video memory," is a dedicated cache in a computer for storing graphics data. The system is a special memory on the display card and is mainly used for storing various data required by a Graphic Processing Unit (GPU) when the GPU performs rendering tasks.
Quantization the process of converting non-training parameters (e.g., weights and activation values, etc.) in a large model from one precision (typically a higher floating point precision, such as FP 32) to another lower precision (e.g., INT8, INT4, or lower). Such conversion helps reduce the need for model computation and storage resources while maintaining the performance of the model as much as possible.
In order to make the technical scheme of the disclosure clearer and easier to understand, the following describes in detail a large model fine tuning method based on mixed quantization provided by the embodiments of the disclosure with reference to the accompanying drawings.
As described in the background section, with advances in computing hardware and availability of massive data, large models have made breakthrough progress in many areas of natural language processing, computer vision, speech recognition, and the like. These models typically contain billions or even more parameters that enable complex data patterns to be captured, thereby enabling highly accurate prediction and generation tasks. However, training and deployment of large models also face challenges such as high consumption of computing resources, increased storage requirements, and extended training time.
In the fine tuning process of a large model, the traditional method is often directly applied to full-precision training, which is not only low in efficiency, but also easily encounters resource bottlenecks. In order to reduce the resources required for fine tuning of large models, researchers have begun to explore quantization methods to quantify non-training parameters of large models. Although the method can reduce the video memory required by fine tuning of the large model to a certain extent, the method cannot fully utilize equipment resources and has quantization loss.
In view of the above, the embodiments of the present application provide a large model fine tuning method, apparatus, electronic device and storage medium based on hybrid quantization. The method comprises the steps of determining a plurality of parameter matrixes of a model to be trained, a plurality of quantization methods and limiting conditions for training the model to be trained, carrying out iterative mixed quantization on any parameter matrix according to the limiting conditions and the quantization methods, determining a quantization loss value set and a video memory occupied value set of any parameter matrix, determining an optimal quantization algorithm of any parameter matrix according to the quantization loss value set and the video memory occupied value set through an optimization method, carrying out iterative mixed quantization on a plurality of parameter matrixes in the model to be trained according to the optimal quantization algorithms, determining a large model, and adjusting parameters of a low-rank component of the large model through a fine adjustment algorithm to finish fine adjustment of the model. And solving an optimal quantization algorithm of each parameter matrix of the large model through an optimization algorithm, so that the total quantization loss of the large model is minimum while the video memory resource can be fully utilized, and quantizing the parameter matrix of the large model through mixed iteration, thereby further reducing the quantization loss of the large model.
As shown in fig. 1, the large model fine tuning method based on mixed quantization includes:
step S102, determining a plurality of parameter matrixes of a model to be trained, a plurality of quantization methods and limiting conditions for training the model to be trained;
Step S104, carrying out iterative quantization on any parameter matrix according to the limiting conditions and a plurality of quantization methods, and determining a quantization loss value set and a video memory occupation value set of any parameter matrix;
Step S106, determining an optimal quantization algorithm of any parameter matrix through an optimization method according to the quantization loss value set and the video memory occupation value set;
Step S108, carrying out iterative mixed quantization on a plurality of parameter matrixes in the model to be trained according to a plurality of optimal quantization algorithms to determine a large model;
And step S110, adjusting parameters of the low-rank component of the large model through a fine adjustment algorithm to finish fine adjustment of the model.
In step S102, the model to be trained is a large model to be trained, which may be a computer model applied in the fields of natural language processing, computer vision, speech recognition, etc. For example, the model may be a mixed model of one or more types among various models such as a language large model (NLP), a visual large model (CV), a multimodal large model, a transducer-based model, a Convolutional Neural Network (CNN) -based model, and a cyclic neural network (RNN) -based model.
Further, for parameter matrices, in large-scale machine learning and deep learning models, the parameter matrices are matrices storing model weights and offsets, and these parameters are learned and adjusted through a training process to minimize a certain loss function, thereby enabling the model to accurately predict or classify. The dimensions and complexity of these parameter matrices depend on the specific model architecture and task.
In large models, large models typically have a complex network structure and a large number of neurons. The neurons are connected through weights to form a complex parameter matrix. These parameter matrices play a vital role in the training process of large models, they determine how the model learns features and rules from the input data. Large models often contain multiple layers, such as an input layer, a hidden layer, and an output layer. In each hierarchy, there may be multiple neurons and weight connections. In particular, in Deep Neural Networks (DNNs) and transformers, each layer may contain a large number of parameter matrices. For example, in a transducer architecture, multiple self-attention layers, feed forward network layers, etc., all introduce a large number of parameter matrices.
The parameter matrix has the functions of feature extraction, weight distribution and pattern recognition;
(1) Feature extraction-the parameter matrix may help the model extract useful features from the input data. These features are critical to the subsequent processing and decision making of the model.
(2) Weight assignment in large models, different parameter matrices represent different weight assignments. These weights determine the relative importance of the various parts of the model in processing the input data.
(3) Pattern recognition-through training, the parameter matrix can learn potential patterns and rules in the input data. These patterns and rules are the basis for the model to make predictions and decisions.
In some embodiments, the constraint includes the size of the practically usable memory, i.e., a predetermined memory value, and a predetermined quantization loss value. Before the steps of iteration, optimization, quantization and fine tuning of the large model are implemented, a target budget needs to be determined, and the target budget can be the available video memory size. In addition, it is also necessary to determine which quantization methods to use for mixed quantization. The quantization method needs to be satisfied, and after the large model is quantized by using the quantization method with the lowest bit width, the size of the video memory occupied by the large model needs to be smaller than the preset target budget, namely the size of the available video memory.
In the deployment process of the neural network model, in order to reduce the storage requirement and the calculation amount of the model, a model quantization method is generally adopted. Model Quantization can be roughly divided into two major classes, online Quantization (AWARE TRAINING, QAT) and offline Quantization (Post Training Quantization, PTQ). Offline quantization is more commonly used in the model deployment phase due to lower development cost and lower threshold. The quantization of the neural network essentially maps data from continuous space to discrete space, and thus this process may introduce quantization loss.
In some embodiments, the large model is quantized by iterative hybrid quantization according to the constraint and a number of the quantization methods. The method comprises the steps of carrying out iterative quantization on each parameter matrix in a large model through a plurality of quantization methods, and stopping iteration after a preset iteration condition is met, so as to determine the video memory occupation condition (the size of the video memory occupation value) and the quantization loss value corresponding to the plurality of quantization methods corresponding to one parameter matrix. For example, for the parameter matrix 1, quantization may be performed by 5 methods such as the quantization method A, B, C, D, E, and finally the video memory occupancy and the quantization loss value corresponding to the quantization method A, B, C, D, E are recorded. The data set storing the video memory occupied sizes of the 5 quantization methods is a video memory occupied value set, and the data set storing the quantization loss values of the 5 quantization methods is a quantization loss value set.
In some embodiments, the parameter matrix is iteratively quantized by any quantization method, comprising determining an initial quantization loss and an initial quantization component, decomposing the difference between the initial quantization loss and the initial quantization component, and determining a low rank component;
quantizing the parameter matrix and the low-rank component through a quantization algorithm to determine a quantization component;
calculating norms of any parameter matrix, the quantization component and the sum of the low-rank components, and determining quantization loss of the current iteration turn;
And determining a quantization loss value and a video memory occupation value of a current quantization algorithm of any parameter matrix according to the current iteration round and the quantization loss of the current iteration round.
In some embodiments, the difference between the parameter matrix W and the quantization component Q is decomposed to obtain low rank components L1, L2, with the following formula:
L1,L2=Factorize(W-Q,r)
Where r is the rank of the matrix used for decomposition.
The difference between the parameter matrix W and the low rank components L1, L2 is quantized using a quantization algorithm, updating the quantization component Q, as follows:
Q=Quantize(W-L1L2,c)
Where c is the quantization algorithm used.
The Frobenius norm of the sum of the parameter matrix W, the quantization component Q and the low-rank components L1, L2 is calculated, and the formula is as follows:
t=||W-(Q+L1L2)||F
Judging whether the quantization loss epsilon t of the current iteration round is larger than the quantization loss epsilon t-1 of the previous iteration round or not, or whether the maximum iteration number T is reached, if so, stopping iteration, taking the quantization loss epsilon t of the current iteration round as the final quantization loss, otherwise, continuing the iteration process.
In some embodiments, the SVD decomposed input is matrix a and the rank r of the matrix. The SVD decomposition formula is as follows:
U,∑,VT=SVD(A,r)
The SVD method comprises the steps of carrying out SVD on an A matrix, wherein U, sigma, V T are 3 matrices, when the shape of the matrix A is m x n, the shape of the U matrix is m x min (m, n), sigma is a diagonal matrix, the size of the matrix is min (m, n) x min (m, n), and the shape of V T is min (m, n).
The low rank component L1L2 is obtained through SVD decomposition. The formula is as follows:
in some embodiments, the determining the quantization loss and the video memory occupation value of the current quantization algorithm of any parameter matrix according to the current iteration round and the quantization loss of the current iteration round includes:
determining a current iteration round and a quantization loss value of the current iteration round;
and responding to the fact that the current iteration round is larger than a preset iteration round and/or the quantization loss of the current iteration round is larger than a preset quantization loss value, taking the quantization loss value of the current iteration round as the quantization loss value of the current quantization algorithm, and taking the video memory occupation size of the current iteration round as the video memory occupation value of the current quantization algorithm. In the iteration process, firstly decomposing the parameter matrix into a low-rank component and a quantization component, and updating the low-rank component and the quantization component while carrying out iterative quantization until a preset iteration ending condition is met, namely the quantization loss is not enlarged or the maximum iteration times are reached.
In some embodiments, the quantization algorithm is a high-precision quantization algorithm and the norm is a Frobenius norm.
In some embodiments, the input of the optimization method is the quantization loss of each parameter matrix under different quantization algorithms, the occupied size of the video memory, and the limitation condition, and the output is the quantization algorithm with the optimal parameter matrix of the large model under the limitation condition. And the objective of the optimization algorithm solution is that the large model is smaller than the available video memory size and the quantization loss is minimum when the quantization algorithm corresponding to each solved parameter matrix is used for quantization on the premise that the available video memory size is certain.
The limiting condition is the maximum occupied video memory size of all the parameter matrixes, and the quantization loss and the video memory occupied size of each parameter matrix under different quantization algorithms are specific to each parameter matrix. The meaning of "quantization loss" in the constraint condition means that the quantization loss is minimum in the case where "the size of the memory occupied after quantization of all parameter matrices is smaller than the maximum" is satisfied.
If the quantization algorithm is [ quantization method 1, quantization method 2, quantization method 3 ], the video memory occupation of each parameter matrix under different quantization algorithms is also input [ video memory occupation 1, video memory occupation 2, video memory occupation 3 ] according to the sequence, the output of the optimization method is a value for each parameter matrix, the range is [0 to (quantization method number-1) ], and the optimal quantization method of the parameter matrix can be determined according to the value.
Therefore, the determining an optimal quantization algorithm of any parameter matrix according to the quantization loss value set and the video memory occupied value set comprises the following steps:
Determining a quantization loss value and a video memory occupation value of any quantization algorithm according to the quantization loss value set and the video memory occupation value set;
And determining the optimal quantization algorithm through an integer linear programming method according to the preset quantization loss value, the preset video memory value, the quantization loss value and the video memory occupation value of any quantization algorithm.
In some alternative embodiments, instead of quantizing the parameter matrices of the large model using the same quantization method, the optimal quantization method under the constraint of each parameter matrix of the large model is solved according to an optimization algorithm. And performing iterative hybrid quantization on a plurality of parameter matrixes in the model to be trained according to a plurality of optimal quantization algorithms to determine a large model, wherein the method comprises the following steps:
And carrying out iterative quantization on the parameter matrix according to an optimal quantization algorithm corresponding to any one of the parameter matrices, and determining the large model in response to determining that a preset iterative round is reached.
In some optional embodiments, the adjusting the parameters of the low rank component of the large model by the fine tuning algorithm to complete the fine tuning of the model includes adjusting the parameters of the low rank component of the large model by the low rank fine tuning algorithm to complete the fine tuning of the model.
In some embodiments, for fine tuning of a large model, fine tuning of the large model is accomplished by setting super-parameters such as learning rate, training round, etc.
From the above, the application solves the quantization algorithm of the optimization of each parameter matrix of the large model through the optimization algorithm, so that the total quantization loss of the large model is minimum while the video memory resource can be fully utilized, and the quantization loss of the large model is further reduced by quantizing the parameter matrix of the large model through the mixed iteration.
It should be noted that, the method of the embodiment of the present application may be performed by a single device, for example, a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the method of an embodiment of the present application, the devices interacting with each other to accomplish the method.
It should be noted that the foregoing describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Based on the same inventive concept, the application also provides a large model fine tuning device based on mixed quantization, which corresponds to the method of any embodiment.
Referring to fig. 2, the large model fine tuning device based on mixed quantization includes:
a determining module 202 configured to determine a number of parameter matrices of the model to be trained, a number of quantization methods and constraints for training the model to be trained;
the first quantization module 204 is configured to perform iterative hybrid quantization on any one of the parameter matrices according to the constraint condition and the quantization methods, and determine a quantization loss value set and a video memory occupation value set of any one of the parameter matrices;
The optimizing module 206 is configured to determine an optimal quantization algorithm of any one of the parameter matrixes by an optimizing method according to the quantization loss value set and the video memory occupation value set;
a second quantization module 208, configured to perform iterative hybrid quantization on a plurality of parameter matrices in the model to be trained according to a plurality of optimal quantization algorithms, to determine a large model;
The fine tuning module 210 is configured to adjust parameters of the low rank component of the large model through a fine tuning algorithm, so as to complete fine tuning of the model.
For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of each module may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.
The device of the foregoing embodiment is configured to implement the corresponding hybrid quantization-based large model fine tuning method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
Based on the same inventive concept, the application also provides an electronic device corresponding to the method of any embodiment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the large model fine tuning method based on the mixed quantization according to any embodiment when executing the program.
Fig. 3 shows a more specific hardware architecture of an electronic device provided by the present embodiment, which may include a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.
The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).
It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.
The electronic device of the foregoing embodiment is configured to implement the corresponding hybrid quantization-based large model fine tuning method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
Based on the same inventive concept, the present application also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the mixed quantization based large model refinement method according to any of the embodiments above, corresponding to any of the embodiments of the method described above.
The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
The storage medium of the above embodiment stores computer instructions for causing the computer to perform the large model fine tuning method based on mixed quantization according to any one of the above embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.
Based on the same inventive concept, the present disclosure also provides a computer program product, corresponding to the mixed quantization based large model fine tuning method described in any of the above embodiments, comprising computer program instructions. In some embodiments, the computer program instructions may be executable by one or more processors of a computer to cause the computer and/or the processor to perform the hybrid quantization based large model refinement method. Corresponding to the execution subject corresponding to each step in each embodiment of the large model fine tuning method based on mixed quantization, the processor executing the corresponding step may belong to the corresponding execution subject.
The computer program product of the above embodiment is configured to enable the computer and/or the processor to perform the large model fine tuning method based on hybrid quantization according to any one of the above embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein.
It will be appreciated by persons skilled in the art that the foregoing discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the application (including the claims) is limited to these examples, that combinations of technical features in the foregoing embodiments or in different embodiments may be implemented in any order and that many other variations of the different aspects of the embodiments described above exist within the spirit of the application, which are not provided in detail for clarity.
Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present application. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present application are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
While the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, and the like, which are within the spirit and principles of the embodiments of the application, are intended to be included within the scope of the application.

Claims (10)

1. A large model fine tuning method based on mixed quantization, comprising:
Determining a plurality of parameter matrixes of the model to be trained, a plurality of quantization methods and limiting conditions for training the model to be trained;
According to the limiting conditions and the quantization methods, carrying out iterative quantization on any parameter matrix, and determining a quantization loss value set and a video memory occupation value set of any parameter matrix;
Determining an optimal quantization algorithm of any parameter matrix by an optimization method according to the quantization loss value set and the video memory occupation value set;
according to a plurality of optimal quantization algorithms, carrying out iterative mixed quantization on a plurality of parameter matrixes in the model to be trained, and determining a large model;
And adjusting parameters of the low-rank component of the large model through a fine adjustment algorithm to finish fine adjustment of the model.
2. The method of claim 1, wherein the constraint comprises a preset quantization loss value and a preset video memory value, and wherein the iterative hybrid quantization comprises iterative training of a parameter matrix by a plurality of quantization algorithms;
and performing iterative mixed quantization on any parameter matrix according to the limiting condition and the quantization methods to determine a quantization loss value set and a video memory occupation value set of any parameter matrix, wherein the method comprises the following steps:
Determining an initial quantization loss and an initial quantization component, decomposing a difference value between the initial quantization loss and the initial quantization component, and determining a low-rank component;
quantizing the parameter matrix and the low-rank component through a quantization algorithm to determine a quantization component;
calculating norms of any parameter matrix, the quantization component and the sum of the low-rank components, and determining quantization loss of the current iteration turn;
Determining a quantization loss value and a video memory occupation value of a current quantization algorithm of any parameter matrix according to a current iteration round and the quantization loss of the current iteration round;
And determining quantization loss values and video memory occupied values of a plurality of quantization algorithms, and determining a quantization loss value set and a video memory occupied value set.
3. The method according to claim 2, wherein determining the quantization loss and the video memory occupancy value of the current quantization algorithm of any parameter matrix according to the current iteration round and the quantization loss of the current iteration round comprises:
determining a current iteration round and a quantization loss value of the current iteration round;
And responding to the fact that the current iteration round is larger than a preset iteration round and/or the quantization loss of the current iteration round is larger than a preset quantization loss value, taking the quantization loss value of the current iteration round as the quantization loss value of the current quantization algorithm, and taking the video memory occupation size of the current iteration round as the video memory occupation value of the current quantization algorithm.
4. The method of claim 2, wherein the quantization algorithm is a high-precision quantization algorithm, the norm is a Frobenius norm, and the decomposition method is an SVD decomposition method.
5. The method according to claim 2, wherein said determining an optimal quantization algorithm for any one of said parameter matrices based on said set of quantization loss values and said set of memory footprint values comprises:
Determining a quantization loss value and a video memory occupation value of any quantization algorithm according to the quantization loss value set and the video memory occupation value set;
And determining the optimal quantization algorithm through an integer linear programming method according to the preset quantization loss value, the preset video memory value, the quantization loss value and the video memory occupation value of any quantization algorithm.
6. The method according to claim 1, wherein said iteratively mixing quantization of a plurality of parameter matrices in said model to be trained according to a plurality of said optimal quantization algorithms, determining a large model, comprises:
And carrying out iterative quantization on the parameter matrix according to an optimal quantization algorithm corresponding to any one of the parameter matrices, and determining the large model in response to determining that a preset iterative round is reached.
7. The method according to claim 1, wherein said adjusting parameters of low rank components of said large model by a fine tuning algorithm to complete model fine tuning comprises:
and adjusting parameters of a low-rank component of the large model through a low-rank fine adjustment algorithm to finish fine adjustment of the model.
8. A large model fine tuning device based on hybrid quantization, comprising:
The determining module is configured to determine a plurality of parameter matrixes of the model to be trained, a plurality of quantization methods and limiting conditions for training the model to be trained;
The first mixed quantization module is configured to perform iterative mixed quantization on any one of the parameter matrixes according to the limiting conditions and the quantization methods, and determine a quantization loss value set and a video memory occupation value set of any one of the parameter matrixes;
The optimization module is configured to determine an optimal quantization algorithm of any parameter matrix through an optimization method according to the quantization loss value set and the video memory occupation value set;
The second mixed quantization module is configured to perform iterative mixed quantization on a plurality of parameter matrixes in the model to be trained according to a plurality of optimal quantization algorithms, so as to determine a large model;
and the fine tuning module is configured to adjust parameters of the low-rank component of the large model through a fine tuning algorithm to finish fine tuning of the model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method according to any one of claims 1 to 7 when the computer program is executed.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.
CN202411739761.9A 2024-11-29 2024-11-29 Large model fine-tuning method based on hybrid quantization and related equipment Active CN119740640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411739761.9A CN119740640B (en) 2024-11-29 2024-11-29 Large model fine-tuning method based on hybrid quantization and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411739761.9A CN119740640B (en) 2024-11-29 2024-11-29 Large model fine-tuning method based on hybrid quantization and related equipment

Publications (2)

Publication Number Publication Date
CN119740640A true CN119740640A (en) 2025-04-01
CN119740640B CN119740640B (en) 2025-10-21

Family

ID=95132686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411739761.9A Active CN119740640B (en) 2024-11-29 2024-11-29 Large model fine-tuning method based on hybrid quantization and related equipment

Country Status (1)

Country Link
CN (1) CN119740640B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251445A1 (en) * 2018-02-09 2019-08-15 Google Llc Neural network compression
CN111738419A (en) * 2020-06-19 2020-10-02 北京百度网讯科技有限公司 Quantization method and device for neural network model
KR20220109301A (en) * 2021-01-28 2022-08-04 삼성전자주식회사 Quantization method for deep learning model and apparatus thereof
CN115115026A (en) * 2022-05-17 2022-09-27 南京大学 Depth model parameter quantification method based on gradient compensation
CN117371508A (en) * 2023-09-28 2024-01-09 北京百度网讯科技有限公司 Model compression method, device, electronic equipment and storage medium
CN118035624A (en) * 2024-02-06 2024-05-14 广东能哥知识科技有限公司 Low-rank adaptive quantitative fine tuning method and device for large language model
CN118485893A (en) * 2024-05-31 2024-08-13 南湖实验室 Model optimization method and system considering influence of quantization and pruning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251445A1 (en) * 2018-02-09 2019-08-15 Google Llc Neural network compression
CN111738419A (en) * 2020-06-19 2020-10-02 北京百度网讯科技有限公司 Quantization method and device for neural network model
KR20220109301A (en) * 2021-01-28 2022-08-04 삼성전자주식회사 Quantization method for deep learning model and apparatus thereof
CN115115026A (en) * 2022-05-17 2022-09-27 南京大学 Depth model parameter quantification method based on gradient compensation
CN117371508A (en) * 2023-09-28 2024-01-09 北京百度网讯科技有限公司 Model compression method, device, electronic equipment and storage medium
CN118035624A (en) * 2024-02-06 2024-05-14 广东能哥知识科技有限公司 Low-rank adaptive quantitative fine tuning method and device for large language model
CN118485893A (en) * 2024-05-31 2024-08-13 南湖实验室 Model optimization method and system considering influence of quantization and pruning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PRISCILLA SHARON ALLWIN: "Run-Time Non-Uniform Quantization for Dynamic Neural Networks in Wireless Communication", 《ASPDAC \'24: PROCEEDINGS OF THE 29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE》, 3 April 2024 (2024-04-03) *
崔媛: "面向图像目标检测的深度神经网络压缩方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 4, 15 April 2022 (2022-04-15) *

Also Published As

Publication number Publication date
CN119740640B (en) 2025-10-21

Similar Documents

Publication Publication Date Title
CN113168563B (en) Residual Quantization for Neural Networks
US12045307B2 (en) Fine-grained per-vector scaling for neural network quantization
CN111105017B (en) Neural network quantization method and device and electronic equipment
TW202141363A (en) Adaptive quantization for execution of machine learning models
US20220207370A1 (en) Inferring device, training device, inferring method, and training method
CN110659725A (en) Compression and acceleration method of neural network model, data processing method and device
CN114418105B (en) A method and device for processing quantum application problems based on quantum circuits
CN110889290B (en) Text encoding method and apparatus, text encoding validity checking method and apparatus
KR20190140841A (en) Neural network hardware acceleration with stochastic adaptive resource allocation
CN110689109A (en) Neural network method and apparatus
CN112561050A (en) Neural network model training method and device
CN115761830A (en) Face recognition model quantitative training method, device, equipment and storage medium
CN114787823A (en) Flexible precision neural inference processing unit
CN118863078B (en) A method, device, medium and electronic device for constructing a variable quantum circuit
EP4196919A1 (en) Method and system for quantizing a neural network
CN119740640B (en) Large model fine-tuning method based on hybrid quantization and related equipment
CN114170474B (en) Method and related equipment for generating adversarial sample Trojans for neural network models
CN118608902B (en) Image prediction method, training method, device and equipment of image prediction model
Moe et al. Implementing spatio-temporal graph convolutional networks on graphcore ipus
CN114943332A (en) Training method of lightweight YOLO model and related equipment
CN119206245A (en) Method executed by electronic device, storage medium, and program product
CN114707592A (en) Model structure obtaining method and device
CN119514721B (en) Language model training method, device, equipment and storage medium
CN115588104B (en) Optimization methods for image classification models, electronic devices, and storage media.
CN120822628B (en) Quantitative reasoning method, device, equipment and medium based on multi-mode large model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant