Disclosure of Invention
The invention aims to provide a parallel strategy searching method for high-efficiency training of an artificial intelligent large model, so as to provide a parallel strategy searching method for high-efficiency training of the large model based on mixed integer quadratic programming, solve the problem of local optimality of the parallel strategy searched in the prior art, further improve training efficiency of the large model and reduce energy consumption.
The technical scheme is that the parallel strategy searching method for the artificial intelligence large model high-efficiency training comprises the following steps:
s1, inputting a large model;
s2, acquiring system execution performance information and execution performance information of the large model;
setting traffic a ar of All-Reduce, and counting communication time t ar of All-Reduce, so as to obtain All-Reduce communication efficiency e ar of the system;
Setting traffic a p2p of P2P, and counting communication time t p2p of P2P, so as to obtain P2P communication efficiency e p2p of the system;
Setting traffic a ar of All-Reduce, respectively counting All-Reduce communication time t n for independently executing All-Reduce and All-Reduce communication time t o for simultaneously executing All-Reduce and matrix multiplication calculation, thereby obtaining a calculation-communication overlap factor e o of the system;
counting forward execution time fp u of each layer u (u epsilon V) of the big model G (V, E);
Statistics of active video memory for each layer of large model G (V, E) except for the video memory occupied by model and optimizer And additional memory m other required for loading the large model G (V, E).
The execution time overhead described in step S3 includes the sum of the computation and communication time of all candidate policies of each layer of the large model, and one or more of the communication time of the same pipeline stage and the P2P communication time across pipeline stages of all candidate policy pairs of adjacent layers.
S3, estimating the execution time cost and the storage cost of the large model on the current system by using the cost model;
the execution time overhead described in step S3 specifically includes the following steps:
Estimating the sum A u of all candidate data parallel, tensor parallel and optimizer parallel strategies of the layer u (u epsilon V) of the large model G (V, E) by using the cost model;
Estimating the communication time R uv of the same pipeline stage between the parallel, tensor parallel and optimizer parallel strategy pairs of all candidate data of the layer u and the layer V corresponding to the edges < u, V > (< u, V > ∈E) of the large model G (V, E) by using the cost model;
and estimating P2P communication time R' uv of the cross-pipeline stage between the parallel, tensor parallel and optimizer parallel strategy pairs of all candidate data of the layer u and the layer V corresponding to the edges < u, V > (< u, V > ∈E) of the large model G (V, E) by using the cost model.
The storage overhead described in step S3 includes the memory overhead of all candidate policies for each layer of the large model.
The storage overhead described in step S3 specifically includes the step of estimating the memory overhead M u of all candidate data parallel, tensor parallel and optimizer parallel strategies of layer u (u ε V) of the large model G (V, E) using the cost model.
S4, taking the time spent by each iteration of the pipeline parallel training as an optimization target, establishing a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model, and solving;
Step S4.1, enumerating the legal pipeline stage number deg and legal micro-batch number c, and selecting the minimum time TPI spent by each iteration and one or more of corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy;
step S4.2, based on a non-pipeline parallel strategy (deg=1), establishing and solving a mixed integer quadratic programming mathematical model which takes time TPI spent by each iteration of data parallel, tensor parallel and optimizer parallel as an optimization target, wherein the searched content is one or more of the data parallel strategy, tensor parallel strategy and the optimizer parallel strategy;
Step S4.3, based on a given pipeline parallel strategy (deg > 1), establishing and solving a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as an optimization target, wherein the searched content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy.
And S5, outputting an optimal parallel training strategy of the large model.
Based on each solving result of the mixed integer quadratic programming mathematical model, outputting one or more combinations of an optimal pipeline parallel strategy, a data parallel strategy, a tensor parallel strategy and an optimizer parallel strategy of the large model under the current system.
A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a parallel policy search method for efficient training of artificial intelligence large models as described above.
A computer device comprises a storage, a processor and a computer program stored on the storage and capable of running on the processor, wherein the processor realizes the parallel strategy searching method oriented to the artificial intelligence large model high-efficiency training when executing the computer program.
Compared with the prior art, the invention has the following advantages:
1. The invention supports four parallel training strategies of pipeline parallelism, data parallelism, tensor parallelism and optimizer parallelism, and has large strategy space;
2. The invention utilizes mixed integer quadratic programming, so that the searching time is short;
3. the invention can be applied to single-machine multi-card parallel training of a large model and distributed parallel training of a multi-machine cluster, and is beneficial to improving the training efficiency of the large model, thereby reducing the training time and reducing the energy consumption.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
The parallel strategy searching method for the artificial intelligent large model high-efficiency training, provided by the invention, supports the assembly line parallel, data parallel, tensor parallel and optimizer parallel training of various large models, and can be applied to single-machine multi-card parallel training and multi-machine cluster distributed parallel training of the large models.
In this embodiment, two servers Server0 and Server1 are used, and each Server has four GPU cards, numbered GPU0, GPU1, GPU2 and GPU3 in sequence. The two servers are isomorphic to each other, i.e., include, but are not limited to, identical hardware environments, identical software environments, identical network environments and protocols, etc. In this embodiment, global numbers globalID = serverID × gpuPerServer + localID of GPUs are defined, wherein serverID is the server number, gpuPerServer is the number of GPUs of each server, and localID is the local number of the current GPU. GPU1, e.g., server1, has a global number of 1×4+1=5.
To assist in understanding the relationship between the two servers of the present embodiment, fig. 2 shows the topology of the two servers of the present embodiment. GPU0 and GPU1 of the two servers are connected with CPU0 through a PCIe bus, and GPU2 and GPU3 are connected with CPU1 through a PCIe bus. The CPU0 and the CPU1 of the two servers are connected through QPI buses. The two servers are connected through Ethernet. In the embodiment, a parallel strategy searching method oriented to efficient training of an artificial intelligence large model is applied, and a parallel strategy capable of using an Adam optimizer to train a BERT-Huge model in a distributed manner on a Server0 and a Server1 is obtained through searching. In this embodiment, the specific workflow of the method of the present invention is as follows:
The workflow of the parallel strategy search method for the artificial intelligence large model efficient training is shown in fig. 1.
And S1, inputting a large model.
In this embodiment, the large model is the BERT-Huge model.
And S2, acquiring system execution performance information and execution performance information of the large model.
In this embodiment, the acquiring the system execution performance information and the execution performance information of the large model includes:
Setting traffic a ar of All-Reduce, and counting communication time t ar of All-Reduce to obtain All-Reduce communication efficiency of the system
Setting the traffic a p2p of the P2P, and counting the communication time t p2p of the P2P so as to obtain the P2P communication efficiency of the system
Traffic a ar of All-Reduce is set, all-Reduce communication time t n for separately executing All-Reduce and All-Reduce communication time t o for simultaneously executing All-Reduce and matrix multiplication calculation are counted, so as to obtain calculation-communication overlap factor of system
On Server0, setting the batch size as 1 by using GPU0, and counting the forward execution time fp u of each layer u (u E V) of BERT-Huge model G (V, E);
On Server0, using GPU0 to set batch size as 1, counting active video memory of each layer of BERT-Huge model G (V, E) except video memory occupied by BERT-Huge model and Adam optimizer And additional memory m other required for loading BERT-Huge model G (V, E).
And S3, estimating the execution time cost and the storage cost of the large model on the current system by using the cost model.
In this embodiment, the execution time overhead of the large model on the current system includes:
All candidate data parallel, tensor parallel and optimizer parallel strategies of layer u (u E V) of the BERT-Huge model G (V, E) are estimated by using the cost model, and the sum of calculation and communication time of the batch size bs is a u. In this embodiment, for a candidate policy S ui of the layer u of BERT-Huge, let its data parallel/optimizer parallelism be dp ui, tensor parallelism be tp ui, data parallel/optimizer parallel traffic be a dp, and tensor parallel forward and reverse traffic be a tp. Then when policy S ui is applied, the forward computation of layer u and the sum of the communication times is The inverse computation and communication time includes the time of overlapping the inverse computation and data parallel/optimizer parallel communication, the time of non-overlapping the inverse computation and data parallel/optimizer parallel communication and the inverse tensor parallel communication time, as Thus, for one candidate policy S ui for layer u of BERT-Huge of this embodiment, where the batch size is bs, the sum of the calculation and communication time for layer u is a ui=FPui+BPui;
And estimating the communication time R uv of the same pipeline stage under the condition of bs in batch size between all candidate data parallel, tensor parallel and optimizer parallel strategy pairs of the layer u and the layer V corresponding to the edges < u, V > (< u, V >. Epsilon.E) of the BERT-Huge model G (V, E) by utilizing the cost model. In this embodiment, let the candidate policy of layer u be S ui, the candidate policy of layer v be S vj, < u, v > ∈E, the traffic of the same pipeline stage between layer u and layer v be a uvij, and the traffic of the same pipeline stage between layer u and layer v be
And estimating P2P communication time R' uv of the cross-pipeline stage under the condition of bs in batch size between all candidate data parallel, tensor parallel and optimizer parallel strategy pairs of the layer u and the layer V corresponding to the edges < u, V > (< u, V >. Epsilon.E) of the BERT-Huge model G (V, E) by utilizing the cost model. In this embodiment, let the candidate policy of layer u be S ui, the candidate policy of layer v be S vj, < u, v > ∈e, the P2P traffic between layer u and layer v across pipeline stage be a' uvij, and the P2P traffic between layer u and layer v across pipeline stage be
In this embodiment, the storage overhead of the large model on the current system includes:
And estimating the memory overhead M u of all candidate data parallel, tensor parallel and optimizer parallel strategies of the layer u (u E V) of the BERT-Huge model G (V, E) under the condition of the batch size bs by using the cost model. In this embodiment, let the parameter number of layer u be C u, the data parallelism/optimizer parallelism be dp ui, the tensor parallelism be tp ui, and training by Adam optimizer, then on a GPU, the video memory overhead of layer u is
Step S4, taking the time spent by each iteration of the pipeline parallel training as an optimization target, establishing a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model, and solving:
In this embodiment, the establishing and solving a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model by using time spent by each iteration of the pipeline parallel training as an optimization target includes:
Step S4.1, enumerating the legal pipeline stage number deg and legal micro-batch number c, and selecting the minimum time TPI spent by each iteration and one or more of corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy;
step S4.2, based on a non-pipeline parallel strategy (deg=1), establishing and solving a mixed integer quadratic programming mathematical model which takes time TPI spent by each iteration of data parallel, tensor parallel and optimizer parallel as an optimization target, wherein the searched content is one or more of the data parallel strategy, tensor parallel strategy and the optimizer parallel strategy;
Step S4.3, based on a given pipeline parallel strategy (deg > 1), establishing and solving a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as an optimization target, wherein the searched content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy.
For ease of understanding, fig. 3 shows the execution flow of step S4.1 in the present embodiment. In this embodiment, the pipeline is a GPipe-mode synchronous pipeline.
Step S4.1.1, the sum A of calculation and communication time of each layer of the input large model and the memory overhead M, the communication time R of the unified pipeline stage of the adjacent layers and the P2P communication time R' of different pipeline stages, and the GPU quantity n and the small batch quantity B.
Step S4.1.2, let the pipeline stage count deg=1, minimum execution time minTPI = infinity per iteration, and corresponding minimum pipeline stage count minDeg = -1, micro-batch count minc= -1, placement strategy for each layer of modelParallel policy
And S4.1.3, judging whether the pipeline stage number deg is not larger than the GPU number n. If S4.1.3 determines that this is true, then the outer loop is entered (entry is step S4.1.4).
And S4.1.4, judging whether the degree deg of the pipeline is 1. If yes, step S4.2 is entered, and a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of data parallelism, tensor parallelism and optimizer parallelism as an optimization target is established and solved. After the solution is completed, step S4.1.5 is entered, it is determined whether the optimization target TPI is less than minTPI, and if step S4.1.5 is determined to be true, the routine will execute step S4.1.6, where minTPI =tpi, minDeg =deg, minc=c, minP =p, and mins=s. Finally, the process will execute step S4.1.7, let deg=deg×2, and then determine the outer loop condition in step S4.1.3.
If the result of the determination in step S4.1.4 is no, step S4.1.8 is executed to set the micro-batch number c to 2, and step S4.1.9 is executed to determine whether the micro-batch number c does not exceed the small batch number B. If the determination is not true, the process proceeds to step S4.1.7, where deg=deg×2, and if the determination is true, the process enters an inner loop. In the inner loop, the process first executes step S4.1.10 to determine if B is not divisible by c. If the determination is true, then step S4.1.14, c=c+1 is performed, the condition of the inner loop in step S4.1.9 is continued to be determined, and if the determination in step S4.1.10 is not true, then step S4.1.11, micro-batch size b=b/c, is performed, then step S4.3 is performed, and a mixed integer quadratic programming mathematical model is built and solved with the time TPI taken for each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as the optimization objective. After the solution result is obtained, step S4.1.12 is executed to determine whether the optimization target TPI is less than minTPI, if yes, step S4.1.13 is executed to make minTPI =tpi, minDeg =deg, minc=c, minP =p, mins=s, and then the procedure will execute steps S4.1.14, c=c+1 to continue the determination of the inner loop condition in step S4.1.9.
If step S4.1.3 determines that the method is not satisfied, the method proceeds to step S4.1.15, outputs the minimum execution time per iteration minTPI and the corresponding minimum pipeline stage number minDeg, micro batch number minC, placement policy minP for each layer of the model, and parallel policy minS for each layer of the model, and ends.
In step S4.2 provided in this embodiment, to help understand the mixed integer quadratic programming mathematical model with the time TPI spent for each iteration of data parallelism, tensor parallelism, and optimizer parallelism as the optimization target, this embodiment will exemplify a mixed integer quadratic programming mathematical model:
assuming that the time it takes for the objective function to minimize each iteration TPI can be expressed as equation (1):
min TPI=p1 (1)
in equation (1), TPI is the time it takes for each iteration, so this equation is equivalent to maximizing the number of iterations per unit time, i.e., to maximizing throughput. In equation (1), p i is the total computation and communication time of the pipeline parallel i-th stage. Because step S4.2 does not need to take into account pipeline parallelism, one or more combinations of data parallelism, tensor parallelism, and optimizer parallelism can be considered to perform parallel computation in one pipeline stage. Without loss of generality, data parallelism, tensor parallelism, and optimizer parallelism may be provided to execute in stage 1 of the pipeline, with the total computation and communication time for stage 1 of the pipeline noted as p 1.
For a better understanding, the constraints are presented below in conjunction with the formulas:
Constraint 1 the computation time of stage 1 of the pipeline includes the sum of computation and communication time of all layers u of BERT-Huge, and the sum of communication costs of the same pipeline stage between layers u and v corresponding to all edges < u, v > (< u, v > ∈E) of BERT-Huge. Constraint 1 can be expressed as equation (2):
Wherein S uk represents whether layer u selects the kth candidate data parallel, tensor parallel, and optimizer parallel strategies, which is a 0-1 variable.
Constraint 2 the storage overhead of stage 1 of the pipeline includes the sum of all candidate data parallelism for all layers u (u e V) of BERT-Huge, the storage overhead of tensor parallelism, and the optimizer parallelism policy. Constraint 2 can be expressed as equation (3):
In this embodiment, m represents the memory capacity of GPU0 minus the remaining memory capacity of the additional memory m other required to load BERT-Huge model G (V, E). In this embodiment, server0 and Server1 are isomorphic, so m is a constant for different GPUs.
Constraint 3. BERT-Huge all layers u (u ε V) can and only one candidate data parallel, tensor parallel and optimizer parallel strategy can be selected. Constraint 3 can be expressed as equation (4) (5):
Where g u represents the candidate data parallel, tensor parallel and optimizer parallel policy set for BERT-Huge layer u, |g u | represents the potential of the g u set.
In step S4.3 provided in this embodiment, to help understand the mixed integer quadratic programming mathematical model with the time TPI spent for each iteration of the pipeline parallel, data parallel, tensor parallel and optimizer parallel training as the optimization target, this embodiment will exemplify a mixed integer quadratic programming mathematical model:
assuming that the time it takes for the objective function to minimize each iteration TPI can be expressed as equation (6):
In formula (6), p i is the total computation and communication time of the i-th stage of pipeline parallelism, ps is the total number of stages of pipeline parallelism, and ps=deg is satisfied. o j is the total communication time of the jth communication stage across pipeline stages, os is the total number of communication stages across pipeline stages in pipeline parallelism, satisfying os = ps-1.c is the micro-batch number when the pipeline is calculating. Wherein the method comprises the steps of Representing the sum of computation and communication time for all stages of the pipeline,Representing the sum of P2P communication times across all pipeline stages in the pipeline, max { P 1,…,pps } (c-1) models the effect of the slowest calculated stage and micro-batch number in all stages of the pipeline on the overall pipeline computation time.
For a better understanding, the following describes the various constraints in conjunction with the formulas:
Constraint 1 is that the calculation and communication time of the ith stage of the pipeline comprises (1) the sum of the calculation and communication time of the BERT-Huge layers placed in the ith stage and (2) the sum of the communication costs of the same pipeline stage corresponding to all edges < u, v > (< u, v > ∈E) placed between the layers u and v of the ith stage of the BERT-Huge. Constraint 1 can be expressed as equation (7):
Wherein P ui represents whether the skin u is placed in the ith stage of the pipeline, a 0-1 variable. S uk represents whether layer u selects the kth candidate data parallel, tensor parallel, and optimizer parallel strategies, which is a 0-1 variable.
Constraint 2. Communication time of jth cross pipeline stage of pipeline parallelism. Constraint 2 can be expressed as equation (8):
Constraint 3. The storage overhead of the ith stage of the pipeline includes the sum of all candidate data parallelism, tensor parallelism, and the memory overhead of the optimizer parallelism policy placed at BERT-Huge layer u (u ε V) of the ith stage. Constraint 3 can be expressed as equation (9):
In this embodiment, m represents the memory capacity of GPU0 minus the remaining memory capacity of the additional memory m other required to load BERT-Huge model G (V, E). In this embodiment, server0 and Server1 are isomorphic, so m is a constant for different GPUs.
Constraint 4 each pipeline stage must contain a continuous portion of the BERT-Huge model. Constraint 4 can be expressed as equation (10) (11) (12):
Where Z vi is a 0-1 variable, whether a layer w follows the skin layer v or not, such that a certain precursor layer u of layer w and layer v is placed on the ith pipeline stage. If layer w is present, Z vi = 1, otherwise equal to 0. In this embodiment, formulas (10) (11) (12) together constrain the different layers of the BERT-Huge model not to be placed out of order on pipeline stages, but rather in order.
Constraint 5 BERT-Huge all layers u (u ε V) can be and can only be placed on one of the pipeline stages, and at least one layer on all pipeline stages. Constraint 5 can be expressed as equation (13) (14) (15):
constraint 6 BERT-Huge all layers u (u ε V) can and only one candidate data parallel, tensor parallel and optimizer parallel strategy can be selected. Constraint 6 can be expressed as equation (4) (5).
Finally, based on the above solution result of the mixed integer quadratic programming model, the parallel training strategy of the optimal large model is output (step S5):
In this embodiment, the parallel training strategy for outputting the optimal large model specifically includes:
Based on the outputting step S4.1.15 of step S4, one or more combinations of an optimal pipeline parallel strategy, a data parallel strategy, a tensor parallel strategy and an optimizer parallel strategy of the BERT-Huge model under the current system are output.
More specifically, minTPI represents the minimum execution time per iteration, minDeg represents the minimum pipeline stage number corresponding to current minTPI, minC represents the micro-batch number corresponding to current minTPI, minP represents the placement policy per layer of the model corresponding to current minTPI, and minS represents the parallel policy per layer of the model corresponding to current minTPI. The final pipeline parallel strategy is to set the stage number of the pipeline as minDeg, and split the small batch into minC micro-batches during pipeline calculation. Each layer u of the model should be placed on the pipeline stage of each corresponding layer placement in minP. Meanwhile, the data parallelism, tensor parallelism and optimizer parallelism strategies of each layer u of the model are parallel strategies adopted by corresponding each layer in minS. Furthermore, for the mapping of the parallel computing strategy and the system, the GPU with global number l in the system will be responsible for computing the first pipelineStage. For example, assuming minDeg =4, then GPU3 of Server0 has a global number of 0×4+3=3, which will be responsible for computing the first of the pipelineStage, global number 1×4+0=4 of GPU0 of Server1, will be responsible for computing the first of the pipelineStage.
The method of the present invention has been tested on a number of models and systems. In the experimental process, the search time of the parallel strategy search method for the artificial intelligence large model high-efficiency training is counted on a server with the number of 0. After the parallel training strategy obtained by searching through the method, the throughput of the parallel strategy under the current system is counted on a server with the number of 0. Experimental results on four models of BERT-Huge, T5-Larget, viT-Huge and Swin-Huge show that compared with the existing optimal method, the method can improve the strategy searching speed by about 16 times at maximum and the training throughput by about 1.7 times at maximum.