CN116680301B

CN116680301B - A parallel strategy search method for efficient training of large artificial intelligence models

Info

Publication number: CN116680301B
Application number: CN202310781759.7A
Authority: CN
Inventors: 李武军; 林昊; 吴轲
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2025-08-22
Anticipated expiration: 2043-06-29
Also published as: CN116680301A

Abstract

The present invention discloses a parallel strategy search method for efficient training of large artificial intelligence models. The method comprises the following steps: inputting a large model; obtaining system execution performance information and the large model's execution performance information; using a cost model to estimate the execution time and storage overhead of the large model on the current system; establishing and solving a mixed integer quadratic programming mathematical model for automatically searching for parallel training strategies for the large model, taking the time spent on each iteration of pipeline parallel training as the optimization target; and outputting the optimal parallel training strategy for the large model. The method supports pipeline parallelism, data parallelism, tensor parallelism, and optimizer parallelism, has a large strategy space, and shortens search time. It can be applied to single-machine multi-card parallel training of large models and distributed training of multi-machine clusters, thereby improving the training efficiency of large models.

Description

Parallel strategy searching method oriented to artificial intelligence large model efficient training

Technical Field

The invention relates to an artificial intelligence technology, in particular to a parallel strategy searching method for high-efficiency training of an artificial intelligence large model (hereinafter referred to as a large model).

Background

In recent years, the number of parameters of the artificial intelligence large model is rapidly increased, and the best effect is obtained on most tasks in various fields such as computer vision, natural language processing and the like. In contrast, computing hardware is evolving very slowly. And therefore have to be trained in parallel using a single machine, multiple cards or clusters.

Currently, parallel training big models mainly have parallel modes such as pipeline parallel, data parallel, tensor parallel, optimizer parallel and the like. Pipeline parallelism mainly cuts between different layers of a large model, and then a plurality of adjacent layers are placed on the same pipeline stage. During calculation, the pipeline divides one small batch into a plurality of micro batches in parallel, and the micro batches are organized into the pipeline, so that the overall training efficiency is improved. And the data parallelism, tensor parallelism and optimizer parallelism are subjected to parallel computation in each layer of the large model, so that the in-layer computation efficiency is improved. The tensor parallel and the optimizer parallel respectively segment tensor and optimizer states of the large model, and the communication quantity is larger than that of the data parallel although the memory cost is smaller. It should be noted that these parallel computing strategies each have advantages and disadvantages in different models and different computing environments. Thus, how to combine these parallel strategies to maximize training efficiency for a particular model and a given computing environment is an important research direction in parallel training of large models.

However, due to the numerous parallel strategies, the cost of manual combination and evaluation is high. Meanwhile, for different models and computing environments, proper parallel strategy combination is selected to depend on expert-level knowledge, and the method is difficult to apply to a real scene. Therefore, researchers have proposed a parallel training strategy method for automatically searching large models. However, until now, existing methods either only focused on one or several of the four parallel strategies described above, or considered the four parallel strategies hierarchically, solving layer by layer. The strategy space of the methods is limited, and the locally optimal parallel training strategy is easy to obtain, but not the globally optimal parallel training strategy. And because training a large model with larger parameter scale usually takes weeks to months and consumes a large amount of energy, how to automatically search for a globally optimal parallel training strategy to maximally accelerate the training of the large model becomes an important problem.

Disclosure of Invention

The invention aims to provide a parallel strategy searching method for high-efficiency training of an artificial intelligent large model, so as to provide a parallel strategy searching method for high-efficiency training of the large model based on mixed integer quadratic programming, solve the problem of local optimality of the parallel strategy searched in the prior art, further improve training efficiency of the large model and reduce energy consumption.

The technical scheme is that the parallel strategy searching method for the artificial intelligence large model high-efficiency training comprises the following steps:

s1, inputting a large model;

s2, acquiring system execution performance information and execution performance information of the large model;

setting traffic a _ar of All-Reduce, and counting communication time t _ar of All-Reduce, so as to obtain All-Reduce communication efficiency e _ar of the system;

Setting traffic a _p2p of P2P, and counting communication time t _p2p of P2P, so as to obtain P2P communication efficiency e _p2p of the system;

Setting traffic a _ar of All-Reduce, respectively counting All-Reduce communication time t _n for independently executing All-Reduce and All-Reduce communication time t _o for simultaneously executing All-Reduce and matrix multiplication calculation, thereby obtaining a calculation-communication overlap factor e _o of the system;

counting forward execution time fp _u of each layer u (u epsilon V) of the big model G (V, E);

Statistics of active video memory for each layer of large model G (V, E) except for the video memory occupied by model and optimizer And additional memory m _other required for loading the large model G (V, E).

The execution time overhead described in step S3 includes the sum of the computation and communication time of all candidate policies of each layer of the large model, and one or more of the communication time of the same pipeline stage and the P2P communication time across pipeline stages of all candidate policy pairs of adjacent layers.

S3, estimating the execution time cost and the storage cost of the large model on the current system by using the cost model;

the execution time overhead described in step S3 specifically includes the following steps:

Estimating the sum A _u of all candidate data parallel, tensor parallel and optimizer parallel strategies of the layer u (u epsilon V) of the large model G (V, E) by using the cost model;

Estimating the communication time R _uv of the same pipeline stage between the parallel, tensor parallel and optimizer parallel strategy pairs of all candidate data of the layer u and the layer V corresponding to the edges < u, V > (< u, V > ∈E) of the large model G (V, E) by using the cost model;

and estimating P2P communication time R' _uv of the cross-pipeline stage between the parallel, tensor parallel and optimizer parallel strategy pairs of all candidate data of the layer u and the layer V corresponding to the edges < u, V > (< u, V > ∈E) of the large model G (V, E) by using the cost model.

The storage overhead described in step S3 includes the memory overhead of all candidate policies for each layer of the large model.

The storage overhead described in step S3 specifically includes the step of estimating the memory overhead M _u of all candidate data parallel, tensor parallel and optimizer parallel strategies of layer u (u ε V) of the large model G (V, E) using the cost model.

S4, taking the time spent by each iteration of the pipeline parallel training as an optimization target, establishing a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model, and solving;

Step S4.1, enumerating the legal pipeline stage number deg and legal micro-batch number c, and selecting the minimum time TPI spent by each iteration and one or more of corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy;

step S4.2, based on a non-pipeline parallel strategy (deg=1), establishing and solving a mixed integer quadratic programming mathematical model which takes time TPI spent by each iteration of data parallel, tensor parallel and optimizer parallel as an optimization target, wherein the searched content is one or more of the data parallel strategy, tensor parallel strategy and the optimizer parallel strategy;

Step S4.3, based on a given pipeline parallel strategy (deg > 1), establishing and solving a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as an optimization target, wherein the searched content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy.

And S5, outputting an optimal parallel training strategy of the large model.

Based on each solving result of the mixed integer quadratic programming mathematical model, outputting one or more combinations of an optimal pipeline parallel strategy, a data parallel strategy, a tensor parallel strategy and an optimizer parallel strategy of the large model under the current system.

A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a parallel policy search method for efficient training of artificial intelligence large models as described above.

A computer device comprises a storage, a processor and a computer program stored on the storage and capable of running on the processor, wherein the processor realizes the parallel strategy searching method oriented to the artificial intelligence large model high-efficiency training when executing the computer program.

Compared with the prior art, the invention has the following advantages:

1. The invention supports four parallel training strategies of pipeline parallelism, data parallelism, tensor parallelism and optimizer parallelism, and has large strategy space;

2. The invention utilizes mixed integer quadratic programming, so that the searching time is short;

3. the invention can be applied to single-machine multi-card parallel training of a large model and distributed parallel training of a multi-machine cluster, and is beneficial to improving the training efficiency of the large model, thereby reducing the training time and reducing the energy consumption.

Drawings

FIG. 1 is a flow chart of the steps of the method of the present invention;

FIG. 2 is a schematic diagram of a system environment according to an embodiment of the present invention;

Fig. 3 is a flowchart of step S4.1 according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

The parallel strategy searching method for the artificial intelligent large model high-efficiency training, provided by the invention, supports the assembly line parallel, data parallel, tensor parallel and optimizer parallel training of various large models, and can be applied to single-machine multi-card parallel training and multi-machine cluster distributed parallel training of the large models.

In this embodiment, two servers Server0 and Server1 are used, and each Server has four GPU cards, numbered GPU0, GPU1, GPU2 and GPU3 in sequence. The two servers are isomorphic to each other, i.e., include, but are not limited to, identical hardware environments, identical software environments, identical network environments and protocols, etc. In this embodiment, global numbers globalID = serverID × gpuPerServer + localID of GPUs are defined, wherein serverID is the server number, gpuPerServer is the number of GPUs of each server, and localID is the local number of the current GPU. GPU1, e.g., server1, has a global number of 1×4+1=5.

To assist in understanding the relationship between the two servers of the present embodiment, fig. 2 shows the topology of the two servers of the present embodiment. GPU0 and GPU1 of the two servers are connected with CPU0 through a PCIe bus, and GPU2 and GPU3 are connected with CPU1 through a PCIe bus. The CPU0 and the CPU1 of the two servers are connected through QPI buses. The two servers are connected through Ethernet. In the embodiment, a parallel strategy searching method oriented to efficient training of an artificial intelligence large model is applied, and a parallel strategy capable of using an Adam optimizer to train a BERT-Huge model in a distributed manner on a Server0 and a Server1 is obtained through searching. In this embodiment, the specific workflow of the method of the present invention is as follows:

The workflow of the parallel strategy search method for the artificial intelligence large model efficient training is shown in fig. 1.

And S1, inputting a large model.

In this embodiment, the large model is the BERT-Huge model.

And S2, acquiring system execution performance information and execution performance information of the large model.

In this embodiment, the acquiring the system execution performance information and the execution performance information of the large model includes:

Setting traffic a _ar of All-Reduce, and counting communication time t _ar of All-Reduce to obtain All-Reduce communication efficiency of the system

Setting the traffic a _p2p of the P2P, and counting the communication time t _p2p of the P2P so as to obtain the P2P communication efficiency of the system

Traffic a _ar of All-Reduce is set, all-Reduce communication time t _n for separately executing All-Reduce and All-Reduce communication time t _o for simultaneously executing All-Reduce and matrix multiplication calculation are counted, so as to obtain calculation-communication overlap factor of system

On Server0, setting the batch size as 1 by using GPU0, and counting the forward execution time fp _u of each layer u (u E V) of BERT-Huge model G (V, E);

On Server0, using GPU0 to set batch size as 1, counting active video memory of each layer of BERT-Huge model G (V, E) except video memory occupied by BERT-Huge model and Adam optimizer And additional memory m _other required for loading BERT-Huge model G (V, E).

And S3, estimating the execution time cost and the storage cost of the large model on the current system by using the cost model.

In this embodiment, the execution time overhead of the large model on the current system includes:

All candidate data parallel, tensor parallel and optimizer parallel strategies of layer u (u E V) of the BERT-Huge model G (V, E) are estimated by using the cost model, and the sum of calculation and communication time of the batch size bs is a _u. In this embodiment, for a candidate policy S _ui of the layer u of BERT-Huge, let its data parallel/optimizer parallelism be dp _ui, tensor parallelism be tp _ui, data parallel/optimizer parallel traffic be a _dp, and tensor parallel forward and reverse traffic be a _tp. Then when policy S _ui is applied, the forward computation of layer u and the sum of the communication times is The inverse computation and communication time includes the time of overlapping the inverse computation and data parallel/optimizer parallel communication, the time of non-overlapping the inverse computation and data parallel/optimizer parallel communication and the inverse tensor parallel communication time, as Thus, for one candidate policy S _ui for layer u of BERT-Huge of this embodiment, where the batch size is bs, the sum of the calculation and communication time for layer u is a _ui＝FP_ui+BP_ui;

And estimating the communication time R _uv of the same pipeline stage under the condition of bs in batch size between all candidate data parallel, tensor parallel and optimizer parallel strategy pairs of the layer u and the layer V corresponding to the edges < u, V > (< u, V >. Epsilon.E) of the BERT-Huge model G (V, E) by utilizing the cost model. In this embodiment, let the candidate policy of layer u be S _ui, the candidate policy of layer v be S _vj, < u, v > ∈E, the traffic of the same pipeline stage between layer u and layer v be a _uvij, and the traffic of the same pipeline stage between layer u and layer v be

And estimating P2P communication time R' _uv of the cross-pipeline stage under the condition of bs in batch size between all candidate data parallel, tensor parallel and optimizer parallel strategy pairs of the layer u and the layer V corresponding to the edges < u, V > (< u, V >. Epsilon.E) of the BERT-Huge model G (V, E) by utilizing the cost model. In this embodiment, let the candidate policy of layer u be S _ui, the candidate policy of layer v be S _vj, < u, v > ∈e, the P2P traffic between layer u and layer v across pipeline stage be a' _uvij, and the P2P traffic between layer u and layer v across pipeline stage be

In this embodiment, the storage overhead of the large model on the current system includes:

And estimating the memory overhead M _u of all candidate data parallel, tensor parallel and optimizer parallel strategies of the layer u (u E V) of the BERT-Huge model G (V, E) under the condition of the batch size bs by using the cost model. In this embodiment, let the parameter number of layer u be C _u, the data parallelism/optimizer parallelism be dp _ui, the tensor parallelism be tp _ui, and training by Adam optimizer, then on a GPU, the video memory overhead of layer u is

Step S4, taking the time spent by each iteration of the pipeline parallel training as an optimization target, establishing a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model, and solving:

In this embodiment, the establishing and solving a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model by using time spent by each iteration of the pipeline parallel training as an optimization target includes:

For ease of understanding, fig. 3 shows the execution flow of step S4.1 in the present embodiment. In this embodiment, the pipeline is a GPipe-mode synchronous pipeline.

Step S4.1.1, the sum A of calculation and communication time of each layer of the input large model and the memory overhead M, the communication time R of the unified pipeline stage of the adjacent layers and the P2P communication time R' of different pipeline stages, and the GPU quantity n and the small batch quantity B.

Step S4.1.2, let the pipeline stage count deg=1, minimum execution time minTPI = infinity per iteration, and corresponding minimum pipeline stage count minDeg = -1, micro-batch count minc= -1, placement strategy for each layer of modelParallel policy

And S4.1.3, judging whether the pipeline stage number deg is not larger than the GPU number n. If S4.1.3 determines that this is true, then the outer loop is entered (entry is step S4.1.4).

And S4.1.4, judging whether the degree deg of the pipeline is 1. If yes, step S4.2 is entered, and a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of data parallelism, tensor parallelism and optimizer parallelism as an optimization target is established and solved. After the solution is completed, step S4.1.5 is entered, it is determined whether the optimization target TPI is less than minTPI, and if step S4.1.5 is determined to be true, the routine will execute step S4.1.6, where minTPI =tpi, minDeg =deg, minc=c, minP =p, and mins=s. Finally, the process will execute step S4.1.7, let deg=deg×2, and then determine the outer loop condition in step S4.1.3.

If the result of the determination in step S4.1.4 is no, step S4.1.8 is executed to set the micro-batch number c to 2, and step S4.1.9 is executed to determine whether the micro-batch number c does not exceed the small batch number B. If the determination is not true, the process proceeds to step S4.1.7, where deg=deg×2, and if the determination is true, the process enters an inner loop. In the inner loop, the process first executes step S4.1.10 to determine if B is not divisible by c. If the determination is true, then step S4.1.14, c=c+1 is performed, the condition of the inner loop in step S4.1.9 is continued to be determined, and if the determination in step S4.1.10 is not true, then step S4.1.11, micro-batch size b=b/c, is performed, then step S4.3 is performed, and a mixed integer quadratic programming mathematical model is built and solved with the time TPI taken for each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as the optimization objective. After the solution result is obtained, step S4.1.12 is executed to determine whether the optimization target TPI is less than minTPI, if yes, step S4.1.13 is executed to make minTPI =tpi, minDeg =deg, minc=c, minP =p, mins=s, and then the procedure will execute steps S4.1.14, c=c+1 to continue the determination of the inner loop condition in step S4.1.9.

If step S4.1.3 determines that the method is not satisfied, the method proceeds to step S4.1.15, outputs the minimum execution time per iteration minTPI and the corresponding minimum pipeline stage number minDeg, micro batch number minC, placement policy minP for each layer of the model, and parallel policy minS for each layer of the model, and ends.

In step S4.2 provided in this embodiment, to help understand the mixed integer quadratic programming mathematical model with the time TPI spent for each iteration of data parallelism, tensor parallelism, and optimizer parallelism as the optimization target, this embodiment will exemplify a mixed integer quadratic programming mathematical model:

assuming that the time it takes for the objective function to minimize each iteration TPI can be expressed as equation (1):

min TPI=p₁ (1)

in equation (1), TPI is the time it takes for each iteration, so this equation is equivalent to maximizing the number of iterations per unit time, i.e., to maximizing throughput. In equation (1), p _i is the total computation and communication time of the pipeline parallel i-th stage. Because step S4.2 does not need to take into account pipeline parallelism, one or more combinations of data parallelism, tensor parallelism, and optimizer parallelism can be considered to perform parallel computation in one pipeline stage. Without loss of generality, data parallelism, tensor parallelism, and optimizer parallelism may be provided to execute in stage 1 of the pipeline, with the total computation and communication time for stage 1 of the pipeline noted as p ₁.

For a better understanding, the constraints are presented below in conjunction with the formulas:

Constraint 1 the computation time of stage 1 of the pipeline includes the sum of computation and communication time of all layers u of BERT-Huge, and the sum of communication costs of the same pipeline stage between layers u and v corresponding to all edges < u, v > (< u, v > ∈E) of BERT-Huge. Constraint 1 can be expressed as equation (2):

Wherein S _uk represents whether layer u selects the kth candidate data parallel, tensor parallel, and optimizer parallel strategies, which is a 0-1 variable.

Constraint 2 the storage overhead of stage 1 of the pipeline includes the sum of all candidate data parallelism for all layers u (u e V) of BERT-Huge, the storage overhead of tensor parallelism, and the optimizer parallelism policy. Constraint 2 can be expressed as equation (3):

In this embodiment, m represents the memory capacity of GPU0 minus the remaining memory capacity of the additional memory m _other required to load BERT-Huge model G (V, E). In this embodiment, server0 and Server1 are isomorphic, so m is a constant for different GPUs.

Constraint 3. BERT-Huge all layers u (u ε V) can and only one candidate data parallel, tensor parallel and optimizer parallel strategy can be selected. Constraint 3 can be expressed as equation (4) (5):

Where g _u represents the candidate data parallel, tensor parallel and optimizer parallel policy set for BERT-Huge layer u, |g _u | represents the potential of the g _u set.

In step S4.3 provided in this embodiment, to help understand the mixed integer quadratic programming mathematical model with the time TPI spent for each iteration of the pipeline parallel, data parallel, tensor parallel and optimizer parallel training as the optimization target, this embodiment will exemplify a mixed integer quadratic programming mathematical model:

assuming that the time it takes for the objective function to minimize each iteration TPI can be expressed as equation (6):

In formula (6), p _i is the total computation and communication time of the i-th stage of pipeline parallelism, ps is the total number of stages of pipeline parallelism, and ps=deg is satisfied. o _j is the total communication time of the jth communication stage across pipeline stages, os is the total number of communication stages across pipeline stages in pipeline parallelism, satisfying os = ps-1.c is the micro-batch number when the pipeline is calculating. Wherein the method comprises the steps of Representing the sum of computation and communication time for all stages of the pipeline,Representing the sum of P2P communication times across all pipeline stages in the pipeline, max { P ₁,…,p_ps } (c-1) models the effect of the slowest calculated stage and micro-batch number in all stages of the pipeline on the overall pipeline computation time.

For a better understanding, the following describes the various constraints in conjunction with the formulas:

Constraint 1 is that the calculation and communication time of the ith stage of the pipeline comprises (1) the sum of the calculation and communication time of the BERT-Huge layers placed in the ith stage and (2) the sum of the communication costs of the same pipeline stage corresponding to all edges < u, v > (< u, v > ∈E) placed between the layers u and v of the ith stage of the BERT-Huge. Constraint 1 can be expressed as equation (7):

Wherein P _ui represents whether the skin u is placed in the ith stage of the pipeline, a 0-1 variable. S _uk represents whether layer u selects the kth candidate data parallel, tensor parallel, and optimizer parallel strategies, which is a 0-1 variable.

Constraint 2. Communication time of jth cross pipeline stage of pipeline parallelism. Constraint 2 can be expressed as equation (8):

Constraint 3. The storage overhead of the ith stage of the pipeline includes the sum of all candidate data parallelism, tensor parallelism, and the memory overhead of the optimizer parallelism policy placed at BERT-Huge layer u (u ε V) of the ith stage. Constraint 3 can be expressed as equation (9):

Constraint 4 each pipeline stage must contain a continuous portion of the BERT-Huge model. Constraint 4 can be expressed as equation (10) (11) (12):

Where Z _vi is a 0-1 variable, whether a layer w follows the skin layer v or not, such that a certain precursor layer u of layer w and layer v is placed on the ith pipeline stage. If layer w is present, Z _vi = 1, otherwise equal to 0. In this embodiment, formulas (10) (11) (12) together constrain the different layers of the BERT-Huge model not to be placed out of order on pipeline stages, but rather in order.

Constraint 5 BERT-Huge all layers u (u ε V) can be and can only be placed on one of the pipeline stages, and at least one layer on all pipeline stages. Constraint 5 can be expressed as equation (13) (14) (15):

constraint 6 BERT-Huge all layers u (u ε V) can and only one candidate data parallel, tensor parallel and optimizer parallel strategy can be selected. Constraint 6 can be expressed as equation (4) (5).

Finally, based on the above solution result of the mixed integer quadratic programming model, the parallel training strategy of the optimal large model is output (step S5):

In this embodiment, the parallel training strategy for outputting the optimal large model specifically includes:

Based on the outputting step S4.1.15 of step S4, one or more combinations of an optimal pipeline parallel strategy, a data parallel strategy, a tensor parallel strategy and an optimizer parallel strategy of the BERT-Huge model under the current system are output.

More specifically, minTPI represents the minimum execution time per iteration, minDeg represents the minimum pipeline stage number corresponding to current minTPI, minC represents the micro-batch number corresponding to current minTPI, minP represents the placement policy per layer of the model corresponding to current minTPI, and minS represents the parallel policy per layer of the model corresponding to current minTPI. The final pipeline parallel strategy is to set the stage number of the pipeline as minDeg, and split the small batch into minC micro-batches during pipeline calculation. Each layer u of the model should be placed on the pipeline stage of each corresponding layer placement in minP. Meanwhile, the data parallelism, tensor parallelism and optimizer parallelism strategies of each layer u of the model are parallel strategies adopted by corresponding each layer in minS. Furthermore, for the mapping of the parallel computing strategy and the system, the GPU with global number l in the system will be responsible for computing the first pipelineStage. For example, assuming minDeg =4, then GPU3 of Server0 has a global number of 0×4+3=3, which will be responsible for computing the first of the pipelineStage, global number 1×4+0=4 of GPU0 of Server1, will be responsible for computing the first of the pipelineStage.

The method of the present invention has been tested on a number of models and systems. In the experimental process, the search time of the parallel strategy search method for the artificial intelligence large model high-efficiency training is counted on a server with the number of 0. After the parallel training strategy obtained by searching through the method, the throughput of the parallel strategy under the current system is counted on a server with the number of 0. Experimental results on four models of BERT-Huge, T5-Larget, viT-Huge and Swin-Huge show that compared with the existing optimal method, the method can improve the strategy searching speed by about 16 times at maximum and the training throughput by about 1.7 times at maximum.

Claims

1. A parallel strategy search method for efficient training of large artificial intelligence models, characterized by comprising the following steps:

Step S1: inputting the artificial intelligence model;

Step S2: Obtaining system execution performance information and execution performance information of the artificial intelligence large model;

Step S3: using the cost model to estimate the execution time and storage overhead of the large artificial intelligence model on the current system;

Step S4: Taking the time spent on each iteration of pipeline parallel training as the optimization target, a mixed integer quadratic programming mathematical model for automatically searching for a parallel training strategy for a large model is established and solved;

Step S5: outputting the optimal parallel training strategy for the large artificial intelligence model;

The step S2 is specifically as follows:

Setting the All-Reduce Traffic Volume , statistics of All-Reduce communication time , thereby obtaining the system's All-Reduce communication efficiency ;

Set P2P communication volume , statistics of P2P communication time , thereby obtaining the P2P communication efficiency of the system ;

Setting the All-Reduce Traffic Volume , respectively count the All-Reduce communication time of executing All-Reduce separately , and the All-Reduce communication time when performing All-Reduce and matrix multiplication calculations simultaneously , thereby obtaining the system's computation-communication overlap factor ;

Statistical large models Each layer Forward execution time ;

Statistics In addition to the memory occupied by the model and optimizer, large models Active memory for each layer and loading large models Additional video memory required ;

Using cost models to estimate large models Layer The sum of the computation and communication time of all candidate data parallel, tensor parallel, and optimizer parallel strategies ;

Estimating Large Models Using Cost Models edge The corresponding layer and layer Communication time within the same pipeline stage between all candidate data parallelism, tensor parallelism, and optimizer parallelism strategy pairs ;

Estimating Large Models Using Cost Models edge The corresponding layer and layer P2P communication time across pipeline stages between all candidate data parallelism, tensor parallelism, and optimizer parallelism strategy pairs ;

The step S4 is specifically as follows:

Step S4.1: Enumerate all legal pipeline stage numbers and legal micro-batch quantities , select the minimum time spent on each iteration , and one or more of its corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy;

Step S4.2: Based on a non-pipelined parallel strategy, establish and solve the time taken for each iteration in data parallelism, tensor parallelism, and optimizer parallelism A mixed integer quadratic programming mathematical model for optimizing the objective, wherein the search content is one or more of a data parallel strategy, a tensor parallel strategy, and an optimizer parallel strategy;

Step S4.3: Based on a given pipeline parallel strategy, establish and solve the time taken for each iteration of pipeline parallelism, data parallelism, tensor parallelism and optimizer parallel training. A mixed integer quadratic programming mathematical model is used to optimize the target, and the search content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy.

2. A parallel strategy search method for efficient training of large artificial intelligence models according to claim 1, characterized in that the execution time overhead described in step S3 includes one or more of the sum of the calculation and communication time of all candidate strategies in each layer of the large model, the communication time of all candidate strategy pairs in adjacent layers in the same pipeline stage, and the P2P communication time across pipeline stages.

3. A parallel strategy search method for efficient training of large artificial intelligence models according to claim 1, characterized in that the storage overhead described in step S3 includes the video memory overhead of all candidate strategies in each layer of the large model.

4. The parallel strategy search method for efficient training of large artificial intelligence models according to claim 1 is characterized in that the storage overhead in step S3 specifically includes the following steps: using the cost model to estimate the large model Layer The memory overhead of all candidate data parallel, tensor parallel, and optimizer parallel strategies .

5. The parallel strategy search method for efficient training of large artificial intelligence models according to claim 1, wherein step S5 is specifically:

Based on the solution results of the mixed integer quadratic programming mathematical model, one or more combinations of the optimal pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy for the large model under the current system are output.

6. A computer storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, it implements a parallel strategy search method for efficient training of large artificial intelligence models as described in any one of claims 1-5.

7. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method for searching for a parallel strategy for efficient training of large artificial intelligence models as described in any one of claims 1 to 5 is implemented.