[go: up one dir, main page]

CN116680301B - A parallel strategy search method for efficient training of large artificial intelligence models - Google Patents

A parallel strategy search method for efficient training of large artificial intelligence models

Info

Publication number
CN116680301B
CN116680301B CN202310781759.7A CN202310781759A CN116680301B CN 116680301 B CN116680301 B CN 116680301B CN 202310781759 A CN202310781759 A CN 202310781759A CN 116680301 B CN116680301 B CN 116680301B
Authority
CN
China
Prior art keywords
parallel
strategy
model
parallelism
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310781759.7A
Other languages
Chinese (zh)
Other versions
CN116680301A (en
Inventor
李武军
林昊
吴轲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202310781759.7A priority Critical patent/CN116680301B/en
Publication of CN116680301A publication Critical patent/CN116680301A/en
Application granted granted Critical
Publication of CN116680301B publication Critical patent/CN116680301B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种面向人工智能大模型高效训练的并行策略搜索方法,步骤如下:输入大模型;获取系统执行性能信息和大模型的执行性能信息;利用代价模型估计大模型在当前系统上的执行时间开销和存储开销;以流水线并行训练的每次迭代所花费的时间为优化目标,建立用于自动搜索大模型的并行训练策略的混合整数二次规划数学模型并求解;输出最优的大模型的并行训练策略。本发明支持流水线并行、数据并行、张量并行和优化器并行,策略空间大,搜索时间短,可应用于大模型的单机多卡并行训练和多机集群的分布式训练,提升大模型的训练效率。

The present invention discloses a parallel strategy search method for efficient training of large artificial intelligence models. The method comprises the following steps: inputting a large model; obtaining system execution performance information and the large model's execution performance information; using a cost model to estimate the execution time and storage overhead of the large model on the current system; establishing and solving a mixed integer quadratic programming mathematical model for automatically searching for parallel training strategies for the large model, taking the time spent on each iteration of pipeline parallel training as the optimization target; and outputting the optimal parallel training strategy for the large model. The method supports pipeline parallelism, data parallelism, tensor parallelism, and optimizer parallelism, has a large strategy space, and shortens search time. It can be applied to single-machine multi-card parallel training of large models and distributed training of multi-machine clusters, thereby improving the training efficiency of large models.

Description

Parallel strategy searching method oriented to artificial intelligence large model efficient training
Technical Field
The invention relates to an artificial intelligence technology, in particular to a parallel strategy searching method for high-efficiency training of an artificial intelligence large model (hereinafter referred to as a large model).
Background
In recent years, the number of parameters of the artificial intelligence large model is rapidly increased, and the best effect is obtained on most tasks in various fields such as computer vision, natural language processing and the like. In contrast, computing hardware is evolving very slowly. And therefore have to be trained in parallel using a single machine, multiple cards or clusters.
Currently, parallel training big models mainly have parallel modes such as pipeline parallel, data parallel, tensor parallel, optimizer parallel and the like. Pipeline parallelism mainly cuts between different layers of a large model, and then a plurality of adjacent layers are placed on the same pipeline stage. During calculation, the pipeline divides one small batch into a plurality of micro batches in parallel, and the micro batches are organized into the pipeline, so that the overall training efficiency is improved. And the data parallelism, tensor parallelism and optimizer parallelism are subjected to parallel computation in each layer of the large model, so that the in-layer computation efficiency is improved. The tensor parallel and the optimizer parallel respectively segment tensor and optimizer states of the large model, and the communication quantity is larger than that of the data parallel although the memory cost is smaller. It should be noted that these parallel computing strategies each have advantages and disadvantages in different models and different computing environments. Thus, how to combine these parallel strategies to maximize training efficiency for a particular model and a given computing environment is an important research direction in parallel training of large models.
However, due to the numerous parallel strategies, the cost of manual combination and evaluation is high. Meanwhile, for different models and computing environments, proper parallel strategy combination is selected to depend on expert-level knowledge, and the method is difficult to apply to a real scene. Therefore, researchers have proposed a parallel training strategy method for automatically searching large models. However, until now, existing methods either only focused on one or several of the four parallel strategies described above, or considered the four parallel strategies hierarchically, solving layer by layer. The strategy space of the methods is limited, and the locally optimal parallel training strategy is easy to obtain, but not the globally optimal parallel training strategy. And because training a large model with larger parameter scale usually takes weeks to months and consumes a large amount of energy, how to automatically search for a globally optimal parallel training strategy to maximally accelerate the training of the large model becomes an important problem.
Disclosure of Invention
The invention aims to provide a parallel strategy searching method for high-efficiency training of an artificial intelligent large model, so as to provide a parallel strategy searching method for high-efficiency training of the large model based on mixed integer quadratic programming, solve the problem of local optimality of the parallel strategy searched in the prior art, further improve training efficiency of the large model and reduce energy consumption.
The technical scheme is that the parallel strategy searching method for the artificial intelligence large model high-efficiency training comprises the following steps:
s1, inputting a large model;
s2, acquiring system execution performance information and execution performance information of the large model;
setting traffic a ar of All-Reduce, and counting communication time t ar of All-Reduce, so as to obtain All-Reduce communication efficiency e ar of the system;
Setting traffic a p2p of P2P, and counting communication time t p2p of P2P, so as to obtain P2P communication efficiency e p2p of the system;
Setting traffic a ar of All-Reduce, respectively counting All-Reduce communication time t n for independently executing All-Reduce and All-Reduce communication time t o for simultaneously executing All-Reduce and matrix multiplication calculation, thereby obtaining a calculation-communication overlap factor e o of the system;
counting forward execution time fp u of each layer u (u epsilon V) of the big model G (V, E);
Statistics of active video memory for each layer of large model G (V, E) except for the video memory occupied by model and optimizer And additional memory m other required for loading the large model G (V, E).
The execution time overhead described in step S3 includes the sum of the computation and communication time of all candidate policies of each layer of the large model, and one or more of the communication time of the same pipeline stage and the P2P communication time across pipeline stages of all candidate policy pairs of adjacent layers.
S3, estimating the execution time cost and the storage cost of the large model on the current system by using the cost model;
the execution time overhead described in step S3 specifically includes the following steps:
Estimating the sum A u of all candidate data parallel, tensor parallel and optimizer parallel strategies of the layer u (u epsilon V) of the large model G (V, E) by using the cost model;
Estimating the communication time R uv of the same pipeline stage between the parallel, tensor parallel and optimizer parallel strategy pairs of all candidate data of the layer u and the layer V corresponding to the edges < u, V > (< u, V > ∈E) of the large model G (V, E) by using the cost model;
and estimating P2P communication time R' uv of the cross-pipeline stage between the parallel, tensor parallel and optimizer parallel strategy pairs of all candidate data of the layer u and the layer V corresponding to the edges < u, V > (< u, V > ∈E) of the large model G (V, E) by using the cost model.
The storage overhead described in step S3 includes the memory overhead of all candidate policies for each layer of the large model.
The storage overhead described in step S3 specifically includes the step of estimating the memory overhead M u of all candidate data parallel, tensor parallel and optimizer parallel strategies of layer u (u ε V) of the large model G (V, E) using the cost model.
S4, taking the time spent by each iteration of the pipeline parallel training as an optimization target, establishing a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model, and solving;
Step S4.1, enumerating the legal pipeline stage number deg and legal micro-batch number c, and selecting the minimum time TPI spent by each iteration and one or more of corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy;
step S4.2, based on a non-pipeline parallel strategy (deg=1), establishing and solving a mixed integer quadratic programming mathematical model which takes time TPI spent by each iteration of data parallel, tensor parallel and optimizer parallel as an optimization target, wherein the searched content is one or more of the data parallel strategy, tensor parallel strategy and the optimizer parallel strategy;
Step S4.3, based on a given pipeline parallel strategy (deg > 1), establishing and solving a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as an optimization target, wherein the searched content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy.
And S5, outputting an optimal parallel training strategy of the large model.
Based on each solving result of the mixed integer quadratic programming mathematical model, outputting one or more combinations of an optimal pipeline parallel strategy, a data parallel strategy, a tensor parallel strategy and an optimizer parallel strategy of the large model under the current system.
A computer storage medium having stored thereon a computer program which, when executed by a processor, implements a parallel policy search method for efficient training of artificial intelligence large models as described above.
A computer device comprises a storage, a processor and a computer program stored on the storage and capable of running on the processor, wherein the processor realizes the parallel strategy searching method oriented to the artificial intelligence large model high-efficiency training when executing the computer program.
Compared with the prior art, the invention has the following advantages:
1. The invention supports four parallel training strategies of pipeline parallelism, data parallelism, tensor parallelism and optimizer parallelism, and has large strategy space;
2. The invention utilizes mixed integer quadratic programming, so that the searching time is short;
3. the invention can be applied to single-machine multi-card parallel training of a large model and distributed parallel training of a multi-machine cluster, and is beneficial to improving the training efficiency of the large model, thereby reducing the training time and reducing the energy consumption.
Drawings
FIG. 1 is a flow chart of the steps of the method of the present invention;
FIG. 2 is a schematic diagram of a system environment according to an embodiment of the present invention;
Fig. 3 is a flowchart of step S4.1 according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
The parallel strategy searching method for the artificial intelligent large model high-efficiency training, provided by the invention, supports the assembly line parallel, data parallel, tensor parallel and optimizer parallel training of various large models, and can be applied to single-machine multi-card parallel training and multi-machine cluster distributed parallel training of the large models.
In this embodiment, two servers Server0 and Server1 are used, and each Server has four GPU cards, numbered GPU0, GPU1, GPU2 and GPU3 in sequence. The two servers are isomorphic to each other, i.e., include, but are not limited to, identical hardware environments, identical software environments, identical network environments and protocols, etc. In this embodiment, global numbers globalID = serverID × gpuPerServer + localID of GPUs are defined, wherein serverID is the server number, gpuPerServer is the number of GPUs of each server, and localID is the local number of the current GPU. GPU1, e.g., server1, has a global number of 1×4+1=5.
To assist in understanding the relationship between the two servers of the present embodiment, fig. 2 shows the topology of the two servers of the present embodiment. GPU0 and GPU1 of the two servers are connected with CPU0 through a PCIe bus, and GPU2 and GPU3 are connected with CPU1 through a PCIe bus. The CPU0 and the CPU1 of the two servers are connected through QPI buses. The two servers are connected through Ethernet. In the embodiment, a parallel strategy searching method oriented to efficient training of an artificial intelligence large model is applied, and a parallel strategy capable of using an Adam optimizer to train a BERT-Huge model in a distributed manner on a Server0 and a Server1 is obtained through searching. In this embodiment, the specific workflow of the method of the present invention is as follows:
The workflow of the parallel strategy search method for the artificial intelligence large model efficient training is shown in fig. 1.
And S1, inputting a large model.
In this embodiment, the large model is the BERT-Huge model.
And S2, acquiring system execution performance information and execution performance information of the large model.
In this embodiment, the acquiring the system execution performance information and the execution performance information of the large model includes:
Setting traffic a ar of All-Reduce, and counting communication time t ar of All-Reduce to obtain All-Reduce communication efficiency of the system
Setting the traffic a p2p of the P2P, and counting the communication time t p2p of the P2P so as to obtain the P2P communication efficiency of the system
Traffic a ar of All-Reduce is set, all-Reduce communication time t n for separately executing All-Reduce and All-Reduce communication time t o for simultaneously executing All-Reduce and matrix multiplication calculation are counted, so as to obtain calculation-communication overlap factor of system
On Server0, setting the batch size as 1 by using GPU0, and counting the forward execution time fp u of each layer u (u E V) of BERT-Huge model G (V, E);
On Server0, using GPU0 to set batch size as 1, counting active video memory of each layer of BERT-Huge model G (V, E) except video memory occupied by BERT-Huge model and Adam optimizer And additional memory m other required for loading BERT-Huge model G (V, E).
And S3, estimating the execution time cost and the storage cost of the large model on the current system by using the cost model.
In this embodiment, the execution time overhead of the large model on the current system includes:
All candidate data parallel, tensor parallel and optimizer parallel strategies of layer u (u E V) of the BERT-Huge model G (V, E) are estimated by using the cost model, and the sum of calculation and communication time of the batch size bs is a u. In this embodiment, for a candidate policy S ui of the layer u of BERT-Huge, let its data parallel/optimizer parallelism be dp ui, tensor parallelism be tp ui, data parallel/optimizer parallel traffic be a dp, and tensor parallel forward and reverse traffic be a tp. Then when policy S ui is applied, the forward computation of layer u and the sum of the communication times is The inverse computation and communication time includes the time of overlapping the inverse computation and data parallel/optimizer parallel communication, the time of non-overlapping the inverse computation and data parallel/optimizer parallel communication and the inverse tensor parallel communication time, as Thus, for one candidate policy S ui for layer u of BERT-Huge of this embodiment, where the batch size is bs, the sum of the calculation and communication time for layer u is a ui=FPui+BPui;
And estimating the communication time R uv of the same pipeline stage under the condition of bs in batch size between all candidate data parallel, tensor parallel and optimizer parallel strategy pairs of the layer u and the layer V corresponding to the edges < u, V > (< u, V >. Epsilon.E) of the BERT-Huge model G (V, E) by utilizing the cost model. In this embodiment, let the candidate policy of layer u be S ui, the candidate policy of layer v be S vj, < u, v > ∈E, the traffic of the same pipeline stage between layer u and layer v be a uvij, and the traffic of the same pipeline stage between layer u and layer v be
And estimating P2P communication time R' uv of the cross-pipeline stage under the condition of bs in batch size between all candidate data parallel, tensor parallel and optimizer parallel strategy pairs of the layer u and the layer V corresponding to the edges < u, V > (< u, V >. Epsilon.E) of the BERT-Huge model G (V, E) by utilizing the cost model. In this embodiment, let the candidate policy of layer u be S ui, the candidate policy of layer v be S vj, < u, v > ∈e, the P2P traffic between layer u and layer v across pipeline stage be a' uvij, and the P2P traffic between layer u and layer v across pipeline stage be
In this embodiment, the storage overhead of the large model on the current system includes:
And estimating the memory overhead M u of all candidate data parallel, tensor parallel and optimizer parallel strategies of the layer u (u E V) of the BERT-Huge model G (V, E) under the condition of the batch size bs by using the cost model. In this embodiment, let the parameter number of layer u be C u, the data parallelism/optimizer parallelism be dp ui, the tensor parallelism be tp ui, and training by Adam optimizer, then on a GPU, the video memory overhead of layer u is
Step S4, taking the time spent by each iteration of the pipeline parallel training as an optimization target, establishing a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model, and solving:
In this embodiment, the establishing and solving a mixed integer quadratic programming mathematical model for automatically searching a parallel training strategy of a large model by using time spent by each iteration of the pipeline parallel training as an optimization target includes:
Step S4.1, enumerating the legal pipeline stage number deg and legal micro-batch number c, and selecting the minimum time TPI spent by each iteration and one or more of corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy;
step S4.2, based on a non-pipeline parallel strategy (deg=1), establishing and solving a mixed integer quadratic programming mathematical model which takes time TPI spent by each iteration of data parallel, tensor parallel and optimizer parallel as an optimization target, wherein the searched content is one or more of the data parallel strategy, tensor parallel strategy and the optimizer parallel strategy;
Step S4.3, based on a given pipeline parallel strategy (deg > 1), establishing and solving a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as an optimization target, wherein the searched content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy.
For ease of understanding, fig. 3 shows the execution flow of step S4.1 in the present embodiment. In this embodiment, the pipeline is a GPipe-mode synchronous pipeline.
Step S4.1.1, the sum A of calculation and communication time of each layer of the input large model and the memory overhead M, the communication time R of the unified pipeline stage of the adjacent layers and the P2P communication time R' of different pipeline stages, and the GPU quantity n and the small batch quantity B.
Step S4.1.2, let the pipeline stage count deg=1, minimum execution time minTPI = infinity per iteration, and corresponding minimum pipeline stage count minDeg = -1, micro-batch count minc= -1, placement strategy for each layer of modelParallel policy
And S4.1.3, judging whether the pipeline stage number deg is not larger than the GPU number n. If S4.1.3 determines that this is true, then the outer loop is entered (entry is step S4.1.4).
And S4.1.4, judging whether the degree deg of the pipeline is 1. If yes, step S4.2 is entered, and a mixed integer quadratic programming mathematical model taking time TPI spent by each iteration of data parallelism, tensor parallelism and optimizer parallelism as an optimization target is established and solved. After the solution is completed, step S4.1.5 is entered, it is determined whether the optimization target TPI is less than minTPI, and if step S4.1.5 is determined to be true, the routine will execute step S4.1.6, where minTPI =tpi, minDeg =deg, minc=c, minP =p, and mins=s. Finally, the process will execute step S4.1.7, let deg=deg×2, and then determine the outer loop condition in step S4.1.3.
If the result of the determination in step S4.1.4 is no, step S4.1.8 is executed to set the micro-batch number c to 2, and step S4.1.9 is executed to determine whether the micro-batch number c does not exceed the small batch number B. If the determination is not true, the process proceeds to step S4.1.7, where deg=deg×2, and if the determination is true, the process enters an inner loop. In the inner loop, the process first executes step S4.1.10 to determine if B is not divisible by c. If the determination is true, then step S4.1.14, c=c+1 is performed, the condition of the inner loop in step S4.1.9 is continued to be determined, and if the determination in step S4.1.10 is not true, then step S4.1.11, micro-batch size b=b/c, is performed, then step S4.3 is performed, and a mixed integer quadratic programming mathematical model is built and solved with the time TPI taken for each iteration of pipeline parallel, data parallel, tensor parallel and optimizer parallel training as the optimization objective. After the solution result is obtained, step S4.1.12 is executed to determine whether the optimization target TPI is less than minTPI, if yes, step S4.1.13 is executed to make minTPI =tpi, minDeg =deg, minc=c, minP =p, mins=s, and then the procedure will execute steps S4.1.14, c=c+1 to continue the determination of the inner loop condition in step S4.1.9.
If step S4.1.3 determines that the method is not satisfied, the method proceeds to step S4.1.15, outputs the minimum execution time per iteration minTPI and the corresponding minimum pipeline stage number minDeg, micro batch number minC, placement policy minP for each layer of the model, and parallel policy minS for each layer of the model, and ends.
In step S4.2 provided in this embodiment, to help understand the mixed integer quadratic programming mathematical model with the time TPI spent for each iteration of data parallelism, tensor parallelism, and optimizer parallelism as the optimization target, this embodiment will exemplify a mixed integer quadratic programming mathematical model:
assuming that the time it takes for the objective function to minimize each iteration TPI can be expressed as equation (1):
min TPI=p1 (1)
in equation (1), TPI is the time it takes for each iteration, so this equation is equivalent to maximizing the number of iterations per unit time, i.e., to maximizing throughput. In equation (1), p i is the total computation and communication time of the pipeline parallel i-th stage. Because step S4.2 does not need to take into account pipeline parallelism, one or more combinations of data parallelism, tensor parallelism, and optimizer parallelism can be considered to perform parallel computation in one pipeline stage. Without loss of generality, data parallelism, tensor parallelism, and optimizer parallelism may be provided to execute in stage 1 of the pipeline, with the total computation and communication time for stage 1 of the pipeline noted as p 1.
For a better understanding, the constraints are presented below in conjunction with the formulas:
Constraint 1 the computation time of stage 1 of the pipeline includes the sum of computation and communication time of all layers u of BERT-Huge, and the sum of communication costs of the same pipeline stage between layers u and v corresponding to all edges < u, v > (< u, v > ∈E) of BERT-Huge. Constraint 1 can be expressed as equation (2):
Wherein S uk represents whether layer u selects the kth candidate data parallel, tensor parallel, and optimizer parallel strategies, which is a 0-1 variable.
Constraint 2 the storage overhead of stage 1 of the pipeline includes the sum of all candidate data parallelism for all layers u (u e V) of BERT-Huge, the storage overhead of tensor parallelism, and the optimizer parallelism policy. Constraint 2 can be expressed as equation (3):
In this embodiment, m represents the memory capacity of GPU0 minus the remaining memory capacity of the additional memory m other required to load BERT-Huge model G (V, E). In this embodiment, server0 and Server1 are isomorphic, so m is a constant for different GPUs.
Constraint 3. BERT-Huge all layers u (u ε V) can and only one candidate data parallel, tensor parallel and optimizer parallel strategy can be selected. Constraint 3 can be expressed as equation (4) (5):
Where g u represents the candidate data parallel, tensor parallel and optimizer parallel policy set for BERT-Huge layer u, |g u | represents the potential of the g u set.
In step S4.3 provided in this embodiment, to help understand the mixed integer quadratic programming mathematical model with the time TPI spent for each iteration of the pipeline parallel, data parallel, tensor parallel and optimizer parallel training as the optimization target, this embodiment will exemplify a mixed integer quadratic programming mathematical model:
assuming that the time it takes for the objective function to minimize each iteration TPI can be expressed as equation (6):
In formula (6), p i is the total computation and communication time of the i-th stage of pipeline parallelism, ps is the total number of stages of pipeline parallelism, and ps=deg is satisfied. o j is the total communication time of the jth communication stage across pipeline stages, os is the total number of communication stages across pipeline stages in pipeline parallelism, satisfying os = ps-1.c is the micro-batch number when the pipeline is calculating. Wherein the method comprises the steps of Representing the sum of computation and communication time for all stages of the pipeline,Representing the sum of P2P communication times across all pipeline stages in the pipeline, max { P 1,…,pps } (c-1) models the effect of the slowest calculated stage and micro-batch number in all stages of the pipeline on the overall pipeline computation time.
For a better understanding, the following describes the various constraints in conjunction with the formulas:
Constraint 1 is that the calculation and communication time of the ith stage of the pipeline comprises (1) the sum of the calculation and communication time of the BERT-Huge layers placed in the ith stage and (2) the sum of the communication costs of the same pipeline stage corresponding to all edges < u, v > (< u, v > ∈E) placed between the layers u and v of the ith stage of the BERT-Huge. Constraint 1 can be expressed as equation (7):
Wherein P ui represents whether the skin u is placed in the ith stage of the pipeline, a 0-1 variable. S uk represents whether layer u selects the kth candidate data parallel, tensor parallel, and optimizer parallel strategies, which is a 0-1 variable.
Constraint 2. Communication time of jth cross pipeline stage of pipeline parallelism. Constraint 2 can be expressed as equation (8):
Constraint 3. The storage overhead of the ith stage of the pipeline includes the sum of all candidate data parallelism, tensor parallelism, and the memory overhead of the optimizer parallelism policy placed at BERT-Huge layer u (u ε V) of the ith stage. Constraint 3 can be expressed as equation (9):
In this embodiment, m represents the memory capacity of GPU0 minus the remaining memory capacity of the additional memory m other required to load BERT-Huge model G (V, E). In this embodiment, server0 and Server1 are isomorphic, so m is a constant for different GPUs.
Constraint 4 each pipeline stage must contain a continuous portion of the BERT-Huge model. Constraint 4 can be expressed as equation (10) (11) (12):
Where Z vi is a 0-1 variable, whether a layer w follows the skin layer v or not, such that a certain precursor layer u of layer w and layer v is placed on the ith pipeline stage. If layer w is present, Z vi = 1, otherwise equal to 0. In this embodiment, formulas (10) (11) (12) together constrain the different layers of the BERT-Huge model not to be placed out of order on pipeline stages, but rather in order.
Constraint 5 BERT-Huge all layers u (u ε V) can be and can only be placed on one of the pipeline stages, and at least one layer on all pipeline stages. Constraint 5 can be expressed as equation (13) (14) (15):
constraint 6 BERT-Huge all layers u (u ε V) can and only one candidate data parallel, tensor parallel and optimizer parallel strategy can be selected. Constraint 6 can be expressed as equation (4) (5).
Finally, based on the above solution result of the mixed integer quadratic programming model, the parallel training strategy of the optimal large model is output (step S5):
In this embodiment, the parallel training strategy for outputting the optimal large model specifically includes:
Based on the outputting step S4.1.15 of step S4, one or more combinations of an optimal pipeline parallel strategy, a data parallel strategy, a tensor parallel strategy and an optimizer parallel strategy of the BERT-Huge model under the current system are output.
More specifically, minTPI represents the minimum execution time per iteration, minDeg represents the minimum pipeline stage number corresponding to current minTPI, minC represents the micro-batch number corresponding to current minTPI, minP represents the placement policy per layer of the model corresponding to current minTPI, and minS represents the parallel policy per layer of the model corresponding to current minTPI. The final pipeline parallel strategy is to set the stage number of the pipeline as minDeg, and split the small batch into minC micro-batches during pipeline calculation. Each layer u of the model should be placed on the pipeline stage of each corresponding layer placement in minP. Meanwhile, the data parallelism, tensor parallelism and optimizer parallelism strategies of each layer u of the model are parallel strategies adopted by corresponding each layer in minS. Furthermore, for the mapping of the parallel computing strategy and the system, the GPU with global number l in the system will be responsible for computing the first pipelineStage. For example, assuming minDeg =4, then GPU3 of Server0 has a global number of 0×4+3=3, which will be responsible for computing the first of the pipelineStage, global number 1×4+0=4 of GPU0 of Server1, will be responsible for computing the first of the pipelineStage.
The method of the present invention has been tested on a number of models and systems. In the experimental process, the search time of the parallel strategy search method for the artificial intelligence large model high-efficiency training is counted on a server with the number of 0. After the parallel training strategy obtained by searching through the method, the throughput of the parallel strategy under the current system is counted on a server with the number of 0. Experimental results on four models of BERT-Huge, T5-Larget, viT-Huge and Swin-Huge show that compared with the existing optimal method, the method can improve the strategy searching speed by about 16 times at maximum and the training throughput by about 1.7 times at maximum.

Claims (7)

1.一种面向人工智能大模型高效训练的并行策略搜索方法,其特征在于,包括以下步骤:1. A parallel strategy search method for efficient training of large artificial intelligence models, characterized by comprising the following steps: 步骤S1、输入人工智能大模型;Step S1: inputting the artificial intelligence model; 步骤S2、获取系统执行性能信息和人工智能大模型的执行性能信息;Step S2: Obtaining system execution performance information and execution performance information of the artificial intelligence large model; 步骤S3、利用代价模型估计人工智能大模型在当前系统上的执行时间开销和存储开销;Step S3: using the cost model to estimate the execution time and storage overhead of the large artificial intelligence model on the current system; 步骤S4、以流水线并行训练的每次迭代所花费的时间为优化目标,建立用于自动搜索大模型的并行训练策略的混合整数二次规划数学模型并求解;Step S4: Taking the time spent on each iteration of pipeline parallel training as the optimization target, a mixed integer quadratic programming mathematical model for automatically searching for a parallel training strategy for a large model is established and solved; 步骤S5、输出最优的人工智能大模型的并行训练策略;Step S5: outputting the optimal parallel training strategy for the large artificial intelligence model; 所述步骤S2具体为:The step S2 is specifically as follows: 设定All-Reduce的通信量,统计All-Reduce的通信时间,从而获取系统的All-Reduce通信效率Setting the All-Reduce Traffic Volume , statistics of All-Reduce communication time , thereby obtaining the system's All-Reduce communication efficiency ; 设定P2P的通信量,统计P2P的通信时间,从而获取系统的P2P通信效率Set P2P communication volume , statistics of P2P communication time , thereby obtaining the P2P communication efficiency of the system ; 设定All-Reduce的通信量,分别统计单独执行All-Reduce的All-Reduce通信时间,以及同时执行All-Reduce和矩阵乘法计算时的All-Reduce通信时间,从而获取系统的计算-通信重叠因子Setting the All-Reduce Traffic Volume , respectively count the All-Reduce communication time of executing All-Reduce separately , and the All-Reduce communication time when performing All-Reduce and matrix multiplication calculations simultaneously , thereby obtaining the system's computation-communication overlap factor ; 统计大模型每一层的前向执行时间Statistical large models Each layer Forward execution time ; 统计除模型和优化器占用的显存以外,大模型每一层的激活显存和加载大模型所需要的额外显存Statistics In addition to the memory occupied by the model and optimizer, large models Active memory for each layer and loading large models Additional video memory required ; 步骤S3中所述的执行时间开销具体包括以下步骤:The execution time overhead described in step S3 specifically includes the following steps: 利用代价模型估计大模型的层的所有候选数据并行、张量并行和优化器并行策略的计算及通信时间之和Using cost models to estimate large models Layer The sum of the computation and communication time of all candidate data parallel, tensor parallel, and optimizer parallel strategies ; 利用代价模型估计大模型的边所对应的层和层的所有候选数据并行、张量并行和优化器并行策略对之间的同一流水线阶段的通信时间Estimating Large Models Using Cost Models edge The corresponding layer and layer Communication time within the same pipeline stage between all candidate data parallelism, tensor parallelism, and optimizer parallelism strategy pairs ; 利用代价模型估计大模型的边所对应的层和层的所有候选数据并行、张量并行和优化器并行策略对之间的跨流水线阶段的P2P通信时间Estimating Large Models Using Cost Models edge The corresponding layer and layer P2P communication time across pipeline stages between all candidate data parallelism, tensor parallelism, and optimizer parallelism strategy pairs ; 所述步骤S4具体为:The step S4 is specifically as follows: 步骤S4.1、枚举所有合法的流水线阶段数量和合法的微批量数量,选取最小的每次迭代所花费时间,及其对应的流水线并行策略、数据并行策略、张量并行策略和优化器并行策略的一种或多种;Step S4.1: Enumerate all legal pipeline stage numbers and legal micro-batch quantities , select the minimum time spent on each iteration , and one or more of its corresponding pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy; 步骤S4.2、基于一个非流水线并行策略,建立并求解以数据并行、张量并行和优化器并行每次迭代所花费时间为优化目标的混合整数二次规划数学模型,搜索的内容为数据并行策略、张量并行策略和优化器并行策略的一种或多种;Step S4.2: Based on a non-pipelined parallel strategy, establish and solve the time taken for each iteration in data parallelism, tensor parallelism, and optimizer parallelism A mixed integer quadratic programming mathematical model for optimizing the objective, wherein the search content is one or more of a data parallel strategy, a tensor parallel strategy, and an optimizer parallel strategy; 步骤S4.3、基于一个给定的流水线并行策略,建立并求解以流水线并行、数据并行、张量并行和优化器并行训练的每次迭代所花费时间为优化目标的混合整数二次规划数学模型,搜索的内容为流水线并行策略、数据并行策略、张量并行策略和优化器并行策略的一种或多种。Step S4.3: Based on a given pipeline parallel strategy, establish and solve the time taken for each iteration of pipeline parallelism, data parallelism, tensor parallelism and optimizer parallel training. A mixed integer quadratic programming mathematical model is used to optimize the target, and the search content is one or more of pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy. 2.根据权利要求1所述的一种面向人工智能大模型高效训练的并行策略搜索方法,其特征在于,步骤S3中所述的执行时间开销包括大模型每一层的所有候选策略的计算及通信时间之和,相邻层的所有候选策略对的同一流水线阶段的通信时间和跨流水线阶段的P2P通信时间的一种或多种。2. A parallel strategy search method for efficient training of large artificial intelligence models according to claim 1, characterized in that the execution time overhead described in step S3 includes one or more of the sum of the calculation and communication time of all candidate strategies in each layer of the large model, the communication time of all candidate strategy pairs in adjacent layers in the same pipeline stage, and the P2P communication time across pipeline stages. 3.根据权利要求1所述的一种面向人工智能大模型高效训练的并行策略搜索方法,其特征在于,步骤S3中所述的存储开销包括大模型每一层的所有候选策略的显存开销。3. A parallel strategy search method for efficient training of large artificial intelligence models according to claim 1, characterized in that the storage overhead described in step S3 includes the video memory overhead of all candidate strategies in each layer of the large model. 4.根据权利要求1所述的一种面向人工智能大模型高效训练的并行策略搜索方法,其特征在于,步骤S3中所述的存储开销具体包括以下步骤:利用代价模型估计大模型的层的所有候选数据并行、张量并行和优化器并行策略的显存开销4. The parallel strategy search method for efficient training of large artificial intelligence models according to claim 1 is characterized in that the storage overhead in step S3 specifically includes the following steps: using the cost model to estimate the large model Layer The memory overhead of all candidate data parallel, tensor parallel, and optimizer parallel strategies . 5.根据权利要求1所述的一种面向人工智能大模型高效训练的并行策略搜索方法,其特征在于,所述步骤S5具体为:5. The parallel strategy search method for efficient training of large artificial intelligence models according to claim 1, wherein step S5 is specifically: 基于混合整数二次规划数学模型的各求解结果,输出大模型在当前系统下的最优流水线并行策略、数据并行策略、张量并行策略和优化器并行策略的一种或多种组合。Based on the solution results of the mixed integer quadratic programming mathematical model, one or more combinations of the optimal pipeline parallel strategy, data parallel strategy, tensor parallel strategy and optimizer parallel strategy for the large model under the current system are output. 6.一种计算机存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1-5中任一项所述的一种面向人工智能大模型高效训练的并行策略搜索方法。6. A computer storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, it implements a parallel strategy search method for efficient training of large artificial intelligence models as described in any one of claims 1-5. 7.一种计算机设备,包括储存器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1-5中任一项所述的一种面向人工智能大模型高效训练的并行策略搜索方法。7. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the method for searching for a parallel strategy for efficient training of large artificial intelligence models as described in any one of claims 1 to 5 is implemented.
CN202310781759.7A 2023-06-29 2023-06-29 A parallel strategy search method for efficient training of large artificial intelligence models Active CN116680301B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310781759.7A CN116680301B (en) 2023-06-29 2023-06-29 A parallel strategy search method for efficient training of large artificial intelligence models

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310781759.7A CN116680301B (en) 2023-06-29 2023-06-29 A parallel strategy search method for efficient training of large artificial intelligence models

Publications (2)

Publication Number Publication Date
CN116680301A CN116680301A (en) 2023-09-01
CN116680301B true CN116680301B (en) 2025-08-22

Family

ID=87787313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310781759.7A Active CN116680301B (en) 2023-06-29 2023-06-29 A parallel strategy search method for efficient training of large artificial intelligence models

Country Status (1)

Country Link
CN (1) CN116680301B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117453361B (en) * 2023-10-27 2025-02-25 清华大学 A data allocation optimization method and device based on calculation graph and recalculation
CN120087433A (en) * 2023-12-01 2025-06-03 华为技术有限公司 Parallel strategy optimization method and neural network solver training method and device
CN117892787A (en) * 2023-12-15 2024-04-16 上海人工智能创新中心 A large model automatic parallel method, device and storage medium
CN118172232A (en) * 2024-04-25 2024-06-11 北京壁仞科技开发有限公司 Artificial intelligence device, operation method thereof, and machine-readable storage medium
CN119127477B (en) * 2024-08-20 2025-08-12 中国科学院计算机网络信息中心 A method for generating large model parallel training strategies for domestic supercomputing systems
CN119440841B (en) * 2024-11-04 2025-12-05 南京大学 An Automatic Parallel Training Method for Deep Learning Models Applicable to Heterogeneous Clusters
CN119719708A (en) * 2024-12-06 2025-03-28 南京华清智言科技有限公司 A method for estimating the time required to train and infer large language models
CN120128489B (en) * 2025-05-07 2025-09-02 之江实验室 Network traffic modeling method and simulation system for large model training clusters
CN120179295B (en) * 2025-05-21 2025-07-22 上海壁仞科技股份有限公司 Parallel operation method of operator stream, computer device and readable storage medium
CN120996216A (en) * 2025-10-27 2025-11-21 中国科学技术大学 Reasoning methods, systems, devices, and media for dynamic routing hybrid expert models

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108734645A (en) * 2017-04-24 2018-11-02 英特尔公司 neural network optimization mechanism
CN110321218A (en) * 2019-04-10 2019-10-11 苏州卓晋通信有限公司 A method of based on point to point network system solution MIXED INTEGER program

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12293298B2 (en) * 2019-11-04 2025-05-06 Baidu Usa Llc Reducing training times of deep neural networks through efficient hybrid parallelism
CN112488868B (en) * 2020-11-27 2022-11-01 北京邮电大学 Surfactant oil displacement integrated scheduling optimization and control method based on closed-loop framework
CN114862656B (en) * 2022-05-18 2023-05-05 北京百度网讯科技有限公司 Multi-GPU-based acquisition method for training cost of distributed deep learning model
CN115794385A (en) * 2022-11-14 2023-03-14 南京大学 Container automatic arrangement method for deep learning model distributed training
CN115543639B (en) * 2022-12-01 2023-04-28 阿里云计算有限公司 Optimization method for performing deep learning tasks in distributed mode and distributed system
CN115879529A (en) * 2022-12-23 2023-03-31 上海交通大学 Method, medium and device for automatic parallel strategy search based on network-level simulation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108734645A (en) * 2017-04-24 2018-11-02 英特尔公司 neural network optimization mechanism
CN110321218A (en) * 2019-04-10 2019-10-11 苏州卓晋通信有限公司 A method of based on point to point network system solution MIXED INTEGER program

Also Published As

Publication number Publication date
CN116680301A (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN116680301B (en) A parallel strategy search method for efficient training of large artificial intelligence models
CN115293342B (en) Deep convolutional neural network parallel training method based on hybrid parallelism
CN117852627B (en) Pre-training model fine tuning method and system
CN113986890A (en) Joint hospital data migration method and system based on few-sample model learning
CN118113367B (en) Hypergraph partition-based method for unloading computing power network tasks
US11640531B2 (en) Method, apparatus and device for updating convolutional neural network using GPU cluster
CN115952856B (en) A pipelined parallel training method and system for neural networks based on bidirectional segmentation
CN111831354B (en) Data precision configuration method, device, chip array, equipment and medium
Zou et al. Efficient message passing algorithm and architecture co-design for graph neural networks
CN115345285B (en) GPU-based timing chart neural network training method and system and electronic equipment
CN113946424B (en) Software and hardware partitioning and task scheduling model and method based on graph convolutional network
CN120259388A (en) Image registration and segmentation joint optimization method, system, device and medium
CN117725963B (en) A method, system and device for Transformer model inference calculation
CN120469788A (en) Optimal parallel strategy for non-uniform heterogeneous chips and search method and device thereof
CN116582863B (en) Decentralized Distributed Network Optimization Training Methods, Apparatus, Equipment and Media
CN118981372A (en) Non-uniform memory access resource allocation method based on parallel adaptive auction algorithm
CN118095343A (en) Distributed graph neural network training method based on incremental aggregation strategy
CN110765654A (en) Parameter self-adaptive proxy method and device suitable for power market simulation
CN116757315A (en) A hydrogen energy storage system operation optimization method, device, medium and equipment
CN118747107B (en) Multi-thread partitioning method under storage and computing integrated structure
CN106951329A (en) A kind of extensive Method for HW/SW partitioning based on superseded particle cluster algorithm of climbing the mountain
CN119440841B (en) An Automatic Parallel Training Method for Deep Learning Models Applicable to Heterogeneous Clusters
CN119597487B (en) Dual mapping method and system for GPU parallel computing data
CN111582507A (en) Hardware system and training method of LS-SVM training machine based on SIMD architecture
Dreuning et al. CAPTURE: Memory-Centric Partitioning for Distributed DNN Training with Hybrid Parallelism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant