JP7256811B2

JP7256811B2 - Method and system for accelerating AI training using advanced interconnect technology

Info

Publication number: JP7256811B2
Application number: JP2020536955A
Authority: JP
Inventors: ジービャオジャオ; チエンオウヤン; ハーフェイジュー; チンシューチェン; ウェイチー
Original assignee: Baidu com Times Technology Beijing Co Ltd; Kunlunxin Technology Beijing Co Ltd; Baidu USA LLC
Current assignee: Baidu com Times Technology Beijing Co Ltd; Kunlunxin Technology Beijing Co Ltd; Baidu USA LLC
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2023-04-12
Anticipated expiration: 2039-10-12
Also published as: JP2022504995A; KR20210044180A; US20210318878A1; EP3830764A1; EP3830764A4; WO2021068243A1; KR102472282B1; US11544067B2; CN113272854B; CN113272854A

Description

本開示の実施形態は、概して機械学習に関する。より具体的には、本開示の実施形態は、ニューラルネットワークトレーニングに関する。 Embodiments of the present disclosure relate generally to machine learning. More specifically, embodiments of the present disclosure relate to neural network training.

複雑な問題を解決するために、ニューラルネットワークはますます複雑になる。複雑なニューラルネットワークのトレーニングの際に、複雑な深層学習アルゴリズム及びより多くの帯域幅が必要であるため、トレーニング時間、コスト、消費電力が増やすことになる。トレーニングを加速するために、高級なサーバ（例えば、より複雑なインタフェースを有する高速なサーバまたはサーバクラスタ）を利用して、計算および通信を改善し、高価なハードウェアのコストを削減する。しかし、従来の解決手段では、性能およびコストの点で依然として挑戦がある。 Neural networks are becoming increasingly complex to solve complex problems. Complex deep learning algorithms and more bandwidth are required when training complex neural networks, resulting in increased training time, cost and power consumption. To accelerate training, a high-end server (eg, a fast server or server cluster with a more complex interface) is utilized to improve computation and communication and reduce the cost of expensive hardware. However, conventional solutions remain challenging in terms of performance and cost.

第１の態様によれば、本開示のいくつかの実施形態は、データ処理（ＤＰ）アクセラレータを用いて、人工知能（ＡＩ）モデルをトレーニングするための、コンピュータにより実施される方法であって、ＣＰＵにより配信された複数のデータブロックを含むトレーニングデータセットに基づいて前記ＡＩモデルをトレーニングするための要求を、前記ＣＰＵから受信するステップと、論理リングに配置された複数の汎用処理ユニット（ＧＰＵ）によって複数回のＤＰ反復を実行して、前記ＡＩモデルをトレーニングするステップと、を含み、前記複数回のＤＰ反復は、各回のＤＰ反復のに対し、第１のＤＰサイクルにおいて、前記複数のＧＰＵは、それぞれ、前記複数のデータブロックの一つに対して、並行して第１の所定のＤＰ操作を実行し、それぞれの第１のＤＰ結果を生成し、第２のＤＰサイクルにおいて、前記複数のＧＰＵは、それぞれ、プロセッサ間リンクを介して、それぞれの第１のＤＰ結果を、更に処理するために論理リング内の下流のＧＰＵに転送する、ことを含む、コンピュータにより実施される方法を提供する。 According to a first aspect, some embodiments of the present disclosure are a computer-implemented method for training an artificial intelligence (AI) model using a data processing (DP) accelerator, comprising: receiving a request from the CPU to train the AI model based on a training data set comprising a plurality of data blocks delivered by the CPU; and a plurality of general purpose processing units (GPUs) arranged in a logical ring. and training the AI model by performing multiple DP iterations by: each perform a first predetermined DP operation on one of said plurality of data blocks in parallel to produce a respective first DP result; and in a second DP cycle, said plurality of each of the GPUs of is forwarding, via an inter-processor link, each first DP result to a GPU downstream in the logical ring for further processing. do.

第２の態様によれば、本開示のいくつかの実施形態は、データ処理システムであって、少なくとも一つのＣＰＵと、前記ＣＰＵに接続された複数の汎用処理ユニット（ＧＰＵ）と、を含み、前記複数のＧＰＵのそれぞれは、前記ＣＰＵから配信された人工知能ＡＩデータ処理（ＤＰ）操作を実行するように構成され、前記操作は、ＣＰＵにより配信された複数のデータブロックを含むトレーニングデータセットに基づいて前記ＡＩモデルをトレーニングするための要求を、前記ＣＰＵから受信するステップと、論理リングに配置された複数の汎用処理ユニット（ＧＰＵ）によって複数回のＤＰ反復を実行して、前記ＡＩモデルをトレーニングするステップと、を含み、前記複数回のＤＰ反復は、各回のＤＰ反復のに対し、第１のＤＰサイクルにおいて、前記複数のＧＰＵは、それぞれ、前記複数のデータブロックの一つに対して、並行して第１の所定のＤＰ操作を実行し、それぞれの第１のＤＰ結果を生成し、第２のＤＰサイクルにおいて、前記複数のＧＰＵは、それぞれ、プロセッサ間リンクを介して、それぞれの第１のＤＰ結果を、更に処理するために論理リング内の下流のＧＰＵに転送する、ことを含む、データ処理システムを提供する。 According to a second aspect, some embodiments of the present disclosure are a data processing system comprising at least one CPU and a plurality of general purpose processing units (GPUs) coupled to said CPU, Each of the plurality of GPUs is configured to perform an artificial intelligence AI data processing (DP) operation delivered by the CPU, the operation being performed on a training data set comprising a plurality of blocks of data delivered by the CPU. and performing multiple DP iterations by a plurality of general purpose processing units (GPUs) arranged in a logical ring to train the AI model based on and training, wherein the plurality of DP iterations, for each DP iteration, in a first DP cycle, the plurality of GPUs each train for one of the plurality of data blocks. , executing a first predetermined DP operation in parallel to generate a respective first DP result, and in a second DP cycle, the plurality of GPUs each, via an inter-processor link, each A data processing system is provided, including forwarding a first DP result to a GPU downstream in a logical ring for further processing.

第３の態様によれば、本開示のいくつかの実施形態は、指令が記憶された非一時的な機械可読媒体であって、前記指令は、プロセッサによって実行されると、前記プロセッサに人工知能ＡＩトレーニングの操作を実行させ、前記操作は、ＣＰＵにより配信された複数のデータブロックを含むトレーニングデータセットに基づいて前記ＡＩモデルをトレーニングするための要求を、前記ＣＰＵから受信するステップと、論理リングに配置された複数の汎用処理ユニット（ＧＰＵ）によって複数回のＤＰ反復を実行して、前記ＡＩモデルをトレーニングするステップと、を含み、前記複数回のＤＰ反復は、各回のＤＰ反復のに対し、第１のＤＰサイクルにおいて、前記複数のＧＰＵは、それぞれ、前記複数のデータブロックの一つに対して、並行して第１の所定のＤＰ操作を実行し、それぞれの第１のＤＰ結果を生成し、第２のＤＰサイクルにおいて、前記複数のＧＰＵは、それぞれ、プロセッサ間リンクを介して、それぞれの第１のＤＰ結果を、更に処理するために論理リング内の下流のＧＰＵに転送する、ことを含む、非一時的な機械可読媒体を提供する。 According to a third aspect, some embodiments of the present disclosure are a non-transitory machine-readable medium having instructions stored thereon, said instructions, when executed by a processor, giving said processor artificial intelligence. causing an AI training operation to be performed, said operation receiving a request from said CPU to train said AI model based on a training data set comprising a plurality of data blocks delivered by said CPU; training the AI model by executing multiple DP iterations by multiple general purpose processing units (GPUs) located in a , in a first DP cycle, each of said plurality of GPUs performs a first predetermined DP operation in parallel on one of said plurality of data blocks, and outputs a respective first DP result to generating, and in a second DP cycle, each of the plurality of GPUs forwards, via an inter-processor link, a respective first DP result to a GPU downstream in the logical ring for further processing; To provide a non-transitory machine-readable medium comprising:

第４の態様によれば、本開示のいくつかの実施形態は、コンピュータプログラムであって、前記コンピュータプログラムがプロセッサにより実行されると、第１の態様に記載の方法を実現させるコンピュータプログラムを提供する。 According to a fourth aspect, some embodiments of the present disclosure provide a computer program for implementing the method according to the first aspect when said computer program is executed by a processor. do.

図面は、本発明の実施形態を例示的に示しているが、本発明の実施形態を限定するものではない。図面において、類似の要素に同じ符号が付けられている。
本願実施形態に係る、ＡＩモデルのトレーニング用のシステムの一例を示す図である。図２Ａ～図２Ｆ本願実施形態に係るＡＩモデルのトレーニングにおけるデータ転送の例示的なプロセスを示す図である。図２Ａ～図２Ｆのプロセスの変形例を示すフローチャートである。本願実施形態による、データ圧縮、データ操作、および相互接続バスの例示的アーキテクチャを示す図である。一実施形態によるゼロサム圧縮技術を示す図である。一実施形態に係る圧縮データに対する操作の例を示す図である。一実施形態によるＡＩモデルトレーニングの例示的なプロセスを示す図である。 The drawings illustrate, by way of example, embodiments of the invention and are not intended to limit embodiments of the invention. In the drawings, similar elements are numbered the same.
1 illustrates an example system for training an AI model, according to embodiments herein; FIG. 2A-2F illustrate an exemplary process of data transfer in training an AI model in accordance with embodiments herein. FIG. 2F is a flow chart showing a variation of the process of FIGS. 2A-2F; FIG. FIG. 3 illustrates an exemplary architecture for data compression, data manipulation, and interconnection buses, according to embodiments herein; FIG. 4 illustrates a zero-sum compression technique according to one embodiment; FIG. 4 is a diagram illustrating an example of operations on compressed data according to one embodiment; FIG. 4 illustrates an exemplary process of AI model training according to one embodiment;

以下、本発明の実施の形態について図面を参照して説明する。以下の説明及び図面は本開示の説明であり、本開示を限定するものと解釈されるべきではない。本開示の様々な実施形態の完全な理解を提供するために、多くの特定の詳細が記載されている。しかしながら、いくつかの場合において、本開示の実施形態の簡潔な説明を提供するために、周知または従来の詳細は記載されていない。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings. The following description and drawings are illustrative of the disclosure and should not be construed as limiting the disclosure. Many specific details are described in order to provide a thorough understanding of various embodiments of the disclosure. However, in some cases well known or conventional details have not been described in order to provide a concise description of the embodiments of the disclosure.

本明細書において「一実施形態」または「実施形態」という言及は、実施形態を用いて説明された特定の特徴、構造、または特徴が、本開示の少なくとも１つの実施形態に含まれてもよいことを意味する。明細書の様々な箇所に現れる「一実施形態において」という語句は、必ずしも同じ実施形態を指すものではない。 References herein to "one embodiment" or "an embodiment" may include a particular feature, structure, or feature described using an embodiment in at least one embodiment of the disclosure means that The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

様々な実施形態によれば、アドバンストインタコネクト技術を利用して人工知能（ＡＩ）トレーニングを加速するための方法およびシステムが提供される。本開示に記載の実施形態によれば、ソフトウェア及びハードウェアコンポーネントを利用することにより、相互接続通信帯域幅の要求、消費電力を大幅に低減させ、且つトレーニング時間を低減させ、これにより、精度損失がない状況でトレーニング性能を向上させる。分散システムにおいてシステムのデータ圧縮及び解凍を用いてＡＩモデルトレーニングを行うとともに、効率的な全減少（Ａｌｌ－Ｒｅｄｕｃｅ）アルゴリズムを併用する。 According to various embodiments, methods and systems are provided for accelerating artificial intelligence (AI) training utilizing advanced interconnect technology. Embodiments described in the present disclosure significantly reduce interconnect communication bandwidth requirements, power consumption, and reduce training time by utilizing software and hardware components, thereby reducing accuracy loss. Improve training performance in situations where there is no AI model training is performed in a distributed system using system data compression and decompression, combined with an efficient All-Reduce algorithm.

一実施形態によれば、ＡＩモデルトレーニングのコンピュータにより実施される方法は、プロセッサクラスタで、分散規約（Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅ）プロセスにおいて、複数回の反復を実行することを含み、それぞれのプロセッサは、グラフィックス処理ユニット（ＧＰＵ）であってもよい。ニューラルネットワークモデルをトレーニングするために、プロセッサは、論理リングとして配置され、それぞれのプロセッサは、複数のデータブロックを有し、それぞれのデータブロックは、ニューラルネットワークモデル内のパラメータのセットまたはパラメータのセットを更新するための勾配のセットをそれぞれ表すようにしてもよい。 According to one embodiment, a computer-implemented method of AI model training includes performing multiple iterations in a Scatter-Reduce process on a cluster of processors, each processor running a graphics It may also be a processing unit (GPU). To train the neural network model, the processors are arranged in a logical ring, each processor having a plurality of data blocks, each data block providing a set of parameters or sets of parameters within the neural network model. Each may represent a set of gradients to update.

各回の反復において、プロセッサは、論理リングにおける前のプロセッサから、圧縮データブロックを受信し、受信した圧縮データブロックと、現在プロセッサで生成された圧縮データブロックとに対して、操作を実行して、データブロックを演算し、演算されたデータブロックを論理リング内の後続のプロセッサに送信する。複数回の反復が終了すると、複数のプロセッサ上の各データブロックは、すべて圧縮されて操作されたことになる。この方法は、複数のプロセッサのそれぞれにおいて、圧縮データブロックを識別する操作をさらに含み、ここで、圧縮データブロックは、複数のプロセッサの対応するデータブロックから算出されたものである。 In each iteration, the processor receives a compressed data block from a previous processor in the logical ring, performs an operation on the received compressed data block and on the compressed data block currently generated by the processor, It operates on data blocks and transmits the operated data blocks to subsequent processors in the logical ring. After multiple iterations, each block of data on multiple processors has all been compressed and manipulated. The method further includes, in each of the plurality of processors, identifying compressed data blocks, where the compressed data blocks were computed from corresponding data blocks of the plurality of processors.

一実施形態では、識別された圧縮データブロックは、論理リング内の他のプロセッサのそれぞれに配信され、そのプロセッサ上で解凍され、ニューラルネットワークモデル内のパラメータを更新するに使用される。プロセッサは、分散型ＡＩモデルトレーニングシステムの異なるシステム内の中央処理装置（ＣＰＵ）に添付可能である。一実施形態では、各プロセッサは、ゼロ値圧縮技法を使用してデータブロックを圧縮および解凍するためのハードウェアベースまたはソフトウェアベースの圧縮モジュールを含んでも良い。圧縮データブロックは、ビットマスク部分と圧縮データ部分とを有するデータ構造によって表すことができ、ビットマスクは、データブロック内の非ゼロ値の位置を示すビットを含む。 In one embodiment, the identified compressed data blocks are distributed to each of the other processors in the logical ring, decompressed thereon, and used to update parameters within the neural network model. Processors can be attached to central processing units (CPUs) in different systems of the distributed AI model training system. In one embodiment, each processor may include a hardware- or software-based compression module for compressing and decompressing data blocks using zero-value compression techniques. A compressed data block can be represented by a data structure having a bitmask portion and a compressed data portion, where the bitmask contains bits that indicate the location of non-zero values within the data block.

一実施形態によれば、中央処理装置（ＣＰＵ）からＡＩトレーニングのための要求を受信すると、論理リングとして配置された汎用処理装置（ＧＰＵ）の各々は、ＣＰＵから配信されたデータブロックに対してデータ処理（ＤＰ）操作をパイプライン方式で繰り返し実行するように構成される。各ＧＰＵは、ＣＰＵに対してＤＰアクセラレータとして操作する。毎回の反復について、第１のＤＰサイクルにおいて、複数のＧＰＵは、それぞれ、データブロックのうちの１つに対して、並行して第１の所定のＤＰ操作（例えば、データ圧縮）を実行し、それぞれのＤＰ結果を生成する。第２のＤＰサイクルにおいて、複数のＧＰＵは、それぞれ、対応するプロセッサ間リンクを介して、それぞれのＤＰ結果を論理リング内の対応する下流のＧＰＵに転送し、そこでさらに処理する。説明のために、ＧＰＵをＤＰアクセラレータの例として使用するが、他のタイプのプロセッサまたは処理ロジックをＤＰアクセラレータとして使用してもよい。 According to one embodiment, upon receiving a request for AI training from a central processing unit (CPU), each of the general purpose processing units (GPUs) arranged in a logical ring processes data blocks delivered by the CPU. It is configured to repeatedly execute data processing (DP) operations in a pipeline fashion. Each GPU operates as a DP accelerator with respect to the CPU. For each iteration, in the first DP cycle, the multiple GPUs each perform a first predetermined DP operation (e.g., data compression) in parallel on one of the data blocks; Generate respective DP results. In the second DP cycle, the multiple GPUs each forward their respective DP results via corresponding inter-processor links to corresponding downstream GPUs in the logical ring for further processing there. For purposes of explanation, a GPU is used as an example of a DP accelerator, but other types of processors or processing logic may be used as DP accelerators.

一実施形態では、第２のＤＰサイクル中に、各ＧＰＵは、対応するプロセッサ間リンクを介して、論理リング内の対応する上流ＧＰＵから処理結果をも受信し、受信された処理結果は、ＧＰＵで更なる処理を実行するに用いられる。一実施形態では、第３のＤＰサイクル中に、複数のＧＰＵの各々は、自身によって処理された第１のデータブロック（例えば、処理結果）と、上流のＧＰＵから受信した第２のデータブロック（例えば、上流のＧＰＵによる処理結果）とに対して、第２の所定のＤＰ操作（例えば、加算などの結合演算）を同時に実行する。一実施形態では、第４のＤＰサイクル中に、複数のＧＰＵの各々は、データ解凍操作などのさらなるＤＰ操作を実行する。 In one embodiment, during the second DP cycle, each GPU also receives processing results from corresponding upstream GPUs in the logical ring via corresponding inter-processor links, and the received processing results are sent to GPUs is used to perform further processing. In one embodiment, during the third DP cycle, each of the multiple GPUs receives a first data block processed by it (e.g., processing results) and a second data block received from the upstream GPU ( a second predetermined DP operation (eg, a join operation such as an addition) simultaneously. In one embodiment, during the fourth DP cycle, each of the multiple GPUs performs a further DP operation, such as a data decompression operation.

図１は、一実施形態による、ＡＩモデルのトレーニング用のシステムの一例を示す図である。図１に示すように、システムは、複数のサーバ（例えば、サーバＡ１０３およびサーバＢ１０５）に分散された汎用処理ユニット（ＧＰＵ）クラスタ１０１を含み、各サーバは、１つまたは複数のＣＰＵを含み、各ＣＰＵは、ＧＰＵなどの１つまたは複数のデータ処理（ＤＰ）アクセラレータと関連付けられている。 FIG. 1 illustrates an example system for training AI models, according to one embodiment. As shown in FIG. 1, the system includes a general purpose processing unit (GPU) cluster 101 distributed across multiple servers (e.g., Server A 103 and Server B 105), each server including one or more CPUs; Each CPU is associated with one or more data processing (DP) accelerators, such as GPUs.

サーバは、イーサネット接続１１１を介して互いに通信するＣＰＵ１０７およびＣＰＵ１０９を含むことができる。図１に示すシステム例では、各ＣＰＵは、ＰＣＩｅ（ＰｅｒｉｐｈｅｒａｌＤｅｖｉｃｅＩｎｔｅｒｃｏｎｎｅｃｔｉｏｎＨｉｇｈＳｐｅｅｄ）スイッチを介してＣＰＵに接続された複数のＧＰＵを有することができる。例えば、サーバＡ１０３において、ＧＰＵ１１７、ＧＰＵ１１９およびＧＰＵ１２１は、ＰＣＩｅスイッチＡ１１３を介してＣＰＵＡ１０７に接続される。サーバＢ１０５において、ＧＰＵ１２３、ＧＰＵ１２５およびＧＰＵ１２７は、ＰＣＩｅＢ１１５を介してＣＰＵＢ１０９に接続される。 The server may include CPU 107 and CPU 109 communicating with each other via Ethernet connection 111 . In the example system shown in FIG. 1, each CPU may have multiple GPUs connected to it via Peripheral Device Interconnection High Speed (PCIe) switches. For example, in server A103, GPU117, GPU119 and GPU121 are connected to CPU A107 via PCIe switch A113. In server B105, GPU123, GPU125 and GPU127 are connected to CPU B109 via PCIe B115.

ＣＰＵ１０７およびＣＰＵ１０９は、ニューラルネットワークをトレーニングするためのタスクを協働させるために、イーサネット接続１１１などのプロセッサ間リンクを介して互いに通信することができる。例えば、ジョブコマンドは、イーサネット接続１１１を介して各サーバに配信することができる。次に、ジョブコマンドを、サーバ内のＣＰＵから当該ＣＰＵに接続されたＧＰＵに配信ることができる。ジョブコマンドが配信られると、システム内のＧＰＵ間には、対応するチップ間リンク１２２を介してデータを転送可能とされる。チップ間リンク１１２には、例えば、アクセラレータ用のキャッシュコヒーレントインタコネクト（ＣＣＩＸ）リンクなど、様々なチップ間相互接続のソリューションが採用できる。図１に示すように、一方向リングトポロジーを使用することができるが、システム内のＧＰＵは、双方向リングトポロジーに配置される。 CPU 107 and CPU 109 can communicate with each other via an inter-processor link, such as Ethernet connection 111, to coordinate tasks for training the neural network. For example, job commands can be delivered to each server via Ethernet connection 111 . Job commands can then be delivered from a CPU in the server to a GPU connected to that CPU. When the job command is delivered, data can be transferred between the GPUs in the system via the corresponding inter-chip links 122 . Inter-chip links 112 may employ a variety of inter-chip interconnect solutions such as, for example, cache coherent interconnect (CCIX) links for accelerators. As shown in FIG. 1, a unidirectional ring topology can be used, but the GPUs in the system are arranged in a bidirectional ring topology.

ＣＣＩＸは、ＣＣＩＸアライアンスによって開発されたオープンキャッシュコヒーレンス相互接続アーキテクチャである。ＣＣＩＸは、標準的なＰＣＩｅのキャッシュコヒーレンシを拡張することによって、システム内のＣＰＵなどの中央プロセッサとＧＰＵなどの様々なアクセラレータとの間の通信を簡略化するように設計されている。ＣＣＩＸは、異種システムアーキテクチャのためのキャッシュコヒーレンスフレームワークを提供する高性能チップ間相互接続アーキテクチャである。システム内の中央処理装置と様々な他のアクセラレータとの間のキャッシュコヒーレンシは常に自動的に維持される。ＣＣＩＸをサポートする各装置は、少なくとも１つのＣＣＩＸポートを含み、ＣＣＩＸポートは、ＣＣＩＸの起用されたあらゆる他のデバイスとは、ピン互換性がある。ＣＣＩＸは、チップ・トゥ・チップ、チップ・トゥ・スイッチ・トゥ・チップ、グリッド、デイジーチェーン、およびリングなどの様々なトポロジーをサポートする。 CCIX is an open cache coherence interconnect architecture developed by the CCIX Alliance. CCIX is designed to simplify communication between a central processor, such as a CPU, and various accelerators, such as GPUs, in a system by extending standard PCIe's cache coherency. CCIX is a high performance chip-to-chip interconnect architecture that provides a cache coherence framework for heterogeneous system architectures. Cache coherency between the central processing unit and various other accelerators in the system is always automatically maintained. Each device that supports CCIX includes at least one CCIX port, which is pin-compatible with any other CCIX-enabled device. CCIX supports various topologies such as chip-to-chip, chip-to-switch-to-chip, grid, daisy chain, and ring.

一実施形態では、ＧＰＵは、それぞれのＣＰＵから配信されたデータブロックに対して、パイプライン方式でＡＩトレーニング動作を実行するように構成される。それぞれのＧＰＵは、さらに、プロセッサ間リンクを介して互いに通信する。ＧＰＵは、さらなるデータ処理のために、上流のＧＰＵからの処理結果を受信するように環状に構成されてもよい。それぞれのＧＰＵは、処理結果をその対応する下流の、さらなる処理を実行するためのＧＰＵにさらに送信することができる。したがって、それぞれのＧＰＵは、並列して、配信られたＤＰ操作を実行し、そのＤＰ結果をダウンストリームＧＰＵに送信する。且つ、各ＧＰＵは、その上流のＧＰＵから処理結果を受信して、さらなる処理を実行する。 In one embodiment, the GPUs are configured to perform AI training operations in a pipeline fashion on data blocks delivered from each CPU. Each GPU also communicates with each other via an inter-processor link. GPUs may be configured in a loop to receive processing results from upstream GPUs for further data processing. Each GPU may further send processing results to its corresponding downstream GPU for performing further processing. Thus, each GPU performs the distributed DP operations in parallel and sends the DP results to downstream GPUs. Each GPU also receives processing results from its upstream GPU and performs further processing.

図２Ａ～図２Ｆは、一実施形態によるＡＩモデルのトレーニングにおけるデータ転送の例示的なプロセスを示す図である。ここでは、ＧＰＵ２０３、２０５、および２０７の３つのＧＰＵが示されているが、例示的なプロセスは、トレーニングされるニューラルネットワークの複雑さ、トレーニング用データのサイズ、およびユーザが所望するトレーニングの速度などの複数の要因に応じて、できる限り多くのＧＰＵ（例えば、数千個のＧＰＵ）を使用することができる。 2A-2F illustrate an exemplary process of data transfer in training an AI model according to one embodiment. Although three GPUs are shown here, GPUs 203, 205, and 207, an exemplary process is the complexity of the neural network to be trained, the size of the data for training, and the speed of training desired by the user. As many GPUs as possible (eg, thousands of GPUs) can be used, depending on a number of factors.

例示的なシステム上でトレーニングされたニューラルネットワークの例は、結合されるニューロンの集合を含む多層パーセプトロン（ＭＬＰ）ニューラルネットワークを含む。ＭＬＰニューラルネットワーク内のニューロンは、１つの層内の各ニューロンが後続の層内の各ニューロンにパラメータ（例えば、重みおよびバイアス）で結合されると、完全に結合され得る。 Examples of neural networks trained on the exemplary system include multi-layer perceptron (MLP) neural networks comprising a set of connected neurons. Neurons in an MLP neural network may be fully connected when each neuron in one layer is connected to each neuron in subsequent layers by parameters (eg, weights and biases).

ニューラルネットワークモデルのトレーニング中、勾配降下（すなわち、逆伝達）を使用して、ニューラルネットワークモデルの期待値と実際の出力との間の差を最小化するためのパラメータのセットを決定することができる。勾配降下は、損失／誤差関数の勾配を計算するステップと、勾配に応答して既存のパラメータを更新するステップとを含む。このサイクルは、損失関数の極小値に達するまで繰り返される。 During training of a neural network model, gradient descent (i.e. back transfer) can be used to determine a set of parameters to minimize the difference between the expected and actual output of the neural network model. . Gradient descent involves computing the slope of a loss/error function and updating existing parameters in response to the slope. This cycle is repeated until a local minimum of the loss function is reached.

一実施形態では、ニューラルネットワークモデルのトレーニング用データセットは、複数のサブセットに分割され、各サブセットは、ニューラルネットワークのトレーニングが複数のＧＰＵによって並行して行われるように、ＧＰＵのうちの１つ上でニューラルネットワークモデルをトレーニングするために使用される。各ＧＰＵは、ニューラルネットワークモデルの完全なコピーを有することができる。 In one embodiment, a neural network model training data set is divided into multiple subsets, each subset being distributed on one of the GPUs so that training of the neural network is performed by multiple GPUs in parallel. used to train neural network models. Each GPU can have a complete copy of the neural network model.

トレーニング用データセットの各サブセットは、複数の等しいサイズのデータブロックに論理的に分割することができる。例示的なプロセスでは、ブロックの数は、ＧＰＵの数に等しい。ニューラルネットワークモデルの並列的なトレーニンは、勾配降下の複数の反復を必要とする。毎回の反復に、各ＧＰＵは、ネットワークパラメータに対する損失の勾配を計算するために、ＧＰＵ上のデータに対してニューラルネットワークモデルの順方向伝達を実行し、続いて誤差の逆伝達を実行する。次に、ＧＰＵは、互いに通信して、勾配の統計量（例えば、平均値、最大値、または最小値）を計算し、そして、統計量（例えば、平均勾配）を利用して、更新されたパラメータを得ることができる。ニューラルネットワークモデルは、多数のパラメータ（例えば、数十億のパラメータ）を有し、各パラメータは、それぞれ勾配値に関連付けられるようにすることができる。このように、ニューラルネットワークにとって、勾配の大きさは非常に大きく、ＧＰＵ間で勾配を転送することは、かなり帯域幅を占有する。 Each subset of the training data set can be logically divided into multiple equally sized data blocks. In the exemplary process, the number of blocks equals the number of GPUs. Parallel training of neural network models requires multiple iterations of gradient descent. At each iteration, each GPU performs forward propagation of the neural network model on the data on the GPU, followed by back propagation of the error, to compute the gradient of the loss versus the network parameters. The GPUs then communicate with each other to compute a gradient statistic (e.g., mean, maximum, or minimum), and utilize the statistic (e.g., mean gradient) to update the parameters can be obtained. A neural network model may have a large number of parameters (eg, billions of parameters), each associated with a respective gradient value. Thus, for neural networks, the magnitude of gradients is very large and transferring gradients between GPUs takes up considerable bandwidth.

再び図２Ａ～図２Ｆを参照すると、例示的なプロセスは、ＧＰＵ間のデータ転送による要求の帯域幅を低減するためのアルゴリズムを示す。一実施形態では、本開示で使用される帯域幅とは、所与のネットワーク接続における最大データ転送レートである。当該アルゴリズムは、２つのプロセスを含むことができる。第１のプロセスは、分散規約（Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅ）プロセスであり、第２のプロセスは、全凝集（Ａｌｌｇａｔｈｅｒ）プロセスである。Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセス中において、ＧＰＵは、各ＧＰＵが多数の最終結果ブロックで終了するようにデータを交換することができる。Ａｌｌｇａｔｈｅｒプロセス中において、ＧＰＵは、すべてのＧＰＵが完全な最終結果で終わるように、これらの結果ブロックを交換することができる。 Referring again to FIGS. 2A-2F, the exemplary process shows algorithms for reducing the bandwidth required by data transfers between GPUs. In one embodiment, bandwidth as used in this disclosure is the maximum data transfer rate over a given network connection. The algorithm can include two processes. The first process is the Scatter-Reduce process and the second process is the Allgather process. During the Scatter-Reduce process, GPUs can exchange data such that each GPU ends up with a number of final result blocks. During the Allgather process, GPUs can exchange these result blocks so that all GPUs end up with a complete final result.

各ＧＰＵは、ＧＰＵ上のトレーニング用データセットのサブセットを等しいサイズのデータブロックに分割するように構成された１つ以上のアプリケーションを含むことができる。例示的なシステムでは、各ＧＰＵ上のデータブロックの数は、ＧＰＵの数である。ニューラルネットワークモデルのトレーニング中において、各データブロックに対しては、それ自体の勾配のセットを生成することができる。 Each GPU may include one or more applications configured to divide a subset of the training data set on the GPU into equally sized blocks of data. In an exemplary system, the number of data blocks on each GPU is the number of GPUs. During training of the neural network model, each data block can generate its own set of gradients.

この例では、上述したように、システム内に３つのＧＰＵが存在するので、それぞれのＧＰＵ上のデータブロックの数は３である。ＧＰＵ＃０２０３上のトレーニング用データのサブセットから、３組の勾配ａ_０２１５、ｂ_０２３１、ｃ_０２３７を生成することができ、ＧＰＵ＃１２０５上のトレーニング用データのサブセットから、さらに３組の勾配ａ_１２１７、ｂ_１２２３、ｃ_１２３９を生成することができる。同様に、ＧＰＵ＃２２０７上のトレーニング用データのサブセットから、３組の勾配ａ_２２１９、ｂ_２２３５、ｃ_２２４１が生成される。一実施形態では、各ＧＰＵ上の異なる勾配セットは、アレイまたは別のデータ構造で記憶されてもよい。 In this example, as mentioned above, there are three GPUs in the system, so the number of data blocks on each GPU is three. From the subset of training data on GPU #0 203, we can generate 3 sets of gradients a ₀ 215, b ₀ 231, c ₀ 237, and from the subset of training data on GPU #1 205, 3 more A set of gradients a ₁ 217, b ₁ 223, c ₁ 239 can be generated. Similarly, from a subset of training data on GPU#2 207, three sets of gradients a ₂ 219, b ₂ 235, c ₂ 241 are generated. In one embodiment, the different gradient sets on each GPU may be stored in an array or another data structure.

一例として、当該アルゴリズムは、トレーニング用データセットの各サブセットによって生成された勾配を合計するように設計されても良い。これにより、アルゴリズムが完了すると、各ＧＰＵは、トレーニング用データセットから生成された勾配の合計を有することになる。 As an example, the algorithm may be designed to sum the gradients produced by each subset of the training dataset. Thus, when the algorithm is complete, each GPU will have a sum of gradients generated from the training dataset.

例示的なプロセスにおけるＧＰＵは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセス中にＮ－１回の反復を有することができ、ここで、Ｎは、システム内のＧＰＵの総数である。このように、例示的なシステムにおけるＧＰＵは、２つの反復を有することができる。毎回の反復時、各ＧＰＵは、ＧＰＵ上の１組の勾配をその右隣に送信するとともに、その左隣から１組の勾配を受信して、当該２組の勾配を加算して１組の新しい勾配とすることができる。各ＧＰＵによって送信または受信される勾配の組は、毎回の反復に異なる。ｎ番目のＧＰＵは、ｎ番目の勾配の組が送信されたことによって開始するとともに、（ｎ－１）番目の勾配の組を受信するように、処理を逆行する。 A GPU in the exemplary process can have N−1 iterations during the Scatter-Reduce process, where N is the total number of GPUs in the system. Thus, the GPU in the exemplary system can have two iterations. On each iteration, each GPU sends a set of gradients on the GPU to its right neighbor, receives a set of gradients from its left neighbor, and adds the two sets of gradients to produce a set of It can be a new gradient. The set of gradients sent or received by each GPU is different for each iteration. The nth GPU begins with the nth set of gradients being sent and reverses the process to receive the (n-1)th set of gradients.

図２Ａ～図２Ｃは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅ処理を示す図である。図２Ａは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの第１回の反復におけるデータ伝送を示す。第１の送信および第１の受信が完了した後、各ＧＰＵは、２つの異なるＧＰＵ上の２組の勾配の合計を表す値を有するアレイ要素を有する。例えば、ＧＰＵ２０５における第１の要素ａ１は、第２のＧＰＵ２０５および第１のＧＰＵ２０３からの勾配の組の合計を含むことができる。図２Ｂは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの第２回の反復におけるデータ転送を示しており、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの第１回の反復の完了後の中間和も示す。第２回の反復では、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスが続き、そして、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの終了時に（すなわち、この例では第２回の反復の後に）、各ＧＰＵは、すべてのＧＰＵに亘る、対応するアレイ要素のすべての勾配の合計を含む一つのアレイ要素を有する。図２Ｃは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅ処理終了時の最終状態を示している。 2A to 2C are diagrams showing the Scatter-Reduce processing. FIG. 2A shows data transmission in the first iteration of the Scatter-Reduce process. After the first transmission and first reception are completed, each GPU has an array element with values representing the sum of two sets of gradients on two different GPUs. For example, a first element a1 in GPU 205 may include the sum of gradient sets from second GPU 205 and first GPU 203 . FIG. 2B shows the data transfer in the second iteration of the Scatter-Reduce process and also shows the intermediate sum after the completion of the first iteration of the Scatter-Reduce process. In the second iteration, the Scatter-Reduce process follows, and at the end of the Scatter-Reduce process (i.e., after the second iteration in this example), each GPU has a corresponding It has one array element containing the sum of all the gradients of the array elements. FIG. 2C shows the final state at the end of the Scatter-Reduce process.

図２Ｄ～図２Ｆは、Ａｌｌｇａｔｈｅｒプロセスを示す。当該プロセスは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅと同様に行われ、且つ、Ｎ－１回の反復を有する。Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅと比べ、受信された勾配が、ＧＰＵによって受信された勾配に累積することの代わりに、受信ＧＰＵ上の対応するアレイ要素内の勾配をカバーするという点で相違する。図２Ｄは、Ａｌｌｇａｔｈｅｒプロセスの第１回の反復におけるデータ転送を示す。図２Ｅに示すように、第１回の反復が完了した後、各ＧＰＵは、すべてのＧＰＵに亘る対応するアレイ要素内のすべての勾配の合計をそれぞれ含む２つのアレイ要素を有する。図２Ｅは、第２回の反復におけるＡｌｌｇａｔｈｅｒプロセス、すなわち例示的プロセスにおける最終回の反復を示す。図２Ｆに示すように、Ａｌｌｇａｔｈｅｒプロセスの終了時に、ＧＰＵは、トレーニング用データセット全体からの完全に蓄積された勾配を有する。例示的なプロセスは、すべてのデータ転送が、離散の反復において同期して起こるので、帯域幅が最適となる。 Figures 2D-2F illustrate the Allgather process. The process is similar to Scatter-Reduce and has N−1 iterations. Compared to Scatter-Reduce, it differs in that the received gradient covers the gradient in the corresponding array element on the receiving GPU instead of accumulating on the gradient received by the GPU. FIG. 2D shows data transfer in the first iteration of the Allgather process. After the first iteration is complete, each GPU has two array elements each containing the sum of all gradients in the corresponding array element across all GPUs, as shown in FIG. 2E. FIG. 2E shows the Allgather process in the second iteration, the final iteration in the exemplary process. As shown in FIG. 2F, at the end of the Allgather process, the GPU has fully accumulated gradients from the entire training dataset. The exemplary process is bandwidth optimized because all data transfers occur synchronously in discrete iterations.

図３は、図２Ａ～図２Ｆの処理の変形例を示すフローチャートである。一実施形態では、図３に示される例示的なプロセスを使用して、ニューラルネットワークパラメータを更新するための勾配をニューラルネットワークモデルのトレーニング中に転送することができる。ここで、分散サーバ間で数十メガバイトのデータが転送される必要があるとともに、協働して操作する必要もある。これは、性能及び遅延を改善できる効率的なハードウェア及びソフトウェアを必要とする。 FIG. 3 is a flow chart showing a modification of the processing of FIGS. 2A-2F. In one embodiment, the exemplary process illustrated in FIG. 3 can be used to transfer gradients for updating neural network parameters during training of the neural network model. Here, tens of megabytes of data need to be transferred between distributed servers, and they also need to operate cooperatively. This requires efficient hardware and software that can improve performance and latency.

一実施形態では、例示的なプロセスは、Ａｌｌ－Ｒｅｄｕｃｅアルゴリズム、を利用し、且つ、ソフトウェアとハードウェアとの協調設計により、性能および遅延を改善する。ソフトウェアとハードウェアとの協調設計とは、所望の機能を実現するために、ハードウェアとソフトウェアを同時に設計することをいう。この例示的なプロセスは、クラスタ内のＧＰＵを接続するために使用されるアクセラレータのキャッシュコヒーレンスインタコネクト（ＣＣＩＸ）などのハードウェアコンポーネントと、圧縮データに基づくハードウェア計算を可能にするゼロ値圧縮モジュールおよび他の圧縮モジュールなどのソフトウェアモジュールを使用する。この例示的なプロセスは、効率的なＡｌｌ－Ｒｅｄｕｃｅプロセスを実行するように設計された分散システムでシステムデータ圧縮を使用する。これにより、より速くトレーニングデータセットの異なるサブセットから生成された勾配を累積して各ＧＰＵに分配することができ、よって、ＡＩモデルトレーニングをより速くすることができる。 In one embodiment, the exemplary process utilizes an All-Reduce algorithm and improves performance and latency through software and hardware co-design. Co-design of software and hardware means designing hardware and software at the same time in order to realize a desired function. This exemplary process includes hardware components such as the accelerator's cache coherence interconnect (CCIX) used to connect GPUs in the cluster, and a zero-value compression module that enables hardware computations based on compressed data. and using software modules such as compression modules. This exemplary process uses system data compression in a distributed system designed to perform an efficient All-Reduce process. This allows gradients generated from different subsets of the training data set to be accumulated and distributed to each GPU faster, thus making AI model training faster.

図３において、左段は、図２Ａ～図２Ｆに詳細に記載された典型的なＡｌｌ－Ｒｅｄｕｃｅプロセス３０２を示し、右段は、分散システム上でシステム圧縮を使用した改善されたＡｌｌ－Ｒｅｄｕｃｅプロセスを示す。図３は、一例として、論理リングを形成するように配置されている３つのＧＰＵを使用している。 In FIG. 3, the left row shows the exemplary All-Reduce process 302 detailed in FIGS. 2A-2F, and the right row shows the improved All-Reduce process using system compression on distributed systems. indicates FIG. 3 uses, as an example, three GPUs arranged to form a logical ring.

典型的なＡｌｌ－Ｒｅｄｕｃｅプロセス３０２および改善されたＡｌｌ－Ｒｅｄｕｃｅプロセスにおいて、ＣＰＵ間で転送されるデータブロックは、データ構造（例えば、アレイ）に格納され、且つ、データブロックは、ニューラルネットワークモデルをトレーニングするために使用されるトレーニングデータセットのサブセットの異なるブロックから生成された勾配であってもよい。それぞれのＧＰＵは、トレーニングされるニューラルネットワークモデルの完全なコピーを有することができる。勾配は、ニューラルネットワークモデルのパラメータを更新するためにＧＰＵ間で渡される。 In the typical All-Reduce process 302 and the improved All-Reduce process, blocks of data transferred between CPUs are stored in data structures (e.g., arrays), and the blocks of data are used to train a neural network model. may be gradients generated from different blocks of a subset of the training data set used to Each GPU can have a complete copy of the neural network model being trained. Gradients are passed between GPUs to update the parameters of the neural network model.

一実施形態では、各ＧＰＵ上のデータブロックは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの第１回の反復または第１の処理サイクルにおいて圧縮モジュールにより圧縮されてもよく、当該圧縮モジュールは、ハードウェア上で実施されてもよく、ソフトウェアモジュールとして実施されてもよい。例えば、操作３０１、３１５、３２９では、ＧＰＵ＃０２０３上のデータブロックａ_０、ＧＰＵ＃１２０５上のデータブロックｂ_１、ＧＰＵ＃２２０７上のデータブロックｃ_２がそれぞれ圧縮される。 In one embodiment, data blocks on each GPU may be compressed in the first iteration or first processing cycle of the Scatter-Reduce process by a compression module, which is implemented in hardware. may be implemented as software modules. For example, in operations 301, 315, and 329, data block _a0 on GPU#0 203, data block _b1 on GPU#1 205, and data block _c2 on GPU#2 207 are compressed, respectively.

圧縮データブロックは、次の処理サイクルにおいて隣接のＧＰＵに送信されてもよい。例えば、操作３０３において、ＧＰＵ＃０２０３上の圧縮データブロックをＧＰＵ＃１２０５に送信してもよく、操作３１７において、ＧＰＵ＃１２０５上の圧縮データブロックをＧＰＵ＃２２０７に送信してもよく、操作３３１において、ＧＰＵ＃２２０７上の圧縮データブロックをＧＰＵ＃０２０３に送信してもよい。 Compressed data blocks may be sent to neighboring GPUs in the next processing cycle. For example, in operation 303 a compressed data block on GPU #0 203 may be sent to GPU #1 205, and in operation 317 a compressed data block on GPU #1 205 may be sent to GPU #2 207. Well, in operation 331 the compressed data block on GPU #2 207 may be sent to GPU #0 203 .

一実施形態では、圧縮データブロックが隣接のＧＰＵに送信されると同時に、各ＧＰＵ上の異なるデータブロックが圧縮され、上述のように受信された圧縮データに付加されてもよい。当該例示的な実施形態では、合計操作を例に挙げているが、他の操作（例えば、乗算、演繹、及び数学的平均など）を用いてもよい。 In one embodiment, compressed data blocks may be sent to neighboring GPUs at the same time that different data blocks on each GPU are compressed and appended to the received compressed data as described above. Although the exemplary embodiment exemplifies a summation operation, other operations (eg, multiplication, deduction, mathematical averaging, etc.) may be used.

例えば、操作３０５において、ＧＰＵ＃０２０３上のデータブロックｃ_０は圧縮されて、ＧＰＵ＃２２０７から受信された圧縮データブロックｃ_２に付加されてもよい。操作３１９において、ＧＰＵ＃１２０５上のデータブロックａ_１は圧縮されて、ＧＰＵ＃０２０３から受信された圧縮データブロックａ_０に付加されてもよい。操作３３３において、ＧＰＵ＃２２０７上のデータブロックｂ_２が圧縮されて、ＧＰＵ＃１２０５から受信した圧縮データブロックｂ_１に追加される。 For example, at operation 305 , data block c ₀ on GPU # 0 203 may be compressed and appended to compressed data block c ₂ received from GPU # 2 207 . At operation 319 , data block a ₁ on GPU # 1 205 may be compressed and appended to compressed data block a ₀ received from GPU # 0 203 . At operation 333 , data block b ₂ on GPU # 2 207 is compressed and added to compressed data block b ₁ received from GPU # 1 205 .

上記のプロセスは、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの残りの反復ごとに繰り返すことができる。反復の回数は、ＧＰＵの数から１を引いた数であってもよい。このように、改善されたＡｌｌ－ＲｅｄｕｃｅプロセスにおけるＳｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセス３０５は、２回の反復を有することができる。残りの反復のそれぞれにおいて、各ＧＰＵは、元の圧縮データブロックをＧＰＵ上で送るのではなく、圧縮データブロックの合計を複数のＧＰＵから次のＧＰＵに送ることができる。 The above process can be repeated for each remaining iteration of the Scatter-Reduce process. The number of iterations may be the number of GPUs minus one. Thus, the Scatter-Reduce process 305 in the improved All-Reduce process can have two iterations. In each of the remaining iterations, each GPU may send the sum of the compressed data blocks from multiple GPUs to the next GPU instead of sending the original compressed data block on the GPU.

例えば、第２回の反復において、操作３０７において、ＧＰＵ＃０２０３は、圧縮データブロックｃ_０と圧縮データブロックｃ_２との合計をＧＰＵ＃１２０５に送信することができる。操作３２１において、ＧＰＵ＃１２０５は、圧縮データブロックａ_０と圧縮データブロックａ_１との合計をＧＰＵ＃２２０７に送信することができる。操作３３５において、ＧＰＵ＃２２０７は、圧縮データブロックｂ_１と圧縮データブロックｂ_２との合計をＧＰＵ＃０２０３に送信することができる。 For example, in the second iteration, GPU # 0 203 may send the sum of compressed data block c ₀ and compressed data block c ₂ to GPU # 1 205 in operation 307 . At operation 321 , GPU # 1 205 may send the sum of compressed data block a ₀ and compressed data block a ₁ to GPU # 2 207 . At operation 335 , GPU # 2 207 may send the sum of compressed data block b ₁ and compressed data block b ₂ to GPU # 0 203 .

一実施形態では、圧縮データブロックの合計が隣接するＧＰＵに送信されると同時に、各ＧＰＵは、ＧＰＵ上の残りのデータブロックを圧縮して、この前論理リング内の前のＧＰＵから受信した圧縮データブロックの合計に圧縮データブロックに付加されてもよい。例えば、操作３０９において、ＧＰＵ＃０２０２上のデータブロックｂ_０が圧縮され、圧縮データブロックｂ_１およびｂ_２の合計に付加されてもよい。操作３２３において、ＧＰＵ＃１２０５上のデータブロックｃ_１は、圧縮され、圧縮データブロックｃ０およびｃ_２の合計に追加されてもよい。操作３３７において、ＧＰＵ＃２２０７上のデータブロックａ_２は、圧縮されて、圧縮データブロックａ０およびａ１の合計に追加されてもよい。 In one embodiment, while the sum of the compressed data blocks is being sent to neighboring GPUs, each GPU compresses the remaining data blocks on the GPU to match the compressed data received from the previous GPU in this previous logical ring. The sum of the data blocks may be added to the compressed data blocks. For example, in operation 309, data block _b0 on GPU #0 202 may be compressed and added to the sum of compressed data blocks _b1 and _b2 . In operation 323, data block _c1 on GPU #1 205 may be compressed and added to the sum of compressed data blocks c0 and _c2 . In operation 337, data block _a2 on GPU #2 207 may be compressed and added to the sum of compressed data blocks a0 and a1.

したがって、Ｓｃａｔｔｅｒ－Ｒｅｄｕｃｅプロセスの終了時には、例示的なプロセスでは、各ＧＰＵは、アレイ内のすべてのＧＰＵに亘る対応する位置からの圧縮データブロックの合計を有する。 Thus, at the end of the Scatter-Reduce process, in the exemplary process each GPU has a sum of compressed data blocks from corresponding locations across all GPUs in the array.

Ａｌｌｇａｔｈｅｒプロセス中に、各ＧＰＵは、圧縮データブロックの合計をアレイ内の対応する位置から他のＧＰＵに配信してもよい。その結果、Ａｌｌｇａｔｈｅｒプロセスの終了時に、各ＧＰＵは、全ての圧縮データブロックの合計のコピーを有することになる。その後、操作３１３、３２７、および３４１に示されるように、各ＧＰＵは、圧縮された合計を解凍してもよい。各ＧＰＵ上の解凍された合計を使用して、ＧＰＵ上のニューラルネットワークモデルのコピーのパラメータを更新することができる。 During the Allgather process, each GPU may distribute sums of compressed data blocks from corresponding locations in the array to other GPUs. As a result, at the end of the Allgather process, each GPU will have a copy of the sum of all compressed data blocks. Each GPU may then decompress the compressed sum, as shown in operations 313 , 327 and 341 . The uncompressed sums on each GPU can be used to update the parameters of the copy of the neural network model on the GPU.

図４は、一実施形態による、データ圧縮、データ操作、および相互接続バスの例示的アーキテクチャを示す。 FIG. 4 shows an exemplary architecture for data compression, data manipulation, and an interconnect bus, according to one embodiment.

図４のグラフは、生（ＲＡＷ）データブロック４０５および４０７を圧縮し、圧縮しされたデータブロックを相互接続バス４１６および４１８を介して転送し、圧縮データに対して操作４１３および４１９を実行し、圧縮データを生（ＲＡＷ）データへ解凍する、データフローを示す。 The graph of FIG. 4 illustrates compressing raw data blocks 405 and 407, transferring the compressed data blocks over interconnect buses 416 and 418, and performing operations 413 and 419 on the compressed data. , decompressing compressed data into RAW data.

図３に示すように、各ＧＰＵ上で、圧縮モジュールと解凍モジュールの対が一対使用されてもよい。例えば、ＧＰＵＡ４０１上では、圧縮モジュール４１２および解凍モジュール４０９が使用され、ＧＰＵＢ４０３上では、圧縮モジュール４１７および解凍モジュール４１５が使用されてもよい。 As shown in FIG. 3, a pair of compression and decompression modules may be used on each GPU. For example, on GPU A 401 compression module 412 and decompression module 409 may be used, and on GPU B 403 compression module 417 and decompression module 415 may be used.

圧縮モジュール４１２および４１７には、任意の圧縮アルゴリズムを使用することができる。圧縮アルゴリズムの例として、ゼロ値圧縮アルゴリズム／技法があり、以下の開示において詳細に説明される。ゼロ値比が５０％である場合、ゼロ値圧縮アルゴリズムを採用することで、５０％に近いデータ転送の帯域幅を節約することができる。相互接続バスと圧縮データに対する様々な操作とが組み合わせられる場合、帯域幅のメリットは５０％を超えることができる。 Any compression algorithm can be used in compression modules 412 and 417 . Examples of compression algorithms include zero-value compression algorithms/techniques, which are described in detail in the following disclosure. If the zero-value ratio is 50%, employing a zero-value compression algorithm can save nearly 50% of the data transfer bandwidth. Bandwidth benefits can exceed 50% when the interconnect bus and various operations on compressed data are combined.

図５は、一実施形態によるゼロ圧縮技術を示す。図５において、マトリクス５１３は、ニューラルネットワークモデルをトレーニングするための元の４×４データアレイである。データ構造５１０は、ゼロ値圧縮技術を使用したマトリクス５１３の圧縮形式を示している。データ構造５１０は、例えば、タイプフィールド５０１、長さフィールド５０３、ビットマスクフィールド５０５、および圧縮データフィールド５０７などの複数のフィールドを含む。マトリクス５１３およびデータ構造５１０は、圧縮５１１および解凍５０９を使用して相互に変換することができる。 FIG. 5 illustrates a zero compression technique according to one embodiment. In FIG. 5, matrix 513 is the original 4×4 data array for training the neural network model. Data structure 510 shows the compressed form of matrix 513 using zero-value compression techniques. Data structure 510 includes multiple fields such as, for example, type field 501 , length field 503 , bitmask field 505 , and compressed data field 507 . Matrix 513 and data structure 510 can be converted to each other using compression 511 and decompression 509 .

一実施形態では、タイプフィールド５０１は、マトリクス５１３内の値のデータタイプを表す。データタイプの例として、浮動小数点数（ＦＰ）３２、ＦＰ１６、および整数（ＩＮＴ）８が挙げられる。長さは、バイトで、ビットマスクフィールド５０５と圧縮データフィールド５０７との合計サイズを表し、又は、一定のサイズのビットマスクバイトを有する圧縮データフィールド５０７のサイズを表す。ビットマスクフィールド５０５は、行列５１３内の特定の位置における非ゼロ値を表すために「１」に設定され、ゼロ値を表すために「０」に設定される。圧縮データフィールド５０７は、正しいアラインメント／オフセットを有する非ゼロ値データのみを含む。ビットマスクフィールドは、非ゼロで値を４×４データアレイ５１３内の元の位置に書き戻すために、解凍モジュール（例えば、図４の解凍モジュール４０９または４１５）によって使用されてもよい。 In one embodiment, type field 501 represents the data type of the values in matrix 513 . Examples of data types include floating point (FP) 32, FP 16, and integer (INT) 8. The length, in bytes, represents the total size of bitmask field 505 and compressed data field 507, or represents the size of compressed data field 507 with bitmask bytes of a fixed size. Bitmask field 505 is set to '1' to represent a non-zero value at a particular location in matrix 513 and to '0' to represent a zero value. Compressed data field 507 contains only non-zero valued data with correct alignment/offset. The bitmask field may be used by a decompression module (eg, decompression module 409 or 415 of FIG. 4) to write non-zero values back to their original locations in 4×4 data array 513 .

図６は、本実施形態に係る圧縮データに対する操作の例を示す図である。図６に示すように、合計操作を例として、２つの圧縮データブロックに対してどのように操作するかを説明する。 FIG. 6 is a diagram showing an example of operations on compressed data according to this embodiment. As shown in FIG. 6, we take the summation operation as an example to illustrate how to operate on two compressed data blocks.

一実施形態では、圧縮データ６１７は、マトリクスＡ６１３の圧縮形式でマトリクスＡ６１３を表すデータ構造であり、圧縮データ６１９は、マトリクスＢ６１５の圧縮形式でマトリクスＢ６１５を表すデータ構造である。これらの２つの構造は、図５に示される圧縮技術により生成され、解凍モジュール（例えば、解凍モジュール４０９または４１５）により、それぞれマトリクスＡ６１３およびマトリクスＢ６１５に解凍される。 In one embodiment, compressed data 617 is a data structure representing matrix A 613 in the compressed form of matrix A 613, and compressed data 619 is a data structure representing matrix B 615 in the compressed form of matrix B 615. These two structures are generated by the compression technique shown in FIG. 5 and decompressed by a decompression module (eg, decompression module 409 or 415) into matrix A 613 and matrix B 615, respectively.

一実施形態では、２つの圧縮されたマトリクス６１３および６１５をその圧縮形式で合計するために、ハードウェア圧縮モジュール（例えば、図４の圧縮モジュール４１１または４１７）は、まず、２つの圧縮データ構造６１７および６１９をパイプライン化して、一方のデータ構造内のビットマスクフィールド内のビットを他方のデータ構造のビットマスクフィールド内のビットと比較し、比較した結果６２１を出力することができる。 In one embodiment, to sum the two compressed matrices 613 and 615 in their compressed form, a hardware compression module (eg, compression module 411 or 417 in FIG. 4) first creates two compressed data structures 617 and 619 can be pipelined to compare the bits in the bitmask field in one data structure with the bits in the bitmask field of the other data structure and output the result 621 of the comparison.

ＧＰＵ間でデータを圧縮形式で転送することによって、データ転送に必要な帯域幅を低減することができる。さらに、圧縮データブロックは、その非圧縮形式より少ないメモリを占有し、操作中にメモリから読み出され、メモリに書き込まれるビットが少ないので、圧縮データブロックの操作に必要なメモリを低減することができる。 By transferring data between GPUs in compressed form, the bandwidth required for data transfer can be reduced. In addition, compressed data blocks occupy less memory than their uncompressed forms, and fewer bits are read from and written to memory during operation, thus reducing the memory required to manipulate compressed data blocks. can.

例えば、合計操作は、２回の読み出しおよび１回の書き込みを必要とすることがある。メモリから読み書きされるデータが圧縮された形式であるため、合計操作に必要なメモリを低減することができる。 For example, a sum operation may require two reads and one write. Because the data read and written from memory is in compressed form, the memory required for sum operations can be reduced.

図７は、一実施形態によるＡＩモデルトレーニングの例示的なプロセス７００を示す。プロセス７００は、ソフトウェア、ハードウェア、またはそれらの組み合わせを含む処理ロジックによって実行することができる。 FIG. 7 shows an exemplary process 700 of AI model training according to one embodiment. Process 700 may be performed by processing logic including software, hardware, or a combination thereof.

再び図７を参照すると、操作７０１において、論理リングとして配置された複数のプロセッサにおいて、複数回の反復を実行して、ニューラルネットワークモデルをトレーニングし、それぞれのプロセッサは、複数のデータブロックを含む。操作７０２において、複数回の反復のそれぞれに対し、複数のプロセッサのうちの１つが、論理リング内の前のプロセッサから、圧縮データブロックを受信し、受信した圧縮データブロックと当該プロセッサ上で生成された圧縮データブロックとに対して操作を実行して、データブロックを算出し、算出したデータブロックを論理リング内の後続のプロセッサに送信する。操作７０３において、複数のプロセッサの各々において、複数のプロセッサからの対応するデータブロックに基づいて算出された圧縮データブロックを識別する。識別されたデータブロックは、他のプロセッサのそれぞれに配信され、そこに解凍されて、ニューラルネットワークモデルのパラメータの更新などのＡＩモデルのトレーニングに用いられる。 Referring again to FIG. 7, at operation 701, a neural network model is trained by performing multiple iterations on multiple processors arranged as a logical ring, each processor containing multiple data blocks. At operation 702, for each of the plurality of iterations, one of the plurality of processors receives a compressed data block from a previous processor in the logical ring and combines the received compressed data block with the compressed data block generated on that processor. It performs operations on the compressed data blocks to compute data blocks and transmits the computed data blocks to subsequent processors in the logical ring. In operation 703, in each of the plurality of processors, a compressed data block computed based on corresponding data blocks from the plurality of processors is identified. The identified data blocks are distributed to each of the other processors, where they are decompressed and used for AI model training, such as updating parameters of neural network models.

なお、上述した構成要素の一部または全部は、ソフトウェア、ハードウェア、またはそれらの組み合わせによって実現されてもよい。例えば、そのような構成要素は、永久記憶装置にインストールされて記憶されたソフトウェアとして実装することができ、このソフトウェアは、プロセッサ（図示せず）によってメモリにロードされて実行され、本明細書に記載のプロセスまたは操作全体を実施することができる。あるいは、そのようなコンポーネントは、集積回路（例えば、特定用途向けＩＣまたはＡＳＩＣ）、デジタル信号プロセッサ（ＤＳＰ）、またはフィールドプログラマブルゲートアレイ（ＦＰＧＡ）などの特定用途向けハードウェアにプログラムまたは組み込まれた実行可能コードとして実装することができ、実行可能コードは、対応するドライバおよび／またはオペレーティングシステムを介してアプリケーションからアクセスすることができる。さらに、そのようなコンポーネントは、１つまたは複数の特定の命令を介してソフトウェアコンポーネントによってアクセス可能な命令セットの一部として、プロセッサまたはプロセッサコア内の特定のハードウェアロジックとして実装することができる。 Some or all of the components described above may be realized by software, hardware, or a combination thereof. For example, such components may be implemented as software installed and stored on a permanent storage device, which software is loaded into memory and executed by a processor (not shown) and described herein. The entire described process or operation can be performed. Alternatively, such components execute programmed or embedded in application-specific hardware such as an integrated circuit (e.g., an application-specific IC or ASIC), a digital signal processor (DSP), or a field-programmable gate array (FPGA). It can be implemented as executable code, which can be accessed by applications through corresponding drivers and/or operating systems. Further, such components may be implemented as specific hardware logic within a processor or processor core as part of an instruction set accessible by software components via one or more specific instructions.

上述の詳細な説明の一部は、コンピュータメモリ内のデータビットの操作のアルゴリズムおよびシンボル表現に基づいて提示されている。これらのアルゴリズム記述および表現は、データ処理分野の当業者が、当業者に最も効率的にその作業内容を伝えるために使用する方法である。ここで、アルゴリズムは、一般に、所望の結果をもたらす自己適合性操作シーケンスであると考えられる。これらの操作は、物理量を物理的に操作する必要がある操作である。 Some of the above detailed descriptions are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. Here an algorithm is generally considered to be a self-adapting sequence of operations that produces a desired result. These operations are those requiring physical manipulations of physical quantities.

これらの用語および類似の用語はすべて、適切な物理量と関連付けられ、これらの量に適用される便利なタグのみである。上述の議論から明らかなように、他に明示的に指摘されない限り、以下の特許請求の範囲に記載されているような用語を用いた説明は、コンピュータシステムのレジスタおよびメモリ内の物理的（電子的）量として表されているデータを、コンピュータシステムのメモリまたはレジスタまたは他のそのような情報記憶、送信または表示装置内の物理量として同様に表されている他のデータに変換する、コンピュータシステムまたは同様の電子計算装置の操作およびプロセスを、本明細書全体を通して意味することを理解されたい。 All these and similar terms are only convenient tags to be associated with and applied to appropriate physical quantities. It should be clear from the discussion above that, unless expressly indicated otherwise, descriptions using terms such as those in the following claims refer to physical (electronic) operations within the registers and memory of a computer system. a computer system that converts data represented as physical quantities into other data similarly represented as physical quantities within the memory or registers of a computer system or other such information storage, transmission or display device, or It should be understood that similar electronic computing device operations and processes are meant throughout this specification.

本開示の実施形態は、本明細書の操作を実行するための装置にも関する。このようなコンピュータプログラムは、非一時的なコンピュータ可読媒体に格納される。機械可読媒体は、機械（例えば、コンピュータ）によって読み取り可能な形態で情報を記憶するための任意の機構を含む。例えば、機械可読（例えば、コンピュータ可読）媒体は、機械可読記憶媒体、例えば、読み出し専用メモリ（「ＲＯＭ」）、ランダムアクセスメモリ（「ＲＡＭ」）、磁気ディスク記憶媒体、光記憶媒体、フラッシュメモリ装置を含む。 Embodiments of the present disclosure also relate to apparatus for performing the operations herein. Such computer programs are stored on non-transitory computer-readable media. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (eg, a computer). For example, a machine-readable (eg, computer-readable) medium includes a machine-readable storage medium, such as read-only memory (“ROM”), random-access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices. including.

前の図に描かれたプロセスまたは方法は、ハードウェア（例えば、回路、専用ロジックなど）、ソフトウェア（例えば、非一時的なコンピュータ可読媒体上に含まれる）、または両方の組み合わせを含む処理ロジックによって実行されてもよい。プロセスまたは方法は、いくつかのシーケンス操作に従って上述したが、説明したいくつかの操作は、異なるシーケンスで実行されてもよいことを理解されたい。さらに、いくつかの操作は、連続的ではなく並列に実行されてもよい。 The processes or methods depicted in the preceding figures may be implemented by processing logic including hardware (e.g., circuits, dedicated logic, etc.), software (e.g., contained on non-transitory computer-readable media), or a combination of both. may be performed. Although the process or method has been described above according to some sequence of operations, it should be understood that some of the operations described may be performed in different sequences. Moreover, some operations may be performed in parallel rather than serially.

本開示の実施形態は、特定のプログラミング言語を参照して記載されていない。本明細書に記載の本開示の実施形態の教示は、様々なプログラミング言語を使用して実施することができることを理解されたい。 Embodiments of the present disclosure are not described with reference to any particular programming language. It should be appreciated that the teachings of the embodiments of the disclosure described herein may be implemented using a variety of programming languages.

以上、特定の実施形態を参照しながら、本発明について詳解してきた。以下の特許請求の範囲に記載された本開示のより広い趣旨および範囲から逸脱することなく、様々な変更を加えることができることは明らかである。したがって、本明細書および図面は、限定ではなく例示的なものと解釈されるべきである。 The invention has been described in detail above with reference to specific embodiments. It will be evident that various changes can be made without departing from the broader spirit and scope of this disclosure as set forth in the following claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

1. A computer-implemented method for training an artificial intelligence (AI) model using a data processing (DP) accelerator, comprising:
receiving a request from the CPU to train the AI model based on a training data set containing a plurality of data blocks distributed by the CPU;
performing multiple DP iterations by multiple general purpose processing units (GPUs) arranged in a logical ring to train the AI model;
The multiple DP iterations are
For each DP iteration ,
In a first DP cycle, each of the plurality of GPUs performs a data compression operation in parallel on one of the plurality of data blocks to generate a respective first compressed data block ;
In a second DP cycle, each of the plurality of GPUs forwards, via an interprocessor link, a respective first compressed data block to a GPU downstream in the logical ring for further processing; and each receive, via a corresponding interprocessor link or CCIX connection, from an upstream GPU in the logical ring for further processing a second compressed data block generated by the upstream GPU performing a data compression operation; death,
In the third DP cycle, each of the plurality of GPUs processes the first compressed data block processed by the current GPU and the second compressed data block processed by the corresponding upstream GPU and received from that GPU. performing a summation operation on the blocks to produce a first DP result;
In a fourth DP cycle, the plurality of GPUs each perform a data decompression operation on the first DP result, and the data blocks obtained by decompression are used for the next DP iteration. including
The summing operation pipelines the first compressed data block and the second compressed data block such that the bits in the bitmask field of one compressed data block are replaced by the bits in the bitmask field of the other compressed data block. A computer-implemented method that is an operation that compares the bits of and outputs the result of the comparison.

2. The method of claim 1, wherein at least some of the data blocks represent parameters or gradients generated as part of training the AI model.

The data compression operation is performed using a zero-value compression algorithm that compresses one or more data blocks into a data structure having a bitmask portion and a compressed data portion, wherein the bitmask portion is a containing bits indicating the positions of non-zero values,
The method of claim 1.

A data processing system,
at least one CPU;
a plurality of general purpose processing units (GPUs) connected to the CPU;
each of the plurality of GPUs is configured to perform artificial intelligence AI data processing (DP) operations delivered from the CPU;
Said operation is
receiving a request from the CPU to train an AI model based on a training data set containing a plurality of data blocks distributed by the CPU;
performing multiple DP iterations by multiple general purpose processing units (GPUs) arranged in a logical ring to train the AI model;
The multiple DP iterations are
For each DP iteration ,
In a first DP cycle, each of the plurality of GPUs performs a data compression operation in parallel on one of the plurality of data blocks to generate a respective first compressed data block ;
In a second DP cycle, each of the plurality of GPUs forwards, via an interprocessor link, a respective first compressed data block to a GPU downstream in the logical ring for further processing; and each receive, via a corresponding interprocessor link or CCIX connection, from an upstream GPU in the logical ring for further processing a second compressed data block generated by the upstream GPU performing a data compression operation; death,
In the third DP cycle, each of the plurality of GPUs processes the first compressed data block processed by the current GPU and the second compressed data block processed by the corresponding upstream GPU and received from that GPU. performing a summation operation on the blocks to produce a first DP result;
In a fourth DP cycle, the plurality of GPUs each perform a data decompression operation on the first DP result, and the data blocks obtained by decompression are used for the next DP iteration. including
The summing operation pipelines the first compressed data block and the second compressed data block such that the bits in the bitmask field of one compressed data block are replaced by the bits in the bitmask field of the other compressed data block. A data processing system that is an operation that compares the bits in and outputs the result of the comparison.

A non-transitory machine-readable medium having instructions stored thereon,
The instructions, when executed by a processor, cause the processor to perform operations of artificial intelligence AI training, the operations comprising:
receiving a request from the CPU to train an AI model based on a training data set containing a plurality of data blocks distributed by the CPU;
performing multiple DP iterations by multiple general purpose processing units (GPUs) arranged in a logical ring to train the AI model;
The multiple DP iterations are
For each DP iteration ,
In a first DP cycle, each of the plurality of GPUs performs a data compression operation in parallel on one of the plurality of data blocks to generate a respective first compressed data block ;
In a second DP cycle, each of the plurality of GPUs forwards, via an interprocessor link, a respective first compressed data block to a GPU downstream in the logical ring for further processing; and each receive, via a corresponding interprocessor link or CCIX connection, from an upstream GPU in the logical ring for further processing a second compressed data block generated by the upstream GPU performing a data compression operation; death,
In the third DP cycle, each of the plurality of GPUs processes the first compressed data block processed by the current GPU and the second compressed data block processed by the corresponding upstream GPU and received from that GPU. performing a summation operation on the blocks to produce a first DP result;
In a fourth DP cycle, the plurality of GPUs each perform a data decompression operation on the first DP result, and the data blocks obtained by decompression are used for the next DP iteration. including
The summing operation pipelines the first compressed data block and the second compressed data block such that the bits in the bitmask field of one compressed data block are replaced by the bits in the bitmask field of the other compressed data block. A non-transitory machine-readable medium that is an operation that compares the bits of and outputs the result of the comparison.

A computer program,
A computer program for implementing the method of any one of claims 1 to 3 when said computer program is executed by a processor.