CN117290280A

CN117290280A - Data transmission method between multi-processing cards

Info

Publication number: CN117290280A
Application number: CN202311484181.5A
Authority: CN
Inventors: 赵军平; 梅晓峰; 赵守仁
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-12-26
Also published as: CN116450564A; CN116450564B

Abstract

The embodiment of the specification provides a data transmission method between multiple processing cards, wherein a direct connection channel and an indirect channel are arranged between the multiple processing cards, and the indirect channel comprises more than two hops of direct connection channels; the multi-processing card is positioned on a main board of the same processing equipment; the method comprises the following steps: receiving an inter-card data transmission request; determining more than two channels from a first processing card corresponding to a source address to a second processing card corresponding to a target address; selecting more than one target channel from more than two channels for transmitting data to be transmitted; obtaining more than one data subset of the data to be transmitted based on the number of the target channels and the data to be transmitted, wherein the more than one data subset is transmitted through more than one target channel; based on more than one data subset, a data transmission instruction is initiated to the processing card related to the target channel, and the corresponding data subset is transmitted through the processing card related to the target channel, so that the data to be transmitted is transmitted to the target address.

Description

Data transmission method between multi-processing cards

Description of the division

The application is a divisional application which is proposed in China application with the application date of 2023, 6, 15, the application number of 202310707879.2 and the name of 'a data transmission method and system between multi-processing cards'.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method for transmitting data between multiple processors or multiple processing cards.

Background

A multiprocessor device refers to a processing device that contains multiple processors. Each processor has the capability of data operation, can exchange data, is managed by a unified operating system, and can share peripheral devices such as I/O equipment, magnetic disks and the like. In general, in a multiprocessor device, a plurality of processors are located on the same motherboard, and circuits are disposed on the motherboard and slots are formed on the motherboard, and the processors are clamped on the slots in the form of a board card, so the processors may also be referred to as a processing card.

Multiprocessor devices have a more powerful data computing capability than single processor devices, and are typically used as industrial-level computing devices to implement large data computations. In a big data computing scenario, frequent and massive data exchange needs to be performed between the multiple processors, and how to optimize data transmission between the processors becomes a problem to be solved.

Disclosure of Invention

One or more embodiments of the present disclosure provide a data transmission method between multi-processing cards, including: receiving the call of the application process through the data transmission client so as to initiate an inter-card data transmission request to the data transmission service process, wherein the inter-card data transmission request comprises a target address and a source address, and the target address and the source address correspond to different processing cards; selecting more than one target channel from more than two channels between a first processing card corresponding to a source address and a second processing card corresponding to a target address through a data transmission service process, and returning a target channel identifier to a data transmission client; obtaining more than one data subset of the data to be transmitted by the data transmission client based on the number of target channels and the data to be transmitted; initiating a data transmission instruction to a transmission process on a processing card related to a target channel based on more than one data subset through a data transmission client; and transmitting the corresponding data subset through a transmission process on the processing card related to the target channel, and further transmitting the data to be transmitted to the target address.

One or more embodiments of the present specification provide a data transmission system between multi-processing cards, the data transmission system between multi-processing cards including: the data transmission client is called by the application process so as to initiate an inter-card data transmission request to the data transmission server, wherein the inter-card data transmission request comprises a target address and a source address, and the target address and the source address correspond to different processing cards; the data transmission server is used for responding to the inter-card data transmission request, selecting more than one target channel from more than two channels between the first processing card corresponding to the source address and the second processing card corresponding to the target address, and returning the target channel identification to the data transmission client; the data transmission client is also used for obtaining more than one data subset of the data to be transmitted based on the number of the target channels and the data to be transmitted, and initiating a data transmission instruction to a transmission module on a processing card related to the target channels based on the more than one data subset; and the transmission module is used for responding to the data transmission instruction, transmitting the corresponding data subset and further transmitting the data to be transmitted to the target address.

One or more embodiments of the present specification provide a storage medium storing computer instructions that, when executed by a processor or processing card, perform the aforementioned method of data transfer between multiple processing cards.

One or more embodiments of the present disclosure provide an apparatus, including a plurality of processing cards and a storage medium, where a direct channel and/or an indirect channel are provided between the plurality of processing cards, the indirect channel includes two or more hops of direct channels, the storage medium stores computer instructions, and the plurality of processing cards are configured to execute the computer instructions to implement a data transmission method between the plurality of processing cards.

One or more embodiments of the present disclosure provide a data transmission method between multiple processing cards, where a direct channel and/or an indirect channel are provided between the multiple processing cards; the method comprises the following steps: receiving an inter-card data transmission request; the inter-card data transmission request comprises a target address and a source address, and the target address and the source address correspond to different processing cards; determining more than two channels from a first processing card corresponding to a source address to a second processing card corresponding to a target address; acquiring a weight value of each of the more than two channels; the weight value is positively correlated with the data transmission bandwidth of the channel and negatively correlated with the observed load capacity of the channel; and selecting more than one target channel from the more than two channels based on the weight value, wherein the target channel is used for transmitting data to be transmitted.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is a schematic diagram of a multiprocessor motherboard, shown in accordance with some embodiments of the present specification;

FIG. 2 is an exemplary flow chart of a method of data transfer between multiple processing cards according to some embodiments of the present disclosure;

FIG. 3 is an exemplary flow chart for determining a target channel according to some embodiments of the present disclosure;

FIG. 4 is an exemplary diagram of a component table shown in accordance with some embodiments of the present description;

FIG. 5 is an exemplary diagram of a load table shown in accordance with some embodiments of the present description;

FIG. 6 is an exemplary schematic diagram of a method of caching data subsets on a transfer processing card according to some embodiments of the present disclosure;

FIG. 7 is an exemplary schematic diagram of a method of caching data subsets on a transfer processing card according to further embodiments of the present disclosure;

fig. 8 is a schematic block diagram of a data transmission system between multiple processing cards according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

In big data processing scenarios, multiprocessor devices are commonly used. A multiprocessor device refers to a device that includes a plurality of processors. Each processor has the capability of data operation, can exchange data, is managed by a unified operating system, and can share peripheral devices such as I/O equipment, magnetic disks and the like. Generally, a plurality of processors are located on the same motherboard, and circuits are disposed on the motherboard and slots are formed on the motherboard, and the processors are clamped on the slots in a board card manner. In some scenarios, the terms "processor," "processing card," and "processor board card" may be interchanged.

Processors located on the same processing device or on the same motherboard may be divided into a main processor and a coprocessor. The main processor is a processing core of the processing device, and is typically implemented by a general-purpose CPU (Center Process Unit, central processing unit). The coprocessor receives a host processor schedule for assisting the host processor in performing certain specific computing tasks. Coprocessors can be classified into mathematical coprocessors, graphics coprocessors, and the like, depending on the computational task for which they are specifically responsible. Math coprocessors are used for digital processing or computation, otherwise known as floating point coprocessors, which operate digitally faster than the main processor. Graphics processors, also known as GPU (Graphic Processing Unit), are used for graphics display and computation acceleration, and are characterized by high-speed parallel computing, and are suitable for parallel-processable computing tasks.

In some embodiments, the multiprocessor device may be used for machine learning model training or prediction. In particular, when there are more training samples, distributed training may be implemented on multiple processors. For example, the host processor may distribute the same initial model to multiple coprocessors, each of which trains the initial model using a portion of the training samples. As an example, each coprocessor may process feature information of a training sample by using an initial model to obtain a prediction result, and return the prediction result to the main processor, where the main processor obtains a loss function value according to each prediction result and a label of the training sample, and further obtains an average gradient value, returns the average gradient value to each coprocessor, and performs parameter update of the local model by the coprocessor based on the average gradient value, so as to perform multiple rounds of iterative distributed training, thereby obtaining a trained model. Compared with single-processor processing equipment, the model training time is greatly shortened because the plurality of coprocessors perform model training in parallel. When the model volume is large, distributed prediction can be implemented on multiple processors. For example, the main processor may split the trained model into a plurality of sub-models, distribute the sub-models to a plurality of coprocessors, vertically split feature items of the object to be predicted, and distribute the feature items to each coprocessor correspondingly, where the coprocessors process feature values of corresponding feature items of the object to be predicted by using the local sub-models to obtain local prediction results, the coprocessors return the local prediction results to the main processor, and the main processor aggregates the local prediction results to obtain final prediction results. It can be seen that there is a large, frequent data interaction between processors during a large-scale model training or prediction process.

It should be noted that the model distributed training/prediction is an application scenario of the multiprocessor device, and should not be construed as a limitation of the application scenario of the multiprocessor device, and in some embodiments, the multiprocessor computing scenario is also applicable to query of a knowledge graph, and the like.

FIG. 1 is a schematic diagram of a multiprocessor motherboard, according to some embodiments of the present description. As shown in fig. 1, the motherboard 100 includes two CPU boards and 8 GPU boards, and data lines or data channels are arranged between the processor boards. Wherein CPU0 and CPU1 can be considered as main processors, and GPUs 0-7 can be considered as coprocessors. The CPU and the GPU can be directly connected through a PCIe interface, and the transmission bandwidth can reach 16GB/s (G Byte/second) or 32GB/s. The GPUs can be directly connected through an NVLink interface, and the transmission bandwidth of the GPU is larger than that of PCIe, for example, 150GB/s or more. The CPUs can be directly connected through QPI or UPI interfaces, and the transmission bandwidth of the CPUs is between PCIe and NVLink. In some embodiments, more than two GPUs may be combined as a group (e.g., GPU0 and GPU1 in a dashed box) and connected to the CPU via a PCIe direct channel, where the CPU may transmit different information to GPU0 or GPU1, respectively, via "addressing". There may be more than two direct channels between two GPUs, where the bandwidth between the two GPUs is the sum of the bandwidths of the more than two direct channels.

As can be seen from fig. 1, the two processor boards can be communicated through an indirect channel comprising more than two (or two hops) of direct communication channels, besides data exchange through the direct communication channels. For example, CPU0 through GPU0, which have a PCIe direct channel, may also interact with data via indirect channels CPU0-GPU3-GPU0, or indirect channels CPU0-CPU1-GPU6-GPU0, etc. When data interaction is needed between two processors, how to select a target channel suitable for current data transmission from multiple channels, thereby improving the data transmission speed or balancing the data transmission load becomes a problem worthy of research.

Therefore, some embodiments of the present disclosure provide a global multi-channel data transmission acceleration method across multiple processor cards and/or across tasks, which includes multiple internal optimization steps such as multi-channel optimization selection, data segmentation, concurrent transmission, transparent application layer, and the like, and finally, significantly accelerates the data transmission performance between the processing cards.

Fig. 2 is an exemplary flow chart of a method of data transfer between multiple processing cards according to some embodiments of the present description, which implements data transfer in a CS (Client-Server) architecture. In some embodiments, the multi-processing card may include a main processor CPU board and a plurality of coprocessor GPU boards. The server is realized by an independent process and runs on the CPU board card. And each processor board card is provided with a transmission process for realizing data transmission. The client side supplies an application program (or application process) call, and initiates a request to a service process or sends a data transmission instruction to a transmission process on a corresponding processing card.

The application process is a process corresponding to the application program and is responsible for data operation or information processing of an application layer, such as calculation of obtaining a prediction result by using model processing feature data, model training based on a training sample, and the like. When the application process obtains the calculation result (either the intermediate result or the final result), the calculation result needs to be transmitted to the application process located on other processing cards for subsequent operation, and then data transmission between the processing cards is initiated. In some embodiments of the present description, when an application process needs to transfer data to another processing card, a data transfer client may be invoked to initiate an inter-card data transfer request to a data transfer service process. Accordingly, the data transmission client can be deployed on each processing card to supply application process calls. As shown in fig. 2, the data transmission method 200 provided in some embodiments of the present disclosure may specifically include the following steps:

step 210, receiving the application process call by the data transmission client, and further initiating an inter-card data transmission request to the data transmission service process.

The data transfer client may be packaged in a functional form, with its call interface including a function name and input parameters. The input parameters of the data transmission client may include a source address and a destination address of the data to be transmitted. The source address may specifically be a first address of a storage area where data to be transmitted is located. When the application program is located on the CPU board, the source address may point to a certain storage area of the memory of the processing device or a certain storage area located on the disk, and when the application program is located on the GPU board, the source address may point to a certain storage area of the video memory in the GPU. The target address is a storage area to be reached by the data to be transmitted, and points to a certain storage area on a processing card different from the application program, and similar to the source address, the specific storage position can be a memory, a magnetic disk or a video memory. In some embodiments, the processing card corresponding to the source address may be referred to as a source processing card or a first processing card, and the processing card corresponding to the destination address may be referred to as a destination processing card or a second processing card. In some embodiments, the input parameters may also include a data offset reflecting the amount of data to be transmitted, such as 1024B (Byte), 320B, etc. Based on the head address and the offset of the data to be transmitted, the data transmission client can determine the address of each byte of data in the data to be transmitted in the storage area corresponding to the source processing card.

The application program can call the data transmission client through a call interface of the data transmission client, the data transmission client obtains a source address and a target address of data to be transmitted, and then the client can initiate an inter-card data transmission request to a service process, wherein the request can comprise the source address and the target address of the data to be transmitted. Because the data transmission client is located in the application process, the data transmission client can transmit the inter-card data transmission request to the data transmission service process in an inter-process communication mode such as a pipeline, a socket and the like.

Step 220, selecting more than one target channel from more than two channels between the first processing card corresponding to the source address and the second processing card corresponding to the target address through the data transmission service process, and returning the target channel identification to the data transmission client.

In some embodiments, the data transfer service process may determine the target channel according to the flow 300 shown in fig. 3. As shown in fig. 3, the process 300 includes:

step 310, determining more than two channels from a first processing card corresponding to a source address to a second processing card corresponding to a destination address.

As mentioned above, a plurality of channels are provided between any two of the multi-processing cards, including more than one direct channel and more than one indirect channel. In some embodiments, the data transfer service process may determine the respective corresponding processing card from the source address and the destination address in the inter-card data transfer request. Specifically, the source address may include the number of the processing card and the first address of a storage area (such as a memory, a register, or a video memory) of the data to be transferred on the processing card, and the destination address is similar. Alternatively, each memory address may be globally unique among the multiple processing cards, so that it may be determined directly to which processing card it points by the source address or destination address. Thereafter, the data transfer service process may determine all channels between the first processing card corresponding to the source address to the second processing card corresponding to the destination address.

In some embodiments, the data transfer service process may obtain a component table that records the direct channels between the multi-processing cards as well as the data transfer bandwidth. Fig. 4 is a component table of some embodiments of the present disclosure, and as shown in fig. 4, a table 400 records a processing card having a direct connection channel among the plurality of processing cards, and a data transmission bandwidth of each direct connection channel. As can be seen from table 400, there is a direct channel between CPU and GPU0, the data transmission bandwidth is 12GB/s, and there is a direct channel between GPU0 and GPU1, the data transmission bandwidth is 24GB/s. Also described in table 400 is the data transfer bandwidth within the processing card, e.g., up to 1440GB/s within GPU 0. Because the indirect channel is composed of more than two direct communication channels, based on the component table, the data transmission service process can also determine the indirect channel between any two processing cards, and then determine all channels between the first processing card and the second processing card. The direct connection channel between the processing cards is fixed, so the data transmission service process can acquire the component table only once.

In some embodiments, the hardware driver of the motherboard or processing card may provide a component table query API (Application Programming Interface ) that the data transfer service process may retrieve.

Step 320, a weight value for each of the more than two channels is obtained.

In order to increase the transmission rate, a transmission with a large data transmission bandwidth for data to be transmitted may be selected from two or more channels between the first processing card and the second processing card. In practical use, the data transmission efficiency of the channel is limited by the observed load on the channel in addition to the bandwidth. The observed load may be the amount of data that the channel is transmitting and/or waiting to transmit, reflecting the actual situation in which the channel is occupied. For this, the data transmission service process may acquire a weight value of each channel based on the data transmission bandwidth and the observed load amount, so as to screen.

In some embodiments, the data transfer service process may determine the data transfer bandwidth of the direct channel between the first processing card and the second processing card based on table 400. For an indirect channel between two cards, the data transmission bandwidth of each direct connection channel on the indirect channel can be obtained, and the minimum data transmission bandwidth is used as the data transmission bandwidth of the indirect channel. Thus, the data transmission service process can acquire the data transmission bandwidth of each channel between the first processing card and the second processing card.

In some embodiments, the data transmission service process may acquire a load table, where the load table records an observed load amount on a direct connection channel between multiple processing cards, where the observed load amount may change with time, and thus the data transmission service process needs to acquire the load table once each time it is used. Similar to the component table, the data transfer service process may obtain the load table by calling an API provided by the motherboard or the processing card. Fig. 5 is a load table 500 according to some embodiments of the present disclosure, where from the table 500, it may be determined that the observed load on the direct channel between the current CPU and GPU0 is 1GB, and the observed load on the direct channel between the GPU0 and GPU1 is 4GB.

Based on the load table, the data transmission service process may determine an observed load of all direct channels between the first processing card and the second processing card. For an indirect channel between two cards, the observed load of each direct-connection channel on the indirect channel can be obtained, and the observed load of each direct-connection channel is accumulated, so that the observed load of the indirect channel can be obtained. Therefore, the data transmission service process can acquire the observed load capacity of each channel between the first processing card and the second processing card.

In some embodiments, the data transmission service process may determine its weight value based on the data transmission bandwidth and the observed load amount for each channel. Specifically, the weight value may be positively correlated with the data transmission bandwidth of the channel and negatively correlated with the observed load. In some embodiments, the data transmission service process may determine a first weight value based on the data transmission bandwidth, the first weight value is positively correlated with the data transmission bandwidth, determine a second weight value based on the observed load, the second weight value is negatively correlated with the observed load, and finally the data transmission service process may synthesize the first weight value and the second weight value to obtain the weight value of the channel, for example, may add or multiply the two to obtain the weight value.

Step 330, selecting more than one target channel from the more than two channels based on the weight value, for transmitting the data to be transmitted.

In some embodiments, the data transmission service process may sort the channels in the first processing card and the second processing card in descending order based on the weight values, and select the first n channels as the target channels, where n may be set according to actual needs, for example, 1, 2, or 3.

The data transfer service process may return the identification of the target channel to the data transfer client. In some embodiments, the channels between the processing cards may have unique identifications. In particular, the identification may comprise a string or a number. In some embodiments, the channel identifier may be composed of the number of the processing card of the path, such as indirect channel CPU0-GPU3-GPU0, direct channel CPU0-GPU0, and when there are more than two direct channels between two cards, the channel identifier may also include a direct channel identifier, such as direct channel GPU1-GPU2-1, GPU1-GPU2-2, where the individual values "1", "2" are direct channel identifiers. Accordingly, the indirect channel identification may be CPU0- (GPU 2-GPU 1-1) or CPU0- (GPU 2-GPU 1-2). Wherein brackets may also be omitted.

At step 230, more than one subset of data for the data to be transmitted is obtained by the data transmission client based on the number of target channels and the data to be transmitted.

The data transmission client may split the data to be transmitted into more than one subset of data based on the data of the target channel. In some embodiments, the data subsets may be in one-to-one correspondence with the target channels, and when the number of target channels is 1, the number of data subsets is 1, that is, the data to be transmitted is not split. In some embodiments, the inter-card data transmission request may further include an offset of the storage area of the data to be transmitted with respect to the source address, the offset being the data amount of the data to be transmitted, and at this time, the data transmission client may determine the data amount of each subset of data based on the offset. As an example, the data amount of the data subset corresponding to the target channel with a larger data transmission bandwidth may be set larger. For example, the offset is 1280B, the number of target channels is 3, the bandwidths of the target channels are 48GB/s, 12GB/s and 12GB/s, the data to be transmitted can be split into 3 data subsets, and the data amounts are 1000B,140B and 140B.

The data transfer client may derive a head address and a subset offset for each data subset based on the source address and the data amount of the data subset. In the previous example, if the source address is 0xB13FF (16 scale), it may be determined that the first subset of data has a head address of 0xB13FF, a subset offset of 1000B, the second subset of data has a head address of 0xb13ff+0x3e8 (16 scale of 1000), a subset offset of 140B, the third subset of data has a head address of 0xb13ff+0x3e8 (16 scale of 1000) +0x8c (16 scale of 140), and a subset offset of 140B.

Step 240, initiating, by the data transmission client, a data transmission instruction to a transmission process on the processing card associated with the target channel based on the more than one data subset.

In some embodiments, the data transfer client may communicate with a transfer process on a processing card associated with the target channel, to which data transfer instructions are sent. The data transfer instruction may include a head address of the subset of data, a subset offset, and a target address. In some embodiments, the data transfer instructions may also include the amount of data (and/or the number of data subsets) of the data to be transferred and the identification of the data subsets. The identification of the data subsets may indicate the order of storage of the respective data subsets in the storage area corresponding to the source address, e.g. the data subsets stored on the first processing card closer to the source address may have smaller sequence numbers.

For the directly connected target channel, the data transmission client can directly communicate with the transmission process on the second processing card, and the data transmission instruction is transmitted to the transmission process of the second processing card.

For an indirect target channel, the data transfer client may communicate with the nearest processing card on the channel to which the data transfer instructions are sent. The nearest process card may be a process card in the target channel that is directly connected to the first process card. Taking the target channel CPU0-GPU3-GPU1 as an example, the first processing card is CPU0, and the nearest processing card is GPU3. In some embodiments, the data transmission client may also transmit the transit routing information to the nearest processing card together, where the transit routing information reflects the processing card of the path before the subset of data reaches the destination address, and in some embodiments, the data transmission client may directly transmit the identifier of the destination channel formed by the numbers of the processing cards of the path to the transmission process of the nearest processing card. The transmission process of the nearest processing card can read the data subset from the first processing card to the local based on the data transmission instruction, then initiate the data transmission instruction to the transmission process of the next processing card on the target channel based on the transfer route information, and so on, the data subset sequentially jumps on the transfer processing card on the target channel, and finally reaches the target address. More about the transmission process transmission data subset can be found in the relevant description of step 250.

Step 250, transmitting the corresponding data subset through the transmission process on the processing card related to the target channel, and further transmitting the data to be transmitted to the target address.

After the transmission process on the processing card receives the data transmission instruction, the data can be positioned to a storage area corresponding to the previous processing card of the data subset based on the first address and the subset offset of the data subset in the instruction, and the data can be read from the storage area to the storage area of the processing card.

When the target channel is a direct connection channel, the transmission process on the second processing card can directly read the data subset from the storage area corresponding to the first processing card and store the data subset in the storage area pointed by the target address.

When the target channel is an indirect channel, the processing cards on the channel other than the first processing card and the second processing card may be referred to as transit processing cards. For transferring the data subset, the data subset needs to be read from the storage area corresponding to the previous processing card and cached locally, and then a data transfer instruction is initiated to the next processing card. In some embodiments, a transfer process on the staging card may store a subset of data in a global storage area or a temporary storage area of the transfer process. The temporary storage area occupies computing resources associated with the transfer process, which may be registers corresponding to the transfer process, and is released when the transfer process exits. The global storage area is relative to the temporary storage area, and occupies a memory or video memory area. The transmission process selects a temporary storage area to store data or selects a global storage area to cache data, and the transmission performance is basically equal, and the difference is mainly that the occupied resource types are different. In some embodiments, the transmission process may pre-determine the occupation condition of the temporary storage area and the global storage area, and dynamically select. For example, data may be cached in the temporary storage area when the amount of free of the temporary storage area is greater than a set threshold, otherwise, data is cached in the global storage area.

In some embodiments, when the transfer process determines to cache a subset of data in a global storage area, it may open more than two pre-sized buffer areas in the global storage area. The size of the buffer area may be fixed, such as 64B, 128B, etc. For each buffer region, the transmission process may set up a corresponding task for reading and buffering a portion of the data in the subset of data into its corresponding buffer region. In some embodiments, tasks may be performed in parallel or at intervals of a certain time difference.

The parallel execution of the tasks can be regarded as independent execution of the tasks, and the tasks are not influenced by the execution states of the other tasks. With reference to fig. 6, assuming that there are two Buffer areas, i.e., buffer1 and Buffer2, each Buffer area has a size of 64B, the data subset has a size of 320B, the target channel is CPU0-GPU3-GPU0, the current transit processing card is GPU3, the transmission process sets two tasks s1, s2 for the Buffer area, task s1 is used to read data not exceeding 64B (e.g., read 64B) from CPU0 into Buffer1 of GPU3 based on the head address and the subset offset of the data subset, task s2 is used to read data not exceeding 64B from the rest of the data subset into Buffer2, e.g., s2 may read data not exceeding 64B from the position (head address+64b) into Buffer2. In some embodiments, the task may initiate a data transfer instruction to a transfer process of a next processing card (e.g., a second processing card GPU 0) after buffering the data into the buffer region, the data transfer instruction may include an address of the buffer region and an amount of data therein. After the data in the buffer area is read by the transmission process of the next processing card, the task can continue to read the data from the rest of the data subset to the buffer area. For example, s1 may continue to read 64B data from the location of (first address+64b+64b) to Buffer1, s2 may continue to read 64B data from the location of (first address+64b+64b+64b) to Buffer2, and tasks s1 and s2 may be executed in parallel in sequence until 320B data in the subset of data all passes through the Buffer area in the current transit processing card to the next processing card. Wherein the process of data from the CPU to the GPU may be regarded as H2D and the process of data transfer between different GPUs may be regarded as D2D.

The execution of a plurality of tasks at intervals of a certain time difference can be considered that the starting of one task depends on the execution state of another task, and specifically, the other task is triggered after the current task is executed for a certain time. In some embodiments, a task may generate an event identifier, such as event1, after a certain period of time, or set the event identifier to a valid state, such as ready, and another task continually queries for the event identifier, and when its state is valid, the task is executed. In connection with fig. 7, the previous example is the same as the previous example, except that task s1 of fig. 7 generates an event identifier after storing data in Buffer1, task s2 continuously queries the event identifier, and when its state is valid, task s2 is executed. The multiple tasks are sequentially executed through the 'pipelining', so that the bandwidth occupation among the processing cards can be reduced, namely, the data quantity transmitted on the channels (such as the direct connection channels from the CPU0 to the GPU 3) at the same time is less than that of the parallel mode.

When more than two transit cards are in the target channel, each transit card may read a subset of data from the previous card to the local buffer area and then to the next card in the manner described above.

For the second processing card in the indirect target channel, namely the processing card corresponding to the target address, the transmission process of the second processing card can read data from the buffer area in the previous transfer processing card to the target address after receiving the data transmission instruction. In some embodiments, the transmission process on the second processing card may also set up a plurality of tasks, and perform the data transmission task (similar to the transmission process on the transfer processing card) in parallel or in a running manner under the trigger of the transmission task of the previous processing card, and read the data in the corresponding buffer area to the storage area corresponding to the target address.

In some embodiments, the data to be transmitted is divided into more than two subsets of data for transmission over more than two target channels to the second processing card. At this time, the transmission process of the second processing card may sequentially store the data subsets from different target channels in the storage area corresponding to the target address, until all the data to be transmitted reaches the storage area corresponding to the target address. In some embodiments, after the data to be transmitted all reach the storage area corresponding to the target address, the transmission process of the second processing card may generate another event identifier, such as event2. The data transmission client can detect the event identification at fixed time, and when the event identification is valid, the data transmission client can determine that the data requested to be transmitted by the application process is transmitted completely, and returns the information of the transmission completion to the application process. The application process may perform processing operations of other links, such as initiating an inter-card data transfer request for the next piece of data, based on the returned information.

Fig. 8 is an exemplary block diagram of a data transmission system between multiple processing cards according to some embodiments of the present description. As shown in fig. 8, the system 800 may include a data transmission server 810, one or more data transmission clients (e.g., data transmission clients 821, 822, …), and one or more transmission modules (e.g., transmission modules 831, 832 …).

The data transmission client is used for being called by an application process (such as application process 1, application process 2, etc.) so as to initiate an inter-card data transmission request to the data transmission server 810; the inter-card data transfer request includes a destination address and a source address, the destination address and the source address corresponding to different processing cards.

The data transmission server 810 is configured to respond to an inter-card data transmission request, select one or more target channels from two or more channels between a first processing card corresponding to a source address and a second processing card corresponding to a target address, and return a target channel identifier to the data transmission client.

The data transmission client is further used for obtaining more than one data subset of the data to be transmitted based on the number of target channels and the data to be transmitted; and initiating a data transmission instruction to a transmission module on a processing card associated with the target channel based on the more than one subset of data.

The transmission module is used for responding to the data transmission instruction, transmitting the corresponding data subset, and further transmitting the data to be transmitted to the target address.

In some embodiments, the multi-processing card is located on a motherboard of the same processing device and includes a CPU board and a GPU board; the number of the application processes is more than one, and the application processes are respectively positioned on the CPU board card and/or the GPU board card; the data transmission server is positioned on the CPU board card; the number of the transmission modules is more than one, and the transmission modules are respectively positioned on the CPU board card and/or the GPU board card.

For more details on the modules, reference may be made to the relevant description of fig. 2, which is not repeated here. It should be appreciated that the system shown in fig. 8 and its modules may be implemented in a variety of ways. For example, in some embodiments, the system and its modules may be implemented in hardware, software, or a combination of software and hardware. Wherein the hardware portion may be implemented using dedicated logic; the software portions may then be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. In some embodiments, the modules may be implemented by computer code, and when the computer code is executed, the client may be represented as a function entity and an interface thereof, and the server and the transmission module may be represented as different processes. Those skilled in the art will appreciate that the methods and systems described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The system of the present specification and its modules may be implemented not only with hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., but also with software executed by various types of processors, for example, and with a combination of the above hardware circuits and software (e.g., firmware).

It should be noted that the above description of the system and its modules is for convenience of description only and is not intended to limit the present description to the scope of the illustrated embodiments. It will be appreciated by those skilled in the art that, given the principles of the system, various modules may be combined arbitrarily or a subsystem may be constructed in connection with other modules without departing from such principles. Or splitting some modules to obtain more modules or multiple units under the modules. Such variations are within the scope of the present description.

In some embodiments, system 800 may be packaged into a library (library) for other program compilation system calls. As an example, it may be invoked by a machine learning model framework (e.g., pyTorch, tensorFlow, etc.). The machine learning model framework is used for machine model training or prediction, a developer realizes model training or prediction functions by writing corresponding application programs, executable codes are obtained after the application programs are compiled by the framework, and the inter-card data transmission step shown in fig. 2 can be realized by calling modules in the system 800 in the executable codes.

Some embodiments of the present disclosure further provide a data transmission method between multiple processing cards, where a direct channel and/or an indirect channel are provided between the multiple processing cards; the method comprises the following steps: receiving an inter-card data transmission request; the inter-card data transmission request comprises a target address, a source address and a data identifier to be transmitted, and the target address and the source address correspond to different processing cards; determining more than two channels from a first processing card corresponding to a source address to a second processing card corresponding to a target address; acquiring a weight value of each of the more than two channels; the weight value is positively correlated with the data transmission bandwidth of the channel and negatively correlated with the observed load capacity of the channel; and selecting more than one target channel from the more than two channels based on the weight value, wherein the target channel is used for transmitting the data to be transmitted.

In addition to being applied to multiprocessor devices, the data transmission method between the multiple processing cards provided in some embodiments of the present disclosure may be used in a computing cluster including multiple computing devices, where each computing device in the cluster may correspond to a processing card. The details of the data transmission method between the multi-processing cards can also be referred to in the related description of fig. 3, and will not be described herein.

Possible benefits of embodiments of the present description include, but are not limited to: (1) By optimizing the target channel selection strategy, the data transmission rate between the processing cards is effectively accelerated: experiments show that after the data transmission method provided by the embodiment of the specification is adopted, the data transmission speed is increased to 2.8 times compared with the data transmission speed from the CPU card to the GPU card through the single straight communication channel, and the data transmission speed is increased by 4.5 times compared with the data transmission speed of the GPU card through the single straight communication channel; (2) The load capacity on the channels between the processing cards is effectively balanced by optimizing the target channel selection strategy, and the computing resource allocation is optimized; (3) The data transmission method provided by the embodiments of the present disclosure may be transparent at an application layer, and the original application program may also implement the inter-card data transmission method provided by some embodiments of the present disclosure by calling the interface or the module provided by the present disclosure in a compiling link without modification.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A data transmission method between multi-processing cards, have direct channel and indirect channel between the multi-processing cards, the indirect channel includes the direct channel of more than two hops; the multi-processing card is positioned on a main board of the same processing equipment; the method comprises the following steps:

Receiving an inter-card data transmission request; the inter-card data transmission request comprises a target address and a source address, and the target address and the source address correspond to different processing cards;

determining more than two channels from a first processing card corresponding to a source address to a second processing card corresponding to a target address;

selecting more than one target channel from the more than two channels for transmitting data to be transmitted;

obtaining more than one data subset of the data to be transmitted based on the number of target channels and the data to be transmitted, wherein the more than one data subset is transmitted through the more than one target channels;

and based on the more than one data subsets, a data transmission instruction is initiated to the processing card related to the target channel, and the corresponding data subsets are transmitted through the processing card related to the target channel, so that the data to be transmitted is transmitted to the target address.

2. The method of claim 1, further comprising, the more than one subset of data each having a different amount of data, the subset of data having a greater amount of data being transmitted over a channel having a greater data transmission bandwidth.

3. The method of claim 1, wherein the data to be transmitted comprises a result of a calculation performed by an application process.

4. The method of claim 1, to select more than one target channel from the more than two channels, the method further comprising:

acquiring a component table; the component table comprises the data transmission bandwidth of the direct connection channel among the multi-processing cards;

taking the minimum data transmission bandwidth in the direct connection channel with more than two hops in the indirect channel as the data transmission bandwidth of the indirect channel;

determining a first weight value for each of the two or more channels based on a data transmission bandwidth; the first weight value is positively correlated with the data transmission bandwidth of the channel;

acquiring a load table; the load table comprises the observed load capacity of the direct connection channels among the multiple processing cards; the observed load reflects the amount of data being transmitted and/or to be transmitted on the corresponding channel;

accumulating the observed load capacity of more than two direct connection channels in the indirect channel to obtain the observed load capacity of the indirect channel;

determining a second weight value for each of the two or more channels based on the observed load; the second weight value is inversely related to the observed load capacity of the channel;

for each of the two or more channels: and obtaining a weight value based on the first weight value and the second weight value, and selecting more than one target channel from the more than two channels based on the weight value.

5. The method of claim 1, wherein the subset of data corresponds one-to-one to the target channel; the inter-card data transmission request also comprises the offset of a storage area of data to be transmitted relative to the source address; the method further comprises the steps of:

splitting the data to be transmitted into data subsets with different data amounts based on the offset; the data quantity of the data subset corresponding to the target channel with larger data transmission bandwidth is larger;

obtaining the head address and the subset offset of each data subset based on the source address and the data quantity of the data subset;

the data transfer instruction includes a first address of a subset of data and a subset offset.

6. The method according to claim 1 or 5, when the target channel is an indirect channel, initiating, by the data transmission client, a data transmission instruction to a transmission process on a processing card associated with the target channel, comprising:

initiating a data transmission instruction to a transmission process on the nearest processing card on a target channel through a data transmission client, and transmitting a corresponding data subset to the nearest processing card;

the data to be transmitted is transmitted to the target address by transmitting the corresponding data subset through the processing card related to the target channel, which comprises

And the transmission process of the nearest processing card initiates a data transmission instruction to the transmission process of the next processing card on a target channel, transmits the data subset to the next processing card, and the like until the data subset is transmitted to the target address.

7. The method of claim 6, wherein for the transfer processing card on the target channel, for transferring the subset of data, its transfer process is to: and caching the corresponding data subset in a global storage area or a temporary storage area of the transmission process, wherein when the free amount of the temporary storage area is larger than a set threshold value, the data subset is cached in the temporary storage area, and otherwise, the data subset is cached in the global storage area.

8. The method of claim 7, wherein the transfer process of the transit processing card is further configured to:

setting more than two buffer areas with preset sizes in the global storage area, and more than two tasks; the task is used for reading data with the data quantity matched with the preset size from the rest data of the corresponding data subset to a corresponding buffer area according to the head address of the data subset in the data transmission instruction, the subset offset and the cached data quantity of the data subset;

Wherein the two or more tasks are executed in parallel or at intervals of a certain time difference.

9. The method of claim 8, wherein one of the two or more tasks starts executing after the other task reads data into the corresponding buffer region.

10. A storage medium storing computer instructions which, when executed by a processor or processing card, implement the method of any one of claims 1 to 9.

11. A computer device comprising a plurality of processing cards with a direct channel and an indirect channel therebetween, the indirect channel comprising more than two hops of direct channels, and a storage medium storing computer instructions, the plurality of processing cards to execute the computer instructions to implement the method of any one of claims 1-9.