CN120011298A

CN120011298A - An Extended Topology System of Multi-processor Interconnection

Info

Publication number: CN120011298A
Application number: CN202411626735.5A
Authority: CN
Inventors: 李兆石; 丛高建; 付轩; 刘晓青; 魏莉
Original assignee: Muxi Integrated Circuit Shanghai Co ltd
Current assignee: Muxi Integrated Circuit Shanghai Co ltd
Priority date: 2024-01-04
Filing date: 2024-11-14
Publication date: 2025-05-16
Also published as: WO2025113544A1

Abstract

The invention relates to the field of chip design, in particular to an expansion topology system for multi-processor interconnection, which comprises N groups of processors, wherein each group of processors comprises M processors, the positions of the processors in each group of processors are distributed identically, each group of processors comprises two layers of interconnection structures, namely an intra-group interconnection structure and an inter-group interconnection structure, wherein the M processors in the intra-group interconnection structure are in point-to-point full connection, the inter-group interconnection structure comprises M annular connection structures, and the distribution positions of each processor connected in the annular connection structure in each group of processors are identical, so that the obtained expansion topology system realizes the number of expansion processor interconnection under the condition of conforming to an original protocol.

Description

Expansion topology system for multiprocessor interconnection

Technical Field

The invention relates to the field of chip design, in particular to an extended topology system for multiprocessor interconnection.

Background

The training and reasoning process of the large model requires tensors, pipelining and data parallelism in the high bandwidth domain, and only pipelining and data parallelism in the low bandwidth domain. The high bandwidth domain is realized by interconnecting a plurality of GPUs through a high-speed interconnection protocol by a GPU manufacturer, and the low bandwidth domain can be realized by adopting an ethernet. The large model is highly dependent on tensor parallelism, which is used only if the number of GPUs supported in the high bandwidth domain is insufficient. Therefore, the more GPUs supported in the high bandwidth domain, the better the training and reasoning process of the large model can be supported.

At present, GPUs are interconnected through an OAM protocol, and because the routing of the interconnection between any two GPUs is determined and cannot be changed in the OAM protocol, the interconnection topology structure of the GPUs based on the OAM protocol supports at most 8 GPUs, each GPU reserves 8 interconnection ports, and any two GPUs in the interconnection topology structure of the 8 GPUs can be directly interconnected point to point, but cannot be expanded to the interconnection topology structure of more than 8 GPUs due to the limitation of the OAM protocol. Therefore, there is a need for an interconnect topology that can support more than 8 GPUs.

Disclosure of Invention

Aiming at the technical problems, the technical scheme adopted by the invention is that the multi-processor interconnection expansion topological system comprises N groups of processors, each group of processors comprises M processors, the positions of the processors in each group of processors are distributed identically, each group of processors comprises two layers of interconnection structures, namely an intra-group interconnection structure and an inter-group interconnection structure, wherein the M processors in the intra-group interconnection structure are in point-to-point full connection, the inter-group interconnection structure comprises M annular connection structures, and the distribution positions of the processors connected in the annular connection structure in each group of processors are identical.

The invention has at least the following beneficial effects:

In summary, the extended topology system for multiprocessor interconnection provided by the invention comprises N groups of processors, each group of processors comprises M processors, the position distribution of each group of processors is the same, each group of processors comprises two layers of interconnection structures, namely an intra-group interconnection structure and an inter-group interconnection structure, so that the number of the extended processors interconnection is realized under the condition that the obtained extended topology system accords with the original protocol.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a topology of an OAM protocol supporting a maximum of 8 processor interconnections;

FIG. 2 is a schematic diagram of an extended topology system according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of an extended topology system according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of an extended topology system according to a third embodiment of the present invention;

fig. 5 is a schematic diagram of an extended topology system according to a fourth embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic topology diagram of an OAM protocol supporting a maximum of 8 processors, and in fig. 1, a total of 8 processors including OAM0-OAM7 can be directly interconnected, i.e., point-to-point interconnection between any two processors. The interconnection protocol is an OAM protocol in which, when the addresses of the source processor and the destination processor are located, the route between the two is also uniquely determined, because the route between the source processor and the destination processor is fixed in hardware and cannot be changed, the route is { source processor address, source processor port index number, destination processor address, destination processor port index number }. In the training and pushing of large models, point-to-point full interconnection between processors is not needed, and collectives modes such as allreduce, alltoall and the like only need to be supported. The allreduce mode is to collect data from each display card and aggregate the data, and then distribute the aggregate result to each display card. Alltoall mode means that the data of each node is distributed to each display card, and the data of each display card is collected. Thus, processors may be interconnected point-to-point, or may be interconnected via forwarding by other processors. The problem to be solved is thus how to extend the number of processor interconnects while conforming to the original protocol.

Note that the topology in fig. 1 is hereinafter referred to as an original topology, and is not described below.

The invention provides an extended topology system interconnected by multiple processors, which comprises N groups of processors, wherein each group of processors comprises M processors, the positions of the processors in each group of processors are distributed identically, and each group of processors comprises two layers of interconnection structures, namely an intra-group interconnection structure and an inter-group interconnection structure. Point-to-point full connections between M processors in the intra-group interconnect structure. The inter-group interconnection structure comprises M annular connection structures, and the distribution positions of each processor connected in the annular connection structures in each group of processors are the same.

Alternatively, the processor is a GPU or a GPGPU. Other processors in the prior art are also within the scope of the present invention.

Alternatively, N is greater than 8, and N is a multiple of 2. Preferably, N is a multiple of 8.

Alternatively, M is greater than 1. Preferably, M is equal to 4.

As a preferred embodiment, the intra-group interconnect structure of each group of processors is a full interconnect structure in the original topology.

As a preferred embodiment, the processors of each group are internally fully interconnected by the same M-1 ports.

As a preferred embodiment, the software address remapping table is searched to obtain the remapping address of each processor address, and the route corresponding to the remapping address is obtained. Wherein the software address remapping table includes each processor address and its remapping address. When the processor address is less than 7, i.e., any one of S0-S7, the remapped address of the processor address is itself. When the processor address is greater than 7, the remapped address of the processor address is modulo the processor address. And searching the software address remapping tables according to the addresses of the source processor and the destination processor respectively to obtain the remapping address of each processor. And remapping all processor addresses of the N groups of processors into S0-S7 through a software address remapping table, so that the processor addresses during data transmission conform to the interconnection route fixed in hardware. The number of processor interconnects is extended in conformity with the original protocol.

Referring to fig. 2, a first type of interconnect topology is shown. The topology system of this type includes 4 processors, a zeroth processor S0 to a fifteenth processor SF, totaling 16 processors, a first group of processors including a first processor S0, a third processor S3, a fourth processor S4, and a seventh processor S7, a second group of processors including a first processor S1, a second processor S2, a fifth processor S5, and a sixth processor S6, a third group of processors including a tenth processor SA, a ninth processor S9, a fourteenth processor SE, and a thirteenth processor SD, and a fourth group of processors including a fifteenth processor SF, an eleventh processor SB, a twelfth processor SC, and an eighth processor S8. Each processor comprises 7 ports, each port in each processor corresponds to a unique index number, and the index numbers of different processors are independent, namely, the index numbers of the ports in different processors are all from a first port index number link1 to a seventh port index number link7. The 16 processors in fig. 2 all occupy 5 ports in total in the interconnect topology. The same three ports are occupied inside each group of processors to realize full interconnection, and in fig. 2, the full interconnection inside the groups is realized through link1, link2 and link 3. Inter-group interconnections are formed into a ring structure by hopping between the remaining ports of the interconnection processor.

The distribution positions of S0, S5, SF and SA in each group of processors are the same, and in an interconnection structure formed by interconnection of S0, S5, SF and SA, the connection between index numbers of ports is specifically as follows, link7 of S0 is connected with link7 of S5, link4 of S5 is connected with link4 of SF, link7 of SF is connected with link7 of SA, link6 of SA is connected with link6 of S0, and the connection is formed into an annular structure in an end-to-end mode. Similarly, the distribution positions of S4, S1, SB and SE in each group of processors are the same, and in an interconnection structure formed by interconnection of S4, S1, SB and SE, the connection between index numbers of ports is specifically as follows, link7 of S4 is connected with link7 of S1, link4 of S1 is connected with link4 of SB, link7 of SB is connected with link7 of SE, link6 of SE is connected with link6 of S4, and the connection is formed into an annular structure end to end. Similarly, the distribution positions of S3, S6, SC and S9 in each group of processors are the same, the S3, S6, SC and S9 are interconnected to form an interconnection structure, and the connection between index numbers of ports is specifically as follows, link7 of S3 is connected with link7 of S6, link6 of S6 is connected with link6 of SC, link7 of SC is connected with link7 of S9, link4 of S9 is connected with link4 of S3, and an annular structure is formed in a head-tail mode. Similarly, the distribution positions of S7, S2, S8 and SD in each group of processors are the same, the S7, S2, S8 and SD are interconnected to form an interconnection structure, and the connection between index numbers of ports is specifically as follows, link7 of S7 is connected with link7 of S2, link6 of S2 is connected with link6 of S8, link7 of S8 is connected with link7 of SD, link4 of SD is connected with link4 of S3, and an annular structure is formed in a head-tail mode.

On the basis of fig. 2, the 16-bit processor address of S0-SF is remapped to the 8-bit processor address of S0-S7 by means of a software address remapping table. The interconnection route between any two processors in fig. 2 is the same as the original 8 processors in fig. 1. I.e. an extension of the interconnect processor is achieved without changing the fixed routing in the hardware.

Note that, in fig. 2, index numbers link7, link6, and link4 of ports forming the ring-shaped inter-group interconnect structure are link7, link6, link7, and link4 or link7, link4, link7, and link6 in order of the ring-shaped structure. Alternatively, an equivalent implementation of the index number of the interconnection port may be link6, link7, link6 and link4 or link6, link4, link6 and link7 in sequence. It is also possible to replace the index number of any one port in the ring structure with Link5 by Link5, for example, if the index number of the interconnect structure is not Link7 but Link5 is used, then the index numbers of the ports Link5, link6, and Link4.

Referring to fig. 3, a second type of interconnect topology is shown, again comprising 4 sets of processors. The 16 processors in fig. 3 occupy a total of 5 ports in the interconnect topology. Processors within each group of processors implement intra-group full interconnection. Inter-group interconnection is achieved by passing processors of each group of processors in the same distribution position through the remaining ports to form a ring structure. The 4 groups of processors are S0-S3, S4-S7, S8-SB and SC-SF respectively. And realizing the full interconnection in each group by link3, link4, link5 and link6 respectively in each group of processors. Inter-group interconnections are formed into a ring structure by hopping between the remaining ports of the interconnection processor. The S0, S5, SB and SE distributed in the same position in each group of processors are interconnected to form an annular structure, and the connection between index numbers of ports is specifically as follows, link7 of S0 is connected with link7 of S5, link6 of S5 is connected with link6 of SB, link7 of SB is connected with link7 of SE, link5 of SE is connected with link5 of S0, and the connection is formed into an annular structure end to end. Similarly, S1, S4, SA and SF with the same distribution position in each group of processors are interconnected to form a ring structure, and the connection between index numbers of ports is specifically as follows, link7 of S1 is connected with link7 of S4, link4 of S4 is connected with link4 of SA, link7 of SA is connected with link7 of SF, link5 of SF is connected with link5 of S1, and the connection is formed into a ring structure. Similarly, S3, S6, S8 and SD distributed in the same positions in each group of processors are interconnected to form a ring structure, and the connection between index numbers of ports is specifically as follows, link7 of S3 is connected with link7 of S6, link5 of S6 is connected with link5 of S8, link7 of S8 is connected with link7 of SD, link6 of SD is connected with link6 of S3, and the connection is formed into a ring structure. Similarly, S2, S7, S9 and SC distributed in the same position in each group of processors are interconnected to form a ring structure, and the connection between index numbers of ports is specifically as follows, link7 of S2 is connected with link7 of S7, link5 of S7 is connected with link5 of S9, link7 of S9 is connected with link7 of SC, link4 of SC is connected with link4 of S2, and the connection is formed into a ring structure.

On the basis of fig. 3, the 16-bit processor address of the S0-SF is remapped into the 8-bit processor address of the S0-S7 through a software address remapping table, so that the purpose of expanding the interconnection processor is achieved on the premise of not changing the fixed route in hardware.

Referring to fig. 4, a third type of interconnect topology is shown, again comprising 4 sets of processors. The 16 processors in fig. 4 interconnect topology occupies a total of 6 ports. Processors within each group of processors implement intra-group full interconnection. Inter-group interconnection is achieved by passing processors of each group of processors in the same distribution position through the remaining ports to form a ring structure. The 4 sets of processors are S0, S2, S5 and S7, S1, S3, S4 and S6, S8, SA, SD and SF, S9, SB, SC and SE, respectively. And realizing the full interconnection in each group by link2, link6, link7 and link4 respectively in each group of processors. Inter-group interconnections are formed by interconnecting remaining ports of the processors to form a ring structure. The S0, the S1, the SD and the SC which are distributed in the same position in each group of processors are interconnected to form an annular structure, and the interconnection ports are sequentially connected with index numbers of the ports, wherein the connection is specifically that link4 of the S0 is connected with link6 of the S1, link1 of the S1 is connected with link1 of the SD, link5 of the SD is connected with link5 of the SC, and link1 of the SC is connected with link1 of the S0, and the connection is formed into an annular structure. Similarly, S2, S6, SF and SB which are distributed in the same positions in each group of processors are interconnected to form a ring structure, and the connection between ports is specifically as follows, link1 of S2 is connected with link1 of S6, link4 of S6 is connected with link6 of SF, link1 of SF is connected with link5 of SB, link5 of SB is connected with link5 of S2, and the connection is formed into a ring structure end to end. Similarly, S7, S3, SA and SE with the same distribution positions in each group of processors are interconnected to form an annular structure, and the connection between index numbers of ports is specifically as follows, link1 of S7 is connected with link1 of S3, link5 of S3 is connected with link5 of SA, link1 of SA is connected with link1 of SE, link4 of SE is connected with link6 of S7, and the connection is formed into an annular structure in an end-to-end mode. Similarly, S5, S4, S8 and S9 with the same distribution positions in each group of processors are interconnected to form a ring structure, and the connection between index numbers of ports is specifically as follows, link5 of S5 is connected with link5 of S4, link1 of S4 is connected with link1 of S8, link4 of S8 is connected with link6 of S9, and link1 of S9 is connected with link1 of S5, and the connection is formed into a ring structure.

The extended topology system provided by the 16 processors provided in fig. 4 also needs to remap the addresses of the S0-SF into the S0-S7 through address remapping, so as to achieve the purpose of extending the interconnection processors without changing the fixed routes in the hardware.

The equivalent topology system of the extended topology system provided by the 16 processors provided in fig. 2, 3 and 4 further comprises an interconnection structure formed after the positional relationship among the groups of processors in the extended topology system is exchanged. The extended topology system which changes the index number to make the final realized result be the same as the hardware fixed route determined by the OAM protocol falls within the protection scope of the invention.

Referring to fig. 5, a fourth type of interconnect topology is shown, which is a further extension of the interconnect topology provided in fig. 3, to implement the interconnection of 32 processors, equivalent to the interconnection between 4 sets of original interconnect topologies. The interconnection of 32 processors comprises 8 groups of interconnection structures, the inside of each group of interconnection structures is fully interconnected, and the processors at the same position are sequentially interconnected to form a ring-shaped interconnection structure. The address remapping is the same as the software address remapping table used in fig. 3.

As a preferred embodiment, the extended topology system provided in fig. 2 and 4 can be extended again with reference to the extended topology system provided in fig. 5. The method shown in fig. 2, fig. 3 and fig. 4 can be expanded for multiple times in a mode that processors in the whole group are interconnected and processors in the same position among the groups are interconnected to form a ring structure, so that the interconnection of 8*N processors is realized.

As a preferred embodiment, when the source processor performs hardware address remapping to make it conform to the rule checked by the destination processor, or the destination address performs hardware address remapping, the hardware address remapping is added on the hardware in the hardware path between the source processor and the destination processor. As an example, when S4 expects to communicate with the Link1 interconnect of S0, but in fact S4 communicates with the Link7 interconnect of S0, the interconnect Link7 can be remapped to Link1 by mapping address remapping.

As a preferred embodiment, when 16 processors are interconnected, the address remapping of S0-S7 is followed by itself, and the address remapping of S8-SF is followed by the current processor address minus 8.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. An extended topology system for interconnecting multiple processors, characterized in that the extended topology system comprises N groups of processors;

Each processor group includes M processors;

The positions of the processors in each group of processors are distributed in the same way;

Each group of processors includes two layers of interconnection structures: an intra-group interconnection structure and an inter-group interconnection structure; wherein the M processors in the intra-group interconnection structure are point-to-point fully connected; and the inter-group interconnection structure includes M ring connection structures, and each processor connected in the ring connection structure has the same distribution position in each group of processors.

2. The system according to claim 1 is characterized in that the intra-group interconnection structure of each group of processors is a full interconnection structure in the original topology structure.

3. The system according to claim 1 is characterized in that each group of processors is fully interconnected through the same M-1 ports.

4. The system according to claim 3, characterized in that the interconnection between groups forms a ring structure by jumping between the remaining ports of the interconnected processors.

5. The structure according to claim 1 is characterized in that M is equal to 4.

6. The system according to claim 1, wherein the processor is a GPU or a GPGPU.

7. The system according to claim 1 is characterized in that the system further comprises: searching a software address remapping table to obtain a remapping address of each processor address, and obtaining a route corresponding to the remapping address.

8. The system according to claim 7 is characterized in that when the processor address is greater than 7, the remapped address of the processor address is modulo the processor address.