WO2024038544A1

WO2024038544A1 - Interconnection apparatus and switching system

Info

Publication number: WO2024038544A1
Application number: PCT/JP2022/031223
Authority: WO
Inventors: Ibrahim Salah; Yusuke MURANAKA; Toshikazu Hashimoto; Takeshi Sakamoto
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2024-02-22
Anticipated expiration: 2025-02-18
Also published as: JP2025526149A

Abstract

An interconnection apparatus (100) according to this invention is an interconnection apparatus for interconnecting a switching apparatus and an XPU connected to an XPU memory includes a protocol processor (110) configured to execute an off-loaded protocol from the XPU and directly accesses the XPU memory. In addition, a switching system (300) includes an optical switch (14), an electric switch (212), an optical transmitter (13) configured to convert an electric signal input from the electric switch into an optical signal and output the optical signal to the optical switch, an optical receiver (15) configured to convert the optical signal input from the optical switch into an electric signal and output the electric switch to the electric switch, and a plurality of interconnection apparatuses (100) connected in parallel to the electric switch. Accordingly, this invention can provide an interconnection apparatus and a switching system, each of which can improve XPU use efficiency and execute a protocol.

Description

INTERCONNECTION APPARATUS AND SWITCHING SYSTEM

The present invention relates to an interconnection apparatus and a switching system, each of which interconnects a processor and a switch in a computing system.

A high-performance computing system is formed by XPUs as a large number of computing units to improve use efficiency of resources and reduce the latency. In this case, the XPU is an abbreviation of different types of processing units (processors). For example, the XPU includes a normally used CPU (Central Processing Unit) and an application-oriented GPU (Graphical Processing Unit).

For this purpose, a physical data transfer rate and a data transmission protocol execution rate in a transmission link and switching fibric must be improved. In particular, an increase in protocol execution rate must be performed by both the hardware level and the software level.

Execution of a conventional protocol will be described with reference to Fig. 18A. In this case, a case in which a protocol is executed to a packet reaching an XPU will be described below.

In an operating system (OS) kernel 400, a device driver 401 supplies, to a Network Internet Card (NIC) 410, in advance, an address of a memory queue in which an input packet is stored.

When a new packet reaches, the NIC 410 transmits a packet bit received to a queue 404 (an arrow indicated by the dotted line in Fig. 18A) to generate an interrupt signal in an XPU 420. In this case, an XPU memory 421 is connected to the XPU 420. At this time, an application executed by the XPU 420 is held, and then context switching is executed to hold all related processing data in a designated register for restarting this application. In addition, another processing action such as pipeline flash processing is also executed.

In the interrupt postprocessing of the operating system (OS) kernel 400, protocol steps are executed for the received packet. A header check sum and the fragment of a transmission data flow to be reassembled are verified in an IP layer 402 and a TCP layer 403. When the protocol execution is complete, an application level payload is distributed to the queue 404 of the corresponding end point.

Fig. 18B shows a normal XPU processing timeline (without receiving a packet) and a timeline of the XPU which executes the same processing upon receiving the packet.

In the normal XPU processing timeline, second processing 432 is executed after first processing 431 (430 in Fig. 18B).

On the other hand, when receiving a packet, the first processing 431 is interrupted, and packet reception processing 441 is executed. In this case, before and after the packet processing 441, a class selector (CS) 442 is set (440 in Fig. 18B).

As described above, when receiving the packet, a large processing overhead is observed. The influence of this overhead is large for a packet having a short payload. Fig. 19 shows the relationship between the Ethernet bandwidth and the transmission message size in a 1-Gb/s Ethernet link. Fig. 19 shows the relationship between the Ethernet frame (1,500 bytes indicated by reference numeral 451) of the standard size and the Ethernet frame of a large size (9,000 bytes indicated by reference numeral 452). In a region having a small message size (less than about 512 octets), the bandwidth increases with an increase in message size. If a region having a large message size (about 512 octets or more) reaches the upper limit, the bandwidth rarely increases.

In the conventional protocol execution, the protocol will not be executed for a new input packet until the completion of the protocol which is being executed.

The duration of a short payload of an Ethernet packet is only part of the protocol actual execution time. As shown in Fig. 19, the duration is improved when the effective band width of the packet reception is very small and the payload size increases until the upper limit.

David Riddoch, "Low Latency Distributed Computing," A dissertation submitted for the degree of Doctor of Philosophy, University of Cambridge 2002.

As described above, reduction of the XPU use efficiency poses a problem in the conventional protocol execution.

In addition, in the conventional protocol execution, the input packet processing operations are performed in an order of arrival of packets having no priorities. At this time, the loss of the QoS control in the input packet poses a problem in the computing overhead.

If the loss of the QoS control occurs, the application of the highest priority waits for a long time until the execution of a protocol having a low priority which is arrived first.

If at least one packet of a data flow is lost or a queue overflow occurs before the completion of the flow reception, a long time is taken for the execution of the protocol for a packet which is not used at all by the XPU.

In order to solve the above problem, an interconnection apparatus according to the present invention is an interconnection apparatus for interconnecting a switching apparatus and an XPU connected to an XPU memory includes a protocol processor configured to execute an off-loaded protocol from the XPU and directly access the XPU memory.

According to the present invention, there are provided an interconnection apparatus and a switching system, each of which executes a protocol by improving the use efficiency of an XPU.

Fig. 1A is a schematic view showing the arrangement of an interconnection apparatus according to the first embodiment of the present invention; Fig. 1B is a view for explaining the operation of the interconnection apparatus according to the first embodiment of the present invention; Fig. 2 is a view for explaining the operation of the interconnection apparatus according to the first embodiment of the present invention; Fig. 3 is a schematic view showing the arrangement of a switching system according to the second embodiment of the present invention; Fig. 4 is a schematic view showing the basic arrangement of a switching apparatus 10 according to the second embodiment of the present invention; Fig. 5A is a view for explaining the basic operation of the switching apparatus 10 according to the second embodiment of the present invention; Fig. 5B is a view for explaining the basic operation of a conventional switching apparatus; Fig. 6A is a view for explaining the basic operation of the switching apparatus 10 according to the second embodiment of the present invention; Fig. 6B is a view for explaining the basic operation of the switching apparatus 10 according to the second embodiment of the present invention; Fig. 6C is a view for explaining the basic operation of the switching apparatus 10 according to the second embodiment of the present invention; Fig. 7 is a view for explaining the basic operation of the conventional switching apparatus; Fig. 8 is a schematic view showing the basic arrangement of a switching apparatus 30 according to the second embodiment of the present invention; Fig. 9 is a view for explaining the basic operation of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 10A is a view for explaining the basic operation of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 10B is a view for explaining the basic operation of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 10C is a view for explaining the basic operation of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 10D is a view for explaining the basic operation of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 11 is a view for explaining the basic operation of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 12A is a view for explaining the effect of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 12B is a view for explaining the effect of the switching apparatus 30 according to the second embodiment of the present invention; Fig. 13 is a schematic view showing the basic arrangement of a switching system 40 according to the second embodiment of the present invention; Fig. 14 is a schematic view showing the basic arrangement of a switching system 50 according to the second embodiment of the present invention; Fig. 15 is a schematic view showing the basic operation of the switching system 50 according to the second embodiment of the present invention; Fig. 16 is a schematic view showing the arrangement of a switching system according to Example 1 of the present invention; Fig. 17 is a schematic view showing an example of the arrangement of the switching system according to Example 1 of the present invention; Fig. 18A is a view for explaining the conventional switching system; Fig. 18B is a view for explaining the conventional switching system; and Fig. 19 is a view for explaining the conventional switching system.

(First Embodiment)
An interconnection apparatus according to the first embodiment of the present invention will be described with reference to Figs. 1A to 2.

(Arrangement of Interconnection Apparatus)
As shown in Fig. 1A, an interconnection apparatus 100 according to this embodiment includes a protocol processing dedicated processor (to be referred to as a protocol processor hereinafter) 110. The interconnection apparatus 100 is connected to an XPU 120 and a memory 121 connected to the XPU 120 (to be referred to as an XPU memory hereinafter).

As described above, in the interconnection apparatus 100, protocol processing is off-loaded from the XPU 120 to the protocol processor 110.

The protocol processor 110 is connected to the XPU 120 and the XPU memory 121.

In addition, the protocol processor 110 is connected to a NIC 130.

The input packet payload is extracted by the protocol processor 110 and is applicable to the XPU 120.

When transmitting a packet during the execution of a process, the XPU 120 processes the packet payload at an appropriate timing without context switching or the other conventional overhead processing steps. Fig. 1B shows a processing timeline 103 of the XPU 120 and a processing timeline 104 of the protocol processor 110. For example, as shown in Fig. 1B, packet payload processing 102 is executed upon completion of the processing without interrupting predetermined process processing 101.

The protocol processor 110 has the following features.

First, the protocol processor 110 has a sufficiently high processing capacity. For example, the protocol processor 110 is operated at high speed. If the operation of the protocol processor 110 is slow, the time required for the protocol processing increases, and prioritization degree using another device as the protocol processor 110 is lowered.

In this case, in a computing system including a variety of target XPUs 120, assignment of the high-performance protocol processors 110 to the respective XPUs 120 simultaneously increases the power consumption and the cost. Accordingly, the tradeoff between the number of processors to be assigned and the power consumption and cost must be performed in consideration of the operating situation of the computing system.

Second, the protocol processor 110 can directly access the XPU memory 121. This is because when the protocol processor 110 cannot directly access the XPU memory 121, the operation process of the XPU 120 is interrupted to cause an unnecessary overhead.

In the interconnection apparatus 100, a PCIe (Peripheral Component Interconnect-Express) subsystem is used as an example of the frame work capable of causing the protocol processor 110 to directly access the XPU memory 121.

As shown in Fig. 2, in the PCIe subsystem, a root complex (RC) 140 is a main device for connecting the XPU 120 and the XPU memory 121 to different end points. For example, the NIC is connected to the RC as the end point of the RC, and an RDMA (Remote Direct Memory Access) is arranged between the XPU 120 and the NIC which are arranged to be spaced apart from each other. In this case, the PCIe protocol has a simple arrangement formed by only three layers, that is, a transaction layer, a link layer, and a physical layer.

For example, as shown in Fig. 2, the transaction layer is formed by memory write, memory read, and completion of data (CplD).

In this transaction layer, an XPU (driver) first notifies the NIC of the message transmission preparation (S11 in Fig. 2).

Next, the memory read and the transmission of completion of data are executed between the RC and the NIC, and memory write of the completion of the executed processing is executed (S12 in Fig. 2). As described above, a series of processing operations are collectively executed in the transition layer.

Next, when the NIC receives the payload, the data is transmitted to the network. If the transmission succeeds, the NIC receives a reception acknowledgement (ACK) from the transmission destination (S13 in Fig. 2).

Finally, when the NIC receives the ACK, the XPU memory is directly accessed via the RC to write the completion of transmission (S14 in Fig. 2).

Memory polling is used in the XPU 120 and the XPU memory 121. The XPU 120 checks a periodically designated memory area in order to prevent context switching and other overhead processing steps. Note that the XPU 120 can perform an interrupt as an option and may directly access the input data without using memory polling.

Replacement of a round-trip transaction with a unidirectional transaction is executed to further reduce the latency.

　According to the interconnection apparatus according to this embodiment, the protocol processing is off-loaded from the XPU, and XPU use efficiency is promoted as a variety of computing resources.

(Second Embodiment)
A switching system according to the second embodiment of the present invention will be described with reference to Figs. 3 to 15.

(Arrangement of Switching System)
A switching system 200 according to this embodiment includes a switching system 210 and an interconnection apparatus 100 according to the first embodiment.

As shown in Fig. 3, the switching system 210 includes an optical switch 14 and a plurality of control groups 211_10 to 211_m0.

In each control group, a transmission group formed by devices such as a signal transmission optical transmitter 13, an electric switch, and a processor and a reception group formed by devices such as a signal reception optical receiver 15, an electric switch, and a processor are collectively mounted on, for example, the same mounting substrate.

In this case, an electric switch 212 is a single ASIC chip and arranged for each group to perform all switching and data management operations described above. In this case, an ASIC chip 212 has functions of a control unit (to be described later), a transmission electric switch, and a reception electric switch.

The basic arrangement and the operation of the switching system 210 will be described with reference to Figs. 4 to 15. For the sake of descriptive simplicity, an example in which the transmission group and the reception group are divisionally arranged on the input side and the output side of the optical switch will be described.

As the basic arrangement of the switching system 210, for example, switching

systems

40 and 50 include switching

apparatuses

10 or 30, a plurality of transmission groups (for example 3_10 to 3_40), and a plurality of reception groups (for example, 4_20 and 4_30).

(Arrangement of Switching Apparatus 10)
First, a switching apparatus 10 will be described with reference to Figs. 4 to 7.

As shown in Fig. 4, a switching apparatus 10 includes optical transmitters 13, an optical switch 14, optical receivers 15, and control units 17.

A packet 1 is transmitted from a transmission host 3, and the optical transmitter 13 compresses and demultiplexes the packet 1 into packets (packets 2). The optical switch 14 switches the packet, and the packet 1 is received by a reception host 4 via the optical receiver 15. The packets 1 transmitted from the transmission host 3 are set with priorities.

The optical switch 14 is an all through switch and does not require arbitration in switching itself. The optical switch 14 simultaneously transmits all packets for the same output port destination. The destination output port simultaneously receives different packets.

The control unit 17 is an electric chip and arranged while being connected to the optical receiver 15. The control unit 17 executes processing such as arbitration for the packets 2 output from the optical switch 14.

The control unit 17 is locally assigned for each reception host 4 and executes self management of traffic input to the reception host 4. The control unit 17 is arranged adjacent to the reception host 4 and quickly updates the data acquisition priority.

In addition, the control unit 17 manages communication processing of the single reception host 4, thereby performing a high-speed operation.

An information channel is connected to the reception host 4 and the control unit 17 in addition to a data link.

(Operation of Switching Apparatus 10)
Fig. 5A shows packet processing in the switching apparatus 10. As comparison, Fig. 5B shows a conventional non-blocking switch 20.

In the conventional electric non-blocking switch 20, the bandwidths of all the output ports are constant. The switch 20 can transmit only one packet at one time. For example, if a flow B (101_2) and a flow C (101_3) are transmitted to the same output port, these flows are processed in a single band having a predetermined bandwidth (Fig. 5B).

On the other hand, in the optical switch 14 in the switching apparatus 10, the bandwidth of any output port is variable, and all the packets input to the switch can be adjusted. For example, as shown in Fig. 4A, the bandwidth is changed to process the packet in two bands. In this case, the bandwidth of the output port is 1/2 of the total throughput (bandwidth) of the switch.

As described above, the optical switch 14 can cope with a high bandwidth per output port.

As shown in Fig. 5A, in the switching apparatus 10, input data (packet C) 102_3 having a low priority is buffered and held in the RAM of the control unit 17 for a long time.

On the other hand, data (packet B) 102_2 having a high priority is transmitted first.

Input data (packet C) 102_3 having a low priority is transmitted after the completion of the transmission of the data (packet B) 102_2 having a high priority.

In addition, in the switching apparatus 10, an optical data link may be arranged in the control unit 17. Accordingly, optical data can be directly transmitted between the output port of the switch and the reception host 4 capable of processing the optical input signal.

As shown in Figs. 6A to 6C, the optical switch 14 can be arranged based on a broadcast-and-select method.

Signals input to different switch ports are multiplexed for, for example, the respective wavelengths. The input signals are respectively transmitted to all the output ports and selected in the respective output ports based on a desired signal transmission destination.

For example, as shown in Fig. 6A, an input packet 1_1 from a transmission host A (3_1) is demultiplexed by a splitter 18, and demultiplexed packets 2_1 to 2_4 are transmitted to reception hosts 4_1 to 4_4.

In addition, as shown in Fig. 6B, the input packet 1_1 and an input packet 1_4 from the transmission host A (3_1) and a transmission host D (3_4) are demultiplexed by the splitter 18. The demultiplexed packets 2_1 to 2_4 are selected by an optical selection filter 19 in the output port and transmitted to the reception hosts 4_1 to 4_4. In this case, a fast tunable filter or a polarization filter element can be used as the optical selection filter 19.

In addition, as shown in Fig. 6C, the input packets 1_1 and 1_4 from the transmission hosts A and D (3_1 and 3_4) are demultiplexed by the splitter 18. The demultiplexed packets 2_1 to 2_4 are received by the plurality of optical receivers 15 for the respective packets and transmitted to the reception hosts 4_1 to 4_4.

(Effect)
In the conventional non-blocking type packet switch 20, a data packet input to any input port can be switched to a desired output port.

If a plurality of packets are simultaneously transmitted to the same output port, contention occurs, and arbitration is executed for the colliding packets. In the arbitration, a packet having a high priority is first selected and transmitted, and other packets are buffered and subsequentially transmitted.

As described above, in the conventional non-blocking type packet switch 20, arbitration is required in scheduling performed when simultaneously transmitting the packets to the same destination.

In the arbitration process, generally pieces of information concerning the validity, priority, and selection of the data are collected to a central control unit (not shown) in accordance with the concentrated control method in a system before the determination in the arbitration. The accuracy of this determination highly depends on all pieces of necessary information in the immediately preceding updating. However, in a dynamic computing system, it is difficult to maintain the accuracy of the information of immediately preceding updating.

More specifically, for example, as shown in Fig. 7, the reception host 4 is assigned with a computational task for reducing the two data flows, that is, a flow A (201_1) and the flow B (201_2). The host 4 has already processed the flow A (201_1) and waits for the reception of the flow B (201_2).

On the other hand, the flow C (201_3) is another flow to be transmitted to the host 4 and reaches the switch before the arrival of the flow B (201_2) with a time difference. If the time difference is more than zero, the switch transmits the flow C (201_3) to the host 4.

When operating the optical switch 14, transmission of the flow B (201_2) is not started until the completion of all the transmission operations of the flow C (201_3).

When operating the electric switch, arbitration of the packet is started between the flow C (201_3) and the flow B (201_2) when the flow B (201_2) arrives. As far as an arbitrator 22 of the switch has already received a notification, a priority is given to the flow B (201_2). If a failure or delay occurs in the arbitrator 22 notified of a request of the host 4 using the flow B (201_2) as the highest destination, the flow B (201_2) is not sufficiently quickly switched even if the time difference is equal to zero.

The host 4 is a processing unit, and the data acquisition priority changes at high speed. In this case, it is difficult to continuously update the arbitrator 22 with the changing priority. Accordingly, in the processing of all the system communication amount, it is difficult for the arbitrator 22 to sufficiently quickly and accurately perform determination for a very large amount of dynamic data.

As described above, in the conventional non-blocking type packet switch 20, since arbitration is executed by a concentrated control method, the number of ports of the switch and the processing capacity (throughput) are increased, and the process is complicated. As a result, the latency and power consumption increase.

In addition, collection of information necessary for the arbitration process, such as the priority, in accordance with a predetermined rule is difficult because of an increase of the system scale.

On the other hand, in the switching apparatus 10, arbitration is executed by the control unit 17 on the output side in accordance with the distributed control method. In this case, the host 4 connected to each output port determines a packet to be processed first. In this manner, the control unit 17 can locally execute arbitration for each host.

All packets are output in a predetermined duration T. In other words, the data rate of the output signal is equal to that of the input signal.

As described above, the data rate of the packets reaching the same output port in different slots is converted into the initial (original) data rate. In addition, the packets are output in an order requested to the connected output host. In this manner, the minimum latency is given to the packet having the highest priority.

Accordingly, in the switching apparatus 10, the latency and power consumption in switching can be reduced, and the load for acquiring information necessary for the arbitration process can be reduced (eliminated).

(Arrangement of Switching Apparatus 30)
Next, a switching apparatus 30 will be described with reference to Figs. 8 to 12B.

As shown in Fig. 8, an example of a switching apparatus (packet switch) 30 includes input ports 11, input blocks 12, optical transmitters 13, an optical switch 14, optical receivers 15, and output ports 16. The switching apparatus 30 also includes control units 17 connected to the optical receivers 15. In the switching apparatus 30 according to this embodiment, the optical switch 14 is operated in the time slot operation.

(Operation of Optical Switch)
The operation of the switching apparatus (packet switch) 30 according to this embodiment will be described with reference to Fig. 9.

Fig. 9 shows an example of the basic operation of the packet switch 30 for executing non-block processing using a 4 x 4 switch. This packet switch 30 is based on the time slot operation to be described below.

First, a packet is switched for each input group. In this case, arbitration is executed within the input packet of the same input group. Since this arbitration is executed for the small number of ports and a low communication amount, the operation can be executed fast without requiring a long time.

An electric packet 1 input to the switch has a bandwidth BW (bit/sec) and a duration T. The desired output port 16 to which the packet is transmitted is set.

In the packet switch 30, a packet switching operation to any one of the four output ports 16 is complete within the time T. This is because if a time of T or longer is required for the switching of a single packet, the next input packet is blocked and a continuous switching delay is accumulated.

In order to match the input packet 1 with the time slot, the optical transmitter 13 compresses the input packet 1 by a factor (in this case, 4) equal in number to the number of ports (that is, the number of optical receivers to which packets are transmitted). That is, the duration of the input packet is divided by a factor equal in number of the number of ports and becomes T/4. In addition, the bandwidth is multiplied by the same factor and becomes 4BW in order to retain the packet data contents.

As described above, an optical input packet 2 is generated by satisfying the above conditions.

Next, the optical switch 14 distributes the respective optical input packets 2 to the desired output ports 16 using periodic time slots. In this case, the periodic operation of the switch is divided into four time slots.

The distribution (switching) of the optical input packets 2 is repeatedly executed for each time slot in accordance with a sequence formed by steps S1 to S4 (to be described later).

Finally, the optical receiver 15 converts the packet into an electric packet, and the packet switch 30 outputs the electric packet in a predetermined duration. In other words, the data rate of the signal output from the packet switch 30 is equal to the data rate of the input signal.

In the packets arriving in different time slots whose time difference is reduced, the data rate is changed to the first data rate. This change is performed in the arrival order of packets. In addition, the priority of the change may be set by another arbitration.

The switching operation in the above optical switch 14 will be described with reference to Figs. 10A to 10D. Figs. 10A to 10D show examples of a series of switching operations in steps S1 to S4, respectively.

In the packet switch 30, packets are input to four ports 11_1 to 11_4, respectively. The packet (packet C) input to the port 11_3 has the desired output port as a port 16_3 and has the highest priority.

The packets (packets A and D) input to ports 11_1 and 11_4 have the desired ports as ports 16-_2 and 16_1, respectively, and have the second highest priority.

　The packet (packet B) input to the port 11_2 has the desired output port as a port 16_4 and has the third highest priority.

First, since the packet C has the highest priority, the packet C is transmitted to the output port 16_3 in a duration of the first time slot (step S1 and Fig. 10A).

Since the packets A and D have the second highest priority, they are simultaneously transmitted to the output ports 16_2 and 16_1 in a duration of the second time slot (step S2 and Fig. 10B). In this case, since the packets A and D are transmitted to different output ports, no collision occurs.

Next, since the packet (packet B) input to the port B has the third highest priority, it is transferred to the output port 16_4 in a duration of the third time slot (step S3 and Fig. 10C).

Finally, since transmission of the packets A to D is complete in the previous step (step 3), switching is not executed in a duration of the fourth time slot (step S4 and Fig. 10D).

As described above, if the operation cycle (the four steps) is complete, all the input packets are simultaneously switched to desired output ports by the non-blocking method.

In this switching operation, in all the steps, as shown in Figs. 10A to 10D, the respective output ports 16 are connected to only one input port 11. In addition, the input port 11 is connected to the desired output port 16. In switching of the packet input to the input port 11, the packet is arranged in a correct (accurate) time slot (the divided duration).

In addition, the optical switch 14 is operated, as shown in Fig. 11.

In the packet switch 30, packets (packets A to D) are input to four ports 11_1 to 11_4, respectively. The packets A to D have the same desired output port (16_2) and are prioritized in the order of the packets B, A, D, and C.

First, since the packet B has the highest priority, the packet B is transmitted to the output port 16_2 in a duration of the first time slot (step S1).

Next, since the packet A has the second highest priority, the packet A is transmitted to the output port 16_2 in a duration of the second time slot (step S2).

Next, since the packet D has the third highest priority, the packet D is transmitted to the output port 16_2 in a duration of the third time slot (step S3).

Finally, since the packet C has the fourth highest priority, the packet C is transmitted to the output port 16_2 in a duration of the fourth time slot (step S4).

As described above, if the operation cycle (the four steps) is complete, all the input packets are simultaneously switched to desired output ports by the non-blocking method. In this case, since the packets A to D are transmitted in different time slots, no collision occurs.

In this manner, in the packet switch 30, all the input packets transmitted to the same output port are correctly (accurately) switched to this port in the time T.

(Effect)
The effect of the switching apparatus 30 according to this embodiment will be described below.

In the optical switch according to this first embodiment, problems are posed in which the signal output level is reduced along with the number of ports (Fig. 6A) and the number of constituent units such as the optical selection filter 19 such as the fast tunable filter and the optical receiver 15 are increased (Figs. 6B and 6C). In particular, it is technically difficult to control a large number of fast tunable units, and the power consumption is increased.

The optical switch 14 according to this embodiment can process a large number of dynamic data sufficiently at high speed due to the introduction of time slots and an increase in bit rate without decreasing the signal output level along with an increase in the number of ports and without increasing the constituent units, thereby reducing power consumption.

In addition, in the switching apparatus 10 according to this embodiment, the arbitration by the conventional concentrated control method is distributed into the following steps, and the arbitration is executed.

First, the optical switch 14 transmits the packets of different input groups to the same output group in different time slots. The optical switch 14 can execute this step at a high data rate with accurate time control.

Next, arbitration is executed by the control unit 17 on the output side in accordance with the distributed control method. In this case, the host 4 connected to each output port determines a packet to be processed first. In this manner, the control unit 17 can locally execute arbitration for each host.

Accordingly, in the switching apparatus of this embodiment, the latency and power consumption in switching can be reduced, and the load for acquiring information necessary for the arbitration process can be reduced (eliminated).

The effect of the switching apparatus 30 will be described in detail in comparison with the conventional non-blocking switch.

Fig. 12A shows the mode of latency in a flow switched by the switching apparatus 30. As comparison, Fig. 12B shows the mode of latency in a flow switched by the conventional non-blocking switch 20.

For example, it is assumed that a flow A (1_10) having packets A1 (1_11) to A3 (1_13) and a flow D (1_40) having packets D1 (1_41) to D3 (1_43) are input and transmitted to the same output port.

At this time, as shown in Fig. 12B, in the conventional non-blocking switch, since the flows A (2_10) and D (2_40) are processed in a single band, the delay times are accumulated. As a result, if the length (time) of one packet is T, the delay time is 3T, that is, the length (time) of the immediately preceding transmitted flow.

As described above, in the conventional non-blocking switch, the delay time is increased, and the latency is increased.

On the other hand, as shown in Fig. 12A, in the switching apparatus 30, since the flows A (2_10) and D (2_40) are processed in two bands, the delay time is rarely accumulated and is not less than T. This delay occurs in only the first packet and can be neglected as compared with the length 3T of the flow.

As described above, the switching apparatus 30 has a short delay time, so that the latency can be reduced.

In a normal electric switch, an input packet passes through the input port of the switch. First, the destination and the priority are examined, and then concentrated arbitrary is executed. The first packet to be transmitted is determined from all the packets having the same output port destination.

Concentrated arbitration processing becomes complicated with an increase in the number of switch ports and the throughput. As a result, the communication latency is increased, and the power consumption is increased.

On the other hand, according to the switching apparatus 30, since the packet can be switched without executing the concentrated arbitration requiring a long time, the communication latency is reduced, and the power consumption can be reduced.

Since the optical switch 14 performs part of the switching processing, the power consumption is lower than that of the ASIC formed by a CMOS transistor, and the switching capacity can be increased.

In addition, since a chiplet is used for the input block 12, the area occupied by the input block 12 can be reduced. As a result, even if an optical-electric interface is mounted, the total area of the packet switch (chip) is not increased. Accordingly, the optical-electric interface is mounted, and the throughput (processing capacity) of the switch can be increased without changing the chip area. In addition, the power consumption can be reduced by using the chiplet.

The contention between the ports of the same block can be prevented, and non-blocking processing can be executed.

In addition, to simultaneously transmit the plurality of packets to the same destination by the conventional packet switch, parallel optical receivers in number equal to the number of packets are necessary.

On the other hand, according to the switching apparatus 30, compact copies of the input packets are created at a high rate and divided and transmitted in short time slots. As described above, transmission of the packets to the same destination can be processed within a short time as compared with the actual packet input interval by using time interleaving.

In this case, the optical receiver 15 using the switching apparatus 30 can be operated in correspondence with this burst mode transmission.

The switching apparatus 30 is formed by four 1 x 4 switching units corresponding to the different input ports 11 to easily implement a high-speed 4 x 4 optical switch device. In the 4 x 4 optical switch device, the time (transition time) for transitioning one switch mode (for example, Fig. 10A) to another switch mode (for example, Fig. 10B) is very short as compared with the duration of the input packet.

For example, if it is assumed that the transition time can be neglected, the transition time can be reduced to 10 psec in an actually usable technique. This transition time is very short as compared with a 100-Gb/s Ethernet packet having a duration of 120 nsec.

In addition, a short guard time may be provided between the optical packets to prevent all data losses between switching operations.

The bandwidth of the packet generated by the host is multiplied by a coefficient F (switch port count). For example, in an 8 x 8 switch, a 25-Gb/s electric packet must be converted into a 200-Gb/s optical packet. This optical packet is generated by using a direct modulation laser and a multilevel modulation format.

In this case, since a distance between hosts assumed in this embodiment is short, this embodiment is applicable to a high data rate. In addition, the optical dispersion effect has a negligible level.

Furthermore, a packet having a high bit rate may be generated by another method. To scale up the number of interconnected hosts without performing concentrated control, this switch may be used as a core switching unit in a hybrid switching architecture.

(Arrangement of Switching System 40)
Next, a switching system 40 will be described with reference to Fig. 13. The switching system 40 can be scaled by grouping hosts.

As shown in Fig. 13, the switching system 40 includes a plurality of transmission (source) groups 3_10 to 3_40 on the transmission side. Each transmission group includes a plurality of transmission hosts (for example, 3_11 and 3_12) and a switching element 41.

A plurality of destination groups (for example, 4_20 and the like) are provided on the reception side. Each destination group includes an optical receiver 15, a control unit 17, a reception-side switch 43, and a plurality of reception hosts (for example, 4_21, 4_22, and the like). Other arrangements are the same as in the first embodiment.

The switching element 41 is a low-radix ASIC switch chip.

The optical switch 14 has an operation period divided into four input/output ports and four time slots. In this case, not each host unit, but the groups 3_10 to 3_40 of the transmission host are connected to the ports of the optical switch 14.

For example, in the transmission group A (3_10), the ASIC switch 41 and two hosts A1 and A2 (3_11 and 3_12) are arranged to be adjacent to each other. The ASIC switch 41 and the two hosts 3_11 and 3_12 are electrically linked at a short distance.

In this case, scalability is improved by an increase in the number of host units per group. In addition, in order to enhance the feature of the electric link, the number of host units per group is preferably about 10.

As shown in Fig. 13, the transmission groups A to D (3_10 to 3_40) are connected to the input ports of the optical switch 14, and destination groups 4_10 to 4_40 are connected to the output ports.

The hosts of the same group exchange the packets using an ASIC switch 41. For example, the ASIC switch 41 of the group A (3_10) is used to interconnect the hosts A1 and A2 (3_11 and 3_12). Packets between hosts of different groups (to be referred to as an inter-group packets hereinafter) are exchanged via the interconnection with the optical switch 14.

The groups of the respective destinations (outputs) are connected to only one transmission (input) group in all the time slots. Switching of the inter-group packets of the transmission group is processed by arranging packets (optical packets of the shortened duration) within the accurate time slots. In each time slot, the transmission group is connected to the desired destination group.

(Operation of Switching System 40)
The operation of the switching system 40 according to this embodiment will be described with reference to Fig. 13.

In the switching system 40, end-to-end transmission of the inter-group packet from the transmission host to the destination host is formed by the following three steps.

As the first step, electric switching for a packet is executed in a level of not the destination host but the destination group.

More specifically, in order to demultiplex a packet generated by the host, electric switching is executed by the transmission groups 3_10 to 3_40 using the local ASIC switch 41 in accordance with the destination group.

The inter-group packets simultaneously transmitted to the same group are collected by destination virtual queues regardless of the difference between the destination hosts. In this case, queues G1 to G4 (42_1 to 42_4) correspond to the destination groups 4_10 to 4_20.

In this case, as an example, two packets to be simultaneously transmitted from the hosts A1 and A2 (3_11 and 3_12) to the hosts 4_22 and 4_21 of the destination group 4_20 are assumed. At this time, both the packets are switched in the queue G2 (42_1).

For example, the packets from the hosts A1 and A2 (3_11 and 3_12) are transmitted in a bandwidth twice 25 Gb/s.

As the second step, when packets are respectively arranged in the matching time slots in the optical switch 14, optical switching is executed for the packets (optical packets of the shortened duration) transmitted to the desired destination group.

For example, a packet is demultiplexed into four packets and compressed four times. The demultiplexed packets are assigned to the first time slot to the fourth time slot in the time T and transmitted in the bandwidth of 200 Gb/s.

More specifically, the packets are transmitted from the corresponding queue in the ASIC switch 41 to the respective optical switches 14.

In this case, the hosts of the same group simultaneously generate packets to be transmitted to the same group.

In addition, to prevent the contention, these simultaneous packets are adjusted in the identical time slots of the optical switch 14. This can be achieved by the bandwidth of the optical transmitter 13.

For identification, packets of the different transmission hosts need not be demultiplexed using different wavelengths. Note that if a WDM-based optical transmitter is used, the request of the high bandwidth along with an increase in the number of host units per group can be satisfied.

As the third step, the packet can reach the desired destination group. For example, all the 25-Gb/s packets can be received in the time T.

Subsequently, electric switching is executed for the packet.

More specifically, the plurality of packets are simultaneously transmitted to the same end host, and a packet having a high priority is processed first. The self management of the input data packet as described previously is executed for the local ASIC switch 41 to which data reception is assigned.

On the other hand, in the switching system 40, arbitration by the conventional concentrated control method is executed by distributing the arbitration by the following three steps.

In the first step, the input ports of the switching system 40 are divided into groups (for example, 3_10 to 3_40), and the packets of each group are processed independently of the remaining packets. The output groups of the same destination which are simultaneously input are processed as those of the same group in the input group and are transmitted without executing arbitration. In this manner, since the input ports are grouped into small groups, this step can be performed at high speed.

As in the second embodiment, as the second step, the processing step of the optical switch is executed. As the third step, the arbitration step is executed.

According to the switching system 40, the hosts are grouped to increase the number of interconnected hosts, and scalability of the system can be improved.

(Arrangement of Switching System 50)
Next, a switching system 50 will be described with reference with Figs. 14 and 15. The switching system 50 is scaled by optical multicasting (optical multiplexing).

As shown in Fig. 14, a switching system 50 includes an optical multiplexer 51 between an optical transmitter 13 and an optical switch 14. The switching system 50 also includes a first optical demultiplexer 52 and a second optical demultiplexer 53 between the optical switch 14 and the optical receiver 15. Other arrangements are the same as in the third embodiment.

The plurality of optical transmitters 13 are connected to the optical multiplexer 51.

In addition, the output port of the optical switch 14 is connected to the fist optical demultiplexer 52, and the second optical demultiplexer 53 is connected to the output portion of the first optical demultiplexer 52.

This embodiment illustrates an example using wavelength multiplexing of an optical signal as an example of optical multicasting.

To wavelength-multiplex an optical signal, an AWG (Arrayed Waveguide Grating) optical coupler is used in the optical multiplexer 51.

In addition, an optical splitter is used in the first optical demultiplexer 52 and demultiplexes an optical signal at a predetermined power ratio.

An AWG filter is used in the second optical demultiplexer 53 and optically demultiplexes an optical signal for each wavelength.

(Operation of Switching System)
For example, as shown in Fig. 14, in the switching system 50, an optical transmitter 13_1 connected to a transmission group A (3_10) and an optical transmitter 13_2 connected to a transmission group B (3_20) output optical packets having different wavelengths, respectively.

The optical packets having different wavelengths are multiplexed by the AWG optical coupler 51 and simultaneously transmitted to a plurality of destination groups 4_10 to 4_40 in the same time slot.

As described above, the optical packets can be transmitted to a large number of groups, for example, a large number of end hosts without increasing the number of ports of the switch.

In addition, a higher multicasting ratio can be implemented, and the maximum achievable ratio can be determined by the power budget of the optical link.

For example, if the two optical packets having different wavelengths are transmitted in the bandwidth of 200 Gb/s, the packets can be transmitted in a two-fold bandwidth (400 Gb/s).

The transmitted optical packet is demultiplexed into destination groups by the optical splitter 52 and demultiplexed by the AWG filter 53 in each group (for example, group 4_40) for each wavelength. The optical packets are transmitted to the end hosts (for example, the destination hosts 4_41 and 4_42).

As described above, if multicasting is used, the optical packets from the plurality of transmission groups simultaneously reach the same destination group. To process packets from different transmission groups, a reception unit having an optical demultiplexing function is used to increase the total number of the reception units of the system.

According to the switching system 50, the scalability of the system can be improved by optical multicasting (optical multiplexing).

Fig. 15 shows an example of the timing chart of the switching system 50. In the switching system 50, one switching period is divided into four time slots. Among these time slots, the first time slot and the second slots are illustrated in the left and right views, respectively.

The switching system 50 includes 128 25-Gb/s hosts and 16 groups (eight hosts per group) and multicasts four groups at a time.

A commercially available transceiver unit based on a PAM4 multilevel format is used in the switching system 50 and performs processing in the total communication amount of 6.4 Tb/s.

(Effect)
As described above, according to the switching system 210 of this embodiment, the integrated bandwidth of data appearing from the electric switching edge can be processed to efficiently reduce the consumed switching energy per bit.

In addition, in the switching system 210, the optical switch can improve the processing scalability of the entire switching system without requiring arbitration and without sacrificing the end-to-end latency. That is, the optical switch can improve the scalability of the system and the end-to-end latency without the tradeoff between them.

In addition, the plurality of electric switches each having a low radix are used in the switching system 210. As compared with the case in which a single switch unit having a large switching capacity is used, the scalability of the computing system can be improved at a low power density, and the large-scale switching fabric can be implemented by using the low-radix switch.

Each switch processes communication from a group having a small number (low count) of XPUs.

The XPU is arranged near the designated switch. In this case, electric connection having high energy efficiency and a large bandwidth is suitable.

In this embodiment, each XPU can quickly update the switch designated using the data reception priority, and the switch can process only a small part of the total data communication, thereby executing quick arbitration.

(Example 1)
A switching system 300 according to Example 1 of the present invention will be described with reference to Figs. 16 and 17.

As shown in Fig. 16, the switching system 300 according to Example 1 includes a switching apparatus 301 and an interconnection apparatus 302.

As described above, the switching apparatus 301 includes an optical switch 14, an optical transmitter 13, an optical receiver 15, and an electric switch (including a control unit) 212.

The interconnection apparatus 302 is connected to each of a plurality of XPUs 120. In the interconnection apparatus 302, an RC 140 and a protocol processor 110 are sequentially connected. The protocol processor 110 is connected to the electric switch 212. In this case, protocol processing is off-loaded from the XPU 120 to the protocol processor 110. An XPU memory 121 is connected to each XPU 120. The RC 140 is connected to the XPU memory 121 and can directly access the XPU memory 121.

As described above, to obtain an I/O subsystem of a very low latency, the ports of each electric switch are connected to the protocol processor and a device having a function such as root complex (RC). In this case, the function such as the RC is not particularly limited to the PCIe standard.

A packet transmission table is used as an example of the conventional electric switch and is executed by an TCAM (Ternary Content Addressable Memory). In this case, all the priorities and destinations are checked in the packets which are simultaneously input. All the packets are transmitted to the desired output port.

On the other hand, in this embodiment, the protocol processor may be shared between different ports. In this case, the packet bit check sum is checked in the TCAM and the parallel steps. Only the packet payload is transmitted to the desired output port, and all the packets are not transmitted.

In addition, it is difficult to assign each individual high-performance processor for each port, and the tradeoff between the computing power of the protocol processor and the other system features is required.

A high-power protocol processor may be shared in the same group or in XPUs having an arrangement in which identical electric switches are mounted.

In addition, as shown in Fig. 17, to reduce the latency and improve power efficiency, physical switching and protocol execution may be integrally implemented on the same device 312.

A CAM (Content Addressable Memory) is used for implementation of a look-up table in a high-end ASIC switch and can detect a memory address coinciding with a special content at high speed.

In a normal switch, the content is made to coincide with the port of the desired output switch, and processing is performed by transferring data.

On the other hand, in the integral implementation of the I/O and the switching function of Example 1, data having the coincident content is protocol-processed and physically transmitted to the switch port with accuracy, and only the effective payload part is applied to the XPU.

The optical transmitter 13 and the receiver 15 are arranged in the port of the device connected to the optical switch, and an FEC encoder and a decoder are mounted on the optical transmitter 13 and the receiver 15, respectively. Accordingly, the optical signal of a high data rate (about 100 Gb/s or more) can be transmitted with high reliability.

In this case, all the packets which are simultaneously transmitted can be transmitted to the same XPU by the optical switch 14.

For example, a packet having a low priority and a packet having a high priority, both of which have the same XPU#1 as the destination, are simultaneously input from the optical switch 14. At this time, the packet having the low priority is transmitted to the common memory (not shown) of the devices. In this case, the common memory is embedded in or mounted on a device 312.

On the other hand, the packet having the high priority is protocol-processed, and the payload part is transmitted to XPU#1.

According to the switching system of Example 1, the delay and power consumption of the signal processing can be reduced by integrally implementing the switching in the low-radix electric switch and the device having high function level.

The embodiments of the present invention illuminate an example of the structure, size, material, and the like of each constituent component in the arrangements of the interconnection apparatus and the switching system. However, the present invention is not limited to this. Any example is possible as far as the functions of the interconnection apparatus and the switching system can be enhanced to obtain the same effects as described above.

The present invention is related to the interconnection apparatus which interconnect the processor and the switch, and the switching system and is also applicable to a computing system.

100: interconnection apparatus
110: protocol processor

Claims

An interconnection apparatus for interconnecting a switching apparatus and an XPU connected to an XPU memory, comprising:
a protocol processor configured to execute an off-loaded protocol from the XPU and directly access the XPU memory.
The interconnection apparatus according to claim 1, further comprising an RC between the XPU and the protocol processor.
A switching system comprising:
an optical switch;
an electric switch;
an optical transmitter configured to convert an electric signal input from the electric switch into an optical signal and output the optical signal to the optical switch;
an optical receiver configured to convert the optical signal input from the optical switch into an electric signal and output the electric signal to the electric switch; and
a plurality of interconnection apparatuses according to claim 1, which are connected in parallel to the electric switch.
The switching system according to claim 3, wherein
a priority is set to the input electric signal, and
the switching system further comprises a control unit connected to and arranged in the optical receiver, and configured to hold the converted electric signal having the priority whose level is low, and transmits the converted electric signal whose level is high first.
The switching system according to claim 4, further comprising a memory configured to store the converted electric signal having the priority whose level is low.
The switching system according to claim 3, further comprising an encoder and a decoder.
The switching system according to claim 3, wherein the optical signal transmitted from the optical transmitter is multiplex, demultiplexed, and received by the optical receiver in accordance with a predetermined optical characteristic.
The switching system according to claim 3, wherein
the optical transmitter demultiplexes the optical signal by a number equal to the number of optical receivers to which the optical signals are transmitted, and
the optical switch transmits the demultiplexed optical packets in time slots respectively assigned to the demultiplexed optical packets.