High-Performance Computing: Analysis and Application of RoCE Technology

Development of HPC Networks and the Birth of RoCE

In early High-Performance Computing (HPC) systems, custom network solutions such as Myrinet, Quadrics, and InfiniBand were often used instead of Ethernet. These networks could overcome the design limitations of Ethernet solutions, providing higher bandwidth, lower latency, better congestion control, and some unique features.

In 2010, the IBTA released the RoCE (RDMA over Converged Ethernet) protocol technology standard, followed by the release of the RoCEv2 protocol technology standard in 2014, which also saw a significant increase in bandwidth. The substantial improvement in Ethernet performance has led to a growing interest in high-performance network solutions that are compatible with traditional Ethernet. This has disrupted the declining trend of Ethernet usage in HPC clusters listed in the top500, allowing Ethernet to maintain a prominent position in the top500 rankings.

While Myrinet and Quadrics have disappeared, InfiniBand still holds an important position in high-performance networks. Additionally, Cray's proprietary network series, Tianhe's proprietary network series, and Tofu D series networks also have their significant roles.

publication date

Introduction to the RoCE Protocol

The RoCE protocol is a cluster network communication protocol that enables Remote Direct Memory Access (RDMA) over Ethernet. It offloads the packet send/receive tasks to the network card, eliminating the need for the system to enter kernel mode like the TCP/IP protocol, thus reducing the overhead of copying, encapsulation, and decapsulation. This significantly reduces the latency of Ethernet communication, minimizes the CPU resource utilization during communication, alleviates congestion in the network, and allows for more efficient utilization of bandwidth.

The RoCE protocol has two versions: RoCE v1 and RoCE v2. RoCE v1 is a link-layer protocol, which means that both communicating parties using RoCEv1 protocol must be within the same Layer 2 network. On the other hand, RoCE v2 is a network-layer protocol, allowing RoCE v2 protocol packets to be routed at the Layer 3, providing better scalability.

RoCE V1 Protocol

The RoCE protocol retains the interface, transport layer, and network layer of InfiniBand (IB) and replaces the link layer and physical layer of IB with the link layer and network layer of Ethernet. In the RoCE data packet's link layer data frame, the Ethertype field value is defined by IEEE as 0x8915, indicating that it is a RoCE data packet. However, since the RoCE protocol does not inherit the network layer of Ethernet, RoCE data packets do not contain an IP field. As a result, RoCE data packets cannot be routed at the network layer, and their transmission is limited to routing within a Layer 2 network.

RDMA application ULP

RoCE V2 Protocol

The RoCE v2 protocol introduces several improvements to the RoCE protocol. The RoCEv2 protocol replaces the IB network layer retained by the RoCE protocol with the Ethernet network layer and a transport layer that uses the UDP protocol. It also utilizes the DSCP and ECN fields in the IP datagram of the Ethernet network layer to implement congestion control. Therefore, RoCE v2 protocol packets can be routed, providing better scalability. As RoCE v2 has completely replaced the flawed RoCE protocol, when referring to the RoCE protocol, people generally mean the RoCE v2 protocol, unless specifically stated as the first generation of RoCE.

Lossless Networks and RoCE Congestion Control Mechanism

In a network using the RoCE protocol, it is crucial to achieve lossless transmission of RoCE traffic. During RDMA communication, data packets must arrive without loss and in the correct order. If packet loss or out-of-order arrival occurs, go-back-N retransmission must be performed, and subsequent data packets that were expected to arrive should not be cached.

The RoCE protocol has two stages of congestion control: a slowdown stage using DCQCN (Datacenter Quantized Congestion Notification) and a transmission pause stage using PFC (Priority Flow Control). Although strictly speaking, only the former is a congestion control strategy, while the latter is a traffic control strategy, they are commonly regarded as the two stages of congestion control.

When there is a situation of multiple-to-one communication in a network, congestion often occurs, which is manifested by the rapid increase in the total size of the pending send buffer messages of a port on the switch. If the situation is not controlled, the buffer will be filled, resulting in packet loss. Therefore, in the first stage, when the switch detects that the total size of the pending send buffer messages of a port reaches a certain threshold, it marks the ECN field of the IP layer in the RoCE data packet. When the recipient receives this packet and finds that the ECN field has been marked by the switch, it sends a Congestion Notification Packet (CNP) back to the sender, reminding the sender to reduce its sending speed.

It is important to note that not all packets are marked when the threshold is reached for the ECN field. There are two parameters, Kmin and Kmax. When the congestion queue length is less than Kmin, no marking is performed. When the queue length is between Kmin and Kmax, the longer the queue, the higher the probability of marking. When the queue length exceeds Kmax, all packets are marked. The receiver does not send a CNP packet for every received ECN packet but sends a CNP packet within each time interval if it receives packets with ECN markings. In this way, the sender can adjust its sending speed based on the number of CNP packets received.

egress queue size

When the congestion in the network further worsens and the switch detects that the queue length of a certain port's pending send queue reaches a higher threshold, the switch sends a Pause Flow Control (PFC) frame to the upstream sender of the messages, causing the upstream server or switch to pause sending data until the congestion in the switch is relieved. When the congestion is relieved, the switch sends a PFC control frame to the upstream to notify it to resume sending. Since PFC flow control supports pause on different traffic channels, setting the bandwidth ratio of each traffic channel to the total bandwidth allows pausing of traffic transmission on one channel without affecting data transmission on other channels.

RoCE & Soft-RoCE

Although most high-performance Ethernet cards now support the RoCE protocol, there are still some cards that do not. Therefore, IBM, Mellanox, and others have collaborated to create the open-source Soft-RoCE project. This allows nodes with unsupported RoCE protocol cards to still use Soft-RoCE, enabling them to communicate using the RoCE protocol with nodes that have RoCE-supported cards, as shown in the diagram. While this does not bring performance improvement to the former, it allows the latter to fully utilize its performance. In some scenarios, such as data centers, only upgrading the high I/O storage servers to RoCE-supported Ethernet cards can improve overall performance and scalability. At the same time, this combination of RoCE and Soft-RoCE can also meet the requirements of gradual cluster upgrades without the need for a simultaneous full upgrade.

RoCE & Soft-RoCE

Issues with Applying RoCE to HPC

The Core Requirements of HPC Networks

According to NADDOD, there are two core requirements for HPC networks: ① low latency, and ② the ability to maintain low latency in rapidly changing traffic patterns.

For ① low latency, RoCE is designed to address this issue. As mentioned earlier, RoCE offloads network operations to the network card, achieving low latency and reducing CPU utilization.

For ② maintaining low latency in rapidly changing traffic patterns, the key concern is congestion control. However, the challenge lies in the fact that HPC traffic patterns are highly dynamic, and RoCE's performance in this regard is suboptimal.

RoCE's Low Latency

Compared to traditional TCP/IP networks, InfiniBand and RoCEv2 bypass the kernel protocol stack, resulting in significantly improved latency performance. Experimental tests have shown that by bypassing the kernel protocol stack, the end-to-end latency at the application layer in communications within the same cluster can be reduced from 50 microseconds (TCP/IP) to 5 microseconds (RoCE) or 2 microseconds (InfiniBand).

end to end communication latency

RoCE Packet Structure

Assuming we want to send 1 byte of data using RoCE, the additional costs to encapsulate this 1-byte data packet are as follows:

Ethernet Link Layer: 14 bytes MAC header + 4 bytes CRC

Ethernet IP Layer: 20 bytes

Ethernet UDP Layer: 8 bytes

IB Transport Layer: 12 bytes Base Transport Header (BTH)

Total: 58 bytes

Assuming we want to send 1 byte of data using IB, the additional costs to encapsulate this 1-byte data packet are as follows:

IB Link Layer: 8 bytes Local Routing Header (LHR) + 6-byte CRC

IB Network Layer: 0 bytes (When there is only a Layer 2 network, the Link Next Header (LNH) field in the link layer can indicate that the packet has no network layer)

IB Transport Layer: 12 bytes Base Transport Header (BTH)

Total: 26 bytes

If it is a customized network, the packet structure can be simplified further. For example, the Mini-packet (MP) header of Tianhe-1A consists of 8 bytes.

From this, it can be seen that the heavy underlying structure of Ethernet is one of the obstacles to applying RoCE to HPC.

Ethernet switches in data centers often need to have many other functionalities, which require additional costs to implement, such as SDN, QoS, and so on.

Regarding these Ethernet features, are Ethernet and RoCE compatible with these functionalities? Do these functionalities affect the performance of RoCE?

Issues with RoCE Congestion Control

Both segments of the RoCE protocol's congestion control have certain issues that may make it difficult to maintain low latency in rapidly changing traffic patterns.

The use of Priority Flow Control (PFC) relies on pause control frames to prevent receiving an excessive number of packets, which can lead to packet loss. This method, compared to credit-based methods, inevitably results in lower buffer utilization. Especially for switches with lower latency, there may be relatively fewer buffers available, making it challenging to control using PFC. On the other hand, using a credit-based approach allows for more precise management.

DCQCN (Data Center Quantized Congestion Notification) is similar to IB's congestion control in that both utilize backward notification: sending congestion information to the destination and then returning the congestion information to the sender for rate limitation. However, there are slight differences in the details. RoCE's slowdown and speedup strategies, according to the paper "Congestion Control for Large-Scale RDMA Deployments," follow a fixed set of formulas, while IB allows for customizable speedup and slowdown strategies. Although most people likely use the default configurations in practice, having the flexibility is always better than not having it. Additionally, in the mentioned paper, the testing involved generating at most one CNP (Congestion Notification Packet) every N=50 microseconds, and it is unknown whether reducing this value would be feasible. In IB, the corresponding CCTI_Timer can be set as low as 1.024 microseconds, but it is uncertain whether such a small value can be practically implemented.

The ideal approach would be to directly return congestion information to the source from the congestion point, known as Forward notification. Ethernet's limitations due to specifications can be understood, but why doesn't IB adopt this approach?

Application Cases of RoCE in HPC

Slingshot

The three new leading supercomputers in the United States are all equipped with the Slingshot network, which is an improved Ethernet. The Rosetta switches in the network are compatible with traditional Ethernet while addressing some of the shortcomings of RoCE. If both ends of a link are supported devices (dedicated network cards, Rosetta switches), some enhanced features can be enabled:

Reducing the minimum frame size of IP packets to 32 bytes

Propagating queue occupancy (credit) information to neighboring switches

Improved congestion control

The average switch latency reaches 350ns, which is comparable to high-performance Ethernet switches, but still falls short of the low latency achieved by IB and some customized supercomputer switches, as well as the previous generation of Cray XC supercomputer switches.

However, in practical applications, it seems to perform well. The paper "An In-Depth Analysis of the Slingshot Interconnect" only compares it with the previous generation of Cray supercomputers and does not compare it with IB.

CESM and GROMACS Testing

The performance of the applications CESM and GROMACS was compared using a 25G Ethernet with low latency and a 100G Ethernet with higher bandwidth. Although there is a fourfold difference in bandwidth between the two, the results still provide some reference value.

CESM testing

CESM Testing

GROMACS testing

GROMACS Testing

Summary

NADDOD has a professional technical team and extensive experience in implementing and servicing various application scenarios. Its high-quality products and solutions have earned the trust and favor of customers, and are widely used in industries and critical areas such as high-performance computing, data centers, education and research, biomedicine, finance, energy, autonomous driving, the internet, manufacturing, and telecommunications. However, NADDOD believes that applying RoCE to HPC faces the following challenges based on market demands and user project implementation experiences:

Ethernet switches have higher latency compared to IB switches and some HPC custom network switches.

There is room for improvement in RoCE's flow control and congestion control strategies.

The cost of Ethernet switches is still higher.

In general, choosing the right solution for AI data center networks is a critical decision. Traditional TCP/IP protocols are no longer suitable for AI applications that require high network performance. In comparison, the application of RDMA technology has made InfiniBand and RoCE highly regarded network solutions.

InfiniBand has demonstrated excellent performance in areas such as high-performance computing and large-scale GPU clusters, while RoCE, as an RDMA technology based on Ethernet, offers greater deployment flexibility.

Therefore, selecting the appropriate network solution based on specific network requirements and application scenarios is a crucial step in ensuring high-performance and efficient AI data center networks.