Introduction to RoCE v2 Network

NADDOD Jason Data Center Architect Sep 14, 2023

Background

NADDOD has a professional technical team and extensive experience in implementing and servicing various application scenarios. The NADDOD technical team believes that RoCE, which enables RDMA functionality over Ethernet, can bypass TCP/IP and utilize hardware offloading to reduce CPU utilization.

 

There are two main versions of RoCE: RoCEv1 and RoCEv2. RoCEv1 is an RDMA protocol implemented at the Ethernet link layer. Switches need to support flow control technologies like PFC to ensure reliable transmission at the physical layer. RoCEv2 is implemented at the UDP layer of the Ethernet TCP/IP protocol and introduces the IP protocol to address scalability issues.

 

RoCEv2 supports RDMA routing over Layer 3 Ethernet networks. RoCEv2 replaces the InfiniBand network layer with IP and UDP headers at the Ethernet link layer, enabling routing of RoCE between IP-based traditional routers.

 

InfiniBand networks are to some extent centrally managed networks with a subnet manager (SM), while RoCEv2 networks are purely distributed networks composed of RoCEv1-supporting NICs and switches, typically adopting a two-tier architecture.

RoCE Architecture

 

The main suppliers of RoCE network cards are NVIDIA, Intel, and others. PCIe cards are the primary form of network cards for data center servers. The port PHY speed of RDMA cards typically starts from 50Gbps, and currently available commercial network cards can achieve single-port speeds of up to 400Gbps.

https://resource.naddod.com/images/blog/2023-09-13/roce-nic-007960.webp

 

Currently, most data center switches support RDMA traffic control technology, which, when combined with RoCE-supporting network cards, enables end-to-end RDMA communication. The core of high-performance switches lies in the forwarding chips they employ. Through in-depth research conducted by the NADDOD professional team, the Tomahawk series chips are widely used in commercial forwarding chips in the current market. Among them, the Tomahawk3 series chips are more commonly used in switches, and the market is gradually seeing an increase in switches that support the Tomahawk4 series chips.

Evolution of Ethernet forwarding chips

 

RoCE vs. InfiniBand

Compared to InfiniBand, RoCE offers greater versatility and relatively lower cost. It can be used not only to build high-performance RDMA networks but also in traditional Ethernet environments. However, configuring parameters such as Headroom, PFC (Priority-based Flow Control), and ECN (Explicit Congestion Notification) on switches can be complex. In large-scale deployments, the overall throughput performance of RoCE networks may be slightly lower than that of InfiniBand networks.

InfiniBand vs. RoCE v2

 

From a technical perspective, InfiniBand employs various technologies to enhance network forwarding performance, reduce fault recovery time, improve scalability, and lower operational complexity.

 

In terms of application performance, InfiniBand has lower end-to-end latency compared to RoCEv2, giving InfiniBand-based networks an advantage in application-level performance.

 

Bandwidth and latency are influenced by factors such as congestion and routing in high-performance network interconnects.

 

Congestion

InfiniBand utilizes two different frame relay messages to control congestion: Forward Explicit Congestion Notification (FECN) and Backward Explicit Congestion Notification (BECN). When congestion occurs in the network, FECN notifies the receiving device, while BECN notifies the sending device. InfiniBand combines FECN and BECN with adaptive marking rates to reduce congestion. It provides coarse-grained congestion control.

 

Congestion control on RoCE uses Explicit Congestion Notification (ECN), which is an extension of IP and TCP. It enables endpoint network congestion notification without dropping packets. ECN places a mark on the IP header, indicating the presence of congestion to the sender. For non-ECN congestion communication, the lost packets need to be retransmitted. ECN reduces packet loss caused by TCP congestion, avoiding retransmissions. Fewer retransmissions can reduce latency and jitter, providing better transaction and throughput performance. ECN also provides coarse-grained congestion control but does not have a significant advantage over InfiniBand.

 

Routing

When congestion occurs in the network, adaptive routing sends devices through alternate routes to alleviate congestion and speed up transmission. RoCE v2 operates on top of IP. Over the years, IP has achieved routability through advanced routing algorithms, and now, with AI machine learning, congestion-aware routing can be predicted, and packets can be automatically sent through faster routes. In terms of routing, Ethernet and RoCE v2 have significant advantages.

 

However, InfiniBand and RoCE haven't addressed tail latency much. Tail latency is crucial for synchronous HPC message applications.

 

UEC Prepares to Define New Transport Protocol

In addition to InfiniBand and RoCE, there have been other protocols proposed in the industry.

 

On July 19th, the Ultra Ethernet Consortium (UEC) was officially established. UEC aims to go beyond the existing Ethernet capabilities and provide a high-performance, distributed, and lossless transport layer optimized for high-performance computing and artificial intelligence. Founding members of UEC include AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta, and Microsoft, all of whom have decades of experience in network, AI, cloud, and high-performance computing deployments.

https://resource.naddod.com/images/blog/2023-09-13/founding-members-007957.webp

 

UEC believes that RDMA, defined decades ago, has become outdated for demanding AI/ML network traffic. RDMA transfers data in large traffic blocks, which can lead to link imbalances and excessive burdens. It is now time to start building a modern transport protocol that supports RDMA for emerging applications.

 

According to NADDOD, the UEC transport protocol is currently under development with the goal of providing better Ethernet transmission (still supporting RDMA) than the current RDMA. It aims to retain the advantages of Ethernet/IP while delivering the performance required for AI and HPC applications.

 

UEC transport is a new form close to the transport layer, with some semantic adjustments, congestion notification protocols, and enhanced security features. UEC will provide more flexible transport, eliminating the need for lossless networks, allowing for features such as multi-path and out-of-order packet transmission required for many-to-many AI workloads.

 

Conclusion

When comparing InfiniBand and RoCE, we can see that both have their own advantages and suitable use cases. IB excels in high-performance computing, offering excellent performance, low latency, and scalability. RoCE, on the other hand, is easier to integrate into existing Ethernet infrastructure and has lower costs. The emerging transport protocols represented by UEC also represent the continuous development and innovation in technology.

 

Only by adapting to evolving demands can one maintain a competitive edge. As a leading provider of comprehensive optical networking solutions, NADDOD continues to offer innovative, efficient, and reliable products, solutions, and services, providing optimal end-to-end solutions for data centers, high-performance computing, edge computing, artificial intelligence, and other application scenarios. We also continue to learn and research in the fields of InfiniBand, RoCE, and emerging UEC transport protocols, looking forward to breakthroughs in future solutions!