End-to-End RoCE Concept Principles

NADDOD Dylan InfiniBand Solutions Architect Sep 25, 2023

1. What is RoCE?

To achieve high performance similar to InfiniBand (IB) networks, the InfiniBand Trade Association (IBTA) has defined RoCE (RDMA over Converged Ethernet), which ports InfiniBand's four-layer transport protocol RDMA to Ethernet. The encapsulation of IB and RoCE is compared as follows:


The development of RoCE actually involves two versions. The early version, RoCEv1, only supported RDMA transmission over Layer 2 Ethernet. Due to limited scenarios and network scale, it did not receive widespread adoption. It wasn't until RoCEv2 emerged that the transport layer RDMA was encapsulated over UDP/IPv4 or UDP/IPv6 protocols. RoCEv2 uses the default UDP port 4791 and enables the RoCE protocol to be routed over Layer 3 Ethernet, removing many of the limitations imposed by RoCEv1.


RoCEv2 can operate in both lossless and lossy modes. In lossless mode, also known as lossless Ethernet, the data transmission packets are required to have no reordering or packet loss to ensure reliable network performance. In the lossy mode, starting from CX5 series network cards, enhanced features of RoCE have been supported, enabling near-lossless network performance without the need for deploying lossless network configurations in small-scale network deployments.


Considering the selection and deployment situation in actual projects, it is common to choose the lossless approach by defining congestion control mechanisms such as PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). This approach ensures lossless Ethernet transmission by prioritizing network resources (buffers/network bandwidth). The lossy functionality provided by network cards can be enabled concurrently with lossless functionality to achieve better performance optimization effects.

2. The difference between RoCE and IB native RDMA

RoCE supports the deployment of RDMA (Remote Direct Memory Access) over Ethernet infrastructure, which offers various advantages of RDMA communication, including transport offloading, kernel bypass, high performance, and low latency.


However, due to the differences between Ethernet and InfiniBand (IB) networks, optimizing the Ethernet configuration is necessary to ensure the stability and performance of RoCE services. This optimization involves:


  • Flow Control (FC), typically referring to PFC (Priority Flow Control)


  • Congestion Control (CC), achieved through ECN (Explicit Congestion Notification) and DCGCN (Data Center Global Congestion Notification) algorithms for end-to-end congestion control


  • Quality of Service (QoS) for scheduling RoCE traffic based on priority, with higher priority given to handling CNP (Congestion Notification Packet) control messages at priority 6, and switch-based weighted round-robin scheduling for different traffic flows


  • Enhanced features of Lossy RoCE


  • Optimization of network card/switch buffers and configurations


The aforementioned optimizations ensure the usability of RoCE over Ethernet. However, compared to native IB networks, Ethernet introduces unavoidable packet loss, resulting in additional overhead and instability in the system. Therefore, these optimizations are more reliant on the current network's traffic models and may require parameter adjustments for achieving the best results based on different network purposes and scales.


The higher RTT latency and uncertainty of packet loss in Ethernet also have some impacts on IO (Input/Output) performance.

3. RoCE Lossless Network

The lossless solution includes:

  • Flow Control: Handles backpressure on the receiving queue.


  • Congestion Control:Manages congestion on the switch, particularly the sending queue.


  • Quality of Service (QoS): Enables differentiation of different traffic types, allowing separate scheduling for RoCE. It specifies algorithms for each Traffic Class (TC) to ensure bandwidth allocation, maximum bandwidth, or proportional bandwidth allocation based on weights.


  • Receive Buffer Management:Enables more precise PFC control by managing the receive buffers effectively.


  • Lossy Algorithm:Can be deployed simultaneously with the lossless solution to optimize system performance.

Limitations of Lossless Networks:

  • Configuration optimization is difficult.


  • The configuration is relatively complex and requires consistent node configuration.


  • When the network scale is large, a pause storm may occur.


  • Lossy RoCE will supplement and optimize the Lossless solution. The deployment of Lossy RoCE can enable the RoCE network to adapt to multiple scenarios and on a larger scale.


4. Conclusion

In summary, RoCE (RDMA over Converged Ethernet) is a technology that brings the high-performance features of InfiniBand networks to Ethernet. RoCEv1 only supports RDMA transmission over Layer 2 Ethernet, which limits its application scenarios and network scale. RoCEv2, on the other hand, encapsulates the RDMA protocol using UDP/IPv4 or UDP/IPv6 at the transport layer, allowing routing over Layer 3 Ethernet and removing the limitations of RoCEv1.


RoCE has some differences compared to native InfiniBand RDMA. To ensure the stability and performance of RoCE services, Ethernet configuration optimizations are required, including flow control, congestion control, QoS, and cache configuration. These optimization measures ensure the availability and performance of RoCE over Ethernet, providing advantages of RDMA communication such as transport offloading, kernel bypass, high performance, and low latency.


RoCE can be deployed in either lossless or lossy modes. In a lossless network, data transmission packets are required to have no reordering or packet loss, typically achieved through congestion control mechanisms (such as PFC and ECN) and QoS. In a lossy network, enhanced features of RoCE allow near-lossless network performance without the need for deploying lossless network configurations in small-scale network deployments.


The lossless solution includes flow control, congestion control, QoS, and receive buffer management. However, configuring a lossless network can be complex, requiring consistency across nodes, and may encounter issues such as pause storms in large-scale networks. Lossy RoCE can complement and optimize the lossless solution, adapting to larger network scenarios.


In conclusion, RoCE is a technology that enables high-performance RDMA transmission over Ethernet. Through configuration optimizations and the choice of lossless or lossy network modes, it can provide performance and features similar to InfiniBand networks.


NADDOD offers lossless network solutions based on RoCE, providing users with a lossless network environment and high-performance computing capabilities. NADDOD can tailor the optimal solution based on specific application scenarios and user requirements, delivering high-bandwidth, low-latency, and high-performance data transmission to effectively address network bottlenecks and enhance network performance and user experience.


NADDOD provides high-quality Ethernet interconnect products, including optical modules ranging from 1G to 800G, AOCs, and DACs. These products improve the connection speed and stability of Ethernet deployment while reducing costs and complexity. For customized Ethernet connectivity requirements, feel free to contact NADDOD's technical experts for further assistance!