RDMA Accelerates Cluster Performance Improvement

NADDOD Abel InfiniBand Expert Aug 22, 2023

Under the drive of enterprise digitization demands, various new applications continue to emerge and be implemented. As data becomes a crucial asset for businesses, it stimulates the demand for high-performance computing, big data analytic, AI, and various storage applications. Traditional data transmission protocols such as TCP/UDP are facing numerous bottlenecks in adapting to these new requirements.

1. RoCE Technical Advantages and Ecological Development

RDMA, which stands for Remote Direct Memory Access, is a technology used for high-performance network communication. It is one of the core technologies of the InfiniBand network standard. DMA (Direct Memory Access) refers to the direct access of device to host memory without CPU involvement. RDMA technology allows direct access to memory data through the network interface without the intervention of the operating system kernel. This enables high throughput and low latency network communication, making it particularly suitable for large-scale parallel computing clusters.

 

By optimizing the transport layer and leveraging the capabilities of network interface cards, RDMA enables applications to better utilize network link resources. RDMA was initially implemented on the InfiniBand transport network, but as the demand for Ethernet grew, the industry "ported" RDMA to traditional Ethernet. Ethernet-based RDMA technology can be divided into two types: iWARP and RoCE, with RoCE further including RoCEv1 and RoCEv2. Compared to the high cost of InfiniBand, RoCE and iWARP have significantly lower hardware costs.

 

RDMA running on Ethernet networks is referred to as RoCE (RDMA over Converged Ethernet). Currently, the mainstream networking solution for high-performance networks is based on the RoCE v2 protocol (RDMA over Converged Ethernet), which is based on the convergence of Ethernet and RDMA. It is widely applicable to various deployment scenarios in Ethernet networks.

Socket VS. RDMA

Compared to the TCP/IP approach, RDMA utilizes Kernel Bypass and Zero Copy technologies to provide low latency, reduce CPU usage, alleviate memory bandwidth bottlenecks, and achieve high bandwidth utilization. RDMA offers an IO-based channel that allows an application to directly read from and write to remote virtual memory through RDMA devices.

TCP,IP VS. RDMA(RoCE)

RDMA technology establishes a data path between applications and the network, implementing a bypass of the system kernel. By optimizing the data path, CPU resources for data forwarding can be reduced to 0%, while high performance is provided by ASIC chips. RDMA transfers data directly into the computer's storage area through the network, swiftly moving data from one system to remote system memory without impacting the operating system, thereby minimizing the need for computational power. It eliminates the overhead of external memory copying and context switching, freeing up memory bandwidth and CPU cycles to improve application system performance and enhance overall cluster efficiency. RDMA technology has already been widely adopted in supercomputing centers and internet enterprises, forming a mature application-network ecosystem. Its adoption in enterprise-level large-scale data centers in this project marks a new stage of development in the technological ecosystem.

2. GPU Direct - RDMA improves AI/HPC application efficiency

The traditional TCP network heavily relies on CPU processing of packets, making it difficult to fully utilize all available bandwidth. Therefore, in AI environments, RDMA is almost a necessary network transport technology when using large-scale cluster training.

 

RDMA technology can be used not only for high-performance network transmission of user-space data in CPU memory but also for GPU transfers in GPU clusters across multiple servers. This is where GPU Direct technology, crucial for HPC/AI performance optimization, comes into play. With increasingly complex deep learning models and a surge in computational data volume, single machines are no longer sufficient to meet the computational requirements. Distributed training with multiple machines and GPUs has become a necessary demand. In this scenario, communication between multiple machines becomes a vital performance metric for distributed training. GPUDirect RDMA technology can be used to accelerate GPU communication between multiple machines.

 

➢ GPU Direct RDMA: Based on the RoCE capability of network cards, it enables high-speed memory data exchange among GPUs across server nodes within a GPU cluster.

 

From a network design and implementation perspective, NVIDIA optimizes and enhances the performance of GPU clusters by supporting GPU Direct RDMA functionality. The technical implementation of GPU Direct RDMA is illustrated in the diagram below.

GPU Direct RDMA

In the context of GPU cluster networking, there are higher requirements for network latency and bandwidth. Traditional network transmission has constrained the parallel processing capabilities of GPUs, resulting in resource wastage. High-bandwidth data transmission typically requires involvement of CPU memory, and both memory read/write operations and CPU load can become bottlenecks for GPU multi-node communication. GPU Direct RDMA technology addresses these challenges by directly exposing the network card device to the GPU, enabling direct remote access between GPU memory spaces. This approach significantly improves both bandwidth and latency, greatly enhancing the efficiency of GPU cluster operations.

3. Data center switch lossless network solution

 

RoCE Solution

The solution that supports carrying RoCE traffic on switches is also known as the Lossless Ethernet solution. The Lossless Ethernet solution consists of the following key technologies:

 

➢ ECN Technology: ECN defines a traffic control and end-to-end congestion notification mechanism based on the IP layer and transport layer. The ECN feature uses the DS field in the IP packet header to mark the congestion state along the transmission path. Terminal devices that support this feature can determine congestion on the transmission path based on the packet content, and adjust the packet transmission method to avoid exacerbating congestion. Enhanced Fast ECN technology marks the ECN field of the data packet when it is dequeued, thereby reducing the delay in marking ECN for packets during forwarding. The receiving server can receive ECN-marked data packets with minimal delay, thus accelerating the adjustment of the sending rate.

 

➢ PFC Technology: PFC provides per-hop priority-based flow control. When devices forward packets, they schedule and forward the packets based on their priority, mapping them to corresponding queues. If the sending rate of packets of a certain priority exceeds the receiving rate, causing insufficient available data buffering space at the receiving end, the device sends a PFC PAUSE frame back to the previous hop device. The previous hop device stops sending packets of that priority upon receiving the PAUSE frame, and resumes traffic transmission only after receiving a PFC XON frame or after a certain aging time has elapsed. By using PFC, congestion of a certain type of traffic does not affect the normal forwarding of other types of traffic, ensuring that different types of packets on the same link do not interfere with each other.

4. Summary - RDMA and RoCE product selection

Based on the practical deployment experience of lossless Ethernet, NVIDIA has adopted ECN as the primary congestion control technology, with hardware-accelerated Fast ECN ensuring instantaneous response capability for flow control. Combined with ETS and physical cache optimization techniques, resource scheduling is optimized based on the traffic model. PFC technology poses potential risks of network deadlock. Comparative verification has shown that PFC flow control mechanisms have limited improvements on network reliability, cannot effectively prevent congestion packet loss, and introduce direct risks and performance bottlenecks.

 

RDMA achieves optimal end-to-end network communication, aiming to facilitate fast remote data transfer. Technically, it is a combination of multiple optimizations involving kernel bypass on the host side, transport layer offloading on the network card, and congestion flow control on the network side. The achieved results include low latency, high throughput, and low CPU overhead. However, the current implementation of RDMA also has limitations, such as scalability constraints and challenges in configuration and modification.

 

When building a high-performance RDMA network, besides RDMA adapters and powerful servers, key products required include high-speed optical modules, switches, and fiber cables. In this regard, it is recommended to choose NADDOD brand for reliable high-speed data transmission products and solutions. As a leading provider of high-speed data transmission solutions, NADDOD offers a variety of premium products, including high-performance switches, AOC/DAC/optical modules, intelligent network cards, etc., to meet the low-latency high-rate data transmission requirements. NADDOD's products and solutions are widely used in various industries, whether for large-scale scientific computing, real-time data analysis, or meeting low-latency requirements for financial transactions, and have a good reputation. NADDOD's products will be an ideal choice for achieving a balance between economy and efficiency when deploying high-performance networks!