Exploring RDMA and Low-Latency Networks

NADDOD Peter Optics Technician Nov 1, 2023

The development of networks seems to lag behind computing and storage in various aspects, and latency is no exception. High network transmission latency has gradually become a bottleneck for high-performance data centers.

 

In the process of high-performance distributed computing and high-performance computing in data centers, data flows are generated, forming the majority of east-west traffic, accounting for 70% of the total. These flows are generally transmitted through TCP/IP networks. Therefore, if we can improve the TCP/IP transmission rate between servers, the performance of the data center will naturally improve as well.

 

Thus, the role of RDMA begins to play a significant role and is widely used in high-performance scientific computing (HPC). With the development of data centers requiring high bandwidth and low latency, RDMA is gradually being applied in scenarios that demand high-performance data centers.

 

RDMA (Remote Direct Memory Access) is a new memory access technology. RDMA allows server application data to be directly transferred from memory to intelligent network cards (with embedded RDMA protocols). The intelligent network card hardware handles the encapsulation of RDMA transfer packets, enabling servers to directly read and write memory data of other servers at high speed without the time-consuming processing of the operating system/CPU. Here are the specifics:

 

RDMA's kernel bypass mechanism allows direct data reading and writing between applications and network cards, bypassing the limitations of TCP/IP and reducing the protocol stack latency to nearly 1 microsecond. RDMA's zero-copy mechanism eliminates the need to copy data between the application's memory and the data buffer in the operating system. This type of transfer does not require the CPU, cache, or context switcher to perform any work, significantly reducing processing latency in message transmission. Additionally, the transfer can be performed in parallel with other system operations, improving network transmission performance.

traditional vs RDMA

 

By comparing the processing procedures of traditional modes and RDMA modes for sending and receiving data, RDMA technology brings significant breakthroughs to the communication architecture of data centers, including low latency and ultra-low CPU and memory resource utilization.

 

The low latency is mainly reflected in the zero-copy network and kernel bypass mechanism of RDMA. The zero-copy network card allows direct data transfer between the application memory and the network card, eliminating the need for data copying between the application memory and the kernel memory, resulting in a significant reduction in transmission latency. The kernel bypass mechanism allows applications to send commands to the network card without involving kernel memory calls. Without any involvement of kernel memory, RDMA requests are sent from the user space to the local network card, then transmitted to the remote network card through the network. This reduces the number of context switches between kernel memory space and user space when processing network transmission, thereby reducing network latency.

 

The ultra-low CPU and memory resource utilization are mainly reflected in the fact that application programs can directly access remote memory without occupying any CPU resources in the remote server. The cache resources in the remote CPU will not be filled with accessed content. Servers can allocate almost 100% of CPU and memory resources to computing or other services, saving server resource utilization and improving server data processing bandwidth.

 

Based on the analysis and understanding of the "network requirements for HPC (high-performance computing)" and "RDMA technology," Asterfusion has launched the CX-N series of ultra-low-latency cloud switches.

 

Using RoCEv2 to Reduce Transmission Protocol Latency

Currently, there are three choices for the network layer protocol of RDMA: InfiniBand, iWarp (internet Wide Area RDMA Protocol), and RoCE (RDMA over Converged Ethernet).

 

RoCE is a network protocol that allows applications to achieve remote memory access over Ethernet. It is also proposed by IBTA and applies RDMA technology to Ethernet. It supports RDMA on standard Ethernet switches, requiring only special network cards that support RoCE, without any special requirements on the network hardware side. Currently, there are two versions of RoCE: RoCEv1 and RoCEv2. RoCEv2 is a network layer protocol that enables routing and allows hosts in different broadcast domains to access each other at the third layer. It is encapsulated based on the UDP protocol. However, due to the sensitivity of RDMA to packet loss and the best-effort nature of traditional Ethernet, lossless Ethernet support is required.

 

Among these types of RDMA networks, RoCEv2 offers good performance and low deployment costs. The ultra-low-latency, lossless Ethernet built by Asterfusion CX-N series cloud switches can effectively support RoCEv2 and create a low-latency, zero-loss, high-performance HPC network based on RoCEv2.

RDMA in the Trend of Network Convergence

 

Ultra-Low-Latency Switching Chips to Reduce Network Forwarding Latency

The Asterfusion CX-N series cloud switches provide industry-leading ultra-low-latency capabilities, meeting the low-latency network requirements of three typical scenarios in high-performance computing. They also address the need for ultra-low latency in tightly coupled scenarios that heavily rely on coordination between computing nodes, synchronous computation, and high-speed information transmission. Building a high-performance computing network using the CX-N series cloud switches can significantly reduce processing latency and improve high-performance computing performance.

 

Use PFC with High-priority Queues to Provide a Lossless Network 

PFC(Priority Flow Control) is an enhanced pause mechanism that allows the creation of eight virtual channels on an Ethernet link, with each virtual channel assigned a priority level and dedicated resources such as buffer space and queues. It enables the individual pause and resume of any virtual channel without affecting the transmission of traffic on other virtual channels, ensuring uninterrupted flow for other virtual channels. This approach allows the network to create a lossless class of service for a single virtual link, enabling it to coexist with other traffic types on the same interface.

PFC working mechanism

 

Using ECN Congestion Control Algorithm to Eliminate Network Congestion

ECN (Explicit Congestion Notification) is an important mechanism for building a lossless Ethernet network, providing end-to-end traffic control. By utilizing the ECN feature, network devices can mark the ECN field in the IP header of packets once congestion is detected. When these marked packets reach their intended destination, the congestion notification is fed back to the traffic sender. The traffic sender then responds to the congestion notification by throttling problematic network packets. This helps reduce network latency and jitter, thereby enhancing the performance of high-performance computing clusters.

ECN

 

1. The sending server marks the IP packets with ECN.

 

2. When the switch receives the packet and detects queue congestion, it modifies the ECN field of the packet and forwards it.

 

3. The receiving server receives the marked congested packet and processes it normally.

 

4. The receiving server generates a congestion notification and periodically sends Congestion Notification Packets (CNPs) requesting that the packets should not be discarded by the network.

 

5. The switch receives the CNP packet and forwards it as usual.

 

6. The sending server receives the marked CNP packet, parses it, and applies the corresponding data stream rate limiting algorithm for throttling.