How to build Lossless Network for RDMA?

Many people often have two questions after learning about RDMA and lossless networks: Why do we need lossless networks? What benefits can these advanced technologies bring us?

From a networking perspective alone, it may be difficult to provide a satisfactory answer. NADDOD's technical experts, however, can shed light on these questions by providing a few examples from both front-end business and back-end application perspectives.

1. Why do we need Lossless Networks?

Firstly, there is a significant amount of online business on the internet, such as online search, shopping, live streaming, etc. These services require very fast response times to high-frequency user requests. Any latency introduced at any point in the data center can greatly impact the user experience, resulting in effects on traffic, reputation, active users, and more.

Furthermore, with the increasing trends in machine learning and AI technologies, the demand for computational power is growing exponentially. To meet the requirements of complex neural networks and deep learning models, data centers deploy numerous distributed computing clusters. However, the communication latency of large-scale parallel programs can significantly affect the efficiency of the entire computational process.

Moreover, to address the efficiency challenges of exploding data storage and retrieval within data centers, distributed storage networks using Ethernet convergence are becoming increasingly popular. However, because data flows in storage networks are primarily characterized by elephant flows, congestion-induced packet loss can trigger re-transmissions of these large flows. This not only reduces efficiency but also exacerbates congestion.

So, from the perspective of front-end user experience and back-end application efficiency, the current requirements for data center networks are: the lower the latency, the better, and the higher the efficiency, the better.

To reduce internal network latency and improve processing efficiency in data centers, RDMA technology has emerged. It allows user-level applications to directly read from and write to remote memory without involving the CPU in multiple memory copies. It also bypasses the kernel and writes data directly to the network card, achieving high throughput, ultra-low latency, and low CPU overhead.

Currently, RDMA's transport protocol over Ethernet is RoCEv2 (RDMA over Converged Ethernet v2). RoCEv2 is a connectionless protocol based on UDP (User Datagram Protocol). Compared to the connection-oriented TCP (Transmission Control Protocol), UDP is faster and consumes fewer CPU resources. However, unlike TCP, UDP does not have mechanisms such as sliding windows and acknowledgment responses to achieve reliable transmission. In the event of packet loss, the upper-layer application needs to detect it and initiate retransmission, which can significantly reduce the efficiency of RDMA transmission.

To unleash the true performance of RDMA and overcome the network performance bottlenecks in large-scale distributed systems within data centers, it is essential to establish a lossless network environment specifically tailored for RDMA. The key to achieving lossless transmission is addressing network congestion.

2. What is RDMA?

RDMA (Remote Direct Memory Access), in simple terms, is a remote DMA technology designed to address the latency associated with server-side data processing in network transfers. Comparison of working mechanisms between traditional mode and RDMA mode

In the traditional mode, when transferring data between applications on two servers, the process is as follows:

First, the data needs to be copied from the application cache to the TCP protocol stack cache in the kernel.

Then it is copied to the driver layer.

Finally, it is copied to the NIC (Network Interface Card) cache.

Multiple memory copies require CPU intervention, resulting in significant processing latency of several tens of microseconds. Additionally, excessive CPU involvement throughout the process consumes a significant amount of CPU performance,which can impact normal data computations.

In RDMA mode, application data can bypass the kernel protocol stack and be written directly to the network card. This brings significant benefits, including:

Reduction of processing latency from tens of microseconds to within 1 microsecond.

Minimal CPU involvement throughout the process, resulting in performance savings.

Higher transmission bandwidth.

3. RDMA's Demands on the Network

RDMA is increasingly being applied in high-performance computing, big data analytics, and high-concurrency I/O scenarios. Applications such as iSICI, SAN, Ceph, MPI, Hadoop, Spark, and Tensorflow are deploying RDMA technology. For the underlying network that supports end-to-end transmission, low latency (in microseconds) and losslessness are the most important metrics.

(1) Low Latency

Network forwarding latency mainly occurs at the device nodes (excluding optical-electrical transmission latency and data serialization latency). Device forwarding latency includes the following three parts:

Storage forwarding latency: The chip's forwarding pipeline processing delay, which generates approximately 1 microsecond of chip processing latency per hop (there have been attempts in the industry to use cut-through mode, reducing the single-hop latency to around 0.3 microseconds).

Buffer caching latency: When the network is congested, packets are buffered and wait for forwarding. The larger the buffer, the longer the packets are cached, resulting in higher latency. For RDMA networks, a larger buffer is not necessarily better, and a reasonable selection is required.

Retransmission latency: RDMA networks utilize other techniques to ensure packet loss prevention.

(2) Lossless Network

RDMA can achieve full-rate transmission in a lossless state, but once packet loss and retransmissions occur, performance sharply declines. In traditional network modes, the primary means to achieve losslessness is through large buffers. However, as mentioned earlier, this contradicts the requirement for low latency. Therefore, in an RDMA network environment, the goal is to achieve losslessness with smaller buffers.

Under this constraint, RDMA achieves losslessness primarily through network flow control techniques based on PFC (Priority Flow Control) and ECN (Explicit Congestion Notification).

4. Key Technology for Lossless RDMA Networks: PFC

PFC (Priority-based Flow Control) is a queue-based backpressure mechanism that operates on the basis of priority. It prevents buffer overflow and packet loss by sending Pause frames to notify upstream devices to pause packet transmission.

Priority based Flow Control

PFC allows for the individual pausing and resuming of any specific virtual channel without affecting the traffic of other virtual channels. As shown in the diagram above, when the buffer consumption of Queue 7 reaches the configured PFC flow control threshold, PFC backpressure is triggered:

The local switch triggers the transmission of a PFC Pause frame and sends it upstream.

The upstream device that receives the Pause frame temporarily suspends the transmission of packets from that queue and buffers them.

If the buffer of the upstream device also reaches a threshold, it will continue to trigger Pause frames to exert backpressure upstream.

Ultimately, data packet loss is avoided by reducing the sending rate of the priority queue.

When the buffer occupancy falls below the recovery threshold, a PFC release frame is sent.

5. Key Technology for Lossless RDMA Networks: ECN

ECN (Explicit Congestion Notification) is an older technology that was not widely used in the past but is now used between hosts.

ECN works by marking packets with the ECN field in the IP header when congestion occurs at the egress port of a network device, triggering the ECN threshold. This marking indicates that the packet has encountered network congestion. When the receiving server detects the ECN marking in a packet, it immediately generates a Congestion Notification Packet (CNP) and sends it back to the source server, including information about the flow causing the congestion. Upon receiving the CNP, the source server reduces the sending rate of the corresponding flow to alleviate network congestion and avoid packet loss.

As described earlier, the achievement of end-to-end lossless transmission through PFC and ECN relies on the configuration of different thresholds. The proper configuration of these thresholds involves fine-grained management of the switch's MMU, which refers to the management of the switch's buffer.

6. Conclusion

RDMA networks achieve lossless transmission by deploying PFC and ECN functionalities within the network. PFC technology allows us to control the traffic of RDMA-specific queues on the link and exert backpressure on upstream devices when congestion occurs at the ingress port of the switch. With ECN technology, we can achieve end-to-end congestion control by marking packets with ECN when congestion occurs at the egress port of the switch and causing the sending end to reduce its transmission rate.

To fully leverage the high-performance forwarding capabilities of the network, it is generally recommended to adjust the buffer thresholds for ECN and PFC in a way that ECN triggers faster than PFC. This allows the network to continue forwarding data at full speed while actively reducing the sending rate from the server. If the issue persists, PFC can be used to pause packet transmission from upstream switches, resulting in a decrease in network throughput but without packet loss.

When applying RDMA in data center networks, it is essential to address the requirements for lossless network transmission and also focus on fine-grained operations and maintenance to meet the demands of latency-sensitive and packet loss-sensitive network environments. Additionally, there are some deployment challenges in RDMA networks, such as PFC storms, deadlock issues, and complex ECN threshold design in multi-tier networks. NADDOD's experts have conducted research and accumulated knowledge on these issues and look forward to discussing them further with the community.