RDMA: iWARP vs. RoCE vs. InfiniBand - A Comparison of Remote Direct Memory Access Technologies

RDMA, which stands for Remote Direct Memory Access, is a technology that allows bypassing the operating system kernel of a remote host to access data in its memory. The concept of RDMA originates from DMA (Direct Memory Access) technology. In DMA, external devices (such as PCIe devices) can access the host memory directly without going through the CPU. With RDMA, external devices can bypass the CPU and not only access the local host's memory but also access user space memory on another host. By bypassing the operating system, RDMA not only saves a significant amount of CPU resources but also improves system throughput and reduces network communication latency. As a result, RDMA has found wide applications in high-performance computing and deep learning training. This article will introduce the architecture and principles of RDMA and explain how to use RDMA networks.

Technical Background

The two most important metrics in computer network communication are bandwidth and latency. Communication latency primarily refers to:

Transmission Delay：

The time taken to transmit a packet from the host to the transmission medium

The calculation is as follows: Delay = L / Bandwidth, where L is the size of the data packet in bits, and Bandwidth is the link bandwidth.

If the bandwidth at both ends is high, the transmission time is short, and the transmission latency is low.

Propagation Delay:

After the packet is transmitted to the transmission medium, it has to go through the medium to reach the destination. Hence the time taken by the last bit of the packet to reach the destination is called propagation delay.

The calculation method is as follows: Delay = Distance / Velocity, where Distance is the distance of the transmission link, and Velocity is the transmission speed of the physical medium.

Velocity

Queueing delay

Let the packet is received by the destination, the packet will not be processed by the destination immediately. It has to wait in queue in something called as buffer. So the amount of time it waits in queue before being processed is called queueing delay.

In general we can’t calculate queueing delay because we don’t have any formula for that.

Processing delay

Message handling time at sending/receive ends

Buffer management, message copying across different memory spaces, and system interrupt after message transmission completion.

In actual communication scenarios in computer networks, the focus is primarily on sending small messages, making latency handling a key factor for performance improvement.

In traditional TCP/IP network communication, data needs to be transmitted from the user space of the local machine to the user space of the remote machine, which involves multiple memory copies in this process:

TCP-IP

The data sender needs to copy the data from the user space buffer to the kernel space socket buffer.

The data sender adds a packet header and performs data encapsulation in the kernel space.

The data is copied from the kernel space socket buffer to the NIC buffer for network transmission.

Upon receiving the data packet sent from the remote machine, the data receiver needs to copy the packet from the NIC buffer to the kernel space socket buffer.

Afterwards, a series of network protocols are applied to parse the data packet, and the parsed data is copied from the kernel space socket buffer to the user space buffer.

At this point, a system context switch occurs, and the user application is invoked.

Under high-speed network conditions, the traditional TCP/IP network is limited in terms of bandwidth for inter-machine communication due to the high overhead of data movement and copying operations on the host side. To improve data transmission bandwidth, several solutions have been proposed. Here, we will mainly introduce the following two:

TCP Offloading Engine

Remote Direct Memroy Access

Below is the overall architecture diagram of RDMA. From the diagram, it can be seen that RDMA operates in the application program user space and provides a set of Verbs interface to operate RDMA hardware. RDMA bypasses the kernel and allows direct access to the RDMA NIC (Network Interface Card) from the user space. The RNIC (RDMA NIC) includes a Cached Page Table Entry, which is used to map virtual pages to their corresponding physical pages.

RDMA

Currently, RDMA has three different hardware implementations, all of which can use the same set of APIs but have different physical and link layer characteristics:

InfiniBand: RDMA technology based on the InfiniBand architecture, proposed by the IBTA (InfiniBand Trade Association). Building an RDMA network based on InfiniBand requires dedicated InfiniBand NICs and InfiniBand switches. In terms of performance, InfiniBand networks offer the best performance, but the cost of InfiniBand NICs and switches is relatively high. In contrast, RoCEv2 and iWARP only require special NICs, which are more cost-effective.

iWARP: Internet Wide Area RDMA Protocol, an RDMA technology based on the TCP/IP protocol, defined as an IETF standard. iWARP supports RDMA technology over standard Ethernet infrastructure without the need for lossless Ethernet transmission support from switches. However, servers need to use NICs that support iWARP. Due to the influence of TCP, the performance is slightly lower.

RoCE: RDMA over Converged Ethernet, an RDMA technology based on Ethernet, also proposed by the IBTA. RoCE supports RDMA technology over standard Ethernet infrastructure but requires lossless Ethernet transmission support from switches. Servers need to use RoCE NICs, and the performance is comparable to InfiniBand.

RDMA Application

Back in the last year of the 20th century, with the rapid development of CPU performance, the PCI technology proposed by Intel in 1992 could no longer meet the growing I/O demands of the people. The performance of the I/O system had become the main contradiction restricting server performance. Although in 1998, IBM, HP, and Compaq jointly proposed PCI-X as an extension and upgrade of PCI technology, increasing the communication bandwidth to 1066 MB/sec, people believed that PCI-X still could not meet the requirements of high-performance server performance. The demand for constructing the next-generation I/O architecture was growing. After a series of competitions, InfiniBand integrated the designs of the two competing technologies at that time, Future I/O and Next Generation I/O, and established the InfiniBand Trade Association (IBTA), which included major vendors such as Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun. At that time, InfiniBand was seen as the next-generation I/O architecture to replace the PCI architecture, and the 1.0 version of the InfiniBand architecture specification was released in 2000. In 2001, Mellanox introduced devices that supported a communication rate of 10 Gbit/s.

IB Speed

However, the good times did not last long. In 2000, the dot-com bubble burst, leading to hesitation about investing in such a groundbreaking technology. Intel announced its plan to develop its own PCIe architecture, and Microsoft also halted the development of InfiniBand. Nevertheless, companies like Sun and Hitachi continued to persist in the research and development of InfiniBand technology. Due to its powerful performance advantages, InfiniBand gradually gained widespread application in scenarios such as cluster interconnection, storage systems, and supercomputers. Its software protocol stack became standardized, and Linux added support for InfiniBand. In the 2010s, with the explosion of big data and artificial intelligence, the application scenarios of InfiniBand gradually expanded from the original supercomputing field and gained more extensive usage. The market leader of InfiniBand, Mellanox, was acquired by NVIDIA, while another major player, QLogic, was acquired by Intel. Oracle also began manufacturing its own InfiniBand interconnect chips and switches. In the 2020s, the latest NDR (Next-Generation Data Rate) technology released by Mellanox achieved a theoretical effective bandwidth of up to 400 Gb/s for a single-port, requiring the use of PCIe Gen5x16 or PCIe Gen4x32 to operate a 400 Gb/s HCA (Host Channel Adapter).

InfiniBand Roadmap

NADDOD is a solution provider that offers lossless network solutions based on InfiniBand and RoCE (RDMA over Converged Ethernet). It aims to create a lossless network environment and provide high-performance computing capabilities to users. Naddod can tailor its solutions to different application scenarios and user requirements, ensuring optimal performance based on specific needs. By delivering high bandwidth, low latency, and high-performance data transfers, Naddod effectively addresses network bottlenecks, enhancing network performance and user experience.AS a professional module manufacturer that produces 1G-800G optical modules, AOCs (Active Optical Cables), and DACs (Direct Attach Cables) for high-speed connectivity. Welcome to learn about and purchase our products.

weitu1png

What is RDMA? RoCE vs. InfiniBand vs.iWAR Difference.

Technical Background