Easily Understand RDMA Technology

NADDOD Brandon InfiniBand Technical Support Engineer Feb 2, 2024

RDMA (Remote Direct Memory Access) technology has revolutionized the way data is transferred in computer networks. Traditional data transfer methods, such as DMA, rely on the involvement of the CPU, leading to increased latency and reduced overall system performance. In contrast, RDMA enables direct memory access between remote systems without CPU intervention, significantly improving data transfer efficiency and lowering latency.

 

Before exploring RDMA, we first need to understand its predecessor: DMA.

 

What is DMA?

 

DMA, which stands for Direct Memory Access, is a technology that allows certain hardware subsystems to transfer data directly between memory and devices with minimal intervention from the main processor (such as the CPU).

 

In traditional data processing methods, the CPU is responsible for handling all data transfer tasks. When data needs to be read from or written to a hard disk, the CPU issues instructions and then waits for the data transfer to complete before proceeding with other tasks.

 

Specifically, the CPU issues instructions to the disk controller, and the disk controller places the data into its internal buffer. Then, the CPU reads the data into its own registers one byte at a time and finally writes it into memory. Throughout this process, the CPU is unable to perform other tasks, which significantly reduces the efficiency of the system, especially when dealing with large data transfers.

 

CPU Process

The primary task of the CPU is computation, not data copying, and performing such tasks wastes its computational capabilities. This approach significantly reduces the overall performance of the system when dealing with large amounts of data because the CPU cannot execute other computational tasks while waiting for data transfers to complete.

 

To alleviate the burden on the CPU and allow it to engage in more meaningful work, a mechanism called DMA was later designed:

 

Through a DMA controller, devices can directly transfer data to and from memory without the CPU's full involvement. The CPU only needs to set the relevant parameters before the data transfer begins and can then switch to handling other tasks. When the data transfer is completed, the DMA controller notifies the CPU, which then resumes processing the data transfer task. This approach greatly reduces the CPU's burden and enhances the efficiency of data transfers.

 

 

The value of DMA technology is mainly reflected in the following aspects:

 

  1. Improved data transfer efficiency: DMA controllers can directly access memory, allowing for fast movement of large amounts of data without the intervention of the CPU.

 

  1. Reduced CPU burden: DMA technology frees the CPU from tedious data transfer tasks, enabling it to focus on other more important tasks, thereby improving the overall performance of the system.

 

  1. Fast data copying and storage: DMA technology enables high-speed data transfer between peripherals and memory, or between different areas of memory. This is particularly useful in applications that require extensive data copying and storage.

 

In summary, DMA technology is an efficient data transfer method that allows the CPU to focus on executing core tasks such as computation and control, thereby enhancing the performance of the entire computer system.

 

However, DMA technology still has its limitations. It can only facilitate data transfer between internal devices within the same computer and cannot achieve direct memory access between different computers. This is the background for the emergence of RDMA technology.

 

RDMA Technology

 

RDMA, which stands for Remote Direct Memory Access, allows one computer to access the memory of another computer as if it were accessing its own memory. Typically, data transfer between computers involves complex TCP/IP network protocols. However, RDMA bypasses these protocols, making data transfer as simple and fast as accessing local memory. What's impressive is that the remote computer is unaware of this process, and most of the work is handled by hardware, requiring minimal software involvement.

 

In conventional network transfers, such as when computer A wants to send a message to computer B, it essentially involves moving data from the memory of computer A to the memory of computer B. This process requires the coordination and control of both CPUs, including the functioning of network cards, interrupt handling, data packaging, and unpackaging, among other tasks.

 

As an example, data in the user space of computer A needs to be copied to a buffer in the kernel space before it can be read by the network card. During this process, the data is also adorned with various header information and checksums, such as TCP headers, IP headers, etc. The network card then uses DMA technology to copy the data from the kernel space to an internal buffer within the network card and sends it over the network to computer B.

 

Upon receiving the data, computer B performs the reverse operation: first, the data is copied from the internal buffer of the network card to a buffer in the kernel space. Then, the CPU unpacks the data, and finally, it is copied to the user space.

 

DMA Process

As we can see, even with DMA technology, this process still heavily relies on the CPU.

 

However, when utilizing RDMA technology, this process becomes much simpler. It involves copying data from one memory to another, but with RDMA, the involvement of CPUs on both ends is minimal (only for control purposes). The local network card, using DMA technology, can directly copy data from the user space to its internal storage space. The hardware then automatically assembles the data and sends it over the network to the remote network card. Upon receiving the data, the remote RDMA network card automatically strips off the various header information and checksums, and then uses DMA technology to directly copy the data into the user space memory. This makes the entire process highly efficient and less dependent on the CPU.

 

RDMA Process

The value of RDMA technology is primarily reflected in the following aspects:

 

  1. Ultra-low latency: RDMA operations bypass the operating system's network protocol stack, reducing CPU interrupts and context switches, thereby achieving sub-microsecond latency.

 

  1. High throughput: As RDMA operations directly transfer data at the hardware level, they can achieve very high data transfer rates.

 

  1. Reduced CPU load: RDMA technology offloads the task of data transfer to the hardware, freeing up CPU computing resources to focus on performing other more critical tasks.

 

  1. Improved system scalability: RDMA technology supports large-scale parallel processing, greatly enhancing system scalability.

 

Although RDMA technology is highly beneficial, it was primarily used in domains that required extremely fast computing, such as high-performance computing (HPC) and large-scale data centers. RDMA devices, such as InfiniBand cards and switches, were more expensive than those commonly used.

 

As a result, currently, large internet companies, with their significant demands and budgets, are the main users of RDMA technology. For general developers and ordinary users, RDMA may not be necessary due to price and usage constraints.

 

RDMA Network Architecture Technology

 

InfiniBand can be considered as a "native" RDMA network architecture technology. It provides a channel-based point-to-point message queue forwarding model, where each application can directly access its data messages through a created virtual channel, without the involvement of other operating systems and protocol stacks. The application layer of InfiniBand architecture utilizes RDMA technology, enabling remote RDMA read and write access between nodes, completely offloading CPU workload. The network transmission utilizes high-bandwidth transfer, and the link layer incorporates specific retransmission mechanisms to ensure quality of service, eliminating the need for data buffering. InfiniBand requires the use of IB switches and IB network cards to be implemented.

 

To lower the cost of RDMA technology and improve the standardization of network communication, RoCE (RDMA over Converged Ethernet) and iWARP (Internet Wide Area RDMA Protocol) technologies were developed.

 

RDMA application

RoCE (RDMA over Converged Ethernet) is a technology based on Ethernet. Its first version (v1) used IB rules at the network layer, while the second version (v2) utilizes UDP and IP, allowing data packets to be transmitted over different network paths. RoCE can be seen as a more cost-effective version of InfiniBand (IB) as it encapsulates IB information into Ethernet packets for transmission and reception.

 

Since RoCE v2 can use regular Ethernet switching equipment (though requiring support for flow control technologies like PFC and ECN to address Ethernet congestion and packet loss issues), it finds wider application in enterprises. However, under similar conditions, its performance may not be as good as IB.

 

The iWARP protocol is based on TCP, a reliable and connection-oriented protocol. This means that in the presence of network issues (such as packet loss), iWARP is more reliable than RoCE v2 and InfiniBand (IB), especially in large-scale networks. However, establishing a large number of TCP connections can consume significant memory, and the complex mechanisms of TCP, such as traffic control, may affect performance. Therefore, in terms of performance, iWARP may not be as good as the UDP-based RoCE v2 and IB.

 

It is important to note that although some software can simulate the functionality of RoCE and iWARP, in practical commercial applications, these protocols typically require dedicated hardware devices (such as network cards) to support them.

 

Summary

RDMA achieves extreme end-to-end network communication by aiming for fast remote data transport and combining multiple optimization techniques (including kernel bypass on the host side, transport layer offloading on the network card, and congestion flow control on the network side). It achieves low latency, high throughput, and low CPU overhead. However, the current implementation of RDMA also has limitations, such as restricted network scalability and difficulties in configuration and modification.

 

When building a high-performance RDMA network, key products required include RDMA adapters, powerful servers, and essential components such as high-speed fiber modules, switches, and fiber cables. It is recommended to choose NADDOD brand for reliable high-speed data transmission products and solutions. As a leading provider of high-speed data transmission solutions, NADDOD offers various high-quality products, including high-performance switches, AOC/DAC/optical modules, and intelligent network cards, to meet the low-latency high-speed data transmission requirements. NADDOD's products and solutions are widely used across industries, whether for large-scale scientific computing, real-time data analytics, or low-latency demands in financial transactions, and have gained a good reputation. NADDOD's products will be an ideal choice for achieving a balance between economy and efficiency when deploying high-performance networks!