Why did RDMA emerge and what are its benefits?

Technological Developments Give Birth to RDMA

RDMA, or Remote Direct Memory Access, has emerged due to the rapid development of technology. With the rise of AI, 5G networks, the exponential growth of big data analysis and edge computing, as well as the advent of the "Internet of Things" era, there has been an increasingly strong demand for efficient communication in various applications and industries. In this booming landscape, numerous public cloud providers such as Intel, NVIDIA, AMD, Amazon, Microsoft, Lenovo, Alibaba, Baidu, Dell, EMC, Atos, Huawei, Inspur, Sugon, Cray, Fujitsu, HP, and NEC have introduced solutions to "provide users with flexible, highly elastic, and highly scalable basic communication infrastructure, as well as virtually unlimited storage capacity." This has attracted a growing number of emerging businesses and enterprises to build their data centers on public clouds.

When customers build their data centers on public clouds, there is a significant increase in east-west traffic within the data center network, which occupies 80% of the network bandwidth. Consequently, there is a great demand for remote memory access.

An application's access within a data center triggers a series of chain reactions. For example, in the context of big data analysis in a data center, when a terminal user accesses a certain service, it first accesses the business link in the web application zone. Then, while returning the access result, it also needs to push other business links related to the user's behavior. This requires the big data analysis system to analyze a series of behaviors of the user's terminal, which involves accessing other relevant behavioral data stored in the storage zone for in-depth comprehensive analysis.

Finally, the terminal user's behavior and the results of big data analysis are stored in the storage zone and transmitted to the application display system for arrangement and combination. The final results are then pushed to the user's terminal and displayed through a web interface. There is a significant demand for memory access between web application servers, big data analysis servers, storage servers, and display systems.

chain reaction

The Inefficiency of Remote Memory Access

In the field of data centers, people often focus on the development of cloud computing, the improvement of 100G/400G single-port bandwidth, and other technologies, while neglecting how to improve the data processing performance and memory bandwidth utilization of computing nodes after receiving data. When high-performance computing applications such as AI, 5G, AR/VR/MR, big data analysis, and IoT emerge in large numbers, the serious "mismatch" between network bandwidth, processor speed, and memory bandwidth exacerbates network latency effects. The inefficiency of remote memory access, which is a common performance bottleneck, directly leads to inefficient business applications.

As shown in picture, in the typical IP data transmission process (including data reception and transmission), the data processing principle is as follows:

Data Transmission

The application program APP-A on Server-A sends data to the application program APP-B on Server-B. As part of the transmission, the data needs to be copied from the Buffer in the user application space to the Socket Buffer in the kernel space of Server-A. Then, in the kernel space, packet headers are added, packets are encapsulated, and a series of network protocol packet processing tasks, such as Transmission Control Protocol (TCP) or User Datagram Protocol (UDP), Internet Protocol (IP), and Internet Control Message Protocol (ICMP), are performed. Finally, the processed packets are pushed into the Buffer of the NIC network card for network transmission.

Data Reception

When the message recipient Server-B receives the data packet sent by the remote server Server-A, it will send a response. When Server-A receives the response packet, it first copies the packet data from the NIC Buffer to the kernel Socket Buffer. Then, the packet is parsed through the protocol stack, and the parsed data is copied to the Buffer in the corresponding position of the user application space. Finally, the application program APP-A is awakened, waiting for the execution of the read operation by application program A.

Server Processing

When network traffic interacts at a high rate, the data processing performance of sending and receiving becomes highly inefficient. This inefficiency manifests in the following ways:

Significant processing latency:

During the process of data transmission and reception, most network traffic requires at least two memory copies across the system bus. One copy is performed by the host adapter using DMA to place the data into the memory buffer provided by the kernel, and the other copy is from the kernel to the memory buffer of the application. This means that the computer has to perform two interrupts to switch between the kernel context and the application context. As a result, the processing of received data on the server involves multiple memory copies, interrupt handling, context switching, and complex TCP/IP protocol processing, which exacerbates the latency of traffic transmission.

Increased CPU and memory resource consumption with a higher number of received packets per unit time:

Switches often perform layer 3 parsing, which is sufficient and done by dedicated chips without consuming CPU resources. Servers, on the other hand, need to parse the contents of each received packet, which consumes CPU and memory resources. The parsing of network and transport layers requires CPU resources to query memory addresses, verify CRC, reconstruct TCP/UDP packets, and deliver them to the application space. The more packets received per unit time, the more CPU and memory resources are consumed.

Lower flexibility:

The main reason is that all network communication protocols are passed through the kernel, which makes it difficult to support new network protocols, new message communication protocols, and new sending and receiving interfaces. In the later evolution of the network, it becomes challenging to escape from this "stalemate" using traditional IP data transmission.

RDMA Transforms Inefficiency into Efficiency

To address the issues of high server-side data processing latency and resource consumption in remote memory access, IBM and HP proposed RDMA (Remote Direct Memory Access) in 2003. By using network adapters that support this technology, data can be transferred directly from the network to the application memory or from the application memory directly to the network, enabling zero-copy networking without the need to copy data between the application memory and the operating system's data buffers. Such transfers do not require the CPU, cache, or context switcher to perform any work, significantly reducing processing latency in message transmission. Additionally, the transmission can be performed in parallel with other system operations, improving switch performance.

In the specific process of remote memory read and write, the remote virtual memory address used for RDMA read and write operations is included in the RDMA message for transmission. Therefore, the remote application only needs to register the corresponding memory buffer in its local network card. The CPU of the remote node does not provide any services during the entire RDMA data transfer process, except for connection establishment and registration calls, resulting in no resource consumption for the CPU. For example, suppose a connection has been established between an application and a remote application, and the remote memory buffer has been registered. When the application performs an RDMA read operation, the process is as follows:

When an application performs an RDMA read operation, no data copying is performed. Without involving the kernel, the RDMA request is sent from the application running in the user space to the local NIC.

The local NIC reads the content of the buffer and transmits it over the network to the remote NIC.

The RDMA information transmitted over the network includes the target virtual memory address, memory key, and the data itself.

The target NIC verifies the memory key and directly reads the data from the application buffer. The remote virtual memory address used for the operation is included in the RDMA information.

As shown in the diagram, by comparing the data processing processes of the traditional mode and the RDMA mode, RDMA technology brings significant breakthroughs to data center communication architecture, such as low latency and extremely low CPU and memory resource utilization.