InfiniBand vs. RoCE: Choosing a Network for AI Data Centers
Ultra-high bandwidth, ultra-low latency, and ultra-high reliability are the requirements for network connectivity in large-scale AI training.
For many years, the TCP/IP protocol has been the foundation of internet communication. However, for AI networks, TCP/IP has some critical drawbacks. TCP/IP introduces high latency, typically in the range of tens of microseconds, and it also puts a significant load on the CPU. RDMA (Remote Direct Memory Access) enables direct access to memory data through the network interface without involving the operating system kernel. This allows for high-throughput, low-latency network communication, making it particularly suitable for use in large-scale parallel computing clusters.
RDMA technology has four implementations: InfiniBand, RoCEv1, RoCEv2, and iWARP. Among them, RoCEv1 has been deprecated, and iWARP is not commonly used. Currently, the industry's commonly used network solutions are InfiniBand and RoCEv2.
So, when it comes to AI data center networks, which one is more suitable between InfiniBand and RoCE?
InfiniBand vs. RoCE: What are the network requirements for HPC/AI workloads?
Currently, most data centers adopt a layer 2 network architecture, while AI clusters are supercomputers designed for executing complex, large-scale AI tasks. These workloads run in parallel across multiple GPUs and require high utilization. Therefore, compared to traditional data center networks, AI data center networks face additional complexities:
- Parallel Computing: AI workloads are a unified infrastructure running the same application/computational tasks on multiple machines.
- Scale: HPC/AI workloads can involve thousands of computing engines such as GPUs, CPUs, FPGAs, and more.
- Job Types: Different tasks vary in size, duration, dataset size and quantity to consider, types of answers to generate, as well as different programming languages and hardware types used to code the applications. This results in constantly changing traffic patterns within the network built for running HPC/AI workloads.
- Losslessness: In traditional data centers, lost messages can be retransmitted, but in AI workloads, lost messages mean the entire computation either becomes erroneous or gets stuck. Therefore, AI data centers require a lossless network.
- Bandwidth: High-bandwidth traffic needs to flow between servers so that applications can access data. In modern deployments, each computing engine for AI or other high-performance computing capabilities can have interface speeds of up to 400Gbps.
These complexities present significant challenges for AI networks. As a result, AI data center networks need to possess characteristics such as high bandwidth, low latency, no jitter, no packet loss, and long-term stability.
From TCP/IP to RDMA
NADDOD has a professional technical team and extensive experience in implementing and servicing various application scenarios. NADDOD believes that the existing TCP/IP software and hardware architecture cannot meet the requirements of low-latency, high I/O concurrency applications such as HPC/AI.
Traditional TCP/IP network communication involves sending messages through the kernel, which incurs high data movement and data replication overhead. For example, in a typical IP data transmission, the following operations occur when an application on one computer sends data to an application on another computer:
- The kernel must receive the data.
- The kernel must determine which application the data belongs to.
- The kernel wakes up the application.
- The kernel waits for the application to perform a system call to the kernel.
- The application copies the data from kernel memory space to the buffer provided by the application.
This process means that if the host adapter uses direct memory access (DMA), most of the network traffic will be copied in the system main memory. Additionally, the computer performs context switches to switch between the kernel and the application. These context switches can result in higher CPU load, increased traffic, and slower performance for other tasks.
Unlike traditional IP communication, RDMA communication bypasses kernel intervention during the communication process, allowing hosts to directly access the memory of another host, reducing CPU overhead. The RDMA protocol enables host adapters to determine which application should receive a data packet and where to store it in the memory space of that application after the packet enters the network. Instead of sending the packet to the kernel for processing and copying it to the memory of the user application, the host adapter directly places the contents of the packet into the application's buffer.
RDMA transmission reduces the number of CPU cycles involved, contributing to improved throughput and performance. In other words, the essence of RDMA is to bypass the CPU for large-scale distributed computing and storage scenarios, allowing the network card to access memory directly in remote servers, accelerating interaction between servers, reducing latency, and utilizing valuable CPU resources for high-value computation and logic control.
Compared to traditional TCP/IP networks, InfiniBand and RoCEv2 bypass the kernel protocol stack, resulting in latency performance improvements by several orders of magnitude. Experimental tests have shown that by bypassing the kernel protocol stack, the end-to-end latency at the application layer can be reduced from 50us (TCP/IP) to 5us (RoCE) or 2us (InfiniBand) when communication occurs within a single hop in the same cluster.
Introduction to InfiniBand Network
The InfiniBand network enables data transmission through InfiniBand adapters or switches instead of Ethernet. The latency of Ethernet is influenced by factors such as the processing capability, cache size, and forwarding mechanism of switches or routers, typically ranging from a few microseconds to a few milliseconds. The latency of an InfiniBand (IB) network is influenced by factors such as the forwarding capability of switches or routers, Cut-Through technology, and RDMA technology, typically ranging from several hundred nanoseconds to several microseconds.
Key components of the InfiniBand network include the Subnet Manager (SM), IB network cards, IB switches, and IB cables. InfiniBand switches do not run any routing protocols, and the forwarding table for the entire network is computed and distributed by a centralized Subnet Manager. In addition to the forwarding table, the SM is responsible for managing configurations such as partitioning and QoS within the InfiniBand subnet. The InfiniBand network requires dedicated cables and optical modules to interconnect switches and connect switches to network cards.
Native Lossless Network
The InfiniBand network utilizes a credit-based mechanism that fundamentally avoids buffer overflow and packet loss. The sender will only initiate packet transmission once it confirms that the receiver has sufficient credits to accept the corresponding number of messages.
Each link in the InfiniBand network has a predetermined buffer. The sender will not transmit data exceeding the available predetermined buffer size of the receiver. Once the receiver completes forwarding, it releases the buffer and continuously reports the current available predetermined buffer size back to the sender. This link-level flow control mechanism ensures that the sender does not send an excessive amount of data, preventing network buffer overflow and packet loss.
Network Card Expansion Capability
InfiniBand's adaptive routing, based on per-packet dynamic routing, ensures optimal utilization of the network in large-scale deployments. There are many examples of large GPU clusters using InfiniBand networks, such as Baidu AI Cloud and Microsoft Azure.
In terms of speed, InfiniBand network cards have been rapidly advancing. The 200Gbps HDR has been widely deployed commercially, and the 400Gbps NDR network cards are also starting to be commercially deployed. Through years of research and market surveys, NADDOD has found that there are major InfiniBand network solution providers and supporting equipment vendors in the market, including NVIDIA, which holds the highest market share, exceeding 70%. The following diagram shows commonly used InfiniBand network cards.
In conclusion, selecting the right solution for AI data center networks is a critical decision. In cases where AI applications demand high network performance, traditional TCP/IP protocols are no longer sufficient to meet the requirements. In contrast, the application of RDMA technology has made InfiniBand and RoCE highly regarded network solutions.
NADDOD is a leading provider of comprehensive optical network solutions, committed to providing users with high-performance, low-latency network solutions for InfiniBand and RoCE. InfiniBand has demonstrated excellent performance in areas such as high-performance computing and large-scale GPU clusters, while RoCE, as an RDMA technology based on Ethernet, offers greater deployment flexibility. For more detailed information on RoCE, please refer to the RoCE v2 network introduction.
Therefore, selecting the appropriate network solution based on specific network requirements and application scenarios is a crucial step in ensuring high performance and efficiency in AI data center networks.