RDMA High-Speed Network for Large Model Training: An Overview of RDMA, InfiniBand, RoCE, and GPUDirect RDMA

1. What is RDMA?

Remote Direct Memory Access (RDMA) is an ultra-high-speed network memory access technology that allows programs to access the memory of remote computing nodes at extremely fast speeds. The reason for its high speed is illustrated in the diagram below. With RDMA, a network access does not need to go through the operating system's kernel (such as Sockets, TCP/IP, etc.), which would otherwise consume CPU time with kernel operations. RDMA bypasses these operating system kernel overheads and enables direct memory access to the Network Interface Card (NIC). In some contexts, it is also referred to as Host Channel Adapter (HCA).

traditional-vs-rdm

RDMA has three main hardware implementations: InfiniBand, RoCE, and iWARP. According to NADDOD's technology experts, InfiniBand and RoCE are currently the more mainstream technologies.

2. What is InfiniBand?

InfiniBand aims to achieve infinite bandwidth. InfiniBand originated from an Israeli company called Mellanox, which was acquired by NVIDIA for $6.9 billion in 2020.

NVIDIA & Mellanox

InfiniBand's mainstream technologies currently include 100G and 200G. The proprietary terms for IB rates include Enhanced Data Rate (EDR, 100G) and High Data Rate (HDR, 200G).

Most IT professionals may have limited exposure to InfiniBand due to its high cost, which makes it less affordable for general use. However, in major universities and research institutions' supercomputing centers, InfiniBand is almost a standard feature as it supports critical supercomputing tasks. How expensive is it? For example, a 10-meter InfiniBand cable, which looks roughly like the one below, costs approximately $1.3K. Each end of the cable requires a network card, and each network card costs around $820. So, for two network cards and one cable, the total cost would be $3K.

nvidia-mellanox-infiniband-AOC cable

The following diagram illustrates a comparison between a 1G Ethernet cable and an InfiniBand switch. The top portion represents a 1G Ethernet cable, while the bottom portion represents an HDR switch.

In addition, the cost of building an InfiniBand network is extremely high. InfiniBand networking differs from regular switches. If seamless communication between the network cards of any two computing nodes in the network is desired, a network topology called "Fat Tree" is used. It typically follows a structure as shown below, where squares represent switches and ellipses represent computing nodes. Fat Tree primarily consists of two layers: the upper layer is the core layer, which is not connected to any computing nodes and serves the purpose of traffic forwarding, while the lower layer is the access layer, connecting various computing nodes.

The high cost of implementing a Fat Tree topology in an InfiniBand network is mainly due to the following reason: On a particular aggregation switch, if it has 36 ports, for example, to achieve lossless communication, half of the ports, which is 18 ports, can be connected to computing nodes, while the remaining half needs to be connected to the upper layer core switches. It's important to note that each cable costs around $1.3K and redundant connections need to be made to achieve lossless communication.

fat-tree-topology

After discussing the cost, let's talk about performance. The saying goes, "You get what you pay for," and in the case of InfiniBand, it truly delivers high bandwidth and low latency. According to Wikipedia, InfiniBand offers significantly lower latency compared to Ethernet, with latencies of 100 nanoseconds and 230 nanoseconds, respectively. InfiniBand is widely used in some of the world's leading supercomputers, including those employed by Microsoft, NVIDIA, and national laboratories in the United States.

3. What is RoCE?

Compared to expensive network deployments like InfiniBand, RoCE (RDMA over Converged Ethernet) is a relatively cheaper option, although it still cannot be considered inexpensive. RoCE primarily provides RDMA capabilities over Ethernet. In recent years, RoCE has witnessed rapid development as a substitute for InfiniBand due to its high cost.

Currently, many vendors specializing in Ethernet interconnectivity are actively promoting RoCE. However, if one aims to achieve a lossless network with RoCE, it becomes challenging to keep the overall network cost below 50% of what it would be with InfiniBand.

4. GPUDirect RDMA

When training large-scale models, the cost of inter-node communication is significant. The combination of InfiniBand and GPUs enables a feature called GPUDirect RDMA, which allows direct communication between GPUs across nodes without involving memory and CPU. In other words, the communication between GPUs of two nodes takes place directly through the InfiniBand network interface cards, bypassing the need to go through CPU and memory.

GPUDirect RDMA is especially crucial for large-scale model training because the models reside on GPUs. Copying the models to the CPU already consumes a significant amount of time, and further transmitting them to other nodes through the CPU would be even slower.

5. Large Model Network Card Configuration

For large-scale models, the optimal configuration involves pairing one GPU card with one InfiniBand network card. The NVIDIA DGX system follows this setup. In a typical compute node, there may be nine InfiniBand network cards, wherein one is dedicated to the storage system (e.g., Lustre), and the remaining eight are allocated to eight GPU cards. This configuration incurs a significantly high cost. With limited budget, a preferable alternative would be a ratio of 1 InfiniBand network card to 4 GPU cards.

In most cases, both the GPU and InfiniBand are connected to a PCI-E switch, with two GPUs connected to one PCI-E switch. Ideally, each GPU would be assigned one InfiniBand network card, which would be the optimal scenario. However, if two GPUs are paired with one InfiniBand network card, the two GPUs will share the PCI-E switch and InfiniBand network card, resulting in contention between the GPUs for access to the shared InfiniBand network card.

The fewer the number of InfiniBand network cards, the more pronounced the contention between the GPUs, leading to lower communication efficiency between nodes.

inspur-5688

In the following diagram, it can be observed that with the configuration of only one 100 Gbps network card, the bandwidth is 12 GB/s. Converting bits to bytes, we have 100 ÷ 8 ≈ 12. As the number of network cards increases, the bandwidth also increases almost linearly. The combination of eight H100 cards with eight 400G InfiniBand NDR cards would result in an astonishingly high data transfer rate.

nvidia-ib-bw

network-gpu

One network card per GPU is the ideal situation

6. Large Model Network Topology: Rail Optimization

If you want to work with large-scale models, it is necessary to configure a dedicated fat-tree network topology. This type of fat-tree network topology differs from the conventional HPC fat-tree, and NVIDIA has named it "Rails" for optimized performance. Specifically, please refer to the two diagrams below.

ib rails1

This diagram represents a lower-end version of a fat-tree and Rails-optimized topology. It consists of two switches, where QM8700 represents an HDR switch. The two HDR switches are connected using four HDR cables to ensure interconnection speed. Each DGX GPU node has a total of nine IB cards (referred to as HCAs in the diagram), with one card dedicated to storage (Storage Target), and the remaining eight cards used for large-scale model training. Among these eight IB cards, HCA1/3/5/7 are connected to the first switch, while HCA2/4/6/8 are connected to the second switch.

IB-rails2

This diagram depicts a non-blocking, full-speed Rails-optimized topology. Each DGX GPU node is equipped with 8 IB cards, and each card is connected to a separate switch. These switches are referred to as leaf switches, and a total of 8 leaf switches are required. Specifically, HCA1 is connected to the first leaf switch, HCA2 is connected to the second leaf switch, and so on. Communication between the leaf switches is facilitated by spine switches to ensure high-speed connectivity.

The purpose of this configuration is to avoid bottlenecks, enabling any IB card to communicate at high speed with all other IB cards within the network. In other words, any GPU can communicate with other GPUs at extremely fast speeds.

The following diagram represents an actual deployment of a full-speed network consisting of six switches. The dense and intricate network of cables can appear intimidating:

ib switch connection

Data center InfiniBand HDR large model network real shot

The underlying topology behind it is illustrated in the following diagram. There are two green switches representing the spine switches and four blue switches representing the leaf switches. There are a total of 80 cables connecting the blue and green switches, with the blue switches positioned below and connected to the compute nodes.

topo

7. Conclusion:Choosing Naddod InfiniBand or RoCE Solutions

The choice between InfiniBand and RoCE for building a high-performance, lossless network will depend on the specific requirements of your application and infrastructure. Both InfiniBand and RoCE are capable of providing low-latency, high-bandwidth, and low CPU overhead, making them suitable for high-performance computing applications.

Naddod provides lossless network solutions based on both InfiniBand and RoCE to help customers build high-performance computing capabilities and lossless network environments. Depending on different application scenarios and user requirements, Naddod can choose the optimal solution according to the actual situation, providing customers with high-bandwidth, low-latency, and high-performance data transmission. This effectively solves network bottleneck problems, improving network performance and user experience.

Naddod provides high-speed InfiniBand products, including HDR/NDR 200G/400G/800G and RoCE 200G/400G AOC and DAC and Optical Modules and other products, which have excellent performance and significantly improve customers' business acceleration capabilities at a low cost. Naddod always puts its customers first, constantly creating outstanding value for customers in various industries. Its products and solutions have won customers' trust and favor with high quality and excellent performance and are widely used in industries and key fields such as high-performance computing, data centers, education, research, biomedicine, finance, energy, autonomous driving, the internet, manufacturing, and operators. Naddod works closely with customers, providing them with reliable and efficient network technology to help them succeed in the digital era. Whether it is InfiniBand or RoCE, Naddod will be your trusted partner.

RDMA High-Speed Network For Large Model Training