RDMA Network Discussion in the Era of Large AI Models
With the popularity of large models like ChatGPT, the third wave of the information age, with all its implications, is becoming increasingly apparent, creating a feeling of fear of missing out. This article aims to provide an overview of RDMA network-related technologies from the perspective of distributed training of large AI models.
Overview of Large AI Models
Currently, the large AI models that require distributed training and are most commonly encountered are dense models applied in scenarios such as vision, natural language processing/generation, including models like ResNet-50, BERT, Transformer, GPT-3, GPT-4, and sparse models used in scenarios like search, advertising, recommendation, such as various Deep Learning Recommendation Models (DRLMs).
As the name suggests, dense models take dense tensors as input, where each element participates in the model's computations. For current dense models, due to the significantly large number of model parameters, which exceed the capacity of a single GPU memory (even the latest NVIDIA A100 with 80GB memory), it is often necessary to parallelize tensors at the layer level and employ pipeline parallelism across different layers to accommodate the entire large model. Communication in this context involves both intra-node and inter-node communication, with the specific communication patterns determined by model partitioning and placement methods. Furthermore, to accelerate the training process, complete large models also employ data parallelism, where different training data is fed to each complete large model, resulting in the well-known allreduce communication pattern.
Sparse models often have a very large feature space, with each batch input containing only a small number of relevant features. Sparse models typically consist of two parts: the embedding of sparse features at the front and the dense model at the back. The computation of sparse feature embedding is critical for sparse models, while the dense model part is usually smaller and can fit within a single GPU, allowing for data parallelism and allreduce communication. During training, complex operations such as lookup and rearrangement are performed on the embedding table, followed by tensor generation for dense model computations. The feature embedding table often requires a significant amount of storage space and may need multiple GPU servers to store it completely, which is a typical case of tensor parallelism. In such scenarios, the typical communication pattern is alltoall, which can cause severe incast communication (multiple-to-one), leading to network congestion and placing high demands on network architecture, congestion control protocols, load balancing algorithms, and more.
In short, dense models and sparse models exhibit noticeable differences in terms of model features, as well as distinct requirements for computational, storage, and communication resources. To maximize the utilization of GPU computing resources and achieve the best acceleration effects, it is necessary to consider and design GPU server architecture, network architecture, training data storage and retrieval, and distributed training frameworks in a comprehensive manner, taking into account the model characteristics and implementation methods.
GPU Intra-Node Communication Technologies
To fully leverage the performance advantages of RDMA networks in AI large-scale model training scenarios, in addition to high requirements for the configuration, algorithms, and protocols of the physical network side consisting of NIC-switch-NIC, the intra-node topology of GPU servers is also crucial.
One of the most important components of intra-node communication is PCIe. It adopts a tree-like topology, and its architecture generally consists of PCIe devices such as Root Complex (RC), PCIe Switch, and PCIe Endpoint. For detailed information, you can refer to the PCIe Specification. In this architecture, the CPU and memory are the core components, while the GPU, NIC, NVMe, and others are peripherals, as shown in the diagram below.
However, in the era of deep learning, this paradigm has shifted, with the GPU becoming the core of computation, while the CPU plays a controlling role, as illustrated in the diagram below.
When it comes to communication between GPUs within a machine, using PCIe/QPI/UPI can often become a bottleneck. Therefore, NVIDIA has introduced new intra-node communication components such as NVLink and NVSwitch, which can provide several hundred Gbps or even terabits per second of interconnect bandwidth between GPUs within the same machine.
Communication between GPUs across machines requires passing through NICs. In the absence of a PCIe switch, communication between GPUs and NICs needs to go through RC, and typically involves a copy and transfer through the CPU, which often becomes a bottleneck. To address this, NVIDIA has developed GPU Direct RDMA (GDR) technology, which allows GPU data to be directly DMA-ed to the network card without the need for copying and transfer through the CPU. It is important to note that GDR is a communication acceleration technology based on standard PCIe features and requires the GPU-NIC connection to be on the same PCIe switch. If it goes through the PCIe host bridge, QPI/UPI, and other components on the RC, there may not be performance benefits.
So, a natural question arises: how to determine the connection mode between GPUs? NVIDIA has thoughtfully provided the command "nvidia-smi topo -m" to check the interconnectivity within the machine:
However, it is worth noting that not all machines will have NVLink, NVSwitch, PCIe switches, etc., as they all come at a cost. Therefore, optimizing GPU communication performance, deciding whether to enable GDR, and configuring PCIe parameters depend on the specific machine model and communication pattern. The best approach is to consider the model characteristics and implementation methods when purchasing GPU servers and setting up the physical network, designing the intra-node connections between GPUs and between GPUs and NICs, as well as the network connectivity between NICs, to avoid premature bottlenecks and wastage of expensive GPU computing resources.
NCCL Communication Library
In order to harness the distributed GPU computing power for large-scale models, communication libraries play a critical role. These libraries provide APIs for training frameworks to utilize and establish efficient communication between GPUs within and across machines for model parameter transmission. Currently, the most widely used communication library in the industry is NCCL, an open-source library provided by NVIDIA. Most major companies rely on NCCL or modified versions of NCCL as the foundation for GPU communication. NCCL is specifically designed for collective communication among multiple GPUs, or in other words, it serves as a framework for multi-GPU card communication. It has a certain level of topological awareness and offers collective communication APIs such as AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, as well as support for complex point-to-point communication using ncclSend() and ncclRecv(). These point-to-point communications include scenarios like One-to-all, All-to-one, All-to-all, and can achieve high bandwidth and low latency through PCIe, NVLink, NVSwitch within a server, and RoCEv2, IB, TCP networks between servers in the majority of cases.
As deep learning models become increasingly complex and the number of model parameters grows, a single GPU server is no longer sufficient to meet the requirements of model parameter size and training iteration speed. Distributed training with multiple machines and GPUs has become essential. RDMA networks, as the underlying communication technology in the era of AI large-scale models, already play a significant role and will continue to do so.
As a leading provider of optical network equipment, NADDOD can offer optimal end-to-end solutions for RDMA networks, from connectivity components to network infrastructure to servers. These solutions enable fast data exchange and processing in data centers, high-performance computing, and other domains, thereby improving system performance and efficiency.