GPU Cluster Network In Large Language Model

NADDOD Jason Data Center Architect Jan 1, 2024

In the training process of AIGC (AI large models), various GPU/TPU parallel processing schemes are employed, including data parallelism, model parallelism, pipeline parallelism, and tensor parallelism. Regardless of the type of parallelism used, the enormous size of parameters and datasets results in significant inter-GPU traffic through the network connections that link these GPUs. Any congestion in the network can lead to prolonged training time and very low GPU utilization.


This article will explore popular GPU/TPU cluster network configurations, such as NVLink, InfiniBand, ROCE Ethernet Fabric, and DDC network solutions, to gain an in-depth understanding of their connectivity and how they come into play in LLM training. The interconnect technology and topology used for GPU/TPU clusters play a key role in the overall cost and performance of LLMs.


TPUs (Tensor Processing Units) are AI accelerators developed by Google for accelerating matrix multiplication, vector processing, and other computations required for training large-scale neural networks. Google does not sell TPUs to other cloud service providers or individuals. These TPU clusters are exclusively used by Google for providing ML/AI services in Google Cloud and other internal applications.



In a 2D torus network, each TPU is connected to four neighboring TPUs in the North/South/East/West directions. The torus shape comes from connecting the nodes of the grid at the edges (forming a circular or toroidal shape) to ensure that TPUs on the grid's edge are still connected to four other TPUs. The same concept can be extended to 3D topologies. This topology enables fast communication between adjacent TPUs as they exchange results (tensors/pipeline parallelism) during computation.


When it comes to GPUs, Nvidia is the primary GPU supplier, and most major data centers and HPC (High-Performance Computing) facilities use Nvidia GPU systems for training large-scale DNNs. Apart from models created by Google, most LLMs are trained using Nvidia GPUs.


Nvidia GPUs are equipped with high-speed NVLink technology for GPU-to-GPU communication. NVLink offers higher bandwidth compared to traditional PCIe interfaces and enables faster data transfers between GPUs, enhancing the training speed of machine learning models. The fourth generation of NVLink provides 200Gbps bandwidth in each direction. Nvidia's latest H100 GPU features 18 NVLinks, offering 3600Gbps bandwidth in each direction.


Additionally, NVLink allows for GPU memory pooling, where multiple GPUs can be connected together as a larger memory resource. This is beneficial for running applications that require more memory than what is available on a single GPU, and it enables flexible model parameter partitioning between GPUs.


Nvidia Hot Chips 2022

The image source: Nvidia's Hot Chips 2022 presentation. The diagram depicts a GPU server with 8 GPUs. Each NIC can carry 400Gbps of data in each direction and is connected to external Ethernet/IB switches via OSFP slots. The OSFP slots in the nodes are connected to each CX7 NIC, linking the GPUs to the external Ethernet/IB switches. The GPUs within the nodes can be interconnected with other GPUs through a hierarchical connection of NVLink switch systems, utilizing OSFP ports from four NV switches.


A GPU server, also known as a system or node, is a cluster that houses 8 GPUs. The GPUs within the cluster communicate with each other using NVLink through four custom NVLink switches (NVswitches). Multiple GPU servers can be interconnected using a GPU network to form a large-scale system. The GPU network consists of a set of switches arranged in a leaf/spine or 3-tier CLOS topology. It provides arbitrary connectivity between GPU servers connected to the network.


The leaf/spine switches in the network typically follow a fat-tree topology. In a fat-tree topology, the bandwidth of links increases as the topology ascends from nodes to leaf and spine switches. This is because each uplink needs to handle the combined bandwidth of multiple downlinks.



The image showing GPU networks in GPU servers and data center networks.


The number of switches in a GPU network (fabric) depends on:


  1. The scale of the system, i.e., the number of GPU nodes. As the number of nodes increases, more switches may be required to provide sufficient connectivity.


  1. The throughput of the switches. A higher throughput provided by individual switches reduces the number of switches needed.


  1. Additional bandwidth (over-provisioning) in the network to mitigate congestion.


For optimal training performance:


  1. The GPU network should have low end-to-end latency. Reducing overall latency in data transfers between nodes helps minimize the overall training time since there is significant communication among GPUs.


  1. The network shouldachieve lossless data transmission between nodes. Lossless transmission is crucial for AI training, as any loss of gradients or intermediate results would require rolling back the entire training to the previous stored checkpoint in memory and restarting, which negatively impacts training performance.


  1. The system should have efficient end-to-end congestion control mechanisms. In any tree-like topology, transient congestion is inevitable when multiple nodes transmit data to a single node. Persistent congestion can increase tail latency in the system. Due to the sequential dependency among GPUs, even if the gradient update of one GPU is delayed by the network, many GPUs might stall. A slow link is enough to degrade training performance.


Additionally, factors such as the overall cost, power consumption, and cooling costs of building the system should also be considered.


Given these considerations, let's now explore the choices in GPU architecture design and the pros and cons of each approach.


2. Custom NVLink Switch Systems

The NVLink switches, designed to connect the 8 GPUs within a GPU server, can also be utilized to construct a switching network between GPU servers.


Nvidia showcased a topology at the Hot Chips conference in 2022 that employed the NVswitch architecture to interconnect 32 nodes (or 256 GPUs). Due to NVLink being specifically designed as a high-speed point-to-point interconnect for GPUs, it offers superior performance and lower overhead compared to traditional networking solutions.


NVswitch GPU

GPU architecture using NVSwitch systems


The third-generation NVswitch supports 64 NVLink ports with a switching capacity of 12.8Tbps. It also enables multicast and network aggregation.


Through network aggregation, gradients generated by all working GPUs are aggregated within the NVswitches, and the updated gradients are sent back to the GPUs to initiate the next iteration. This can potentially reduce the inter-GPU data traffic between training iterations.


Nvidia claims that training the GPT-3 model using the NVswitch architecture can be twice as fast as using an InfiniBand switch network. This performance is impressive, but the bandwidth of this switch is four times lower than what high-end switch vendors provide with their 51.2Tbps switches!


Building large-scale systems with over 1000 GPUs using NVswitches is economically infeasible, and the protocol itself may have limitations in supporting even larger-scale systems.


Additionally, Nvidia does not sell these NVswitches separately. If a data center wants to expand their existing GPU cluster by mixing GPUs from different vendors, they cannot use NVswitches because GPUs from other vendors do not support these interfaces.


3. InfiniBand(IB)Fabric

InfiniBand is a high-speed interconnect technology introduced in 1999 as a fast alternative to PCI and PCI-X bus technologies for connecting servers, storage, and networking. While its initial grand vision was scaled down due to economic factors, InfiniBand has found widespread adoption in high-performance computing, AI/ML clusters, and data centers due to its impressive speed, low latency, lossless transmission, and Remote Direct Memory Access (RDMA) capabilities.


The InfiniBand (IB) protocol is designed to be efficient and lightweight, avoiding the overhead commonly found in Ethernet protocols. It supports both channel-based and memory-based communication, enabling efficient handling of various data transfer scenarios.


IB achieves lossless transmission through credit-based flow control between send/receive devices at the queue or virtual channel level. This hop-by-hop flow control ensures that data loss is prevented due to buffer overflow. Additionally, it supports congestion notification between endpoints (similar to Explicit Congestion Notification in the TCP/IP protocol stack). It provides excellent quality of service, allowing certain types of traffic to be prioritized to reduce latency and prevent packet loss.


Furthermore, all IB switches support the RDMA protocol, allowing direct data transfers from the memory of one GPU to another GPU without involving the CPU and the operating system. This direct transfer improves throughput and significantly reduces end-to-end latency.


Despite these advantages, InfiniBand switch systems are not as popular as Ethernet switch systems because they are more challenging to configure, maintain, and scale. The control plane of InfiniBand is typically centrally controlled through a single subnet manager. It can work well in small clusters, but scalability may become a challenge for networks with 32K or more GPUs. IB networks also require specialized hardware such as host channel adapters and InfiniBand cables, making the cost of scalability higher than Ethernet networks.


Currently, Nvidia is the only vendor providing high-end IB switches for HPC and AI GPU clusters. OpenAI trained their GPT-3 model on Microsoft Azure cloud using 10,000 Nvidia A100 GPUs and an IB switch network. Meta recently built a 16K GPU cluster using Nvidia A100 GPU servers and Quantum-2 IB switches with a switching capacity of 25.6Tbps and 400Gbps ports. This cluster was used to train their generative AI models, including LLaMA. It's important to note that when connecting more than 10,000 GPUs, the switching between GPUs within the servers is done through NVswitches inside the servers, while IB/Ethernet networks connect the servers together.



Conceptual diagram of GPU structure, with 128 GPUs connected (not all connections shown). The GPU-leaf node connection rate is 400Gbps, and the spine connection rate is 800Gbps.


To meet the training demands of models with larger parameter sizes, major cloud service providers are looking to build GPU clusters with 32K or even 64K GPUs.


4. Ethernet Fabric

At this scale, using Ethernet networks makes more economic sense because Ethernet has already established a strong ecosystem among silicon/system and optical module vendors. It has achieved interoperability between vendors as a goal through open standards.


Ethernet is ubiquitous, with varying speeds ranging from 1Gbps to 800Gbps, and is expected to reach 1.6Tbps in the future. Its usage differs across different environments, from data centers to backbone networks.


In terms of interconnect port speed and overall switching capacity, InfiniBand lags behind Ethernet. Ethernet switches are more competitively priced compared to their InfiniBand counterparts, offering a more cost-effective price per unit of bandwidth. Due to intense competition among high-end network chip vendors, manufacturers are packaging more bandwidth into ASICs, resulting in better cost per gigabit.


High-end Ethernet switch ASICs from major vendors can provide a switching capacity of 51.2Tbps with 800Gbps ports, offering twice the performance of Quantum-2. If the throughput of each switch is doubled, building a GPU network with a certain number of GPUs would require only half the number of switches.


  • Lossless transmission:Ethernet can also provide lossless networks with Priority Flow Control (PFC). PFC allows for eight service classes, each supporting traffic control. Some of these can be designated as lossless classes with PFC enabled. Lossless traffic takes priority during processing and forwarding within switches over lossy traffic. During congestion, switches/network adapters can control upstream devices through traffic control rather than dropping packets.


  • RDMA support: Ethernet can also support RDMA through RoCEv2 (RDMA over Converged Ethernet), where RDMA frames are encapsulated within IP/UDP. When RoCEv2 packets destined for the GPU reach the network adapter (NIC) in a GPU server, the NIC directly transfers the RDMA data to the GPU's memory, bypassing the CPU. Additionally, powerful end-to-end congestion control schemes like DCQCN can be deployed to reduce congestion and packet loss for RDMA.


  • Enhanced load balancing:Routing protocols such as BGP use Equal Cost Multipath (ECMP) to distribute packets across multiple paths with equal "cost" when there are multiple paths with equal cost between the source and destination. The cost can be represented simply as the number of hops. The goal of ECMP is to distribute network traffic to improve link utilization and prevent congestion. When a packet arrives at a switch with multiple equivalent paths to the destination, the switch uses a hash function to determine which path to send the packet. This hash function can use source IP address, destination IP address, source port, destination port, and protocol fields to map packets to flows. However, hashing is not always perfect and can result in certain links being over-subscribed.


For example, in the diagram below, assuming the unicast traffic pattern is G0->G19, G9->G2, and G18->G11. Ideally, there should be no congestion in the network as the bandwidth in the leaf/spine switches is sufficient to support this traffic pattern. However, due to insufficient information (hash collisions), all flows may choose spineswitch0. When this happens, the output port of that switch becomes oversubscribed, with 1200Gbps of traffic trying to go out through an 800Gbps port.



In scenarios like this, end-to-end congestion schemes such as ECN/DCQCN can effectively restrict the sender's traffic when congestion occurs within switches (although transient congestion may still occur before the sender limits its traffic). There are also other methods to reduce congestion:


  1. Bandwidth reservation: Reserving a slight excess bandwidth between spine/leaf switches.


  1. Adaptive load balancing: When multiple paths are available to the destination, if one path becomes congested, switches can route packets of new flows to other ports until the congestion is resolved. Switch hardware monitors egress queue depths and queuing rates and periodically sends this information back to the load balancer in upstream switches. Many switches already support this feature.


  1. Packet-level load balancing: Packet-level load balancing in RoCEv2 can evenly distribute these packets across all available links to maintain link balance. By doing so, packets may arrive at the destination out of order. However, the network adapter can resequence any out-of-order data at the RoCE transport layer, transparently delivering ordered data to the GPU. This requires additional hardware support in the network adapter and Ethernet switch.


In addition to the mentioned features, network aggregation within the network can aggregate gradients from GPUs within switches, helping to reduce inter-GPU traffic during the training process. Nvidia supports this feature in their high-end Ethernet switches and provides integrated software support.


Therefore, high-end Ethernet switches/network adapters have powerful congestion control/load balancing capabilities and RDMA support. They can scale to larger designs than IB switches. Some cloud service providers and companies with ultra-large-scale clusters have already begun building Ethernet-based GPU networks to connect more than 32K GPUs.


5. Summary & Future Outlook

Developing and training large language models (LLMs) requires a highly specialized team of AI/ML researchers/engineers, data scientists, and significant investment in cloud computing resources. Companies lacking extensive expertise in machine learning are unlikely to undertake this challenge independently. Instead, more companies will seek to fine-tune commercially available pretrained models with their proprietary datasets. Cloud service providers may offer these services to enterprises.


Therefore, the workload for model training is expected to primarily come from academic institutions, cloud service providers, and AI research labs.


Contrary to expectations, the training workload is not expected to decrease or remain constant in the coming years. To generate accurate and up-to-date results rather than "hallucinations," models need to undergo frequent training. This will greatly increase the training workload.


Ethernet fabric may be a good choice for building large-scale training GPU clusters. All existing high-end Ethernet switches (modular/standalone) are currently prepared for the challenges of such large clusters.


By implementing additional enhancements such as enabling reordering of RoCEv2 packets and packet-level spraying, in-network aggregation, and cut-through support, these switches can achieve impressive performance surpassing what IB (fabric) can provide. However, IB (fabric) will continue to be used until these large Ethernet fabric clusters are deployed and widely adopted.


The approach of VOQ (fabric), a distributed switch, appears promising and provides another potential option for the solution!


The performance/scale of GPUs/network switches increases approximately twofold every two years. If models continue to grow at the rate of doubling or tripling with each new version, they will soon encounter hardware limitations. The AI community must heavily invest in research to make LLMs more optimized, environmentally friendly, and sustainable.