Optimizing Large-scale GPU Clusters: Strategies for Performance and Scalability

NADDOD Gavin InfiniBand Network Engineer Jul 11, 2023

Connecting more GPUs through networking forms a GPU resource pool, which can meet the needs of large-scale and distributed AI training tasks. Therefore, we will discuss three goals: high performance, large-scale, and high availability.

High Performance

Most of our AI training tasks are synchronous training, which means that after each batch, all GPUs need to wait for data synchronization before starting the next batch.

Total Batch Duration

From this graph, it can be seen that the communication time largely depends on the slowest link in the entire communication process. Unlike a single machine, the network bandwidth between multiple machines is much lower than the internal bandwidth of a single machine. For example, the latest NVLink 3.0 provides a bandwidth of up to 600GB/s internally, while the latest InfiniBand NDR for inter-machine communication is 400Gb/s, and HDR is 200Gb/s. It is important to note the units here: one is GB (gigabyte), and the other is Gb (gigabit), which is almost an order of magnitude difference.

Another factor is complexity. The GPU interconnection within a single machine is relatively straightforward, as it can achieve a fully connected structure through NVLink. However, when it comes to network communication across servers, especially at a large scale, the network structure becomes more complex and requires higher hardware requirements.

Therefore, for performance optimization, we can consider the following approaches:

Improve the network access bandwidth of a single machine

Optimization method 1

Another factor is complexity. The GPU interconnection within a single machine is relatively straightforward, as it can achieve a fully connected structure through NVLink. However, when it comes to network communication across servers, especially at a large scale, the network structure becomes more complex and requires higher hardware requirements.

Therefore, for performance optimization, we can consider the following approaches:

Improve the network access bandwidth of a single machine

Optimization method 1

Increase the speed. The advancement in network card speeds is a great advantage for AI. From 40G to the latest 400G, there has been a gradual breakthrough in data transmission bandwidth bottlenecks. It is important to note that network cards do not exist in isolation; they require coordination with server and network architecture. This includes support for PCIe bus and network switch Serdes. Please refer to the table below for specifics.

NIC Support

Optimization Method 2

Increase the quantity. In the early stages, GPU servers usually had only a few network cards that were shared between the CPU and GPU. Nowadays, due to increased communication demands, we can allocate separate 1-2 network cards for the CPU and additional 4-8 network cards for the GPU. This allows for a significant increase in the network access bandwidth of a single machine. It is important to note that increasing the number of network cards also requires support from PCIe and network switches.

GPU Server

Utilize RDMA Networking

Once the bandwidth is ensured, the next step is to consider how to fully utilize it. Traditional TCP networks heavily rely on the CPU processing of packets, making it nearly impossible to fully utilize all available bandwidth. Therefore, in an AI environment that involves large-scale cluster training, RDMA (Remote Direct Memory Access) is almost a necessary network transmission technology.

RDMA is primarily used in two scenarios for large-scale GPU training: inter-GPU communication between multiple machines and communication between GPUs and storage systems.

Corresponding optimization algorithms include GPUDirect RDMA and GPUDirect Storage.

Reduce Network Congestion

As the cluster grows larger, the network structure becomes more complex, and congestion issues become unavoidable. There may be multiple equivalent paths from the source to the destination, and switches need to use hashing algorithms to map different connections to different paths. However, if there is a hash collision, it can impact the actual data throughput. For example, if two 100G connections are mapped to the same link, each connection can only achieve a maximum of 50G bandwidth, resulting in reduced communication efficiency.

Optimization Method 1

Ensure a 1:1 convergence ratio. This means that the number of uplink paths on the switch should be equal to the number of downlink paths, which minimizes the probability of collisions.

ToR Switch

Then, how should the downlink paths on the switch be selected and used to connect with the network cards? As a leading provider of high-speed data transmission solutions, you can find InfiniBand HDR direct attach and breakout cables (AOC and DAC) and optical transceivers at NADDOD to meet the low-latency, high-speed data transmission requirements. Additionally, more NDR products will be launched soon, so stay tuned.

Optimization Method 2

Separate the networks. Completely segregate the traffic between the CPU and GPU, allowing them to operate independently and reducing mutual interference. Moreover, after the separation, it becomes more convenient to optimize the traffic for each of them.

CPU & GPU Networking

Communication Algorithm Optimization

Currently, the commonly used GPU communication algorithm is Ring-Allreduce. Its basic idea is to form a ring among the GPUs, allowing data to flow within the ring. The GPUs in the ring are arranged in a logical order, with each GPU having a left neighbor and a right neighbor. It only sends data to its right neighbor and receives data from its left neighbor.

The algorithm consists of two steps: scatter-reduce and all-gather. In the scatter-reduce step, the GPUs exchange data to obtain a block of the final result for each GPU. In the all-gather step, the GPUs exchange these blocks so that all GPUs obtain the complete final result.


Now, when the GPU servers expand from a single machine to multiple machines, the ring would look like this:
Level-1 Ring

In the early stages, there was no NVLink within a single machine, and RDMA was not available on the network. The bandwidth was relatively low, and there wasn't much difference between distributed computing within a single machine and distributed computing across multiple machines in terms of bandwidth. Therefore, creating a large ring was sufficient.
Level-2 Ring

However, now that we have NVLink within a single machine, using the same method is not appropriate. The network bandwidth is much lower than NVLink, so using a single large ring would significantly reduce the high bandwidth of NVLink to the level of the network. Additionally, with the availability of multiple network cards, using a single ring cannot fully utilize the advantages of multiple network cards.

Therefore, in such a scenario, it is recommended to use a two-level ring: first, utilize the high bandwidth advantage of NVLink within a single machine to synchronize data between the GPUs; then, establish multiple rings among the GPUs across multiple machines using multiple network cards to synchronize different segments of data; finally, the GPUs within a single machine synchronize once more to achieve complete data synchronization among all GPUs.


The high cost of computing power is a common concern for everyone. Due to the scarcity of GPU resources, it is advantageous to consolidate as many GPU resources as possible into a unified resource pool. This enables flexible task scheduling, reduces AI task queues, minimizes resource fragmentation, and improves GPU utilization.

To build a large-scale GPU cluster, the network architecture also needs to be optimized:

Scalable Network Architecture

The traditional three-tier network architecture, consisting of “access-aggregation-core,” is an effective model for north-south traffic in data centers. However, it also has some issues, such as bandwidth waste caused by spanning tree protocol, high latency due to long data links, and scalability challenges caused by centralized core devices.

Therefore, the “Spine-Leaf” network architecture, which is more suitable for east-west traffic in data centers, has emerged. The flat network architecture of Spine-Leaf is derived from the CLOS network, where switches are hierarchically interconnected to achieve high-performance and non-blocking large-scale networks.

In a large-scale GPU cluster, the following architecture is commonly used:

Top-of-Rack (ToR) switches are directly connected to GPU servers, and Leaf switches are used to interconnect the ToR switches. A group of ToR switches and a group of Leaf switches can achieve a non-blocking full-mesh connection, forming a pod in the network.

Spine switches are used to connect different pods. Spine switches are similar to core switches in the three-tier architecture but with some changes: high-density, high-throughput layer 3 switches replace large chassis switches. The network load is distributed among many spine switches, rather than concentrated on a central core switch.

With this network architecture, the cluster can scale horizontally to tens of thousands of network interfaces.

Scalable Network Architecture

Ensure a 1:1 convergence ratio for the Top-of-Rack (ToR) switches, as the number of connections within a pod and the switches cannot be adjusted. What can be adjusted is the convergence ratio of the Leaf switches and the number of Spine switches. The convergence ratio of the Leaf switches determines the communication bandwidth between multiple pods. If the scheduling system can perceive the topology and intelligently schedule tasks so that related tasks are placed within the same pod, achieving a 1:1 convergence ratio for the Leaf switches is not necessary.

Compute and Storage Separation

As mentioned earlier, compute and storage separation is also important. The performance of storage communication greatly impacts the overall performance of the GPU cluster. Therefore, after separating the networks, storage nodes will follow the aforementioned network architecture and form their storage pods. For the scheduling system, the GPU servers receiving tasks and the storage nodes containing the required datasets may be located in different pods. Hence, when considering the network, we need to take into account the storage’s cross-pod data retrieval. This effectively increases the scheduling flexibility of the resource pool, enabling the realization of a large-scale cluster.

Compute and Storage Separation

High Availability

In GPU clusters, availability issues are often downplayed because large-scale distributed AI tasks are primarily offline training tasks that don’t directly impact the core business during network interruptions. Although they are offline tasks, we still need to pay attention because AI training can take a long time, and without intermediate state saving, a network interruption would render all the progress made up to that point useless, wasting the GPU resources utilized.

We should also be mindful of the high sensitivity of AI training tasks to network topology. An interruption in one part of the network can lead to network asymmetry in other nodes, significantly increasing the complexity of higher-level processing. Therefore, when designing the cluster, it is crucial to consider a network architecture that tolerates interruptions.

Available Network Architecture

For the network of storage nodes, if a storage node goes offline due to a network interruption, it triggers substantial data recovery traffic within the network, increasing the network load. Therefore, it is advisable to adopt a dual uplink design to ensure that the availability of storage nodes is not affected by the interruption of a single switch or uplink link.

Regarding the computing network, due to the unique nature of AI training and considering performance and cost factors, a dual uplink design is temporarily not considered. Instead, all the network cards responsible for GPU computing and communication within the same GPU server are connected to the same switch. This is because, as mentioned earlier, any network card or link interruption on the GPU server would cause network asymmetry, affecting the entire GPU server. Therefore, it is better to have all the network cards share the same switch. This approach also brings a derived benefit: if the ToR switch fails, it impacts fewer GPU servers, leading to improved overall system availability. However, connecting all GPU communication network cards to the same switch requires relevant network routing settings at the system level to ensure the simultaneous and effective utilization of all network cards.


High performance and large scale are the primary considerations when addressing GPU cluster requirements, with availability being of secondary importance. GPU resources are not isolated entities; they require the full cooperation of cluster networks, communication algorithms, and scheduling systems to fully unleash their powerful capabilities and achieve efficient utilization of costly resources.