Optimizing AI GPU Clusters Network and Scale

NADDOD Gavin InfiniBand Network Engineer Mar 27, 2024

In the previous article, <Quick Understanding GPU Server Network Card Configuration in AI Era>, we discussed the network card configuration of a single GPU server, and then discussed the GPU cluster network architecture (GPU cluster fabrics) and cluster size.

 

The most commonly used GPU cluster network topology in practice is the Fat-Tree non-blocking network architecture, which is favored for its scalability, simple routing, ease of management and maintenance, robustness, and relatively low cost. In practice, smaller-scale GPU cluster computing networks typically adopt a two-tier architecture (Leaf-Spine), while larger-scale GPU clusters utilize a three-tier architecture (Leaf-Spine-Core). Here, "Leaf" corresponds to the access layer, "Spine" to the aggregation layer, and "Core" to the core layer.

 

3 tier fat tree

 

Assuming the same switches are used in the computing network of a GPU cluster, with each switch having P ports, a two-tier Fat-Tree non-blocking computing network (Leaf-Spine) can accommodate a maximum of P*P/2 GPUs in a GPU cluster.

 

In a two-tier Fat-Tree non-blocking computing network (Leaf-Spine), in the first tier, each Leaf switch uses P/2 ports to connect to GPU cards and the remaining P/2 ports to connect upwards to Spine switches (a non-blocking network requires an equal number of connections downwards and upwards). In the second tier, each Spine switch also has P ports and can connect to a maximum of P Leaf switches downwards. Thus, in a two-tier Fat-Tree non-blocking computing network, there can be a maximum of P Leaf switches, resulting in a total of P*P/2 GPU cards. Since there are P Leaf switches and each Leaf switch has P/2 ports connecting upwards to Spine switches, there are P/2 Spine switches.

 

For example, for an Nvidia A100 cluster, assuming the use of a 40-port switch (such as Nvidia Mellanox QM8700), in a two-tier Fat-Tree computing network scenario, a maximum of 800 A100 cards can be accommodated (40*40/2 = 800).

 

2 tier fat tree

 

It is worth noting that if high-speed interconnection between cards within a GPU server is already present (such as NVLink and NVSwitch), the GPU cards within the same server should not be connected to the same Leaf switch. GPU cards with the same number across different servers (for example, card 3 in server A and card 3 in server B) should preferably be connected to the same Leaf switch to improve distributed computing efficiency (for example, enhancing the efficiency of cross-server AllReduce operations).

 

It should be particularly noted that for GPU servers without inter-card high-speed interconnection solutions (for example, L20 servers, L40S servers), efforts should be made to connect the GPU cards within the same server to the same Leaf switch to avoid cross-NUMA communication.

 

From the analysis above, it can be seen that assuming the use of a 128-port switch, a two-tier Fat-Tree non-blocking computing network can accommodate a maximum of 8192 GPUs (128*128/2 = 8192). If a larger-scale GPU cluster needs to be constructed, it is necessary to expand from a two-tier to a three-tier computing network.

 

For larger-scale GPU clusters, a three-tier computing network architecture is generally required. Assuming the same switches are used in the computing network of a GPU cluster, with each switch having P ports, for a three-tier Fat-Tree non-blocking computing network (Leaf-Spine-Core), the maximum number of GPUs in a GPU cluster is PPP/4.

 

Expanding from a two-tier Fat-Tree network to a three-tier Fat-Tree network, we can consider each two-tier Fat-Tree network as a unit (i.e., a two-tier Fat-Tree sub-network). Because each Spine switch has half of its ports connected downwards to Leaf switches (each Spine switch can connect to a maximum of P/2 Leaf switches), and the other half of its ports connected upwards to Core switches, each two-tier Fat-Tree sub-network can only have P/2 Leaf switches. In a non-blocking network, the number of connections at each layer must remain the same, so the number of Spine switches is equal to the number of Leaf switches.

 

H800 Cluster

 

Since the Core switch also has P ports and can connect P such two-tier Fat-Tree sub-networks, there are PP/2 Leaf switches and PP/2 Spine switches in a three-tier Fat-Tree non-blocking computing network (Leaf-Spine-Core). Therefore, the total number of GPU cards is at most (P/2)(PP/2), which is PPP/4. The number of connections from Spine switches to Core switches is PPP/4, resulting in a total of P*P/4 Core switches.

 

From the analysis above, it can be seen that the scale of a GPU cluster is determined by the architecture of the computing network and the number of ports on the switches (of course, the scale of the GPU cluster is also limited by hardware factors such as cabinets, power supply, cooling, and data center). The relationship between the scale of the cluster and the number of ports on the switch is illustrated in the table below, using a three-tier Fat-Tree non-blocking network as an example.

 

If M GPUs share a single network card in a server, the total number of GPUs should be multiplied by M. For example, if two GPU cards in a server share a single network card, such as in the configuration of an A800 server with 4 x 200 GbE network cards for 8 cards, then the total number of GPU cards should be multiplied by 2 (reference Nvidia DGX V100).

 

Number of switch ports(Leaf, Spine, Core switches are the same)  Maximum Theoretical Number of GPUs in a Three-tier Fat-Tree Non-blocking Network (Scale of GPU Cluster) Examples of GPU Clusters
24 3456 V100 Cluster
32 8192 V100 Cluster
40 16000 A100 Cluster, A800 Cluster
48 27648 H800 Cluster
64 65536 H100 Cluster, H200 Cluster, H20 Cluster
80 128000 H20 Cluster (estimated)
128 524288

H200 Cluster, H20 Cluster (estimated)

 

From the table above, it can be seen that GPU clusters constructed based on a three-tier Fat-Tree non-blocking network architecture can meet the requirements of most large-scale model training and distributed computing, thus eliminating the need to consider a four-tier or more complex network topology.

 

In the analysis above, it was assumed that the entire GPU cluster computing network uses the same switches. If Leaf, Spine, and Core switches use different switches (or even different network switches for each layer), the analysis of the GPU cluster network and cluster scale becomes more complex.

 

When building GPU cluster computing networks, selecting a reliable supplier for connectivity products is crucial. As a leading provider of optical modules, AOCs, and DACs, NADDOD offers a range of high-performance, high-reliability solutions to help you establish efficient and stable GPU clusters. Our optical modules support transmission speeds of up to 800Gbps and are compatible with a variety of top-tier networking equipment brands such as Cisco, HP, Brocade, Juniper, Mellanox, Arista, and more. Additionally, we provide customized solutions to meet your specific needs.

 

In addition to optical modules, NADDOD offer AOCs and DACs in various lengths, including 0.5m, 1m, 2m, 3m, and 5m, with custom lengths available upon request. Our products undergo rigorous testing and certification to meet CE, FCC, and RoHS standards, ensuring high quality and compliance. Partnering with us, you'll benefit from flexible minimum order quantities, reliable performance, competitive pricing, and fast delivery, providing the perfect connectivity solutions for your GPU cluster. Contact us now to learn more about how NADDOD can elevate your GPU cluster computing network.