Building and Optimizing Large-Scale GPU Clusters: Insights and Innovations

NADDOD Jason Data Center Architect May 31, 2024

Since OpenAI introduced ChatGPT, Large Language Models (LLMs) have quickly become a focal point of interest and development. Companies are investing heavily in LLM pre-training to keep up with this trend. Training a 100B-scale LLM typically requires extensive computational resources, such as clusters with thousands of GPUs. For instance, the Falcon series models were trained on a cluster of 4096 A100 GPUs, taking nearly 70 days to train a 180B model with 3.5T tokens. As data scales continue to grow, so does the demand for computational power. Meta, for example, used 15T tokens to train its LLaMA3 series models on two clusters of 24K H100 GPUs.

 

This article delves into the components and configurations involved in building large-scale GPU clusters, including various GPU types, server configurations, and network equipment (such as network cards, switches, and optical transceiver modules).

 

Constructing a GPU cluster with over ten thousand cards is highly complex. This article covers a fraction of the considerations. In practice, cluster construction also involves the design of data center network topologies (like 3-Tier and Fat-Tree), storage networks, management networks, and other aspects, which have similar connection methods and are not elaborated on here.

 

GPU

The table below highlights the most powerful GPUs from the Ampere, Hopper, and the latest Blackwell series, showcasing improvements in memory, computing power, and NVLink capabilities.

 

From A100 to H100: FP16 dense computing power has increased more than threefold, while power consumption has only risen from 400W to 700W.

 

From H200 to B200: FP16 dense computing power has more than doubled, with power consumption increasing from 700W to 1000W.

 

B200 FP16 dense computing power: Approximately seven times that of the A100, with power consumption only 2.5 times higher.

 

Blackwell GPUs: Support FP4 precision, offering double the computing power of FP8. NVIDIA reports compare FP4 computing power with Hopper architecture's FP8, highlighting significant acceleration.

 

It is important to note that the GB200 uses the Full B200 chip, while the B100 and B200 are corresponding cut-down versions.

 

 

HGX

HGX is a high-performance server from NVIDIA, typically containing 8 or 4 GPUs per machine, paired with Intel or AMD CPUs, and using NVLink and NVSwitch for full interconnectivity (8 GPUs being the NVLink full interconnect limit outside of NVL and SuperPod). These servers generally use air cooling.

 

From HGX A100 to HGX H100 and HGX H200: FP16 dense computing power has increased by 3.3 times, while power consumption is less than double.

 

From HGX H100 and HGX H200 to HGX B100 and HGX B200: FP16 dense computing power has doubled, with comparable power consumption, increasing by at most 50%.

It should be noted that HGX B100 and HGX B200's networks have not significantly upgraded, with rear IB network cards still being 8x400Gb/s.

 

HGX is a high-performance server from NVIDIA, typically containing 8 or 4 GPUs per machine, paired with Intel or AMD CPUs, and using NVLink and NVSwitch for full interconnectivity. These servers generally use air cooling.

 

From HGX A100 to HGX H100 and HGX H200: FP16 dense computing power has increased by 3.3 times, while power consumption is less than double.

From HGX H100 and HGX H200 to HGX B100 and HGX B200: FP16 dense computing power has doubled, with comparable power consumption, increasing by at most 50%.

It should be noted that HGX B100 and HGX B200's networks have not significantly upgraded, with rear IB network cards still being 8x400Gb/s.

 

NVIDIA DGX and HGX are two high-performance solutions designed for deep learning, artificial intelligence, and large-scale computing needs, but they differ in design and target applications:

 

DGX: Primarily for general consumers, offering a plug-and-play high-performance solution with comprehensive software support, including NVIDIA's deep learning software stack, drivers, and tools. These systems are usually pre-built and closed.

 

HGX: Mainly for cloud service providers and large-scale data center operators, suitable for building custom high-performance solutions. It offers a modular design, allowing customers to customize hardware based on their needs, typically provided as a hardware platform or reference architecture.

 

 

Networking

Network Cards

Here we primarily introduce ConnectX-5/6/7/8, Mellanox's high-speed network cards, supporting both Ethernet and IB (InfiniBand). ConnectX-5 was released in 2016, NVIDIA acquired Mellanox in 2019, then released ConnectX-6 in 2020, ConnectX-7 in 2022, and at the 2024 GTC conference, ConnectX-8 was introduced. The detailed specifications are yet to be seen. The brief configuration of the cards is as follows, showing that each generation's total bandwidth has roughly doubled, with the next generation expected to reach 1.6Tbps:

 

 

Switches

NVIDIA provides switches for both Ethernet and InfiniBand, often with dozens or hundreds of ports. The total throughput (Bidirectional Switching Capacity) is the maximum bandwidth multiplied by the number of ports times two (indicating bidirectional).

 

The diagram below shows common Spectrum-X series Ethernet switches (mainly listing high-bandwidth data, low-bandwidth support is also available, but the total number of ports is fixed, so it's not particularly meaningful to list here):

 

Spectrum

SN3700C

SN3700

SN4600C

SN4600

SN4700

SN5400

SN5600

Size

1U

1U

2U

2U

1U

2U

2U

Throughput

6.4Tbps

12.8Tbps

12.8Tbps

25.6Tbps

25.6Tbps

51.2Tbps

102.4Tbps

Ports

 

 

 

 

 

 

800Gbps * 64

 

 

 

 

400Gbps * 32

400Gbps * 64

400Gbps * 128

 

200Gbps * 32

 

200Gbps * 64

200Gbps * 64

200Gbps * 128

200Gbps * 256

100Gbps * 32

100Gbps * 64

100Gbps * 64

100Gbps * 128

100Gbps * 128

100Gbps * 256

100Gbps * 256

50Gbps * 64

50Gbps * 128

50Gbps * 128

50Gbps * 128

50Gbps * 128

50Gbps * 256

50Gbps * 256

Use Case

 

leaf

spine

spine

leaf

spine

spine

leaf

super-spine

spine

super-spine

super-spine

super-spine

 

 

 

The diagram below shows common Quantum-X series IB switches:

 

Quantum

QM8700/8790

QM9700/9790

X800 Q3400-RA

Size

1U

1U

4U

Throughput

16Tbps

51.2Tbps

230.4Tbps

Ports

 

 

800Gbps x 144

 

400Gbps x 64

 

200Gbps x 40

200Gbps x 128

 

100Gbps x 80

 

 

Topology

SlimFly

DragonFly+

6DT

Fat Tree

SlimFly

DragonFly+

Multi-dimensional Torus

Fat Tree, etc.

 

In addition to Mellanox switches, many data centers use modular switches. For instance, Meta's recent "Building Meta's GenAI Infrastructure" mentioned constructing two GPU clusters with 24K H100 each, using Arista 7800 series switches. The 7800 series includes modular switches, with 7816LR3 and 7816R3 offering 576 ports of 400G high-speed bandwidth, connected internally via efficient buses or switch backplanes, resulting in very low transmission and processing latency.

 

 

Optical Transceiver Modules

Optical transceiver modules enable fiber optic communication by converting electrical signals to optical signals and vice versa. This technology supports higher transmission rates and longer distances, free from electromagnetic interference. Each module typically includes a transmitter (electrical to optical conversion) and a receiver (optical to electrical conversion).

 

 

Common interfaces in fiber optic communication include SFP (Small Form-factor Pluggable) and QSFP (Quad Small Form-factor Pluggable):

 

SFP: Usually single transmission channel (one or a pair of fibers).

QSFP: Multiple transmission channels, including QSFP-DD (Double Density), offering higher port density with 8 channels.

 

Interface

Channels

Model

Bandwidth

Year

SFP

1

SFP

1Gbps

2000

SFP+

10Gbps

2006

SFP28

25Gbps

2014

SFP112

100Gbps

2021

QSFP

4

QSFP+

40Gbps

2013

QSFP28

100Gbps

2016

QSFP56

200Gbps

2017

QSFP112

400Gbps

2021

QSFP-DD

8

QSFP28-DD

200Gbps

2016

QSFP56-DD

400Gbps

2018

QSFP-DD800

800Gbps

2021

QSFP-DD1600

1.6Tbps

2023

 

 

Recently, OSFP (Octal Small Form-factor Pluggable) packaging has been introduced, targeting high-bandwidth scenarios like 400Gbps and 800Gbps. OSFP modules are larger than QSFP-DD and require converters for compatibility with SFP and QSFP interfaces. The diagram below shows 400Gbps OSFP optical modules for different transmission distances (100m, 500m, 2km, 10km):

 

Type

Maximum Transmission Distance

Wavelength

Fiber Type

Connector Type

Modulation Technology

Standards

400G OSFP SR8

100 meters

850 nm

Multimode Fiber

MPO/MTP-16

50G PAM4

IEEE P802.3cm/IEEE 802.3bs

400G OSFP DR4

500 meters

1310 nm

Single-mode Fiber

MPO/MTP-12

100G PAM4

IEEE 802.3bs

400G OSFP DR4+

2 kilometers

1310 nm

Single-mode Fiber

MPO/MTP-12

100G PAM4

/

400G OSFP LR4

10 kilometers

CWDM4

Single-mode Fiber

LC

100G PAM4

100G Lambda Multisource Agreement

 

For different distances and scenarios, different optical transceiver modules can be chosen. For example, 10Km 400G LR4 and 800G 2xLR4 between Core and Spine layers, 2Km 400G FR4 between Spine and Leaf layers, and 500m 400G DR between Leaf and ToR layers.

 

 

The unit price of optical modules is relatively high, ranging from hundreds to thousands of dollars, depending on factors such as bandwidth and transmission distance. Generally, higher bandwidth and longer distances correlate with higher prices.

 

naddod optical moduelsExplore more optical module solutions on NADDOD

 

NADDOD is a leading supplier in the optical transceiver industry, offering a comprehensive range of products from 1G to 800G, including advanced 800G/400G NDR solutions. NADDOD's portfolio supports both InfiniBand and RoCE (RDMA over Converged Ethernet) solutions, catering to various packaging forms such as SFP, QSFP, QSFP-DD, QSFP112, and OSFP. Compared to other vendors, NADDOD distinguishes itself with ready availability of all mainstream products, short delivery cycles, and rigorous 100% verification of all products before shipping. Additionally, NADDOD's timely and reliable service ensures efficient data transfer and communication within data centers.

 

Optical transceiver modules play a critical role in data center operations. Since each port requires an optical module, the number of optical modules is typically proportional to the number of GPUs, often reaching 4-6 times the number of GPUs. Adding optical modules can significantly enhance network performance and ensure efficient data transfer and communication within the data center.

 

NADDOD's high-performance and compatibility features make it an ideal choice for data centers aiming to maintain optimal performance and reliability. Our extensive product range and prompt service ensure that data centers can achieve efficient and high-speed connectivity, crucial for large-scale GPU clusters and LLM training setups. Additionally, NADDOD offers cost-effective solutions, minimizing overall expenses while maintaining top-tier performance and reliability.

 

naddod

 

Conclusion

As the demand for computational power and efficient data transfer continues to rise, the importance of advanced networking solutions becomes increasingly evident. NADDOD's optical transceiver modules offer a robust solution to meet these demands, providing high-speed, reliable, and efficient communication within data centers. This supports the ever-growing needs of modern computational tasks and LLM training.