Building and Optimizing Large-Scale GPU Clusters

Since OpenAI introduced ChatGPT, Large Language Models (LLMs) have quickly become a focal point of interest and development. Companies are investing heavily in LLM pre-training to keep up with this trend. Training a 100B-scale LLM typically requires extensive computational resources, such as clusters with thousands of GPUs. For instance, the Falcon series models were trained on a cluster of 4096 A100 GPUs, taking nearly 70 days to train a 180B model with 3.5T tokens. As data scales continue to grow, so does the demand for computational power. Meta, for example, used 15T tokens to train its LLaMA3 series models on two clusters of 24K H100 GPUs.

This article delves into the components and configurations involved in building large-scale GPU clusters, including various GPU types, server configurations, and network equipment (such as network cards, switches, and optical transceiver modules).

Constructing a GPU cluster with over ten thousand cards is highly complex. This article covers a fraction of the considerations. In practice, cluster construction also involves the design of data center network topologies (like 3-Tier and Fat-Tree), storage networks, management networks, and other aspects, which have similar connection methods and are not elaborated on here.

GPU

The table below highlights the most powerful GPUs from the Ampere, Hopper, and the latest Blackwell series, showcasing improvements in memory, computing power, and NVLink capabilities.

From A100 to H100: FP16 dense computing power has increased more than threefold, while power consumption has only risen from 400W to 700W.

From H200 to B200: FP16 dense computing power has more than doubled, with power consumption increasing from 700W to 1000W.

B200 FP16 dense computing power: Approximately seven times that of the A100, with power consumption only 2.5 times higher.

Blackwell GPUs: Support FP4 precision, offering double the computing power of FP8. NVIDIA reports compare FP4 computing power with Hopper architecture's FP8, highlighting significant acceleration.

It is important to note that the GB200 uses the Full B200 chip, while the B100 and B200 are corresponding cut-down versions.

HGX

HGX is a high-performance server from NVIDIA, typically containing 8 or 4 GPUs per machine, paired with Intel or AMD CPUs, and using NVLink and NVSwitch for full interconnectivity (8 GPUs being the NVLink full interconnect limit outside of NVL and SuperPod). These servers generally use air cooling.

From HGX A100 to HGX H100 and HGX H200: FP16 dense computing power has increased by 3.3 times, while power consumption is less than double.

From HGX H100 and HGX H200 to HGX B100 and HGX B200: FP16 dense computing power has doubled, with comparable power consumption, increasing by at most 50%.

It should be noted that HGX B100 and HGX B200's networks have not significantly upgraded, with rear IB network cards still being 8x400Gb/s.

From HGX A100 to HGX H100 and HGX H200: FP16 dense computing power has increased by 3.3 times, while power consumption is less than double.

From HGX H100 and HGX H200 to HGX B100 and HGX B200: FP16 dense computing power has doubled, with comparable power consumption, increasing by at most 50%.

It should be noted that HGX B100 and HGX B200's networks have not significantly upgraded, with rear IB network cards still being 8x400Gb/s.

NVIDIA DGX and HGX are two high-performance solutions designed for deep learning, artificial intelligence, and large-scale computing needs, but they differ in design and target applications:

DGX: Primarily for general consumers, offering a plug-and-play high-performance solution with comprehensive software support, including NVIDIA's deep learning software stack, drivers, and tools. These systems are usually pre-built and closed.

HGX: Mainly for cloud service providers and large-scale data center operators, suitable for building custom high-performance solutions. It offers a modular design, allowing customers to customize hardware based on their needs, typically provided as a hardware platform or reference architecture.

Networking

Network Cards

Here we primarily introduce ConnectX-5/6/7/8, Mellanox's high-speed network cards, supporting both Ethernet and IB (InfiniBand). ConnectX-5 was released in 2016, NVIDIA acquired Mellanox in 2019, then released ConnectX-6 in 2020, ConnectX-7 in 2022, and at the 2024 GTC conference, ConnectX-8 was introduced. The detailed specifications are yet to be seen. The brief configuration of the cards is as follows, showing that each generation's total bandwidth has roughly doubled, with the next generation expected to reach 1.6Tbps:

Switches

NVIDIA provides switches for both Ethernet and InfiniBand, often with dozens or hundreds of ports. The total throughput (Bidirectional Switching Capacity) is the maximum bandwidth multiplied by the number of ports times two (indicating bidirectional).

The diagram below shows common Spectrum-X series Ethernet switches (mainly listing high-bandwidth data, low-bandwidth support is also available, but the total number of ports is fixed, so it's not particularly meaningful to list here):

Spectrum	SN3700C	SN3700	SN4600C	SN4600	SN4700	SN5400	SN5600
Size	1U	1U	2U	2U	1U	2U	2U
Throughput	6.4Tbps	12.8Tbps	12.8Tbps	25.6Tbps	25.6Tbps	51.2Tbps	102.4Tbps
Ports							800Gbps * 64
					400Gbps * 32	400Gbps * 64	400Gbps * 128
		200Gbps * 32		200Gbps * 64	200Gbps * 64	200Gbps * 128	200Gbps * 256
	100Gbps * 32	100Gbps * 64	100Gbps * 64	100Gbps * 128	100Gbps * 128	100Gbps * 256	100Gbps * 256
	50Gbps * 64	50Gbps * 128	50Gbps * 128	50Gbps * 128	50Gbps * 128	50Gbps * 256	50Gbps * 256
Use Case	leaf	spine	spine	leaf	spine	spine	leaf
Use Case	super-spine	spine	super-spine	super-spine	super-spine

The diagram below shows common Quantum-X series IB switches:

Quantum	QM8700/8790	QM9700/9790	X800 Q3400-RA
Size	1U	1U	4U
Throughput	16Tbps	51.2Tbps	230.4Tbps
Ports			800Gbps x 144
		400Gbps x 64
	200Gbps x 40	200Gbps x 128
	100Gbps x 80
Topology	SlimFly DragonFly+ 6DT	Fat Tree SlimFly DragonFly+ Multi-dimensional Torus	Fat Tree, etc.

In addition to Mellanox switches, many data centers use modular switches. For instance, Meta's recent "Building Meta's GenAI Infrastructure" mentioned constructing two GPU clusters with 24K H100 each, using Arista 7800 series switches. The 7800 series includes modular switches, with 7816LR3 and 7816R3 offering 576 ports of 400G high-speed bandwidth, connected internally via efficient buses or switch backplanes, resulting in very low transmission and processing latency.

Optical Transceiver Modules

Optical transceiver modules enable fiber optic communication by converting electrical signals to optical signals and vice versa. This technology supports higher transmission rates and longer distances, free from electromagnetic interference. Each module typically includes a transmitter (electrical to optical conversion) and a receiver (optical to electrical conversion).

Common interfaces in fiber optic communication include SFP (Small Form-factor Pluggable) and QSFP (Quad Small Form-factor Pluggable):

SFP: Usually single transmission channel (one or a pair of fibers).

QSFP: Multiple transmission channels, including QSFP-DD (Double Density), offering higher port density with 8 channels.

Interface	Channels	Model	Bandwidth	Year
SFP	1	SFP	1Gbps	2000
		SFP+	10Gbps	2006
		SFP28	25Gbps	2014
		SFP112	100Gbps	2021
QSFP	4	QSFP+	40Gbps	2013
		QSFP28	100Gbps	2016
		QSFP56	200Gbps	2017
		QSFP112	400Gbps	2021
QSFP-DD	8	QSFP28-DD	200Gbps	2016
		QSFP56-DD	400Gbps	2018
		QSFP-DD800	800Gbps	2021
		QSFP-DD1600	1.6Tbps	2023

Recently, OSFP (Octal Small Form-factor Pluggable) packaging has been introduced, targeting high-bandwidth scenarios like 400Gbps and 800Gbps. OSFP modules are larger than QSFP-DD and require converters for compatibility with SFP and QSFP interfaces. The diagram below shows 400Gbps OSFP optical modules for different transmission distances (100m, 500m, 2km, 10km):

Type	Maximum Transmission Distance	Wavelength	Fiber Type	Connector Type	Modulation Technology	Standards
400G OSFP SR8	100 meters	850 nm	Multimode Fiber	MPO/MTP-16	50G PAM4	IEEE P802.3cm/IEEE 802.3bs
400G OSFP DR4	500 meters	1310 nm	Single-mode Fiber	MPO/MTP-12	100G PAM4	IEEE 802.3bs
400G OSFP DR4+	2 kilometers	1310 nm	Single-mode Fiber	MPO/MTP-12	100G PAM4	/
400G OSFP LR4	10 kilometers	CWDM4	Single-mode Fiber	LC	100G PAM4	100G Lambda Multisource Agreement

For different distances and scenarios, different optical transceiver modules can be chosen. For example, 10Km 400G LR4 and 800G 2xLR4 between Core and Spine layers, 2Km 400G FR4 between Spine and Leaf layers, and 500m 400G DR between Leaf and ToR layers.

The unit price of optical modules is relatively high, ranging from hundreds to thousands of dollars, depending on factors such as bandwidth and transmission distance. Generally, higher bandwidth and longer distances correlate with higher prices.

naddod optical moduels Explore more optical module solutions on NADDOD

NADDOD is a leading supplier in the optical transceiver industry, offering a comprehensive range of products from 1G to 800G, including advanced 800G/400G NDR solutions. NADDOD's portfolio supports both InfiniBand and RoCE (RDMA over Converged Ethernet) solutions, catering to various packaging forms such as SFP, QSFP, QSFP-DD, QSFP112, and OSFP. Compared to other vendors, NADDOD distinguishes itself with ready availability of all mainstream products, short delivery cycles, and rigorous 100% verification of all products before shipping. Additionally, NADDOD's timely and reliable service ensures efficient data transfer and communication within data centers.

Optical transceiver modules play a critical role in data center operations. Since each port requires an optical module, the number of optical modules is typically proportional to the number of GPUs, often reaching 4-6 times the number of GPUs. Adding optical modules can significantly enhance network performance and ensure efficient data transfer and communication within the data center.

NADDOD's high-performance and compatibility features make it an ideal choice for data centers aiming to maintain optimal performance and reliability. Our extensive product range and prompt service ensure that data centers can achieve efficient and high-speed connectivity, crucial for large-scale GPU clusters and LLM training setups. Additionally, NADDOD offers cost-effective solutions, minimizing overall expenses while maintaining top-tier performance and reliability.

Conclusion

As the demand for computational power and efficient data transfer continues to rise, the importance of advanced networking solutions becomes increasingly evident. NADDOD's optical transceiver modules offer a robust solution to meet these demands, providing high-speed, reliable, and efficient communication within data centers. This supports the ever-growing needs of modern computational tasks and LLM training.

Building and Optimizing Large-Scale GPU Clusters: Insights and Innovations

GPU

HGX

Networking

Network Cards

Switches

Optical Transceiver Modules

Conclusion