Hardware Topology and Bandwidth Considerations for Large-Scale GPU Training

Large-scale model training is typically done using a cluster of single machines with 8 GPU cards each. The machine models used in the cluster include 8 of each: A100, A800, H100, H800, and possibly the upcoming {4, 8} L40S. Here is the hardware topology of a typical machine with 8 A100 GPUs:

Typical 8-card A100 Host Hardware Topology

Basic Introduction Concepts and Terminology

1. PCIe Switch Chip

Devices such as CPUs, memory, storage (NVMe), GPUs, and network cards that support PCIe can be connected to the PCIe bus or dedicated PCIe switch chips for interconnectivity.

Currently, PCIe has five generations of products, with the latest being Gen5.

2. NVLink

Definition on NVLink on Wikipedia:

NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub. The protocol was first announced in March 2014 and uses a proprietary high-speed signaling interconnect (NVHS).

NADDOD Tech Manager to summary the NVLink Features:

NVLink refers to a high-speed interconnection method between different GPUs in the same host.

It provides a short-distance communication link that ensures successful packet transmission and offers higher performance compared to PCIe.

NVLink serves as a replacement for PCIe and supports multiple lanes, with the link bandwidth increasing linearly with the number of lanes.

NV SwitchWithin a single node, GPUs are interconnected using NVLink in a full-mesh configuration, similar to a spine-leaf topology.

A patented technology by NVIDIA.

NVLink evolution: 1/2/3/4 generations

The main differences lie in the number of lanes in a single NVLink link and the bandwidth per lane (both directions are provided in the diagram).

NVLink evolution

For example：

A100 has a 12-lane configuration with a bandwidth of 50GB/s per lane, resulting in a bidirectional bandwidth of 600GB/s (300GB/s unidirectional).

A800, on the other hand, has 4 lanes disabled, thus having an 8-lane configuration with a bandwidth of 50GB/s per lane, resulting in a bidirectional bandwidth of 400GB/s (200GB/s unidirectional).

Additionally, real-time NVLink bandwidth can be collected based on DCGM (Data Center GPU Manager).

3. NVSwitch

Typical 8-card A100 Host Hardware Topology

NV Switch is a switching chip from NVIDIA. It is packaged on the GPU module and is not an independent switch outside the host.

Below is a picture of the real machine. The 8 boxes in the picture are 8 pieces of A100. The 6 ultra-thick heat sinks on the right are under the NVSwitch chip:

NVIDIA HGX A100 8 GPU Assembly Side View

4. NVLink Switch

NVSwitch sounds like a switch, but it is actually a switching chip on the GPU module, used to connect GPUs in the same host.

In 2022, NVIDIA took out this chip and actually made it into a switch, called NVLink Switch [3], which is used to connect GPU devices across hosts.

The two are easily confused by their names.

5. HBM (High Bandwidth Memory)

Origin of HBM

Traditionally, GPU memory, similar to regular DDR memory, is inserted into the motherboard and connected to the processor (CPU or GPU) via PCIe. Therefore, the speed bottleneck lies in PCIe, with Gen4 providing 64GB/s and Gen5 providing 128GB/s.

As a result, some GPU manufacturers (not just NVIDIA) have adopted a configuration where multiple DDR chips are stacked together and packaged with the GPU (as shown in the later section when discussing H100 in the text). With this configuration, when each GPU interacts with its dedicated memory, it avoids the need to go through the PCIe switch chip, resulting in a significant increase in speed. This type of configuration, known as "High Bandwidth Memory" (HBM), can provide a substantial improvement in bandwidth.

Currently, the HBM market is monopolized by Korean companies such as SK Hynix and Samsung.

Evolution: HBM 1/2/2e/3/3e

HBM

6. Bandwidth Unit

The performance of large-scale GPU training is directly related to data transfer speeds. It involves multiple links, such as PCIe bandwidth, memory bandwidth, NVLink bandwidth, HBM bandwidth, network bandwidth, and more.

In addition to the convention of using bits/second (b/s) for network bandwidth, where typically only the unidirectional (TX/RX) is mentioned,

other module bandwidths are generally expressed in bytes/second (B/s) or transactions/second (T/s), representing the bidirectional total bandwidth.

When comparing bandwidths, it is important to distinguish and convert between different units.

Related Resources:

High-Performance GPU Server Hardware Topology and Cluster Networking-1

High-Performance GPU Server Hardware Topology and Cluster Networking-2