High-Performance GPU Server Hardware Topology and Cluster Networking-2: A Deep Dive into 8-Card A100/A800 and H100 Host Configurations

**Typical 8A100/8A800 host**

1. Topology within the host: 2-2-4-6-8-8

2 CPUs (and memory on both sides, NUMA)
2 storage network cards (access to distributed storage, in-band management, etc.)
4 PCIe Gen4 Switch chips
6 NVSwitch chips
8 GPUs
8 GPU dedicated network cards

Typical 8-card A100 host hardware topology

The picture below is more professional, please refer to it if you need more details:

NVIDIA DGX A100 host (official 8-card machine) hardware topology

1.1. Storage network card

Connected directly to the CPU via PCIe. Applications:

Reading and writing data from distributed storage, such as reading training data and writing checkpoints.

Regular node management tasks, including SSH, monitoring, and data collection.

The official recommendation is to use BF3 DPU. However, as long as the bandwidth requirements are met, any solution can be used. For cost-effective networking, RoCE can be used, while InfiniBand is preferred for maximizing performance.

1.2. NVSwitch fabric: intra-node full-mesh

8 GPUs are connected via 6 NVSwitch chips in a full-mesh configuration, also known as the NVSwitch fabric. Each link in the full-mesh has a bandwidth of n * bw-per-nvlink-lane.

The A100 GPUs utilize NVLink3, with a bandwidth of 50GB/s per lane. Therefore, each link in the full-mesh operates at 12 * 50GB/s = 600GB/s. It is important to note that this bandwidth is bidirectional, with a unidirectional bandwidth of 300GB/s.

The A800 GPUs are a reduced version, with 12 lanes reduced to 8 lanes. As a result, each link operates at 8 * 50GB/s = 400GB/s, with a unidirectional bandwidth of 200GB/

1.3. Use nvidia-smi topo to view the topology

The following is the actual topology displayed by nvidia-smi on an 8*A800 machine (network cards are bonded in pairs, and NIC 0~3 are all bonded):

actual topology displayed by nvidia

Between GPUs (top-left region): All are NV8, indicating 8 NVLink connections.

Between NICs:

On the same CPU die: NODE, indicating no need to cross NUMA but requires crossing PCIe switch chips.

On different CPU dies: SYS, indicating the need to cross NUMA.

Between GPUs and NICs:

On the same CPU die and under the same PCIe switch chip: NODE, indicating only the need to cross PCIe switch chips.

On the same CPU die but not under the same PCIe switch chip: NODE, indicating the need to cross PCIe switch chips and PCIe host bridge.

On different CPU dies: SYS, indicating the need to cross NUMA, PCIe switch chips, and the longest distance.

2. GPU training cluster networking: IDC GPU fabirc

GPU node interconnection architecture:

GPU node interconnection architecture

2.1. Computing Network

The GPUs are directly connected to the top-of-rack (ToR) switch (leaf switch). The leaf switches are connected to the spine switches in a full-mesh configuration, forming a cross-host GPU compute network.

The purpose of this network is to enable GPU-to-GPU data exchanges with other nodes.

Each GPU is connected to its respective network interface card (NIC) via a PCIe switch: GPU <--> PCIe Switch <--> NIC.

2.2. Storage Network

The two network interface cards (NICs) are directly connected to the CPU and connected to another network. Their main purpose is data reading and writing, as well as SSH management and other tasks.

2.3. RoCE vs. InfiniBand

Regardless of whether it is a compute network or a storage network, RDMA (Remote Direct Memory Access) is required to achieve the high performance needed for AI. Currently, there are two options for RDMA:

RoCEv2 (RDMA over Converged Ethernet version 2): This is the network commonly used by public cloud providers for their 8-GPU instances, such as the CX6 configuration with 8 * 100Gbps. It is relatively cost-effective compared to other options, provided that it meets the performance requirements.

InfiniBand (IB): With equivalent network card bandwidth, InfiniBand offers over 20% better performance compared to RoCEv2. However, it comes at a higher price, approximately twice as expensive.

3. Data link bandwidth bottleneck analysis

Single-machine 8-card A100 GPU host bandwidth bottleneck analysis

Several key link bandwidths are indicated on the diagram:

Between GPUs on the same host: Using NVLink, bidirectional bandwidth is 600GB/s, and unidirectional bandwidth is 300GB/s.

Between GPUs and their respective network interface cards (NICs) on the same host: Utilizing PCIe Gen4 switch chips, bidirectional bandwidth is 64GB/s, and unidirectional bandwidth is 32GB/s.

Between GPUs across different hosts: Data transmission relies on the NICs, and the bandwidth depends on the specific NIC used. Currently, commonly used NICs in China for A100/A800 models offer a mainstream bandwidth of 100Gbps (12.5GB/s) in one direction. Therefore, inter-host communication experiences a significant decrease in performance compared to intra-host communication.

200Gbps (25GB/s) is close to the unidirectional bandwidth of PCIe Gen4.

400Gbps (50GB/s) surpasses the unidirectional bandwidth of PCIe Gen4.

Hence, using a 400Gbps NIC in this type of configuration does not yield significant benefits, as it requires PCIe Gen5 performance to fully utilize the 400Gbps bandwidth.

**Typical 8H100/8H800 host**

GPU Board Form Factor is divided into two types:

PCIe Gen5
SXM5: higher performance

1. H100 chip layout

Below is the internal structure of an H100 GPU chip:

NVSwitch chips

Single-chip H100 GPU internal logical layout

4nm process;

The bottom row consists of 18 Gen4 NVLink connections; providing a bidirectional total bandwidth of 18 lanes * 25GB/s/lane = 900GB/s;

The blue area in the middle represents the L2 cache;

The left and right sides contain HBM chips, which serve as the graphics memory.

2. Hardware topology within the host

The structure is roughly similar to the A100 8 card machine, the differences are:

The number of NVSwitch chips has been reduced from 6 to 4; the real machine picture is as follows:

NVSwitch chips

The interconnection with the CPU is upgraded from PCIe Gen4 x16 to PCIe Gen5 x16, with a bidirectional bandwidth of 128GB/s;

3. Networking

Similar to the A100, with the difference being the standard configuration now includes the 400Gbps CX7 network card. Otherwise, the network bandwidth would have a larger disparity compared to the PCIe Switch and NVLink/NVSwitch.

Related Resources:

high-performance-gpu-server-hardware-topology-and-cluster-networking-1

High-Performance GPU Server Hardware Topology and Cluster Networking-L40S

High-Performance GPU Server Hardware Topology and Cluster Networking-2

Typical 8*A100/8*A800 host