NVIDIA DGX GH200 AI Supercomputer

NADDOD Dylan InfiniBand Solutions Architect Jan 5, 2024

At the SIGGRAPH conference in August 2023, NVIDIA CEO Huang Renxun announced a series of updates to their AI software and hardware products. Two hardware updates of note were the introduction of the new L40S GPU based on the Ada Lovelace architecture and the parameter updates for the DGX GH200 AI supercomputer. This brief overview is based on the product whitepaper released earlier in June.

 

During the Computex conference in May, Huang Renxun had already unveiled the DGX GH200 AI supercomputer, which featured 96GB of HBM3 memory per GPU and was planned for shipping by the end of the year. However, at SIGGRAPH, the DGX GH200 received some parameter updates:

 

  1. The memory bandwidth of each Hopper GPU was upgraded from 96GB of HBM3 with a bandwidth of 4TB/s to 144GB of HBM3e with a bandwidth of 4.9TB/s. This upgrade may have been made in consideration of Google's TPU v5 and AMD's MI300, which also adopted the high-speed HBM3e memory.

 

  1. The high-speed access memory of the GPU increased from 576GB (480GB for the CPU + 96GB for the GPU) to 624GB (480GB for the CPU + 144GB for the GPU).

 

The updated version with 144GB of HBM3e memory is expected to be available for shipping next year. The previously mentioned whitepaper still reflects the DGX GH200's product parameters from May, with no changes to the overall internal architecture.

 

DGX GH200

DGX GH200 AI Supercomputer

The DGX GH200 AI supercomputer is primarily composed of several components, including the GH200 Grace Hopper Superchip, NVLink-C2C interconnect technology, NVLink L1 and L2 switch systems, and the NDR Quantum-2 InfiniBand network, both in-band and out-of-band Ethernet networks. On the software side, it mainly includes NVIDIA Base Command and NVIDIA AI Enterprise.

 

The Grace Hopper Superchip, enabled by NVLink-C2C technology, combines GPUs and CPUs into an AI-accelerated heterogeneous platform. By effectively combining 256 Grace Hopper Superchips, the GH200 can deliver 1 ExaFLOP of FP8 sparse compute power and a unified memory pool of 144TB for the GPUs. Overall, it provides a total NVLink bandwidth of 115.2TB/s. In summary, the GH200 functions as a complete DGX server, providing a large memory pool and minimizing the data memory access and network communication overhead typically associated with multi-node and multi-GPU training of large models.

 

The GH200 features a two-tier NVLink network. The first tier consists of a chassis architecture comprising 8 Grace Hopper Superchips and 3 L1 NVLink switches, delivering 3.6TB/s bandwidth per unit. Across 32 chassis, it provides a total bandwidth of 115.2TB/s. The second tier interconnects the 32 chassis using 36 L2 NVLink switches, forming a two-tier fat-tree 1:1 non-blocking CLOS topology. This configuration results in a supercomputing system with a complete set of 256 GPUs. Please refer to the internal logic diagram below for further details.

 

NVIDIA Quantum-2InfiniBand NDR400

The red area in the diagram represents a Grace Hopper Superchip, while the blue area represents a chassis consisting of 8 Superchips and 3 NVLink switches. The 32 chassis form a non-blocking DGX supercomputer with 144TB of memory using the two-tier fat-tree topology of NVLink. The upper part of the diagram shows the connection of storage or other GH200 supercomputers through an NDR400 InfiniBand network. The following diagram illustrates the connection of two DGX GH200 AI supercomputers, totaling 512 Grace Hopper Superchips, using InfiniBand (IB).

 

NVIDIA Quantum-2 InfiniBand NDR400 NVLink

As shown in the diagram below, NVIDIA has also introduced the MGX GH200 architecture, which allows collaboration with OEM and ODM manufacturers to build large-scale RDMA supercomputing clusters based on Grace Hopper. These clusters are exclusively built using InfiniBand NDR400.

 

NVIDIA Quantum-2 InfiniBand NDR400 MGX GH200

Grace Hopper Superchip

NVIDIA GH200 Grace Hopper Superchip

Grace CPU features

Hopper H100 GPU Feature

 

The Grace Hopper Superchip is NVIDIA's first heterogeneous acceleration platform designed for HPC and AI scenarios. The basic specifications of the CPU and GPU are as follows: the GPU, based on the Hopper architecture, features 144GB of HBM3e memory with a bandwidth of 4.9TB/s, while the 72-core Grace Arm Neoverse v2 CPU has 480GB of LPDDR5X memory with a bandwidth of 512GB/s.

 

Through NVLink-C2C technology, the GPU memory and CPU memory can be connected with high bandwidth and low latency, providing a bidirectional bandwidth of 900GB/s, with a full-duplex bandwidth of 450GB/s. This is seven times the bandwidth of a standard PCIe 5.0 x16 link (32 x 16 / 8 x 7 = 448GB/s). When combined with Extended GPU Memory (EGM) technology and the Magnum IO GPUDirect acceleration software stack, local GPUs can access both local and remote CPUs, achieving a full-duplex high-speed bandwidth of 450GB/s and enabling a unified memory of 144TB across the entire system.

GPU to Peer CPU

This increase in total high-speed memory enables an increase in batch size, as evidenced by the latest MLPerf 3.1 Inference test results. It is observed that a single Grace Hopper Superchip shows significantly improved inference performance compared to the H100SXM card based on the x86 architecture, particularly for BERT and DLRMv2 models.

GH200 Performance Increase vs. H100 SXM

Each Superchip also features 4 PCIe 5.0 x16 high-speed interfaces (512GB/s bidirectional bandwidth) for external connectivity. Additionally, they are connected through 18 links of the fourth-generation NVLink. The connection is facilitated by three NVLink switch systems in the first layer, allowing them to connect to the other 7 Grace Hopper Superchips.

NVLink Switch

This is 3rd generation NVSwitch ASIC structure diagram

3rd generation NVSwitch ASIC structure diagram

Each L1 or L2 NVLink Switch incorporates two third-generation NVSwitch ASIC chips, as shown in the diagram above. The two ASICs combined provide a total of 128 fourth-generation NVLink interfaces. Each interface consists of two PHY Lanes with a combined full-duplex bandwidth of 26.6GB/s (approximately 200Gb/s). Therefore, the 128 NVLink 4 interfaces can deliver a full-duplex bandwidth of 25.6Tb/s.

 

Additionally, it can be observed that the ASICs also include SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) acceleration units internally. In addition to the external IB-based SHARP, hardware acceleration for All Reduce is also implemented internally within NVLink.

Grace Hopper Computer Box

8 Grace HopperCompute trays

8 Grace Hopper Compute GPU
8 Grace Hopper Compute NVLink

 

As depicted in the diagram above, this chassis is composed of 8 Grace Hopper Superchip cards and 3 NVLink Switch cards. Each NVLink Switch consists of 2 third-generation NVLink switch ASICs. Furthermore, each Grace Hopper Superchip card comes with a 1x400G OSFP CX7 NDR InfiniBand network card and a 2x200G QSFP112 CX7 or BF3 network card.

 

The 1x400G OSFP CX7 NDR InfiniBand network card can be utilized for expansion to connect with other GH200 supercomputers, facilitating the establishment of supercomputing clusters beyond 256 GPU cards.

 

The 2x200G QSFP112 CX7 or BF3 network card is primarily employed for in-band management and storage. It is likely that one of the 200G ports is configured in Ethernet mode for in-band management, while the other 200G port is configured in InfiniBand mode for high-speed parallel storage.

NVLink Topology within a Chassis

From the internal logic diagram above, it can be observed that each Grace Hopper Superchip is connected to 6 NVLink Switch ASIC chips through 18 links of fourth-generation NVLink. Here, each NVLink represents 2 NVLink ports, which correspond to 2 lanes of 112G PAM4. Each NVLink Switch ASIC chip is internally connected to 8 Grace Hopper Superchip chips through 24 NVLink links, and externally connected to the L2 NVLink switch through 24 NVLink links. This means that each ASIC extends 6 800G OSFP ports externally.

 

Taking a chassis perspective, with 3 L1 NVLink Switches and a total of 6 ASICs, each ASIC is connected to 8 GPUs through 24 NVLink4 links. This results in a total bandwidth of 3.6TB (6 * 24 * 200 / 8). Externally, each ASIC is connected to 6 L2 NVLink Switches through 24 NVLink4 links, resulting in a total bandwidth of 3.6TB (24 * 6 * 200 / 8). The 32 L2 NVLink Switches are likely divided into 6 groups, with each group consisting of 6 switches interconnected with all NVLink ASICs at the same positions in the 32 chassis.

 

In the end, the 32 chassis have a total of 96 L1 NVLink Switches and 36 L2 NVLink Switches, enabling a non-blocking, 1:1 fat-tree topology with full interconnectivity.

 

32 chassis

Another Four Network Structures

In addition to the non-blocking compute network consisting of NVLink and NVLink Switch for 256 GPUs, the GH200 supercomputer includes four other networks, following a similar design approach to the DGX Superpod.

a. NDR400 IB Network

By using CX7 400G IB network cards and NDR 400G switches, a fat-tree non-blocking IB network can be formed. It supports features such as traffic optimization, SHARPv3, MPI, and gRPC. This network can be extended to connect with other DGX GH200 supercomputer clusters.

b. NDR200 IB Storage Network

One of the dual-port 200G QSFP112 CX7 or dual-port 200G QSFP112 BF3 on each Grace Hopper card is dedicated for storage. Having a separate storage network maximizes storage performance without interference from the compute network. Each card is recommended to achieve a read/write capability of 25 GB/s, while the entire GH200 is suggested to achieve a read/write capability of 450 GB/s.

c. 200G In-Band Management Ethernet Network

The other 200G port of the dual-port 200G mentioned earlier can be configured in Ethernet mode. The in-band management network is responsible for Base Command Manager automated deployment, SLURM or K8S job scheduling, NFS (home directory) service, NGC Docker image downloads, and other functions.

d. 1G Out-of-Band Management Ethernet Network

This network primarily connects to the BMC ports of each Grace Hopper Superchip card, BF3's BMC port, and the ComEx ports of L1 and L2 NVLink switches. It provides telemetry monitoring and firmware upgrade capabilities.

DGX GH200 Software

From the figure below, it can be seen that the software framework of GH200 is essentially the same as the software framework of A100/H100 Superpod. It is mainly divided into two parts. One part is the Base Command system, which includes automated deployment and installation of DGX OS, management of various networking and storage acceleration software libraries, and Base Command Manager for automated software deployment, health monitoring, and software scheduling of the entire cluster. Base Command also integrates InfiniBand management and monitoring software UFM and Ethernet switch management and monitoring software NetQ, with added monitoring functionality for NVLink Switch systems. The other part is the NVIDIA AI Enterprise framework, which provides various customized AI software application containers through the NGC software center. These containers include pre-installed frameworks such as TensorFlow, PyTorch, Nemo-Megatron, as well as AI application containers for recommendation systems, image recognition, text translation, HPC, and various other AI applications.

 

NVIDIA AI EnterpriseNVIDIA UFM

The Global Fabric Manager (GFM) feature in NetQ also supports partitioning of the 256-GPU nodes, achieving complete isolation of memory and performance between different partitions.

NVLink Partitioning Examples in a 256-Node DGX GH200 System

GH200 vs H100

GH200 vs H100

The figure above should depict a performance comparison conducted on the IB network between GH200 and 32 DGX H100 servers (with 8 GPUs per server). GH200 efficiently integrates 256 GPUs and CPUs, overcoming the GPU memory and network communication bottlenecks introduced by the 32 DGX H100 servers. This results in improvements in recommendation systems, graphics and data analysis, and large-scale language model training, among other areas.

DGX GH200 overall configuration parameters

DGX GH200 overall configuration parameters

Finally, let's take a look at the overall configuration parameters. GH200 can be customized with 32, 64, 128, or 256 GPU nodes based on the specific requirements of the customers. NVIDIA has also announced the construction of a supercomputing cluster called "Helios" by the end of the year, which will connect four DGX GH200 supercomputers together using NDR IB network, providing a total computational power of 1024 Grace Hopper GPUs.

Summarize

The DGX GH200 AI supercomputer provides high-speed CPU and GPU memory access through NVLink-C2C technology. With high-bandwidth and low-latency connections enabled by NVLink, it efficiently connects 256 CPUs and GPUs together. These are just glimpses of the technology involved. There are also innovations such as EGM, ATS (Address Translation Services), and CUDA API optimizations for memory access on the GH200 platform. These technologies aim to create the most powerful supercomputing cluster for HPC and AI applications currently available. Of course, besides performance, customers also need to consider whether the computing platform supports the transition from x86 to ARM architecture. Additionally, the methods used for 3D parallel splitting of models and data in current multi-node multi-GPU clusters will also undergo changes on the GH200 supercomputing platform.