NVIDIA NVLink Network: Advancing High-Speed Interconnectivity for GPUs

Introduction

With the rapid development of AI and HPC technologies, there is a growing demand for high-speed interconnectivity and scalability of GPUs. A high-bandwidth, low-latency, and high-performance interconnect technology is crucial for enhancing overall AI computing performance. In April of this year, NVIDIA CEO Jensen Huang introduced the third-generation NVIDIA NVSwitch and the fourth-generation NVLink technology during the GTC conference. These advancements provide a faster point-to-point interconnect solution for the newly released H100 GPU compared to the A100 GPU, forming a prototype of the NVLink network.

The third-generation NVSwitch, also known as NVSwitch3, can be used to connect multiple GPU cards within a server and can also be used to establish an independent and complete GPU high-speed cluster by extending the GPU servers externally. Additionally, the NVSwitch chip incorporates hardware accelerators to support multicast packet acceleration and introduces SHARP (Scalable Hierarchical Aggregation and Reduction Protocol), which was previously only available in IB switches. SHARP is primarily used to accelerate and optimize All-Reduce AI computations.

By utilizing physical switches composed of the third-generation NVSwitch chips, a cluster with up to 256 H100 GPU cards can be created, providing a total all-to-all bandwidth of 57.6TB/s. The NVLink 4.0 technology specification employed in this solution significantly enhances GPU performance and scalability. For example, the parallel process block structure of the GPU architecture aligns with the parallel structure of NVLink, and the NVLink interface further optimizes data exchange for GPU L2 cache.

NVLink-Port interfaces

NVLink

NVlinks

NVLink is a protocol designed to address point-to-point communication between GPUs within a server. Traditional PCIe switches have bandwidth limitations, with the latest PCIe 5.0 offering only 32Gbps per lane. This bandwidth is insufficient to meet the communication requirements between GPUs. However, with NVLink technology, GPUs can directly communicate with each other at high speeds within the server, eliminating the need for communication through PCIe switches. Each lane of the fourth-generation NVLink provides a bandwidth of 112Gbps, which is three times higher than a single lane of PCIe Gen5.

PCIe links performance

The main purpose of NVLink is to provide a high-speed and point-to-point network for GPU interconnectivity. Unlike traditional networks, NVLink eliminates overhead such as end-to-end packet retransmission, adaptive routing, and packet reassembly. The highly streamlined NVLink interface accelerates CUDA from the session layer, presentation layer, to the application layer, thereby further reducing the network overhead caused by communication.

NVLink course

parameter

As shown in the figure above, we can see that NVLink has evolved with GPU architecture, starting from NVLink1 for the first-generation P100 GPU to the current NVLink4 designed for the H100 GPU. NVLink3 supports both 50G NRZ and 56G PAM4, while NVLink4 introduces 112G PAM4 SerDes for the first time, providing a bidirectional bandwidth of 900GB/s, which is 1.5 times higher than the previous generation NVLink3's 600GB/s.

So how is the 900GB/s calculated? Each H100 GPU is connected to the internal NVSwitch3 chip through 18 NVLink4 connections. Each NVLink4 connection consists of two lanes, with each lane having a bandwidth of 112G PAM4. Thus, a single NVLink4 provides a unidirectional bandwidth of 224Gbps, or 25GB/s (note the conversion from bits to bytes) in one direction. With a bidirectional bandwidth of 50GB/s, the total bandwidth for 18 NVLink4 connections is 900GB/s in both directions.

NVSwitch chip

NVSwitch Chip parameter

The NVSwitch chip is a physical chip similar to a switch ASIC. It enables high-speed interconnectivity among multiple GPUs through the NVLink interface, thereby improving communication efficiency and bandwidth between multiple GPUs within a server. During the NVLink1 era of the P100 GPU, there was no NVSwitch chip, and the GPUs were connected in a ring-like configuration, which resulted in a lack of direct point-to-point communication between GPUs across NUMA boundaries. The NVSwitch1 chip was introduced with the V100 GPU, followed by the NVSwitch2 chip with the A100 GPU. The upcoming NVSwitch3 chip is being prepared specifically for the H100 GPU.

NVSwitch chip NVSwitch chips

The chip is manufactured using TSMC's 4N process and contains 25.1 billion transistors within a 294 square millimeter area. The entire chip measures 50mm x 50mm and features a SHARP controller capable of managing 128 parallel groups of SHARP. It includes embedded SHARP ALU (Arithmetic Logic Unit) for computational processing and has built-in SRAM memory that supports SHARP calculations. The embedded ALU enables NVSwitch to deliver a computing throughput of 400 GFLOPS for FP32 precision and supports FP16, FP32, FP64, and BF16 precision calculations.

The PHY circuit interface is compatible with 400Gbps Ethernet or NDR IB (InfiniBand) connections. Each cage can provide four OSFP optical modules for NVLink4, supporting FEC (Forward Error Correction) verification. The chip also offers security features, allowing partitioning within the NVLink network to create subnets and providing telemetry monitoring technology similar to IB.

The NVSwitch3 chip provides 64 NVLink4 interfaces. As mentioned earlier, each NVLink4 has two lanes, resulting in 200Gbps unidirectional bandwidth. Therefore, a single NVSwitch3 chip can provide a unidirectional bandwidth of 64 * 200Gbps = 12.8Tbps or bidirectional bandwidth of 3.2TB/s.

2x effective NVLink bandwidth

The NVSwitch3 chip introduces the SHARP functionality, which allows for hardware-based aggregation and updating of calculation results from multiple GPU units during all-reduce operations. This reduces the number of network messages and improves computational performance.

NVLink Server

NVLINK-ENABLED SERVER GENERATIONS

NVLink servers refer to servers that utilize NVLink and NVSwitch technologies to interconnect GPUs. They are typically NVIDIA's own DGX series servers or OEM HGX servers that adopt similar architectures.

The DGX-1 server, which utilizes P100 GPUs, does not incorporate NVSwitch. The interconnectivity between the 8 GPUs in the DGX-1 server is achieved through NVLink1, with each P100 GPU having 4 NVLink1 connections. With the introduction of the NVIDIA V100 GPU architecture, NVSwitch1 and NVLink2 were introduced to provide high-bandwidth and any-to-any connectivity among multiple GPUs within a server. The NVIDIA A100 GPU introduced NVSwitch2 and NVLink3.

DGX A100 Internal diagram

In the DGX A100 server, as shown in the internal diagram, the connection between the GPU and the CPU is achieved through a PCIe switch. The interconnectivity among the 8 GPUs primarily relies on 6 NVSwitch2 chips. Each GPU is connected to the NVSwitch2 chip through 12 NVLink3 connections, with each NVLink3 providing a unidirectional bandwidth of 25GB. This configuration enables a single GPU to have a total of 12 * 25GB = 300GB/s unidirectional bandwidth or 600GB/s bidirectional bandwidth.

Now let's take a look at the specifications of the DGX H100 server.

DGX H100

8x NVIDIA H100 Tensor Core GPUs with 640GB of aggregate GPU memory

4x third-generation NVIDIA NVSwitch chips

18x NVLink Network OSFPs

6 TB/s of full-duplex NVLink Network bandwidth provided by 72 NVLinks

8x NVIDIA ConnectX-7 Ethernet/InfiniBand ports

2x dual-port BlueField-3 DPUs

Dual Sapphire Rapids CPUs

Support for PCIe Gen 5

nvlinks

DATA-NETWORKCONFIGURATION

With the introduction of the H100 GPU, the third-generation NVSwitch and fourth-generation NVLink technologies were introduced. They provide a unidirectional bandwidth of 450GB/s for a single H100 GPU. Additionally, an external 1U box-shaped NVLink switch is introduced to facilitate high-speed communication among multiple GPU servers.

Inside the DGX H100 server, there are 8 H100 GPUs. Each H100 GPU is connected to 4 NVSwitch3 chips through 18 NVLinks in a configuration of (5,4,4,5). The traffic load between GPUs is distributed across the 4 switching planes, enabling all-to-all traffic within the GPUs. Each internal NVSwitch3 chip is connected to external NVLinks with a 2:1 convergence ratio. This design choice takes into consideration the complexity and cost of interconnecting servers.

NVSwitch

NVLINK SWITCH

The NVLink switch was newly released this year and is specifically designed for interconnecting the H100 Superpod. It features a 1U form factor with 32 OSFP ports. Each OSFP port consists of 8 lanes of 112G PAM4. The switch is equipped with 2 NVSwitch3 chips, with each chip providing 64 NVLink4 interfaces. Therefore, two chips can offer a maximum of 128 NVLink4 interfaces, providing a unidirectional bandwidth of 128 * 400Gbps = 51.2TBps or a bidirectional bandwidth of 6.4TB/s.

The NVLink switch supports out-of-band management, DAC cables, as well as AOC and OSFP cables with specific firmware support. There is currently no public information available on the exact form of the OSFP module, but it is speculated to resemble the NDR OSFP form factor, with two MPO ports each connecting to a 400G link or a single 800G link directly connected to a 24-fiber MPO cable.

OSFP Package

OSFP with MPO cable

NVLink Network

Through the NVSwitch physical switch, multiple GPU servers can be interconnected to form a large-scale fabric network known as the NVLink network. This network primarily addresses the high-speed communication bandwidth and efficiency issues between GPUs and does not include the computational and storage networks between CPUs.

In the pre-NVLink network era, each server allocated a local address space for its GPUs, and communication between GPUs was achieved through NVLink. In the NVLink network, each server has its own independent address space, which is used for data transmission, isolation, and security of GPUs within the NVLink network. When the system boots up, the NVLink network establishes connections automatically through software APIs, and the addresses can be changed at any point during runtime.

NVLINK NETWORK

The following diagram compares the NVLink network with traditional Ethernet. It can be observed that the NVLink network, consisting of NVLink technology, NVSwitch chips, and NVSwitch switches, can form a separate network dedicated to GPUs, independent of the IP Ethernet.

NVLink Network vs Ethernet Network

DGX H100 SuperPOD

DGX H100 SUPERPOD AI EXASCALE

The Superpod consists of 8 racks, with each rack housing 4 DGX H100 servers, totaling 32 servers and 256 H100 GPU cards. It provides a total of 1 exaFLOP (1 billion billion floating-point operations per second) of FP8 precision sparse AI computing power.

The NVLink network within this Superpod can deliver a combined all-to-all bidirectional bandwidth of 57.6TB/s for the 256 GPUs. Additionally, the CX7 within each of the 32 DGX H100 servers can be interconnected with an InfiniBand (IB) switch, providing a bidirectional bandwidth of 25.6TB/s. This configuration allows for usage within a single POD or connection to multiple SuperPODs.

Currently, NVIDIA's published DGX H100 Superpod networking solutions are relatively fragmented. Based on my understanding, I have drawn the following rough sketch:

DGX H100 Superpod Network

NVS refers to the previously mentioned NVSwitch3 chip, while L2NVS refers to the NVSwitch physical switch. Within each DGX H100, a single GPU extends 18 NVLink4 connections, providing a bidirectional bandwidth of 18 * 50GB = 900GB/s. These 18 NVLink4 connections are divided into four groups (5, 4, 4, 5) and connected to four onboard NVSwitch3 chips. For the 8 GPUs, each NVSwitch chip is connected to 40, 32, 32, 40 NVLink4 connections, totaling 114 NVLink4 connections, in a 2:1 convergence ratio.

On the northbound side, each NVSwitch chip is connected to 18 external L2NVS (NVSwitch 1U switches), which were mentioned earlier. These switches are divided into four groups (5, 4, 4, 5). Thus, each onboard NVSwitch chip has a total of 72 NVLink4 connections (20, 16, 16, 20) on the northbound side, while the southbound side has 114 NVLink4 connections, resulting in a 2:1 convergence ratio. Each NVLink4 consists of 2 lanes of 112G PAM4, so a pair of 800G OSFP modules is required for end-to-end connectivity for every 4 NVLink links.

In summary, the upper half of the diagram illustrates the high-speed network forming an NVLink network, enabling GPU all-to-all interconnectivity.

NV GPU

The interconnection between the GPU and the CPU is achieved through the PCIe GEN5 switch within the CX7 network card. Unlike the previous DGX A100 with 8 individual CX6 network cards, the CX7 network card in DGX H100 is designed as two separate cards that are inserted into the server. Each Cedar card consists of 4 CX7 chips and provides 2 800G OSFP ports externally. With a total of 8 CX7 network card chips, there are 2 Cedar cards, resulting in 4 OSFP 800G ports. This configuration delivers a bidirectional bandwidth of 800Gbps * 4 * 2 = 800GB/s. The CX7 network card can operate in Ethernet mode with RoCE or within an NDR InfiniBand network.

In the diagram below, the four Superpods with a total of 1024 GPUs can be interconnected using a fat-tree topology in an NDR InfiniBand network.

DGX A100 1K Pod VS DGX H100 1K Pod

Each DGX H100 is equipped with two BlueField3 cards to connect to the storage network.

InfiniBand Network VS. NVLink Network

After adopting the NVLink network for H100 GPUs, how much faster is it compared to the IB network of A100 GPUs? The following is a bandwidth comparison between a DGX A100 256 POD and a DGX H100 256 POD:

DGX A100 256 POD vs DGX H100 256 POD

Bisection is a performance metric in all-to-all scenarios, where each GPU needs to simultaneously send data to all other GPUs. It typically calculates the network bandwidth when half of the nodes in the network send data to the other half, often measuring 1:1 non-blocking traffic.

For 1 DGX A100 internally: 8/2 * 600GB/s = 2400GB/s

For 32 DGX A100s internally, with a total of 256 A100 GPUs, each server connected through 8 200Gbps HDR network cards, assuming a 4:1 convergence ratio at the TOR switch:

256/2/4 * 200GB/s = 6400GB/s

For 1 DGX H100 internally: 8/2 * 900GB/s = 3600GB/s

For 32 DGX H100s internally, with a 2:1 convergence ratio:

256/2/2 * 900GB/s = 57600GB/s (which corresponds to the previously mentioned 57.6TB/s)

A single DGX H100 can provide a 1.5x increase in bandwidth and a 3x increase in bidirectional bandwidth compared to DGX A100. With 32 DGX H100s, the bisection bandwidth can be improved by 9x and the bidirectional bandwidth by 4.5x.

NVLink Switch System features 4.5x more bandwidth than maximum infiniBand

As shown in the diagram, for training a recommendation system with a 14TB embedding table using an all-to-all data model, the H100 with NVLink exhibits higher performance compared to the H100 with InfiniBand.

NCCL

Below are the publicly available bandwidth results for NCCL running allreduce and alltoall operations on multiple GPUs within a server and across multiple GPU nodes. Thanks to the optimizations provided by NVLink4 and NVSwitch3, the H100 achieves consistent bandwidth performance, whether it is for multiple GPUs within the server or across external GPUs.

ALLREDUCE PERFORMANCE

ALLTOALL BANDWIDTH

Summary

NVLink and NVSwitch technologies were developed to meet the demands of high-speed, low-latency point-to-point and point-to-multipoint communication in multi-GPU setups. They have evolved with each generation of GPU architecture, constantly innovating. Since the acquisition of Mellanox, NVIDIA has also started combining NVLink technology with InfiniBand (IB) technology, introducing the new generation of NVSwitch chips and switches with SHARP functionality, optimized for external GPU server networks. The current scale of NVLink networks, which supports a maximum of 256 GPUs, is just the beginning. It is expected that the scale of NVLink networks will continue to grow and improve in the future, potentially creating a supercomputing cluster that integrates multiple networks, such as AI computing, CPU computing, and storage, into one cohesive system.

Naddod provides high-quality InfiniBand NDR400G 800G /HDR 200G /EDR 100G AOC and DAC series products for server clusters that require low latency, high bandwidth, and reliability in network applications. Our products offer exceptional performance while reducing costs and complexity. With multiple successful deliveries and real-world application cases, we ensure the highest quality standards. You can rely on us for product quality and availability as we always maintain sufficient inventory to meet your needs promptly.

In addition to offering third-party high-quality optical modules, we also stock a wide range of original NVIDIA products, providing you with more options. Contact us now to learn more details!

Naddod

Naddod - Your trusted supplier of optical modules and high-speed cables!

Resource Links:

https://developer.nvidia.com/blog/upgrading-multi-gpu-interconnectivity-with-the-third-generation-nvidia-nvswitch/

https://www.servethehome.com/nvidia-nvlink4-nvswitch-at-hot-chips-34/

https://www.servethehome.com/nvidia-nvswitch-details-at-hot-chips-30/

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s42663/

https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41784/

Brief Discussion on NVIDIA NVLink Network