The High-speed Road of GPUs

Some people say that the software ecosystem of CUDA is NVIDIA's moat. Indeed, but at the same time, it has to be said that NVIDIA chips have strong hardware capabilities.

This strength is not only reflected in the performance of individual GPUs but also in their collaboration. It is difficult to support AI model training with billions of parameters using a single GPU; it requires cooperation.

As for the interconnection between GPUs, traditionally it has relied on PCIe, which has now developed to the fifth generation (Gen5), with a bandwidth of up to 128 GB/s in both directions.

However, if you use NVIDIA's NVLink, the per-GPU transfer rate can reach 900 GB/s (represented by the Hopper architecture with the H100), which is 7 times that of PCIe Gen5 (the US Department of Commerce's export restriction on GPUs is a bandwidth of 600 GB/s, so both the H100 and the A100 of the Ampere architecture made the list).

2-4th Generation NVlink GPU

Depending on the number of GPUs, NVLink can organize GPUs into various topological connection forms:

topo connection

However, this point-to-point direct connection (peer-to-peer) has limitations in scalability. To further improve overall throughput, NVSwitch is required (which acts as a network connection for GPUs).

GPU network connection

NVSwitch itself is a switch chip with computational capabilities (with even more transistors than the V100), capable of handling communication-intensive tasks in AI training, such as reduction. It reduces the bandwidth consumption of these tasks and serves as an offload for the computational power of the connected GPUs.

CPU-GPU Connection

In addition to being used for GPU interconnection, the derivative version of NVLink, NVLink-C2C (Chip-to-Chip), can also be used for connecting CPUs and GPUs. However, this requires collaboration and customization with relevant CPU manufacturers, and the representative product is NVIDIA's Grace Hopper.

Grace CPU is a CPU chip designed by NVIDIA based on the Neoverse V2 architecture. "Neoverse" is a processor architecture launched by ARM for data centers, and V2 was introduced in September 2022, based on the ARMv9.0-A instruction set.

"Grace + Hopper" is a combination of the latest architectures of both, known as a super chip.

super chip

Compared to traditional CPU + GPU architectures, from the test data, Grace Hopper's performance is significantly higher in various algorithms, truly deserving of the "super" label.

relative performance

One important reason for its exceptional performance is that it provides cache coherence from the hardware level.

Cache coherence refers to the coherence between the cache and memory. When the host memory on the CPU side (blue section in the diagram below) is modified by CPU B, the cache on CPU A can "perceive" this change through the bus.

But what if the host memory is modified by the GPU? It can be notified to refresh the corresponding cache line through PCIe bus snooping (although some ARM chips do not support this feature).

host memory changing

However, if the device memory (green section in the diagram) is modified by the GPU, the CPU cannot perceive it, and there is no coherence between the CPU's cache and the device memory. NVLink-C2C allows the connected CPU to perceive it, enabling the CPU to access the device memory as it would access host memory (understanding each other's intentions without speaking).

Since cache coherence can be maintained at the hardware level, and NVLink provides sufficient bandwidth, in a UMA (Uniform Memory Access) scenario, page migration becomes less necessary. It can be migrated or not migrated (freedom of migration). If a page is accessed more frequently by the GPU (detected through hardware access counters), it is preferable to move it to device memory, and vice versa.

Physical memory

Furthermore, in the UMA architecture, CPU programs and GPU programs use the same virtual addresses, and now the gap between physical memory has been eliminated. VA (Virtual Address) is the same, PA (Physical Address) is also the same, so it naturally follows that CPU and GPU can share page tables.

system page table