How to Optimizing GPU Communication for AI Clusters?
GPU plays a crucial role in high-performance computing and accelerating deep learning tasks. Its powerful parallel computing capabilities significantly enhance computational performance. As the volume of computation data continues to increase, there is a substantial need for data exchange between GPUs. Therefore, GPU communication performance becomes a critical metric.
In distributed training within AI clusters, communication is a necessary component and adds additional system overhead compared to single-machine training. The ratio of communication to computation time often determines the upper limit of the acceleration ratio for distributed machine learning systems.
Hence, the key to distributed machine learning lies in designing efficient communication mechanisms to reduce the ratio of communication to computation time and effectively train high-precision models.
Below, we will introduce the hardware for communication in AI clusters.
Understanding GPU Machine Communication
There are two types of communication implementations: intra-machine communication and inter-machine communication.
- Shared memory (QPI/UPI) - Communication between CPUs within the same machine can be achieved using shared memory.
- PCIe - Typically used for communication between CPUs and GPUs within a machine.
- NVLink - Primarily used for communication between GPUs within a machine, but can also be used for CPU-GPU communication.
- TCP/IP network protocol
- RDMA (Remote Direct Memory Access) network protocol
PCI-Express (Peripheral Component Interconnect Express), abbreviated as PCIe, is a high-speed serial computer expansion bus standard primarily used to increase the data throughput of computer systems and improve device communication speed.
PCIe is essentially a full-duplex connection bus, and the amount of data transferred is determined by the number of channels (lanes).
Typically, each individual connection channel, or lane, is referred to as X1. Each lane consists of two pairs of data lines: one pair for transmission and one pair for reception. Each pair of data lines includes two differential lines. In the case of X1, there is only one lane, consisting of four data lines. With one clock cycle, 1 bit of data can be transmitted in each direction. Following this pattern, X2 has 2 lanes, consisting of 8 data lines, enabling the transmission of 2 bits per clock cycle. Similarly, there are X12, X16, X32 configurations, indicating 12, 16, and 32 lanes, respectively.
PCIe 1.0 was officially released in 2003, supporting a per-lane transfer rate of 250MB/s and a total transfer rate of 2.5 GT/s.
In 2007, the PCIe 2.0 specification was introduced. It doubled the total transfer rate from PCIe 1.0 to reach 5 GT/s, and the per-lane transfer rate increased from 250 MB/s to 500 MB/s.
In 2022, the PCIe 6.0 specification was officially released, raising the total transfer rate to 64 GT/s.
In June 2022, the PCI-SIG consortium announced the PCIe 7.0 specification, which allows for a single-lane (x1) one-way transfer rate of 128 GT/s. The final version is planned to be released in 2025.
Calculation of PCIe Throughput (Available Bandwidth):
Throughput = Transfer Rate * Encoding Scheme
The transfer rate is measured in Gigatransfers per second (GT/s) rather than Gigabits per second (Gbps) because the transfer rate includes overhead bits that do not provide additional throughput. For example, PCIe 1x and PCIe 2x use an 8b/10b encoding scheme, which occupies 20% (= 2/10) of the original channel bandwidth.
- GT/s, or Gigatransfers per second, represents the number of transfers per second and primarily describes the rate attribute of the physical layer communication protocol. It is not directly proportional to link width or other factors.
- Gbps, or Gigabits per second, represents the rate of data transmission in billions of bits per second. There is no direct conversion relationship between GT/s and Gbps.
The PCIe 2.0 protocol supports 5.0 GT/s, meaning that each lane can transfer 5 billion bits per second. However, this does not imply that each lane of the PCIe 2.0 protocol supports a rate of 5 Gbps. This is because the physical layer protocol of PCIe 2.0 uses an 8b/10b encoding scheme, where 10 bits are transmitted for every 8 meaningful bits, introducing additional overhead. Therefore, each lane of the PCIe 2.0 protocol supports a rate of 5 * 8/10 = 4 Gbps = 500 MB/s. Taking a PCIe 2.0 x8 channel as an example, the available bandwidth of x8 is 4 * 8 = 32 Gbps = 4 GB/s.
Similarly, the PCIe 3.0 protocol supports 8.0 GT/s, meaning that each lane can transfer 8 billion bits per second. PCIe 3.0 uses a 128b/130b encoding scheme, where 130 bits are transmitted for every 128 meaningful bits. Therefore, each lane of the PCIe 3.0 protocol supports a rate of 8 * 128/130 = 7.877 GB/s = 984.6 MB/s. Taking a PCIe 3.0 x16 channel as an example, the available bandwidth of x16 is 7.877 * 16 = 126.032 Gbps = 15.754 GB/s.
PCIe architecture typically includes components such as the Root Complex (RC), switch, and Endpoint (EP) devices. The RC is a single component in the bus architecture that handles the connection between the processor, memory subsystem, and I/O devices. The switch, usually provided in software form, consists of two or more logical PCI-to-PCI bridges to maintain compatibility with existing PCI devices. Its function is to enable connectivity and routing between different PCIe devices.
The improvement in computational power often relies not only on the performance enhancement of a single GPU card but also on the combination of multiple GPU cards. In a multi-GPU system, the bandwidth for communication between GPUs is typically in the range of several hundred GB/s or more. However, the data transfer rate of the PCIe bus can easily become a bottleneck, and the serialization and deserialization of data in the PCIe link interface can introduce significant latency, impacting the efficiency and performance of GPU parallel computing.
The signals emitted by the GPUs need to be transmitted to the PCIe switch. Within the PCIe switch, data processing takes place, and the CPU distributes and schedules the data. All of these operations introduce additional network latency, which limits system performance.
To address these limitations, NVIDIA introduced a technology called GPUDirect P2P, which enhances GPU communication performance. It allows GPUs to directly access the video memory of the target GPU through PCI Express, eliminating the need for intermediate copying to the CPU host memory. This significantly reduces data exchange latency. However, due to limitations in the PCI Express bus protocol and the system's topology, it is not possible to achieve higher bandwidth.Subsequently, NVIDIA introduced the NVLink bus protocol as a solution.
Introduction to NVLink
NVLink is a high-speed interconnect technology designed to accelerate data transfer between CPUs and GPUs, as well as between multiple GPUs. Its aim is to improve system performance. By enabling direct interconnection between GPUs, NVLink allows for scalable multi-GPU I/O within servers, providing a more efficient and low-latency interconnect solution compared to traditional PCIe bus.
The first version of NVLink was released in 2014, introducing high-speed GPU interconnect. The P100, released in 2016, featured the first generation of NVLink, providing a bandwidth of 160 GB/s, which was five times that of PCIe 3.0 x16 bandwidth (bi-directional) at that time. Subsequently, several new versions were released. The NVLink2 in the V100 increased the bandwidth to 300 GB/s, while the NVLink3 in the A100 offered a bandwidth of 600 GB/s. The H100 includes 18 fourth-generation NVLink links, achieving a total bandwidth (bi-directional) of 900 GB/s, which is seven times that of PCIe 5.0 x16 bandwidth (bi-directional).
NVLink high-speed interconnect can be implemented in two main ways:
- It canxthe form of bridges or adapters.
- It can be integrated into the motherboard with NVLink interfaces.
To address the issue of imbalanced GPU communication, NVIDIA introduced NVSwitch. The NVSwitch chip is a physical ASIC (application-specific integrated circuit) similar to a switch. It enables high-speed interconnection of multiple GPUs through the NVLink interface, allowing the creation of seamless, high-bandwidth multi-node GPU clusters. This facilitates collaborative work among all GPUs in a cluster with full-bandwidth connections, thereby improving communication efficiency and bandwidth among multiple GPUs within a server. The combination of NVLink and NVSwitch enables NVIDIA to efficiently scale AI performance across multiple GPUs.
The first generation of NVSwitch was released in 2018, manufactured using TSMC's 12nm FinFET process and featuring 18 NVLink 2.0 interfaces. NVSwitch has since iterated to the third generation. The third-generation NVSwitch is built using TSMC's 4N process, which is custom-designed and optimized for NVIDIA. Compared to TSMC's regular 5nm node, the 4N process offers better power efficiency, performance, and increased density. Each NVSwitch chip has 64 NVLink 4.0 ports, enabling GPU-to-GPU communication speeds of up to 900GB/s.
Differences between NVIDIA GPU servers with PCIe and SXM versions
NVIDIA GPUs can be connected using two types of memory slots: PCIe and SXM.
PCIe is a relatively universal protocol, but it is slower compared to SXM. SXM is specifically designed for interconnecting GPUs, and its protocol is embedded on the circuit board. SXM protocol allows for faster inter-GPU communication, provides better native support for NVLink, and offers higher memory bandwidth compared to PCIe. Both PCIe and SXM can utilize NVLink, but SXM is the preferred method for using NVLink.
The SXM architecture is a high-bandwidth socketed solution used to connect GPUs to NVIDIA's proprietary DGX and HGX systems. SXM GPUs are connected through NVLink via the integrated NVSwitch on the motherboard, without the need for communication through PCIe on the motherboard. SXM supports interconnectivity between up to eight GPU cards, enabling high-bandwidth communication between GPUs. The A100 variant achieves a bandwidth of 600GB/s, while the H100 variant achieves 900GB/s. A800 and H800, provide a bandwidth of 400GB/s.
On the other hand, PCIe version GPU cards are inserted into PCIe slots and can communicate with CPUs and other GPU cards within the same server. They can also communicate with devices on other server nodes via network cards, utilizing the PCIe communication method. However, this method has slower transmission speeds. To achieve faster transmission speeds similar to SXM, an NVLink bridge can be used for communication between the GPU and CPU. However, unlike SXM, the PCIe version can only enable communication between pairs of GPUs. In other words, PCIe versions require NVLink bridges to connect pairs of GPUs, allowing data communication through PCIe channels. Additionally, the latest PCIe network bandwidth has a limitation of 128GB/s.
In the next article, we will introduce inter-machine communication in AI cluster communication. Stay tuned for updates on the Naddod website.