NVLink, InfiniBand, and RoCE in AI GPU Interconnect Technologies - NADDOD Blog

NVLink InfiniBand and RoCE in AI GPU Interconnect Technologies

NADDOD Brandon InfiniBand Technical Support Engineer Dec 11, 2023

In the AI era, GPUs have become the core processors due to their parallel computing capabilities, making them well-suited for handling large volumes of simple computations. They excel in tasks such as image processing and AI inference. However, as the complexity of large models continues to rise, the limited memory of a single GPU falls short of training requirements. For example, if an AI model has 260 billion parameters, an 80GB GPU like the A800 can only accommodate around 1-2 billion parameters, considering the computational states during training. To store a model with 260 billion parameters, it would require 100-200 GPUs. Additionally, subsequent training of large models necessitates even more parameters and computational power, resulting in a significant demand for GPUs.

AI-5

To meet the computational needs, it is essential to combine multiple GPUs or even multiple servers to work collaboratively. Distributed training has become the core training approach.

 

  1. Network connectivity plays a critical role in distributed systems.In a distributed system, the network provides the necessary connections, which can be categorized into single-card, multi-card, and multi-machine interconnections. The network within a single card is used for computational neural networks, while the interconnection between multiple cards (i.e., GPU interconnect) typically utilizes PCIe or various high-bandwidth communication networks. The interconnection between multiple machines (i.e., server interconnect) commonly employs RDMA networks.

 

  1. The bus serves as an essential conduit for data communication, with PCIe being the most widely used bus protocol.The bus is a channel for data transmission among different hardware components on a server motherboard and plays a crucial role in determining data transfer speeds. Currently, the most prevalent bus protocol is PCIe (PCI-Express), introduced by Intel in 2001. PCIe primarily connects the CPU with other high-speed devices such as GPUs, SSDs, network cards, and graphics cards. Since its initial release of PCIe 1.0 in 2003, subsequent generations have been introduced approximately every three years. The latest version, PCIe 6.0, provides a transfer rate of up to 64GT/s and a bandwidth of 256GB/s across 16 channels, continually improving performance and scalability.

 

  1. The tree-like topology and point-to-point transmission of PCIe impose limitations on the number and speed of connections, leading to the emergence of PCIe switches. PCIe adopts point-to-point data transmission links, where each end of the PCIe link can only connect to one device. This limited device recognition capability cannot meet the demands of scenarios with numerous device connections or requiring high-speed data transmission. Hence, PCIe switches were developed. PCIe switches provide both connection and switching functions, allowing a single PCIe port to recognize and connect to multiple devices, resolving the issue of insufficient channels. They can also interconnect multiple PCIe buses to form a high-speed network, enabling communication among multiple devices. In short, a PCIe switch acts as an expander for PCIe connections.

PCIe Switch

During the era of GPU interconnect, the transfer rate and network latency of PCIe were unable to meet the increasing demand. This led to the emergence of various alternative protocols such as NVLINK, CAPI, GenZ, CCIX, and CXL. The development of AI and GC greatly stimulated the increase in computational power demand, and GPU multi-card combinations became a trend. The bandwidth of GPU interconnect typically needs to exceed several hundred GB/s, while the data transfer rate of PCIe becomes a bottleneck. The serial-parallel conversion of link interfaces introduces network latency and affects the efficiency of GPU parallel computing. Additionally, signals emitted by GPUs need to pass through PCIe switches, which further adds network latency due to data processing. Furthermore, the separation of PCIe bus and memory addresses in each memory access exacerbates network latency. Therefore, the efficiency of the PCIe protocol in GPU multi-card communication is not high. To improve bus communication efficiency and reduce latency, various companies have introduced alternative protocols:

 

CAPI Protocol: Initially introduced by IBM and later evolved into Open CAPI, it is an application extension built on existing high-speed I/O standards. It adds features such as cache coherence and lower latency. However, due to the continuous decline in IBM's server market share, the CAPI protocol lacked user adoption and did not gain widespread popularity.

 

GenZ Protocol: GenZ is an open organization that is independent of any chip platform. It involves numerous manufacturers, including AMD, ARM, IBM, Nvidia, Xilinx, and others. GenZ extends the bus protocol into a switched network and incorporates GenZSwitch to improve scalability.

 

CXL Protocol (incorporating the above two protocols): Introduced by Intel in 2019, it shares a similar concept with the CAPI protocol. By the end of 2021, it absorbed the GenZ protocol for joint development and merged with the Open CAPI protocol in 2022. CXL includes memory interfaces and has gradually grown into one of the leading protocols for device interconnection standards.

 

CCIX Protocol: Another open protocol joined by ARM, with functionality similar to GenZ but not included in the mergers.

 

NVLINK Protocol: A high-speed GPU interconnect protocol proposed by NVIDIA. In comparison to the traditional PCIe bus protocol, NVLINK introduces significant changes in three aspects: 1) support for mesh topology to address limited channel issues; 2) unified memory allowing GPUs to share a common memory pool, reducing the need for data copying between GPUs and improving efficiency; 3) direct memory access, eliminating the need for CPU involvement, allowing GPUs to directly access each other's memory and reducing network latency. Additionally, to address imbalanced communication between GPUs, NVIDIA introduced NVSwitch, a physical chip similar to a switch ASIC. NVSwitch connects multiple GPUs through NVLINK interfaces, creating high-bandwidth multi-node GPU clusters. On May 29, 2023, NVIDIA launched the AI supercomputer DGX GH200, which connects 256 GH200 chips through NVLink and NVSwitch. All GPUs are interconnected and work collaboratively, with access to memory exceeding 100TB.

DGX GH200

Multi-Device Network Interconnection: InfiniBand and Ethernet  Coexist

In distributed training, RDMA (Remote Direct Memory Access) networks have become the preferred choice, including both InfiniBand (IB) and Ethernet networks. Traditional TCP/IP network communication involves sending messages through the kernel, which requires data movement and replication, making it unsuitable for high-performance computing, big data analytics, and other scenarios that require high-concurrency I/O and low latency. RDMA is a computer networking technology that allows direct remote access to memory data without kernel intervention and CPU resource utilization. It significantly improves data transfer performance and reduces latency. Therefore, RDMA is better suited for the networking needs of large-scale parallel computing clusters. Currently, there are three types of RDMA: InfiniBand, RoCE (RDMA over Converged Ethernet), and iWARP (Internet Wide Area RDMA Protocol), with the latter two being Ethernet-based technologies.

RDMA API

InfiniBand: It is a network designed specifically for RDMA, ensuring reliable transmission at the hardware level, and providing higher bandwidth and lower latency. However, it is expensive and requires compatible InfiniBand network cards and switches.

 

RoCE (RDMA over Converged Ethernet): It is based on Ethernet for RDMA and can use standard Ethernet switches, resulting in lower costs. However, it requires network cards that support RoCE.

 

iWARP: It is an RDMA network based on TCP, utilizing TCP for reliable transmission. In comparison to RoCE, iWARP's large number of TCP connections in large-scale networks can consume significant memory resources and require higher system specifications. It can use standard Ethernet switches but requires network cards that support iWARP.

InfiniBand vs. RoCE vs. iWARP

NADDOD Network Interconnection Optical Fiber Products: NADDOD provides lossless network solutions based on InfiniBand and RoCE, enabling users to build a lossless network environment and high-performance computing capabilities. NADOD can adaptively select the optimal solution based on different application scenarios and user requirements, offering high bandwidth, low latency, and high-performance data transmission. It effectively addresses network bottlenecks, enhances network performance, and improves user experience.

 

NADDOD InfiniBand network interconnect products include DAC (Direct Attach Copper) high-speed cables, AOC (Active Optical Cables), and optical modules.

The available data rates are QDR (40G), EDR (100G), HDR (200G), and NDR (400G,800G).

The module form factors include QSFP+ (Quad Small Form-factor Pluggable), QSFP28, QSFP56, and OSFP (Octal Small Form-factor Pluggable).

 

NADDOD RoCE network interconnect products include DAC (Direct Attach Copper) high-speed cables, AOC (Active Optical Cables), and optical modules.

The available data rates are 40G, 56G, 100G, 200G, and 400G.

The module form factors include SFP28, QSFP+, QSFP28, QSFP56,OSFP-QDD and QSFP-DD.

 

As a professional module manufacturer, NADDOD welcomes everyone to learn about our products and make purchases.