GPU Communication Technology Accelerating AI Development

NADDOD Gavin InfiniBand Network Engineer Jan 22, 2024

In recent years, artificial intelligence (AI) technology has gained widespread attention and application. AI technology spans across various fields, including speech recognition, image processing, video analytics, natural language processing (NLP), and more, all of which require robust computing resources. Therefore, the development of AI technology demands significant computational power.


Although the computing capabilities of GPUs have been continuously improving, there are still limitations to the computational power of a single GPU card when it comes to AI applications. Hence, the need arises to leverage multiple GPUs in combination to enhance computing performance. There are two main scenarios for multi-GPU setups: multiple GPUs within a single server and multiple servers, each equipped with multiple GPU cards. In both cases, efficient communication between GPUs is crucial. Now, let's delve into the critical domain of GPU communication technology.


Single-machine and Multi-card GPU Communication


GPU Direct


GPU Direct is a technology developed by NVIDIA that enables GPUs to communicate and transfer data directly with other devices, such as network interface cards (NICs) and storage devices, without involving the CPU. In traditional modes, when data needs to be transferred between the GPU and another device, it must be first read or written by the CPU, which can potentially lead to performance bottlenecks and increased transfer latency.


With the help of GPU Direct technology, network adapters and storage drivers can directly read from and write to GPU memory, avoiding unnecessary memory overhead, reducing CPU overhead, and shortening transfer latency. This significantly enhances overall performance.


GPU Direct technology includes features such as GPU Direct Storage, GPU Direct RDMA, GPU Direct P2P, and GPU Direct Video. These features enable GPUs to communicate and transfer data more efficiently with other devices, further improving overall performance.


GPUDirect Storage


GPUDirect Storage technology allows for direct data transfer between storage devices and GPUs, bypassing the involvement of the CPU. This reduces data transfer latency and minimizes CPU resource consumption. With this technology, GPUs can directly access data from storage devices, such as solid-state drives (SSDs) or non-volatile memory express (NVMe) drives, without the need to copy the data to the CPU's memory first. This direct access approach enables faster transfer speeds and better utilization of GPU resources, thereby enhancing overall performance.


GPUDirect Storage

The main features and advantages of GPUDirect Storage are as follows:


  1. Reduced CPU involvement: By bypassing the CPU and enabling direct communication between the GPU and storage devices, GPUDirect Storage reduces CPU overhead and frees up CPU resources for other tasks, thereby enhancing overall system performance.


  1. Low-latency data access: GPUDirect Storage eliminates the data transfer path through the CPU, minimizing data transfer latency to a great extent. This is particularly beneficial for latency-sensitive applications such as real-time analytics, machine learning, and high-performance computing.


  1. Improved storage performance: By allowing direct GPU access to storage devices, GPUDirect Storage facilitates high-speed data transfer, significantly improving storage performance and accelerating the processing speed of data-intensive workloads.


  1. Enhanced scalability: GPUDirect Storage supports multi-GPU configurations, allowing multiple GPUs to simultaneously access storage devices. This scalability is critical for applications that require large-scale parallel processing and data analysis.


  1. Compatibility and ecosystem support: GPUDirect Storage is designed to be compatible with various storage protocols, including NVMe, NVMe over Fabrics, and Network-Attached Storage (NAS). It is supported by major storage vendors and has been integrated into popular software frameworks like NVIDIA CUDA to streamline integration with existing GPU-accelerated applications.


GPUDirect P2P


In certain workloads, data exchange is required between two or more GPUs within the same server. Without GPUDirect P2P technology, data from a GPU would first be copied to the host's pinned shared memory via the CPU and PCIe bus. Subsequently, the data would be copied from the host's pinned shared memory to the target GPU again through the CPU and PCIe bus. This means that the data needs to be copied twice before reaching its destination.


However, with GPUDirect P2P communication technology, the need to temporarily store data in host memory is eliminated when copying data from the source GPU to another GPU within the same node. If the two GPUs are connected to the same PCIe bus, GPUDirect P2P allows them to directly access their respective memories without involving the CPU. This direct communication method reduces the number of copy operations required for performing the same task by half, thereby improving overall performance and efficiency.




In GPUDirect P2P technology, multiple GPUs are directly connected to the CPU via PCIe. However, as training data continues to grow, the bidirectional bandwidth of PCIe 3.0*16, which is 32GB/s, may become insufficient, gradually becoming a system bottleneck. To address this issue, NVIDIA introduced NVLink, a new high-speed, high-bandwidth interconnect technology, in 2016.


NVLink is designed to establish communication between multiple GPUs or between GPUs and other devices such as CPUs and memory. It provides a direct point-to-point connection with higher transfer speeds and lower latency compared to traditional PCIe buses.


NVLink offers the following features and advantages:


  1. High bandwidth and low latency: NVLink provides bidirectional bandwidth of up to 300GB/s, which is nearly 10 times higher than PCIe 3.0. The point-to-point connection offers ultra-low latency, enabling fast and efficient data transfer and communication.


  1. GPU-to-GPU communication:NVLink allows direct point-to-point communication between multiple GPUs without the need for data transfer through host memory or the CPU.


  1. Memory sharing:NVLink also supports memory sharing between GPUs, enabling multiple GPUs to directly access each other's memory space.


  1. Flexible connections: NVLink supports various connection configurations, including 2, 4, 6, or 8 channels, allowing for flexible configurations and scalability based on specific needs. This makes NVLink suitable for systems with different scales and requirements.


By utilizing NVLink, significant improvements in communication performance between multiple GPUs can be achieved, fully leveraging the computational power of GPUs, and further enhancing overall performance and efficiency.




In NVLink technology, although point-to-point connections and high-bandwidth communication between GPUs can be achieved, there are still challenges in achieving full connectivity of 8 GPUs within a single server. To address this issue, NVIDIA introduced NVSwitch in 2018, which is an innovative node-level switching architecture that enables full connectivity of NVLink.


NVSwitch is the first solution to support 16 fully interconnected GPUs within a single server node. With NVSwitch, each GPU can directly communicate with all other GPUs without relying on the CPU or host memory for data transfer. This allows all 8 GPU pairs to communicate at a speed of up to 300GB/s simultaneously, greatly enhancing data transfer and communication efficiency between multiple GPUs.


Through NVSwitch, multiple GPUs can collaborate more efficiently, accelerating the processing speed of computationally intensive tasks. This is crucial for applications that require large-scale parallel processing and high-performance computing. The introduction of NVSwitch further expands the application scope of NVIDIA GPU architecture in areas such as high-performance computing and artificial intelligence.



Multi-machine and Multi-card GPU Communication




Artificial intelligence (AI) computations require significant computational power, and multi-machine and multi-GPU computing have become the norm. In distributed training, communication between multiple machines is a critical performance metric. Traditional TCP/IP network communication involves multiple memory copies and packet processing overhead, resulting in communication latency between servers in the millisecond range, which does not meet the requirements of multi-machine and multi-GPU scenarios.


Remote Direct Memory Access (RDMA) is a technology that allows direct access to data in remote memory without involving the remote host, addressing the issue of data processing latency in network transfers. RDMA is a remote memory access technology that provides high-speed data transfer and zero-copy functionality. It enables data to be transferred directly from the sender's memory to the receiver's memory, avoiding the need for data copying in the system kernel and multiple memory copies.


Currently, RDMA has three different implementation approaches:


  • InfiniBand (IB): IB is a high-performance interconnect technology that provides native RDMA support. It uses dedicated IB adapters and switches to achieve high-speed direct memory access and data transfer between nodes through RDMA operations.


  • RDMA over Converged Ethernet (RoCE): RoCE is a technology that enables RDMA over Ethernet. It uses standard Ethernet as the underlying transport medium and employs RoCE adapters and appropriate protocol stacks to enable RDMA functionality.


  • Internet Wide Area RDMA Protocol (iWARP): iWARP is an RDMA implementation based on the TCP/IP protocol stack. It utilizes regular Ethernet adapters and standard network switches, implementing RDMA functionality within the TCP/IP protocol stack to provide high-performance remote memory access and data transfer.


These RDMA implementation approaches all offer efficient data transfer performance and low latency, making them suitable for high-performance computing and AI applications in multi-machine and multi-GPU scenarios. They significantly reduce data transfer latency and overhead, thus improving the efficiency and performance of distributed training.




GPUDirect RDMA is a solution that combines GPU-accelerated computing with Remote Direct Memory Access (RDMA) technology, enabling direct data transfer and communication between GPUs and RDMA network devices. It allows GPUs to access data in RDMA network devices directly, without the need for intermediate steps through host memory or the CPU.



By bypassing host memory and the CPU, GPUDirect RDMA technology can significantly reduce data transfer latency, accelerate data exchange speed, and alleviate the CPU's workload, freeing up its computational capacity. This technology finds wide applications in high-performance computing, artificial intelligence, and other fields that require high-speed and efficient data transfer and computationally intensive tasks.


Furthermore, GPUDirect RDMA technology allows GPUs to directly access data in RDMA network devices, avoiding data copying in host memory and improving bandwidth utilization for data transfers. This enhances the efficiency of data transfer in multi-machine and multi-GPU high-performance computing environments, speeding up distributed training and processing.


GPUDirect RDMA technology combines the advantages of GPU-accelerated computing and RDMA technology, enabling high-speed and efficient data transfer and computation. It enhances the speed of distributed training and processing while alleviating the CPU's workload, thereby unleashing computational capabilities.




IPOIB (IP over InfiniBand) is a technology that enables the use of IP protocols over an InfiniBand network. It combines the standard IP protocol stack with InfiniBand interconnect technology, allowing nodes in an InfiniBand network to communicate and transfer data using IP protocols.


IPOIB provides an IP network emulation layer built on top of RDMA, enabling applications to run on an InfiniBand network without modification. However, IPOIB still requires passing through the kernel layer (IP Stack), which incurs a significant number of system calls and CPU interrupts, resulting in lower performance compared to RDMA communication. To achieve the benefits of high bandwidth and low latency, most applications opt for RDMA communication, with only a few critical applications utilizing IPOIB for communication.


In large-scale computing, utilizing GPUDirect and NVLink technologies in single machine multi-GPU scenarios, and employing GPUDirect RDMA technology in distributed scenarios can significantly reduce communication time and enhance overall performance. These technologies allow GPUs to directly access data in other GPUs or RDMA network devices, avoiding data copying and transmission in host memory, thereby improving bandwidth utilization and speed for data transfer. This enhances the efficiency of data transfer in multi-machine and multi-GPU high-performance computing environments, accelerating distributed training and processing.