Optimizing Inter-Machine Communication in AI Clusters for Efficient Model Training

In AI clusters, the role of GPUs cannot be underestimated as they play a crucial role in high-performance computing and accelerating deep learning tasks. With their powerful parallel computing capabilities, GPUs significantly enhance computational performance. However, as the volume of computational data continues to increase, there is a need for frequent data exchange between GPUs. Therefore, GPU communication performance has become a critical metric for measuring system performance. In this article, we will explore the importance of machine-to-machine communication in AI clusters and discuss strategies for optimizing communication mechanisms to effectively train high-precision models.

Inter-Machine Communication

TCP/IP network protocol

RDMA (Remote Direct Memory Access) network protocol

InfiniBand

iWARP

RoCE

TCP/IP

TCP/IP (Transmission Control Protocol/Internet Protocol) is used for interconnecting network devices over the Internet. It defines how data should be packaged, addressed, transmitted, routed, and received. TCP/IP places great emphasis on accurate data transmission between two computers. If there is an issue encountered during the transmission of a message, the entire message must be resent.

Additionally, TCP/IP operates across four distinct layers: the data link layer, the internet layer, the transport layer, and the application layer. Data passes through these four layers before being received by the other end. TCP/IP then reassembles and presents the data to the receiver by passing it through the layers in reverse order. This allows for performance or security improvements in a data center by upgrading specific layers rather than the entire system.

RDMA (Remote Direct Memory Access) network protocol

RDMA (Remote Direct Memory Access) was developed to address the latency in server-side data processing during network transfers. It allows for direct access to the memory of one host or server from another host or server without involving the CPU. This frees up the CPU to perform its intended tasks, such as running applications and handling large amounts of data. RDMA improves bandwidth, reduces latency, jitter, and CPU consumption.

In contrast to traditional network transfer mechanisms, RDMA operates without the intervention of the operating system and the TCP/IP protocol stack. Its kernel bypass mechanism enables direct data read and write between applications and network cards, reducing server-to-server data transfer latency to below 1 microsecond. Additionally, RDMA's zero-copy mechanism allows the receiving end to directly read data from the sender's memory, significantly reducing CPU overhead and improving CPU efficiency.

RDMA

There are three main types of RDMA networks: InfiniBand, RoCE, and iWARP. InfiniBand is a network specifically designed for RDMA, ensuring reliable transmission at the hardware level. RoCE and iWARP, on the other hand, are RDMA technologies based on Ethernet, supporting their respective verbs interfaces.

RDMA was initially implemented on the InfiniBand transport network, which is technologically advanced but comes with higher costs (only Mellanox (now acquired by NVIDIA) and Intel (in 2012, Intel acquired QLogic's InfiniBand technology) provide complete network solutions). Later, the industry ported RDMA to traditional Ethernet, reducing the cost of RDMA adoption and promoting its widespread use. On Ethernet, based on the differences in protocol stack integration, there are two technologies: iWARP and RoCE. RoCE further includes two versions, RoCEv1 and RoCEv2 (with the major improvement in RoCEv2 being support for IP routing). The comparison of the RDMA network protocols is shown in the following diagram:

RDMA application

InfiniBand (IB): RDMA technology based on the InfiniBand architecture, proposed by the InfiniBand Trade Association (IBTA). Building an RDMA network based on IB technology requires dedicated IB network cards and IB switches.

iWARP (Internet Wide Area RDMA Protocol): RDMA technology based on the TCP/IP protocol, defined as an IETF standard. iWARP supports the use of RDMA technology on a standard Ethernet infrastructure, but servers need to use iWARP-compatible network cards.

RoCE (RDMA over Converged Ethernet): RDMA technology based on Ethernet, also proposed by the IBTA. RoCE supports the use of RDMA technology on a standard Ethernet infrastructure, but requires switches that support lossless Ethernet transmission and servers to use RoCE network cards.

Among the three mainstream RDMA technologies, they can be divided into two camps. One is IB technology, and the other is RDMA-enabled Ethernet technologies (RoCE and iWARP). IB and RoCE are supported by the IBTA, with Mellanox being a pioneer in this area. On the other hand, iWARP is supported by IEEE/IETF, mainly driven by Chelsio.

In the storage domain, RDMA-enabled technologies have long existed, such as SCSI RDMA Protocol (SRP) and iSCSI Extensions for RDMA (iSER). The emerging NVMe over Fabrics, if not using FC networks, essentially becomes NVMe over RDMA. In other words, NVMe over InfiniBand, NVMe over RoCE, and NVMe over iWARP all fall under the category of NVMe over RDMA.