InfiniBand Network Technology for HPC and AI: In-Network Computing

InfiniBand is a high-speed, low-latency networking technology that is commonly used in data centers for high-performance computing (HPC) and artificial intelligence (AI). InfiniBand includes a number of features that make it particularly well-suited for these types of applications, including its ability to support extremely high bandwidth and low latency communication between nodes in a cluster or storage network.

What Is InfiniBand In-Network Computing?
How Does InfiniBand In-Network Computing Work in Data Center?
Key Technologies of InfiniBand In-Network Computing
Typical InfiniBand In-network Computing Applications: HPC & AI
- High Performance Computing (HPC)
- Artificial Intelligence (AI)
Conclusion

What Is InfiniBand In-Network Computing?

In-Network Computing is a new paradigm enabled by InfiniBand network that allows data processing to be offloaded from the CPU to the network, reducing latency and increasing overall system performance. In-network computing effectively solves the collective communication and point-to-point bottleneck problems in AI and HPC applications, and provides new ideas and solutions for the scalability of data centers.

In-network computing uses InfiniBand network cards, switches and other network devices to simultaneously perform online calculation of data during data transmission to reduce communication delays and improve overall computing efficiency, making it a computing unit as important as GPU and CPU.

InfiniBand in-computing network

How Does InfiniBand In-Network Computing Work in Data Center?

In the past few years, modern data centers represented by cloud computing, big data, high-performance computing, and artificial intelligence have evolved into distributed parallel processing architectures. All resources in the data center, such as CPUs, memory, and storage, are distributed throughout the data center, connected by high-speed networking technologies (InfiniBand, Ethernet, Fibre Channel, Omni-Path, etc.), and work together through collaborative design and division of labor to complete data processing tasks in the data center. In modern data centers, everything is oriented towards business data, creating a balanced system architecture. Along the direction of business data flow, CPU computing, GPU computing, storage computing, and in-network computing are integrated and collaborate vertically and horizontally to form a new generation of data center system architecture centered on data.

As an efficient and scalable intelligent interconnect technology, InfiniBand deeply integrates in-network computing technology into product practice, realizing seamless integration and solving communication bottlenecks. By moving various communication-related calculations from CPUs/GPUs to the network and freeing up resources for CPUs and GPUs, in-network computing allows applications to obtain more computing resources, thereby improving the performance of the entire service and helping enterprises to tackle data challenges.

Key Technologies of InfiniBand In-Network Computing

Network Protocol Offloading

In-Network Computing has always aimed to enhance the performance of data transmission and reception. However, the widely used TCP/IP protocol stack in the industry relies on software provided by the operating system kernel for protocol stack processing, which consumes a significant amount of CPU processing resources. This has become a performance bottleneck in data centers where network bandwidth is increasing and application performance requirements are becoming more demanding.

nfiniBand network adapters and switches completely construct the physical layer, link layer, network layer, and transport layer of network communication through ASIC hardware. Therefore, during data transmission, data flow does not require additional software and CPU processing resources, which greatly improves communication performance.

Remote Direct Memory Access (RDMA)

With the rapid development of high-performance computing, big data analysis, artificial intelligence, and IoT technology, there is an increasing amount of data that needs to be accessed from the network in business applications. This places higher demands on the exchange speed and performance of data center networks. Traditional TCP/IP software and hardware architecture and applications suffer from problems such as high network transmission and data processing latency, multiple data copy interruptions, and complex protocol processing. Remote Direct Memory Access (RDMA) is a technology that was developed to address the problem of server-side data processing latency in network transmission. RDMA technology enables direct transmission of data from user applications to the storage area of the server, which can then be quickly transmitted to the remote system’s storage via the network, eliminating the need for multiple data copying and text exchanging operations during the transmission process. This results in a significant reduction in CPU load and improved network transmission efficiency.

RDMA is a key reason why InfiniBand is used in HPC, you can learn more about why RDMA/InifniBand is more suitable for HPC than Ethernet here: Why Is InfiniBand Used in HPC?

GPUDirect RDMA

GPUs are commonly utilized as computing cores in high-performance computing and artificial intelligence platforms across various industries. The communication efficiency between GPUs plays a crucial role in the overall effectiveness of GPU clusters. To address this challenge, GPUDirect technology leverages the RDMA capability of InfiniBand to facilitate communication between GPU nodes, thereby improving communication efficiency during HPC computing or AI training. This is achieved through direct read and write access to GPU memory via RDMA network adapters. When two GPU processes on different nodes within a GPU cluster need to communicate, the RDMA network adapter can directly transfer data between the GPU memories of the two nodes, eliminating the need for CPU involvement in data copying and reducing the number of accesses to the PCIe bus. As a result, unnecessary data copying is minimized, and communication performance is significantly enhanced.

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a collective communication network offloading technology.

In various high-performance computing and artificial intelligence computing, there are often many collective communications, which often have a huge impact on the parallel efficiency of the application because they involve the entire network. There are many studies in the industry that use various software methods to optimize the efficiency of collective communication, but multiple communications are still required in the network to complete the overall aggregation operation. Compared with point-to-point communication, even after various methods are optimized, the latency of collective communication is still one order of magnitude higher than that of point-to-point communication.

In response to this situation, NVIDIA Mellanox introduced the SHARP technology from the EDR InfiniBand switch. The compute engine unit is integrated in the switch chip, which can support 16-bit, 32-bit, and 64-bit fixed-point or floating-point calculations, and can support calculations such as summation, minimum, maximum, AND, OR, and XOR, and can support operations such as Barrier, Reduce, and All-Reduce.

In a cluster environment composed of multiple switches, Mellanox defines a set of Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) offloading mechanisms, and the Aggregation Manager builds a logical SHARP tree in the physical topology, and multiple switches in the SHARP tree process collective communication operations in parallel and distributed manner. When the host needs to perform global communication (such as All-Reduce), all hosts submit communication data to the switches they are connected to. After receiving the data, the first-level switch will use the built-in engine to calculate and process the data, and then submit the result data to SHARP. The upper-level switches in the tree also use their own engines to aggregate the result data received from multiple switches and continue to submit them to the upper level of the SHARP tree. After reaching the root switch of the SHARP tree, the root switch performs the final calculation and sends the result back to all host nodes. Through the parallel and distributed processing of the SHARP tree, the latency of collective communication can be greatly reduced, network congestion can be reduced, and the scalability of the cluster system can be improved.

Typical InfiniBand In-network Computing Applications: HPC & AI

High Performance Computing (HPC)

Most of the high-performance computing fields are computing-intensive applications. The computing process consumes a lot of CPU/GPU computing resources, and is usually accompanied by various types of point-to-point and collective communications. It is very sensitive to communication bandwidth and delay performance, and requires communication protocols. Offloading to reduce CPU/GPU resource contention, RDMA, GPUDirect and SHARP technologies are widely used to improve overall computing performance.

Artificial Intelligence (AI)

Artificial intelligence is one of the hottest technologies at present. How to quickly and efficiently complete the training to obtain a high-accuracy model is one of the key technologies of the artificial intelligence platform. At present, GPUs or dedicated AI chips are widely used in the industry as the computing core of artificial intelligence training platforms to accelerate the training process. Artificial intelligence training is also a typical computing-intensive application, which requires offloading of application communication protocols to reduce latency. At the same time, GPUDirect RDMA technology effectively improves the communication bandwidth between GPU clusters and reduces communication delay.

In large-scale distributed training, the current popular data-parallel deep neural network algorithm needs to use multiple neural networks to train in parallel, and after each neural network has been trained for an iteration, the model is synchronized between all neural networks. Model synchronization operations are often implemented using collective communication such as all-reduce, and its performance has become a key factor affecting the performance of distributed machine learning. By using SHARP technology, the allreduce communication performance of AI training can be significantly improved, the communication model synchronization process can be accelerated, and the overall training performance of the cluster can be greatly improved.

Conclusion

InfiniBand In-Network Computing is a powerful technology that can significantly improve system performance in data centers. Its key technologies, including network protocol offloading, RDMA, GPUDirect RDMA, and SHARP, enable faster data transfer and processing, reduce network congestion, and improve system scalability. Its typical applications in HPC and AI demonstrate its ability to accelerate large-scale simulations and deep learning tasks.

Related Resources:
What Is InfiniBand Network and Its Architecture?
Why Is InfiniBand Used in HPC?
NADDOD High-Performance Computing (HPC) Solution
NADDOD InfiniBand Cables & Transceivers Products
Case Study: NADDOD Helped the National Supercomputing Center to Build a General-Purpose Test Platform for HPC