What you must know about InfiniBand Networking

NADDOD Quinn InfiniBand Network Architect Sep 22, 2023

1. What is InfiniBand?

InfiniBand is an open industrial standard. It defines a network for connecting high-performance CPU/GPU servers, storage servers, and others. It provides point-to-point bidirectional serial links, serving as a high-speed peripheral device to connect processors located on different servers.The operating system can be Linux, Windows, or ESXi.

 

Similar to other network protocols, the InfiniBand specification also defines a multi-layer model to enhance and accelerate all protocol layers from the physical layer to the upper layers.

2. Structural Components of an InfiniBand network

Components of InfiniBand Networking

An InfiniBand network consists of the following elements:

  • HCA - Host Channel Adapter/Channel Adapter

The form of HCA on the server is a network card. It is an end node connected to the IB network, which can perform transport layer functions and supports the verbs interface. Verbs API is the programming interface for IB devices.

 

  • InfiniBand to Ethernet Gateway/Bridge

It is an IB network and Ethernet interconnection device that can convert IB and Ethernet messages. Use it when you need to communicate with the Ethernet network.

 

  • InfiniBand Switch

It is a device that forwards messages between IB networks.

 

  • Subnet Manager (SM)

The software that manages the IB subnet can run on the host, the switch, or be deployed together with UFM.

 

  • InfiniBand Router

A device that transmits messages between different IB subnets.

3. The core features of InfiniBand

Key features of InfiniBand Networking

  • Subnet Manager (SM) is a program that runs and manages the entire network. It provides centralized routing management, enabling plug-and-play functionality for all nodes in the network. SM supports a master-standby configuration, where only one SM can be the master in a subnet, while the rest of the SMs operate in standby mode. Each subnet must have a master SM, and each node has a Subnet Manager Agent (SMA) to communicate with the SM.

 

  • GPU Direct allows direct data transfers between the memory of one GPU to another GPU. It reduces latency and improves performance, especially in GPU-based computing. The offloading of compute tasks is also implemented by NVIDIA GPUs.

 

  • Low Latency: Extremely low latency is achieved through a combination of hardware offloading and acceleration mechanisms. InfiniBand switches utilize cut-through forwarding mode, resulting in transmission latencies as low as 130ns. End-to-end transport latency is further reduced by utilizing RDMA technology.

 

  • Network Scalability: Multiple InfiniBand subnets can be interconnected using InfiniBand routers, enabling easy scalability to over 48,000 nodes.

 

  • Fault-Tolerant Stable Network: In InfiniBand networks, traffic recovery is rapid since flow reordering is entirely dependent on the subnet manager's routing algorithm. Traffic restoration occurs quickly.

 

  • Self-Healing Network: Self-healing network is a hardware-based feature of NVIDIA IB switches, allowing recovery time in just one millisecond.

 

  • Adaptive Routing: Adaptive routing is a feature that balances traffic sent on each switch port. It is enabled on NVIDIA switches' hardware and managed by the Adaptive Routing Manager.

 

  • SHARP (Scalable Hierarchical Aggregation and Reduction Protocol): SHARP is a mechanism based on NVIDIA switch hardware and central management packets. It offloads collective communication that used to run on CPUs and GPUs to the switch, optimizing collective communication and reducing multiple data transfers between nodes. Thus, SHARP significantly enhances the performance of accelerated computing, especially with MPI-based applications such as AI and machine learning.

4. Advantages of InfiniBand VS Ethernet

InfiniBand is a high-performance, low-latency network technology that is particularly suitable for applications requiring large-scale data transfer and fast communication. It provides higher bandwidth and lower latency while supporting Remote Direct Memory Access (RDMA) technology, which significantly improves data transfer efficiency and processing capabilities.

 

The data center network architecture that adopts InfiniBand technology is often referred to as an InfiniBand Fabric. In this architecture, servers and storage devices in the data center are interconnected through InfiniBand switches, creating a high-performance interconnect network. InfiniBand Fabric offers several advantages over traditional Ethernet-based architectures.

 

  1. High bandwidth and low latency: InfiniBand provides higher bandwidth and lower latency, meeting the performance requirements for large-scale data transfer and real-time communication applications.

 

  1. RDMA support: InfiniBand supports Remote Direct Memory Access (RDMA), allowing data to be copied directly from one node's memory to another node's memory. This reduces CPU overhead and data copy time during data transfer, improving transfer efficiency.

 

  1. Scalability: InfiniBand Fabric has excellent scalability, supporting the connection of a large number of nodes and high-density server layouts. By adding InfiniBand switches and cables, the network scale and bandwidth capacity can be easily expanded.

 

  1. High reliability: InfiniBand Fabric incorporates redundant design and fault isolation mechanisms, improving network availability and fault tolerance. When a node or connection fails, network connectivity can be maintained through alternate paths without affecting the entire network.

5. InfiniBand network solution

InfiniBand networks use cables that are different from traditional Ethernet cables and fiber optic cables. Dedicated InfiniBand cables are required for different connection scenarios. NADDOD provides InfiniBand network interconnect products, including high-speed copper cables (DAC), active optical cables (AOC), and optical modules.

 

The available rates include QDR (40G), EDR (100G), HDR (200G), and NDR (400G). The module packages include QSFP+, QSFP28, QSFP56, and OSFP.

 

NADDOD's InfiniBand cables utilize high-density connectors and precise wiring technology to provide reliable signal transmission and low insertion loss. NADDOD not only offers InfiniBand products but also provides EDR/HDR/NDR solutions based on customers' actual environments. Our products have the following advantages:

 

  1. High-density connectivity: NADDOD cables feature a high-density connector design, enabling support for more connections in limited space, providing higher port density and scalability.

 

  1. Low insertion loss: NADDOD cables employ high-quality materials and precise manufacturing processes to reduce insertion loss during signal transmission, ensuring reliable data transmission with low signal attenuation.

 

  1. High reliability: NADDOD cables undergo rigorous testing and quality control, demonstrating excellent reliability and stability. They can maintain good performance in high-load and high-frequency transmission environments, with a long service life.

 

NADDOD, based on its lossless InfiniBand network solution, builds a lossless network environment and delivers high-performance computing capabilities to users. With different application scenarios and user requirements in mind, NADDOD can tailor the optimal solution to provide users with high bandwidth, low latency, and high-performance data transfer, effectively addressing network bottlenecks, improving network performance, and enhancing user experience.

 

By partnering with NADDOD and deploying a stable InfiniBand network, you can accelerate the growth of your business!