InfiniBand - Powering the Next-Gen AI Supercomputing Centers

AI factories manage massive workflows and foundational models like LLMs, requiring seamless scaling across thousands of GPUs with ultra-reliable networking to maximize resource utilization. InfiniBand, with ultra-low latency, unmatched bandwidth, and native lossless RDMA performance, has become the de facto standard for AI factory workloads. Its end-to-end high performance networking powers AI training, real-time analytics, and scientific computing, perfectly suited for high-density AI environments.

Capabilities

Topology

Solutions

Benefits

Portfolio

Purpose-Built for High Performance AI Workloads

Adaptive routing, in-network computing, congestion control architecture, empower InfiniBand to meet the rigorous demands of HPC and AI clusters. These optimizations ensure seamless data flow, eliminate bottlenecks, and enable efficient resource utilization, driving superior performance and operational efficiency for complex infrastructures.

High Performance, Low Latency

InfiniBand achieves end-to-end latency as low as 2 µs and switch latency down to 230 nanoseconds (NDR), ideal for AI/ML workloads that rely on rapid data processing. This reduces communication delays, accelerating model training and inference cycles.

Lossless Transmission with Credit-Based Flow Control

With a credit-based flow control, InfiniBand provides a truly lossless network, mitigating packet loss and ensuring that no data is dropped during transfer data-key for reliable, large-scale data handling.

Adaptive Routing for Optimal Load Distribution

Adaptive multipath routing dynamically balances traffic by selecting optimal paths based on real-time congestion, it reduces bottlenecks, enhances throughput, and improves overall network efficiency, making InfiniBand ideal for environments with fluctuating data loads.

In-Network Computing with SHARP Protocol

The Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) enables in-network data aggregation, reducing latency and data movement. Offloading collective operations from the CPU to the network, SHARP improves data throughput and maximizes bandwidth utilization, accelerating compute-intensive tasks.

Scalability with Flexible Topologies

Supporting up to 48,000 nodes in a single subnet, it eliminates ARP and broadcast overhead. Advanced topologies, including Fat-Tree, Dragonfly+, and multi-dimensional Torus, provide flexible, high-performance configurations tailored to specific application needs.

Stability and Resilience with Self-Healing Technology

Self-Healing Networking technology reduces network recovery times to as little as 1 millisecond, ensuring high availability and resilience-critical for uninterrupted AI and data-intensive operations.

Scalable Architecture for Peak AI Performance with InfiniBand

For large-scale InfiniBand AI cluster deployment, with NVIDIA H100, H200 GPUs, the DGX solutions (GB200, GB300), fat-tree network topology are paired with these suited for handling intensive AI workloads.

Quantum-X800 XDR Topology（800G per Node）
Quantum-X800 XDR + NDR Topology（400G per Node）
Quantum-2 NDR Topology

Spine
Quantum-X800 Switches
Leaf
Quantum-X800 Switches
Server
DGX System
ConnectX-8 C8180 Adapters

Spine-Leaf
XDR Optics
Leaf-Server
XDR Optics and
XDR Splitter DACs

NADDOD InfiniBand Network Solutions for Small-to-Large AI Clusters

Flexible solutions tailored to varying AI cluster sizes, data center layouts, and connection distances.

Small-Scale InfiniBand AI Data Center Solutions

High-performance InfiniBand network deployment designed for compact AI data centers

NDR Multimode Transceivers Deployment

InfiniBand NDR multimode transceivers cost-effective, reliable performance for short distances.

Typical Use Case

Spine-to-Leaf and Leaf-to-Server connections under 50 meters.

Recommendations

Spine-to-Leaf：800G OSFP 2xSR4 Transceivers

Leaf-to-Server：400G OSFP SR4 Transceivers

Mid-to-Large InfiniBand AI Data Center Solutions

High performance and cost-effective way to upgrade or build more scalable AI clusters.

NDR Single-Mode Transceivers and DAC/ACC Cables

InfiniBand NDR Single-mode transceivers enable stable, long-distance connections, while DAC cables lower costs and power consumption. Together, they provide an efficient solution for mid-to-large clusters.

Single-Mode Transceivers + 800G DAC/ACC Cables

Typical Use Case

Spine and Leaf switches colocated or in adjacent racks for short-distance DAC connections. Single-mode transceivers handle longer Server-to-Leaf distances with high-speed, low-latency performance.

Recommendations

Spine-to-Leaf: 800G OSFP DAC/ACC cables (support up to 5 meters)

Leaf-to-Server: 800G OSFP 2xDR4 and 400G OSFP DR4 (both support up to 100 meters)

800G FR8 InfiniBand solution for AI data center

Single-Mode Transceivers + 800G Breakout DAC/ACC Cables

Typical Use Case

Breakout DAC cables connect servers to leaf switches in adjacent racks. For Leaf-to-Spine distances exceeding 50 meters and up to 2 kilometers, single-mode transceivers provide reliable, high-performance connectivity.

Recommendations

Spine-to-Leaf: 800G OSFP 2xFR4 - 800G OSFP 2xFR4 (supports up to 2 kilometers; suitable for inter-building connections) or 800G OSFP 2xDR4 - 800G OSFP 2xDR4 (optimized for distances under 500 meters with high port density) transceivers.

Leaf-to-Server: 800G OSFP Breakout DAC/ACC cables (support up to 5 meters)

Large-scale InfiniBand AI Data Center Solutions

Meeting the extreme bandwidth and latency demands of trillion-parameter ultra-large model.

XDR/NDR Transceivers and DAC/ACC Cables

For large AI clusters, a hybrid approach combining DAC cables with XDR or NDR transceivers is commonly adopted. It combines ultra-high speeds for demanding AI workloads with cost-efficient infrastructure.

XDR Transceivers + XDR Breakout DAC/ACC Cables

Typical Use Case

When compute nodes connectivity is at 800Gp/s speed, short distance in-rack and inter-rack connections use 1.6T copper cables, while long-distance links adopt 1.6T XDR single-mode modules, reducing the cost compared to using single-mode modules for all connections.

Recommendations

Spine-to-Leaf：1.6T OSFP224 2xDR4 (support up to 500 meters)

Leaf-to-Server：1.6T OSFP224 2xDR4 - 800G OSFP224 DR4 and 1.6T OSFP224 Breakout DAC/ACC cables

XDR and NDR Transceivers + NDR Breakout DAC/ACC Cables

Typical Use Case

During the shift from NDR to XDR networks, hybrid setups balance performance upgrades with efficiency. For compute nodes operating with 400 Gb/s NICs, the server side continues to operate with NDR optical modules and cables, while Spine-to-Leaf connections adopt XDR transceivers for maximum throughput.

Recommendations

Spine-to-Leaf: 1.6T OSFP224 2xDR4 (support up to 500 meters)

Leaf-to-Server: 800G OSFP 2xDR4 - 400G OSFP DR4 and 800G OSFP Breakout DAC/ACC cables

Common Network Issues Affecting AI Training Efficiency

80% AI Training Interruptions Stem from Network-Side Issues

95% Network Problems Often Linked to Faulty Optical Interconnects

Common Network Issues Affecting AI Training Efficency

NADDOD - Safeguarding AI Clusters from Training Interruptions

NADDOD InfiniBand Product Portfolio for AI Workloads

InfiniBand Transceivers and Cables

NVIDIA Quantum-X800 and Quantum-2 connectivity options enable flexible topologies with a variety of transceivers, MPO connectors, ACC, and DACs. Backward compatibility connects 800b/s, 400Gb/s clusters to existing 200Gb/s or 100Gb/s infrastructures, ensuring seamless scalability and integration.

InfiniBand XDR Optics and Cables

InfiniBand NDR Optics and Cables

InfiniBand Adapters/NICs

The NVIDIA ConnectX-8 and ConnectX-7 InfiniBand adapter delivers unmatched performance for AI and HPC workloads, offeringsingle or dual network ports with speeds of up to 800Gb/s, available in multiple form factors to meet diverse deployment needs.

ConnectX-8 Adapters

ConnectX-7 Adapters

InfiniBand Switches

The NVIDIA Quantum‑X800 and Quantum‑2 are designed to meet the demands of high-performance AI and HPC networks, delivering 800 Gb/s and 400 Gb/s per port, respectively, and are available in air-cooled or liquid-cooled configurations to suit diverse data center needs.

Quantum-X800 Switches

Quantum-2 Switches