Resources

High-Performance GPU Server Hardware Topology and Cluster Networking-L40S

High-Performance GPU Server Hardware Topology and Cluster Networking-L40S

This article provides an overview of the L40S GPU, comparing its configuration and features with the A100 GPU. It discusses the performance comparison, building machine recommendations, networking options, and analyzes the data link bandwidth bottleneck. The information aims to help readers understand the capabilities and limitations of the L40S GPU for high-performance computing.
Gavin
Nov 27, 2023
High-Performance GPU Server Hardware Topology and Cluster Networking-2

High-Performance GPU Server Hardware Topology and Cluster Networking-2

This article explores the hardware topology and cluster networking of high-performance GPU servers, focusing on typical 8-card A100/A800 and H100 host configurations. It delves into the internal structure, interconnections, and bandwidth analysis of various components, including NVSwitch, PCIe Gen4/Gen5, NVLink, and network cards. The article also discusses the impact of different networking options, such as the CX7 network card, on inter-node and intra-node communication.
Abel
Nov 20, 2023
High-Performance GPU Server Hardware Topology and Cluster Networking-1

High-Performance GPU Server Hardware Topology and Cluster Networking-1

This article provides an overview of the hardware topology and bandwidth considerations for large-scale GPU training. It covers concepts such as PCIe switch chips, NVLink, NVSwitch, HBM (High Bandwidth Memory), and different bandwidth units. The evolution of NVLink and HBM technologies is discussed, along with the importance of distinguishing and converting between different bandwidth units.
Gavin
Nov 13, 2023
AI Intelligent Computing Center Network Architecture Design Practice

AI Intelligent Computing Center Network Architecture Design Practice

Discover the challenges faced in building high-performance AI networks and the best practices to overcome them. Learn about network congestion, high latency, and limited bandwidth, and explore the benefits of adopting the Fat-Tree architecture and AI-Pool design. Find out how NADDOD's connectivity products can optimize network performance for large-scale AI models.
Brandon
Nov 6, 2023
Exploring RDMA and Low-Latency Networks

Exploring RDMA and Low-Latency Networks

RDMA technology, including RoCEv2 and PFC, enables ultra-low latency, lossless Ethernet, and congestion control for high-performance computing networks.
Peter
Nov 1, 2023
Why did RDMA emerge and what are its benefits?

Why did RDMA emerge and what are its benefits?

RDMA technology enables low-latency, zero-copy networking with minimal CPU and memory resource consumption, revolutionizing data center communication.
Jason
Nov 3, 2023
NVIDIA Spectrum-X Solution Benefits and Product Components

NVIDIA Spectrum-X Solution Benefits and Product Components

Learn about the NVIDIA Spectrum-X solution, the world's first comprehensive end-to-end Ethernet solution designed for generative AI. Explore the benefits of Spectrum-X and its key product components, including the Spectrum-4 series switches, Bluefield-3 DPU network cards, LinkX 400G cables, and software solutions with hardware acceleration support. Discover how Spectrum-X addresses the limitations of traditional Ethernet in AI training and find out where to buy NVIDIA Spectrum-X products.
Gavin
Oct 30, 2023
Ethernet: The Road to Singularity - Modernized RDMA

Ethernet: The Road to Singularity - Modernized RDMA

Ultra Ethernet Consortium aims to transform Ethernet for massive-scale computing, overcoming RDMA limitations and achieving high-performance, cost-effective interconnects.
Jason
Oct 27, 2023
The High-speed Road of GPUs

The High-speed Road of GPUs

The content discusses GPU interconnection, NVLink's role in CPU-GPU connection, and the superior performance of NVIDIA's Grace Hopper super chip architecture.
Adam
Oct 25, 2023