HPC Networking: Delivering High Performance for DGX Server

To meet the high computational demands of tasks such as AI model training and scientific computing, hundreds or even thousands of GPUs are used as computational units for evaluating performance, optimizing model configurations, and tuning parameters. To ensure the efficiency of such a massive computational unit, it is essential to establish low-latency, high-bandwidth network connections among the server nodes. These connections facilitate intercommunication between servers/GPUs for computation and data access, as well as overall cluster management.

The network system of a server cluster comprises several key hardware components, including servers, network cards, switches, AOCs (Active Optical Cables), and DACs (Direct Attach Cables). In terms of network architecture, the network cards are installed internally in the servers, directly connected to the CPUs, or linked to the GPUs via PCIe switches. The first layer of switches connects to the network cards within the servers through the ports on the server chassis. Cables are used to establish connections between servers and switches, as well as between switches themselves. If optical signals are used for signal transmission, optical transceivers need to be installed at both ends of the cables.

Servers

Taking reference from the hardware configuration of NVIDIA DGX series servers, we focus on the network card and chassis network port configurations to analyze the reflection of network architecture development in server hardware.

DGX-1

In 2016, NVIDIA released the DGX-1 server, which features eight V100 GPUs. The server is equipped with four single-port 100Gb/s IB/Ethernet NICs. The server chassis has four QSFP28 ports, with each port supporting a 100G EDR IB network or 100G Ethernet. Additionally, the server chassis includes two 10GBASE-T RJ45 Ethernet ports and one 10/100BASE-T IPMI network RJ-45 port.

DGX-2

The DGX-2 server is equipped with 16 Tesla V100 GPUs, distributed across two separate GPU boards. Each V100 GPU boasts 32GB of HBM2 memory, resulting in a combined total of 512GB. These GPUs are interconnected using 12 NVSwitches, enabling full-speed communication between any two GPUs and providing a P2P total bandwidth of 300GB. The processing power is supported by two Intel Xeon Platinum CPUs, while the storage configuration includes 1.5TB of standard memory and a 30TB NVMe SSD, offering ample cache space. With these specifications, the DGX-2 server can achieve a staggering 20 quadrillion floating-point operations per second (2 PFLOPS), delivering a performance increase of 10 times compared to the DGX-1.

The DGX-2 server features 12 NVSwitches, with each GPU board hosting 6 NVSwitches. This arrangement allows for the interconnection of all 16 GPUs. Considering that each GPU has 6 NVLink channels, with each channel connecting to an NVSwitch, every GPU is interconnected with the 6 NVSwitches on the mainboard. With 8 GPUs on each GPU board, each NVSwitch receives 8 incoming and 8 outgoing NVLink channels, connecting to the NVBridge on the backplane, which acts as the central bridge depicted in the diagram. Each GPU board's NVSwitch has a total of 48 NVLink connections to the backplane, resulting in a backplane bandwidth of 2.4 terabytes per second.

DGX A100

The DGX A100, released in 2020, is equipped with the next-generation Mellanox ConnectX-6 network cards, which offer increased bandwidth of up to 200Gb/s for single-port configurations. Within the DGX A100, there are a total of 8 single-port ConnectX-6 network cards supporting IB networks and 1 dual-port ConnectX-6 network card supporting both IB and Ethernet networks. Additionally, the server can be optionally configured with an additional dual-port ConnectX-6 network card. On the server chassis, there are 12 QSFP ports used for computation, storage, and In-Band management, along with 1 RJ-45 port for Out-of-Band management.

DGX H100

Several networks are deployed on the DGX SuperPOD. The compute fabric is used for inter-node communication through the applications. A separate storage fabric is used to isolate storage traffic. There are two Ethernet fabrics for in-band and OOB management. Requirements for each section are detailed below. In addition, designs for the network are provided after the requirements.

The diagram below shows the different ports on the back of the DGX H100 CPU tray and the connectivity provided. The InfiniBand compute fabric ports in the middle use a two-port transceiver to access all eight GPUs. Each pair of in-band Ethernet management and InfiniBand storage ports provide parallel pathways into the DGX H100 system for increased performance. The OOB port is used for BMC access. In addition, there is an additional LAN port next to the BMC but is not used in the DGX SuperPOD.

NICs

The DGX A100 and DGX H100 servers are equipped with the ConnectX-6 and ConnectX-7 network cards, respectively. The ConnectX-7 offers improvements in network bandwidth, the number of supported ports per card, and PCIe adaptability compared to its predecessor.

Switches

Computing and Storage Network Switches

Following NVIDIA's server cluster design, both the compute and storage networks in the cluster are interconnected using IB networks, and the recommended switch models for DGX A100 and DGX H100 server clusters are Mellanox QM8790 and Mellanox QM9700, respectively. The Mellanox QM9700 provides enhancements in single-port network bandwidth, port count, and overall throughput compared to the former.

In-Band and Out-of-Band Management Network Switches

For the In-Band management network in the DGX A100 and DGX H100 server clusters, the recommended switches are SN4600 and SN4600C. As for the Out-of-Band management network, the recommended switches are AS4610 and SN2201.

AOC & DAC Cables

The connections between servers and switches, as well as between switches, are primarily established using Direct Attach Cables (DAC) or Active Optical Cables (AOC). DACs are copper cable assemblies with fixed connectors on both ends. They can be categorized as active or passive DACs based on whether the connectors have signal compensating chips. DACs transmit electrical signals and do not involve electrical-to-optical or optical-to-electrical conversion. On the other hand, AOCs consist of optical modules on both ends and multimode fibers in between, transmitting information in the form of optical signals.

Active DACs, passive DACs, and AOCs can all be used for connections between servers and switches, as well as between switches. The main differences among these three connection methods lie in transmission distance, power consumption, and cost.

NADDOD is a leading provider of comprehensive optical networking solutions, dedicated to building a interconnected world of digital intelligence through innovative computing and networking solutions. NADDOD continuously delivers innovative, efficient, and reliable products, solutions, and services to empower data centers, high-performance computing, edge computing, artificial intelligence, and other application scenarios with superior switches, AOC/DAC, optical transceivers, NICs, and optimal DPUs and GPUs overall solutions. These products and solutions significantly enhance customers' business acceleration capabilities with high-cost performance.

NADDOD takes customer-centricity at the core, consistently creating exceptional value for customers across various industries. With a professional technical team and extensive experience in implementing and servicing diverse application scenarios, NADDOD's products and solutions have earned customers' trust and favor with high quality and outstanding performance, which are widely applied in industries and critical sectors such as high-performance computing, data centers, educational and scientific research, biomedicine, finance, energy, autonomous driving, internet, manufacturing, and telecommunications.

To meet the low-latency, high-bandwidth, and reliability requirements of DGX server clusters, NADDOD provides high-quality InfiniBand HDR AOC and DAC series products. These products offer high-performance computing to server clusters while reducing costs and complexity. Please refer to the table below for more information:

PN	Model	Description
200G QSFP56 to QSFP56 AOC	Q56-200G-A3H	Mellanox MFS1S00-H003E/MFS1S00-H003V Compatible AOC 3m (10ft) 200Gb/s QSFP56 VCSEL-Based IB HDR LSZH Active Fiber Cable (850nm, MMF)
200G QSFP56 to QSFP56 DAC	Q56-200G-CU1H	Mellanox MCP1650-H001E30 Compatible DAC 1m (3ft) 200Gb/s QSFP56 to QSFP56 IB HDR Passive Direct Attach Copper Twinax Cable
200G QSFP56 to 2xQSFP56 AOC	Q2Q56-200G-A1H	Mellanox MFS1S50-H001E/MFS1S50-H001V Compatible AOC 1m (3ft) IB HDR 200Gb/s to 2x100Gb/s QSFP56 to 2xQSFP56 Active Optical Splitter Cable (850nm, MMF, LSZH)
200G QSFP56 to 2xQSFP56 DAC	Q2Q56-200G-CU1H	Mellanox MCP7H50-H001R30 Compatible DAC 1m (3ft) IB HDR 200Gb/s to 2x100Gb/s QSFP56 to 2xQSFP56 Passive Copper (Hybrid) Cable (Passive Twinax, LSZH)

Summary

The key hardware requirements for DGX server clusters are crucial for achieving high computational power in AI training and scientific computing. Different models of DGX servers are equipped with various hardware components to meet the demands of high computational power and network communication. From server configurations and network card selection to switch options and cable and optical module usage, every detail plays a vital role in the overall performance and efficiency of the cluster.

By understanding the hardware configurations of different DGX server models, we can gain better insights into the network architecture and development trends of server clusters. With the continuous advancement of technology, the next generation of network cards and switches are constantly improving network bandwidth and performance, providing stronger support for high-performance computing.