InfiniBand Network Application in NVIDIA DGX Cluster: A Comprehensive Analysis

NVIDIA has launched the DGX SuperPOD, a cloud-native supercomputer that combines software and hardware in a complete solution. It not only provides the necessary computing power for AI models but also enables businesses to quickly deploy AI data centers.

InfiniBand DGX

The DGX SuperPOD adopts a modular design and supports designs of different scales. A standard SuperPOD consists of 140 DGX A100 GPU servers, HDR InfiniBand 100G/200G network cards, and the NVIDIA Quantum QM8790 switch.

DGX Superpod

This article focuses on the analysis of the application of InfiniBand (IB) network in the DGX cluster.

1. DGX-1 equipped with InfiniBand Network Cards

The DGX-1 cluster (as shown in the diagram below) is equipped with four EDR InfiniBand cards (each with a speed of 100 Gb/s) and two 10Gb/s Ethernet cards (copper). These network interfaces are used to connect the DGX-1 to the network for communication and storage purposes.

InfiniBand Leaf Switch

Each pair of GPUs is connected to a PCIe switch on the system board. This switch is also connected to the InfiniBand card. In order to reduce latency and improve throughput, the network traffic from these two GPUs should flow towards the associated InfiniBand card. This is why the DGX-1 device has four InfiniBand cards.

If you want to use InfiniBand (IB) networking to connect DGX devices, theoretically, you only need to use one IB card. However, these data flows will be forced to pass through the QPI links between CPUs, which is a very slow (i.e., bottlenecked) link for GPU traffic.

A better solution is to use two IB cards, one connected to each CPU. This can be IB0 and IB2, or IB1 and IB3, or IB0 and IB3, or IB1 and IB2. This will significantly reduce the traffic that needs to traverse the QPI links. The optimal performance is always achieved by using all four IB links of the IB switch.

Using IB links is the best way to connect all four IB cards to the IB fabric. This will provide the best performance (equal bandwidth distribution and lowest latency) if you are using multiple DGX devices for training.

Typically, the smallest IB switch comes with 36 ports. This means a single IB switch can accommodate nine DGX-1 devices using all four IB cards. This allows for a bandwidth of 400 Gb/s from the DGX-1 to the switch.

If your application does not require bandwidth between DGX-1 devices, you can use two IB connections per DGX-1, as mentioned earlier. This allows you to connect up to 18 DGX-1 devices to a single 36-port IB switch.

Note: It is not recommended to use only a single IB card, but if you choose to do so for some reason, you can connect up to 36 DGX-1 devices to a single switch.

2. Two-tier Switching Network

For a large number of DGX-1 devices, you may need to use a two-tier switching network. The classic HPC configuration involves using a 36-port IB switch (sometimes referred to as a leaf switch) at the first tier and connecting them to a single large core switch, sometimes called a director-class switch. The largest director-class InfiniBand switches have 648 ports. Of course, you can also use multiple core switches, but the configuration becomes quite complex.

NVIDIA DGX-1 Systems

For a two-tier switching network, if all 4 IB cards of each DGX-1 device are used to connect to the 36-port switch, and there is no over-subscription, the maximum number of DGX-1 devices per switch is 4. In this configuration, each DGX-1 has 4 ports entering the switch, totaling 16 ports. Then, from the leaf switch to the core switch (director-class switch), there are 16 uplink links. A total of 40 36-port leaf switches can be connected to a 648-port core switch (648/16). This allows for 160 (40 * 4) DGX-1 devices (640 cards in total) to be connected with full bandwidth symmetry.

3. Over-subscription

Over-subscription means that the bandwidth from the uplink links is smaller than the bandwidth entering the device (in other words, the bandwidth performance is poorer). If we use a 2:1 over-subscription from the DGX-1 devices to the first-tier switch (36-port leaf switch), each DGX-1 device will only use two IB cards to connect to the switch. This results in less bandwidth and higher latency compared to using all four cards.

If we maintain a 1:1 network bandwidth from the leaf switch to the core switch (meaning no over-subscription, full bandwidth symmetry), we can fit nine DGX-1 devices into a single leaf switch (with a total of 18 ports from the DGX devices to the leaf switch and 18 uplink ports to the core switch). As a result, a total of 36 (648/18) leaf switches can be connected to the core switch. This allows for a total of 324 (36 * 9) DGX-1 devices to be connected together.

You can further customize the IB network by using over-subscription from the leaf switch to the core switch. This can be achieved by using all four IB connections from each DGX device to the leaf switch and then applying a 2:1 over-subscription to the core switch. Alternatively, you can use two IB connections to the leaf switch from each DGX device and then apply a 2:1 over-subscription to the core switch.

4. Subnet Manager (SM)

InfiniBand networks have another important aspect called the Subnet Manager (SM). The SM is responsible for managing the IB network. There is always one SM managing the IB fabric, but you can have other SMs running and ready to take over in case the first SM fails. The decision of how many SMs to run and where to run them has a significant impact on the design of the cluster.

The first decision to make is where to run the SM. If desired, they can run on IB switches. This is called a hardware SM because it runs on switch hardware. The advantage of doing this is that you don't need any additional servers to run the SM. However, one drawback of running a hardware SM is that it may encounter difficulties if there is heavy IB traffic. For large IB traffic and larger networks, it is best practice to use software SMs running on dedicated servers.

The second decision is how many SMs you want to run. You must run at least one SM. The most cost-effective solution is to run a single hardware SM, which works well for small clusters of DGX-1 devices (perhaps 2-4 devices). As the number of units increases, you will need to consider running two SMs simultaneously to achieve high availability (HA) functionality. The reason for needing HA is that there are more users on the cluster, and the impact of cluster failures is greater than that of a small number of device failures.

As the number of devices grows, consider running the SMs on dedicated servers (software SMs). You will also need to run at least two SMs for the cluster. Ideally, this means having two dedicated servers for the SMs.

5. Application of InfiniBand Cards in DGX Systems

InfiniBand is widely used in DGX SuperPOD, and the following diagram shows the network topology of DGX A100/H100 256 SuperPOD:

DGX A100 H100 256 SuperPOD

The following is the network topology diagram for DGX A100/H100 1K POD:

DGX A100 H100 1K POD

NADDOD NDR800G/HDR200G *Optical Interconnect Solution***

NADDOD offers a flexible NDR optical interconnect solution. The physical form of the NDR switch port is OSFP, with each interface having eight channels using 100Gb/s SerDes. Therefore, from a connection speed perspective, there are three mainstream options: 800G to 800G, 800G to 2X400G, and 800G to 4X200G. Additionally, each channel supports a downgrade from 100Gb/s to 50Gb/s, enabling interconnection with the previous generation HDR (which utilizes 50Gb/s SerDes) devices, supporting 400G to 2X200G configurations.

The NDR cable and transceiver series offer a wide range of product choices for configuring network switches and adapter systems, focusing on data centers with lengths of up to 2 kilometers, specifically designed for accelerating AI computing systems. To minimize data retransmission, the cables and transceivers have low latency, high bandwidth, and extremely low bit error rates (BER) required for AI and accelerated computing applications.

Regarding connector types, there are three main options: passive copper cables (DAC), active copper cables (ACC), and optical modules with fiber jumpers. DAC supports transmission distances of 1-3 meters (2 meters for direct connections), ACC supports distances of 3-5 meters, multimode optical modules support a maximum distance of 50 meters, and single-mode optical modules support a maximum distance of 500 meters.

NADDOD NDR product

In addition, Naddod provides InfiniBand HDR 200G Optical transceivers and cables, which are active optical cables capable of supporting 200Gbps. Naddod offers direct-attach 200G copper cables with a maximum reach of 3 meters and 200G active optical cables with a maximum reach of 100 meters. All cables for 200Gb/s connections utilize the standard QSFP56 form factor. Additionally, the optical cables feature the world's first silicon photonics engine that supports 50Gb/s channels.

Through the 200Gb/s or 400Gb/s Networking solution, NADDOD remains at the forefront of the competition in driving the industry towards Exascale computing.

InfiniBand Network Application in DGX Cluster