Efficient Data Transfer: Network Architecture for NVIDIA DGX A100 Server Cluster

NADDOD Dylan InfiniBand Solutions Architect Jul 18, 2023

In the world of high-performance computing, efficient data transfer is crucial for maximizing the performance of server clusters. NVIDIA's DGX A100 server cluster leverages a combination of InfiniBand (IB) and Ethernet networks to achieve high-speed and reliable data communication. Let's explore the network architecture and optical transceiver configuration of the DGX A100 server cluster in detail.

 

In the DGX A100 and DGX H100 server clusters, two primary types of networks are employed based on network protocols: InfiniBand and Ethernet. Depending on the functionalities required during server operations, the networks can be categorized into compute network, storage network, In-Band management network, and Out-of-Band management network. The compute and storage networks utilize InfiniBand, while In-Band management and Out-of-Band management networks use Ethernet.

 

Referring to the network architecture of the NVIDIA DGX A100 cluster, each Server Unit (SU) consists of 20 DGX A100 servers. Specifically, four DGX A100 servers are placed in a dedicated Compute Rack, which is equipped with two Power Distribution Units (PDUs). Additionally, different types of switches are individually placed in their own racks. This means that each SU contains six racks, with five racks for servers and one rack for switches.

I. Computation Network (Compute Fabric)

In large-scale AI model training with a significant number of parameters, multiple GPUs need to work collaboratively in parallel. By distributing data among N GPUs, with each GPU processing 1/N of the data, the training speed can ideally be increased by a factor of N. During the training process, each GPU calculates local gradients based on the assigned data. At the end of this phase, GPUs communicate with each other to average the local gradients and obtain global gradients. The global gradients are then fed back to each GPU for further training. This process, known as all-reduce, reduces the values of each GPU by averaging and distributes them back to all GPUs. The inter-GPU communication speed during the all-reduce operation directly impacts the computational performance of the system.

 

The compute network in the DGX A100 cluster can utilize up to three layers of switches: Leaf switches, Spine switches, and Core switches (used in larger-scale clusters). In the DGX A100 server cluster, all three layers of switches use the Mellanox QM8790 model with 40 ports. Here is an overview of the compute network architecture:Each SU, consisting of 20 DGX A100 servers, is connected to eight Leaf switches. Each DGX A100 server in an SU is connected to every Leaf switch in the SU, ensuring optimal rail-optimized network architecture for improved deep learning training performance.The Spine Group (SG) consists of ten QM8790 switches, facilitating communication between different SUs. In a 140-node cluster system, eight SGs (corresponding to 80 switches) are required.The Core Group (CG) consists of 14 QM8790 switches, enabling interconnection between SGs. In a 140-node switch system, two CGs are required, totaling 28 switches.

DGX A100 Cluster Compute Fabric

For AI server clusters with 80 nodes or fewer, the compute network architecture includes only two layers of switches.

 

Considering the compute network switch and cable requirements, a SuperPOD cluster composed of 140 DGX A100 servers requires 164 switches and 3,364 cables, corresponding to 6,728 ports.

II. Storage Network (Storage Fabric)

The storage network in a server cluster utilizes InfiniBand technology to provide high throughput for AI servers to access shared storage. In the case of a 140-node server cluster, the storage network architecture consists of two layers of switches, requiring 26 switches, 660 cables, and approximately 1,320 ports.

III. In-Band Management Network and Out-of-Band Management Network

The In-Band management network in a server cluster plays a crucial role in connecting and managing all services within the cluster. It controls the node's access to the main file system, storage pools, and other services both within and outside the cluster. On the other hand, the Out-of-Band management network manages the system through the Baseboard Management Controller (BMC) within the servers. Out-of-Band management utilizes a separate low-utilization network system to avoid bandwidth competition with other cluster services, ensuring the smooth operation of the cluster. Both In-Band and Out-of-Band management networks use Ethernet.

 

In the DGX A100 cluster, In-Band management utilizes 100Gb Ethernet with SN4600 switch models, while Out-of-Band management employs 1Gb Ethernet with AS4610 switch models. These network configurations provide efficient management and control capabilities for the server cluster.

IV. Estimating the Quantity of Optic Transceivers/Optic Chips in the DGX A100 Server Cluster

In the DGX A100 server cluster, each A100 GPU corresponds to approximately 7 units of 200G optical transceivers.

 

To estimate the quantity of optic transceivers/optic chips required in the DGX A100 server cluster, we consider both the IB network and Ethernet network, where the IB network has a higher number of ports and bandwidth per port compared to Ethernet. Therefore, we calculate the total port quantity for the compute and storage networks to determine the demand for optic transceivers/optic chips in the cluster network.

 

The table below shows the calculated port quantities for the server cluster, a single server in the cluster, and a single GPU in the cluster. It can be observed that in clusters composed of 20-80 DGX A100 servers, a single GPU corresponds to approximately 5.3-5.8 ports. In clusters with 120 and 140 DGX A100 servers, a single GPU corresponds to approximately 7.2 ports. The difference is primarily due to the addition of a third layer switch (CG) in clusters with 120 or more nodes, which increases the demand for switches, cables, and ports.

DGX A100 Server Cable & Ports Number

Considering that the DGX A100 server cluster uses QSFP56 ports for both the server and switch ends of the IB network, the ratio of 50G optic chips (transmitter and receiver chips counted as one unit) to 200G optical transceivers is 4:1. Based on this ratio, we estimate that a single GPU in the DGX A100 cluster requires approximately 28 units of 50G optic chips.

Ⅴ. Conclusion

In summary, the network architecture and hardware requirements of the NVIDIA DGX A100 cluster are crucial factors in achieving efficient deep learning training and data processing. The three-layer switch architecture for the compute network and the InfiniBand architecture for the storage network provide high-speed data transfer and communication capabilities. Additionally, properly configured management networks ensure the smooth operation and administration of the cluster. By estimating the demand for optical transceivers/optic chips, we can better plan and meet the hardware requirements of the cluster. The well-designed and configured network architecture and hardware requirements of the NVIDIA DGX A100 cluster provide robust support and performance for AI workloads.

 

NADDOD takes customer-centricity at the core,  consistently creating exeptional value for sustomers across various industries. With a professional technical team and extensive experience in implementing and servicing diverse application scenarios, NADDOD's products and solutions have earned customers' trust and favor with high quality and outstanding performance, which are widely applied in industries and critical sectors such as high-performance computing, data centers, educational and scientific research, biomedicine, finance, energy, autonomous driving, internet, manufacturing, and telecommunications.

 

To meet the demand for optical transceivers in the computation and storage networks of the DGX A100 server cluster, NADDOD provides high-quality InfiniBand 200G QSFP56 optical transceivers. These products offer exceptional performance for server clusters while reducing costs and complexity.