Quick Understanding GPU Server Network Card Configuration in AI Era

NADDOD Nathan Optics Application Engineer Mar 25, 2024

In the era of Generative AI (GenAI) and large models, our focus extends beyond the computational power of individual GPU cards to encompass the total effective computing power of GPU clusters. While the computing power of a single GPU card can be estimated based on its peak performance, such as the Nvidia A100, which boasts a peak FP16/BF16 dense compute power of 312 TFLOPS, with a single card's effective computing power approximately ~298 TFLOPS [1, 2].

 

While we are familiar with the use of single GPU cards and individual GPU servers, we are still learning and gathering insights from practice regarding building GPU clusters, determining the scale of GPU clusters, and planning the total effective computing power. In this article, we'll delve into discussions surrounding GPU cluster network configurations, cluster scale, and overall effective computing power, with a particular focus on the computational network plane.

 

To Storage & Managements

 

GPU Server Network Card Configuration

 

The scale and total effective computing power of a GPU cluster largely depend on the network configuration and switch equipment used. For each Nvidia GPU server, Nvidia provides recommended GPU cluster network configurations. For example, with the DGX A100 server, the recommended network connection between servers is 200 Gbps per card (meaning each A100 card has a 200 Gbps network connection to communicate with A100 cards in other servers), and a single DGX A100 server is configured with 8 computational network cards (such as InfiniBand 200 Gbps) [1, 2].

 

DGX A100 System Server Block Diagram

 

So, how is the computational network bandwidth between GPU servers determined?

 

Apart from cost considerations, the computational network bandwidth between GPU servers is determined by the PCIe bandwidth supported by the GPU cards. This is because the network cards of GPU servers are connected to GPU cards via PCIe Switches (GPU <--> PCIe Switch <--> NIC). Therefore, the PCIe bandwidth limits the computational network bandwidth.

 

For example, in the case of the Nvidia DGX A100 server, since a single A100 card supports PCIe Gen4 with a bidirectional bandwidth of 64 GB/s and a unidirectional bandwidth of 32 GB/s, which is equivalent to 256 Gbps. Therefore, configuring a 200 Gbps network card for a single A100 card is sufficient. Thus, for the computational network, the Nvidia DGX A100 server is configured with 8 Mellanox ConnectX-6 InfiniBand network cards (Note: Mellanox ConnectX-7 can also be configured as it also supports 200 Gbps). If a 400 Gbps network card is configured for A100 cards, it would be underutilized due to the PCIe Gen4 bandwidth limitation (resulting in wastage of network card bandwidth).

 

42.11KB 2024-03-25 18:11:53 20456  Q56-200G-FR4 301.46KB 2024-03-22 17:59:51 20453  Blackwell VS Hopper 60.37KB 2024-03-22 15:42:00 20452  Blackwell 17.73KB 2024-03-22 15:42:00 20451  NVLink Switch Chip 31.39KB 2024-03-22 15:42:00 20450  Blackwell GPU 39.52KB 2024-03-22 15:42:00 20449  Blackwell B200 17.34KB 2024-03-22 15:42:00 审核状态 123…963到第 1 页确定共 9624 条 10 条/页 https://resource.naddod.com/images/blog/2024-03-25/nvidia-dgx-a100-system-topology-009450.webp

 

For the Nvidia DGX H100 server, since a single H100 card supports PCIe Gen5 with a bidirectional bandwidth of 128 GB/s and a unidirectional bandwidth of 64 GB/s, which is equivalent to 512 Gbps. Therefore, configuring a 400 Gbps computational network card for a single H100 card is Nvidia's recommended standard configuration. In terms of the computational network, the Nvidia DGX H100 server is configured with 8 Mellanox ConnectX-7 InfiniBand network cards, providing each H100 card with a 400 Gbps external network connection.

 

DGX H100 Configuration

 

It is important to note that for the computational network configuration of A800 and H800 servers, the standard configurations recommended by Nvidia DGX are generally not used in China. For example, for A800 servers, there are two common computational network card configurations: the first is 8 x 200 GbE, where each A800 card has a separate 200 GbE network card configuration (resulting in a total of ~1.6 Tbps RoCEv2 computational network connection for 8 A800 cards); the second is 4 x 200 GbE, where every two A800 cards share one 200 GbE network card, with a maximum of 200 GbE network per card and an average of 100 GbE external connection per A800 card. The second method is similar to the design of Nvidia DGX V100. Considering that communication aggregation can be done within A800 servers before communicating with other servers, the impact of these two computational network card configurations on the overall cluster efficiency is generally consistent.

 

H800 supports PCIe Gen5, and for H800 servers, the common computational network card configuration is 8 x 400GbE, where each H800 card has a separate 400 GbE network card configuration, providing each H800 card with a 400 GbE external computational network connection, resulting in a total of ~3.2 Tbps RoCEv2 computational network connection for 8 H800 cards.

 

While Nvidia has implemented high-speed interconnection between multiple GPUs within a single server using NVLink and NVSwitch, when multiple servers are used to build a cluster, PCIe bandwidth remains the primary performance bottleneck (cluster network bottleneck). This is because the connection between network cards and GPU cards still mainly relies on PCIe Switches. With the widespread adoption of PCIe Gen6 (standard released in 2022) and even PCIe Gen7 (expected standard release in 2025) in the future, the overall performance of GPU clusters will reach a new level. The Nvidia H20, set to be released in 2024, also supports PCIe Gen5.

 

Here, I would also like to introducing NADDOD, a leading provider of cutting-edge optical communication solutions. Our comprehensive range of offerings includes:

 

  • Up to 800Gbps Optical Transceivers: Our transceivers are compatible with a wide range of networking equipment from top brands like Cisco, HP, Brocade, Juniper, H3C, Huawei, Arista, and more. We also offer custom solutions tailored to your specific needs.

 

 

  • Wave Division Multiplexers (WDM): Choose from our range of WDM solutions, including OADM, FWDM, CWDM, DWDM, and CEx WDM, supporting up to 48 channels in 100GHz grid and 96 channels in 50GHz grid configurations.

 

  • Fiber Patch Cords and Pigtails:We offer a variety of fiber patch cords and pigtails with different fiber types (OS2, OM1, OM2, OM3, OM4, OM5) and connectors (LC, SC, FC, ST, MPO, MU, MTRJ) to suit your networking requirements.

 

  • FBT/PLC Splitters, Fiber Optic Connectors/Adaptors/Attenuators: Our range includes splitters, connectors, adaptors, and attenuators to facilitate efficient fiber optic connectivity.

 

  • Ethernet Switches and POE Solutions: We provide Ethernet switches, POE switches, injectors, splitters, and extenders to support your networking infrastructure needs.

 

NADDOD NDR product

 

All our products are rigorously tested and certified to meet CE, FCC, and RoHS standards, ensuring high quality and compliance. With flexible minimum order quantities (MOQ), reliable performance, competitive pricing, and fast delivery, we offer the perfect solution for your optical communication needs. Partner with us for OEM/ODM cooperation and unlock even more possibilities. Get in touch with us today to learn more about how NADDOD can elevate your network infrastructure.