Understanding the Ratio of Optical Modules to GPUs in Different Networking Architectures - NADDOD Blog

How Many Optical Modules Does One GPU Need?

NADDOD Mark News Writer Dec 4, 2023

Various versions of calculations regarding the ratio of optical modules to GPUs circulate in the market. The main reason for the inconsistency in these numbers is the varying usage quantity of optical modules in different networking architectures. The actual number of optical modules used primarily depends on the following factors.

 

Discrepancies in Calculating the Ratio of Optical Modules to GPU-The Varying Usage Quantity Due to Different Networking Architectures.

 

1. Network Card Model

It mainly includes two network cards, ConnectX-6 (200Gb/s, mainly used with A100).

CX-6 NIC

And ConnectX-7 (400Gb/s, mainly used with H100).

 

At the same time, the next generation ConnectX-8 800Gb/s is expected to be released in 2024.

2. Switch Model

It mainly includes two types of switches, QM9700 series (32-port OSFP (2*400Gb/s), a total of 64 channels of 400Gb/s transmission rate, a total of 51.2Tb/s throughput rate).

QM9700 Switches

 

And QM8700 series (40-port QSFP56, a total of 40 channels of 200Gb/s transmission rate, a total of 16Tb/s throughput rate).

QM8700 Switches

3. Number of units (Scalable Unit)

Unit quantity influences the hierarchical structure of the switching architecture. When the quantity is low, a two-layer architecture is used, while a three-layer architecture is employed when the quantity is high.

 

H100 SuperPOD: Each unit consists of 32 nodes (DGX H100 servers) and supports a maximum of 4 units to form a cluster, using a two-layer switching architecture.

 

A100 SuperPOD: Each unit consists of 20 nodes (DGX A100 servers) and supports a maximum of 7 units to form a cluster. If the number of units exceeds 5, a three-layer switching architecture is required.

DGX H100 Scalable unit

4. In conclusion

  1. A100+ConnectX6+QM8700 Three-layer Network: Ratio 1:6, all using 200G optical modules.

 

  1. A100+ConnectX6+QM9700 Two-layer Network: 1:0.75 of 800G optical modules + 1:1 of 200G optical modules.

 

  1. H100+ConnectX7+QM9700 Two-layer Network: 1:1.5 of 800G optical modules + 1:1 of 400G optical modules.

 

  1. H100+ConnectX8 (yet to be released)+QM9700 Three-layer Network: Ratio 1:6, all using 800G optical modules.

 

Optical Transceivers incremental market:

 

Assuming the shipment of 300,000 units of H100 and 900,000 units of A100 in 2023, it would generate a total demand of 3.15 million units of 200G, 300,000 units of 400G, and 7.875 million units of 800G optical modules, resulting in a substantial AI market expansion of $1.38 billion.

 

Assuming the shipment of 1.5 million units of H100 and 1.5 million units of A100 in 2024, it would create a total demand of 750,000 units of 200G, 750,000 units of 400G, and 6.75 million units of 800G optical modules. This would bring about a remarkable AI market growth of $4.97 billion, approximately equivalent to the combined market size of the optical module industry in 2021.

 

The following is a detailed calculation process for each of the above situations.

 

The first case: A100+ConnectX6+QM8700 three-layer network

The A100 GPU features a total of eight compute interfaces, divided into four on the left side and four on the right side of the diagram. Currently, A100 shipments are primarily paired with ConnectX6 for external communication, offering interface speeds of 200Gb/s.

QM8700

In the first-layer architecture, each node has 8 interfaces (ports), and each node is connected to 8 leaf switches. Every 20 nodes form a single unit (SU). Therefore, in the first layer, a total of 8 * SU leaf switches are required, along with 8 * SU * 20 cables and 2 * 8 * SU * 20 units of 200G optical modules.

SU1-SU4

In the second-layer architecture, due to the use of a non-blocking design, the upstream speed is equal to the downstream speed. In the first layer, the total unidirectional transmission speed is 200G multiplied by the number of cables. Since the second layer also uses 200G transmission speed per cable, the number of cables in the second layer should be the same as in the first layer, requiring 8 * SU * 20 cables and 2 * 8 * SU * 20 units of 200G optical modules. The number of spine switches required is calculated by dividing the number of cables by the number of leaf switches, which results in (8 * SU * 20) / (8 * SU) spine switches needed. However, when there are not enough leaf switches, to save on the number of spine switches, it is possible to have multiple connections between leaf and spine switches (as long as it does not exceed the limit of 40 interfaces). Therefore, when the unit quantity is 1/2/4/5, the required number of spine switches is 4/10/20/20, and the required number of optical modules is 320/640/1280/1600. The number of spine switches does not increase proportionally, but the number of optical modules does.

 

When the unit quantity reaches 7, the third-layer architecture is required. Due to the non-blocking design, the number of cables needed in the third layer is the same as in the second layer.

 

NVIDIA recommends the following configuration for SuperPOD: NVIDIA recommends networking with 7 units, adding the third-layer architecture, and introducing core switches. The quantities of different layers of switches and cable connections for various unit quantities are shown in the diagram.

SuperPOD

For 140 servers, there are a total of 140 * 8 = 1,120 A100 GPUs. This requires 140 QM8790 switches and 3,360 cables. Additionally, 6,720 units of 200G optical modules are needed. The ratio between A100 GPUs and 200G optical modules is 1:6 (1,120 GPUs to 6,720 optical modules).

Second case: A100+ConnectX6+QM9700 Layer 2 network

Currently, this specific configuration is not included in the recommended setups. However, in the future, there may be an increasing number of A100 GPUs that choose to network using QM9700 switches. This would reduce the number of optical modules used, but introduce a demand for 800G optical modules. The main difference lies in the first-layer connections, which would transition from using 8 individual 200G cables to utilizing QSFP to OSFP adapters with 2 connections per adapter, supporting 1-to-4 connectivity.

Switch to QSFP56 4x HDR100

In the first layer: For a cluster with 7 units and 140 servers, there are a total of 140 * 8 = 1,120 interfaces. This corresponds to 280 1-to-4 cables, resulting in a demand for 280 units of 800G and 1,120 units of 200G optical modules. This requires 12 QM9700 switches.

 

In the second layer: Utilizing only 800G connections, 280 * 2 = 560 units of 800G optical modules are needed along with 9 QM9700 switches.

 

Therefore, for 140 servers and 1,120 A100 GPUs, a total of 21 switches (12 + 9) are required, along with 840 units of 800G optical modules and 1,120 units of 200G optical modules.

 

The ratio between A100 GPUs and 800G optical modules is 1,120:840, which simplifies to 1:0.75. The ratio between A100 GPUs and 200G optical modules is 1:1.

 

The third situation: H100+ConnectX7+QM9700 two-layer network

One unique aspect of the H100 design is that, although the card has 8 GPUs, it is equipped with 8 400G network cards that are consolidated into 4 800G interfaces. This consolidation results in a significant demand for 800G optical modules.

H800

In the first layer, according to NVIDIA's recommended configuration, it is suggested to connect one "2*400G" 800G optical module to the server interface. This can be achieved by using a twin-port connection with two optical cables (MPO), where each cable is inserted into a separate switch.

H800-network-links

Therefore in the first layer, each unit consists of 32 servers, and each server is connected to 2*4=8 switches. In a SuperPOD with 4 units, a total of 48=32 leaf switches are required in the first layer.

 

NVIDIA recommends reserving one node for management purposes (UFM). Since the impact on the usage of optical modules is limited, let's approximate the calculation based on 4 units with a total of 128 servers.

 

In the first layer, a total of 4*128=512 units of 800G optical modules and 2*4*128=1024 units of 400G optical modules are needed.

SU1-SU4

In the second layer, the switches are directly connected using 800G optical modules. Each leaf switch is connected downward with a unidirectional speed of 32*400G. To ensure consistent upstream and downstream speeds, the upward connection requires a unidirectional speed of 16*800G. This necessitates 16 spine switches, resulting in a total of 4*8*16*2=1024 units of 800G optical modules needed.

 

In this architecture, the combined total for both layers requires 1,536 units of 800G optical modules and 1,024 units of 400G optical modules. With a total of 4*32*8=1,024 H100 GPUs in the SuperPOD, the ratio between GPUs and 800G optical modules is 1:1.5 (1,024 GPUs to 1,536 optical modules). The ratio between GPUs and 400G optical modules is 1:1 (1,024 GPUs to 1,024 optical modules).

The fourth situation: H100+ConnectX8 (not yet released)+QM9700 three-layer network

In this hypothetical scenario where the H100 GPUs are upgraded to 800G network cards, the external interfaces should be increased from 4 OSFP interfaces to 8 OSFP interfaces. The connections between each layer would also be made using 800G optical modules. The overall network architecture would be similar to the first scenario, with the only difference being the replacement of 200G optical modules with 800G optical modules. Therefore, in this architecture, the ratio between GPUs and optical module requirements would still be 1:6, just as in the first scenario.

 

We organize the above four situations into the following table:

 

If we assume that in 2023, there are shipments of 300,000 units of H100 GPUs and 900,000 units of A100 GPUs, it would result in a total demand of 3.15 million units of 200G optical modules, 300,000 units of 400G optical modules, and 787,500 units of 800G optical modules.

 

Similarly, if in 2024, there are shipments of 1.5 million units of H100 GPUs and 1.5 million units of A100 GPUs, it would result in a total demand of 750,000 units of 200G optical modules, 750,000 units of 400G optical modules, and 6.75 million units of 800G optical modules.

 

  • Half of the A100 GPUs are connected to 200G switches, and the other half are connected to 400G switches.

 

  • Half of the H100 GPUs are connected to 400G switches, and the other half are connected to 800G switches.

 

Technology is constantly evolving and innovating. 400G multimode optical modules/AOCs/DACs are expected to continue leading the development in the networking field, offering strong support for the network requirements of the digital era. As a professional module manufacturer, NADDOD produces optical modules ranging from 1G to 400G. We welcome everyone to explore and purchase our products.