Building High-Performance AI Networks: Challenges and Best Practices

AI Intelligent Computing Center Network Architectur

The traditional cloud data center networks are typically designed based on a traffic model that focuses on serving external customers. The traffic primarily flows from the data center to the end customers, known as north-south traffic, with east-west traffic within the cloud being secondary. However, there are several challenges in the underlying physical network architecture that carries VPC (Virtual Private Cloud) networks and supports intelligent computing (smart computing) workloads:

Network Congestion: Not all servers generate outbound traffic simultaneously. To control network construction costs, the bandwidth of the downlink ports on leaf switches and the uplink ports do not have a 1:1 ratio but are designed with a convergence ratio. Generally, the uplink bandwidth is only one-third of the downlink bandwidth.

High Latency for Internal Cloud Traffic: Communication between two servers across different leaf switches requires traversing spine switches, resulting in a three-hop forwarding path, which introduces additional latency.

Limited Bandwidth: In most cases, a single physical machine is equipped with only one network interface card (NIC) for connecting to the VPC network. The bandwidth of a single NIC is relatively limited, and currently available commercial NICs typically do not exceed 200 Gbps.

For intelligent computing scenarios, a recommended practice is to build a dedicated high-performance network to accommodate intelligent computing workloads, meeting the requirements of high bandwidth, low latency, and lossless.

1. High Bandwidth Design

The intelligent computing servers can be fully equipped with 8 GPU cards and have 8 PCIe network card slots reserved. When building a GPU cluster across multiple machines, the burst bandwidth for communication between two GPUs may exceed 50 Gbps. Therefore, it is common to associate each GPU with a network port of at least 100 Gbps. In this scenario, you can configure either 4 network cards with a capacity of 2100 Gbps each or 8 network cards with a capacity of 1100 Gbps each. Alternatively, you can configure 8 network cards with a single port capacity of 200/400 Gbps.

AI serve NICs

2. Unblocking Design

The key to unblocking network design is to adopt a Fat-Tree architecture. The downlink and uplink bandwidth of the switches follow a 1:1 non-converged design. For example, if there are 64 ports with a bandwidth of 100 Gbps each in the downlink, there will also be 64 ports with a bandwidth of 100 Gbps each in the uplink.

In addition, data center-grade switches with non-blocking forwarding capability should be used. The mainstream data center switches available in the market generally provide full-port non-blocking forwarding capability.

3. Low-Latency Design: AI-Pool

In terms of low-latency network architecture design, Baidu Intelligent Cloud has implemented and deployed the AI-Pool network solution based on Rail optimization. In this network solution, 8 access switches form an AI-Pool group. Taking a two-layer switch networking architecture as an example, this network architecture achieves one-hop communication between different intelligent computing nodes within the same AI-Pool.

In the AI-Pool network architecture, network ports with the same numbers from different intelligent computing nodes should be connected to the same switch. For example, RDMA port 1 of intelligent computing node 1, RDMA port 1 of intelligent computing node 2, and so on, up to the RDMA port 1 of intelligent computing node P/2, should all be connected to switch 1.

Within each intelligent computing node, the upper-layer communication library matches the GPU cards with the corresponding network ports based on the on-node network topology. This enables direct communication with only one hop between two intelligent computing nodes that have the same GPU card number.

For communication between intelligent computing nodes with different GPU card numbers, the Rail Local technology in the NCCL communication library can make full use of the bandwidth of NVSwitch between GPUs within the host, transforming cross-card communication between multiple machines into communication between the same GPU card numbers across machines.

Ai-Pool serves Interoperability

For communication between two physical machines across AI-Pools, it requires passing through aggregation switches, resulting in a three-hop communication.

Ai-Pool serves Interoperability-2

The scalability of GPUs that the network can support is related to the port density and network architecture of the switches used. As the network becomes more hierarchical, it can accommodate a larger number of GPU cards, but the number of hops and latency for forwarding also increase. Therefore, a trade-off should be made based on the actual business requirements.

4. Two-level Fat-Tree Architecture

8 access switches form an intelligent computing resource pool called AI-Pool. In the diagram, P represents the number of ports on a single switch. Each switch can have a maximum of P/2 downlink ports and P/2 uplink ports, which means a single switch can connect to up to P/2 servers and P/2 switches. A two-level Fat-Tree network can accommodate a total of P*P/2 GPU cards.

5. Three-level Fat-Tree Architecture

In a three-level network architecture, there are additional aggregation switch groups and core switch groups. The maximum number of switches in each group is P/2. The maximum number of aggregation switch groups is 8, and the maximum number of core switch groups is P/2. A three-level Fat-Tree network can accommodate a total of P*(P/2)(P/2) = PP*P/4 GPU cards.

In the context of a three-level Fat-Tree network, the InfiniBand 40-port 200Gbps HDR switches can accommodate a maximum of 16,000 GPUs. This scale of 16,000 GPU cards is currently the largest scale network for GPU clusters using InfiniBand in China, and Baidu holds the current record.

three level fat tree

6. Comparison of two-level and three-level fat tree network architectures

(1) The scale of accommodated GPU cards

The most significant difference between a two-level Fat-Tree and a three-level Fat-Tree lies in the capacity to accommodate GPU cards. In the diagram below, N represents the scale of GPU cards, and P represents the number of ports on a single switch. For example, for a switch with 40 ports, a two-tier Fat-Tree architecture can accommodate 800 GPU cards, while a three-tier Fat-Tree architecture can accommodate 16,000 GPU cards.

two level fat tree VS two level fat tree network

(2) Forwarding path

Another difference between the two-level Fat-Tree and three-level Fat-Tree network architectures is the number of hops in the network forwarding path between any two nodes.

In the two-level Fat-Tree architecture, within the same intelligent computing resource pool (AI-Pool), the forwarding path between nodes with the same GPU card number is 1 hop. The forwarding path between nodes with different GPU card numbers, without Rail Local optimization within the intelligent computing nodes, is 3 hops.

In the three-level Fat-Tree architecture, within the same intelligent computing resource pool (AI-Pool), the forwarding path between nodes with the same GPU card number is 3 hops. The forwarding path between nodes with different GPU card numbers, without Rail Local optimization within the intelligent computing nodes, is 5 hops.

two level fat tree VS two level fat tree network-2

7. AI HPC Network Architecture Typical Practice

Based on the currently mature commercial switches, we recommend several specifications for physical network architectures, taking into consideration the different models of InfiniBand/RoCE switches and the supported scale of GPUs.

Regular: InfiniBand two-tier Fat-Tree network architecture based on InfiniBand HDR switches, supporting a maximum of 800 GPU cards in a single cluster.

Large: RoCE two-tier Fat-Tree network architecture based on 128-port 100G data center Ethernet switches, supporting a maximum of 8192 GPU cards in a single cluster.

XLarge: InfiniBand three-tier Fat-Tree network architecture based on InfiniBand HDR switches, supporting a maximum of 16,000 GPU cards in a single cluster.

XXLarge: Based on InfiniBand Quantum-2 switches or equivalent-performance Ethernet data center switches, adopting a three-tier Fat-Tree network architecture, supporting a maximum of 100,000 GPU cards in a single cluster.

Physical network architectures of different specifications

At the same time, high-speed network connectivity is essential to ensure efficient data transmission and processing.

NADDOD offers high-quality connectivity products that meet the requirements for deploying AI model networks. Their product lineup includes switches, network cards, and optical modules of various speeds (100G, 200G, 400G, 800G), which accelerate AI model training and inference processes. The optical modules provide high bandwidth, low latency, and low error rates, enhancing the capabilities of data center networks and enabling faster and more efficient AI computations. Choosing NADDOD's connectivity products optimizes network performance and supports the deployment and operation of large-scale AI models.

AI Intelligent Computing Center Network Architecture Design Practice