NVIDIA Spectrum-X Ethernet: Accelerated Networking for AI Workloads - NADDOD Blog

NVIDIA Spectrum-X :Ethernet Network Platform Specifically Designed for AI

NADDOD Brandon InfiniBand Technical Support Engineer Sep 12, 2023

The characteristics of AI workloads involve a small number of elephants that handle a large amount of data movement between GPUs, where tail latency can significantly impact overall application performance. Using traditional network routing mechanisms to handle such traffic patterns may result in inconsistent GPU performance and low utilization for AI workloads. Spectrum-X RoCE Dynamic Routing is a fine-grained load balancing technology that dynamically adjusts RDMA data routing to avoid congestion, provide optimal load balancing, and achieve more efficient data bandwidth.

Key Features of NVIDIA Spectrum-X Network Platform

1. Working Principles of NVIDIA RoCE Dynamic Routing

RoCE Dynamic Routing is supported by the Spectrum-4 switch and BlueField-3 DPU in a end-to-end fashion. The Spectrum-4 switch is responsible for selecting the least congested port on a per-packet basis to evenly distribute data transfers. When different packets of the same flow are transmitted through different paths in the network, they may arrive at the destination out of order. BlueField-3 DPU handles out-of-order data at the RoCE transport layer to provide sequential data transparently to the applications.

 

Spectrum-4 evaluates congestion based on the load of egress queues to ensure balanced utilization across all ports. For each network packet, the switch chooses the port with the lowest load in its egress queue. Spectrum-4 also receives status notifications from neighboring switches, which can also influence forwarding decisions. The queues involved in the evaluation are matched with their corresponding traffic classes. As a result, Spectrum-X can achieve up to 95% effective bandwidth in ultra-large-scale systems and high-load scenarios.

2. NVIDIA RoCE Dynamic Routing with NVIDIA Direct Data Placement

By leveraging NVIDIA Direct Data Placement (DDP), the following packet-level demonstration showcases how AI flows move across the Spectrum-X network between GPUs. It illustrates the collaboration between RoCE dynamic routing on the Spectrum-4 switch and NVIDIA Direct Data Placement (DDP) technology on the BlueField DPU:

 

Step 1: Data originates from the server or GPU memory on the left side of the diagram, with the destination being a server on the right side.DDP Step1

Step 2: The BlueField-3 DPU forms network packets with the data and sends them to the first Spectrum-4 leaf switch, simultaneously tagging these packets to ensure secure application of RoCE dynamic routing.DDP Step2

Step 3: The left Spectrum-4 leaf switch applies RoCE dynamic routing to load balance the data packets from the green and purple flows across the four Spine switches, sending the data packets for each flow to multiple Spine switches. This increases the effective bandwidth from the standard Ethernet (60%) to 95% of Spectrum-X (1.6 times).DDP Step3

Step 4: These data packets may arrive out of order when they reach the right-side BlueField-3 DPU.

DDP Step4

Step 5: The right-side BlueField-3 DPU uses NVIDIA Direct Data Placement (DDP) technology to place the data in the correct order in host/GPU memory.DDP Step5

RoCE Dynamic Routing Results

To validate the effectiveness of RoCE dynamic routing, we conducted initial testing using an RDMA Write test program. In the test, we divided the hosts into several pairs, with each pair sending large RDMA Write data streams to each other for a sustained period.RoCE dynamic routing reduces flow completion time

RoCE dynamic routing reduces flow completion time

As shown in the above graph, the hash-based static forwarding mechanism resulted in conflicts on the uplink ports, causing increased communication completion time, decreased bandwidth, and reduced fairness among flows. Switching to dynamic routing resolved all these issues.

 

In the ECMP graph, some flows exhibited the same bandwidth and completion time, while others experienced conflicts, leading to longer completion times and lower bandwidth. Specifically, in the ECMP scenario, some flows achieved an optimal completion time T of 13 seconds, while the slowest flow required 31 seconds to complete, approximately 2.5 times the ideal time T. In the RoCE dynamic routing graph, all flows completed at roughly the same time, with similar peak bandwidth.

RoCE Dynamic Routing for AI Workloads

To further evaluate the performance of dynamic routing in RoCE workloads, we conducted common AI benchmark tests on a testing platform consisting of 32 servers, utilizing a two-tier leaf-spine network topology built with four NVIDIA Spectrum switches. These benchmark tests assessed collective operations and network traffic patterns commonly found in distributed AI training workloads, such as all-to-all traffic and all-reduce collective operations.RoCE Dynamic Routing Empowers AI-All-Reduce

RoCE Dynamic Routing Empowers AI-All-Reduce

RoCE Dynamic Routing Empowers AI-All-to-All

RoCE Dynamic Routing Empowers AI-All-to-All

Summary of RoCE Dynamic Routing

In many cases, hash-based ECMP flow routing can lead to high congestion and unstable flow completion times, resulting in degraded application performance. Spectrum-X RoCE Dynamic Routing addresses this issue. This technology improves the actual network throughput (goodput) while minimizing the instability of flow completion times, thereby enhancing application performance.

 

By combining RoCE Dynamic Routing with NVIDIA Direct Data Placement (DDP) technology on the BlueField-3 DPU, transparent support for applications is achieved. This ensures that NVIDIA Spectrum-X provides a performance-optimized accelerated Ethernet solution for data centers.

Using NVIDIA RoCE Congestion Control for Performance Isolation

Due to network congestion, concurrent running applications in an AI cloud system may experience performance degradation and unstable runtime. This congestion can be caused by the network traffic of the applications themselves or the background network traffic of other applications. The primary cause of this congestion is many-to-one congestion, where multiple data senders exist with a single data receiver.

 

Dynamic routing with RoCE cannot solve such congestion issues. In fact, this type of problem requires flow metering for each endpoint. Spectrum-X RoCE congestion control is an end-to-end technology where the Spectrum-4 switch provides network telemetry information to characterize real-time congestion in the network. This telemetry information is processed by the BlueField-3 DPU, which manages and controls the data injection rate of the data senders to maximize the efficiency of the shared network. Without congestion control, the many-to-one scenario can lead to network backpressure, congestion spreading, and even packet loss, significantly reducing network and application performance.

 

During the congestion control process, the BlueField-3 DPU executes congestion control algorithms, capable of processing millions of congestion control events per second, making rapid, fine-grained rate decisions at the microsecond response level. The Spectrum-4 switch's in-band telemetry provides queue information for accurate congestion estimation and port utilization metrics for fast recovery. NVIDIA's congestion control allows telemetry data to bypass queue latency of congested flows while still providing accurate concurrent telemetry information, significantly reducing congestion detection and reaction time.

 

The following example demonstrates how the network experiences many-to-one congestion and how Spectrum-X utilizes flow metering and in-band telemetry for RoCE congestion control.Network congestion causes victim flow

Network congestion causes victim flow

The diagram depicts a victim flow caused by network congestion. Four source DPUs are transmitting data to two target DPUs. Sources 1, 2, and 3 send data to target 1, with each receiver utilizing one-third of the available link bandwidth. Source 4 sends data to target 2 through a leaf switch shared with source 3, resulting in the receiving end obtaining two-thirds of the available link bandwidth.

 

In the absence of congestion control, sources 1, 2, and 3 would lead to a 3-to-1 congestion as they all send data to target 1. This congestion would cause backpressure to propagate from target 1 to the leaf switches connected to sources 1 and 2. Source 4 becomes the victim flow, with its throughput to target 2 reduced to 33% of the available bandwidth (50% of the expected performance). This has a detrimental impact on the performance of AI applications, which rely on average and worst-case performance.Spectrum-X solves congestion problems with flow metering and congestion telemetry

Spectrum-X solves congestion problems with flow metering and congestion telemetry

The diagram showcases how Spectrum-X addresses the congestion problem in Figure 14. It depicts the same testing environment: four source DPUs transmitting data to two target DPUs. In this case, flow metering at sources 1, 2, and 3 prevents congestion at the leaf switches. This eliminates the backpressure affecting source 4, enabling it to achieve the expected two-thirds of the available bandwidth. Furthermore, Spectrum-4 utilizes in-band telemetry information generated by What Just Happened to dynamically reassign flow paths and queue behavior.

RoCE Performance Isolation

AI cloud infrastructure needs to support a large number of users (tenants) and parallel applications or workflows. These users and applications compete for shared resources of the infrastructure, such as the network, which can impact each other's performance.

 

Furthermore, optimizing AI network performance for the NVIDIA Collective Communications Library (NCCL) requires coordination and synchronization among all workloads running concurrently in the cloud. The traditional advantages of cloud, such as elasticity and high availability, have limited impact on AI applications, while performance degradation is a more significant global issue.

Performance Isolation for Spectrum-X

Performance Isolation for Spectrum-X

ur source DPUs transmitting data to two target DPUs. In this case, flow metering at sources 1, 2, and 3 prevents congestion at the leaf switches. This eliminates the backpressure affecting source 4, enabling it to achieve the expected two-thirds of the available bandwidth. Furthermore, Spectrum-4 utilizes in-band telemetry information generated by What Just Happened to dynamically reassign flow paths and queue behavior.

RoCE Performance Isolation

AI cloud infrastructure needs to support a large number of users (tenants) and parallel applications or workflows. These users and applications compete for shared resources of the infrastructure, such as the network, which can impact each other's performance.

 

Furthermore, optimizing AI network performance for the NVIDIA Collective Communications Library (NCCL) requires coordination and synchronization among all workloads running concurrently in the cloud. The traditional advantages of cloud, such as elasticity and high availability, have limited impact on AI applications, while performance degradation is a more significant global issue.

 

Performance Isolation for Spectrum-X

The Spectrum-X platform includes multiple mechanisms that, when combined, achieve performance isolation. It ensures that one workload does not impact the performance of another workload. These quality of service mechanisms guarantee that any workload does not cause network congestion, which would affect the data movement of another workload.

 

With the help of RoCE dynamic routing, fine-grained data path balancing is achieved to avoid conflicts in data flows passing through Leaf switches and Spine switches, thereby achieving performance isolation. RoCE congestion control, enabled through flow metering and telemetry, prevents victim flows caused by multiple-to-one traffic, further enhancing performance isolation.

 

Additionally, the Spectrum-4 switches leverage a global shared buffer design to facilitate performance isolation. The shared buffer provides bandwidth fairness for flows of different sizes, protecting workloads from the impact of "noisy neighbor" flows and enabling greater absorption of microbursts in scenarios with multiple flows targeting the same destination port.Spectrum-4 switches adopt a global shared buffer design

Spectrum-4 switches adopt a global shared buffer design

NVIDIA Full Stack Optimization

AI Cloud is a meticulously designed machine. Its performance can be affected by network events that are overlooked on the cloud control network, including minor disruptions such as link or device failures. A single sluggish system can cause a chain reaction, slowing down the entire cloud. Additionally, if the network lacks alert monitoring systems capable of detecting and mitigating such behavior, a lagging DPU can interfere with neighboring devices.

 

As part of a full-stack AI solution, NVIDIA employs a multi-stage process to test, certify, and optimize Spectrum-X components:

 

  1. AI Performance Benchmarking: NVIDIA tests the overall AI performance of typical clusters and publishes performance results to provide insights into the expected performance level of AI clusters built using the Spectrum-X network.

 

  1. Component Testing: NVIDIA first individually tests each component (switches, DPU, GPU, and AI libraries) to ensure they function as expected and meet performance benchmarks.

 

  1. Integration Testing: After verifying the functionality of each component, NVIDIA integrates them to create a tightly integrated AI solution. The integrated system undergoes a series of tests to ensure compatibility, interoperability, and seamless communication between components.

 

  1. Performance Tuning: Once the components are integrated and operating as a unified unit, NVIDIA focuses on optimizing the performance of the full-stack solution. This includes parameter tuning, identifying bottlenecks, and fine-tuning configurations to maximize performance.

 

  1. Overall System Performance: At this stage, NVIDIA also specifically tunes buffer sizes and congestion thresholds to meet the requirements of AI workloads such as GPT, BERT, and RetinaNet, ensuring optimal performance for these popular deep learning models.

 

  1. Library and Software Optimization: NVIDIA optimizes AI libraries like NCCL to ensure efficient communication between GPUs and other components. This step minimizes latency and maximizes throughput, which is crucial for large-scale deep learning applications.

 

  1. Certification: After testing and optimizing the full-stack AI solution, NVIDIA conducts a series of certifications to ensure the system operates reliably and securely. This process includes stress testing, security testing, and validation of compatibility with popular AI frameworks and applications.

 

  1. Real-World Testing: Finally, NVIDIA deploys the full-stack AI solution in real-world scenarios to evaluate its performance under various conditions and workloads. This step helps uncover unforeseen issues and ensures the solution is well-prepared for widespread adoption by customers.

 

By following this comprehensive process, NVIDIA ensures the robustness, reliability, and high performance of our full-stack AI solution, providing customers with a seamless experience, especially for widely used AI workloads such as GPT, BERT, and RetinaNet.

 

Ethernet Designed for AI and NADDOD High-Speed Network Connection Accessories

The Spectrum-X network platform is built specifically for demanding AI applications and offers a range of advantages over traditional Ethernet. With higher performance, lower power consumption, lower total cost of ownership, seamless full-stack hardware and software integration, and massive scalability, Spectrum-X becomes the preferred platform for current and future AI cloud workloads.

 

In conjunction with NVIDIA's Spectrum-X accelerated AI network, NADDOD, as a leading provider of optical network connectivity products, offers high-quality DACs, AOCs, and optical modules. Our products undergo rigorous testing and inspection to ensure they meet industry standards for quality and performance. Our engineers have deep business understanding and extensive project experience. NADDOD focuses on high-performance network construction and application acceleration, providing optimal solutions for high-performance switches, intelligent network cards, and AOC/DAC/optical module product combinations based on your specific application scenarios. With our technical expertise and project experience in optical networking and high-performance computing, we continuously provide excellent products, solutions, and technical services for data centers, high-performance computing, edge computing, and artificial intelligence applications.

 

By selecting NADDOD's high-quality and reliable optical connectivity products, you can achieve exceptional performance and reliability for your AI business deployed on the Spectrum-X network platform, enabling the rapid growth of your business.