NVIDIA Spectrum-X Ethernet Solution

NADDOD Quinn InfiniBand Network Architect Jan 3, 2024

NVIDIA, renowned for its cutting-edge technologies, offers a wide range of high-performance networking solutions. Beyond NVLink and InfiniBand, NVIDIA has developed the Spectrum-X solution, a powerful Ethernet solution that brings exceptional performance and scalability to data centers and HPC environments. With its innovative features and advanced capabilities, Spectrum-X enables organizations to meet the increasing demands of modern networking, unlocking new levels of efficiency, speed, and flexibility. In this article, we will delve into the details of NVIDIA's Spectrum-X Ethernet solution, exploring its key features, benefits, and how it revolutionizes networking infrastructure for the most demanding applications.


Traditional data center networks for north-south traffic and typical AI east-west networks exhibit distinct differences. Overall, AI belongs to the category of distributed tightly-coupled workloads, with clear requirements for low latency, low jitter, non-blocking, and predictable network performance.


The essence of AI training involves a multitude of GPU cards continuously performing complex computations, such as gradient calculations. To enable hundreds or even thousands of GPU cards to work together, various parallelization techniques are employed, including data parallelism (dividing training data into subsets for different GPUs), model parallelism (partitioning different layers of the neural network across different GPUs), and tensor parallelism (splitting a tensor into smaller chunks for computation on different GPUs). Regardless of the specific technique, all of them require extensive data exchange among GPUs, demanding a communication network that is high-speed, low-latency, congestion-free, and with zero packet loss.


NVIDIA adopts an "All In" strategy when it comes to AI network communication solutions, offering a comprehensive range of options. They provide NVLink network technology primarily used for GPU communication within servers and small-scale data transfer across server nodes. Additionally, they offer InfiniBand networking, suitable for mid-sized deployments and customers less sensitive to cost considerations. Furthermore, NVIDIA has introduced the Spectrum X platform solution based on RDMA over Converged Ethernet (RoCE), delivering lossless Ethernet for AI applications. With these diversified offerings, NVIDIA caters to a wide range of customer needs in the field of AI network communication.


Nvidia Spectrum X


The NVIDIA® Spectrum™-X network platform is the first Ethernet platform designed specifically to enhance the performance and efficiency of Ethernet-based AI clouds. It delivers 1.7 times improved AI performance and energy efficiency in large-scale AI workloads, while ensuring consistency and predictability in multi-tenant environments. Spectrum-X is built on the foundation of the Spectrum-4 Ethernet switch and NVIDIA BlueField®-3 DPU network card, offering end-to-end optimization for AI workloads.


Nvidia Spectrum X network platform

NVIDIA Spectrum-4 Ethernet Switch


The NVIDIA Spectrum-4 Ethernet switch is built on its proprietary 51.2Tbps Spectrum-4 ASIC. It supports up to 128 400G Ethernet ports or 64 OSFP 800G interfaces in a single 2U switch. It features a two-level CLOS network architecture that supports 8K GPU nodes. The Spectrum-4 ASIC utilizes 112G SerDes channels, and the Spectrum-X solution, from the switch to the DPU to the GPU, employs the same SerDes technology to reduce network power consumption and improve network efficiency.


BlueField-3 DPU Card


The NVIDIA BlueField-3 DPU is the third-generation data center infrastructure chip that enables organizations to build software-defined, hardware-accelerated IT infrastructure from the cloud to the core data center to the edge. With a 400Gb/s Ethernet network connection, the BlueField-3 DPU offloads, accelerates, and isolates software-defined networking, storage, security, and management functions, significantly improving data center performance, efficiency, and security. BlueField-3 provides multi-tenancy and security capabilities in cloud AI data centers driven by Spectrum-X.


BlueField-3 DPU

RoCE Adaptive Routing for addressing low entropy (uneven flow load) in AI networks

Traditional Ethernet ECMP (Equal-Cost Multipath) load balancing mechanisms suffer from low overall network utilization (typically below 60%) due to the varying sizes of different flows, where "elephant flows" (e.g., large files in the order of several gigabytes) and "mice flows" (e.g., small flows in the order of several kilobytes to megabytes) are scheduled using hash algorithms. The industry's approach to solving this issue is to switch from per-flow load balancing to per-packet load balancing, also known as packet spraying. However, packet spraying introduces the challenge of packet reordering, where different packets of the same flow are forwarded on different paths, leading to out-of-order packet delivery. This requires appropriate mechanisms to handle out-of-order packets and reorder them.


RoCE Adaptive Routing is a fine-grained load balancing technology. It dynamically reroutes RDMA (Remote Direct Memory Access) data to avoid congestion and provide optimal load balancing for achieving the highest effective data bandwidth.


It is an end-to-end feature involving Spectrum-4 switches and BlueField-3 DPUs. The Spectrum-4 switch is responsible for selecting the least congested port for each data packet transmission. As different packets of the same flow are transmitted through different paths in the network, they may arrive at the destination out of order. BlueField-3 handles any out-of-order data at the RoCE transport layer, transparently delivering ordered data to the applications.


Spectrum-4 evaluates congestion based on egress queue loads to ensure good balancing across all ports. For each network packet, the switch selects the port with the lowest load in its egress queues. Spectrum-4 also receives status notifications from neighboring switches, which influence routing decisions. The evaluated queues match the quality of service levels.


NVIDIA handles packet reordering on the DPU card side, but there are also other industry approaches to address packet reordering, such as the "packet + switch-side cache sorting" DDC (Distributed Data Centers) solution and the "packet numbering (virtual containerization) + switch-side intelligent scheduling" solution. If interested, you can directly search for these keywords to find my previous articles.


NVIDIA RoCE Congestion Control


The congestion problem caused by INCAST cannot be solved by per-packet load balancing alone. The main cause of this congestion is known as many-to-one congestion, where multiple data senders compete with a single data receiver. It needs to be addressed through congestion control. NVIDIA employs an end-to-end congestion control mechanism called end-to-end congestion control to mitigate network congestion issues.


This type of congestion cannot be resolved using adaptive routing and actually requires per-endpoint flow metering. Congestion control is an end-to-end technology where the Spectrum-4 switch provides network telemetry information representing real-time congestion data. This telemetry information is processed by the BlueField DPU, which manages and controls the data injection rate of data senders to achieve the maximum efficiency of network sharing.


Without congestion control, many-to-one scenarios would result in network backpressure, congestion spread, and even packet loss, significantly reducing network and application performance.


In the congestion control process, the BlueField-3 DPU implements congestion control algorithms. They process millions of congestion control events per second with microsecond response latency and apply fine-grained rate decisions.


Spectrum-4 switches provide in-band telemetry that includes queuing information for accurate congestion estimation and port utilization indicators for fast recovery. NVIDIA RoCE congestion control improves congestion detection and response time significantly by bypassing congestion flow queuing delays while still providing accurate and concurrent telemetry.


RoCE Performance Isolation


Artificial intelligence at a massive scale and cloud infrastructure need to support an increasing number of users (tenants) and parallel applications or workflows. Theserequirements necessitate strong performance isolation between different tenants and applications to ensure predictable and consistent performance.


NVIDIA's RoCE (RDMA over Converged Ethernet) technology provides performance isolation by leveraging the capabilities of the BlueField-3 DPU. The BlueField-3 DPU enables the creation of virtual networks and virtual NICs (vNICs) with dedicated hardware acceleration and isolation for each tenant or application.


With RoCE performance isolation, each tenant or application can have its own dedicated slice of network resources, including bandwidth, latency, and Quality of Service (QoS) parameters. This ensures that one tenant or application cannot monopolize network resources or negatively impact the performance of others.


The BlueField-3 DPU implements hardware-enforced isolation between vNICs, allowing each vNIC to operate independently and securely. It provides fine-grained control over network traffic, enabling administrators to assign specific bandwidth limits, latency thresholds, and QoS policies to each vNIC.


By enforcing performance isolation at the hardware level, NVIDIA's RoCE technology ensures that tenants or applications running on the same network infrastructure are isolated from each other, resulting in predictable and consistent performance for all users.


Overall, the combination of NVIDIA's Spectrum-4 Ethernet switch, BlueField-3 DPU, and RoCE technology provides high-performance networking with features like adaptive routing, congestion control, and performance isolation, making it well-suited for large-scale AI deployments and cloud infrastructure.


As a professional module manufacturer, NADDOD produces optical modules ranging from 1G to 800G support Ethernet and InfiniBand Networking. We welcome everyone to learn about and purchase our products.