NVIDIA Spectrum X: Enhancing Performance and Efficiency in Ethernet-based AI Clouds - NADDOD Blog

NVIDIA Spectrum X : Speed Up AI Network

NADDOD Gavin InfiniBand Network Engineer Sep 7, 2023

1. Spectrum X Increasing AI Performance and Efficiency

The demand for cloud AI workloads is growing at an unprecedented rate, and there is also a surge in the applications of generative AI. Each year, more AI factories are being constructed, and these AI factories are expanding into cloud service providers (CSPs) and large-scale data centers.

AI date center

AI factories are typically built to achieve higher levels of system performance, enabling the training of trillion-parameter base models. They are often operated by a small number of tenants running large workloads. AI factories require highly optimized networks, such as network solutions that combine NVIDIA® NVLink™ high-speed GPU interconnect technology with NVIDIA Quantum InfiniBand technology, to achieve higher levels of performance. In contrast, AI clouds, specifically built for AI, have some unique requirements, such as integration into Ethernet service networks and management frameworks.

Key requirements for an AI cloud Include:

  • Migration of AI in Software as a Service (SaaS) and AI as a Service:These SaaS products are self-improving to adapt to AI and AI as a Service, giving rise to the need for multi-tenant security and multi-tasking performance isolation.

 

  • Cloud-Scale Software Defined Networking (SDN):This provisioning model utilizes Ethernet-based protocols and container orchestration, such as Kubernetes. SDN facilitates seamless scaling from small to large clusters and allows for the reconfiguration of computing and networking resources into multiple smaller clusters.

 

  • Integration and Compatibility: CSP's AI services require tight integration with other cloud services offered by the same company. These services running over Ethernet are essential for managing the complete data pipeline in the cloud.

 

  • Security and Compliance:CSPs implement robust security measures, including packet brokering and Ethernet-based intrusion detection/protection devices, to protect client data and comply with local regulations.

 

  • Open Source:Support for open network operating systems, such as SONiC.

 

  • Global Coverage:CSP data centers are distributed worldwide and synchronized using PTP. This helps deploy AI models closer to end-users while complying with cross-border data regulations concerning sensitive data.

 

CSP networks have been optimized for shared cloud computing infrastructure environments but may not be the ideal choice for meeting the demands of an AI cloud. The differences between the management/user access network and the AI network are illustrated in Figure 1.

Differences between management networkuser access network and AI network

When an AI cloud uses traditional IP/Ethernet as its compute network, even with full optimization, it can only achieve partial performance levels compared to MLPerf. In a multi-tenant environment where multiple AI workloads are running concurrently, traditional Ethernet cannot provide performance isolation, which is essential for protecting the AI tasks of one tenant from being negatively affected by the AI tasks of other tenants.

 

For scenarios where AI clouds are deployed using Ethernet, NVIDIA has created the Spectrum-X networking platform, which significantly improves performance while enhancing predictability and energy efficiency in Ethernet-based AI clouds.

2. What is the NVIDIA Spectrum X Networking Platform?

NVIDIA® Spectrum™-X is the first platform designed to enhance performance and efficiency in Ethernet-based AI clouds. This groundbreaking technology improves the overall AI performance of large-scale AI workloads, such as LLM, by 1.7 times and achieves 1.7 times higher energy efficiency in multi-tenant environments, while ensuring consistent and predictable performance. Spectrum-X leverages the NVIDIA Spectrum-4 Ethernet switch and NVIDIA BlueField®-3 DPU (Data Processing Unit) for end-to-end network innovation and construction. It provides end-to-end network capabilities tailored for AI workloads, reducing the runtime of large-scale transformer-based generative AI models and enabling network engineers, data scientists, and cloud service providers to achieve faster results and make informed decisions.

3. Explore the NVIDIA Spectrum X Networking Platform

Effective AI compute networking relies on the ability to support and accelerate AI workloads. To achieve this goal, optimization of various aspects of the network, from DPUs to switches, cables/fiber, networking, and acceleration software, is crucial. NVIDIA has introduced several innovations to achieve higher effective bandwidth in high-load and large-scale networking scenarios:

 

  1. Spectrum-4 switches support NVIDIA RoCE dynamic routing.

 

  1. BlueField-3 DPUs support NVIDIA Direct Data Placement (DDP).

 

  1. NVIDIA RoCE congestion control supported by Spectrum-4 switches and BlueField-3 DPUs.

 

  1. NVIDIA AI acceleration software.

 

  1. End-to-end AI network visualization.

 

These innovative technologies work together as part of a full-stack solution, extensively tested, optimized, and benchmarked by NVIDIA to ensure achieving higher levels of performance. When implemented as a full-stack, the Spectrum-X networking platform provides several advantages.

4. The Main Advantages of Spectrum X

  • Enhanced AI Cloud Performance:Spectrum-X can boost AI cloud performance by up to 1.7 times.

 

  • Standard Ethernet Connectivity:Spectrum-X is fully compatible with standard Ethernet, allowing seamless interoperability with the traditional Ethernet stack.

 

  • Improved Energy Efficiency:By enhancing performance, Spectrum-X contributes to creating a more energy-efficient AI environment.

 

  • Enhanced Multi-Tenant Protection:Performance isolation in multi-tenant environments ensures consistent and outstanding execution of workloads for each tenant, improving customer satisfaction and service quality.

 

  • Better AI Network Architecture Visibility:Having visibility into data flows within the AI cloud helps identify performance bottlenecks and is a crucial part of modern automated network architecture validation solutions.

 

  • Higher AI Scalability:Scalable up to 128 400G ports in a single hop and up to 8,000 ports in a two-tier leaf-spine network topology, Spectrum-X supports the scaling needs of AI clouds while maintaining high levels of performance.

 

  • Faster Network Configuration: Advanced network features optimized specifically for AI workloads enable automated end-to-end network provisioning and configuration.

5.Spectrum X—Ethernet Designed Specifically for AI

The Spectrum-X networking platform is specifically designed for demanding AI applications, offering a range of advantages over traditional Ethernet. With higher performance, lower power consumption, reduced total cost of ownership, seamless integration of software and hardware, and scalable architecture, Spectrum-X is poised to become the preferred platform for current and future AI cloud workloads.

 

When deploying the Spectrum-X networking platform, it is crucial to choose high-quality and reliable optical connectivity products. As a leading provider of optical network solutions, NADDOD offers optical connectivity and networking products and solutions. With deep business understanding and extensive project implementation experience in high-performance networking, NADDOD can provide optimal combinations of high-performance switches, intelligent network cards, and AOC/DAC/optical module products based on users' different application scenarios. Leveraging its technical expertise and rich project experience in the fields of optical networking and high-performance computing, NADDOD continues to deliver excellent products, solutions, and technical services for data centers, high-performance computing, edge computing, artificial intelligence, and other application scenarios, co-creating a smart world of interconnected data.

 

NADDOD offers optical connectivity products at different speeds that perfectly match Spectrum-X network devices. By selecting NADDOD's high-quality and reliable optical connectivity products for deployment on the Spectrum-X networking platform, you can achieve superior performance and reliability, empowering the rapid growth of your AI business!