Why did NVIDIA build a switch and what is its connection with AIGC?

NADDOD Neo Switch Specialist Jun 5, 2023

At the recent Computex, most people were focused on NVIDIA’s newly released DGX GH200 and MGX—both of which are considered NVIDIA’s system-level AI products, whether they are reference designs or complete servers. Chips, boards, and systems related to CPUs and GPUs have always been popular for NVIDIA, especially since AI and HPC are so hot these days.

However, in the context of AI HPC, particularly AIGC or what many people now call “large model” computing, networking is also essential. In other words, many servers need to work together, and a large-scale cluster is required to handle the calculations. When computing power extends across systems and nodes, performance issues are not just about the computing power of CPUs, GPUs, and AI chips within a single node.

Overview

Google has previously mentioned that the importance of system-level architecture in the overall AI infrastructure is even higher than the TPU chip micro-architecture. Of course, this “system-level” may not necessarily cover cross-node networking, but clearly, when a large number of chips are working together, both the system and the network become bottlenecks for performance.

This is also why DPUs have become important. NVIDIA’s DPU and other networking products are more like supplements to their existing products’ shortcomings, and they do not seem to be made for standalone sales or competition with existing market products. From this perspective, NVIDIA’s hardware products form a complete ecosystem horizontally: the DPU is not meant to compete with anyone but is an integral part of their existing product lineup.

Jensen Huang
At the Computex2023, NVIDIA released the Spectrum-X Ethernet platform as its main networking product. NVIDIA claims this is the world’s first high-performance Ethernet product designed specifically for AI, particularly for “AIGC workloads that require a new type of Ethernet.” With the opportunity provided by Spectrum-X, this article will attempt to discuss this Ethernet product and the logic behind NVIDIA’s networking products.

Why is NVIDIA creating a “switch”?

The two core components of the Spectrum-X platform are the Spectrum-4 Ethernet Switch and the BlueField-3 DPU. We won’t go into detail about the DPU; as for the Spectrum switch, NVIDIA actually released the Spectrum-4 400Gbps switch at last year’s GTC. On the chip level, it is based on the Spectrum ASIC—Jensen Huang showcased this chip during his Computex keynote address (which should be the ASIC shown below, although Huang only said “this is the chip”). This is a massive device with 100 billion transistors, measuring 90x90mm, featuring 800 solder balls at the bottom of the chip package, and consuming 500W of power.

NVIDIA announced that the Spectrum-4 Ethernet Switch system, designed specifically for AI as the “first high-performance Ethernet architecture,” is now available. This “availability” likely means it is ready for CSPs. The specific configuration can be seen in the image: Spectrum-4 has a total of 128 ports, with a total bandwidth of 51.2TB/s, offering twice the effective bandwidth compared to traditional Ethernet switches. It “allows network engineers, AI data scientists, and cloud service providers to produce results and make decisions faster,” enabling AIGC clouds. High bandwidth and low latency are essential for ensuring that GPU scaling across nodes alleviates performance bottlenecks. The entire switch consumes 2800W of power.

NVIDIA Spectrum-x
At first glance, it seems like the “switch” is designed to compete with existing network switches on the market. However, during last year’s GTC, NVIDIA explained that this product is not meant to compete with regular network switches, but rather to handle “elephant flow” traffic while maximizing the potential of large-scale AI, digital twins, simulations, and other applications.

“Traditional switches for connecting are too slow for today’s AIGC workloads. And we’re just at the beginning of the AI revolution. Traditional switches may be sufficient for commodity clouds, but they cannot provide the performance needed for scaling AIGC workloads in AI clouds,” said NVIDIA Networking SVP Gilad Shainer during his keynote address.

During the pre-briefing, a reporter specifically asked whether NVIDIA Spectrum competes directly with switches from companies like Arista. Shainer’s answer was that there is no competition. “Other Ethernet switches on the market are used to build commodity clouds or handle north-south traffic, including user access and cloud control. But there is no solution for AIGC clouds—no Ethernet can meet the needs of AIGC. Spectrum-4, as the world’s first Ethernet network for east-west traffic in AIGC, provides a completely new Ethernet solution for this purpose.”

Shainer also mentioned that existing products from companies like Broadcom do not compete with Spectrum-4. In the introduction, NVIDIA emphasized that Spectrum-X creates a lossless Ethernet network, which may be particularly important when explaining the Spectrum-X platform.

Which is better between Ethernet and InfiniBand?

For those who have studied network engineering, Ethernet should be a familiar term. Ethernet is a well-established network standard that has been evolving with the times. The reason why we specifically mention “lossless” is because Ethernet was originally designed for use in a lossy network environment that allows for packet loss. In other words, this type of network allows for lost packets. To ensure reliability, the TCP protocol is needed at the IP network layer, which can enable the sender to retransmit lost packets during data transmission. Because of these error correction mechanisms, there is an associated increase in latency, which can cause problems for certain types of applications. In addition, in order to cope with sudden spikes in network traffic, switches need to allocate extra cache resources to temporarily store information, which is why we previously said that the size and cost of Ethernet switches would be higher than those of InfiniBand chips with equivalent positioning.

However, “lossy networks are unacceptable for supercomputing data centers,” said Jensen Huang. “The overall cost of the workload running on the supercomputer is high, and any loss in the network will be difficult to bear.” In addition, there are requirements such as performance isolation that lossy networks are indeed unable to bear. Engineers who follow NVIDIA and related articles on NVIDIA AI supercomputing construction architecture should know that NVIDIA has always been using a networking communication standard called InfiniBand. InfiniBand is commonly used in HPC applications that require high throughput and low latency. Unlike Ethernet, which is more widely used, InfiniBand is more suitable for data-intensive applications.

AOC cable
In fact, InfiniBand is not exclusive to NVIDIA. Many companies, including Intel, IBM, and Microsoft, participated in its development, and there is a dedicated alliance called IBTA. Mellanox began promoting InfiniBand products around 2000. According to Wikipedia, the initial goal of InfiniBand was to replace PCI in terms of I/O and Ethernet in terms of interconnecting machine rooms and clusters.

Unfortunately, InfiniBand emerged during the era of the dot-com bubble burst, and its development was once suppressed. Participants such as Intel and Microsoft have all had new choices. However, it is said that in the 2009 TOP500 list of supercomputers, 181 of them were internally interconnected based on InfiniBand (the rest were Ethernet), and by 2014, this number had exceeded half - although 10Gb Ethernet soon caught up in the following two years. When NVIDIA acquired Mellanox in 2019, Mellanox was already the major supplier of InfiniBand communication products in the market.

From the perspective of reliability, InfiniBand itself has a complete protocol definition for network layers 1-4: it achieves the lossless attribute by using end-to-end flow control mechanisms to prevent packet loss. Another major difference between the two is that InfiniBand is based on switch fabric network design, while Ethernet is based on shared medium channel. In theory, the former is better at avoiding network conflicts.

optical transceivers
Since InfiniBand is so good, why is NVIDIA also working on Ethernet? Intuitively, the market foundation, universality, and flexibility of Ethernet should be important factors. Jensen Huang mentioned in his keynote speech that “we want to bring AIGC to every data center,” which requires forward compatibility. “Many enterprises deploy Ethernet,” and “it is difficult for them to obtain the capabilities of InfiniBand, so we bring such capabilities to the Ethernet market.” This is the business logic promoted by Spectrum-4. However, we believe that this is definitely not all.

NVIDIA is simultaneously working on Infiniband and Ethernet products. The former is the Spectrum Ethernet platform, while the latter is called Quantum InfiniBand. If you look at NVIDIA’s official website, you will find that the InfiniBand solution “achieves unparalleled performance on HPC, AI, and supercluster cloud infrastructure at a lower cost and complexity,” while Spectrum is designed for AI and cloud acceleration of Ethernet switching. Obviously, these two are competing to some extent.

What are some common issues with InfiniBand?

Jensen Huang gave a keynote speech to explain the different types of data centers - in fact, NVIDIA clarified the classification of data centers into six categories at GTC last year. In the AI scenarios we are discussing today, data centers can be divided into two categories. One is responsible for managing a large number of different application workloads with many tenants and weak dependencies between workloads. The other category is typical of supercomputers or popular AI supercomputers, which have very few tenants (bare metal is as low as 1) and tightly coupled workloads, requiring high throughput for large-scale computing problems. The infrastructure requirements for these two types of data centers are quite different. Intuitively, Ethernet, which is the most basic lossy environment, is not suitable for the latter demand, as discussed earlier.

Recently, SemiAnalysis published an article specifically discussing the many problems with InfiniBand - mainly technical, but it can be used as a reference for NVIDIA’s development of Ethernet. Here are some of the key points - although these inherent shortcomings are difficult to say are novel, as any standard, protocol, or technology always has advantages and disadvantages; they are just for reference. In fact, both InfiniBand and Ethernet are constantly evolving.

NVIDIA chip
InfiniBand uses a credit-based flow control mechanism, where each link is pre-allocated with a certain number of credits - reflecting link bandwidth and other properties. When a data packet is received and processed, the receiving end returns credits to the sending end. Ideally, such a system can ensure that the network is not overloaded because the sending end needs to wait for credits to be returned before sending more data packets.

However, this mechanism also has many problems. For example, if the sending speed of a sending node exceeds the processing speed of a receiving node, the receiving buffer may be filled up, and the receiving end cannot return credits to the sending end. When credits are exhausted, the sending end cannot send more data packets. Another example is when the receiving end cannot return credits, and if the sending end is also the receiving end for other nodes, in the case of bandwidth overload, it will also be unable to return credits, leading to back pressure spreading to a wider range. There are also problems caused by deadlock and error rates produced by different components.

Why to Choose Ethernet in HPC?

Inherent problems with InfiniBand become more severe as the entire system scales up and becomes more complex. Currently, the largest commercial implementation of InfiniBand is probably from Meta, where a research cluster deploys a total of 16,000 NICs and 16,000 A100 GPUs.

Of course, this scale is huge, but SemiAnalysis suggests that GPT-4 training will require a larger scale, and there will be a need for cluster expansion in the future development of “large models.” Theoretically, InfiniBand can continue to expand its overall capacity, but the impact of inherent problems will become more and more significant. From the perspective of inference, latency and performance can still benefit from InfiniBand, but for inference loads, different requests will be transmitted at various speeds continuously, and future architectures will require various batch sizes to include multiple large models in the same large cluster, which requires continuous credit-based flow control changes.

It is difficult for the credit-based flow control mechanism to respond quickly to changes in the network environment. If there is a large amount of diversified traffic in the network, the receiving buffer status changes quickly. If the network is congested, the sending end is still processing earlier credit information, making the problem more complex. If the sending end is constantly waiting for credits and switching between the two states of credit and data transmission, it is also easy to cause performance fluctuations.

specturm-x switch
From a practical standpoint, NVIDIA’s current Quantum-2 achieves a bandwidth of 25.6TB/s, which is at least numerically inferior to Spectrum-4’s 51.2TB/s. The faster Quantum chips and infrastructure will have to wait until next year, which is a different pace. Moreover, from a cost perspective, achieving the same scale (8000+ GPUs) of GPU deployment requires an additional layer of switching and significantly more cables for Quantum-2 - and they are costly fiber-optic cables. The typical scale of InfiniBand network deployment costs is significantly higher than Ethernet’s. (The cost of DPU and NIC seems to have not been considered here.)

From the customer’s perspective, Ethernet still has a much larger market than InfiniBand, which is also a part of reducing deployment costs. There are also some more specific comparable items, such as traditional front-end systems being based on Ethernet, and InfiniBand having vendor lock-in issues for customers, while Ethernet clearly offers more choices and may have better flexibility and scalability in deployment. Technically, Ethernet also seems to have some potential value in the deployment of optical transmission infrastructure in the future, and interested readers can read SemiAnalysis’s article.

Jensen Huang
These are the theoretical justifications for NVIDIA’s emphasis on Ethernet, or the part of thedecision-making process that considers the technical and practical advantages and disadvantages of different network technologies for data center deployment. Ultimately, the choice of network technology depends on a variety of factors, including the specific requirements of the workload, the scale of the deployment, the cost and availability of different technologies, and the preferences of the customers. It is also worth noting that the field of data center networking is constantly evolving, and new technologies and standards may emerge in the future that could change the landscape of network deployment once again.

Finally, let’s talk about the issue mentioned at the beginning of the article, which is that Ethernet was originally a lossy network. However, with the development of technologies such as RoCE (RDMA over Converged Ethernet), some of InfiniBand’s advantages have also been brought to Ethernet. In fact, technology expansion is to some extent a collection of the advantages of different technologies, including InfiniBand’s high performance and lossless, as well as Ethernet’s universality, cost-effectiveness, and flexibility.

The RoCE mentioned in the Spectrum-X platform features achieves lossless Ethernet network through PFC (priority-based flow control) flow control mechanism - which relies only on the endpoint-side NIC, not the switch device.

HPC data center
In addition, RoCE++ has some new optimization extensions, such as ASCK, which handles packet loss and arrival order issues. The receiving end informs the sending end to only retransmit lost or damaged packets, achieving higher bandwidth utilization. There are also features such as ECN, flow control mechanism, and error optimization, all of which improve efficiency and reliability. In addition, to alleviate the scalability issues of endpoint NICs in standard Ethernet with RoCE networks, the Bluefield NIC mode can be used, and the overall cost of DPUs can be diluted by Ethernet and some new technologies.

Jensen Huang mentioned in his keynote speech that Spectrum-X brings two important features to Ethernet: adaptive routing and congestion control. In addition, NVIDIA has previously collaborated with IDC to publish a white paper report on the commercial value of Ethernet switching solutions, which interested developers can download and view on NVIDIA’s website.

Conclusion

In large-scale AI applications, Ethernet may be an inevitable choice in the future. Therefore, in the promotion of Spectrum-X, NVIDIA’s stance is that it is specifically prepared for AIGC cloud, as the “first” solution for AIGC east-west traffic. However, perhaps not only because Ethernet is highly universal, but also in AI HPC loads, there is a certain probability of a comprehensive shift to Ethernet.

As the saying goes, the development of different standards is a process of constantly checking and filling gaps, and learning from each other’s strengths. For example, InfiniBand has various mitigation solutions to solve its inherent deficiencies, and some of its extended properties are also helpful for its application in AI. This is a problem of comparison between choices and the development of technology itself. We can wait and see whether NVIDIA will tilt towards InfiniBand or Ethernet in the future development, even though these two have their own application scenarios.

Related Resources:
Where to Buy Infiniband Products: A Purchasing Guide for HPC Data Centers
InfiniBand: Unlocking the Power of HPC Networking
InfiniBand NDR: The Future of High-Speed Data Transfer in HPC and AI
What Is InfiniBand and HDR and Why Is IB Used for Supercomputers?