Ethernet: The Road to Singularity - Modernized RDMA

NADDOD Jason Data Center Architect Oct 27, 2023

In the field of AI, especially with the unprecedented growth in demand for Large Language Models (LLMs) and generative AI, an important question arises: How do we effectively scale to support hundreds of thousands of nodes? The answer to this question lies in the innovation and enhancement of Ethernet.

 

We will unveil a series of technological innovations that span the entire Ethernet ecosystem. At the core of this is the Ultra Ethernet Consortium (UEC). We will delve into UEC's initiatives, including multi-path transmission, congestion control, and improved RDMA with end-to-end security. Through these initiatives, UEC will significantly expand the open innovation technologies available for the OCP community and its ecosystem.

 

For those who are concerned with the entire field of AI, the "Singularity" refers to a state where machines reach or even surpass human-level intelligence. When we contemplate reaching this point, the computational power we require far exceeds what can be accommodated on a single chip or multiple chips connected by adapter boards. We are talking about the need to connect tens of thousands, or even hundreds of thousands, of nodes. And when you connect these nodes, you need a network.

 

Network is the computer. When we observe Moore's Law, it seems that no one is doubting the diminishing returns of Moore's Law, or at least it has entered a phase where we can hardly gain more benefits from continuous process node migration except for reducing power consumption. However, the cost of each transistor is increasing while the demand for computing power is growing exponentially.

 

So, how do we tackle this challenge where the scale of chips no longer expands and performance improvements are limited, but we need thousands of these nodes, whether they are compute nodes or acceleration nodes, to perform various tasks and functions? To achieve this goal, we need to connect them together and build a massive system network. Hence, the whole idea is that "network is the computer," and this is not a new concept. About 20 years ago, Sun Microsystems explored this issue, and today we still firmly believe that the network is the fundamental element for expanding computation.

 

But what is the network? We say, "Ethernet is the network." It used to be the network for cloud computing, and now it is the network for ML/AI. In the future, it will be the scale network required for ML/AI. In fact, we mentioned this at last year's OCP conference, which took place in October 2022, predating the introduction of the concept of ChatGPT in November to December last year.

Ethernet Development

 

Looking back at the past year, we can easily see the significant progress of Ethernet. We firmly believe that Ethernet is built on open standards and has an extremely open ecosystem. It enables plug-and-play functionality and achieves interoperability. The Ethernet market has seen the emergence of numerous diverse participants. Last year, a total of 6 billion Ethernet ports were shipped. All of this demonstrates that Ethernet has a vast economic and economies of scale advantage.

 

To illustrate the economies of scale of Ethernet, I would like to emphasize a few data points. Firstly, Ethernet has been around for 50 years. In fact, I should say it has only been around for 50 years because it has become increasingly powerful over time. Secondly, Mr. Metf, the inventor of Ethernet, was awarded the Turing Award for his outstanding contributions to Ethernet.

 

In addition to these significant milestones, just in the past year alone, if you look at the number of vendors announcing the launch of high-performance switches to meet ML/AI bandwidth demands, it is truly remarkable, isn't it?

 

About a year ago, Broadcom announced the release of multiple switches, followed by other vendors like Marvell and Cisco launching 50T switches. But the encouraging trend here is that some advocates of other technologies like InfiniBand have also stepped forward, announcing the introduction of Ethernet switches specifically focused on ML and AI. This showcases the power of Ethernet, as we see many different vendors offering high-performance networks based on open standards to meet today's and future needs.

 

Furthermore, at every OCP conference by Broadcom, we introduce a brand-new switch, and today we have just announced a product called Qumran3D. It is the world's most powerful single-chip router and one of the most energy-efficient devices, providing approximately 25T of capacity while consuming less than 700 watts of power for the entire chip. This initiative demonstrates that Ethernet will continue to expand, offering economies that other technologies struggle to provide.

 

If you look at the largest IT operators globally, you'll find that all their ML/AI infrastructures are connected through Ethernet networks. When I say this, it may raise some questions. This is the "second derivative problem." Someone might ask, "Is the front-end network based on Ethernet while the back-end relies on technologies like InfiniBand?" No, it's a single network — Ethernet. The front-end and back-end, they all converge into one network, an Ethernet network. What we're discussing today is thousands, even more, AI nodes connected on a single Ethernet network.

 

So please don't misunderstand, Ethernet has been deployed at scale by the largest operators worldwide, and it will continue to be deployed. Why? Because Ethernet has an ecosystem that is unmatched by any other technology. It provides fault tolerance, testing equipment, monitoring devices, and the ability to swap out switches or network cards from one vendor to another, all working together. That's why Ethernet is so widely deployed today, and we believe it will continue to evolve in the future.

 

When you consider all these factors, the outlook appears optimistic. You now have Ethernet and technologies like RoCE that can scale to thousands of nodes.

New AI Models

 

But looking ahead, what should we consider? If you look at the growth rate of language models, in 2020, GPT-3 had around 175 billion parameters, and it is expected that GPT-4 will have over 1 trillion parameters. Whether it's large language models or recommendation models, significant resources are required to handle the next generation and evolution of these models.

 

In our conversations with the operators who are building these models today and their plans for the future, they have posed a question to us: Can we build a network that can connect hundreds of thousands of nodes? Can you effectively cool these devices? Can you deploy them within data centers? Do you have enough optical communication equipment to operate them?

 

I always advise to set aside your doubts for now. Do your best, and other questions will be addressed at some point in the future. So, if you ask the question of how to achieve connectivity among millions of nodes, that is precisely the problem that many customers, vendors, and partners are striving to solve.

 

In this context, how can we achieve scalability to hundreds of thousands or even millions of nodes in the future? About two years ago, some participants in the industry came together and established three objectives:

 

First, Ethernet needs to have the performance for supercomputing interconnectivity. I want to emphasize once again that Ethernet needs to have the performance for supercomputing interconnectivity.

 

Second, this needs to be achieved in large-scale applications. When we talk about scale, it's not tens of thousands or a few hundred, but hundreds of thousands and even millions.

 

Third, it needs to have the total cost of ownership and wide applicability of Ethernet.

 

They believe that if we set these three objectives and understand what today's Ethernet has achieved, which is tens of thousands of computing nodes, but to scale to millions of nodes, what changes does Ethernet need to undergo?

 

When you look at the companies on this list, these are all enterprises that are well-versed in computing and networking. Companies among them have deployed some of the largest-scale cloud computing networks and systems worldwide. They collectively believe, "Let's figure out together what transformations Ethernet needs to achieve in the next two to four years, not because it's not excellent now, but because it must adapt to the future."

Ultra-Ethernet Company

 

Thus, the Ultra Ethernet Consortium (UEC) was born, with its unique goal of achieving Ethernet's exceptional performance, massive scale, and unmatched cost-effectiveness in any interconnect worldwide.

 

If you want an example of their achievements in this field, particularly in the area of RDMA, let's talk about it. RDMA is one of the fundamental technologies for transferring memory from one computing node to another. Initially, it was built for InfiniBand. Over time, RDMA has evolved to support RDMA over Ethernet, which is RoCE. Currently, RoCE and RDMA can operate successfully in environments with many thousands of computing nodes.

Modernizing RDMA

 

However, the problem lies here: when RDMA was built 20 years ago, it was designed to connect one node to another, or perhaps a dozen nodes to another dozen nodes, or from 100 nodes to 200 nodes. As most of you may remember, not long ago, HPC clusters purchased by enterprise customers or large oil and gas exploration companies typically had only 256 nodes, maybe 512 nodes, or at most 1000 nodes. RDMA was designed for this scale. But today, if you talk to someone about 1000 nodes, they would consider it child's play. Even 10,000 nodes are no longer something new. People are interested in 100,000 nodes or even more.

 

So, what are the issues with RDMA? RDMA was not initially built for such large-scale deployments. It included a set of previous assumptions. First, it lacked support for multi-path, which means data could only be transmitted from point A to point B through a single path, and all traffic had to be carried on that path alone. This resulted in some links being underutilized while others were overutilized.

 

Furthermore, RDMA included the concept of "packet delivery," where all data packets must arrive in sequence within the data stream. This means that the first packet must arrive after the second, the third must arrive after the second, and so on. And in RDMA, a technique called "return N" means that if a packet is lost in the RDMA data stream, let's say the fifth packet, but the subsequent packets (six, seven, eight) have been successfully transmitted, "return N" essentially tells the system, "Hey, you lost the fifth packet, so I need you to retransmit packets five, six, seven, and eight." This approach is highly inefficient.

 

The design of RDMA is based on a packet-lossless network and utilizes DCQCN (Data Center Quality of Service Congestion Notification) instead of the more robust TCP/IP protocol. The problem is that this makes the network extremely fragile and requires highly precise engineering. Typically, companies selling the systems may want you to purchase all the components from optical communication devices to cables to the entire system, and charge you several times the normal cost.

 

These characteristics have played a role in past markets but are not suitable for the current world's trends. Therefore, Ultra Ethernet has proposed a concept to address the issues with RDMA, called "Ultra Ethernet Transport." They have taken a series of measures to solve the aforementioned problems, although I won't go into detail about each one. In general, the current idea is to establish a high-performance network that eliminates the inefficiencies of traditional RDMA, enabling scalability to over 100,000 nodes in a highly robust network environment.

 

In the supercomputing field, one crucial point is the loss of packets. Microsoft published a paper stating that even a 0.1% packet loss can lead to exponential growth in job completion time. This is because you have to go back to the state where the packet was lost, and then the entire job must be rerun, resulting in very low efficiency.

Path to Sigularity

 

Therefore, our vision is to establish an incredibly robust network with high resilience, high performance, and operate within an open standard framework. This is precisely the goal that the Ultra Ethernet Consortium is striving to achieve. The initial founding members have now opened it up to many other companies, and we have received interest from over 200 companies to join Ultra Ethernet.

 

To summarize, in the world of ML/AI, there won't be a single company providing all GPUs, nor will there be one company offering all interconnect solutions. The only way we can achieve scalability is by establishing an ecosystem where multiple vendors provide accelerators. The survival of this ecosystem relies on building an open, standards-based, high-performance, and cost-effective interconnect architecture. Ethernet is the only choice, be it today, yesterday, or tomorrow.