Powerful Supercomputing System Behind Microsoft ChapGPT: NVIDIA A100 and InfiniBand Networking

NADDOD Abel InfiniBand Expert Mar 16, 2023

When we discuss about the reason why ChatGPT can become the top popularity all over the world today, the supercomputing power behind it is indispensable.

The data shows that the total computing power consumption of ChatGPT is about 3640PF-days (i.e., if the calculation is 1 quadrillion times per second, it needs to be calculated for 3640 days).

So, how was the supercomputer built by Microsoft specially for OpenAI born?

On Monday, Microsoft posted two blogs on its official paltform, deciphering this super expensive supercomputer and the significant upgrade of Azure - adding thousands of NVIDIA’s most powerful A100 graphics cards and faster InfiniBand network interconnection technology.

microsoft azure cloud center
At the same time, Microsoft also officially announced the latest ND H100 v5 virtual machine. Its specific specifications are as follows:

  • 8 NVIDIA H100 Tensor Core GPUs interconnected through next-generation NVSwitch and NVLink 4.0.
  • 400Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU and 3.2 Tb/s non-blocking fat-tree networking per virtual machine.
  • NVSwitch and NVLink 4.0 have 3.6TB/s bi-directional bandwidth between 8 local GPUs per virtual machine.
  • 4th generation Intel Xeon Scalable processors.
  • PCIE Gen5 to GPU interconnection, each GPU has 64GB/s bandwidth.
  • 16-channel 4800MHz DDR5 DIMM.

Computing Power Supported by Hundreds of Millions of Dollars: Tens of Thousands of NVIDIA A100
About five years ago, OpenAI approached Microsoft with a bold idea—to build an artificial intelligence system that could change the way humans and machines interact forever. That means AI systems that create pictures of whatever people describe in plain language or a chatbot to write rap lyrics, draft emails and plan entire menus based on a handful of words. No one knew it was going to happen at that time. In order to build this system, OpenAI needs a lot of computing power—the kind that can really support super-large-scale computing. But the question is, can Microsoft do it? After all, there was no hardware that could meet OpenAI’s needs at that time, and it was uncertain whether building such a huge supercomputer in Azure cloud services would directly crash the system. Subsequently, Microsoft began a difficult exploration.

Microsoft Nidhi Chappell and Phil Waymouth
Nidhi Chappell (left), head of high-performance computing and artificial intelligence products at Microsoft Azure, and Phil Waymouth, senior director of strategic partnerships at Microsoft (right)

To build the supercomputers that underpin the OpenAI project, it spent hundreds of millions of dollars, connecting tens of thousands of NVIDIA A100 chips together on the Azure cloud computing platform and revamping server racks. In addition, in order to tailor this supercomputing platform for OpenAI, Microsoft has been very dedicated and has been paying close attention to OpenAI’s further needs, keeping abreast of their most critical needs when training AI. What is the cost of such a large project? Scott Guthrie, Microsoft’s executive vice president of cloud computing and artificial intelligence, would not disclose the exact amount, but said it was “probably more than” a few hundred million dollars.
supercomputer data center

OpenAI’s Problem

Phil Waymouth, an executive in charge of strategic partnerships at Microsoft, pointed out that the scale of cloud computing infrastructure required for OpenAI training models is unprecedented in the industry. Exponentially growing networked GPU clusters were beyond what anyone in the industry has attempted to build. The reason why Microsoft is determined to cooperate with OpenAI is because it firmly believes that this unprecedented scale of infrastructure will change history, create a new AI, and a new programming platform to provide customers with products and services that actually serve their interests.

Now it seems that these hundreds of millions of dollars are obviously not in vain—the bet is right.

On this supercomputer, the model that OpenAI can train becomes more and more powerful, and unlocks the amazing functions of AI tools. ChatGPT, which almost opened the fourth industrial revolution of mankind, was born.

Satisfied with the result, Microsoft invested another $10 billion into OpenAI in early January 2023.

OpenAI and Microsoft Partnership
It can be said that Microsoft’s ambition to break through the boundaries of AI supercomputing has paid off. Behind this is the transformation from laboratory research to AI industrialization. At present, Microsoft’s office software empire has begun to take shape. The ChatGPT version of Bing can help us search for vacation arrangements; the chatbot in Viva Sales can help marketers write emails; GitHub Copilot can help developers continue to write code; Azure OpenAI service allows us to access OpenAI’s large language model and also access Azure’s enterprise-level capabilities.
Key Microsfot AI breakthroughs

Microsoft and NVIDIA Join Forces: Using NVIDIA A100 Tensor Core GPUs and Quantum-2 InfiniBand Network

In fact, in November 2022, Microsoft officially announced that it would join forces with NVIDIA to build “one of the most powerful AI supercomputers in the world” to handle the huge computing load required to train and expand AI. The supercomputer is based on Microsoft’s Azure cloud infrastructure and uses tens of thousands of NVIDIA A100 Tensor Core GPUs, as well as NVIDIA Quantum-2 InfiniBand networking platform. In a statement, NVIDIA said the supercomputer could be used to research and accelerate generative AI models such as DALL-E and Stable Diffusion.

NVIDIA and Microsoft join forces to build ai computer
As AI researchers start using more powerful GPUs for more complex AI workloads, they see greater potential for AI models that can understand nuance so well that they can handle many different language tasks simultaneously. In simple terms, the bigger the model, the more data you have, the longer you can train it, and the better the accuracy of the model. But these larger models quickly reach the boundaries of available computing resources. And Microsoft understands what kind of supercomputer OpenAI needs and how big it needs to be. This obviously isn’t something that simply buys a bunch of GPUs, hooks them together, and starts working together.

NVIDIA HGX AI supercomputing platform
Nidhi Chappell, product lead for high-performance computing and artificial intelligence at Microsoft Azure, said: "We need to make bigger models train for longer, which means that not only do you need to have the largest infrastructure, you also have to make it run reliably for a long time. .”
Alistair Speirs, director of global infrastructure at Azure, said Microsoft has to make sure it can cool all those machines and chips. For example, using outside air in cooler climates, high-tech evaporative coolers in hotter climates, etc.

liquid cooled chassis
Also, since all the machines are powered on at the same time, Microsoft also had to take into account the placement of them and the power supplies. Like what might happen in your kitchen when you turn on your microwave, toaster, and vacuum all at the same time, only in a data center version.

InfiniBand Network Speeds up Large-scale AI Training

What is the key to completing these breakthroughs?

The challenge is how to build, operate, and maintain tens of thousands of co-located GPUs interconnected over a high-throughput, low-latency InfiniBand network. This scale is far beyond the scope of testing by GPU and network equipment suppliers, and it is completely uncharted territory. No one knows if the hardware will break at this scale. Nidhi Chappell, head of high-performance computing and artificial intelligence products at Microsoft Azure, explained that in the training process of LLM, the large-scale calculations involved are usually divided into thousands of GPUs in a cluster. In a stage called allreduce, the GPUs exchange information about the work they are doing. At this point, it needs to be accelerated through the InfiniBand network, so that the GPU can complete the calculation before the next block starts.

Graphics Processing Units gpu
Nidhi Chappell said that because these jobs span thousands of GPUs, in addition to ensuring the reliability of the infrastructure, there are many, many system-level optimizations required to achieve the best performance, which has been summed up after many generations of experience. The so-called system-level optimization includes software that can effectively use GPU and network equipment.

Over the past few years, Microsoft has developed techniques that reduce the resource requirements and time required to train and serve these models in production while increasing the ability to train models with tens of trillions of parameters . Waymouth noted that Microsoft and partners have also been gradually increasing the capacity of GPU clusters and developing InfiniBand networking to see how far they can push the data center infrastructure needed to keep GPU clusters running, including cooling systems, uninterruptible power systems and backup generator.

Microsoft Azure datacenter
Eric Boyd, vice president of Microsoft AI Platform, said that this supercomputing capability optimized for large-scale language model training and the next wave of AI innovation is already available directly in Azure cloud services. And Microsoft has accumulated a lot of experience through cooperation with OpenAI. When other partners come to them and want the same infrastructure, Microsoft can also provide it. Now, Microsoft’s Azure data centers have covered more than 60 regions around the world.

New Virtual Machine ND H100 v5 using NVIDIA H100 and Quantum-2 InfiniBand Network

On the above-mentioned infrastructure, Microsoft has continued to improve. On March 14, 2023, Microsoft officially announced new massively scalable virtual machines that integrate the latest NVIDIA H100 Tensor Core GPU and NVIDIA Quantum-2 InfiniBand network. With virtual machines, Microsoft can provide customers with infrastructure that scales with the size of any AI task. According to Microsoft, Azure’s new ND H100 v5 virtual machine provides developers with exceptional performance while utilizing thousands of GPUs.