NVIDIA Blackwell B200: Unveiling the Most Powerful GPU for AI Performance Speed

In March 2023, NVIDIA held the GTC 2023 keynote speech, where NVIDIA CEO Huang Renxun not only outlined the company's many achievements in the era of artificial intelligence and its expectations for future development but also unveiled a variety of heavyweight hardware products including the Grace Hopper superchip, AI Foundations cloud service, AI supercomputing service DGX Cloud, and the world's first GPU-accelerated quantum computing system.

Meanwhile, on the morning of March 19, 2024, Beijing time, NVIDIA once again hosted the annual NVIDIA GTC keynote speech. Through this speech, NVIDIA CEO Huang Renxun shared the breakthroughs of the new generation of AI and witnessed another transformative moment in AI.

The next generation AI platform — Blackwell

Blackwell B200，a larger GPU, named after David Harold Blackwell, a mathematician specialized in game theory and statistics, who was the first African American scholar elected to the National Academy of Sciences.

According to NVIDIA, the size of B200 is twice that of the "artificial intelligence superchip" Hopper, integrating 20.8 billion transistors. It is manufactured using a custom N4P TSMC process, with GPU chips connected via a 10TBps chip-to-chip interconnect to form a single GPU. There are two curious points about this:

Firstly, technically, although they are using a new node - TSMC 4NP - this is just a higher-performance version of the 4N node used for the GH100 GPU. This also marks the first time in many years that NVIDIA has been unable to leverage the performance and density advantages of a major new node. This means that almost all efficiency gains of Blackwell must come from architectural efficiency, and the combination of that efficiency with the absolute scale of horizontal expansion will bring overall performance gains for Blackwell.

Blackwell B200

NVIDIA states that the new B200 GPU offers up to 20 petaflops of FP4 performance with its 20.8 billion transistors, equipped with 192GB of HBM3e memory, providing up to 8 TB/s of bandwidth.

Blackwell GPU

On the B200, each chip is paired with 4 HBM3E memory stacks, totaling 8 stacks, forming an effective memory bus width of 8192 bits. One of the limiting factors for all AI accelerators is memory capacity (and do not underestimate the demand for bandwidth), so being able to place more stacks is crucial for increasing the local memory capacity of the accelerator.

Overall, the B200 offers 192GB of HBM3E, which is equivalent to 24GB per stack, the same capacity as the H200 (and 50% more memory per stack than the original 16GB per stack H100).

According to NVIDIA, the chip's HBM memory has a total bandwidth of 8TB/s, with each stack having a bandwidth of 1TB/s, meaning a data rate of 8Gbps per pin. As we mentioned earlier, memory is ultimately designed to run at 9.2Gbps per pin or higher, but we often see NVIDIA being somewhat conservative in its server accelerator clock speeds. Nonetheless, this is nearly 2.4 times the memory bandwidth of the H100 (or 66% higher than the H200), so NVIDIA sees a significant increase in bandwidth.

B200 vs. H200

Blackwell VS Hopper

Like the Hopper series, Blackwell also features a "superchip" offering—two B200 GPUs and one Nvidia Grace CPU, with chip-to-chip link speeds of 900GBps. NVIDIA states that compared to the Nvidia H100 GPU, the GB200 Superchip delivers a 30x performance improvement in LLM inference workloads, while reducing costs and energy consumption by 25 times.

Lastly, the HGX B100 will also be introduced. Its basic concept is similar to the HGX B200, equipped with an x86 CPU and 8 B100 GPUs, but it is designed to be directly compatible with existing HGX H100 infrastructure, allowing for the fastest deployment of Blackwell GPUs. The TDP limit for each GPU is 700W, the same as the H100, with throughput reduced to 14 petaflops for FP4.

In addition to the on-paper performance improvements, Blackwell also supports the second-generation Transformer engine, which doubles computation, bandwidth, and model size by using 4 bits per neuron instead of 8 bits, while the fifth-generation NVLink provided ensures seamless high-speed communication between up to 576 GPUs, delivering 1.8TB/s bidirectional throughput for each GPU.

NVLink Switch Chip

NVIDIA also unveiled the GB200 NVL72, powered by the GB200, a multi-node, liquid-cooled, rack-mounted system designed for the most compute-intensive workloads. It combines 36 Grace Blackwell superchips, comprising 72 Blackwell GPUs and 36 Grace CPUs, interconnected via fifth-generation NVLink.

The new NVLink chip boasts 1.8 TB/s of full bidirectional bandwidth, supporting 576 GPU NVLink domains. Manufactured on the same TSMC 4NP node, the chip comprises 500 billion transistors. The chip also supports Sharp v4 on-chip network computing, delivering 3.6 teraflops, aiding in the efficient processing of larger models.

The previous generation supported HDR InfiniBand bandwidth of up to 100 GB/s, marking a significant leap in bandwidth. Compared to the multi-node interconnects of the H100, the new NVSwitch speed has increased by 18 times. This should significantly improve the scalability of larger trillion-parameter AI network models.

Relatedly, each Blackwell GPU is equipped with 18 fifth-generation NVLink connections, eighteen times the number of links in the H100. Each link provides bidirectional bandwidth of 50 GB/s, or 100 GB/s per link.

Additionally, the GB200 NVL72 includes the NVIDIA BlueField-3 data processing unit, enabling cloud network acceleration, composable storage, zero-trust security, and GPU computing elasticity in ultra-scale AI clouds. Compared to an equal number of NVIDIA H100 Tensor Core GPUs, the GB200 NVL72 can deliver up to a 30x performance improvement in LLM inference workloads, while reducing costs and energy consumption by up to 25 times.

Blackwell

On the road to advancing AI technology, Blackwell B200 provides crucial support for future data center operations by leveraging breakthrough technology and high efficiency to enable faster and more efficient operations.

NADDOD has always been committed to providing leading optical module technology to meet the data center's demands for higher bandwidth and more reliable connections. Our products seamlessly integrate with NVIDIA's InfiniBand Quantum series, bringing added value and convenience to data center operators.

NADDOD 800G-2

Whether it's the Blackwell B200 or NADDOD's optical transmission hardware, we strive to provide data centers with advanced and reliable technological solutions, enabling your business to stay ahead in the ever-evolving AI era. Choose NADDOD to accompany Blackwell B200 and bring superior performance and innovation to your data center.

Nvidia Unveils Most powerful GPU Blackwell B200 Unleashes AI Performance Speed

The next generation AI platform — Blackwell

B200 vs. H200