The Path to Accelerating AI Computing Power

NADDOD Mark News Writer Jun 19, 2023

Trends in AI Computing Power Development

Artificial Intelligence Theory: Deep Learning

The development of artificial intelligence has not been smooth sailing. From its early stages to the current deep learning phase, the three fundamental elements of data, algorithms, and computing power have been the driving force behind the advancement of AI, enabling it to reach higher levels of perception and cognition.

AI Development

Key Figures in the Third Wave of AI

As previously mentioned, the current boom in artificial intelligence owes much to the joint development of data, algorithms, and computing power. At the algorithm level, the contributions of the “big three” of deep learning - Geoffrey Hinton, Yann LeCun, and Yoshua Bengio - are widely recognized and celebrated in the field of AI. They have reshaped AI around neural networks.

At the data level, Fei-Fei Li’s creation of ImageNet, the world’s largest image recognition database in 2007, helped people realize the importance of data to deep learning. The ImageNet Challenge inspired the creation of classic deep learning algorithms such as AlexNet, VggNet, GoogleNet, and ResNet.

One major reason for the previous lulls in AI development was the difficulty of supporting complex algorithms with computing power, while simple algorithms were ineffective. NVIDIA, founded by Jensen Huang, helped to ease the training bottleneck of deep learning algorithms with its GPUs, unleashing the full potential of artificial intelligence.

Computing Power is Productivity

In the age of intelligence, computing power is productivity. Productivity refers to the ability of humans to transform nature and create value. In this context, we have an interesting observation.

Ten years ago, most of the world’s highest-valued companies were energy and financial companies, with only Microsoft as the front-runner among IT companies. At that time, Windows and Office dominated the personal computer era.

Today, almost all of the world’s most valuable companies are information technology and services companies. Interestingly, the top companies are also the biggest purchasers of servers in the world, with only Amazon purchasing 13% of the world’s cloud servers in 2017. Massive computing power is creating value for these companies.

The same applies to countries. Computing power in the age of intelligence is like electricity in the age of electricity, and both are important forms of productivity.

Therefore, we can analyze a country’s economic development by the state of its computing power, just as the KLEIN index uses electricity to measure the development of an industry. According to statistics, there is a clear linear correlation between a country’s GDP and the number of servers shipped, and the amount spent on server procurement.

The US and China not only lead in GDP but also have a significantly higher number of servers per trillion GDP than Japan and Germany. The contribution of the digital economy is also significantly higher.

Faced with the exponential growth in computing demand, computing technology, products, and industries are facing new challenges. Specifically, there are three challenges: diversification, massive scale, and ecosystem.

(1) The first challenge is diversification.
The most critical task of computing is to support business, and different types of businesses require different computing systems to complete them. For example, scientific computing such as seismic wave simulation requires high numerical accuracy, up to 64 bits. On the other hand, AI training can use 16-bit floating-point types with a large numerical range but low precision. In contrast, AI inference requires low numerical accuracy for speed and energy consumption, such as processing with 4-bit, or even 2-bit, or 1-bit integer types.

In other words, AI applications introduce new computing types, spanning from inference to training, with a wider range. At the same time, data volume is increasing from GB to TB or PB, and the types are becoming more complex and diverse, from structured to semi-structured and unstructured.

Different numerical precision computing types have different requirements for computing chip instruction sets and architectures. This means that the general-purpose CPU chips we have been using are no longer able to meet the requirements of such diversified computing scenarios. This is also an important reason why there are more and more types of computing chips.

(2) The second challenge is massive scale.
This is first reflected in the fact that massive models have a large number of parameters and require large amounts of training data.

Taking natural language processing as an example, after the rise of pre-trained models based on self-supervised learning, model accuracy has significantly improved with an increase in model size and training data.

In 2020, the number of parameters of the GPT-3 model exceeded 100 billion for the first time, reaching 175 billion. According to the current development trend, by 2023, the number of parameters of a massive model will exceed one quadrillion, which is basically equivalent to the number of synapses in the human brain, which is about 125 quadrillion.

Massive models require massive memory. Currently, the onboard high-speed memory capacity of a single GPU is 40 GB. To evenly distribute the parameters of a massive model among the memory of each GPU, 10,000 GPUs are required. Considering the additional storage required for training, at least 20,000 GPUs are needed to initiate training. The architecture of current AI chips is not enough to support the parameter storage requirements of massive models.

In addition, massive models depend on massive data feeding. Current AI algorithms are essentially a kind of quantitative change dependence, making it difficult to make a qualitative leap from one type of change to another. For example, the latest massive models require word-level data in the trillion range. Massive data requires massive storage, and it is a huge challenge for storage systems to provide high-performance reading for tens of thousands of AI chips in a super-large-scale cluster.

The second manifestation of massive scale is that computing power demand is growing exponentially.

Since the rise of deep learning in 2011, the demand for computing power has been growing exponentially. The demand for computing power doubles every 3.4 months. Petaflops day (PD) represents the amount of floating-point calculations used per day based on 1 Petaflop of computing power. Training massive models requires a huge amount of computing power, and it is estimated that a single training run of a massive model can consume tens of thousands of PDs.

As a result, computing power is becoming a bottleneck for the development of AI, and there is an urgent need for more computing power to support the development of AI.

(3) The third challenge is the ecosystem.
Computing power is not just about hardware, it also includes software, algorithms, and applications. The development of computing power requires the integration and optimization of the entire ecosystem.

For example, the development of AI requires not only computing power but also data and algorithms. The development of data and algorithms, in turn, requires a large amount of computing power for training and inference. Therefore, the development of computing power requires the integration and optimization of the entire ecosystem.

The development of the ecosystem also requires the participation of industry, academia, and government. The industry needs to invest in research and development to create new technologies and products. Academia needs to conduct fundamental research to explore new theories and algorithms. The government needs to provide funding and policy support to promote the healthy development of the industry.

In conclusion, computing power is becoming an increasingly important form of productivity in the age of intelligence. The development of computing power faces the challenges of diversification, massive scale, and ecosystem. To address these challenges, industry, academia, and government need to work together to create a healthy and sustainable ecosystem for the development of computing power.

Introduction to AI Acceleration Technologies

AI Architecture


AI Structure

Typically, users encounter information related to AI architecture such as applying for XX-core CPUs, XX CPU cards, and XXGB of memory, which correspond to the computing, storage, and network resources of the AI architecture. The actual AI architecture includes computing nodes, management nodes, storage nodes, computing networks, management networks, and clients.

How do we plan for computing resources? The principle is to meet the demand with the lowest cost while considering scalability. For example, if there are two or more business types with different computing features and both are of considerable scale, there should be two or more corresponding computing node types. If the maximum demand is much larger than other demands, the number of computing node types can be reduced to facilitate future expansion.

AI Acceleration Technologies

AI has a huge demand for computing, and how to accelerate it is directly related to production efficiency and cost. Here are some of the latest AI acceleration technologies.

Computing

(1) Heterogeneous Computing
Before GPUs were used for AI computing, CPUs were responsible for computing tasks. However, with the rapid increase in AI computing demand, the computing efficiency of CPUs cannot meet the demand, resulting in a “CPU + GPU” heterogeneous computing architecture, as shown in the upper right corner of the figure below.

As shown in the lower right corner of the figure below, the computing efficiency of GPUs is several to tens of times that of CPUs. Why is there such a big difference in computing efficiency between CPUs and GPUs? The main reason is that there is a huge difference in the architecture of CPUs and GPUs. As shown in the lower-left corner of the figure below, the number of computing units in GPUs is much larger than that in CPUs, so GPUs are more suitable for large-scale parallel computing.
In contrast, the Control and Cache units in the CPU architecture are much larger than those in the GPU, so CPUs are more suitable for complex calculations that cannot be highly parallelized (such as if statements in code, etc.).

CPU VS GPU

(2) NVLINK Communication
As the scale of AI computing increases, such as in large-scale AI training, multiple cards or even multiple nodes need to participate in a task’s calculations simultaneously. One key point is how to support high-speed communication between GPUs within nodes so that they can collaborate as a huge accelerator.

Although PCIe is very standard, its bandwidth is very limited, as shown in the upper left corner of the figure below. The theoretical bandwidth of PCIe Gen3 is 32GB/s, and PCIe Gen4 is 64GB/s, but the actual measured bandwidth is approximately 24GB/s and 48GB/s, respectively.

In AI training, every round of calculation requires a synchronous update of the parameters, i.e., the weight coefficients. The larger the model scale, the larger the parameter size will generally be. Therefore, the communication (P2P) capability between GPUs has a significant impact on computing efficiency, as shown in the upper right corner of the figure below. For example, for eight V100 cards, the NVLINK2.0 architecture can improve performance by 26% compared to the PCIe architecture, while the NVLINK2.0 Next architecture (fully interconnected, with a P2P communication bandwidth of 300GB/s between any two cards) can improve performance by 67% compared to the PCIe architecture.

NVLINK is a high-speed GPU interconnect technology developed by NVIDIA, which has now developed to its third generation (NVLINK3.0), as shown in the lower part of the figure below. From NVLINK1.0 (P100) to NVLINK2.0 (V100), and then to NVLINK3.0 (A100), the bandwidth has increased from 160GB/s to 300GB/s, and then to 600GB/s. The P2P communication of NVLINK1.0 and 2.0 is not fully interconnected, which means that the actual communication bandwidth between any two GPU cards has not reached the maximum bandwidth, and some even communicate through PCIe. This creates a step for GPU P2P communication within nodes.

NVLINK3.0 achieves full P2P interconnected communication, with a communication bandwidth of 600GB/s between any two cards, greatly improving the computing efficiency of multi-card calculations within nodes.

NVLINK3.0

(3) Tensor Core
The Tensor Cores of the V100 are programmable matrix multiplication and accumulation units that can provide up to 125 Tensor TFLOPS of training and inference applications. The V100 contains 640 Tensor Cores. Each tensor core provides a 4x4x4 matrix processing array that performs the operation D=axB+C, where a, B, C, and D are 4x4 matrices, as shown in the upper part of the figure below. The matrix multiplication inputs A and B are FP16 matrices, while the accumulation matrices C and D can be FP16 or FP32 matrices.

Each Tensor Core can perform 64 floating-point mixed multiply-add (FMA) operations per clock cycle, providing up to 125 TFLOPS of computing performance for training and inference applications. This means that developers can use mixed precision (FP16 computation using FP32 accumulation) to perform deep learning training, achieving performance that is three times faster than the previous generation of products and able to converge to the expected accuracy of the network.

The GEMM performance provided by the Tensor Cores is several times that of previous hardware, as shown in the lower right corner of the figure below, comparing the performance of GP100 (Pascal) and GV100 (Volta) hardware.

Tensor Core

(4) Multi-Compute Power
With the development of AI, various types of chips have emerged, such as CPUs, GPUs, ASICs, and FPGAs, as shown in the upper part of the figure below. Analyzing and comparing the two dimensions of generality and performance, the generality dimension is CPU > GPU > FPGA > ASIC, while the performance dimension is the opposite. Different AI tasks have different requirements for chips. For example, for training tasks, various frameworks, models, algorithm libraries, etc. need to be supported, requiring high generality. NVIDIA GPUs occupy a dominant position due to their complete ecosystem and high generality.

For inference tasks, only one or a few frameworks, models, algorithm libraries, etc. need to be supported. Because they are close to the business, they require more performance and cost. Therefore, ASIC chips can have a better cost-performance ratio than NVIDIA GPUs in some scenarios. As shown in the IDC statistics of various chip market sales in the lower part of the figure below, in the inference market, NVIDIA GPUs still dominate, but other chips can still keep up with NVIDIA GPUs. In the training market, other chips still lag behind.

Chip Analysis
Chip market sales

(5) Low Precision
If 32-bit floating-point numbers can be compressed to 16 bits, although a certain amount of representation accuracy will be lost, it will bring great improvements in both parameter storage space and computing power (FPU computation times).

This is the basic principle of mixed-precision training. The main version of the weights is stored in FP32 format. When performing inference and backpropagation operations, they are first converted to FP16 for computation. When updating the weights, the incremental update (gradient multiplied by learning rate) is also added to the weights represented in FP32, as shown in the upper part of the figure below.

As shown in the figure below, in some scenarios, low precision not only brings performance improvements but can also be reused in inference tasks to handle more complex models, thereby improving the accuracy of inference tasks.

SHARP

Network

(1) GDR
GDR (GPU Direct RDMA) means that the GPU of Computer 1 can directly access the GPU memory of Computer 2, as shown in the upper part of the figure below. Before understanding the concept of GDR, first understand the concepts of DMA and RDMA.

DMA (Direct Memory Access) is an important technology for offloading CPU load. With the introduction of DMA, data exchange between device memory and system memory, which used to require CPU participation, is now transferred to DMA control for data transmission, which is a way of completely executing I/O exchange work by hardware.

RDMA can be simply understood as using relevant hardware and network technology, where the network card of server 1 can read and write the memory of server 2, ultimately achieving high bandwidth, low latency, and low resource utilization.

Currently, the implementation of RDMA is mainly divided into two types of transmission networks: InfiniBand and Ethernet. On Ethernet, it can be further divided into IWARP and RoCE (including RoCEv1 and RoCEv2) based on the differences in the protocol stack fused with Ethernet.

GPUDirect RDMA means that the GPU of Computer 1 can directly access the GPU memory of Computer 2. Before this technology existed, the GPU needed to move the data from GPU memory to system memory first, and then use RDMA to transfer it to Computer 2. The GPU of Computer 2 also needed to perform data movement from system memory to GPU memory.

The GPUDirect RDMA technology further reduces the number of data copies for GPU communication and reduces communication latency.

GPUDirect RDMA
GPU-CPU

(2) SHARP
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a type of network communication offloading technology.

In AI training, there are often many collective communications that involve the global scope, which can have a huge impact on the parallel efficiency of the application.

To address this issue, NVIDIA Mellanox introduced the SHARP technology starting from the EDR InfiniBand switch, which integrates computing engine units in the switch chip, supporting 16-bit, 32-bit, and 64-bit fixed-point or floating-point calculations. It can support operations such as summation, minimum value, maximum value, AND, OR, XOR, and more. It can also support operations such as Barrier, Reduce, All-Reduce, etc.

In a cluster environment composed of multiple switches, Mellanox has defined a complete set of scalable hierarchical aggregation and reduction protocol (SHARP) offloading mechanisms. An aggregation manager constructs a logical SHARP tree in the physical topology, and multiple switches in the SHARP tree process collective communications operations in a parallel and distributed manner.

When the host needs to perform global communication, such as all reduce, all hosts submit communication data to the switches they are connected to. When the first-level switch receives the data, it uses the built-in engine to perform calculations and processing on the data and then submits the result data to the upper-level switch in the SHARP tree. The upper-level switch also uses its own engine to aggregate the result data received from several switches and continues to submit it to the upper level of the SHARP tree.

After reaching the root switch of the SHARP tree, the root switch performs the final calculation and sends the result back to all host nodes. The SHARP method can significantly reduce the latency of collective communication, reduce network congestion, and improve the scalability of the cluster system (as shown in the upper part of the figure below). SHARP has a more significant effect on complex models and complex multi-layer networks, as shown in the lower part of the figure below. As the cluster size increases, with SHARP enabled, the latency remains almost unchanged, and compared to not using SHARP, the latency increases linearly. Similarly, the performance improvement is also significantly different.

Low Precision

(3) IB (INFINIBAND)
InfiniBand Architecture is a software-defined network architecture designed for large-scale data centers. Its design aims to achieve the most efficient data center interconnect infrastructure. InfiniBand natively supports network technologies such as SDN, Overlay, and virtualization, and is an open standard high-bandwidth, low-latency, and highly reliable network interconnect. Compared to the RoCE network, IB has many advantages, as shown in the upper part of the figure below.

Of course, there is an intense debate in recent package upgrade plans about whether to use IB or RoCE for AI training networks. NVIDIA is a major proponent of IB, and their argument is that in addition to listing various functional advantages, most AI clusters deployed by internet companies such as Alibaba, Baidu, JD.com, and Tencent in the past two years have adopted IB networks. However, they cannot provide very convincing quantitative data. From Alibaba’s perspective, they have a dedicated RoCE network optimization team, so they have achieved performance similar to IB. At the same time, the benchmark performance listed by NVIDIA, such as SHARP, only achieves a performance improvement of about 3%-5% in actual users (now estimated to be more significant for large models and networks with three or more layers).

Overall, the current conclusion is that IB is superior to RoCE. IB optimizes the work in the ecosystem (NCCL/CUDA/…) for users, and the amount of optimization work for users is very small. But for RoCE, a dedicated team and deep optimization accumulation are required. Therefore, currently choosing IB is more suitable, although the cost is increased, the performance improvement is greater, as shown in the lower part of the figure below.

Of course, in the context of cloudification, in addition to Ethernet, there is another network architecture, which increases the complexity of overall operation and management. Therefore, the IB&RoCE debate may be further analyzed, more quantitative data may be listed, and more fundamental analysis may be done to achieve a deeper understanding of the network.

IB&RoCE

(4) Multiple NICs
As mentioned earlier, NVLINK3.0 has a communication bandwidth of 600GB/s, and the measured communication bandwidth of PCIe4.0 has also reached 48GB/s. However, the maximum computing network currently used is usually 100Gb/s (12.5GB/s). So, for large-scale model training tasks that require multi-card computing across nodes, there will be a bottleneck in parameter communication between nodes. In this case, it is necessary to adopt a multiple NIC strategy, which means that two nodes are no longer connected by one network cable, but by multiple ones. As shown in the figure below, multiple NICs can significantly improve performance. Since the network cost generally accounts for about 10% of the entire computing system cost, a performance improvement of more than 10% is cost-effective for the entire computing system.

Bert Large Pre-Train Phase 2

Storage

(1) GDS
GDS (GPUDirect Storage) is another GPUDirect technology introduced by NVIDIA. Although GPU computing is fast, as the size of data sets and models increases, the time it takes to load data into the application also increases, which affects the performance of the application, especially for end-to-end architecture, slow I/O will make the increasingly faster GPU useless.

The standard path for transferring data from the NVMe disk to GPU memory is to use a bounce buffer in system memory, which means additional data copying. GPUDirect storage technology avoids using a bounce buffer to reduce additional data copies and uses a Direct Memory Access (DMA) engine to directly place data in GPU memory for remote or local storage.

For example, between NVMe or NVMe over Fabric and GPU memory, a direct path for transferring data is established, which can effectively alleviate the bottleneck of CPU I/O, improve I/O bandwidth and transfer data volume.

NVIDIA’s development of GPUDirect storage technology has greatly improved the speed of loading large data sets into GPUs. NVIDIA mentioned that the main function of GPUDirect storage technology is to transfer data to GPU memory in a direct memory access manner through this new system.
Of course, the implementation of GDR is still limited at present. Firstly, the file system needs to be adapted, and only support for GDR technology that has been certified by NVIDIA can be provided, which limits the promotion of the technology. Secondly, GDR is mainly a technology within a single machine, and NVME is mainly used to carry out the intermediate state demand of insufficient memory space and low unified storage bandwidth, so the applicable scenarios are relatively narrow, and the enthusiasm for industry adaptation is not high. However, no matter what, GDR also provides an acceleration option for AI architecture.

Transfer Methods

(2) Burst Buffer
Burst buffer technology can use the local SSD hard disk of the computing node to form a temporary high-speed cache file system. This function can improve the reliability of application programs through faster checkpoint restart; accelerate I/O performance of small block transmission and analysis; provide fast temporary storage space for core external application programs; and create a temporary storage area for a large number of file inputs needed for persistent fast storage during calculation tasks.

Burst buffer technology has been widely used in HPC architecture, such as several of the top 10 supercomputing clusters in the world’s HPC TOP500 list. In AI architecture, some users are also trying to use similar technologies to provide ultra-large high-speed caches for large-scale training.

● Cost-effective: Using SSD hard drives on computing nodes can improve IO performance without increasing hardware costs.

● Fast and Convenient: A parallel file system instance can be created/deleted within 30 seconds with just one command.

● Independent Operation: Burst Buffer instances run completely independently of other global file systems and can be customized to integrate with mainstream job scheduling systems such as PBS, LFS, SLURM, Torque, etc.

Creating a temporary storage area for a large number of file inputs needed for persistent fast storage during calculation tasks solves the problem of massive burst data operations.

Parallel Technology

Parallel technology is a critical aspect of large-scale AI training. Deploying deep learning models across multiple computing devices is one way to train large, complex models. As demand for training speed and frequency continues to increase, the importance of this method is also growing.

Data parallelism (DP) is the most widely used parallel strategy, but when a GPU’s memory cannot hold an entire model, the model needs to be split into N parts and loaded into different N GPU nodes. Model splitting can be divided into tensor slicing model parallelism (intra-layer model parallelism) and pipeline model parallelism (inter-layer model parallelism) depending on the splitting method.

Models such as DeepSpeed and GPT-3 require multiple parallel methods to be combined to accommodate the entire model.

For the GPT-3 model, its computational and I/O demands are significant, requiring the integration of various acceleration technologies mentioned earlier, such as NVLINK, Tensor Core, IB, multiple NICs, GDR, parallel methods, etc., to efficiently complete large-scale model training.

Summary

Various AI acceleration technologies are aimed at improving two aspects: computation and I/O. The use of heterogeneous computing is to improve computational capabilities, while NVLINK, IB, GDR, GDS, Burst Buffer, multiple NICs, etc., are used to improve IO bandwidth and latency.

Since there is a step change in IO bandwidth from GPU cache (7TB/s) to GPU memory (1.6TB/s), CPU memory (90GB/s), high-speed cache (24GB/s), NVME hard disk (6GB/s), distributed storage (5GB/s, can scale up to several tens or hundreds of GB/s), and cold storage (2GB/s), the direction of IO acceleration in AI architecture is gradually bridging the gap between these steps. Of course, algorithms also need to leverage the characteristics of the architecture as much as possible to maximize the use of the fastest IO architecture.

IO architecture

Analysis of GPT-3 Model Pre-training Computing Architecture

In this section, we will analyze the computing architecture of the GPT-3 model pre-training.

Analysis of GPT-3 Model Computing Characteristics

When designing an AI architecture solution, the first step is to understand the computing characteristics of the GPT-3 model, which means understanding what kind of computing, and I/O meets the extreme requirements of the GPT-3 model pre-training.

Generally, analysis is done through theoretical analysis and practical testing. Through analysis, it can be determined that GPT-3 requires I/O close to 100GB/s and corresponding network support of 4×HDR 200 networks, which means 4 NICs are needed. In this case, Infiniband network is used.

Next is the computational demand, evaluated based on the computational power of A100, which is 312TFlops: the computational demand of GPT-2 is about 10 PetaFlop/s-day, equivalent to training for one day using 64 A100 GPUs; the computational demand of GPT-3 is about 3640 PetaFlop/s-day, equivalent to training for one year using 64 A100 GPUs.

Analysis of GPT-3 Model Pre-training Computing Architecture

As analyzed in the previous section, the computational part of the AI architecture is based on the latest A100 GPUs, and the I/O part uses a 4×HDR200 IB network. NVLINK is used to achieve high-speed interconnection at 600GB/s between GPUs.

NVLINK A100 Server Topology
NVLINK A100 Server Topology


Here is the corresponding network topology:

Large Model Training Platform Architecture
Large Model Training Platform Architecture (140 Nodes)

Conclusion

AI computing power is an important component of the three factors of artificial intelligence. AI acceleration technology is rapidly developing around computation and I/O, continuously improving the computational efficiency of AI tasks. It is important to strengthen our understanding of AI architecture.

Of course, in addition to configuring the corresponding hardware architecture for AI acceleration, it also requires collaboration among platform, framework, algorithm, and other related technical personnel to maximize the use of the latest AI architecture.