What is data parallelism？
Data parallelism (data parallelism) is a method of computing by processing multiple pieces of data at the same time. While traditional serial computing can only process one piece of data at a time, data parallelism is able to process multiple pieces of data at the same time, thereby dramatically increasing the speed of computation. The key to data parallelism is to decompose the problem into sub-problems that can be processed in parallel, assign these sub-problems to different processing units, and finally summarize the results of each processing unit.
Why do we need data parallelism?
In recent years of deep learning model training, the trend of using more training data and larger models has not changed. Larger models and data volumes mean more computation and storage requirements, as well as longer training times. Then how to distribute the computation and storage requirements to multiple training devices to improve the training speed is the key problem, data parallelism (data parallelism) is a parallel strategy to solve the above problems.
In deep learning model training, data parallelism can be used as a method to increase the training throughput (global batch size per second) by adding parallel training devices.
If you were to use the 4090 GPU, the single-card FP16 computational power is similar to the A100 (330 vs 312 Tflops), but it has half the memory bandwidth (1 vs 2 TB/s) and significantly lower memory capacity (24 vs 80 GB). The TF32 computational power required for gradient calculations is also half that of the A100 (83 vs 156 Tflops). Taking these factors into account, the training speed of a single 4090 card is slightly lower than that of the A100.
Let's assume we use 2,048 4090 GPUs for training. The biggest challenge in this setup is the communication between these 2,048 4090 GPUs.
Why is that? There are several parallelization methods, including tensor parallelism, pipeline parallelism, and data parallelism. These methods divide the GPUs along different dimensions of the model: within model layers, between model layers, and across training data. The product of these three parallel degrees determines the total number of GPUs required for the training task. Among these methods, data parallelism is the most commonly used approach.
The whole process of data parallelization is as follows:
- Divide the data Batch into n parts according to the number of GPU cards n
- Save the complete model on each GPU
- Each GPU calculates its own gradient based on the assigned data
- GPUs communicate with each other to aggregate all the obtained gradients (AllReduce), which is usually the sum of all card gradients
- Models on all GPUs communicate and again have identical gradients for the next Batch
Each GPU calculates different input data, computes its own gradients (the amount of change in model parameters), and then aggregates the gradients by taking their average and broadcasting the updated values back to each GPU.
However, using data parallelism alone is not sufficient because a single GPU cannot accommodate the entire LLaMA 70B model.
Determining how much GPU memory is needed for model training is not a straightforward task, and only a few people can calculate it accurately. Some even mistakenly believe that only storing the model parameters and backward propagation gradients is sufficient. In reality, the memory required for training includes the model parameters, gradients from backward propagation, memory used by the optimizer, and intermediate states (activation) from forward propagation.
The memory used by the optimizer is relatively straightforward. If you use the classic Adam optimizer, it needs to perform calculations using 32-bit floating-point precision. Otherwise, relying solely on 16-bit floating-point precision would result in significant errors, making it difficult for the model to converge. Therefore, each parameter needs to be stored in a 4-byte 32-bit version (16-bit version during forward propagation and 32-bit version during optimization, known as mixed-precision). Additionally, you need to store 4 bytes for momentum and 4 bytes for variance, totaling 12 bytes. If you use an optimizer like SGD, you can omit the variance and only require 8 bytes.
The intermediate states (activation) from forward propagation are necessary for calculating gradients during backward propagation, and their quantity is proportional to the batch size. A larger batch size allows for more computations to be performed with each memory access of the model parameters, reducing the pressure on GPU memory bandwidth. However, it's important to remember that the number of intermediate states from forward propagation is proportional to the batch size, and GPU memory capacity becomes a bottleneck.
It has been observed that the memory occupied by the intermediate states from forward propagation is significant. To alleviate this, a technique of trading computation for memory can be employed. Instead of storing so many gradients and intermediate states for each layer, you can temporarily recompute the intermediate states from the beginning when reaching a certain layer. This way, the intermediate states for that layer do not need to be stored.
If this approach is applied to every layer, only 2 bytes would be required to store the gradients for that layer. However, the computational overhead of recomputing the intermediate states can be substantial. Hence, in practice, the entire Transformer is typically divided into groups, with each group containing several layers. Only the intermediate states of the first layer in each group are saved, and the subsequent layers are recalculated from the first layer of that group. This balances the computational and memory costs.
Of course, some may argue that if GPU memory is insufficient, it can be swapped out to CPU memory. However, considering the current PCIe speed, the cost of swapping to CPU memory is sometimes not worth it, and it may be more efficient to perform recomputations within GPU memory. If there is a high-bandwidth unified memory system like Grace Hopper, then swapping in and out becomes a viable option, offering optimization possibilities for both the intermediate states during training's forward propagation and the KV Cache.
Connectivity between Data Parallelism and RoCE
Connectivity between data parallelism and RoCE (RDMA over Converged Ethernet) combines data processing and network communication, providing a powerful foundation for high-performance computing. Data parallelism accelerates computation by simultaneously processing multiple data blocks, while RoCE utilizes high-performance Ethernet to achieve low latency and high-bandwidth data transfer. This connectivity enables efficient data flow between compute nodes, facilitating the collaborative work of distributed computing.
- Data transfer efficiency: Data parallelism is a technique in parallel computing that divides large-scale data into smaller chunks and processes them simultaneously on multiple processing units to improve computation efficiency. RoCE is a remote direct memory access (RDMA) technology based on Ethernet that enables efficient data transfer over the network. RoCE leverages hardware-level data transfer, bypassing the complexity of the operating system and protocol stack, to achieve low latency and high-bandwidth data transfer, thus enhancing the efficiency of data parallel computation.
- High-performance network communication: Data parallel computation often requires data exchange and communication between different processing units. RoCE provides high-performance network communication for fast data transmission between compute nodes. It utilizes high-speed Ethernet as the physical transport medium and employs RDMA technology to achieve zero-copy, low latency, and high-bandwidth data transfer. This high-performance network communication meets the fast and efficient communication needs between processing units in data parallel computation.
- Distributed computing support: Data parallel computation typically involves a distributed computing environment where multiple compute nodes collaborate to complete computational tasks. RoCE can provide efficient data transfer and communication for distributed computing, supporting fast data exchange and collaborative computation between nodes. It enables memory sharing and remote access between nodes, offering high-performance support for distributed computing, thereby accelerating the execution of data parallel computation.
In conclusion, RoCE, as a high-performance network communication technology, can provide efficient data transfer and communication support for data parallel computation. It enables low-latency, high-bandwidth data transfer and supports collaborative work in a distributed computing environment, improving the efficiency and performance of data parallel computation. By combining data parallelism and RoCE, we can achieve faster and more efficient data processing and communication, driving innovation and development in scientific, engineering, and business domains.
Data parallelism, as an efficient computer processing technology, has been playing an important role in scientific research, business and daily life. By dividing problems into sub-problems that can be processed in parallel and utilizing multiple processing units to perform calculations, data parallelism is able to accelerate tasks such as data analysis, image processing and simulation calculations. In the future, data parallelism will continue to drive the development of the information world and make our lives smarter and more efficient.