In recent years, with the proposal of Transformer and MOE architectures, which make deep learning models easily break through trillions of scale parameters, the traditional single-machine, single-card model can no longer meet the requirements of training very large models. Therefore, we need to carry out distributed large model training based on single-machine multi-card, or even multi-machine multi-card.
And utilizing AI clusters to enable deep learning algorithms to better train large models with good performance efficiently from a large amount of data is the primary goal of distributed machine learning. In order to achieve this goal, it is generally necessary to consider dividing the computational tasks, training data and models according to the matching of hardware resources and data/model sizes, so as to perform distributed storage and distributed training.
Among them, pipelined parallelism is one of the parallel techniques in distributed training. Pipelined parallelism is a method to accelerate neural network training by combining model parallelism with data pipelining. The core idea is that the model is partitioned into blocks by layer, and each block is handed over to a single device. During the forward pass, each device passes the intermediate activations to the next stage. During the backward pass, each device passes the gradient of the input tensor back to the previous pipeline stage.
However, a problem arises with this approach: only one GPU is actively working on a given chain, while the others are idle. To address this issue, the concept of a pipeline can be employed, allowing for a continuous flow of processing. This can be achieved by dividing a batch into several mini-batches and calculating each mini-batch separately.
So, is it better to make the pipeline deeper so that each GPU only calculates one layer?
Firstly, increasing the depth of the pipeline will significantly increase the storage capacity required for storing intermediate states (activations) during the forward propagation, exacerbating the issue of insufficient memory capacity. For example, if the first stage of the pipeline calculates the intermediate state of the forward propagation, it would need to pass through the remaining N - 1 stages during the forward flow and then wait for the backward propagation to pass through the same N - 1 stages. In other words, it would take a total of 2N - 2 rounds before this intermediate state of the forward propagation can be used. And don't forget that this many intermediate states are generated in each round, resulting in a total of 2N - 1 intermediate states being stored. If N is large, the storage capacity required would be quite significant.
Secondly, communication is required between adjacent pipeline stages, and as the number of stages increases, the total amount of data and the overall latency of communication also increase.
Lastly, to make such a pipeline work, the batch size needs to be equal to the number of layers in the Transformer, which is generally several tens, multiplied by the parallelism of data parallelism. This would result in a very large batch size, which can impact the speed of model convergence or the accuracy of the model after convergence.
Therefore, it is generally better to partition fewer pipeline stages, especially if there is sufficient memory capacity.
For the LLaMA-2 70B model, the model parameters require 140 GB, the gradients for the backward propagation require 140 GB, and the optimizer's state (if using Adam) requires 840 GB.
The intermediate states during the forward propagation depend on the token length, batch size, and the configuration of selective recomputation. To strike a balance between computing power and memory, the required storage for the intermediate states during the forward propagation would be token length * batch size * the number of neurons in the hidden layer * the number of layers * (10 + 24/tensor parallelism) bytes. Assuming a batch size of 8 and no tensor parallelism, the forward propagation intermediate states for the LLaMA-2 70B model would require 4096 * 8 * 8192 * 80 * (10 + 24) bytes, which equals 730 GB. As you can see, it's quite large.
In total, it would require 140 + 140 + 840 + 730 = 1850 GB, which is much larger than just the 140 GB for storing the model parameters alone. A single A100/H100 card has only 80 GB of memory, so it would require at least 24 cards. If using cards with 24 GB of memory, such as the 4090, it would require at least 78 cards.
The LLaMA-2 model has a total of only 80 layers, and if each layer is placed on a single card, it seems perfect, right? This would result in 80 pipeline stages, and for pipeline parallelism alone, 80 batches in parallel would be needed to fill the pipeline.
However, with this setup, the storage required for storing the intermediate states during the forward propagation becomes unbearable. It would require storing the intermediate states for 80 * 2 = 160 rounds, which is an increase by a factor of 160. Even if selective recomputation is used, such as dividing the 80 layers into 8 groups of 10 layers each, the storage for the intermediate states would still increase by a factor of 16.
Unless an extremely drastic approach of complete recomputation is used, where the forward propagation results are recalculated from scratch for each layer during the backward propagation, the computational cost would grow quadratically with the number of layers. For example, the first layer would calculate one layer, the second layer would calculate two layers, and so on until the 80th layer calculates 80 layers. This would result in a total of 3240 layers being computed, and the computational cost would increase by a factor of 40 compared to a normal calculation of 80 layers. This is hardly acceptable.
The issue of storing intermediate states is already significant, and now let's consider the communication cost between these 2048 cards. With the approach of placing one layer on each card and allowing them to completely flow in parallel with different input data, these 2048 cards are effectively participating independently in data parallelism. In data parallelism, each round requires transmitting the computed gradients and the globally averaged gradients, and the amount of gradient data is equal to the number of model parameters.
If we divide the 70B model into 80 layers, each layer would have approximately 1 billion parameters. Since the optimizer uses 32-bit floating-point numbers, it would require transmitting 4 GB of data. So, how long does one round of computation take? The total computational workload can be calculated as batch size * number of tokens * 6 * number of parameters = 8 * 4096 * 6 * 1B = 196 Tflops. On the 4090 GPU, assuming 100% utilization, it would only take 0.6 seconds. However, transferring this 4 GB of data through PCIe Gen4 alone would already take at least 0.12 seconds, and it needs to be transmitted twice (first the gradients, then the averaged gradients). So, the 0.24 seconds of communication time compared to the 0.6 seconds of computation time is a significant proportion.
Of course, we can optimize by allowing each GPU to perform internal aggregation of the 80 sets of gradients it handles in pipeline parallelism. In theory, one training step would then take 48 seconds, with communication taking less than 1 second. The communication overhead would be acceptable. However, achieving communication time of less than 1 second requires having enough network cards installed in the machine to fully utilize the bandwidth of PCIe Gen4. Otherwise, the network card becomes a bottleneck. For example, if you have 8 GPUs in a machine, you would need to use 8 ConnectX-6 200 Gbps RDMA network cards to meet the requirement.
Lastly, let's consider the batch size. When running the entire cluster with 2048 cards and setting the mini-batch size per GPU to 8, the overall batch size would be 16,384. This is already a relatively large batch size in large-scale training, and increasing it further may impact the speed of model convergence or the accuracy after convergence.
The Dominant Role of 800G Optical Transceivers in Pipeline Parallelism
Data Transfer Speed: Pipeline parallelism is a technique that decomposes computational tasks into multiple stages, allowing each stage to be executed in parallel. In pipeline parallel computing, data needs to be transferred between different stages. The 800G optical transceiver is a high-speed fiber optic communication transceiver with a transmission speed of 800 gigabits per second (Gbps). It provides a high-bandwidth data transfer channel to meet the fast data transfer requirements between stages in pipeline parallel computing.
- Data Throughput: Efficient data transfer is essential for achieving collaborative work between different stages in pipeline parallel computing. The high transmission speed of the 800G optical transceiver means that it can support larger data throughput. It can quickly transfer a large amount of data, enabling pipeline parallel computing to process data faster and improve computational efficiency.
- Scalability and Future Development: The 800G optical transceiver represents the latest advancement in optical communication technology, offering higher transmission speeds and bandwidth. As the computational tasks and data volume increase, there is a growing demand for pipeline parallel computing. Adopting high-speed 800G optical transceivers can provide better scalability and performance for future pipeline parallel computing. They can meet the increasing data transfer requirements and support larger-scale and more complex pipeline parallel computing tasks.
In summary, pipeline parallel computing can benefit from the high-speed data transfer and high bandwidth capabilities of 800G optical transceivers. The 800G optical transceiver can support fast data transfer between stages in pipeline parallel computing, improving computational efficiency and performance, while also offering excellent scalability to meet future computational needs.
To meet the high-speed and high-bandwidth requirements of pipeline parallel computing, NADDOD, as a leading optical networking solution provider, offers cutting-edge 800G series optical transceivers and solutions. These include the 800G OSFP DR8+, 800G OSFP 2xFR4, and 800G OSFP DR8 optical transceiver types, catering to different matching requirements of users.
However, despite the benefits of pipeline parallelism in improving computational efficiency, it also brings some challenges. The storage requirements for intermediate states increase significantly, leading to insufficient memory capacity issues. Additionally, the communication overhead between adjacent pipeline stages also increases. In order to balance the computational and communication costs, trade-offs need to be made among memory capacity, communication bandwidth, and model convergence speed. Overall, pipeline parallelism is an effective model parallelism approach, but it requires comprehensive consideration of multiple factors in design and implementation to achieve optimal performance and results.