# Tensor Parallelism

Tensor parallelism is a method of model parallelism that addresses the issue of insufficient storage capacity for intermediate states (activations) in the forward propagation of large models when using purely pipeline parallelism and data parallelism. Unlike pipeline parallelism, which divides the model across layers, tensor parallelism divides the computations within a layer, specifically by splitting the attention calculation and feed-forward network across multiple GPUs.

With tensor parallelism, the problem of excessive pipeline stages due to the model's size, which requires 80 GPUs to fit, can be alleviated. By employing tensor parallelism on a single machine with 8 GPUs, the model can be divided into only 10 pipeline stages. Additionally, tensor parallelism allows for a reduction in batch size since the GPUs involved in tensor parallelism compute the same input data.

One common approach in distributed training is to parallelize the computation of attention, which is relatively easy to parallelize due to the presence of multiple heads that attend to different positions in the input sequence. Each head can be processed independently.

However, it is important to consider the communication overhead when performing any parallel computation.

The size of the Q and K matrices within each head is batch size * token length * size of the key, while the size of the V matrix is batch size * token length * size of the value. The size of the key/value is typically equal to the embedding size divided by the number of heads—for example, in LLaMA-2 70B, it would be 8192 / 64 = 128. The matrix size is batch size * 4096 * 8192 / 64 (note that this is for a single head). On the other hand, the parameter matrices Q, K, and V have a size of embedding size * embedding size / number of heads = 8192 * 8192 / 64 on each head.

The computational workload in the forward pass is essentially the calculation of all parameters for each token, which amounts to 2 * 3 (Q, K, V) * batch size * token length * number of parameters = 2 * 3 * batch size * 4096 * 8192 * 8192 / 64. It would be helpful to compare this with the matrix size to ensure accuracy in the calculations.

So, what is the communication volume? The output matrix Z is constructed by concatenating the results from each head. The size of each head is batch size * token length * embedding size / number of heads, which is equal to batch size * 4096 * 8192 / 64. The input matrix X has a size of batch size * token length * embedding size, which is equivalent to batch size * 4096 * 8192. It's important to note that the size of X here corresponds to each individual head, whereas the combined size of Z after merging all heads together remains the same. The unit used here is the number of parameters, and if calculated in bytes, it needs to be multiplied by the size of each parameter.

If we take the most extreme approach of assigning one GPU to each head, what is the ratio of computational workload to communication volume? It is approximately 2 * 3 * embedding size / number of heads / bytes per parameter = 2 * 3 * 8192 / 64 / 2 = 384. Given the 330 Tflops of the NVIDIA A100 GPU, if we want to ensure that communication does not become a bottleneck, the communication bandwidth needs to be at least 330T / 384 = 859 GB/s. Considering both sending and receiving, this would be 1.7 TB/s. It's too large and far exceeds the 64 GB/s of PCIe Gen4 x16, and even the 900 GB/s of NVLink cannot handle it.

Therefore, tensor parallelism should not be divided too finely, and each GPU should handle multiple heads. By assigning multiple attention heads to each GPU, the input matrix X becomes shared among these heads, thus reducing the communication overhead for the input matrix and improving the ratio of computational workload to communication volume.

Let's calculate based on the computational power of the NVIDIA A100, which is 330T / (64GB/s / 2) = 330T / (64 GB/s / 2). The minimum ratio of computational workload to communication volume needs to be 10,000, which is 2 * 3 * (embedding size / number of tensor parallelism GPUs) / bytes per parameter = 2 * 3 * 8192 / number of tensor parallelism GPUs / 2 >= 10,000. Solving this equation, we get: number of tensor parallelism GPUs <= 2.4. This means that if tensor parallelism is used, at most 2 GPUs can be used. If more GPUs are used, the computational power will not be fully utilized.

However, if we consider the parameters of the H100 GPU, the situation changes. The peak computational power of the H100 GPU is 1979 Tflops, and the bidirectional NVLink bandwidth is 900 GB/s. The minimum ratio of computational workload to communication volume needs to be 4400, which is 2 * 3 * (embedding size / number of tensor parallelism GPUs) / bytes per parameter = 2 * 3 * 8192 / number of tensor parallelism GPUs / 2 >= 4400. Solving this equation, we get: number of tensor parallelism GPUs <= 5.5. This means that using tensor parallelism with 8 GPUs on a single machine, if the computational power is fully utilized, the network becomes the bottleneck. It can be seen that even with a fast NVLink like 900 GB/s, it can easily become a bottleneck in the face of huge computational power. Of course, adopting a more optimized parallel partitioning method can save some network communication overhead.

The stripped-down version of the H800 GPU compared to the H100 GPU is the network bandwidth, which is reduced from 900 GB/s to 400 GB/s. If we calculate again, the minimum ratio of computational workload to communication volume needs to be 10,000. In this case, the number of tensor parallelism GPUs <= 2.4, which is the same as the case of the 4090 GPU. Therefore, using tensor parallelism with 8 H800 GPUs on a single machine would result in the network becoming the bottleneck. However, the theoretical computational power of 1979 Tflops and the optimization of parallel partitioning methods can affect the actual training of the 70B model, so the network may not necessarily be the bottleneck in practice. This is how the H800 precisely targets large model training and makes tensor parallelism uncomfortable.

If tensor parallelism is applied to the Feed Forward Network, similar deductions can be made, but I won't elaborate on that here. In most neural networks, matrix multiplication, where an MN matrix is multiplied by an NK matrix, results in a total computation of MNK. The overall size of the input and output is (MN + NK). Adding a few more matrices, such as Q, K, and V, would also be a constant factor. Therefore, the ratio of computation to communication is on the same order of magnitude as the dimensions (length) of the matrices.

After analyzing it, if you are planning to conduct large-scale training with large models, would you still use the PCIe version of A100/H100/H800? Although PCIe Gen5 is twice as fast as Gen4, for H100, the minimum ratio of computation to communication volume still needs to be 1979T / (128G / 2) = 30000. Solving this equation, the number of GPUs with tensor parallelism would be <= 0.8. Once tensor parallelism is used, computational power is compromised!

When the next generation of H100, such as GH200, is released, with double the computational power and NVLink still at 900 GB/s, NVLink starts to struggle. That's why GH200 timely introduced unified large memory, boasting 144 TB, for better swapping and utilizing memory in exchange for network communication.

The deductions above are certainly simplified, and in reality, the situation may not be as extreme, but the order of magnitude remains similar.

## Summary

From the above discussion, we can understand that tensor parallelism is a method to alleviate the issue of insufficient memory capacity due to excessive pipeline parallelism in large-scale model training. By partitioning the computations within a layer and distributing the attention calculations and Feed Forward Network across multiple GPUs, the number of pipeline stages can be reduced, thereby decreasing the memory pressure and other resource overhead.

Efficient data transfer and communication are crucial when performing tensor parallel computations on GPUs. In this regard, RDMA over Converged Ethernet (RoCE) plays a key role. RoCE leverages high-performance Ethernet technology to achieve low-latency and high-bandwidth data transfer through RDMA, providing robust connectivity for tensor parallel computations. The application of RoCE enables fast and efficient communication channels for data exchange and collaborative computing between GPUs, accelerating the execution of tensor parallel computations.

Therefore, based on market demand and user project implementation experience, NADDOD believes that in distributed training of large models, it is necessary to meet the requirements of high-performance, low-latency bandwidth while also addressing the severe shortage of InfiniBand product capacity in the market. Currently, RoCE is the reliable solution to address the urgent needs of users. RoCE plays a vital role in GPU tensor parallelism, providing high-performance data transfer, distributed computing support, reducing communication overhead, and offering scalability and flexibility. Through the application of RoCE, tensor parallel computations can be accelerated, improving computational efficiency and performance, enabling faster and more efficient deep learning and machine learning tasks.