Exploring Parallel Computing Strategies for GPU Inference

In the process of performing inference on deep learning models, selecting appropriate hardware configurations is crucial for improving inference speed and efficiency. In this article, we will discuss how to leverage the power of graphics processing units (GPUs) for parallel computing in inference. Specifically, we will focus on two common inference strategies: pipeline parallelism and tensor parallelism, as well as their application effects on GPUs.

How many NICs are needed for 70B inference?

Calculating the overall storage capacity is relatively straightforward. During inference, the main memory consumption comes from parameters, KV Cache, and intermediate results of the current layer. When the batch size is 8, the required size for intermediate results is batch size * token length * embedding size = 8 * 4096 * 8192 * 2B = 0.5 GB, which is relatively small.

For a 70B model with 140 GB parameters, neither A100/H100 nor 4090 can fit on a single card. So, would 2 H100 cards be enough? It seems that 160 GB would suffice, but the remaining 20 GB, if used for KV Cache, would either require halving the batch size or reducing the maximum token length by half, which doesn't seem wise. Therefore, at least 3 H100 cards are needed.

As for the 4090, with 140 GB parameters + 40 GB KV Cache = 180 GB, and each card having 24 GB, 8 cards would be just enough to accommodate them.

Can pipeline parallelism be used for inference?

NADDOD has always been customer-centric, continuously creating excellent value for various industries and clients. Based on NADDOD's professional technical team and rich experience in implementing and servicing various application scenarios, we believe that the main issue with using pipeline parallelism for inference is the latency introduced by serial processing, while network latency is a minor concern.

Firstly, there is the issue of inference latency. Although different stages of the pipeline can handle different prompts, the processing of a single prompt still rotates on a single GPU, which increases the latency compared to tensor parallelism.

For a very small batch size, GPU memory bandwidth is the bottleneck. In this case, the per-card token latency can be calculated as 2 bytes * parameter size / number of cards / memory bandwidth. For example, when running LLaMA-2 70B on 8 cards of the 4090, it would be 2 * 70G / 8 / 1 TB/s = 0.0175 seconds. This calculation does not take into account the savings from KV Cache. Note that with 8 cards running in parallel, the token latency needs to be multiplied by 8, resulting in 0.14 seconds. Only 7 tokens can be output per second, which is quite slow for such a small model like 70B.

For a very large batch size, GPU computational power becomes the bottleneck. In this case, the per-card token latency can be calculated as batch size * 2 * parameter size / number of cards / computational power. For example, with a batch size of 1024 and the same 8-card example, it would be 1024 * 2 * 70G / 8 / 330 Tflops = 0.0543 seconds. In fact, for such a large batch size, the GPU memory is already filled with KV Cache and intermediate results from the forward pass.

So, what should be the ideal batch size to balance the utilization of GPU computational power and memory bandwidth? It can be determined by the equation 2 bytes * parameter size / number of cards / memory bandwidth = batch size * 2 * parameter size / number of cards / computational power. The parameter size and number of cards cancel out on both sides of the equation, resulting in batch size = computational power / memory bandwidth. For the 4090, it would be 330, and for the H100, it would be 590. This means that for the 4090, GPU memory bandwidth is the bottleneck when the batch size is less than 330, and GPU computational power becomes the bottleneck when the batch size is greater than 330. When the batch size is set to 330, in an ideal scenario, both memory bandwidth and computational power are fully utilized, resulting in each card processing a token in 17.5 ms.

Next, let's consider network latency. The advantage of pipeline parallelism compared to tensor parallelism is that it requires less network transmission. Between pipeline stages, only batch size * embedding size data needs to be transmitted. For example, with a batch size of 8 and an embedding size of 8192, only 128 KB of data needs to be transmitted. On a PCIe Gen4 x16 with a transfer rate of 32 GB/s, it only takes 4 microseconds to complete the transmission. However, we need to consider the overhead of the communication library and the fact that the 4090 does not support direct GPU-to-GPU peer-to-peer transfers, requiring CPU intermediation. In practice, it may take tens of microseconds. Compared to the token latency of tens of milliseconds in the computation part, this can be neglected.

Even with a batch size of 330, the transmission of this 5.28 MB data over PCIe only takes 0.16 milliseconds, which is still negligible compared to the 17.5 milliseconds of computation time.

If the latency of pipeline parallelism can be tolerated, it is even possible to use multiple hosts for pipeline parallelism. Let's assume that there is only a regular 1 Gbps Ethernet network between the hosts, and each host has a single 4090 card. For a batch size of 1, it takes 0.25 ms to transmit 16 KB of data, plus an additional 0.25 ms for the network protocol stack processing at both ends. This results in a latency of 0.5 ms per pipeline stage. The time spent on communication for 8 cards is only 4 ms, which can be neglected compared to the overall computation latency of 140 ms, and it will not significantly affect the inference latency of the system.

When the batch size is small, the network traffic in pipeline inference is bursty. Only 0.25 ms of data transmission occurs every 18 ms, resulting in an occupancy rate of only 1/72. There's no need to worry about the pipeline inference saturating the local network and affecting normal internet usage.

If a large batch size, such as 330, is set to fully utilize the computational power, then transmitting 5.28 MB of data would take 41 ms. The time spent on communication for 8 cards would be as high as 0.33 seconds, resulting in an output speed of only 3 tokens per second, which is difficult to tolerate. Therefore, if pipeline parallelism is implemented using inter-host communication, and there is not a high communication bandwidth between the hosts, it will inevitably require sacrificing some throughput.

For example, if we set the output speed to be no less than 5 tokens per second, allowing 60 ms for communication, each pipeline stage can have a maximum of 7.5 ms. With a 1 Gbps network, it can transmit 960 KB of data. In this case, the batch size can be set to a maximum of 60, resulting in a total throughput of 2400 tokens per second for these 8 4090 cards. However, the effective utilization of computational power in this scenario is less than 20%.

Recently, there has been a popular open-source project called Petals, which utilizes pipeline parallelism and turns GPUs into a distributed network similar to BitTorrent. Although the inference latency is indeed higher, it at least demonstrates the feasibility of distributed GPU inference.

How about using tensor parallelism for inference?

As mentioned earlier, the main drawback of pipeline parallelism is the high latency due to GPU serialization, resulting in slower token output. However, NADDOD discovered that the main drawback of tensor parallelism is the large amount of data transmission, which may not be manageable on devices with low network bandwidth.

But the amount of data transmission for inference is not the same as that for training! Inference only requires transmitting the intermediate results (activations) of the forward pass, while training requires transmitting the gradients of all parameters, which accounts for the majority of the data volume.

In inference, if tensor parallelism is used, each layer of the Transformer needs to broadcast its responsible result vector (with a size of batch size * embedding size / num GPUs) to all other GPUs and receive data broadcasted from all other GPUs. This transmission occurs once during attention computation and once again during feed-forward network computation, totaling 2 * number of layers.

Each transmission involves batch size * embedding size (sending and receiving are different directions, so they cannot be counted twice). For batch size = 1 and embedding size = 8192, only 16 KB of data needs to be transmitted, which can be done in 1 microsecond with a PCIe Gen4 transfer rate of 32 GB/s. However, considering the CPU overhead discussed earlier, it would still take approximately 30 microseconds. With a total of 160 transmissions, it would require 4.8 ms.

Let's also consider the computational overhead. Again, considering the case of batch size = 1, where GPU memory bandwidth is the bottleneck, the latency for calculating each token on each card is 2 bytes * number of parameters / number of cards / memory bandwidth. Using our previous values, it would still be 17.5 ms. However, in this case, the 8 cards are processing in parallel, so the total processing time would be the calculation time + communication time = 17.5 ms + 4.8 ms = 22.3 ms. This means that it can generate 45 tokens per second, which is already quite impressive, as human reading speed would struggle to keep up with the generation speed.

What if the batch size is larger? For example, batch size = 330, fully utilizing the GPU computational power and memory bandwidth. Each transmission would then involve 330 * 8192 * 2 = 5.4 MB of data, which would take 0.17 ms with a PCIe Gen4 transfer rate of 32 GB/s. With a total of 160 transmissions, it would take 27 ms. Now, network communication overhead becomes the major contributor, resulting in a total processing time of 27 + 17.5 = 44.5 ms, and it can only generate 22 tokens per second, which is still not slow.

Note that regardless of the number of GPUs used for parallel inference, as long as tensor parallelism is employed, the total amount of data transmitted over the network remains the same. Therefore, increasing the number of GPUs can only accelerate computation, not communication.

Therefore, A100/H100's NVLink still plays a significant role in reducing inference latency. If using A100/H100, setting the batch size to 590 to achieve a balance between computational power and bandwidth utilization, the 9.44 MB of data would only take 9.44 MB / 450 GB/s = 0.02 ms. With a total of 160 transmissions, it would amount to only 3.2 ms. Due to the larger memory bandwidth, the calculation time can also be significantly reduced, for example, the calculation time for H100 would be 2 * 70 GB / 8 / 3.35 TB/s = 5.2 ms. The total processing time would then be only 5.2 ms + 3.2 ms = 8.4 ms, and it can generate 119 tokens per second.

It can be said that in terms of the token generation speed for a single prompt, no matter how many 4090 cards are used, they cannot catch up with 8 cards of H100.

The Importance of Optical Transceivers in GPU Parallel Computing

In the context of pipeline parallelism and tensor parallelism in GPU computing, optical transceivers play a crucial role, which can be summarized in the following aspects:

Efficient Data Transfer: In both pipeline parallelism and tensor parallelism, a large amount of data needs to be transferred and exchanged within the GPU. Optical transceivers provide high-speed data transfer channels with high bandwidth and low latency. They enable fast data transfer between the host system or storage devices and the GPU memory, as well as the transmission of computation results back to the host system. This helps prevent data transfer from becoming a bottleneck, improves data transfer efficiency, and accelerates pipeline parallelism and tensor parallelism computations.

Parallel Computation Collaboration: The computation process in pipeline parallelism and tensor parallelism often involves collaboration among multiple GPUs. The high-speed network connectivity of optical transceivers facilitates fast data exchange and communication between multiple GPUs. It enables data sharing, model synchronization, and result exchange among GPUs to support collaborative parallel computing. Through the connection provided by optical transceivers, GPUs can efficiently interact and collaborate in parallel computation tasks, thereby enhancing overall computational performance.

Scalability and Flexibility: Optical transceivers also possess excellent scalability and flexibility, allowing for the construction and connection of large-scale GPU clusters. For tasks requiring large-scale parallel computing, optical transceivers can establish high-speed interconnections between GPUs, forming high-performance computing clusters. This scalability and flexibility enable optical transceivers to meet growing computational demands and support larger-scale pipeline parallelism and tensor parallelism computations, thereby improving computational efficiency and performance.

NADDOD is a leading provider of optical network solutions, offering excellent technical services and high-quality product assurance. Our products have gained attention and recognition from clients and industry professionals alike. NADDOD provides 800G optical transceivers and 400G AOC (Active Optical Cable) and 400G DAC (Direct Attach Cable) high-speed cable products to customers across various industries. We continuously strive to deliver innovative, efficient, and reliable optical network products, solutions, and services.

Conclusion

In conclusion, when performing inference, pipeline parallelism and tensor parallelism are two common parallel computing strategies. Pipeline parallelism is suitable for continuous computation processes, splitting them into multiple stages for parallel execution. Tensor parallelism is suitable for large-scale neural network models, dividing them into multiple parts for parallel execution on multiple devices. Both parallelization approaches aim to improve computational performance, but they have different application scenarios and implementation methods.

In some cases, it is possible to combine both approaches to maximize the acceleration of deep learning tasks. In practical applications, it is important to choose the most suitable parallel strategy based on specific requirements and resource constraints, while also acknowledging the important role of optical transceivers in optimizing inference performance and meeting real-time demands of users.