High-Performance Networks for AI

NADDOD Brandon InfiniBand Technical Support Engineer Jul 22, 2023

In recent years, AI big models have become a hot topic in the field of AI due to their exceptional natural language understanding, cross-media processing capabilities, and potential for gradually achieving general AI. The scale of parameters in the big models launched by leading companies in the industry has reached trillions or even tens of trillions.

 

In 2023, a popular AI product called ChatGPT emerged, which can chat, write code, answer questions, and write novels. Its technical foundation is based on the fine-tuned GPT3.5 big model, which has 175 billion parameters. According to reports, the training of GPT3.5 used an AI supercomputing system specially built by Microsoft, consisting of a high-performance network cluster of 10,000 V100 GPUs, with a total computing power consumption of about 3640 PF-days (that is, if calculated at a rate of 10 trillion calculations per second, it would take 3640 days to complete the computation).

Network Demands in the AI Era: Ultimate Networking

In the age of artificial intelligence, the demand for networks has reached unprecedented levels, requiring unparalleled performance and reliability. As AI technologies continue to advance and large-scale models become the norm, network infrastructure must adapt to this reality and provide exceptional levels of connectivity and responsiveness. The pursuit of an ultimate network experience has become crucial as it directly impacts the seamless execution of AI algorithms, data transfer, and real-time decision-making. From lightning-fast data transfer to ultra-low latency connectivity, the pursuit of a perfect network has become the cornerstone of AI success. Only by harnessing the power of cutting-edge technologies and pushing the boundaries of network capabilities, can we truly unleash the full potential of AI in the digital age.

1.Network Performance Determines GPU Cluster Computing Power

According to Amdahl's Law, serial communication determines the overall efficiency of a parallel system. The more nodes a parallel system has, the higher the proportion of communication, and the greater the impact of communication on the overall system performance. In large-scale model training tasks, which often require the computing power of hundreds or even thousands of GPUs, the large number of server nodes and the need for cross-server communication make network bandwidth performance a bottleneck for GPU cluster systems. In particular, the widespread use of Mixture-of-Experts (MoE) in large model architectures, which are built on the sparse gate feature and require an All-to-All communication pattern, places extremely high demands on network performance as the cluster size increases.Recent industry optimization strategies for All-to-All communication have focused on maximizing the use of the high bandwidth provided by the network to reduce communication time and improve the training speed of MoE models.1

2.Network Performance Determines GPU Cluster Performance

Once a GPU cluster reaches a certain scale, ensuring the stability of the cluster system becomes another challenge that must be addressed in addition to performance. The availability of the network determines the computing stability of the entire cluster. This is because:

  1. Network failure domains are large: Compared to a single point of GPU failure, which only affects a small portion of the cluster's computing power, network failures can affect the connectivity of dozens or even more GPUs. Only a stable network can maintain the integrity of the system's computing power.
  2. Network performance fluctuations have a significant impact: Compared to a single low-performance GPU or server that is easy to isolate, the network is a shared resource for the cluster, and performance fluctuations can affect the utilization of all computing resources.