High-Performance Networks for AI

NADDOD Brandon InfiniBand Technical Support Engineer Jul 22, 2023

In recent years, AI big models have become a hot topic in the field of AI due to their exceptional natural language understanding, cross-media processing capabilities, and potential for gradually achieving general AI. The scale of parameters in the big models launched by leading companies in the industry has reached trillions or even tens of trillions.

 

In 2023, a popular AI product called ChatGPT emerged, which can chat, write code, answer questions, and write novels. Its technical foundation is based on the fine-tuned GPT3.5 big model, which has 175 billion parameters. According to reports, the training of GPT3.5 used an AI supercomputing system specially built by Microsoft, consisting of a high-performance network cluster of 10,000 V100 GPUs, with a total computing power consumption of about 3640 PF-days (that is, if calculated at a rate of 10 trillion calculations per second, it would take 3640 days to complete the computation).

Network Demands in the AI Era: Ultimate Networking

In the age of artificial intelligence, the demand for networks has reached unprecedented levels, requiring unparalleled performance and reliability. As AI technologies continue to advance and large-scale models become the norm, network infrastructure must adapt to this reality and provide exceptional levels of connectivity and responsiveness. The pursuit of an ultimate network experience has become crucial as it directly impacts the seamless execution of AI algorithms, data transfer, and real-time decision-making. From lightning-fast data transfer to ultra-low latency connectivity, the pursuit of a perfect network has become the cornerstone of AI success. Only by harnessing the power of cutting-edge technologies and pushing the boundaries of network capabilities, can we truly unleash the full potential of AI in the digital age.

1.Network Performance Determines GPU Cluster Computing Power

According to Amdahl's Law, serial communication determines the overall efficiency of a parallel system. The more nodes a parallel system has, the higher the proportion of communication, and the greater the impact of communication on the overall system performance. In large-scale model training tasks, which often require the computing power of hundreds or even thousands of GPUs, the large number of server nodes and the need for cross-server communication make network bandwidth performance a bottleneck for GPU cluster systems. In particular, the widespread use of Mixture-of-Experts (MoE) in large model architectures, which are built on the sparse gate feature and require an All-to-All communication pattern, places extremely high demands on network performance as the cluster size increases.Recent industry optimization strategies for All-to-All communication have focused on maximizing the use of the high bandwidth provided by the network to reduce communication time and improve the training speed of MoE models.1

2.Network Performance Determines GPU Cluster Performance

Once a GPU cluster reaches a certain scale, ensuring the stability of the cluster system becomes another challenge that must be addressed in addition to performance. The availability of the network determines the computing stability of the entire cluster. This is because:

  1. Network failure domains are large: Compared to a single point of GPU failure, which only affects a small portion of the cluster's computing power, network failures can affect the connectivity of dozens or even more GPUs. Only a stable network can maintain the integrity of the system's computing power.
  2. Network performance fluctuations have a significant impact: Compared to a single low-performance GPU or server that is easy to isolate, the network is a shared resource for the cluster, and performance fluctuations can affect the utilization of all computing resources.

2

3.Meeting Communication Challenges for AI Model Training with High-Bandwidth Nodes

In the face of large-scale model training with billions or trillions of parameters, the communication volume required for just a single computation iteration and gradient synchronization can reach hundreds of gigabytes. In addition, various parallel modes and communication requirements introduced by acceleration frameworks make it impossible for traditional low-speed networks to support the efficient computation of GPU clusters. Therefore, to fully leverage the powerful computing resources of GPUs, high-performance network infrastructure is needed to provide super-bandwidth computing nodes. These nodes need to have high-bandwidth, scalability, and low-latency communication to meet the demands of large-scale model training. This can enable efficient cluster computing and communication, thereby solving the communication challenges of AI training.

 

NVIDIA InfiniBand (IB) network provides each computing node with ultra-high communication bandwidth of up to 1.6Tbps, which brings more than 10 times the performance improvement compared to traditional networks. The main features of the NVIDIA InfiniBand network include:

 

  1. Non-blocking Fat-Tree topology: Using a non-blocking network topology, data transmission within the cluster is ensured to be efficient. This structure can support a single cluster with a scale of up to 2K GPUs and provide cluster computing power at the level of superEFLOPS (FP16).
  2. Flexible network scalability: The network can be flexibly expanded, supporting a maximum of 32K GPU computing clusters. This allows the cluster size to be adjusted according to demand to meet the requirements of large-scale model training at different scales.
  3. High-bandwidth access: The network plane of the computing node is equipped with eight RoCE network cards, providing ultra-high bandwidth access of 1.6Tbps. This high-bandwidth design ensures fast data transmission between computing nodes, minimizing communication latency.

By using the NVIDIA InfiniBand network, computing nodes with ultra-high bandwidth can be constructed, providing powerful communication performance support for AI training and further improving the efficiency and performance of large-scale model training.

 

Here, it is worth mentioning NADDOD-the leading provider of integrated optical network solutions. We offer high-quality InfiniBand HDR AOC and DAC high-speed products that meet the low-latency, high-bandwidth, and reliability network requirements of AI high-performance network server clusters. Our solutions reduce cost and complexity while providing superior performance, delivering exceptional computing and communication efficiency for your AI server clusters. By choosing NADDOD, you will get high-quality, high-performance network connectivity solutions that enhance the efficiency and competitiveness of large-scale AI model training.

Summary

In the future, as GPU computing power continues to increase and AI large-scale model training continues to develop, building high-performance network infrastructure will become a crucial task. GPU cluster network architecture also needs to be constantly iterated and upgraded to ensure high utilization and availability of system computing power. Only through continuous innovation and upgrading can we meet the growing network demands and provide ultimate network performance and reliability. High-bandwidth, low-latency, and scalable networks will become standard in the AI era, providing strong support for large-scale model training and real-time decision-making, as a leading provider of optical network solutions, is committed to providing high-quality and high-performance network connectivity solutions for AI server clusters. We will continue to innovate, build reliable high-performance network infrastructure, and provide stable and reliable infrastructure for the development and application of AI technology. Let us work together to meet the challenges of the AI era and create a new chapter for intelligent future.