Five Major Network Demands for AIGC Large Models

NADDOD Quinn InfiniBand Network Architect Mar 29, 2024

From the emergence of Transformer to the explosion of ChatGPT in 2023, people have gradually realized that as the scale of model parameters increases, the performance of the model improves, following the Scaling Law. When the model's parameter scale exceeds several hundred billion, the language understanding, logical reasoning, and problem analysis capabilities of AI large models rapidly improve. At the same time, with the increase in model parameters and performance, the requirements of AI large model training on networks have also changed compared to traditional models.


To meet the efficient distributed computing of large-scale training clusters, AI large model training processes usually include various parallel computing modes such as data parallelism, pipeline parallelism, and tensor parallelism. Under different parallel modes, multiple computing devices need to perform collective communication operations. In addition, synchronous mode is typically used during training, requiring collective communication operations between multiple machines and cards before the next iteration or calculation can be performed. Therefore, in large-scale training clusters of AI large models, designing efficient cluster networking solutions to meet low latency, high throughput inter-machine communication, thereby reducing the communication time of data synchronization between multiple machines and cards, and improving the proportion of GPU effective calculation time (GPU calculation time/overall training time) is crucial for the efficiency improvement of AI distributed training clusters. The following will analyze the network requirements of AI large models from the perspectives of scale, bandwidth, latency, stability, and network deployment.


1. Ultra-scale Network Demand


The amount of AI application computation is growing geometrically, and the algorithm model is developing towards gigantic quantities, and the parameters of the AI model have grown by 100,000 times over the past ten years, and the parameters of the current AI mega model have now reached the level of hundreds of billions to trillions. Training such a model undoubtedly requires ultra-high arithmetic. In addition, mega-models have a high demand for video memory. Taking 1T parameter model as an example, using 16bit precision storage, first of all, it needs to consume 2TB of storage space. In addition, during the training process, intermediate variables such as activation values generated by forward computation, gradients generated by backward computation, and optimizer states needed for parameter updates need to be stored, and the intermediate variables will increase in a single iteration. A training process using the Adam optimizer will, at its peak, generate seven times the number of intermediate variables than the number of model parameters. Such high memory consumption means that tens or hundreds of GPUs are needed to fully store a model's training process.


However, just having a large number of GPUs is still not enough to train an effective large model. The right parallelism is the key to improving training efficiency. Currently, there are three main parallelization methods for very large models: data parallelism, pipeline parallelism, and tensor parallelism. All three types of parallelism exist when training large models at the level of hundreds of billions to trillions. Training very large models requires a cluster of thousands of GPUs. On the surface, this is inferior to the current interconnection scale of cloud data centers, which has reached tens of thousands of servers. But in reality, interconnecting GPUs of thousands of nodes is more challenging than interconnecting tens of thousands of servers because network capacity and computing power need to be highly matched. Cloud data centers use CPU computing, the network demand is generally 10Gbps ~ 100Gbps, and use the traditional TCP transport layer protocol. However, AI mega-model training uses GPU training, the arithmetic power is several orders of magnitude higher than CPU, and the interconnection network demand is in the range of 100Gbps~400Gbps, in addition to the use of the RDMA protocol to reduce the transmission delay and improve network throughput.


Specifically, the high-performance networking of thousands of GPUs has the following issues to consider in terms of network size:


  • Problems encountered in large-scale RDMA networks, such as link header blocking, PFC deadlock storms


  • Network performance optimization, including more efficient congestion control, load balancing techniques


  • NIC connectivity issues, how to build thousands of RDMA QP connections when a single host is limited by hardware performance.


  • Network topology selection, whether the traditional Fat Tree structure is better, or you can refer to the high-performance computing Torus, Dragonfly and other organizations.


2. Ultra-high Bandwidth Requirements


In the AI large model training scenario, the collection of communication operations between the machine and the machine will generate a large amount of communication data. From the perspective of in-machine GPU communication, take the AI model with 100 billion parameters as an example, the AllReduce collection communication data volume generated by the model parallelism will reach the level of one hundred gigabytes, therefore, the communication bandwidth between in-machine GPUs and the way is very important for the flow completion time. Intra-server GPUs should support high-speed interconnection protocol, and it further avoids multiple copy operations that rely on CPU memory cache data during GPU communication. From the perspective of inter-machine GPU communication, pipeline parallelism, data parallelism and tensor parallelism modes require different communication operations, and part of the aggregate communication data will reach the level of 100GB, and the complex aggregate communication modes will generate many-to-one and one-to-many communications at the same moment. Therefore, the high-speed interconnection of inter-computer GPUs puts high demands on the single-port bandwidth of the network, the number of available links between nodes, and the total bandwidth of the network. In addition, GPUs and NICs are usually interconnected via the PCIe bus, and the communication bandwidth of the PCIe bus determines whether the single-port bandwidth of the NIC can be fully utilized. Taking PCIe3.0 bus (16lane corresponds to unidirectional 16GB/sec bandwidth) as an example, when the inter-machine communication is equipped with 200Gbps single-port bandwidth, the inter-machine network performance will not be fully utilized.


3. Ultra-Low Delay and Jitter Requirements


The network delay generated during data communication transmission consists of two parts: static delay and dynamic delay. Static delay contains data serial delay, equipment forwarding delay and photoelectric transmission delay, static delay is determined by the ability of the forwarding chip and the distance of transmission, when the network topology and the amount of communication data to determine, this part of the delay is usually a fixed value, and the real impact on the network performance of the dynamic delay is relatively large. Dynamic delay contains the switch internal queuing delay and packet loss retransmission delay, usually caused by network congestion and packet loss.


Taking the GPT-3 model training with 175 billion parameter scale as an example, from the theoretical estimation model analysis, when the dynamic delay is increased from 10us to 1000us, the GPU effective computation time share will be reduced by close to 10%, when the network packet loss rate is one-thousandth of a percent, the GPU effective computation time share will be reduced by 13%, and when the network packet loss rate reaches 1%, the GPU effective computation time share will be less than 5%. How to reduce computing communication delay and improve network throughput is the core issue for AI big model computing center to fully release computing power.


In addition to latency, the latency jitter introduced by network variation factors also has an impact on training efficiency. The aggregate communication process of computing nodes in the training process can generally be disassembled into parallel execution of P2P communication among multiple nodes, for example, Ring AllReduce aggregate communication among N nodes contains 2*(N-1) times of data communication sub-processes, and all the nodes in each sub-process have completed the P2P communication (parallel execution) before the end of this sub-process. When the network fluctuates, the flow completion time (FCT) of P2P between a certain two nodes will be significantly longer. The variation in P2P communication time introduced by network jitter can be interpreted as the weakest link of the barrel efficiency, which will result in the completion time of the subflow to which it belongs to be longer as well. Therefore, network jitter leads to lower efficiency of aggregate communication, which affects the training efficiency of AI models.


4. Ultra-High Stability Requirements


After the birth of Transformer, the prelude to the rapid evolution of large models was opened. In the past five years, the model has grown from 61M to 540B, a nearly 10,000-fold increase! Cluster arithmetic determines the speed of AI model training, a single block of V100 training GTP-3 takes 335 years, 10,000 V100 cluster, cluster system perfect linear expansion takes about 12 days.


The availability of the network system is used as a basis to determine the computational stability of the entire cluster. On the one hand, the network failure domain is large, the failure of a network node in the cluster may affect the connectivity of dozens or even more computing nodes, reducing the integrity of the system's arithmetic; on the other hand, the network performance fluctuations have a large impact, the network as a cluster shared resource is not easy to isolate compared to a single computing node, and performance fluctuations will lead to the utilization rate of all the computational resources are affected. Therefore, maintaining a stable and efficient network is an extremely important goal during the AI large model training task cycle, which brings new challenges to network operation and maintenance.


Once a failure occurs during the training task, fault-tolerant replacement or elastic scaling may be required to deal with the failed node. Once the location of the nodes involved in the computation has changed, the current communication mode may not be optimal, and job rescheduling and scheduling are required to improve the overall training efficiency. In addition, some network failures (e.g., silent packet loss) are not expected to occur, and once they occur, they will not only lead to a reduction in the efficiency of aggregate communication, but also trigger timeouts in the communication library, resulting in a long period of jamming in the training operations, which largely affects the training efficiency. Therefore, it is necessary to obtain fine-grained information about service flow throughput, packet loss, etc., so that the time consumed for avoidance and self-healing can be controlled within the second level.


5. Network Automation Deployment Requirements


Intelligent lossless networks are often built based on RDMA protocols and congestion control mechanisms, but they are accompanied by a series of complex and diverse configurations. Misconfiguration of any one of these parameters may affect the performance of the service, and may lead to problems that do not meet expectations. According to statistics, more than 90% of high-performance network failures are caused by misconfiguration problems, the main reason for this problem is that there are many NIC configuration parameters, of which the number of parameters depends on the version of the architecture, the type of business and the type of NIC. Due to the large cluster size in AI large model training, it further increases the complexity of configuration. Therefore, efficient or automated deployment configuration can effectively improve the reliability and efficiency of large model cluster systems. Automated deployment configuration requires the ability to deploy multiple configurations in parallel, automatically select parameters related to the congestion control mechanism, and select relevant configurations based on the type of NIC and the type of service.


Similarly, under complex architectural and configuration conditions, fast and accurate fault localization during business operation can effectively guarantee the overall business efficiency. Automated fault detection can, on the one hand, quickly define the problem and accurately push the problem to the management, and on the other hand, it can reduce the cost of locating the problem, quickly locate the root cause of the problem and provide a solution.


6. NADDOD Helps Build Lossless and High-Performance AI Networks

As a leading optical network total solution provider, NADDOD is committed to building a digital world where everything is connected through innovative arithmetic and network solutions. We continue to provide users with innovative, efficient and reliable products, solutions and services, offering optimal total solutions of switches, AOC/DAC/optical modules, smart NICs, DPUs and GPUs for data centres, high-performance computing, edge computing, artificial intelligence and other application scenarios. Through low cost and excellent performance, we can significantly improve customers' business acceleration capability.


Our lossless network solutions based on InfiniBand and ROCE provide users with lossless network environment and high-performance computing capability. For different application scenarios and user requirements, we can choose the optimal solution according to the local conditions to provide users with high bandwidth, low latency and high-performance data transmission, thus effectively solving network bottlenecks and improving network performance and user experience.