Fully Lossless Ethernet Network for HPC( High-Performance Computing)
Currently, data centers are evolving into centers of computational power, with the scale of computing clusters within data centers continuously expanding. There is an increasing demand for high-performance interconnection networks between computing nodes, as the performance requirements for the network between computational nodes are becoming higher and higher. Data center networking has become an integral part of data center computing power, and the deep integration of computing and networking has become a trend.
1. High-performance Computing Workloads Demand Higher Requirements for Networking Infrastructure
With the integration of new technologies such as 5G, big data, the Internet of Things (IoT), and AI into various aspects of human society, it is foreseeable that in the next twenty to thirty years, humanity will enter an intelligent society based on the digital world, where everything is sensed, interconnected, and intelligent. Data center computing power has become a new productive force, and the dimension of data centers has shifted from resource scale to computing power scale. The concept of computational power centers has been widely accepted in the industry. As data centers evolve into computational power centers, networks play a vital role in enabling high-performance computing within data centers. Improving network performance can significantly enhance the energy efficiency of data center computing power.
To enhance computing power, the industry is continuously evolving on multiple fronts. The advancement of single-core chip technology has currently reached its limit at 3nm. Increasing computing power through multi-core stacking leads to a significant increase in power consumption per unit of computing power as the number of cores increases. When going from 128 cores to 256 cores, the overall computing power level cannot be increased by 1.2 times. The evolution of computing unit technology is approaching its baseline, and Moore's Law, which doubles performance every 18 months, is nearing its limits. To meet the demands of high computing power, HPC (High-Performance Computing) has become the norm. As the demand for computing power continues to grow, transitioning from the P-scale to the E-scale, the scale of computing clusters keeps expanding, leading to increasingly higher requirements for interconnection network performance. The deep integration of computation and networking has become a trend.
HPC (High-Performance Computing) refers to the utilization of aggregated computing power to tackle the most complex scientific computing problems in research and industry that cannot be accomplished by standard workstations, including simulations, modeling, and rendering, among others. Due to the need for extensive computations, a single general-purpose computer is unable to complete the tasks within a reasonable time frame, or the available resources are insufficient to handle the large amount of data required, making the computations practically infeasible. One approach is to use specialized or high-end hardware for processing, but it often remains challenging to meet the performance requirements while also being costly. Currently, a commonly used approach in the industry is to integrate the computing power of multiple units, distributing data and computations across these units effectively to overcome these limitations.
The interaction between computing nodes in high-performance computing (HPC) imposes different requirements on network performance, which can be broadly categorized into three typical scenarios.
- Loose coupling computing scenario: In a loose coupling scenario, the interdependence between computing nodes is relatively low, resulting in lower network performance requirements. Typical examples of loose coupling scenarios include financial risk assessment, remote sensing and mapping, and molecular dynamics. This scenario has relatively lower demands on network performance.
- Tight coupling scenario: In a tight coupling scenario, there is a strong dependency on coordination between computing nodes, synchronization of computations, and high-speed information transmission. Examples of tight coupling scenarios include electromagnetic simulation, fluid dynamics, and automotive collision simulations. This scenario has a high requirement for low network latency and necessitates the provision of low-latency networks.
- Data-intensive computing scenario: In a data-intensive computing scenario, computing nodes handle a large amount of data and generate significant intermediate data during the computation process. Examples of data-intensive computing scenarios include weather forecasting, gene sequencing, graphic rendering, and energy exploration. Due to the large volume of data being processed and the significant intermediate data generated, this scenario requires a high-throughput network while also having certain requirements for network latency.
In summary, high-performance computing (HPC) demands high throughput and low latency from the network. To achieve high throughput and low latency, the industry generally adopts Remote Direct Memory Access (RDMA) as a replacement for the TCP protocol, which reduces latency and decreases the CPU utilization on servers. However, RDMA is highly sensitive to network packet loss, as a 0.01% loss rate can cause the RDMA throughput to drop to zero. Therefore, lossless becomes an important requirement for the network.
2. Evolution of High-Performance Computing Networks
Traditional data center networks typically consist of multi-hop symmetric architectures using Ethernet technology and rely on the TCP/IP protocol stack for transmission. However, although the traditional TCP/IP network has matured over 30 years of development, its inherent technical characteristics make it less suitable for the demands of high-performance computing. RDMA technology has gradually replaced TCP/IP as the preferred protocol for HPC high-performance computing networks. At the same time, the choice of RDMA's network layer protocol has also evolved from expensive lossless networks based on the InfiniBand (IB) protocol to intelligent lossless networks based on Ethernet. NADDOD's technical experts will now explain the reasons behind these technological replacements and evolutions.
<1>From TCP to RDMA
Traditional data centers typically use Ethernet technology to build multi-hop symmetric network architectures and rely on the TCP/IP protocol stack for transmission. However, TCP/IP network communication gradually becomes inadequate for the demands of high-performance computing due to the following two main limitations:
- Limitation 1: TCP/IP protocol stack introduces tens of microseconds of latency
When receiving/sending packets, the TCP protocol stack requires multiple context switches in the kernel, each of which incurs a latency of approximately 5-10 microseconds. Additionally, there are at least three data copies and protocol encapsulation operations that rely on the CPU, resulting in tens of microseconds of fixed latency caused by the protocol stack alone. This makes the protocol stack latency the most prominent bottleneck in microsecond-level systems such as AI data processing and distributed SSD storage.
- Limitation 2: TCP protocol stack increases server CPU load
In addition to the issue of long fixed latency, TCP/IP networks require the host CPU to participate in multiple memory copies within the protocol stack. As the network scale and bandwidth increase, the CPU scheduling burden for data transmission grows, leading to sustained high CPU loads. Based on industry calculations, transmitting 1 bit of data consumes 1 Hz of CPU frequency. Therefore, when the network bandwidth exceeds 25G (at full load), for the majority of servers, at least half of the CPU capacity must be allocated to data transmission.
To reduce network latency and CPU utilization, RDMA functionality has been introduced on the server side. RDMA is a direct memory access technology that enables data to be transferred directly from one computer's memory to another without involvement from the operating systems, bypassing time-consuming processor operations. This ultimately achieves high bandwidth, low latency, and low resource utilization.
<2>From IB to RoCE
As shown in the diagram below, RDMA's kernel bypass mechanism allows direct data read and write between applications and network cards, bypassing the limitations of TCP/IP and reducing the protocol stack latency to nearly 1 microsecond. Additionally, RDMA's zero-copy mechanism enables the receiving end to directly read data from the sender's memory, greatly reducing CPU burden and improving CPU efficiency. For example, a 40Gbps TCP/IP flow can exhaust all CPU resources in mainstream servers. However, in a 40Gbps scenario using RDMA, CPU utilization drops from 100% to 5%, and network latency decreases from milliseconds to below 10 microseconds.
Currently, there are three options for RDMA network layer protocols: InfiniBand, iWARP (Internet Wide Area RDMA Protocol), and RoCE (RDMA over Converged Ethernet).
- InfiniBand is a network protocol specifically designed for RDMA. It is proposed by the IBTA (InfiniBand Trade Association) and guarantees lossless networking at the hardware level, providing high throughput and low latency. However, InfiniBand switches are proprietary products from specific vendors, using a private protocol. Most existing networks use IP Ethernet, and adopting InfiniBand cannot meet interoperability requirements. The closed architecture also poses vendor lock-in risks for future large-scale elastic expansion of business systems.
- iWARP is a network protocol that allows RDMA to be performed over TCP. It requires special network cards that support iWARP and allows RDMA to be used on standard Ethernet switches. However, due to the limitations of the TCP protocol, iWARP loses many advantages of other RDMA protocols in terms of performance.
- RoCE is a network protocol that enables remote memory access over Ethernet. It is also proposed by the IBTA and applies RDMA technology to Ethernet. RoCE also supports RDMA on standard Ethernet switches and only requires special network cards that support RoCE, with no specific requirements on network hardware. Currently, there are two versions of RoCE: RoCEv1, which is a link-layer protocol allowing direct access between any two hosts within the same broadcast domain, and RoCEv2, which is a network-layer protocol that enables routing functionality and allows access between hosts in different broadcast domains using a three-layer approach based on UDP encapsulation. However, due to the sensitivity of RDMA to packet loss and the best-effort nature of traditional Ethernet, lossless Ethernet support is required.
With the growing demands of data centers and high-performance computing, RDMA technology will continue to play an important role as a high-performance, low-latency data transfer technology. Whether choosing InfiniBand technology or RDMA-capable Ethernet technologies, users and vendors need to make choices based on their specific requirements and practical considerations. InfiniBand technology has wide application and a mature ecosystem in the field of supercomputing, while RoCE and iWARP are more suitable for high-performance computing and storage scenarios in Ethernet environments.
As a leading provider of comprehensive optical network solutions, NADDOD is committed to offering innovative, efficient, and reliable products, solutions, and services to its customers. NADDOD's high-performance switches, AOC/DAC/optical modules, and intelligent network cards can provide customers with a complete set of solutions based on InfiniBand and lossless Ethernet (RoCE) to meet different application requirements. NADDOD's solutions help users achieve business acceleration and performance improvement, providing innovative, efficient, and reliable solutions to empower their success in high-performance computing, artificial intelligence, and storage fields. Visit the NADDOD official website for more information!