EDR InfiniBand vs. 100Gb Ethernet Performance Competition
InfiniBand vs. Ethernet
From a design perspective, InfiniBand and Ethernet are quite different. InfiniBand is a network interconnect technology that is widely used in supercomputing clusters due to its high reliability, low latency, and high bandwidth capabilities. Additionally, it has become the preferred interconnect technology for GPU servers with the advancements in artificial intelligence.
Ethernet, on the other hand, has been the most widely used communication protocol in local area networks since its introduction on September 30, 1980. Unlike InfiniBand, Ethernet was designed with the primary goal of enabling easy flow of information between multiple systems. It is a typical network with distributed and compatibility-oriented design. Traditional Ethernet networks are primarily built using TCP/IP, but with the maturity of RoCE (RDMA over Converged Ethernet) and iWARP technologies and ecosystems, RDMA is also being widely adopted in Ethernet networks.
According to performance comparison tests conducted by the University of New Mexico, comparing IB, RoCE, and Ethernet performance, the tests were conducted using Mellanox ConnectX-4 EDR HCA (capable of operating in both Ethernet and IB modes). When operating at the InfiniBand link layer, communication was performed using Mellanox MSB7700-ES2F EDR Mellanox IB switches. For Ethernet link layer operation, communication was performed using 100Gb Juniper QFX5200 data center switches. Performance testing was conducted using OpenMPI 1.10.3 and OSU Micro Benchmarks.
In terms of latency, for small 8-byte messages, InfiniBand demonstrated a 10% improvement in performance compared to RoCE, while native Ethernet showed significantly higher latency, with delays typically around 10 microseconds. In terms of latency performance, Ethernet is considered to be in a different order of magnitude compared to RDMA technologies.
iWARP vs. RoCE
1. RoCE Faction:
Led by Mellanox, they believe that iWARP, being based on TCP, must use a reliable transport. Therefore, iWARP only supports reliable connection-oriented transport services, which also means it cannot achieve multicast. iWARP shares TCP's port space, making flow management difficult, as ports alone cannot identify whether the message carries RDMA or traditional TCP. iWARP shares the protocol number space with traditional TCP, requiring context (state) to determine whether a packet is iWARP. Typically, this contextual information may not be suitable for on-chip memory of the NIC, making NIC hardware design more complex and packet traffic parsing time longer. In contrast, RoCE can identify a packet as RoCE simply by examining its UDP destination port field. If the value matches the IANA-assigned port for RoCE, then the packet is RoCE. RoCE includes IP and UDP headers in packet encapsulation and can be used simultaneously on L2 and L3 networks. It supports three-layer routing, which extends the RDMA network across multiple subnets.
Based on the performance testing shown in the figure below, at link bandwidths of 25G, 40G, and 100G, iWARP performance is significantly inferior to RoCE. RoCE delivers messages faster than iWARP.
2. iWARP Faction:
Led by Chelsio and Intel, they believe that RoCEv2 implementation requires the deployment of switches with DCB/ETS/PFC capabilities. In summary, RoCEv2 has many issues, including poor scalability, no routing (RoCEv1), difficult deployment and management, lack of congestion control, sensitivity to network link quality, and lack of robustness in real-world network environments. In contrast, iWARP has no requirements for physical devices, is fully compatible, and can be considered as plug-and-play.
In terms of performance, since both iWARP and RoCEv2 are implemented in silicon, the performance difference is negligible. Both protocols use the same verbs interface/PCIe for sending and receiving Ethernet packets. In the end-to-end application data transfer latency, which consists of multiple components, as shown in the chart below, the following figure shows the relative proportions in the network adapter, software (SW), firmware (FW), physical cable (WIRE), and other hardware components. NIC hardware processing (CORE) accounts for approximately 10% of the latency, and hardware TCP processing itself is the smallest portion of this proportion. It must be said that this explanation is somewhat far-fetched.
InfiniBand networks offer significant advantages in terms of speed, low latency, scalability, reliability, and efficiency. They have been widely adopted in industries such as high-performance computing (HPC), financial services, life sciences, and AI/ML to gain a competitive edge and efficiently handle data-intensive workloads. However, the widespread adoption and compatibility of Ethernet, along with its advantages in Ethernet infrastructure and backward compatibility, may make it economically unfeasible or unnecessary to switch to InfiniBand in certain scenarios. Ethernet has evolved with technologies like RoCE and iWARP, which enable RDMA capabilities over Ethernet networks, narrowing the performance gap with InfiniBand.
Ultimately, the choice between InfiniBand and Ethernet depends on the specific requirements of the network and the workloads being handled. InfiniBand is generally preferred for high-performance computing and data-intensive applications that require low latency and high bandwidth, where the benefits of its design and capabilities outweigh the potential cost and complexity of implementation. Ethernet, on the other hand, is more widely adopted and compatible, making it a practical choice for general-purpose networking needs and environments where the performance requirements may not be as demanding.
It's worth noting that technology is continually evolving, and new advancements in Ethernet may further bridge the performance gap with InfiniBand in the future. Therefore, it's important to evaluate the specific needs and available technologies when making a decision between InfiniBand and Ethernet.