InfiniBand Trend Review: Beyond Bandwidth and Latency
InfiniBand History: High Demand for Bandwidth and Low Latency
Distributed systems were created more than three decades ago. Since then, moving more bits over a copper wire or optical fiber cable at a reduced cost per bit change has been the main driving force behind data center networking.
Through the use of Remote Direct Memory Access, or RDMA, technology, InfiniBand networking has focused on reducing latency for the majority of that time. In addition, it offers additional capabilities like full transport offload, adaptive routing, congestion control, and quality of service for workloads running across a network of compute engines.
This relentless pursuit of higher bandwidth and lower latency has served InfiniBand well, and must continue into the future, but its ongoing evolution depends on many other technologies working in concert to deliver more scale, the lowest possible application latency, and more flexible topologies. This stands in contrast to the first 15 years or so of InfiniBand technology, during which sufficient innovation was attained simply by driving down port-to-port hop latency within the switch or latency across a network interface card that links a server to the network.
As bandwidth rates increase, forward error correction needs to compensate for higher bit error rates during transmission and this means that latency across the switch will stay flat — at best — and will likely increase with each generation of technology; this holds true for any InfiniBand variant as well as for Ethernet and indeed any proprietary interconnect. So, latency improvements must be found elsewhere in the stack.
Furthermore, work that has traditionally been done on the host servers in a distributed computing system needs to be moved off the very expensive, general-purpose CPU cores where application code runs (or manage the code offloaded to GPU accelerators) and onto network devices. The devices could be the switch ASIC itself, the network interface ASIC, or a full-blown Data Processing Unit. That DPU is a new part of the InfiniBand stack, and it is important in that it can virtualize networking and storage for the host as well as running security software without putting that burden on the host CPUs.
The combination of all of these technologies and techniques will keep InfiniBand on the cutting edge of interconnects for HPC, AI, and other clustered systems.
“If you look at InfiniBand, it gives us two kinds of services,” Gilad Shainer, senior vice president of networking at NVIDIA’s Networking division, explained in an InfiniBand Birds of Feather session at the recent International Supercomputing Conference. “It gives us the networking services, which have the highest throughput out there, running at 200Gb/s for more than two years now, and moving to 400Gb/s later this year. In addition, it provides computing services via pre-configured and programable In-Network Computing engines.”
Mellanox, which was acquired by NVIDIA in April 2020, was the first significant commercial player to bring InfiniBand to the high-performance computing sector. The company’s InfiniBand roadmap dates back to early 2001, starting with Mellanox providing 10Gb/s SDR InfiniBand silicon and boards for switches and network interface cards.
This was followed in 2004 by 20Gb/s DDR InfiniBand. It was the first time Mellanox sold not only silicon and boards, but also systems. In 2008, the speed was again doubled by QDR InfiniBand to 40 Gb/s. This is also when Mellanox entered the InfiniBand cable business with its LinkX line.
In 2011, 56Gbps FDR InfiniBand and NVIDIA Mellanox were extended to fabric software for some speed improvements. 100Gb/s EDR InfiniBand debuted in 2015 and 200Gb/s HDR InfiniBand came in 2018, incorporating in-switch HPC acceleration technology for the first time. (The ConnectX family of adapters has long performed network offload from the host server.)
InfiniBand Future: Beyond High Bandwidth and Low Latency
InfiniBand suppliers like NVIDIA that implement the spec after it is released should experience fewer speed bumps in the future than they did in the middle of the 2000s.
Customers can use HDR InfiniBand to either gang up four lanes with an effective speed of 50Gb/s to increase the bandwidth on the ports to 200Gb/s or to run the ports with only two lanes while maintaining each port’s speed at 100Gb/s like EDR InfiniBand. (This is done by NVIDIA in the Quantum InfiniBand ASICs, which were introduced in late 2016 and began shipping a year or so later. Customers who don’t require higher bandwidth can flatten their networks by doing this, removing hops from their topologies while also doing away with some switching. (It’s also intriguing to consider how a 400Gb/s ASIC’s radix might be doubled once more to produce a switch with an even higher radix as well as ports that operate at 200Gb/s and 400Gb/s speeds.)
NDR, which NVIDIA is implementing with its Quantum-2 ASICs, is the next stop on the InfiniBand train and has 400Gb/s ports using four lanes. These Quantum-2 ASICs can process 66.5 billion packets per second and deliver 64 ports at 400Gb/s thanks to their 256 SerDes running at 51.6 GHz with PAM-4 encoding and an aggregate bandwidth of 25.6 Tb/sec unidirectional or 51.2 Tb/sec bidirectional.
The XDR speed, which offers 800Gb/s per port, comes next on the InfiniBand roadmap, and the projected final stop—there will undoubtedly be more—is 1.6 Tb/sec using four lanes per port. Due to forward error correction, the latency will probably increase slightly with each step on the InfiniBand roadmap, but NVIDIA is introducing additional technologies to counteract this rising latency. Furthermore, since Ethernet will experience increased port latencies in addition to the same forward error correction as InfiniBand, the latency gaps between the two technologies will largely remain unchanged.
Shainer believes that the right side of the block diagram above—the side dealing with network services—is more significant than the left side, which is concerned with the growing capacities and capabilities of the raw InfiniBand transport and protocol.
The more intriguing solution, according to him, is to integrate compute engines into the silicon of the InfiniBand network, either in the adapter or the switch, so that applications can run as data moves through the network. There are engines that are already set up to carry out very specific tasks, like data reductions, which are typically carried out by host CPUs. As you add more nodes, data reductions will take longer and longer because there is a greater amount of overhead and data being moved around. Our ability to move all data reduction operations into the network will enable flat latency in a single digit microsecond regardless of system size, reduce data motion, lower overhead, and provide significantly better performance. Additionally, we have MPI tag matching and all-to-all engines operating at wire speed for little messages at 200 Gbps and 400 Gbps speeds."
Certain parts of the MPI stack were accelerated by these in-network accelerations, which were first introduced in the ConnectX adapter cards. And Mellanox added support for what it calls Scalable Hierarchical Aggregation Protocol, or SHARP for short, to perform data reductions inside the network with the Switch-IB 2 ASICs for EDR 100Gb/s InfiniBand, a second generation chip operating at that speed that was announced in November 2015.
The “Summit” supercomputer at Oak Ridge National Laboratory and the “Sierra” supercomputer at Lawrence Livermore National Laboratory, both of which began entering the field in late 2017 and became fully operational in 2018, were specifically designed to take advantage of this capability. Look at this:
Since then, Mellanox (and presently NVIDIA) have added all-to-all and MPI tag matching to the data reductions, expanding the in-network functions. What happens when MPI tags are matched?
The NDR 400Gb/s Quantum 2 switches, ConnectX-7 adapters, and the NVIDIA SHARP 3.0 stack enable even more in-network processing as more and more operations are accelerated in the InfiniBand switches. Additionally, NVIDIA will be able to speed up group operations and offload active message processing, smart MPI progression, data compression, and user-defined algorithms to the Arm cores on the BlueField 3 DPU, further relieving the host systems, with the NVIDIA BlueField 3 DPU arriving with five times the compute capacity of the BlueField 2 DPU it replaces.
The important thing is how everything works together to improve overall application performance, and DK Panda of Ohio State University is, as usual, pushing the performance envelope with his MVAPICH2 hybrid MPI and PGAS libraries and is conducting performance tests on the “Frontera” all-CPU system at the Texas Advanced Computing Center at the University of Texas. The MVAPICH2 stack is presented below for those who are unfamiliar.
The performance gains from using the NVIDIA SHARP in conjunction with the MVAPICH2 stack are substantial.
The scaling is nearly flat (meaning latency does not increase as node counts go up) with SHARP running in conjunction with MVAPICH2 with the MPI_Barrier workload on the full Frontera system with one process per node (1ppn in the chart below), and even at the full system configuration of 7,861 nodes, the latency is still a factor of 9X lower. The latency increases exponentially as the number of nodes increases if SHARP is disabled on the same machine. The latencies are reduced when there are 16 processes per node (16ppn), and there is only a 3.5X difference across 1,024 nodes between turning on and off SHARP.
SHARP’s performance advantages on MPI_Reduce operations range from a low of 2.2X to a high of 6.4X, depending on the number of nodes and the number of processes per node. Again, depending on the number of nodes and processes per node, MPI_Allreduce operations range from 2.5X to 7.1X.
On a different set of benchmark tests, adding a BlueField-2 DPU to the system nodes and having it assist with MPI offload on a 32-node system with either 16 or 32 processes per node, the performance benefit ranged from 17 percent to 23 percent on a range of message sizes from 64K to 512K, as shown below.
These performance advantages are cumulative and should greatly enhance the performance of HPC and even AI applications that rely on MPI operations in the adapters, switches, and DPUs where they are present.
Because of all of these factors, it is reasonable to assume that InfiniBand will continue to be used in HPC and AI systems for a very long time. Additionally, InfiniBand has expanded more quickly throughout the course of the previous year 2022.
What Is InfiniBand and How Is It Different from Ethernet?
NADDOD High-Performance Computing (HPC) Solution
Case Study: NADDOD Helped the National Supercomputing Center to Build a General-Purpose Test Platform for HPC
NADDOD InfiniBand Cables & Transceivers Products
Why Is InfiniBand Used in HPC?
InfiniBand Network Technology for HPC and AI: In-Network Computing
Why Autonomous Vehicles Are Using InfiniBand?
Active Optical Cable Jacket Explained: OFNR vs OFNP vs PVC vs LSZH?
What Is InfiniBand Network and Its Architecture?