Reason Why Choose InfiniBand for RDMA

NADDOD Brandon InfiniBand Technical Support Engineer Dec 25, 2023

InfiniBand, as a native network technology for Remote Direct Memory Access (RDMA), has gained popularity and widespread usage among many customers. However, what unique advantages does InfiniBand possess compared to RoCE (RDMA over Converged Ethernet), which is lossless ethernet also supports the RDMA protocol?

 

InfiniBand vs. Ethernet

1. The purists of Traditional SDN: Making Networks Efficient and Simple

InfiniBand SDN networking

InfiniBand is the first network architecture designed natively according to the principles of SDN (Software-Defined Networking). It is managed by a subnet manager, which acts as the SDN controller. Unlike traditional Ethernet, including RDMA over Converged Ethernet (RoCE), InfiniBand switches do not run any routing protocols. The forwarding tables for the entire network are computed and distributed by a centralized subnet manager. In addition to the forwarding tables, the subnet manager is responsible for managing the configuration within the InfiniBand subnet, such as partitioning and QoS (Quality of Service). InfiniBand networks no longer rely on broadcast mechanisms like ARP (Address Resolution Protocol) for forwarding table learning, eliminating broadcast storms and unnecessary bandwidth consumption.

 

InfiniBand Subnet

On the other hand, traditional Ethernet, including RDMA over Converged Ethernet (RoCE), also supports SDN controller-based networking. However, network vendors have deviated from the earlier concept of OpenFlow-based flow table forwarding to avoid becoming mere "white-box" manufacturers. Instead, they have adopted solutions based on netconf, VXLAN (Virtual Extensible LAN), and EVPN (Ethernet Virtual Private Network). SDN controllers have evolved into more advanced "network management systems," primarily focused on deploying relevant control policies. Forwarding, on the other hand, still heavily relies on device-based learning, such as MAC table learning, ARP tables, and routing tables. This divergence has resulted in RoCE losing the advantages of efficient and simple networking found in InfiniBand.

 

Ethernet Subnet

Use examples in life to understand in a simple way:

 

InfiniBand network can be compared to traveling by high-speed rail, where the entire journey is managed and coordinated by a dispatcher (subnet manager). Passengers (network traffic) who want to reach their destinations don't need to learn or search for routes themselves. They simply board the designated train (forwarding table) and enjoy an efficient and smooth journey. This mode of operation ensures high-quality and speedy travel without redundant broadcast announcements or sudden route changes.

 

In contrast, self-driving represents traditional Ethernet and RDMA over Converged Ethernet (RoCE), where navigation systems (SDN controllers) are also present to provide guidance. However, the drivers (network devices) still need to make real-time judgments and adjustments based on road conditions (device-based learning). This process may involve multiple map queries (broadcast mechanisms), waiting at traffic lights (bandwidth waste), or taking detours to avoid congestion (complex network configurations), making the entire travel process relatively less efficient.

2. Prior Credit Congestion Avoidance Mechanism: Achieving Native Lossless Networking

InfiniBand networks utilize a credit-based mechanism that fundamentally avoids buffer overflow and packet loss issues. This mechanism ensures that the sender initiates packet transmission only when the receiver has sufficient credits to accept the corresponding number of messages.

 

The working principle of this credit-based mechanism is as follows: Each link in the InfiniBand network has a predetermined buffer for storing packets to be transmitted. Before transmitting data, the sender checks the available credits of the receiver, which represents the current available buffer size of the receiver. Based on this credit value, the sender decides whether to initiate packet transmission. If the receiver has insufficient credits, the sender waits until the receiver releases enough buffer space and reports new available credits.

 

Once the receiver completes forwarding, it releases the utilized buffer and continuously reports the current available buffer size to the sender. This allows the sender to dynamically adjust the packet transmission based on the receiver's buffer status. This link-level flow control mechanism ensures that the sender does not overwhelm the receiver with excessive data, effectively preventing network buffer overflow and packet loss.

 

The advantage of this credit-based mechanism is that it provides an efficient and reliable method of flow control. By monitoring and adjusting packet transmission in real-time, InfiniBand networks ensure smooth data delivery while avoiding network congestion and performance degradation. Additionally, this mechanism offers better network predictability and stability, enabling applications to efficiently utilize network resources.

 

In contrast, RDMA over Converged Ethernet (RoCE) utilizes a "post-congestion" management mechanism. It does not negotiate resources with the receiver before sending packets. Instead, it directly forwards packets without prior coordination. Only when the receiver's switch experiences port buffer congestion (or imminent congestion), it sends congestion management messages using PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) protocols to reduce or pause packet transmission on the opposing switch and network card. While this "post-congestion" approach can partially alleviate the impact of congestion, it cannot completely prevent packet loss and network instability.

 

Schematic diagram of lossless data transmission in InfiniBand network

To better understand these concepts in everyday life:

 

When you want to dine at a restaurant, the InfiniBand network's credit-based mechanism is like making a reservation over the phone. Customers communicate with the restaurant in advance to ensure there are enough seats available, thus avoiding the awkward situation of arriving at the restaurant and finding no available seats. This approach guarantees a smooth dining experience for customers while preventing resource waste and dissatisfaction.

 

On the other hand, customers queuing up at a restaurant without a reservation is similar to the "post-congestion" management mechanism of RDMA over Converged Ethernet (RoCE). Customers haven't made prior arrangements and can only wait based on the actual situation. Although the restaurant takes measures to alleviate congestion, there's still a risk of insufficient seats and potential loss of customers. While this approach can partially handle the situation, it cannot completely avoid customer dissatisfaction and potential loss.

3. Direct Forwarding Mode: Enabling Networks to Achieve Lower Latency

Ethernet networks, including RDMA over Converged Ethernet (RoCE), typically employ a store-and-forward mode where switches first receive and store the entire data packet in their buffers. They then inspect the destination address and integrity of the packet before forwarding it. This approach introduces some latency, especially when dealing with a large number of packets.

 

In contrast, Cut-through technology, used in direct forwarding mode, allows switches to read only the header information of a received packet, determine the destination port, and immediately start forwarding the packet. This significantly reduces the time the packet spends in the switch, thereby reducing transmission latency.

 

InfiniBand switches utilize the Cut-through direct forwarding mode, simplifying the forwarding process for layer 2 packets. With just a 16-bit Local Identifier (LID) obtained directly from the subnet manager, the switch quickly identifies the forwarding path. As a result, the forwarding latency is reduced to below 100 nanoseconds. Ethernet switches, on the other hand, typically employ MAC table lookup and store-and-forward methods for packet processing. However, due to the additional complexities they handle, such as IP, MPLS, QinQ, etc., their processing time is longer, possibly taking several microseconds or more. Even though some Ethernet switches also incorporate Cut-through technology, the forwarding latency may still exceed 200 nanoseconds.

 

Message Size-IB VS RoCE

Use examples from life to understand in a simple way

 

The way Ethernet handles data packets can be likened to shipping fragile items. The delivery person needs to be extremely careful, receiving the package in its entirety, checking for any damages, and ensuring its integrity before forwarding it to the destination. This cautious process introduces some time delay.

 

In contrast, the processing approach of InfiniBand switches is more akin to shipping regular items. The delivery person only needs to quickly glance at the address on the package and swiftly forward it without waiting for a thorough inspection. This method is faster, significantly reducing the time the package spends at the post office, thereby reducing transmission latency.

4.  SUM UP

InfiniBand offers several unique advantages over RDMA over Converged Ethernet (RoCE) for Remote Direct Memory Access (RDMA) applications. Firstly, InfiniBand is designed natively according to the principles of Software-Defined Networking (SDN), providing efficient and simple network management without relying on broadcast mechanisms. In contrast, RoCE relies on device-based learning and complex network configurations, making networking less efficient. Secondly, InfiniBand utilizes a credit-based congestion avoidance mechanism that ensures native lossless networking, preventing buffer overflow and packet loss. RoCE, on the other hand, employs a post-congestion management mechanism that cannot completely avoid packet loss. Lastly, InfiniBand switches use a cut-through direct forwarding mode, reducing latency by quickly forwarding packets based on header information. Ethernet switches, including RoCE, typically employ store-and-forward methods that introduce additional processing time. These advantages make InfiniBand a preferred choice for RDMA applications.

 

Naddod for InfiniBand optical modules and high-speed cables, provides a one-stop solution!

 

Delivery Time: We have abundant and stable inventory to ensure fast delivery. After placing an order, we promise to complete the delivery within two weeks, enabling your project to progress quickly and saving time and resources, so that your project is no longer limited by waiting.

Product Performance:

① Our products undergo 100% real device testing to ensure quality and reliability, and we can provide you with professional test reports.

② Our testing scenarios involve the simultaneous application of tens of thousands of cables to ensure that the products can operate smoothly under real application requirements without packet loss or errors.

Product Delivery:

① We have successfully cooperated with multiple enterprises and delivered products that have been running stably, gaining trust from our customers.

② We provide fast and responsive technical services to ensure after-sales support throughout your product usage process.

 

Multiple successful deliveries and real-world application cases are the best endorsement of our quality assurance. You don't need to worry about product quality and inventory issues as we always maintain sufficient stock to ensure your needs are met promptly.

 

NADDOD Product

In addition to providing third-party high-quality optical modules, we also have a large inventory of original NVIDIA products, offering you more choices at any time. Contact NADDOD now to learn more details!