Infiniband Technology Q&A for NVIDIA Quantum-2

NADDOD Brandon InfiniBand Technical Support Engineer Aug 31, 2023

With the rapid development of technologies such as big data and artificial intelligence, the demand for high-performance computing is increasing. NVIDIA Quantum-2 Infiniband platform has emerged to provide users with high-speed and low-latency data transmission and processing capabilities, achieving outstanding distributed computing performance.

 

Quantum-2 adopts the latest generation of NVIDIA Mellanox HDR 200Gb/s Infiniband network adapter, supporting high-speed data transmission and low-latency computing. Combined with NVIDIA GPU, it achieves accelerated computing and distributed storage, improving computing efficiency and resource utilization.

NVIDIA Product

In addition, Quantum-2 also supports a variety of advanced technologies such as NVIDIA RDMA, NVLink, and Multi-host to achieve efficient data transmission and resource sharing within data centers. Users can build high-performance computing clusters or distributed storage systems according to their actual needs, providing powerful support for fields such as big data analytics, artificial intelligence, and scientific computing.

These are common Q&A about IB technology

Q: Can CX7 NDR200 QSFP112 port be compatible with HDR/EDR cables?

A: Yes.

NV AOC DAC

Q: How to connect CX7 NDR network card to Quantum-2 QM97XX series switch?

A: NVIDIA's 400GBASE-SR4 or 400GBASE-DR4 optical modules are used on the ports of the CX7 NDR network card, while 800GBASE-SR8 (actually 2x400GBASE-SR4) or 800GBASE-DR8 (actually 2x400GBASE-DR4) optical modules are used on the QM97XX series switch. The optical modules are connected using a 12-core multimode universal polarity APC end face patch cord.

 

Q: Can the CX7 Dual-port 400G achieve 800G after bonding? Why can 200G achieve 400G after bonding?

A: The overall network performance is determined by the bottleneck of PCIe bandwidth, network card processing capacity, and physical network port bandwidth. The PCIe specification of the CX7 network card is 5.0 x16, and the theoretical bandwidth limit of PCIe 5.0 x16 is 512Gbps. Limited by the maximum bandwidth limitation of PCIe 5.0 x16, the CX7 network card does not even have the hardware for Dual-port 400G.

 

Q: How to connect a one-to-two cable?

A: The one-to-two cable (800G to 2X400G) needs to be connected to two different servers to achieve optimal performance for Sharp. This is because GPU servers generally have multiple network cards, but the branch cable should not be fully connected to the Ethernet server network card.

 

Q: In InfiniBand NDR scenarios, how are one-to-two cables connected?

A: In present, there are two types of one-to-two cables in InfiniBand NDR scenarios. 1. Optical modules with one-to-two patch cords (400G divided into 2x200G), that is, MMS4X00-NS400 + MFP7E20-NXXX + MMS4X00-NS400 (downgraded to 200G use); 2. One-to-two DAC copper cables (800G divided into 2x400G), that is, MCP7Y00-Nxxx, MCP7Y10-Nxxx.

 

Q: Which cards are dual-mode IB/ETH, and how to switch between IB/ETH dual modes?

A: CX7 NDR network cards are dual-mode IB/ETH. Switching between the two modes requires installing the NVIDIA MFT tool on the device. Then adjust using the following command:

1.mst start --> Start the tool;
2.mlxconfig -d <device> set LINK_TYPE_P<port number> = <mode> --> Adjust the port mode of the network card;
3.Restart the driver or server to take effect.
Note 1: <device> needs to be queried through mst status.
Note 2: <port number> is the port number, 1 represents the first port of the network card, and 2 represents the second port of the network card.
Note 3: <mode> represents the network card mode, 1 represents IB mode, and 2 represents ETH mode.

Example: Change the first port of the network card from ETH to IB mode:
mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1=1.

IB ETH SPEED

Q: In Superpod network, if I configure four NDR200 cards on each server, can I directly connect them to the same switch using a 1x4 cable, or do I need to use two 1x2 cables to connect to different switches?

A: It is not recommended to connect the four NDR200 ports on each server to the same switch using a one-to-four cable in Superpod network, as this connection method does not comply with Superpod network rules. Considering the performance of NCCL/SHARP, the leaf switches need to use one-to-four cables to connect the NDR200 ports of different servers in SU, forming different communication loops.

 

Q: In a Superpod network, if there are less than 32 nodes in the last SU, for example, only 16 nodes, can the leaf switches in the last SU be reduced to 4? This will result in two network cards of the same node being connected to the same leaf switch. Will there be any issue with the SHARP tree?

A: It is possible but not recommended. The NDR switch can support up to 64 SAT (SHARP Aggregation Tree).

 

Q: Regarding the latest Superpod network, according to the Superpod Network White Paper, it is a network that separately configures two IB switches with UFM software in the computing network. However, this results in one less GPU node in my cluster. If I do not configure a separate UFM switch and only deploy UFM software on the management node, can I manage the cluster through another set of storage network without occupying the computing network?

A: It is recommended to configure UFM equipment (including software). Deploying UFM software on the management node in the computing network is an optional solution, but this node should not bear the GPU computing workload. The storage network is a separate network and a different network plane, which cannot manage the computing cluster.

 

Q: What are the differences between UFM Enterprise, SDN, Telemetry, and Cyber-Al? Is it necessary to buy UFM?

A: It is possible to use the opensm and command script tools included in OFED for simple management and monitoring, but it lacks the UFM-friendly graphical user interface and many functions.

OPED

Q: Is there any difference in the number of subnet managers required for the switch, OFED, and UFM? Which one is more suitable for customer deployment?

A: Managing switches is suitable for managing within 2K nodes. UFM and OFED's openSM node management capabilities are unlimited but require coordination with the CPU and hardware processing capabilities of the management node.

 

Q: What are the differences between DAC, ACC, AOC, and transceivers, and what are the limitations of each?

A: The connection distance and difficulty of wiring are as shown in the following figure.

DAC ACC AOC Transceiver

Q: Why are there 32 QSFP56 ports for a switch with 64 400Gb ports?

A: The size and power consumption are limited, and only 32 cages can be accommodated on a 2U panel. This is for OSFP interfaces that support two 400G ports. It is important to distinguish between the concepts of cage and port for the NDR switch.

 

Q: Is it possible to connect two modules with different interfaces using a cable to transmit data? For example, if one side is an OSFP port on a server and the other side is a QSFP112 port on a switch, can they be connected with a cable?

A: The interconnection of modules is independent of packaging. OSFP and QSFP112 mainly describe the physical size of the module. As long as the Ethernet media type is the same (i.e., both ends of the link are 400G-DR4 or 400G-FR4, etc.), OSFP and QSFP112 modules can be mutually compatible.

 

Q: Can UFM be used to monitor RoCE networks?

A: No, it cannot. UFM only supports InfiniBand networks.

 

Q: Are the functionalities of UFM the same for managed and unmanaged switches?

A: Yes, they are the same.

 

Q: What is the maximum transmission distance supported by IB cables without affecting the transmission bandwidth and latency?

A: Optical modules + jumpers ~ 500m; Passive DAC cables ~ 3m, Active ACC cables ~ 5m.

 

Q: Can CX7 network cards be connected to other 400G Ethernet switches that support RDMA in Ethernet mode?

A: 400G Ethernet interconnection is possible, and RDMA is RoCE, which can run under this circumstance, but performance is not guaranteed. For 400G Ethernet, it is recommended to use the Spectrum-X platform consisting of BF3+Spectrum-4.

 

Q: If NDR is compatible with HDR and EDR, are these cables and modules only available in one piece?

A: Yes, generally OSFP to 2xQSFP56 DAC/AOC cables are used to be compatible with HDR or EDR.

 

Q: Should the module on the OSFP network card side be a flat module?

A: The network card comes with a heat sink, so a Fat module can be used directly. Finned modules are mainly used on the liquid-cooled switch side.

 

Q: Does the IB network card support RDMA in Ethernet mode?

A: RoCE can be run, which is RDMA over Ethernet. It is recommended to use the Nvidia Spectrum-X solution.

 

Q: Why are there no NDR AOCs?

A: OSFP is very large and heavy. Optical fibers are prone to breakage. A two-branch cable will have three large transceiver ends, and a four-branch cable will have five transceivers. This can cause optical fiber breakage during installation, especially for 30-meter AOCs.

 

Q: Are the cables the same for 400G IB and 400G Ethernet besides the different optical modules?

A: The optical cables are the same, but note that they are APC type with an 8-degree angle.

 

Q: Is there a specific requirement for the latency performance of CX7 network cards? What is the network latency requirement under optimal debug environments such as full memory and bound cores? What is an appropriate value for latency, e.g., less than how many microseconds?

A: This is related to the frequency and configuration of the testing machine, as well as the testing tools such as perftest and MPI.

 

Q: Should the module on the OSFP network card side be an OSFP-flat module? Why is there a mention of OSFP-Riding Heatsink?

A: Riding heatsink refers to a heat sink on the cage.cage

Q: Where does UFM fit into this cluster solution? I would like to understand its role.

A: UFM runs separately on a server and can be treated as a node, with support for high availability using two servers. However, it is not recommended to run UFM on a node that also runs compute workloads.

 

Q: For what scale of network clusters is UFM recommended?

A: It is recommended to configure UFM for all IB networks, as UFM provides not only OpenSM but also other powerful management and interface functions.

 

Q: Does PCIe 5 only support up to 512G? What about PCIe 4?

A: PCIe Gen5 provides up to 32G x 16 lanes for a maximum bandwidth of 512G, while PCIe Gen4 provides up to 16G x 16 lanes for a maximum bandwidth of 256G.

 

Q: Do IB network cards support simplex or duplex modes?

A: They are all duplex. Simplex or duplex is just a concept for current devices since the physical channels of transmit and receive have already been separated.

 

Q: Are there reliable vendors for building IB network clusters who can provide technical support and high-quality products?

A: Of course, as a reliable vendor, NADDOD can provide technical support and high-quality products for building IB networks. NADDOD specializes in providing high-performance computing and data center solutions. It has rich experience and expertise in building IB network clusters and provides a variety of hardware connectivity solutions to meet the needs of different customers.

 

NADDOD's solutions include compatibility between different models of network cards and switches, choice of AOC/DAC cables and modules with speeds from 400G, 200G, to 100G, and an experienced technical support team that can support UFM functions and deployment, network latency requirements, etc.

 

NADDOD products are reliable in quality, excellent in performance, and provide comprehensive technical support and after-sales service. Whether building large-scale supercomputing clusters or high-speed data center networks, NADDOD can provide customized solutions for customers. In IB network cluster solutions, NADDOD's professional team will provide the best hardware connectivity solutions based on your needs and network scale, ensuring network stability and high performance. Whether building a new network or upgrading an existing one, NADDOD can provide comprehensive support to ensure the smooth deployment and operation of IB networks.
For more information and support, please visit NADDOD's official website.