Seven major trends in the development of large-scale data center networks

NADDOD Quinn InfiniBand Network Architect Feb 21, 2024

The ultra-large-scale data center networks have undergone significant transformations in architecture, technology, and operations, effectively supporting the prosperity of the Internet and cloud computing over the past decade. Looking ahead, professionals must consider and face the question of where data center networks should evolve, driven by technologies and businesses such as artificial intelligence, big data, the Internet of Things, and cloud-native. Based on industry developments and business requirements, we predict the following trends for the future development of ultra-large-scale data center networks.

1. Evolution of Network Bandwidth: The Core Competence of Chips

Driven by technologies and businesses such as artificial intelligence, big data, machine learning, as well as high-definition video, augmented reality (AR), and virtual reality (VR), data centers will continue to experience an accelerating demand for network bandwidth. Over the next five years, data center switch chipset technology will undergo rapid iterations, with the continued effectiveness of Moore's Law. Approximately every two years, the capacity of switch chipsets is expected to double. SerDes technology is also rapidly evolving from 10Gbit/s and 25Gbit/s to 50Gbit/s and 100Gbit/s, while corresponding optical module technology is gradually progressing from 25Gbit/s, 50Gbit/s, and 100Gbit/s to 400Gbit/s, 800Gbit/s, and even Tbit/s levels. The continuous iterations of switch chipsets and optical chip technologies will effectively ensure the growing bandwidth demands of data center networks. Furthermore, the evolution of network bandwidth will inevitably drive the evolution of computing and storage architectures.

Typical Optical module evolution

In addition to the continuous evolution of capacity, chip programmability will gradually become mainstream, and the strength of network visualization support will become one of the core competitiveness of chips.

2. Hardware Whiteboxing, OS Open Source, Software Autonomy

SDN Concept Drives the Maturity of Device Decoupling Ecosystem: Chip Commercialization, Hardware Whiteboxing, and Customization, Software Autonomy

 

The concept of Software-Defined Networking (SDN) has led to the gradual maturity of a device decoupling ecosystem, where chip commercialization, hardware whiteboxing, customization, and software autonomy are becoming more prevalent. Developing in-house switch devices is not only aimed at cost-saving but also enables the integration of software and customized hardware. This approach allows for rapid iteration of network functionalities to support business development needs while achieving flexible and efficient network monitoring. Ultimately, it leads to a more stable and intelligent network, making it a core competitive advantage.

 

Currently, the open-source-based open ecosystem has become increasingly mature. Switch operating systems can leverage the power of this ecosystem, allowing internet companies to focus on upper-layer software and operational management systems. The development of in-house switch devices will no longer be limited to a few large-scale internet and cloud computing companies. More and more companies will join this camp.

3. Integrated High-Performance Network Forwarding: Hardware Offloading and Programmable Chips

As we all know, the era of Moore's Law for CPUs is coming to an end, while the scale of cloud services and machine learning continues to grow exponentially. Virtual switches are an integral part of cloud data center networks, but server-based network processing solutions have faced challenges. The rapid adoption of 40GbE and even 100GbE network architectures, significant increases in server external throughput, stacked additional features such as network security, and the proliferation of virtual machines have resulted in a substantial utilization of CPU resources for internal and external networking and additional functionalities. This has led data centers into an endless "sea of machines" scenario, posing challenges in terms of deployment scale, application efficiency, CapEx, and more. Enhancing the virtual network performance based on traditional x86 servers becomes crucial.

 

Facing the performance bottleneck of intra-server forwarding capabilities, many industry vendors have attempted to develop Smart Network Interface Card (SmartNIC) solutions using FPGA, multi-core processors, and traditional network processors. In simple terms, SmartNIC offloads network functions, including vSwitches and vRouters, from the x86 server to the SmartNIC, releasing the processing resources of x86 server processors and providing higher-performance network processing capabilities.

 

The first wave of SDN development broke the closed integration of management planes, control planes, and data planes, emphasizing the role of software and bringing transformation to the networking industry. However, as SDN applications become more widespread, the limitations of pure software become increasingly apparent. It becomes more important to effectively control the underlying hardware and chip layers in a simpler way. P4 (Programming Protocol-Independent Packet Processors) emerged in this context. The ability to program and open up the underlying chips will usher in the next wave of SDN industry development, inevitably bringing about another transformation in terms of integrated hardware and software and network visualization.

4. Network Convergence: I/O for Integrated Data Centers with Low Latency

The functionality of networks is no longer limited to providing connectivity; networks are becoming an extension of computer I/O. Networks with ultra-high bandwidth and ultra-low latency are blurring the boundaries between local storage and network storage, laying the foundation for the integrated architecture of compute-storage separation and resource pooling in data centers. The network is a core component of data center integration, serving as a powerful driving force for next-generation high-performance computing and storage. Reducing network latency will be a long-term process, and technologies such as RDMA will gradually be deployed at scale. Revolutionary new technologies or architectural changes will emerge when applications encounter bottlenecks.

 

With the increasing popularity of artificial intelligence and big data, data centers have a growing demand for computing power. Ultra-high-density heterogeneous computing clusters will become the core competitiveness of infrastructure. Efficiently interconnecting computing chips and high-performance storage media and enabling large-scale scalability are crucial challenges. Data center networks will no longer be confined to switch networks; they will further extend into the host, facilitating high-performance interconnection of various computing chips and storage media components within the host and integrating with switch networks. The traditional CPU-centric server architecture will gradually evolve into an architecture centered around data interconnectivity. Network interface cards (NICs) will go beyond traditional I/O functions and serve as carriers for hardware virtualization, bridging the gap between switch network interconnectivity and host component interconnectivity. Hardware-based high-speed network forwarding, network Quality of Service (QoS), network visualization, and other functionalities will be extended to host NICs.

5. Network Visualization: Intelligent Operations and Maintenance Based on Big Data and Artificial Intelligence

Autonomous driving has become a possibility, and widespread adoption is only a matter of time. Similarly, the automation of large-scale network operations and maintenance is an inevitable trend in the industry. Achieving autonomous driving or automated operations and maintenance requires two common conditions: having sufficient and effective data, and having the capability for intelligent analysis and processing of that data. The acquisition of valuable data relies on network devices, and the visualization capabilities of switch chips play a crucial role.

 

Traditionally, the granularity of monitoring and data acquisition for switch devices has been coarse, typically limited to device-level operational status, including CPU, memory, ports, and various table entries. The methods for obtaining this information, such as SNMP and CLI, are primitive and inefficient. These approaches fail to meet the requirements of automated operations and maintenance. New-generation switch chips have taken a significant and commendable step towards network visualization. Some switch chips available in the market can provide richer information, such as supporting In-band Network Telemetry (INT) functionality, which enables the retrieval of information about the physical path, latency, switch buffer levels, and other details for specific user flows. The Mirror on Drop (MoD) feature can capture packet loss information caused by switch pipeline or buffer congestion. These abundant network data, when analyzed by artificial intelligence systems, will elevate network operations and maintenance to unprecedented levels of intelligence, enabling self-driving networks.

 

In addition to the information content, the methods and efficiency of information acquisition by switches have also greatly improved. Streaming Telemetry functionality allows efficient delivery of monitoring data to network monitoring systems, either through software or directly through the chip. The granularity of monitoring critical information can be as precise as microseconds.

6. Optical Interconnect Trends

Optical interconnect refers to the communication method that uses optical technology to connect computing devices, storage devices, and network devices. Here are some specific trends and aspects in the field of optical interconnect:

 

  • High Speed: With the development of data centers and cloud computing, there is an increasing demand for high-speed optical interconnects. For example, traditional optical fiber communication has evolved from 10Gb/s to 100Gb/s, 400Gb/s, and even higher speeds to meet the requirements of large-scale data transmission.

 

  • Low Power Consumption: Energy efficiency and sustainability have become significant concerns, making low-power optical interconnects increasingly important. The development of next-generation optical interconnect technologies, such as energy-efficient optical interconnection and low-power optical modules, helps reduce the energy consumption of data centers.

 

  • High-Density Integration: As data centers scale up, there is a growing need for space-efficient optical interconnects. The development and application of high-density optical modules enable more optical interconnects to be implemented within limited space.

 

  • Optoelectronic Hybrid Integration: Optoelectronic hybrid integration involves integrating optical and electronic components on the same chip to achieve higher performance and compact designs. This integration can provide higher bandwidth, lower latency, and reduced power consumption while reducing the complexity and cost of optical connections.

 

  • High Reliability: Data centers and communication networks require high reliability. The development of optical interconnect technologies includes enhanced reliability of optical fiber connections, improved fault detection and correction mechanisms, and more comprehensive monitoring and management of reliability.

 

  • Applications in Emerging Fields: Optical interconnect technologies are expanding into emerging fields such as artificial intelligence, the Internet of Things, and edge computing. These areas have increasing demands for high-speed, low-latency, and high-capacity communication, making optical interconnects an essential technology to meet these needs.

 

These trends and aspects collectively drive the development of optical interconnect technologies, providing higher performance and efficiency solutions for data centers and communication networks.

7. Green Network

With the increasing popularity of artificial intelligence and big data, data centers have a growing demand for computational power. Ultra-high-density heterogeneous computing clusters will become the core competitiveness of infrastructure. The significant increase in computational power inevitably leads to a substantial rise in power consumption. Power consumption and cooling are crucial issues that must be addressed, as they are key factors in ensuring the sustainable development of large-scale data centers.

Green Network