Next-Gen Data Centers: Embracing Liquid Cooling

NADDOD Claire Optical Module Engineer Jun 28, 2024

Hyperscale data centers and HPC are foundational to the advancement of cloud computing, big data, and generative AI. As these centers grow in power and scale, cooling has become a major bottleneck. Modern data centers generate heat loads that traditional air cooling systems can't handle. So, liquid cooling technology, with its superior efficiency and environmental benefits, is emerging as a crucial solution for supercomputing centers.


1. The Necessity of Liquid Cooling Technology

With the rapid development of AI, chip power consumption is rising sharply. Each new generation of high-performance applications and AI chips consumes more power. For instance, NVIDIA's latest GPUs use up to 700 watts, and the next generation is expected to reach 1200 watts.


Traditionally, air cooling has been the go-to method due to its flexibility and ease of use. However, before the rise of compute-intensive technologies, most server racks had a peak power of around 20 kW. Today, many server racks exceed 30 kW, and GPU racks for AI and machine learning can surpass 40 kW. Clearly, air cooling can no longer meet the demands of high-power chips.


Thus, liquid cooling solutions are increasingly essential for supporting energy-intensive technologies in data centers.


liquid cooling


Advantages of liquid cooling technology:


High Cooling Efficiency: Liquid cooling offers significantly higher thermal conductivity and specific heat capacity than air. Water’s specific heat capacity is 1000-3500 times higher, and its thermal conductivity is 15-25 times greater than air.


Energy Efficiency and Environmental Friendliness: Liquid cooling systems typically do not require compressors or fans, only pumps and sensors, significantly reducing Power Usage Effectiveness (PUE).


Stability and Reliability: Components such as pumps, heat exchangers, and pipes are easy to maintain and highly reliable.


Overall, liquid cooling offers ultra-high energy efficiency and thermal density, efficient heat dissipation, and is unaffected by altitude, region, or temperature. Hence, it is becoming the new trend in data center cooling technology.


2. Main Types of Liquid Cooling Technologies

Liquid cooling in data centers today includes both direct and indirect methods, with direct cooling further divided into immersion cooling and spray cooling.


While it might seem new, liquid cooling has been around for quite some time. In the 19th century, immersion cooling was used to remove heat from transformers, and various forms of liquid cooling have been applied in computing since the 1960s.


direct and indrect

direct A and indirect B


Cold-Plate Liquid Cooling

Cold-plate liquid cooling, an indirect cooling technology, is currently the most mature and popular among the three mainstream liquid cooling solutions.


In these setups, devices containing liquid elements connect directly to heat-generating components like dual in-line memory modules (DIMMs), CPUs, and GPUs. This is known as direct-to-chip cooling. The liquid can either remain in a single-phase state or transition to a gaseous state and then condense back into liquid. Cold plates, made of highly conductive materials, circulate liquid to absorb heat from components, dissipate it, and then recirculate the cooled liquid.


Cold plate systems can use water or a combination of liquid and glycol. Heat exchange occurs through contact between liquid components or between liquid and air. These systems are connected to external Cooling Distribution Units (CDUs), which can be local or serve multiple server racks. Increasing the liquid flow rate through the cold plates reduces thermal resistance, improving cooling efficiency. These systems can remove up to 75% of the heat, with the remaining portion managed by air cooling.


Backup systems are necessary to address potential failures of the CDU or other components. Additionally, cold plate technology can be scaled up through rear door heat exchangers (RDHx). These exchangers, fitted into server rack doors, allow air to circulate and pass over liquid-cooled elements, absorbing and removing heat. Technically, RDHx is not true liquid cooling since server-level chips are still air-cooled, but it brings liquid cooling closer to the rack for enhanced air-cooling efficiency, serving as a good starting point for many companies.


Immersion Cooling

Immersion cooling is like a cold bath for electronic components, where heat-generating devices are fully submerged in a cooling liquid and exchange heat through direct contact. In immersion cooling systems, IT components are partially or fully immersed in hydrocarbons or fluorocarbons that absorb heat, significantly lowering thermal resistance. The fluids used are typically non-toxic and non-flammable, though some may be combustible under specific conditions.


Immersion cooling can be single-phase or two-phase. In single-phase immersion cooling, the liquid circulates to absorb and dissipate heat, then returns the cooled liquid to the tank. The fluid circulation can be mechanical or rely on natural convection driven by heat. In two-phase immersion cooling, the liquid boils as it absorbs heat. Water-cooled condensation elements, suspended above the tanks, absorb heat and cause the evaporated coolant to condense and drip back into the tank. The fluids used in two-phase immersion cooling, such as fluorocarbons, are more expensive, and new low-boiling-point fluids are being developed to improve process efficiency.


Compared to traditional air cooling and cold plate cooling, immersion cooling offers several advantages, such as high energy efficiency (PUE < 1.13), compactness, high reliability, and low noise. However, it requires substantial changes to server technology and data center architecture. The initial investment for immersion cooling systems is high, but they can reduce overall costs over 3-5 years, lowering the PUE below 1.2 and potentially approaching 1 when combined with other technologies. By 2027, the immersion cooling market is expected to reach $1.6 billion.


Spray Cooling

Spray cooling, the least researched of the three liquid cooling types, involves precisely spraying cooling liquid (such as hydrocarbons, water, or methanol) onto chip-level devices or connected heat-conducting elements, evaporating the liquid and carrying away heat. This direct-contact cooling method requires sealing components within chambers to recover and reuse evaporated coolant. In indirect systems, the spraying occurs within smaller sealed chambers where the atomized liquid contacts cooling plates.


Spray cooling also achieves 100% liquid cooling but with lower fluid quantities compared to immersion cooling. However, issues like nozzle clogging and the difficulty of replacing equipment without interrupting the overall cooling process have hindered its widespread adoption. Spray cooling shares the same limitations as immersion cooling regarding changes to data center infrastructure and initial investment costs.


Each liquid cooling technology—cold plate, immersion, and spray—offers unique advantages and challenges. As data centers continue to evolve, these technologies provide critical solutions for managing the heat generated by increasingly powerful computing systems.


3. Enhance Energy Efficiency with Liquid Cooling Technology

Compared to air cooling, liquid cooling is much more energy-efficient and typically requires less mechanical circulation, resulting in substantial cost savings. Studies comparing liquid immersion cooling systems to hybrid air-liquid systems show that liquid systems can be up to 88% more energy-efficient.


Additionally, liquid cooling systems do not need fans, making them much quieter and reducing noise pollution. They also allow for higher server density, optimizing space use and further lowering energy costs for large facilities.


Immersion systems, in particular, reduce risks from dust, humidity, and vibration. While occasional liquid replenishment or filtering might be necessary, most systems do not require toxic refrigerants. Raising the operating temperature in water-based systems can further improve energy efficiency, with any minor heat absorption losses offset by reduced cooling water needs.


Liquid cooling provides consistent and uniform heat dissipation, reducing equipment failure rates and lowering maintenance and replacement costs. It effectively eliminates hotspots in server equipment, a common issue with air cooling.


Moreover, liquid cooling expands the geographical options for data center locations. While air cooling is best suited for temperate climates, liquid-cooled centers can be located almost anywhere, including regions with hot climates like Malaysia, which is also looking into liquid cooling technology.


Liquid cooling also offers better heat recovery opportunities. Air cooling systems typically operate at lower temperatures, producing waste heat between 30-45°C, which has limited value for other uses. In contrast, liquid cooling systems can produce waste heat up to 80°C, making it more useful for other applications.


As global investments in AI computing power increase, the demand for high-performance computing rises, leading to higher heat output from chips. New generations of chips, such as the B200, GB200, Gaudi3, and several domestic models, have switched to liquid cooling. This shift addresses heat dissipation effectively and boosts energy efficiency. Leading tech giants like Google and Microsoft are already piloting liquid cooling in their data centers to enhance energy efficiency and reduce operational costs.


4. Challenges of Liquid Cooling Technology in Data Center

Adapting existing data centers to support liquid cooling technology presents numerous challenges. Often, the piping systems for supplying water to server rooms need to be reconfigured depending on the chosen technology. For data centers transitioning to immersion cooling systems, assessing the infrastructure’s structural capacity is crucial.


It's also essential to evaluate whether existing server components are compatible with liquid cooling, particularly immersion and spray cooling. Introducing liquid cooling technology creates a new environment within existing data centers. Pods running generative AI workloads require higher power distribution units (PDUs), more robust circuit breakers, and powerful low and medium-voltage switchgear, presenting ongoing challenges.


Monitoring for component corrosion and potential harmful effects from contact with cooling liquids is vital. Liquids must be tested for contaminants to ensure they do not impair the functionality of the cooling system or equipment. For instance, the intrusion of particulates can increase the viscosity of certain coolants, negatively affecting their heat transfer capabilities. Equipment materials, including metals and sealants, may leach into the liquid. Bacterial growth can also be a problem, especially in water-based systems.


Some filtration systems have been developed to address these issues. Monitoring for leaks, which could damage sensitive equipment components, is also essential. In some cases, particularly in two-phase cooling systems, fluids must be replenished regularly. Securing the correct formulation of fluids and specific components for these systems might become a challenge as demand increases.


Regular maintenance convenience for server components must also be considered. Cold plates and adjacent components need to be easily removable for maintenance without disrupting the entire system.


Additionally, all these processes require specially trained personnel, adding to operational costs. Therefore, advancing liquid cooling technology comes with several hurdles.


5. Summary

Liquid cooling technology is essential for the future of hyperscale computing centers. As the market evolves and technology advances, liquid cooling will play a crucial role in the green transformation of data centers. This shift promises a future with more environmentally friendly and efficient computing facilities.


NADDOD, a leading provider of networking solutions, offers advanced liquid cooling transceivers for HPC and data centers. We also supply high-speed transceivers, DACs, AOCs, and patch cables for InfiniBand and RoCE networks used in HPC, AI, and data centers. For assistance in choosing the right transceivers, cables, or switches for your network, contact our experts today.


naddod brand