Staying cool: a mini-guide to liquid cooling for data centers

MattQuirk · ‎03-15-2024

Fueled by the rise of cloud services, digital transformation, internet of things (IoT), and other data-centric services, there’s been a steady increase in data center capacity over the past decade. Now, thanks to incredible growth in AI and machine learning, demand for data center capacity is skyrocketing. In fact, Global Market Insights predicts the data center infrastructure market will increase at over 12% CAGR between 2023 and 2030.

Meanwhile, McKinsey expects increased demand will result in a similar rise in data center energy consumption. In the US alone, it predicts that data center energy demand will rise to 35 gigawatts (GW) by 2030, up nearly double from 17 GW in 2022. Besides adding directly to operating expenditures, the massive amount of energy modern data centers consume also intensifies concerns about environmental sustainability.

The AI boom runs on ultra-high-performance GPUs and CPUs. As these processors power new AI and ML workloads and train large language models (LLMs), they also generate massive amounts of heat. Data centers operate at an optimal temperature range of 64 to 80 degrees Fahrenheit. If temperature increases in data centers are not brought under control quickly enough, they can result in equipment failure, system downtime, and ultimately, financial loss.

In fact, nearly 30% of data center outages are caused by environmental factors, which include temperature, humidity, and airflow, among other conditions. Keeping temperatures in the ideal range is crucial to keep workloads running smoothly, but managing the excess heat is one of the main challenges data centers are currently facing.

How much heat are we talking about?

Consider one recent development. Meta just announced its plan to purchase 350,000 of Nvidia’s top-of-the-line H100 GPUs. Each of these GPUs has a max thermal design power (TDP) rating between 300-700 watts (W).

A little rough math indicates that once Meta has these 350K GPUs up and running, their data centers will generate as much as 100-250 megawatts (MW) of additional heat at full throttle. According to the US Department of Energy, that’s comparable to the electricity needs of 80,000-200,000 households. So, Meta will need to reliably move a small city’s worth of power in the form of heat away from their new AI servers. Wow!

This isn’t only a challenge for huge tech companies. As data centers of all types seek to maximize capacity, it’s not uncommon to see rack densities reaching 70 kW or more. However, traditional air-cooled approaches, such as open-air and air-containment methods, become ineffective once rack densities approach 50 kW.

In short, all signs point to the crucial need for more effective and energy-efficient data center cooling. Fortunately, we can take advantage of the fact that liquids are far better at absorbing and conducting heat than air is. From helping air-cooling work better to completely immersing components in fluid, there is a growing range of liquid-cooling innovations that help manage the rising heat of AI workloads while also increasing energy efficiency. Let’s dive in.

Method one: Liquid-assisted cooling

The simplest approach adds liquid cooling to air-cooled environments to assist with heat removal. Rear-door heat exchangers (RDHX), for example, replace standard server rack enclosure doors with doors that have liquid-to-air heat exchangers built in. These conduct heat away from processors and boost the effectiveness of cooling fans.

Another solution in this category improves cooling performance by isolating cool air distribution and warm air return paths from the open-room air. The HPE Adaptive Rack Cooling System (ARCS) is an example that uses this approach. It extends the life of existing data centers and enables simultaneous cooling for up to four racks housing as much as 150kW of IT capacity.

Method two: Direct-liquid cooling (DLC)

The next step along the spectrum of liquid-cooling approaches puts cooling liquid in direct contact with heat-producing components such as CPUs and GPUs. The HPE Apollo Direct Liquid Cooling (DLC) System, for example, integrates direct liquid cooling into the HPE Apollo Gen10 Plus System. It allows these high-density HPC systems to reduce fan power consumed at the server by 81%. Considering that cooling accounts for around 40% of a data center’s energy consumption, you can see how DLC solutions can deliver impressive energy savings.

Method three: Full closed-loop liquid cooling

In this type of system, components are cooled by liquid in a closed-loop system that doesn’t use ambient air. For instance, the HPE Cray Supercomputing EX system utilizes full closed-loop liquid cooling to efficiently remove heat from high-power CPUs, GPUs, and other components in some of the world’s most powerful computers. This includes Frontier, a true exascale system that holds the number one spot in the TOP500 supercomputer rankings for November 2023.

Full-closed loop liquid cooling enables these supercomputers to cool the incredibly high compute densities required for record-breaking computing performance while also delivering groundbreaking new levels of sustainability. In fact, six of the top 10 most energy-efficient supercomputers on November’s Green500 list, are full closed-loop HPE Cray EX systems. Full closed-loop liquid cooling is one of the key reasons that—in addition to being the most powerful supercomputer on the planet—Frontier is also one of the greenest. It’s in eighth place on the list with an impressive energy efficiency score of 52.59 GFlops/watt.

Method four: Immersion cooling

Our final stop on the liquid-cooling spectrum is immersion cooling. This approach involves submerging electronic components or entire servers in a dielectric, thermally conductive liquid that absorbs heat directly from the components. Because these components and liquid are often sealed in a fully enclosed chassis, immersion-cooled systems are also well suited for edge environments with harsh physical conditions.

HPE OEM partner Iceotope collaborated with HPE and Intel on its Ku:l Data Center immersion cooling solution. This is a completely enclosed, liquid-cooled ruggedized mini data center powered by HPE ProLiant DL380 Gen10 servers. In HPC benchmark testing, Iceotope demonstrated that its immersion cooled system increased performance by 4% while consuming 1kW less energy at the rack level compared to an identical air-cooled system. The OEM calculated that this translated to a 5% direct energy saving in IT operations and an amazing 30% overall energy savings at scale.

Start cooling things down

From extreme edge environments to the most powerful supercomputers in the world, and from new data center builds to upgrades and retrofits, liquid cooling is a key technology for improving energy efficiency and sustainability while managing the rising heat output of high-power chips.

Ready to take the plunge into liquid cooling? (Sorry, I couldn’t resist.) Let’s talk about how we can help you deliver liquid-cooled computing performance for your customers’ needs.

Matt Quirk

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Staying cool: a mini-guide to liquid cooling for data centers

MattQuirk