The last couple of years has seen a huge surge in excitement around generative AI, with businesses racing to harness its potential. Business leaders who are embracing the technology have been confronted with the challenges of coping with its huge and growing energy requirements. Training, building and using generative AI models requires vast amounts of energy, thanks in no small part to the fact that the Graphics Processing Units (GPUs) used to power generative AI are highly energy-intensive. For businesses using AI, there is not only a large increase in power consumption, but also a growing need for dense computational resources. All this works together to create heat. This has thrown a sharp focus on the rapidly growing energy demands of data centres, and the enormous amount of heat they generate. Business leaders are now facing an urgent need for innovative solutions to manage this heat, and this is where liquid cooling can be a tangible solution.
The problem with air cooling
Energy intensive Graphics Processing Units (GPUs) that power AI platforms require five to 10 times more energy than Central Processing Units (CPUs), because of the larger number of transistors. This is already impacting data centers. There are also new, cost-effective design methodologies incorporating features such as 3D silicon stacking, which allows GPU manufacturers to pack more components into a smaller footprint. This again increases the power density, meaning data centers need more energy, and create more heat.
Another trend running in parallel is a steady fall in TCase (or Case Temperature) in the latest chips. TCase is the maximum safe temperature for the surface of chips such as GPUs. It is a limit set by the manufacturer to ensure the chip will run smoothly and not overheat, or require throttling which impacts performance. On newer chips, T Case is coming down from 90 to 100 degrees Celsius to 70 or 80 degrees, or even lower. This is further driving the demand for new ways to cool GPUs.
As a result of these factors, air cooling is no longer doing the job when it comes to AI. It is not just the power of the components, but the density of those components in the data center. Unless servers become three times bigger than they were before, efficient heat removal is needed. That requires special handling, and liquid cooling will be essential to support the mainstream roll-out of AI.
Turning to liquid
Liquid cooling is growing in popularity. Public research institutions were amongst the first users, because they usually request the latest and greatest in data center tech to drive high performance computing (HPC) and AI. Yet they tend to have fewer fears around the risk of adopting new technology before it is already established in the market.
Enterprise customers are more risk averse. They need to make sure what they deploy will immediately provide return on investment. We are now seeing more and more financial institutions – often conservative due to regulatory requirements – adopt the technology, alongside the automotive industry.
The latter are big users of HPC systems to develop new cars, and now also the service providers in colocation data centers. Generative AI has huge power requirements that most enterprises cannot fulfil within their premises, so they need to go to a colocation data center, to service providers that can deliver those computational resources. Those service providers are now transitioning to new GPU architectures, and to liquid cooling. If they deploy liquid cooling, they can be much more efficient in their operations.
Across the data centre
Liquid cooling delivers results both within individual servers and in the larger data centers. By transitioning from a server with fans to a server with liquid cooling, businesses can make significant reductions when it comes to energy consumption. But this is only at device level, whereas perimeter cooling - removing heat from the data center – requires more energy to cool and remove the heat. That can mean only two thirds of the energy that the data center is using is going towards computing, the task the data center is designed to do. The rest is used to keep the data center cool.
Power usage effectiveness (PUE) is a measurement of how efficient data centers are. You take the power required to run the whole data center, including the cooling systems, divided by the power requirements of the IT equipment. With data centers that are optimised by liquid, some of them are doing PUE of 1.1, and some even 1.04, which means a very small amount of marginal energy. That’s before we even consider the opportunity to take this hot liquid or water coming out of the racks, and reuse that heat to do something useful, such as heating the building in the winter, which we see some customers doing today.
Density is also very important. Liquid cooling allows us to pack a lot of equipment in a high rack density. With liquid cooling, we can populate those racks and use less data center space overall, less real estate, which is going to be very important for AI.
Meeting the challenge
Liquid cooled systems will become an essential solution for businesses dealing with the growing energy demands of generative AI. Today, liquid cooled systems will help business leaders optimise their energy efficiency, but the technology will also enable data centres to deal with the ever-growing number of GPUs required to power future advancements in AI. Air cooling has simply become inadequate in the age of generative AI. The power demands of generative AI has placed the energy demands of data centres under the spotlight, but this actually offers business leaders an opportunity to embrace innovative solutions and work towards a ‘cleaner’ approach to AI.