laitimes

ISC24 | Data centers need new energy efficiency metrics

author:NVIDIA China

Supercomputer and data center operators are unable to measure their progress toward sustainable computing because they lack a standard to measure the useful work done per unit of energy.

ISC24 | Data centers need new energy efficiency metrics

Data centers need more advanced dashboards that show the progress of real-world applications to guide them in improving energy efficiency.

The formula for calculating energy efficiency is simple – divide the work done by the amount of energy used. But to apply it to the data center, there are some details to consider.

Energy Effectiveness (PUE), the most widely used metric today, is the comparison of the total energy consumed by a facility with the energy used by its computing infrastructure. Over the past 17 years, PUE has helped the most efficient operators move closer to their ideal state of wasting virtually no energy in processes such as power conversion and cooling.

Look for the next indicator

PUE helped data centers a lot when cloud computing was on the rise and will continue to play a role. But in today's era of generative AI, workloads and the systems that run them have changed dramatically, so PUE alone isn't enough.

This is because PUE only measures the energy consumed by a data center, but not the useful output of a data center. It's like measuring the fuel consumption of an engine without knowing how far the car has traveled.

There are many measures of data center efficiency. A 2017 article lists nearly three dozen standards, several of which focus on specific targets for cooling, water use, safety, cost, and more.

Understand what "Watts" are

It's somewhat unfortunate that the computer industry has long described the energy efficiency of systems and their processors in terms of power, often measured in "watts." Because as important as this metric is, many people don't realize that "watts" can only measure the input power at a certain point in time, and cannot measure the energy actually used by the computer or the efficiency with which it is used.

Therefore, when the input power of modern systems and processors is measured in "watts", even an increase in the value does not mean that their energy efficiency has decreased. In fact, the ratio of work done by these systems and processors to energy consumption is typically much higher.

The measurement of a modern data center should focus on energy, which is known in the engineering community as kilowatt-hours or joules. The key is how much useful work they do with that energy.

Redefine what we call work

At this point, the industry is still accustomed to using abstract terms such as processor instructions or mathematical calculations. As a result, MIPS (million instructions per second) and FLOPS (floating point operations per second) are widely used.

Only computer scientists care about how much of this rudimentary work their systems can handle. Users want to know how much of the actual work their system can accomplish, but the definition of useful work is somewhat subjective.

AI-focused data centers primarily refer to the MLPerf benchmark. Supercomputing centers engaged in scientific research often use additional measures of work. There may be additional standards that need to be used in commercial data centers focused on streaming.

The resulting suite of applications must be able to evolve over time to reflect the latest state of the art and most relevant use cases. For example, the last round of MLPerf added testing using two generative AI models that didn't even exist five years ago.

The standard for accelerated computing

Ideally, any new benchmark should be able to measure advances in accelerated computing. Many modern workloads enable applications to run faster and more efficiently than CPUs by combining hardware, software, and methods with parallel processing power.

For example, when it comes to scientific applications, the Perlmutter supercomputer at the National Center for Energy Research Scientific Computing uses accelerated computing to improve energy efficiency by an average of five times. That's why 500 of the top 50 supercomputers on the Greenlist 39 (including the number one system) use NVIDIA GPUs.

ISC24 | Data centers need new energy efficiency metrics

Because GPUs can perform a large number of tasks in parallel, GPUs can do more work in less time and save energy than CPUs

Businesses across many industries have achieved similar results. PayPal, for example, increased real-time fraud detection by 10% and reduced server energy consumption by nearly one-eighth through accelerated computing.

With each generation of GPU hardware and software, the performance gains continue to grow.

In a recent report, Stanford University's human-centered AI research team estimated that since 2003, GPU performance has "improved by about 7,000x" and energy efficiency per unit of performance has "improved by 5,600x."

ISC24 | Data centers need new energy efficiency metrics

Data centers need a baseline to track the energy efficiency of their primary workloads

Insights from two experts

Experts also agree that there is a need for new energy efficiency indicators.

Christian Belady, a data center engineer who originally came up with the concept of PUE, believes that today's PUE for data centers is around 1.2, which is "outdated." "This metric improves the efficiency of the data center when everything is not perfect," he said. But now, 20 years later, everything is better, so we need to focus on other indicators that are more closely related to today's problems. ”

Looking to the future, Belady said, "Performance metrics are key. While it's not possible to directly compare different workloads, I think it's more likely to be successful if it's broken down by workload. ”

Jonathan Koomey, an academic and author who studies computer efficiency and sustainability, agrees.

"To make the right decisions about efficiency, data center operations need a set of benchmarks to measure the impact of today's most widely used AI workloads on energy consumption," said Koomey. ”

"Per-joule tokens are a good example of a benchmark-like component. Enterprises need to participate in open discussions, provide details about their workloads and experiments, and agree to a factual testing process to ensure that these metrics accurately describe how much energy the hardware consumes when running the application in the real world. ”

"Finally, we need an open forum to carry out this important work [the development of new energy efficiency indicators]."

Pool the wisdom and efforts of everyone

Thanks to metrics such as PUE and lists such as the Green500, data centers and supercomputing centers have made huge strides in energy efficiency.

In the age of generative AI, we can and must do more to further improve energy efficiency. The ability to measure the amount of energy consumed by the useful work done by today's most advanced applications could take supercomputing and data center energy efficiency to the next level.