laitimes

Topic: How does the ChatGPT detonate generative AI AI industry change?

author:Mika Technology
Topic: How does the ChatGPT detonate generative AI AI industry change?

Data centers are predicted to be the world's largest energy consumer, rising from 3% of total electricity consumption in 2017 to 4.5% in 2025. In China, for example, the electricity consumption of data centers operating nationwide is expected to exceed 400 billion kWh in 2030, accounting for 4% of the country's total electricity consumption.

Cloud providers also recognize that their data centers use large amounts of electricity and have adopted methodological measures to improve efficiency, such as building and operating data centers in the Arctic to take advantage of renewable energy and free cooling conditions. However, this is not enough to meet the explosive demand for AI applications.

Research by Lawrence Berkeley National Laboratory found that improvements in data center efficiency have been controlling the growth in energy consumption over the past 20 years, but research shows that current energy efficiency measures may not be enough to meet the needs of future data centers, so a better approach is needed.

Data transmission is a fatal bottleneck

The root of efficiency lies in the way GPUs and CPUs work, especially when running AI inference models versus training models. Many people understand the physical limitations of "beyond Moore's Law" and packing more transistors on larger chips. More advanced chips are helping to solve these challenges, but current solutions have a key weakness in AI inference: significantly slower speeds in transferring data in random-access memory.

Traditionally, it has been cheaper to separate processors and memory chips, and processor clock speed has been a key limiting factor in computer performance for years. Today, what is holding back development is the interconnection between chips.

Jeff Shainline, a researcher at the National Institute of Standards and Technology (NIST), explains: "When memory and processor are separated, the communication link connecting the two domains becomes the main bottleneck in the system. Professor Jack Dongarra, a researcher at Oak Ridge National Laboratory, put it succinctly: "When we look at the performance of today's computers, we find that data transmission is a fatal bottleneck."

AI inference vs.AI training

AI systems use different types of computation when training AI models compared to using AI models to make predictions. AI training loads tens of thousands of images or text samples as a reference in a Transformer-based model and then begins processing. Thousands of cores in GPUs process large and rich data sets, such as images or videos, very efficiently, and more cloud-based GPUs can be rented if results need to be obtained faster.

Topic: How does the ChatGPT detonate generative AI AI industry change?

While AI inference requires less energy to compute, in the hundreds of millions of users' autocompletion, a lot of computation and prediction are needed to decide what the next word will be, which consumes more energy than long-term training.

Facebook's AI system, for example, observes trillions of inferences a day in its data centers, a number that has more than doubled in the past three years. The study found that running language translation inference on a large language model (LLM) consumes two to three times more energy than initial training.

Surge in demand tests computational efficiency

ChatGPT took the world by storm late last year, and GPT-4 is even more impressive. A more energy-efficient approach could extend AI inference to a wider range of devices and create new ways of computing.

Microsoft's Hybrid Loop, for example, aims to build AI experiences that dynamically leverage cloud computing and edge devices, which allows developers to make late-binding decisions when running AI inference on the Azure cloud platform, on-premises client computers, or mobile devices to maximize efficiency. Facebook introduced AutoScale to help users efficiently decide where to compute inferences at runtime.

To be effective, we need to overcome the barriers that hinder the development of AI and find ways to do it.

Sampling and pipelines can speed up deep learning by reducing the amount of data processed. SALIENT (for sampling, slicing, and data movement) is a new method developed by researchers at MIT and IBM to solve critical bottlenecks. This approach can significantly reduce the need to run neural networks on large datasets with 100 million nodes and 1 billion edges. But it also compromises accuracy and precision – which is acceptable for choosing the next social post to display, but not if trying to identify unsafe conditions on the jobsite in near real time.

Tech companies such as Apple, Nvidia, Intel, and AMD have announced the integration of dedicated AI engines into processors, and AWS is even working on a new Inferentia 2 processor. But these solutions still use the traditional von Neumann processor architecture, integrated SRAM, and external DRAM memory—all of which require more power to move data in and out of memory.

In-memory computing may be the answer

In addition, researchers have found another way to break down the "memory wall", which is to bring computing closer to memory.

Topic: How does the ChatGPT detonate generative AI AI industry change?

Memory walls are physical barriers that limit the speed at which data can move in and out of memory, a fundamental limitation of traditional architectures. In-memory computing (IMC) solves this challenge by running AI matrix calculations directly in the memory module, avoiding the overhead of sending data over the memory bus.

IMC is suitable for AI inference because it involves a relatively static but large weighted data set that can be accessed repeatedly. While some data is always required to be input and output, AI can effectively use and reuse data for multiple calculations by keeping it in the same physical unit, eliminating most of the energy transfer expense and data movement delays.

This approach improves scalability because it works well for chip designs. By adopting the new chip, AI inference technology can be tested on developers' PCs and then deployed to production via data centers. Data centers can use a large set of devices with many chip processors to efficiently run enterprise-grade AI models.

Over time, IMC is expected to become the dominant architecture for AI inference use cases. This makes perfect sense when users are dealing with massive data sets and trillions of calculations. Because you don't have to waste more resources transferring data between memory walls, and this approach can be easily scaled to meet long-term needs.

Brief summary:

The AI industry is now at an exciting inflection point. Technological advances in generative AI, image recognition, and data analytics have revealed the unique connections and uses of machine learning, but a technological solution that can meet this need first needs to be built. Because Gartner predicts that AI will consume more energy by 2025 than human activities unless more sustainable options are available today. Before this happens, something better needs to be figured out!

Topic: How does the ChatGPT detonate generative AI AI industry change?

Read on