laitimes

AMD GPUs will be open source soon

author:175500; yse
AMD GPUs will be open source soon

AMD said it expects to release its Microengine Scheduler (MES) documentation by the end of May, followed by source code. It will then proceed to release other parts of the Radeon stack in an open-source manner. The statement appears to be in response to a tweet/X from Tiny Corp, which has been publicly communicating with AMD on social media for months (and often criticizing).

AMD GPUs will be open source soon

Discussions between Tiny Corp and AMD have made headlines several times in recent months. The former designed and pre-sold the TinyBox AI server, which has attracted interest due to its use of relatively inexpensive AMD Radeon GPUs. However, the company's social media and its founder, George Hotz, became very vocal when they found that the behavior of consumer-grade cards was not suitable for the server or enterprise level.

In short, Tiny Corp wanted more/deeper access to AMD hardware, firmware, and driver IPs. The company believes that with full access to the firmware and driver stack, the Tiny Box can be used to perform its advertised features. Even though Tiny Corp is a small company, AMD is still involved, and even Dr. Lisa Su joined the conversation in early March. Back in March, Su said that "the team is working on it," but Tiny Corp is still unhappy with the situation he finds himself in.

"We are working to release the Microengine Scheduler (MES) documentation by the end of May and will follow up on the published source code for external review and feedback," the official AMD Radeon Twitter/X account noted in early April. "We've also opened up the GitHub tracker, which will provide updates on the status of fixes and release dates. ”

Today, we're seeing a major update to AMD's documentation and open source advancements. In response to further sarcasm from Tiny Corp, the Red Team Graphics department reiterated the MES documentation statement earlier this month. It added that "the rest of the Radeon stack will be open sourced throughout the year," and then instructed interested parties to keep an eye on the GitHub repository.

Tiny Corp has already responded to AMD's statement, describing the MES message as a "diversion" and asking for more parts of the architecture to be open-sourced, as well as documentation of the hardware scheduler, which it believes is the cause of Tiny Box's system deadlock.

As PC enthusiasts who aren't particularly interested in using a server like the Tiny Box, we're still very interested in the ripple effects that additional Radeon documentation and open source software releases might have on us. If bugs are eliminated and optimized by an entity like Tiny Corp, that should be a good thing for other Radeon users. In addition, this open move could be beneficial for Linux developers and communities as they seek to get more out of Radeon™ hardware.

AMD's AI chip strategy

To say that AMD's story is a rollercoaster ride would be an understatement. Because there's a huge contrast between AMD in 2014 and AMD in 2024. Ten years ago AMD struggled, today AMD is recovering, and crucially, they have become a player in many markets.

Like many other players in the space, AI is a major focus, and the company has built a dedicated AI team in-house to cover a complete end-to-end strategy for the fast-growing AI market.

In recent weeks, AMD CEO Lisa Su and Jack Huynh, senior vice president/general manager of the Compute and Graphics Group, have both answered questions from industry analysts about the nature of AMD's AI hardware strategy and how they view its portfolio.

AMD's AI hardware strategy is three-pronged:

The first is AMD's Instinct family of data center GPUs, which are retailed in the form of the MI300 series.

The MI300X is available in two variants, focused on AI - it has successfully gained adoption from large cloud vendors like Microsoft and Azure as well as some smaller AI-centric clouds like TensorWave.

During the latest earnings call, Lisa Su commented that the demand for these chips continues to expand, with revenue increasing from $2 billion to $3.5 billion by the end of 2024. At the launch event, AMD compared itself to NVIDIA's H100, marking an eight-chip system that is the same in ML training but better in ML inference.

Another variant of the series is the MI300A, which offers similar specs, but with a CPU/GPU combination, geared towards high-performance computing. It has been adopted into the planned largest global supercomputer, El Captian, which will use machine learning models to help protect the U.S. nuclear stockpile.

Commenting on the adoption of the MI300, Lisa said:

"We were pleasantly surprised and excited to see the momentum of the MI300 and where that momentum is coming from. Big cloud [customers] are typically the fastest moving – from workload [to workload]. The LLM is a perfect fit for the MI300 - our memory capacity and memory bandwidth [are market leading]. Artificial intelligence is the most dominant workload. [We] have quite a broad customer base that they have different needs – some are training, some are fine-tuning, some are mixed. When we started with the customer, [but] lost confidence from the pattern. [We also spend] a lot of work on the software environment. New customers [discover] that it's easier to meet their performance expectations because ROCm (AMD's software stack) is maturing. [Us] [MI300]'s largest workload is large language models. ”

It should also be noted that AMD recently announced that it is extending its chip-to-chip communication protocol, known as Infinity Fabric, to select network partners such as Arista, Broadcom, and Cisco. We expect these companies to build Infinity Fabric switches that enable the MI300 to enable chip-to-chip communication outside of a single system.

The second aspect of AMD's strategy is their family of client GPUs.

This includes AMD's Radeon discrete graphics cards (GPUs) and their APUs, which consist of GPUs integrated into client CPUs and are primarily used in laptops. Both the first and second aspects of AMD's AI strategy rely on their compute stack, called ROCm, which is AMD's competitor to the NVIDIA CUDA stack.

A long-standing complaint about ROCm (even the latest version) is the inconsistency of support between enterprise and consumer hardware - only AMD's Instinct GPUs are able to properly support ROCm and its associated libraries and select discrete GPUs, while CUDA runs on almost all NVIDIA hardware.

However, Jack said in our Q&A:

"We're [currently] enabling ROCm on our 7900 flagship so that you can execute some AI applications. We will be expanding ROCm more broadly. "There are schools, universities, and startups that may not be able to afford very high-end GPUs, but they want to tinker. We want to make that community a developer tool. ”

We hope this means that ROCm has a wider range of support for the current generation of hardware, as well as all future versions – not just their flagship RX7900 series.

Lisa also commented on AMD's software stack:

"The big problem lately is software. We've made huge strides in software. The ROCm 6 software stack is a major step forward. There's still a lot of work to be done on the software side...... We want to seize the huge opportunity. ”

AMD's third area is their XDNA AI engine.

While the technology comes from Xilinx, the IP was licensed to AMD prior to the acquisition. These AI engines are being integrated into laptop processors and will be presented as NPUs for Microsoft's AIPC program to compete with products from Intel and Qualcomm. These AI engines are designed for low-power inference, not high-throughput inference or training that high-power GPUs are capable of.

Commenting on the status of NPUs vs. GPUs, Lisa said:

"AI engines will be more popular in some places, such as PCs and laptops. If you're looking for a massive, more workstation laptop, [they] might use a GPU in that framework. ”

AMD sees a future for multiple AI workloads and engines: CPUs, GPUs, and NPUs. It's worth noting that everyone else in the space is making the same sound.

Jack commented:

"[For [the] NPU, Microsoft is pushing [it] heavily because of efficacy. The NPU can still drive the experience, but it won't compromise the battery [life]. We're betting on the NPU. We're going to do 2x and 3x on AI...... The key to the NPU is battery life - in desktops, you often don't have to worry about batteries, and you can bring custom data formats [supported by the NU] to the desktop. ”

This three-pronged approach allows AMD to address all aspects of the AI space, showing that not all eggs have to be in the same basket. AMD has had some success with this approach – in the data center space, AMD is considered NVIDIA's closest competitor. The memory capacity and bandwidth of the MI300 allow it to compete well with NVIDIA's H100 hardware (which we are still waiting for the B100 benchmark). The NPU space is still too new and volatile to really know if AMD's strategy will pay off, but it's likely that Microsoft will use NPUs for native machine learning models, such as assistants or "co-pilot" models.

From our perspective, the weakness of AMD's strategy lies on the desktop GPU side, as the entire AMD hardware stack lacks near-universal ROCm support. This is a problem that will take time to solve - one of the disadvantages of the division of the front is the division of resources. AMD will require strict management to ensure that there is no duplication of work across the company. On the positive side, though, AMD is constantly raising its 2024 data center revenue forecasts, claiming that the constraints are only demand, not supply.

Link to original article

https://www.tomshardware.com/pc-components/gpus/amd-pushes-forward-with-its-radeon-stack-open-sourcing-plans-after-being-prodded-by-tiny-corp

来 源 | 半导体行业观察(ID:icbank)编译自tomshardware

AMD GPUs will be open source soon

☞ Business Cooperation: ☏ Please call 010-82306118 / ✐ or to [email protected]

AMD GPUs will be open source soon