Llama提速500%!谷歌美女程序员手搓矩阵乘法内核

author：New Zhiyuan 2024-04-07 14:46:00

Edited by alan

Recently, genius programmer Justine Tunney tweeted that she had updated the code of Llamafile, and increased Llama's reasoning speed by 500% by hand-rubbing 84 new matrix multiplication kernels!

Google's beautiful programmer who increased Llama's reasoning speed by 500%!

Recently, genius programmer Justine Tunney tweeted that she had updated the code of Llamafile,

She rewrote 84 new matrix multiplication kernels to make it faster for Llamafile to read hints and images.

Compared to llama.cpp, the new Llamafile delivers 30 to 500 percent faster inference on the CPU.

Among them, ARMv8.2+ (e.g., RPI 5), Intel (e.g., Alderlake), and AVX512 (e.g., Zen 4) computers have the most significant improvements.

Also, for matrices suitable for L2 caches, the new cores are 2x faster than MKL!

Justine Tunney said: "Everyone in charge of MKL, you have something to do!

毕竟,由微软,英特尔,TI,AMD,HPE,Oracle,Huawei,Facebook,ARM和National Science Foundation资助的BLIS,𱥬为最强大的开源BLAS,输了就太没面子了!

Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL... [T]his is why Intel funds BLIS development.

Whenever someone outside of Intel beats MKL by a small margin, I report back to the MKL team. For any open source project, more than 10% of MKL is already very powerful...... That's why Intel is funding BLIS development.

Cross-platform "alpaca"

Llamafile was born in November last year as a local LLM project developed by Justine Tunney in collaboration with the Mozilla team.

They used Cosmopolitan Libc to package the llama.cpp into a single cross-platform binary, allowing Alpaca to run on six operating systems based on AMD 64 and ARM64.

And in the case of GPU shortage, Llamafile can do without expensive CUDA cores ,—— old CPUs at home, as long as the performance is okay, plus a little RAM is enough, which protects everyone's wallet well.

Project address: https://github.com/Mozilla-Ocho/llamafile/releases

The Llamafile code can be found on GitHub, is written in C++, has no external dependencies, and can be compiled on Linux, macOS, Windows, FreeBSD, and even SerenityOS.

And, Justine Tunney didn't stop there. She's already working on supporting new data formats like FP16 and BF16 to further reduce memory footprint,—— and she's even successfully running TinyLlama on a Raspberry Pi!

Performance improvements

Old HP

When Justine Tunney first tried LLM, she used the following rudimentary HP mainframe running Alpine, a hard drive, slow RAM, an AVX2 processor, and no GPU.

HP Intel® Core™ i9-9900 ($439) w/ 2200 MT/s RAM

In love with the llama.cpp, Justine Tunney co-authored the introduction of mmap() support, which allows weights to be loaded instantly, using only half the RAM.

After that, Justine spent a lot of time optimizing the code, so let me see what the improvements look like:

On Skylake, llamafile achieved a 2x speedup, llama.cpp also achieved a 50% performance boost.

So far, Justine has written optimized kernels for q8_0, f16, q4_1, q4_0, and f32 data types.

Raspberry Pi

The latest version of the Raspberry Pi not only boosts the clock speed, but also introduces support for ARMv8.2 dotprod and fp16 arithmetic ISA, which alone allow llama.cpp to achieve a 10x performance boost in f16 weighting.

Because both of the Raspberry Pi's CPUs have 32 vector registers, Justine uses a core written for the AVX512 to make inference 2x faster.

It's worth noting, though, that the new ARMv8.2 fp16 ISA may introduce more bugs than usual, as it will cause llamafile to use fp16. So Q8_0 weight actually works better because it uses dotprod ISA.

Game consoles

On Alderlake CPUs, Justine improved float16's performance by a factor of five.

Unlike ARMv8.2, Alderlake is able to do this without introducing rounding errors, as the kernel internally uses float32 compute types.

It's also surprising that when it comes to small workloads, this chip is able to get the job done even before CUDA starts.

apple

Mac Studio, the hardware platform that llama.cpp developers care about the most, is difficult to improve performance here.

Another problem is Apple's own closed environment:

The M2 Ultra puts the RAM DIMM inside the CPU, making latency-constrained operations such as token generation faster because the CPU no longer needs to make "long distance calls".

We can see that the M2 Ultra exposes only 30% of its computing power through the ARM ISA compared to much cheaper Intel computers.

If developers want to access more content, they will need to go through Apple's proprietary frameworks, such as Metal and Accelerate.

AMD

While llamafile cares a lot about helping people who lack GPUs, it also provides a top-notch experience for the other 1%.

AMD Ryzen Threadripper PRO 7995WX, by spending around $10,000, you get 96 AVX512 cores based on the Zen4 architecture.

Although only twice the price of the M2 Ultra, the 7995WX x86 ISA offers 7x the raw computing power of the M2 Ultra ARM ISA, and the token generation speed is almost the same, probably thanks to the 384M L3 cache.

With Justine's optimizations, it is now possible to run LLaMA at 2.8x on Zen4.

Genius programmer

Justine Tunney was born in 1984 and started developing software for other hackers at the age of 14, and was nicknamed "Oogle" at the time.

Let's take a brief look at some of her work over the years:

RedBean

A web server that magically runs on 6 operating systems across platforms!

This is not Java's mechanism of layering virtual machines, Justine has developed a file format called APE (Acctually Portable Executbale) that can be executed on any x86-64 operating system.

"Compile once, run everywhere" - Java: Huh? Isn't that me?

cosmopolitan libc

In order to be able to call external programs across platforms, such as the C standard library, Justine directly rubbed a libc by hand and implemented all the required core operations on various platforms:

Looking at the workload above, it's really too explosive, and even if the average person wants to have a liver, it is impossible without strength.

sectorLisp

Only 512 bytes, minimal Lisp implementation, bootable via BIOS boot:

In addition to the above, there are also genius projects such as Blinkenlights and RoseHub, which will not be listed here.

Regarding this achievement, some netizens sighed:

Every time I read something by Justine Tunney, I am continually reminded of my mediocrity.

Regarding the mmap work mentioned earlier, netizens commented: "There is the style of Fabrice Bellard".

Justine Tunney is a true genius. Similar to Fabrice Bellard, a truly unique mind.

Justine or Fabrice are the true 10x engineers, their output is world class and they are much rarer than any hiring article about these gurus want us to believe. With Justine's work, I feel would need to be more than a 1x engineer myself just to find the time to play with all of her creations.

In 2012, Justine Tunney started working at Google and has been responsible for key parts of some high-profile projects.

For example, Tunney has made many contributions to the famous TensorFlow, including a summary system for storing data.

Bazel is Google's petabyte-scale build system that evolved from Make, and Tunney's main contribution is the downloader code part for automating carrier-grade public artifact transfers.

Nomulus, a service for managing top-level domains, is Google's first open-source production service. Tunney was responsible for writing the registry data hosting system for it.