laitimes

Karpathy 1000行纯C训练大模型速度已追平PyTorch

author:Not bald programmer
Karpathy 1000行纯C训练大模型速度已追平PyTorch

A few days ago, we introduced the latest project of artificial intelligence god Andrej Karpathy, llm.c: GPT2 training with 1000 lines of pure c code, without the mainstream huge and complex large language model training framework PyTorch

Karpathy 1000行纯C训练大模型速度已追平PyTorch

Andrej Karpathy has made incredible progress by open-sourcing llm.c on github and posting a heroic post calling on hackers around the world to improve llm.c's performance

At the beginning, llm.c iteration speed was 4.2 times slower than PyTorch, and now llm.c iteration speed has been reduced to 26.2ms/iteration, which exactly matches the performance of PyTorch (tf32 forward transfer). The researchers found an error, where the code incorrectly called cuBLAS in fp32 math mode. And Ademeure (hackers) contributed a softmax kernel optimized for very long lines (50,257 elements per line, in the last logits layer).

But the hack fun doesn't end there, because hackers still have a lot of tricks on their hands. The current llm.c attention kernel uses a pure attention mechanism instead of Flash Attention, which materializes a (very large) pre-attention and post-attention matrix of (B, NH, T, T). It also performs unnecessary, unfused GeLU nonlinearities and intra-attention permutation/depermutation operations.

llm.c hasn't done more optimizations yet, e.g. CUDA Graphs, lossless compressible memory (?) Wait. The updated performance charts look optimistic that with just 2000 lines of pure C code, the goal of being able to train an LLM faster than PyTorch looks achievable

Karpathy 1000行纯C训练大模型速度已追平PyTorch

Judging from the prediction, the follow-up optimization will continue, and by April 23, the training iteration speed of llm.c will be several times faster than that of PyTorch

Now Andrej Karpathy is focusing on back-passing, so that llm.c can complete the entire training loop in CUDA

Some netizens asked Andrej Karpathy if he could port llm.c from NVIDIA GPU to mac computer, Karpathy thinks that Apple's M chip is something completely different, but it also has GPU, and the code can be built in a similar way, but all the libraries and details will change (Metal instead of CUDA), so it's a completely separate job

epilogue

At present, it can only be said that this is a very interesting project, for artificial intelligence gods like Andrej Karpathy and the world's top hackers can use llm.c to complete very personalized large language model training, but this requires extremely strong software engineering capabilities, not universal, currently only for GPT2, if replaced with GPT3, GPT4, The GPT5 estimation code needs to be optimized and modified, but for the average engineer and researcher, it is still inseparable from the out-of-the-box PyTorch

Read on