Karpathy 1000行纯C训练大模型速度已追平PyTorch

author：Not bald programmer 2024-04-14 11:04:00

A few days ago, we introduced the latest project of artificial intelligence god Andrej Karpathy, llm.c: GPT2 training with 1000 lines of pure c code, without the mainstream huge and complex large language model training framework PyTorch

Andrej Karpathy has made incredible progress by open-sourcing llm.c on github and posting a heroic post calling on hackers around the world to improve llm.c's performance

At the beginning, llm.c iteration speed was 4.2 times slower than PyTorch, and now llm.c iteration speed has been reduced to 26.2ms/iteration, which exactly matches the performance of PyTorch (tf32 forward transfer). The researchers found an error, where the code incorrectly called cuBLAS in fp32 math mode. And Ademeure (hackers) contributed a softmax kernel optimized for very long lines (50,257 elements per line, in the last logits layer).

But the hack fun doesn't end there, because hackers still have a lot of tricks on their hands. The current llm.c attention kernel uses a pure attention mechanism instead of Flash Attention, which materializes a (very large) pre-attention and post-attention matrix of (B, NH, T, T). It also performs unnecessary, unfused GeLU nonlinearities and intra-attention permutation/depermutation operations.

llm.c hasn't done more optimizations yet, e.g. CUDA Graphs, lossless compressible memory (?) Wait. The updated performance charts look optimistic that with just 2000 lines of pure C code, the goal of being able to train an LLM faster than PyTorch looks achievable

Judging from the prediction, the follow-up optimization will continue, and by April 23, the training iteration speed of llm.c will be several times faster than that of PyTorch

Now Andrej Karpathy is focusing on back-passing, so that llm.c can complete the entire training loop in CUDA

Some netizens asked Andrej Karpathy if he could port llm.c from NVIDIA GPU to mac computer, Karpathy thinks that Apple's M chip is something completely different, but it also has GPU, and the code can be built in a similar way, but all the libraries and details will change (Metal instead of CUDA), so it's a completely separate job

epilogue

At present, it can only be said that this is a very interesting project, for artificial intelligence gods like Andrej Karpathy and the world's top hackers can use llm.c to complete very personalized large language model training, but this requires extremely strong software engineering capabilities, not universal, currently only for GPT2, if replaced with GPT3, GPT4, The GPT5 estimation code needs to be optimized and modified, but for the average engineer and researcher, it is still inseparable from the out-of-the-box PyTorch

Karpathy 1000行纯C训练大模型速度已追平PyTorch

Read on

Why do Stanford students want to copy Chinese models?

The popularity of generative AI mobile applications is accelerating! MediaTek Dimensity chips, models, and applications are driven at lightning speed

Multi-format component-level model assembly - model king: flexible combination ● unlimited creativity

Obviously the Snapdragon chip on the PC side is not weak? Why are few manufacturers using it? Now that the performance is directly benchmarked against Apple's M3's Snapdragon XElite, the situation has not only changed, but also improved

Mobile phone into model machine! Baoshan Police: A gentleman loves money and takes it in a good way

Learn more about large language model operations (LLMOps)

Slap in the face! The domestic AI model is far stronger than you think

10 domestic large models vs. college entrance examination essay: writing AI with AI

12 domestic large models vs. college entrance examination mathematics, accidentally exploded a big bug

The last round of mathematics in the high school entrance examination is to check and fill in the gaps: auxiliary circle & hidden circle & maximum value model and its extended application

The last round of mathematics in the high school entrance examination to fill in the gaps: the Hu Bugui model and its extended application

The last round of mathematics in the high school entrance examination is to fill in the gaps: the model of the melon bean principle and its extended application

The last round of mathematics in the high school entrance examination is missing and filling: the Afch's circle maximum value model and its extended application

The final round of mathematics in the high school entrance examination is to fill in the gaps: the general's drinking horse model and its extended application

The final round of mathematics in the high school entrance examination: the Fermat point model and its extended application

Recommend an open-world object detection model: DINO 1.5