70B模型秒出1000token，代码重写超越GPT4o，来自OpenAI参投团队

author：Quantum Position 2024-05-17 16:40:00

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

70B模型,秒出1000Token,换算成字符接近4000!

The researchers fine-tuned Llama3 and introduced an acceleration algorithm, which is 13 times faster than the original version!

Not only is it fast, but it even surpasses GPT-4o in code rewriting tasks.

This achievement comes from anysphere, the team behind the popular AI programming artifact Cursor, and OpenAI has also participated in the investment.

70B模型秒出1000token，代码重写超越GPT4o，来自OpenAI参投团队

You must know that on Groq, a notherally fast inference acceleration framework, the inference speed of 70B Llama3 is only more than 300 tokens per second.

Cursor's speed allows for near-instantaneous editing of complete code files.

Some people call it a good guy, if you put the Llama3 after Cursor's magic change on Groq, will you be able to run tens of thousands of tokens per second?

Some people are even more excited to say that in the field of large models, we are eliminating the concept of "delay".

A new inference acceleration algorithm is introduced

The acceleration method designed by the author is mainly used to solve a task called "Fast Apply", that is, to quickly modify and apply the code content.

First of all, it should be noted that although the final effect of the task is a partial modification of the code, in the actual operation process, the output is not only the changed content, but a direct global rewrite.

The reason for this was the choice the team made after pre-testing - they found that most of the models, with the exception of Claude-3-Opus, did not perform well on the true local modification task.

There are three main reasons for this:

The first is that more tokens will be output when rewritten directly, so that there is more forward passing to determine the correct solution.
Secondly, most of the training data of the model is complete code, which is relatively unfamiliar with local modifications.
In addition, the poor math of the large model does not guarantee that the line numbers will be handled correctly in the case of output discrepancies.

(However, the authors believe that this is still a promising future research area.) ）

Once the global rewrite was decided, the Cursor team fine-tuned Llama3 using mission-related data.

The data used were both real edited data and synthetic data, and were mixed at a 1:4 scale.

Synthetic data refers to the use of GPT-4 to generate code-edited recommendations, and then use other models to "apply" those suggestions to the original code.

To improve the quality of the dataset, the authors also downsampled small files, duplicate files, and unchanged samples.

To evaluate the performance of these models, the authors had them handle 450 code editing tasks (no more than 400 lines each) and scored the output with Claude3-Opus.

最终，作者微调出的70B Llama3模型，表现几乎与Claude3-Opus-diff匹配，并且优于GPT-4-Turbo和GPT-4o。

So far, the fine-tuning has solved the performance problem, but it is not difficult to see that Llama3 is still very slow at this point, and can only output less than 300 characters per second (note that it is a character, not a word or a token).

And what makes the rewriting work so fast that it flies is another secret weapon.

针对代码改写任务，Cursor团队专门引入了一种名为预测性编辑（speculative edits）的算法。

In this way, a priori algorithm is used to predict multiple subsequent tokens, and then verified by the ontology large model, which reduces the number of calls of the large model and thus reduces the amount of computation.

This prior algorithm comes from a feature of the code task - compared to other texts, its vocabulary is smaller, and the syntax structure, indentation rules, etc. have higher certainty, and the use of prior knowledge can be used to predict future tokens more accurately.

This approach also has something in common with GPT-4 and Meta -

The reason why traditional language models are slow to reason is that the process of predicting the next token is usually autoregressive, that is, when the model generates each token, it must consider all the previously generated tokens.

In order to reduce the amount of computation, the large model represented by GPT-4 uses an acceleration algorithm called speculative decoding, which predicts in advance through a small approximate model, and then allows the ontology large model to verify the prediction results.

The difference between Cursor and GPT-4 is that the former's small "model" is a more deterministic algorithm, while the latter is just a reduced model size and is still essentially a probabilistic prediction.

Meta, on the other hand, has launched an algorithm for predicting multiple subsequent tokens at one time, using n independent output headers to predict n future tokens in parallel, and it has been found that it performs particularly well in programming tasks, because the logical structure of the programming language is more rigorous and the knowledge is more closely connected.

Of course, Cursor makes more full use of this feature, and does not use attention heads, but directly uses a more certain algorithm to make long token predictions.

The end result is that the prediction algorithm has delivered a nearly 13x speed increase for the 70B Llama3 without any loss in performance.

In addition, the authors also collaborated with the enterprise AI model infrastructure platform fireworks.ai to further improve the operational efficiency of the model with its optimized inference engine and customized hardware environment.

In the future, the team also plans to distill knowledge and migrate the predictive editing algorithm to the smaller 8B Llama3 and expand to more programming languages and tasks.

At the same time, the authors also plan to improve the Diff algorithm, which has been studied by Cursor's team but has not been used.

One More Thing

In the experiment, the authors not only accelerated Llama3 with the prediction algorithm, but also accelerated GPT4-Turbo.

However, the author did not introduce how to implement it in GPT, but left it for reflection and conducted a "prize quiz".

Those who can answer correctly will receive a 1-month Cursor membership; If prediction acceleration can be achieved in vllm and TensorRT-LLM, you will receive half-year and one-year memberships, respectively.

If you're feeling like you're thinking, try it (manual doghead).

Reference Links:

https://cursor.sh/blog/instant-apply#user-content-fnref-feel-difference

— END —

量子位 QbitAI 头条号签约

70B模型秒出1000token，代码重写超越GPT4o，来自OpenAI参投团队

A new inference acceleration algorithm is introduced

One More Thing

Read on

OpenAI is shockingly exposed! Executives angrily denounced the suppression, and the 710 billion AI giant was embarrassed at home and abroad|Titanium Media AGI

GPT-4o sparks heated discussions about OpenAI's organizational innovation! Heavy responsibilities for fresh graduates and undergraduates, the ranks are all floating clouds

Ilya left OpenAI insider exposure: Ultraman cut his team's computing power and prioritized products to make money

In the second act of OpenAI's palace fight, the core security team was disbanded, and the person in charge blew up the inside story of his resignation

OpenAI forces departing employees to sign shut-up agreements: GPT can talk, but former employees can't

OpenAI responds to "gag" resignation clauses; Didi Chengwei: Liu Qing was promoted to permanent partner, and the company no longer has the position of president; NetBSD prohibits AI-generated code | Geek headlines

OpenAI employees were "sealed" when they left their jobs, the core security team was disbanded, and Altman responded urgently: there was an agreement, but it was never implemented!

聊聊OpenAI最新发布的GPT 4o

OpenAI Shock! The chief scientist suddenly left! Wang Yuquan's exclusive analysis!

OpenAI officially announced the launch of "next-generation cutting-edge model" training! It is expected that the training parameters will be further improved, or the "Wensheng video" model Sora will be integrated

Former OpenAI director reveals the inside story of Ultraman's recall: The board of directors knew that ChatGPT had been released from X

It's all "my own people"! OpenAI urgently set up a "safety committee", less than half a month after the disbandment of the "super alignment" team, and will face the first security "big test" in 90 days

OpenAI is caught in the biggest public relations crisis in history, and the head of Altman, who is in charge, donated half of his net worth to help the company tide over the difficulties

Alibaba Cloud's first biological model that combines DNA, RNA, and protein, covering 16.9W species

Zhong Xuegao responded to the sweet potato assassin again; Ideal refutes rumors that new cars serve as second-hand car exports; The Stanford AI team apologized for plagiarizing the Chinese model

As soon as the domestic open source is opened, the foreign team will self-develop [angry] The Stanford University team plagiarized the open source model of the star startup company of Tsinghua University, "Little Steel Cannon" MiniCPM-Llama3-V2.

Current and former employees of OpenAI, Google DeepMind warn of the risks of artificial intelligence: it could lead to the extinction of humanity! Call for the protection of whistleblowers

AI model plagiarized from China? There is more noteworthy information behind it

Stanford student AI team apologized for plagiarizing China's large model The author of the code once spoke out in response to the question: I didn't

A peep in the tube: The iFLYTEK Spark model is quite good...

Analysis of Large Model Parameter Efficient Fine-tuning (PEFT) technology and fine-tuning acceleration practice

Why do Stanford students want to copy Chinese models?

The popularity of generative AI mobile applications is accelerating! MediaTek Dimensity chips, models, and applications are driven at lightning speed

US media: The United States will launch an antitrust investigation into Microsoft OpenAI and Nvidia

Multi-format component-level model assembly - model king: flexible combination ● unlimited creativity

Endorsed by the "Godfather of AI", 13 current and former employees of OpenAI and Google jointly warned: AI is out of control or leads to the extinction of mankind

Obviously the Snapdragon chip on the PC side is not weak? Why are few manufacturers using it? Now that the performance is directly benchmarked against Apple's M3's Snapdragon XElite, the situation has not only changed, but also improved

Mobile phone into model machine! Baoshan Police: A gentleman loves money and takes it in a good way

Learn more about large language model operations (LLMOps)

Slap in the face! The domestic AI model is far stronger than you think

10 domestic large models vs. college entrance examination essay: writing AI with AI

12 domestic large models vs. college entrance examination mathematics, accidentally exploded a big bug