laitimes

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

author:Quantum Position

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

70B模型,秒出1000Token,换算成字符接近4000!

The researchers fine-tuned Llama3 and introduced an acceleration algorithm, which is 13 times faster than the original version!

Not only is it fast, but it even surpasses GPT-4o in code rewriting tasks.

This achievement comes from anysphere, the team behind the popular AI programming artifact Cursor, and OpenAI has also participated in the investment.

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

You must know that on Groq, a notherally fast inference acceleration framework, the inference speed of 70B Llama3 is only more than 300 tokens per second.

Cursor's speed allows for near-instantaneous editing of complete code files.

Some people call it a good guy, if you put the Llama3 after Cursor's magic change on Groq, will you be able to run tens of thousands of tokens per second?

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

Some people are even more excited to say that in the field of large models, we are eliminating the concept of "delay".

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

A new inference acceleration algorithm is introduced

The acceleration method designed by the author is mainly used to solve a task called "Fast Apply", that is, to quickly modify and apply the code content.

First of all, it should be noted that although the final effect of the task is a partial modification of the code, in the actual operation process, the output is not only the changed content, but a direct global rewrite.

The reason for this was the choice the team made after pre-testing - they found that most of the models, with the exception of Claude-3-Opus, did not perform well on the true local modification task.

There are three main reasons for this:

  • The first is that more tokens will be output when rewritten directly, so that there is more forward passing to determine the correct solution.
  • Secondly, most of the training data of the model is complete code, which is relatively unfamiliar with local modifications.
  • In addition, the poor math of the large model does not guarantee that the line numbers will be handled correctly in the case of output discrepancies.

(However, the authors believe that this is still a promising future research area.) )

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

Once the global rewrite was decided, the Cursor team fine-tuned Llama3 using mission-related data.

The data used were both real edited data and synthetic data, and were mixed at a 1:4 scale.

Synthetic data refers to the use of GPT-4 to generate code-edited recommendations, and then use other models to "apply" those suggestions to the original code.

To improve the quality of the dataset, the authors also downsampled small files, duplicate files, and unchanged samples.

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

To evaluate the performance of these models, the authors had them handle 450 code editing tasks (no more than 400 lines each) and scored the output with Claude3-Opus.

最终,作者微调出的70B Llama3模型,表现几乎与Claude3-Opus-diff匹配,并且优于GPT-4-Turbo和GPT-4o。

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

So far, the fine-tuning has solved the performance problem, but it is not difficult to see that Llama3 is still very slow at this point, and can only output less than 300 characters per second (note that it is a character, not a word or a token).

And what makes the rewriting work so fast that it flies is another secret weapon.

针对代码改写任务,Cursor团队专门引入了一种名为预测性编辑(speculative edits)的算法。

In this way, a priori algorithm is used to predict multiple subsequent tokens, and then verified by the ontology large model, which reduces the number of calls of the large model and thus reduces the amount of computation.

This prior algorithm comes from a feature of the code task - compared to other texts, its vocabulary is smaller, and the syntax structure, indentation rules, etc. have higher certainty, and the use of prior knowledge can be used to predict future tokens more accurately.

This approach also has something in common with GPT-4 and Meta -

The reason why traditional language models are slow to reason is that the process of predicting the next token is usually autoregressive, that is, when the model generates each token, it must consider all the previously generated tokens.

In order to reduce the amount of computation, the large model represented by GPT-4 uses an acceleration algorithm called speculative decoding, which predicts in advance through a small approximate model, and then allows the ontology large model to verify the prediction results.

The difference between Cursor and GPT-4 is that the former's small "model" is a more deterministic algorithm, while the latter is just a reduced model size and is still essentially a probabilistic prediction.

Meta, on the other hand, has launched an algorithm for predicting multiple subsequent tokens at one time, using n independent output headers to predict n future tokens in parallel, and it has been found that it performs particularly well in programming tasks, because the logical structure of the programming language is more rigorous and the knowledge is more closely connected.

Of course, Cursor makes more full use of this feature, and does not use attention heads, but directly uses a more certain algorithm to make long token predictions.

The end result is that the prediction algorithm has delivered a nearly 13x speed increase for the 70B Llama3 without any loss in performance.

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

In addition, the authors also collaborated with the enterprise AI model infrastructure platform fireworks.ai to further improve the operational efficiency of the model with its optimized inference engine and customized hardware environment.

In the future, the team also plans to distill knowledge and migrate the predictive editing algorithm to the smaller 8B Llama3 and expand to more programming languages and tasks.

At the same time, the authors also plan to improve the Diff algorithm, which has been studied by Cursor's team but has not been used.

One More Thing

In the experiment, the authors not only accelerated Llama3 with the prediction algorithm, but also accelerated GPT4-Turbo.

However, the author did not introduce how to implement it in GPT, but left it for reflection and conducted a "prize quiz".

Those who can answer correctly will receive a 1-month Cursor membership; If prediction acceleration can be achieved in vllm and TensorRT-LLM, you will receive half-year and one-year memberships, respectively.

70B模型秒出1000token,代码重写超越GPT4o,来自OpenAI参投团队

If you're feeling like you're thinking, try it (manual doghead).

Reference Links:

https://cursor.sh/blog/instant-apply#user-content-fnref-feel-difference

— END —

量子位 QbitAI 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on