Cressy from the temple of Wafei

量子位 | 公众号 QbitAI

Regarding Llama 3, there are new test results -

LMSYS, a large model evaluation community, released a list of large models, and Llama 3 ranked fifth, and the English individual item tied for first place with GPT-4.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

Unlike other benchmarks, this list is based on a one-on-one battle of the model, which is proposed and scored by the evaluators of the whole network.

In the end, Llama 3 took fifth place on the list, ahead of three different versions of GPT-4 and the Claude 3 Mega Cup.

In the English individual list, Llama 3 overtook Claude and tied with GPT-4.

Meta's chief scientist LeCun was so pleased with the result that he retweeted it and left a "Nice".

Soumith Chintala, the father of PyTorch, is also excited to say that such results are incredible and that he is proud of Meta.

The 3B version of the Llama 400 has not yet come out, and it has taken fifth place with the 70B parameters alone......

I remember when GPT-4 was released last March, it was almost impossible to achieve the same performance.

……

It's incredible how widespread AI is now, and I'm very proud of the success that my colleagues at Meta AI have done.

So, what exactly does this list show?

Nearly 90 models played against 750,000 rounds

As of the release of the latest list, LMSYS has collected nearly 750,000 solo battle results for large models, involving 89 models.

Among them, Llama 3 has participated in 12,700 times, and GPT-4 has multiple different versions, with the most participation 68,000 times.

The chart below shows the number of matches and win rates for some of the popular models, and neither of the two metrics counts the number of draws.

In terms of the list, LMSYS is divided into a general list and multiple sub-lists, with GPT-4-Turbo ranking first, tied with the earlier 1106 version, and the Claude 3 Super Cup Opus.

Another version (0125) of GPT-4 is next to Llama 3.

However, what is more interesting is that the newer 0125 is not as good as the old version of 1106.

In the English individual list, Llama 3's results directly tied with the two GPT-4s, and also surpassed the 0125 version.

The first place in the Chinese proficiency ranking is shared by Claude 3 Opus and GPT-4-1106, and Llama 3 has been ranked outside the top 20.

In addition to language proficiency, the list also sets rankings for long text and code proficiency, and Llama 3 also comes out on top.

But what exactly are the "rules of the game" for LMSYS?

Large-scale model evaluation that everyone can participate in

This is a large model test that everyone can participate in, and the questions and evaluation criteria are all decided by the participants.

The specific "competition" process is divided into two modes: battle and side-by-side.

In battle mode, after entering a question on the test interface, the system will randomly call two models in the library, and the tester does not know who the system has selected, and only "Model A" and "Model B" are displayed in the interface.

After the model outputs the answer, the evaluator needs to choose which is better or tie, although there are options if the model does not perform as expected.

Only after the choice is made will the identity of the model be revealed.

side-by-side is the model selected by the user to PK, and the rest of the test process is the same as the battle mode

However, only the results of the vote in the anonymous mode of the battle will be counted, and the model will be invalidated if the model accidentally reveals its identity during the conversation.

According to the Win Rate of each model against the others, you can draw an image like this:

△ Schematic diagram, earlier version

The final leaderboard is obtained by using Win Rate data and converting it into scores through the Elo evaluation system.

The Elo rating system is a method of calculating the relative skill level of players and was designed by American physics professor Arpad Elo.

In the case of LMSYS specifically, the score (R) of all models is set to 1000 under initial conditions, and the expected win rate (E) is calculated based on such a formula.

As the test progresses, the score is revised based on the actual score (S), which has three values: 1, 0, and 0.5, corresponding to a win, a loss, and a draw.

The correction algorithm is shown in the following formula, where K is the coefficient, which needs to be adjusted by the tester according to the actual situation.

Eventually, when all valid data is included in the calculation, the Elo score of the model is obtained.

However, in the actual operation process, the LMSYS team found that the stability of this algorithm was insufficient, so it used statistical methods to correct it.

They used the Bootstrap method for repeated sampling, which yielded more stable results and estimated confidence intervals.

The final revised Elo score became the basis for the ranking in the list.

One More Thing

Llama 3 can already run on the large model inference platform Groq (not Musk's Grok).

The biggest highlight of this platform is "fast", which has previously run at a speed of nearly 500 tokens per second with the Mixtral model.

The Llama 3 is also quite fast, and the measured 70B can run to about 300 Tokens per second, and the 8B version is close to 800.

Reference Links:

[1]https://lmsys.org/blog/2023-05-03-arena/

[2]https://chat.lmsys.org/?leaderboard

[3]https://twitter.com/lmsysorg/status/1782483699449332144

— END —

QbitAI · Headline number signed

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

Nearly 90 models played against 750,000 rounds

Large-scale model evaluation that everyone can participate in

One More Thing

Read on

Meta AI released the most powerful open-source large model, Llama 3, which is available in versions 8B and 70B?

How to use AI models to solve practical problems?

In the era of large models, is the data center outdated now?

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges

Baidu's strongest SOTA: 3DGS based on diffusion model!

Sprint 2024 "Half Year Red" | Sixty percent of AI companies have achieved profitable growth, and large model companies have made money?

Dialogue with UBTECH Jiao Jichao: Large model accelerates humanoid robots to "work in the factory"