laitimes

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

author:Quantum Position

Cressy from the temple of Wafei

量子位 | 公众号 QbitAI

Regarding Llama 3, there are new test results -

LMSYS, a large model evaluation community, released a list of large models, and Llama 3 ranked fifth, and the English individual item tied for first place with GPT-4.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

Unlike other benchmarks, this list is based on a one-on-one battle of the model, which is proposed and scored by the evaluators of the whole network.

In the end, Llama 3 took fifth place on the list, ahead of three different versions of GPT-4 and the Claude 3 Mega Cup.

In the English individual list, Llama 3 overtook Claude and tied with GPT-4.

Meta's chief scientist LeCun was so pleased with the result that he retweeted it and left a "Nice".

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

Soumith Chintala, the father of PyTorch, is also excited to say that such results are incredible and that he is proud of Meta.

The 3B version of the Llama 400 has not yet come out, and it has taken fifth place with the 70B parameters alone......

I remember when GPT-4 was released last March, it was almost impossible to achieve the same performance.

……

It's incredible how widespread AI is now, and I'm very proud of the success that my colleagues at Meta AI have done.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

So, what exactly does this list show?

Nearly 90 models played against 750,000 rounds

As of the release of the latest list, LMSYS has collected nearly 750,000 solo battle results for large models, involving 89 models.

Among them, Llama 3 has participated in 12,700 times, and GPT-4 has multiple different versions, with the most participation 68,000 times.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

The chart below shows the number of matches and win rates for some of the popular models, and neither of the two metrics counts the number of draws.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

In terms of the list, LMSYS is divided into a general list and multiple sub-lists, with GPT-4-Turbo ranking first, tied with the earlier 1106 version, and the Claude 3 Super Cup Opus.

Another version (0125) of GPT-4 is next to Llama 3.

However, what is more interesting is that the newer 0125 is not as good as the old version of 1106.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

In the English individual list, Llama 3's results directly tied with the two GPT-4s, and also surpassed the 0125 version.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

The first place in the Chinese proficiency ranking is shared by Claude 3 Opus and GPT-4-1106, and Llama 3 has been ranked outside the top 20.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

In addition to language proficiency, the list also sets rankings for long text and code proficiency, and Llama 3 also comes out on top.

But what exactly are the "rules of the game" for LMSYS?

Large-scale model evaluation that everyone can participate in

This is a large model test that everyone can participate in, and the questions and evaluation criteria are all decided by the participants.

The specific "competition" process is divided into two modes: battle and side-by-side.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

In battle mode, after entering a question on the test interface, the system will randomly call two models in the library, and the tester does not know who the system has selected, and only "Model A" and "Model B" are displayed in the interface.

After the model outputs the answer, the evaluator needs to choose which is better or tie, although there are options if the model does not perform as expected.

Only after the choice is made will the identity of the model be revealed.

side-by-side is the model selected by the user to PK, and the rest of the test process is the same as the battle mode

However, only the results of the vote in the anonymous mode of the battle will be counted, and the model will be invalidated if the model accidentally reveals its identity during the conversation.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

According to the Win Rate of each model against the others, you can draw an image like this:

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

△ Schematic diagram, earlier version

The final leaderboard is obtained by using Win Rate data and converting it into scores through the Elo evaluation system.

The Elo rating system is a method of calculating the relative skill level of players and was designed by American physics professor Arpad Elo.

In the case of LMSYS specifically, the score (R) of all models is set to 1000 under initial conditions, and the expected win rate (E) is calculated based on such a formula.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

As the test progresses, the score is revised based on the actual score (S), which has three values: 1, 0, and 0.5, corresponding to a win, a loss, and a draw.

The correction algorithm is shown in the following formula, where K is the coefficient, which needs to be adjusted by the tester according to the actual situation.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

Eventually, when all valid data is included in the calculation, the Elo score of the model is obtained.

However, in the actual operation process, the LMSYS team found that the stability of this algorithm was insufficient, so it used statistical methods to correct it.

They used the Bootstrap method for repeated sampling, which yielded more stable results and estimated confidence intervals.

The final revised Elo score became the basis for the ranking in the list.

One More Thing

Llama 3 can already run on the large model inference platform Groq (not Musk's Grok).

The biggest highlight of this platform is "fast", which has previously run at a speed of nearly 500 tokens per second with the Mixtral model.

The Llama 3 is also quite fast, and the measured 70B can run to about 300 Tokens per second, and the 8B version is close to 800.

The large model fought 750,000 rounds of one-on-one battles, GPT-4 won the championship, and Llama 3 ranked fifth

Reference Links:

[1]https://lmsys.org/blog/2023-05-03-arena/

[2]https://chat.lmsys.org/?leaderboard

[3]https://twitter.com/lmsysorg/status/1782483699449332144

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on