The strongest open source Llama 3 score plummeted, and the gap widened

Mengchen is from Wafei Temple

量子位 | 公众号 QbitAI

If the test questions are too simple, both the top students and the scumbags can score 90 points, and they can't open up the gap......

With the release of stronger models such as Claude 3, Llama 3, and even GPT-5 later, the industry is in dire need of a harder, more discriminating benchmark.

LMSYS, the organization behind the large model arena, has launched the next-generation benchmark Arena-Hard, which has attracted widespread attention.

There is also the latest reference for the strength of the two fine-tuned versions of Llama 3.

The strongest open source Llama 3 score plummeted, and the gap widened

Compared with the previous MT Bench, where everyone had similar scores, the Arena-Hard discrimination has increased from 22.6% to 87.4%, making it clear which is stronger and which is weaker.

Arena-Hard is built using real-time human data from the arena, and the consistency rate with human preferences is as high as 89.1%.

In addition to the above two indicators reaching SOTA, there is an additional benefit:

The test data is updated in real-time with newly coined prompts that humans have never seen before during the training phase, mitigating potential data breaches.

And after the new model is released, there is no need to wait a week or so for human users to participate in the vote, and it only costs $25 to quickly run the test pipeline and get the results.

Some netizens commented that it is really important to use real user prompt words instead of high school exams for testing.

How does the new benchmark work?

To put it simply, 500 high-quality prompt words are selected as the test set from the 200,000 user queries in the large model arena.

First, ensure diversity in the selection process, i.e., the test set should cover a wide range of real-world topics.

To ensure this, the team employed the topic modeling pipeline in BERTopic, first transforming each prompt using OpenAI's embedding model (text-embedding-3-small), using UMAP to reduce dimensions, and using a hierarchy-based model clustering algorithm (HDBSCAN) to identify clusters, and finally GPT-4-turbo for summarization.

At the same time, it is ensured that the selected prompts are of high quality, with seven key indicators to measure:

Specificity: Does the prompt require a specific output?

Domain knowledge: Does the prompt cover one or more specific areas?

Complexity: Does the prompt word have multiple layers of reasoning, components, or variables?

Problem solving: Do prompts directly allow the AI to demonstrate the ability to proactively solve problems?

Creativity: Does the prompt word involve some level of creativity in problem-solving?

Technical accuracy: Does the prompt require the response to be technically accurate?

Practical application: Is the prompt word relevant to practical application?

Use GPT-3.5-Turbo and GPT-4-Turbo to annotate each prompt from 0 to 7 to determine how many conditions are met. Each cluster is then scored based on the average score of the prompts.

High-quality questions are often related to challenging topics or tasks, such as game development or proof of math.

Is the new benchmark accurate?

Arena-Hard currently has another weakness: using GPT-4 as a referee prefers its own output. The official also gave corresponding tips.

It can be seen that the scores of the latest two versions of GPT-4 are significantly higher than those of Claude 3 Opus, but the difference in human voting scores is not so significant.

In fact, on this point, there have been recent studies to demonstrate that cutting-edge models will prefer their own output.

The research team also found that AI is inherently able to determine whether a piece of text was written by itself, and that its ability to self-identify can be enhanced after fine-tuning, and that self-recognition ability is linearly related to self-preference.

So what changes does using Claude 3 to score the results?

First of all, the score of the Claude series will indeed improve.

But surprisingly, it prefers several open models such as Mixtral and Zero One Things Yi, and even has a significant improvement in its rating of GPT-3.5.

Overall, the discrimination and consistency with human results using Claude 3 scores are not as good as those of GPT-4.

Therefore, many netizens also suggested using multiple large models for comprehensive scoring.

In addition to this, the team did more ablation experiments to verify the effectiveness of the new benchmark.

For example, adding "make the answer as detailed as possible" to the prompt will result in a higher average output length and a higher score.

However, if the prompt word is replaced with "like small talk", the average output length has also increased, but the score improvement is not obvious.

In addition, there were many interesting discoveries in the course of the experiment.

For example, GPT-4 is very strict in scoring and will deduct points if there is a mistake in the answer, while Claude 3 will be lenient even if it recognizes small mistakes.

When it comes to code questions, Claude 3 tends to provide answers that are simple in structure, independent of external code bases, and can help humans learn to code, while GPT-4-Turbo prefers the most practical answers, regardless of their educational value.

In addition, even if the temperature is set to 0, GPT-4-Turbo may produce a slightly different judgment.

It can also be seen from the first 64 clusters of the hierarchy visualization that the quality and diversity of questions asked by users in the large model arena is indeed high.

You may be able to contribute to this.

Arena-Hard Githb:

hatps://github.com/lm-sies/arena-hard

Arena-Hard HuggingFace：

https://huggingface.co/spaces/lmsys/arena-hard-browser

Big Model Arena:

Hatpas://Arena. Lansayus.org

Reference Links:

[1]https://x.com/lmsysorg/status/1782179997622649330

[2]https://lmsys.org/blog/2024-04-19-arena-hard

The strongest open source Llama 3 score plummeted, and the gap widened

The strongest open source Llama 3 score plummeted, and the gap widened

Read on

The Seine River has swelled, and the Paris Olympic Committee has postponed the technical test of the opening ceremony, the exact date of which is unknown

The Wuha regiment frightened Deng Chao, and Brother Chao fought back? Netizen: This friendship test is so exciting!

When Lei Jun tested the car live, he was suspected of being maliciously stopped by the car, and the co-pilot: Is there a one-click report?

In order to overcome the high temperature of 2000 degrees, China and the United States are stepping up the test of the same "protective cover"

Tesla was revealed to have sued a big V with tens of millions of fans, suspected of being made by the "emergency braking" test

How strong is Casio really? Load-bearing test, crushing and falling, even if sent into space, it will not be damaged

Xiaomi SU7, which has not yet completed the durability test, has been sold for two months?

Lei Jun invited Internet celebrity Ah Fei to test SU7, and was maliciously stopped at high speed, Ah Fei's words showed high IQ

Ryzen 7 7800X3D VS Core i9-14900K（启用基线配置），游戏测试出炉

Quiz, who will you spend the rest of your life with?

Psychological test: Choose a transit bead that will bring you good luck and see who will change your fate

Psychological test: Choose a fruit bowl and test what you can't avoid

【Industrial Internet Weekly】Kimi Launches a Paid Plan? Dark Side of the Moon: Small-scale grayscale testing; When the Wensheng diagram was demonstrated, the sleep code appeared, and Huawei responded to the suspicion of fraud; Snowflake is in talks to acquire Reka AI for more than $1 billion

The highest degree of digital intelligence and the most complete integration functions in China! The 520 offshore survey and inspection platform of the Central South Institute was delivered in Qingdao

GPT-4 passes the Turing test with a 54% win rate! UCSD's New Work: Humans Can't Recognize GPT-4

A must for testers! What to do in this article?

It is expected to be sold from 120,000, and the BYD Qin L will be launched at the end of May and is expected to become a popular model?

The new BMW X3 exposure: longer, wider and shorter + large screen, how does Mercedes-Benz GLC take over?

Can the top-heavy Pacers stop the Celtics from advancing to the Finals?

Surprising! The Real Madrid master "refused" to renew his contract and suddenly announced his retirement: his career 33 crowns were too eye-catching

一图看懂iQOO Neo9S Pro 强悍双芯胜券在握

iPhone16 series color exposure: a total of eight colors The addition of rose is the focus

Wang Feng officially announced the fifth female companion, Zhang Ziyi Ge Huijie Forest North, who is the best-looking?

Farewell to the legend! Kroos officially announced his retirement, the European Championship became the last dance, and the 34-year-old Real Madrid midfielder left the team

Reverse fraud of battery power! Xiaomi SU7 is full of 73.6 degrees of battery and costs 87.86 degrees of electricity Netizens liked

High pixels, AI algorithms, self-developed chips...... The battle for domestic mobile phone images is becoming increasingly fierce Industry

Portugal announces European Cup roster: Cristiano Ronaldo leads! 41-year-old Pepe was selected

Soared 86 times! Jia Yueting is really "the god of wealth of American retail investors", which is to suffer domestic stockholders

Kroos retires, Real Madrid is heartbroken! Salute to the whole team, An Shuai: he deserves the Golden Ball

Young and frivolous! Edwards: I'm going to take on Owen in the West and let you see how I deal with him

Behind the ideal knife: 98.9 billion cash reserves and 5,600 laid off employees

618 is imminent, but Dong Yuhui and Brother Yang have slipped off the list