laitimes

The strongest open source Llama 3 score plummeted, and the gap widened

The strongest open source Llama 3 score plummeted, and the gap widened

Mengchen is from Wafei Temple

量子位 | 公众号 QbitAI

If the test questions are too simple, both the top students and the scumbags can score 90 points, and they can't open up the gap......

With the release of stronger models such as Claude 3, Llama 3, and even GPT-5 later, the industry is in dire need of a harder, more discriminating benchmark.

LMSYS, the organization behind the large model arena, has launched the next-generation benchmark Arena-Hard, which has attracted widespread attention.

There is also the latest reference for the strength of the two fine-tuned versions of Llama 3.

The strongest open source Llama 3 score plummeted, and the gap widened

Compared with the previous MT Bench, where everyone had similar scores, the Arena-Hard discrimination has increased from 22.6% to 87.4%, making it clear which is stronger and which is weaker.

Arena-Hard is built using real-time human data from the arena, and the consistency rate with human preferences is as high as 89.1%.

In addition to the above two indicators reaching SOTA, there is an additional benefit:

The test data is updated in real-time with newly coined prompts that humans have never seen before during the training phase, mitigating potential data breaches.

And after the new model is released, there is no need to wait a week or so for human users to participate in the vote, and it only costs $25 to quickly run the test pipeline and get the results.

Some netizens commented that it is really important to use real user prompt words instead of high school exams for testing.

The strongest open source Llama 3 score plummeted, and the gap widened

How does the new benchmark work?

To put it simply, 500 high-quality prompt words are selected as the test set from the 200,000 user queries in the large model arena.

First, ensure diversity in the selection process, i.e., the test set should cover a wide range of real-world topics.

To ensure this, the team employed the topic modeling pipeline in BERTopic, first transforming each prompt using OpenAI's embedding model (text-embedding-3-small), using UMAP to reduce dimensions, and using a hierarchy-based model clustering algorithm (HDBSCAN) to identify clusters, and finally GPT-4-turbo for summarization.

The strongest open source Llama 3 score plummeted, and the gap widened

At the same time, it is ensured that the selected prompts are of high quality, with seven key indicators to measure:

Specificity: Does the prompt require a specific output?

Domain knowledge: Does the prompt cover one or more specific areas?

Complexity: Does the prompt word have multiple layers of reasoning, components, or variables?

Problem solving: Do prompts directly allow the AI to demonstrate the ability to proactively solve problems?

Creativity: Does the prompt word involve some level of creativity in problem-solving?

Technical accuracy: Does the prompt require the response to be technically accurate?

Practical application: Is the prompt word relevant to practical application?

The strongest open source Llama 3 score plummeted, and the gap widened

Use GPT-3.5-Turbo and GPT-4-Turbo to annotate each prompt from 0 to 7 to determine how many conditions are met. Each cluster is then scored based on the average score of the prompts.

High-quality questions are often related to challenging topics or tasks, such as game development or proof of math.

The strongest open source Llama 3 score plummeted, and the gap widened

Is the new benchmark accurate?

Arena-Hard currently has another weakness: using GPT-4 as a referee prefers its own output. The official also gave corresponding tips.

It can be seen that the scores of the latest two versions of GPT-4 are significantly higher than those of Claude 3 Opus, but the difference in human voting scores is not so significant.

The strongest open source Llama 3 score plummeted, and the gap widened

In fact, on this point, there have been recent studies to demonstrate that cutting-edge models will prefer their own output.

The strongest open source Llama 3 score plummeted, and the gap widened

The research team also found that AI is inherently able to determine whether a piece of text was written by itself, and that its ability to self-identify can be enhanced after fine-tuning, and that self-recognition ability is linearly related to self-preference.

The strongest open source Llama 3 score plummeted, and the gap widened

So what changes does using Claude 3 to score the results?

First of all, the score of the Claude series will indeed improve.

The strongest open source Llama 3 score plummeted, and the gap widened

But surprisingly, it prefers several open models such as Mixtral and Zero One Things Yi, and even has a significant improvement in its rating of GPT-3.5.

Overall, the discrimination and consistency with human results using Claude 3 scores are not as good as those of GPT-4.

The strongest open source Llama 3 score plummeted, and the gap widened

Therefore, many netizens also suggested using multiple large models for comprehensive scoring.

The strongest open source Llama 3 score plummeted, and the gap widened

In addition to this, the team did more ablation experiments to verify the effectiveness of the new benchmark.

For example, adding "make the answer as detailed as possible" to the prompt will result in a higher average output length and a higher score.

However, if the prompt word is replaced with "like small talk", the average output length has also increased, but the score improvement is not obvious.

The strongest open source Llama 3 score plummeted, and the gap widened

In addition, there were many interesting discoveries in the course of the experiment.

For example, GPT-4 is very strict in scoring and will deduct points if there is a mistake in the answer, while Claude 3 will be lenient even if it recognizes small mistakes.

When it comes to code questions, Claude 3 tends to provide answers that are simple in structure, independent of external code bases, and can help humans learn to code, while GPT-4-Turbo prefers the most practical answers, regardless of their educational value.

In addition, even if the temperature is set to 0, GPT-4-Turbo may produce a slightly different judgment.

It can also be seen from the first 64 clusters of the hierarchy visualization that the quality and diversity of questions asked by users in the large model arena is indeed high.

The strongest open source Llama 3 score plummeted, and the gap widened

You may be able to contribute to this.

Arena-Hard Githb:

hatps://github.com/lm-sies/arena-hard

Arena-Hard HuggingFace:

https://huggingface.co/spaces/lmsys/arena-hard-browser

Big Model Arena:

Hatpas://Arena. Lansayus.org

Reference Links:

[1]https://x.com/lmsysorg/status/1782179997622649330

[2]https://lmsys.org/blog/2024-04-19-arena-hard

Read on