Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

author：Geek AI 2024-02-17 09:41:00

Keywords: large language models, LLMs

Preface

With more and more large language models being released and used, how to evaluate the ability of large models (LLM-Eval) has become a new topic.

Summary of content

Why do you need to do a large model evaluation?
What capabilities of the large model need to be evaluated
How to evaluate large models

Why do you need to do a large model evaluation?

The need to evaluate large models comes from a number of reasons:

Unified criteria for judging the quality of models: If an objective, fair and quantitative model evaluation system is not constructed, it is impossible to judge the ability of many large models, and users cannot understand the real capabilities and actual effects of the models.
Basis for model iterative optimization: For developers, if they cannot quantitatively evaluate the capabilities of the model, they cannot track the changes in the model's capabilities, and cannot know the advantages and disadvantages of the model, so that they cannot specify the model improvement strategy and affect the iterative upgrade of the model.
Consideration of regulatory safety requirements: For legal, medical and other fields related to social security, it is necessary to systematically evaluate the large model to confirm that the large model is suitable for use in this field and will not cause safety accidents.
Basis for the selection of domain basic models: In different fields, the ability performance of large models has its own advantages and disadvantages, and it is necessary to introduce an evaluation system to test the capabilities of large models in various fields, and select the most suitable large model for this specific field as the base, so as to better implement the industry.

Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

OpenCompass官网的模型定量评分

What capabilities of the large model need to be evaluated

The large model evaluation generally includes natural language processing, knowledge ability, domain model, alignment evaluation, security and other aspects, among which natural language processing is a relatively simple evaluation task, including NLU natural language understanding and NLG natural language generation, NLU includes typical tasks such as sentiment analysis, text classification, information extraction, etc., and NLG includes machine translation, automatic summarization and other tasks.

The content involved in the evaluation of large models

While traditional NLP tasks are mostly designed to measure specific and relatively simple abilities, large language models have demonstrated a variety of new capabilities and shifted the focus of assessment to more general and complex skills, such as extensive world knowledge and complex reasoning. The knowledge ability of the large model needs to be considered in the pre-training process because it absorbs the knowledge of massive data, which makes the large model closer to an agent than the NLP task.

Knowledge ability assessment includes knowledge question answering, logical reasoning, tool learning, etc., generally through Prompt prompts to make the large model recall the knowledge learned in the pre-training process to complete the knowledge question and answer task; CoT thinking chain can make the model think step by step, so as to solve logical reasoning tasks; tool learning (Tool Learning) aims to enable large models to use tools based on human instructions and actions to solve specific tasks, such as having large models call search engines or APIs, and fuse search results with pre-trained knowledge to enhance answer generation.

Schematic diagram of the tool reinforcement learning

Compared with general knowledge, the knowledge ability of vertical fields is more important for industrial landing, so there are also knowledge assessments in various vertical fields, including large models such as education, medical care, finance, and law, which are used to select appropriate large models as the model base of vertical fields.

In addition, in addition to NLP tasks and knowledge ability, the alignment of the response content of the large model is also within the scope of the assessment. Alignment means that the responses of the large model need to be consistent with human values and preferences. On the other hand, it is necessary to evaluate the authenticity of the large model's answers to prevent the generation of inaccurate or factual content, which may be caused by the fact that the training dataset contains wrong details, outdated facts, or even intentional misinformation, thus damaging the authenticity of the large language model.

Finally, it is necessary to investigate the security of the large model, which cannot generate harmful content, and needs to have a certain degree of robustness to prevent some small disturbances from being deliberately input to the model, so that the model outputs harmful content and brings threats to the security of the model.

How to evaluate large models

Large model evaluation is divided into objective evaluation and subjective evaluation. There are standard answers for objective evaluation, which are generally input to the large model in the form of question and answer questions and multiple-choice questions, so that the large model can answer the answer and compare it with the correct answer.

For NLP tasks, the evaluation indicators of specific tasks are used to consider the large model, such as the accuracy of text classification and BLEU for machine translation. For knowledge ability, the large model is considered by doing questions, such as the Chinese evaluation dataset C-EVAL constructs multiple vertical domain multiple-choice questions, and evaluates the large model by the correct rate of answers. In addition, for the Base model and the Chat model, for the Base model, a certain example needs to be added to the Prompt, and the Chat model can obtain the answer output by the model directly by means of dialogue due to the fine-tuning of instructions and RLHF.

Problem classification of C-EVAL dataset

Subjective evaluation is generally used in scenarios where there are no standard answers, such as asking multiple large models to write an essay based on the topic, how to evaluate the quality of the essays they output, at this time, manual intervention can be used to score these essays, or a referee model can be introduced to score these essays.

After the full text, the use and practice of the large model evaluation dataset and the large model evaluation framework will be shared in the future.

Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

Preface

Summary of content

Why do you need to do a large model evaluation?

What capabilities of the large model need to be evaluated

How to evaluate large models

Read on

The price of large models has fallen, and the Internet-style "turf war" has reappeared, will big factories really lose money?

The Past of China's Large Model Capital: 20 Large Model Insiders Walk on the "Life and Death Table"

The price war of AI large models starts, and the winner will be decided in a year?

Baidu's first Wenxin large model learning machine Z30 is on sale, and 8G +256G is sold for 6694 yuan

OpenAI officially announced the launch of "next-generation cutting-edge model" training! It is expected that the training parameters will be further improved, or the "Wensheng video" model Sora will be integrated

In the large-scale model competition, why are Chinese and American tech giants rolling in different directions?

Multilingual large model new SOTA! The latest open-source Aya-23: supports 23 languages, 8B/35B optional

Discussion|The second model of the stone hitting the bridge pier, can the bullet break the bridge pier?

Gu Weihao, CEO of Momo Zhixing: AI large model is the only way to realize autonomous driving

UrbanGPT, the first smart city model, is fully open source and open|HKU&Baidu

Six front-line AI engineers summarize the explosion! The experience of large-scale model application for one year is public

Sunday Meditation (152): Journal paper based on the fairness concern model of the Stackelberg game

The Stanford team was exposed to plagiarism of the Tsinghua system and deleted the database, and the CEO of the plagiarized company was also internationally recognized

The difference is 800 yuan, what is the difference between the Honor 200 series? The standard version is thinner and lighter, and the Pro version is more specified

The Moutai Mate60 series in the mobile phone has also begun to reduce in price

The 37th World No Tobacco Day series of activities of Dongcheng School in Lushi County

China Life Co., Ltd. joined hands with the National Centre for the Performing Arts to hold a series of "Enjoy Life" art salon activities

AMD Zen5 Ryzen 9000系列发布:性能比i9-14900K快56%!

Xu Youma Super Series (2): Dong Zhuo rose at the speed of light, traversing multiple teams, and civilians played like this

The Liquan County Center for Disease Control and Prevention actively carried out a series of publicity activities for the 37th World No Tobacco Day

The central kindergarten of Jinjiahe Town, Luoyang County, carried out a series of activities to celebrate the sixth day

Mistral's first "open" programming model

Customer section | China Guangfa Bank's national fitness series activities were launched

Stanford AI team plagiarized domestic large models? Even the identification of "Tsinghua Jane" was copied! The Tsinghua team responded

Computex 2024: Colorful will release the first generation of the Vulcan series of "family bucket" products

Computex 2024：AMD更新Instinct系列GPU路线图，CDNA 4架构明年到来

Computex 2024：华擎带来12V-2x6接口的Radeon RX 7900 WS系列显卡

Computex 2024:AMD发布锐龙I 300系列,新架构新命名

Huaxi Park's "Huamei Ximei Zongqing Dragon Boat Festival" series of activities are waiting for you to →

A preliminary study of the basic model in the figure below in the era of rapid development of LLMs

The price difference is 4300 yuan! Does the OPPOReno12 have an advantage over the iPhone 15Pro? Not to mention, after watching the blogger's comparison video, I wanted to buy an R

The fire of the Celebration of More Than Years series is essentially cleverly packaged in the perspective of cool dramas, the public's thirst for fairness and justice, and the creation of any film and television work must essentially take into account triple cause and effect. The first weight