laitimes

Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

author:Geek AI

Keywords: large language models, LLMs

Preface

With more and more large language models being released and used, how to evaluate the ability of large models (LLM-Eval) has become a new topic.

Summary of content

  • Why do you need to do a large model evaluation?
  • What capabilities of the large model need to be evaluated
  • How to evaluate large models

Why do you need to do a large model evaluation?

The need to evaluate large models comes from a number of reasons:

  • Unified criteria for judging the quality of models: If an objective, fair and quantitative model evaluation system is not constructed, it is impossible to judge the ability of many large models, and users cannot understand the real capabilities and actual effects of the models.
  • Basis for model iterative optimization: For developers, if they cannot quantitatively evaluate the capabilities of the model, they cannot track the changes in the model's capabilities, and cannot know the advantages and disadvantages of the model, so that they cannot specify the model improvement strategy and affect the iterative upgrade of the model.
  • Consideration of regulatory safety requirements: For legal, medical and other fields related to social security, it is necessary to systematically evaluate the large model to confirm that the large model is suitable for use in this field and will not cause safety accidents.
  • Basis for the selection of domain basic models: In different fields, the ability performance of large models has its own advantages and disadvantages, and it is necessary to introduce an evaluation system to test the capabilities of large models in various fields, and select the most suitable large model for this specific field as the base, so as to better implement the industry.
Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

OpenCompass官网的模型定量评分

What capabilities of the large model need to be evaluated

The large model evaluation generally includes natural language processing, knowledge ability, domain model, alignment evaluation, security and other aspects, among which natural language processing is a relatively simple evaluation task, including NLU natural language understanding and NLG natural language generation, NLU includes typical tasks such as sentiment analysis, text classification, information extraction, etc., and NLG includes machine translation, automatic summarization and other tasks.

Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

The content involved in the evaluation of large models

While traditional NLP tasks are mostly designed to measure specific and relatively simple abilities, large language models have demonstrated a variety of new capabilities and shifted the focus of assessment to more general and complex skills, such as extensive world knowledge and complex reasoning. The knowledge ability of the large model needs to be considered in the pre-training process because it absorbs the knowledge of massive data, which makes the large model closer to an agent than the NLP task.

Knowledge ability assessment includes knowledge question answering, logical reasoning, tool learning, etc., generally through Prompt prompts to make the large model recall the knowledge learned in the pre-training process to complete the knowledge question and answer task; CoT thinking chain can make the model think step by step, so as to solve logical reasoning tasks; tool learning (Tool Learning) aims to enable large models to use tools based on human instructions and actions to solve specific tasks, such as having large models call search engines or APIs, and fuse search results with pre-trained knowledge to enhance answer generation.

Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

Schematic diagram of the tool reinforcement learning

Compared with general knowledge, the knowledge ability of vertical fields is more important for industrial landing, so there are also knowledge assessments in various vertical fields, including large models such as education, medical care, finance, and law, which are used to select appropriate large models as the model base of vertical fields.

In addition, in addition to NLP tasks and knowledge ability, the alignment of the response content of the large model is also within the scope of the assessment. Alignment means that the responses of the large model need to be consistent with human values and preferences. On the other hand, it is necessary to evaluate the authenticity of the large model's answers to prevent the generation of inaccurate or factual content, which may be caused by the fact that the training dataset contains wrong details, outdated facts, or even intentional misinformation, thus damaging the authenticity of the large language model.

Finally, it is necessary to investigate the security of the large model, which cannot generate harmful content, and needs to have a certain degree of robustness to prevent some small disturbances from being deliberately input to the model, so that the model outputs harmful content and brings threats to the security of the model.

How to evaluate large models

Large model evaluation is divided into objective evaluation and subjective evaluation. There are standard answers for objective evaluation, which are generally input to the large model in the form of question and answer questions and multiple-choice questions, so that the large model can answer the answer and compare it with the correct answer.

For NLP tasks, the evaluation indicators of specific tasks are used to consider the large model, such as the accuracy of text classification and BLEU for machine translation. For knowledge ability, the large model is considered by doing questions, such as the Chinese evaluation dataset C-EVAL constructs multiple vertical domain multiple-choice questions, and evaluates the large model by the correct rate of answers. In addition, for the Base model and the Chat model, for the Base model, a certain example needs to be added to the Prompt, and the Chat model can obtain the answer output by the model directly by means of dialogue due to the fine-tuning of instructions and RLHF.

Large Model Series: A brief introduction to the evaluation theory of LLM-Eval large model

Problem classification of C-EVAL dataset

Subjective evaluation is generally used in scenarios where there are no standard answers, such as asking multiple large models to write an essay based on the topic, how to evaluate the quality of the essays they output, at this time, manual intervention can be used to score these essays, or a referee model can be introduced to score these essays.

After the full text, the use and practice of the large model evaluation dataset and the large model evaluation framework will be shared in the future.

Read on