Beyond the existing indicator of 57.3%, Professor Xing Bo and Professor Hu Zhiting's team proposed a unified NLG evaluation framework

Heart of the Machine column

Author: Deng Mingkai

For a long time, it was difficult to evaluate machine-generated text. Recently, the team of CMU Professor Eric Xing and Professor UCSD Professor Zhiting Hu proposed to use an operator to unify the evaluation method of various generative tasks, providing a more unified guidance for various new tasks and new requirements in the future. Experiments have shown that the evaluation indicators designed based on the unified framework exceed the similarity of existing indicators and manual scores on multiple tasks, and can now be directly called through PyPI and GitHub.

Natural language generation (NLG) includes natural language processing (NLP) tasks such as machine translation, digest generation, machine dialogue, and more. Although these tasks all require the production of smooth text, the final goal of expression is often very different. For example, the translation task needs to express the meaning of the original text completely and accurately; the abstract generation needs to concisely and accurately reflect the most important information of the original text; and the dialogue system needs to conduct vivid and useful answers with the user.

Over the past few years, researchers have made great strides in modeling these tasks. However, it is still difficult to evaluate the results generated by language. Manual evaluation is the most accurate, but very expensive and time-consuming. Automatic evaluation, on the other hand, is easier to scale but vague in how to evaluate.

Traditionally, evaluation methods have compared model-generated text with reference text written by people, but recent studies have shown that as models have advanced, such methods have become increasingly difficult to distinguish between good and bad text. In fact, in the DSTC9 Dialogue System competition at AAAI 2021, manual scoring no longer considers the reference text, but relies on the scorer to synthesize the dialogue history, knowledge scenarios and model answers to make judgments.

At the same time, the deployment in practical applications also requires a multi-dimensional evaluation of the generative model, which cannot be done by the traditional single indicator. For example, in the "Thousand Words: Generative Evaluation Contest For Fact-based Consistency" hosted by Baidu in 2021, in addition to the traditional information selection indicators, the factual indicators were also examined and an independent evaluation process was designed for them. The previously mentioned each of the DSTC9 races also looked at 3-8 different dimensional metrics.

In order to address the new needs described above, a variety of evaluation methods and new indicators have been proposed, but these methods are often designed for specific tasks and objectives. What to evaluate for the ever-changing variety of tasks? How to evaluate? There is a lack of systematic guidance.

In this direction, the research teams at CMU (Carnegie Mellon University), Petuum Inc., MBZUAI (Mohammed Bin Zayed University of Artificial Intelligence) and UCSD (University of California, San Diego) have proposed a theoretical framework for natural language generative evaluation that provides more unified guidance for future new tasks and requirements when designing evaluation processes.

First of all, the researchers divided the language generation task into three categories according to the way the information changes from input to output, and each type of task puts forward different evaluation requirements for the output. By categorizing new tasks, you can be inspired by "what to evaluate".

Secondly, they used an operator called "information alignment" to unify the evaluation methods of all task categories, and designed evaluation indicators from the perspective of information alignment, which can solve a large number of "how to evaluate" problems.

Based on information alignment, a series of evaluation indicators were uniformly designed in the paper, and the similarity with human scores in multiple evaluation tasks (abstract generation, style conversion and knowledge dialogue) exceeded the existing indicators by 57.30%.

The evaluation indicators designed in the paper have been uploaded to the Python library, and can be installed directly with pip install. The researchers also published the code on GitHub and provided several trained information alignment models, which students are welcome to call in the study.

Beyond the existing indicator of 57.3%, Professor Xing Bo and Professor Hu Zhiting's team proposed a unified NLG evaluation framework

Thesis link: https://arxiv.org/pdf/2109.06379.pdf

Code and API links: https://github.com/tanyuqian/ctc-gen-eval

Python install: pip install ctc_score

What to rate: Classification of language generation tasks

According to the relationship between the amount of information in the task input (X) and output (Y) text, researchers believe that language generation tasks can be divided into three categories: compression, conversion, and creation, corresponding to inputs greater than, equal to, and less than output, respectively. The goals of each type of task are different, and there are also different requirements for the output text. We can be inspired by classifying new tasks and "what to evaluate.".

Compression-like tasks

Goal: Present the important parts of the input information in the output

Examples: Summarization, Image Captioning, Data-to-Text, and Question Generation

Evaluation focus: 1) The output information should come entirely from the input; 2) the output information should be important information in the input

Transduction

Goal: Convert one aspect of the input information and leave the others unchanged

Examples: Translation, Paraphrasing, Style Transfer, and Language Simplification

Evaluation focus: The output should retain the input information as completely as possible

Create

Goal: Outputs new information based on input and external information

Examples: Dialog, Advice Generation, Story Generation, and Poetry Generation

Evaluation focus: 1) The output should adequately respond to the input; 2) the output should use external information correctly

As you can see here, the focus of the evaluation depends on the amount of information entered and output in the task, so if you can measure the degree of coincidence of the input and output information, you can evaluate all categories of generated tasks.

How to rate: Information alignment

To measure the degree of coincidence described above, the researchers introduced the "information alignment" operator, which unifies the evaluation of all generative tasks.

Information alignment means that for text A and any data B, a confidence level can be calculated for each word of A, and whether the information for that word is reflected in B or not. The specific mathematical form is a vector as follows:

In practice, this data B does not have to be literal, it can be any modal data, as long as there is a model (Alignment Model) can calculate the confidence level of this alignment. The relationship between A, B, model, and alignment vector is shown in the following figure:

Below, the researchers show how to uniformly align this operator with information to define evaluation indicators for various language generation tasks.

Align and unify design evaluation indicators with information

Compress class tasks

For compression tasks, the researcher uses summary generation as an example:

Transform class tasks

For transformation tasks, the researchers used text style migration as an example:

Create a class task

For the creation class task, the researchers used the knowledge dialogue as an example:

Now that so many evaluation metrics have been defined with the information alignment operator, the next step is to see how this operator is implemented.

Three implementation methods for information alignment

The researchers modeled information alignment as a predictive problem and proposed three implementation methods based on pretrained Language Models, with self-supervised learning commonly used. Model accuracy can be evaluated by comparing it with manual annotations.

Embedding Matching

Discriminative Model

Aggregated Regression

Experimental results

Experimental results show that the similarity of the unified designed evaluation indicators of the researchers with the manual score exceeds the previous specially designed indicators for the task, and the highest exceeds the existing indicators by 57.30%. In addition, the researchers found that the better the accuracy of the alignment model predictions, the closer their metrics were to the human evaluation.

Exceeds existing metrics by up to 57.30%

Alignment model accuracy is directly related to manual scoring similarity

Self-supervised learning is commonly used in alignment models, but training using manual labeling can effectively improve accuracy and the evaluation metrics achieved in this way. The similarity with the manual score is shown in the following figure:

This illustrates that a large number of evaluation indicators can be improved as long as the alignment predictive model can be improved. We can treat alignment predictions as a separate task, and the progress of this task directly improves the accuracy of evaluation language generation.

This work kicked off the composable process of text evaluation. Like software engineering, researchers say the system can be divided into modules that can be independently improved, scaled, and diagnosed, with more exploration expected in the future.

Cover Source: https://soa.cmu.edu/

Beyond the existing indicator of 57.3%, Professor Xing Bo and Professor Hu Zhiting's team proposed a unified NLG evaluation framework

Read on

GAC "Shadow Character Generation" second car, surpassing Toyota is sooner or later, oil version, hybrid have

Pretty mighty faction! Ordinary table lamps have surpassed the original role, the reason is the intimate function

Glory will benchmark Apple and dare to surpass

Apple surpassed Lenovo to become the leader of the global computer market? It turned out that the iPad sold too well

BYD sold more than 100,000 units in April, surpassing the joint venture brand for the first time in 38 years!

New car sales in China and the United States: Is Laomei not affected by the epidemic? Tesla surpasses the BBA across the board!

The collar and the great V of Venucia are coming, can it surpass Haoying?

The moment of history has arrived! BYD surpassed FAW-Volkswagen in sales in April to win the full championship

Elk tests beyond the sports car! 2.0T Burst 261 Hp StarWay S can really do it!

Intelligent World 2030: Beyond imagination, create the future to predict the future

Sales have far surpassed the fuel version, why is UNI-K iDD in short supply?

Toyota cruisers sell 3 units in China in 3 months, if not domestically produced, may not have a chance in the future

Hongqi H9+ will be available today with a wheelbase lengthened by 200mm/ surpassing the Mercedes-Benz S-Class

Luxury beyond Mercedes-Benz S-Class Hongqi H9+ excellent customized version launched

Tesla Model Y won the first place in European sales, tram sales surpassed oil vehicles for the first time

Yu Chengdong's big move, with the big model of the Qianjie M9, can it surpass the ideal L9?