How to Trust Your "Rumor Crusher"? Explainable Fact-Testing Algorithm Research

Reporting by XinZhiyuan

Author: Chen Jiangjie

EDIT: Sleepy

The development of Internet technology has allowed information to spread rapidly, and at the same time, the information we can receive every day has grown at a rate close to "explosion". The sheer volume of information input has made it difficult to rely on their limited knowledge to identify the truth and falsity of these messages, especially in important events such as the COVID-19 pandemic. Therefore, we need to seek automated fact-verification algorithms that leverage reliable sources of information, such as Wikipedia, to determine whether a given statement is trustworthy.

Fact-testing algorithms are designed to leverage existing knowledge bases to verify the factual correctness of text.

At present, the method of fact verification is usually to break down the problem into two steps: retrieval and verification.

During the search phase, the search model retrieves the relevant text description from the knowledge base based on the given statement text (claim) as evidence for verifying the final result; in the verification phase, the verification model will reason based on the retrieved evidence to derive the final prediction result.

However, most existing models usually only give the final classification results, and lack an explanation of whether a statement is correct or not, making it difficult to know why the model made such predictions. This is very harmful for building trustworthy AI applications.

To solve the problem of interpretability in fact testing, the team at ByteDance Artificial Intelligence Lab and Fudan University proposed LOREN, a new interpretable fact test paradigm: breaking down the validation of an entire statement into phrase-level validation.

Paper: https://arxiv.org/abs/2012.13577

Code: https://github.com/jiangjiechen/LOREN

Under this paradigm, the model is able to give fine-grained validation results for the entire statement, which helps everyone understand the model reasoning process in a more intuitive way, and can also lock in factual errors more quickly.

Interpretable validation

Problem modeling

Loren's main idea is to break down sentence-level validation into phrase-level validation.

Figure 2: LOREN framework

For a given statement

and evidence sets

Composed of inputs

The model needs to get the final prediction results

At the same time, all the phrases in the statement are given

The verification results for

thereinto

，

Indicates Supports, Refutes, and Not Enough Information, respectively.

Define hidden variables

For all the phrases the prediction result, obviously the final prediction result

The prediction result depends on each phrase

, so the final prediction can be expressed as: in probability.

Input data at the given

The corresponding label

After that, you can get the optimization goals for the entire model:

。

One solution to this problem is to use the EM algorithm, however

True posterior distribution

It's hard to solve (intractable).

Therefore, using the method of variational inference, a variational posterior distribution is introduced

, which translates the problem into a variable lower bound objective function corresponding to the optimization - negative Evidence Lower BOund (ELBO):

is KL divergence.

To obtain a priori distribution of phrase validation results

The authors draw on the work of Natural Language Inference (NLI) to incorporate the NLI in

and

Labels correspond to each other

With a pre-trained model pre-trained on NLI data, a prior distribution can be computed

Logical constraints

The biggest challenge in this work is that the available data does not support phrase granularity learning, as there is no (and cannot) beauthnounctic correctness of phrases

Label the results.

In response to this problem, the author proposes and utilizes a set of logical aggregation rules that naturally exist in fact-testing problems to provide weak supervision signals to help with learning

and in fact translate it into a logical constraint between the final label and the phrase-level label.

The following logical rules can be observed:

If a statement is inconsistent with facts (REF), then there is at least one phrase in it that does not correspond to the facts;

If a statement is factual (SUP), then all phrases in it should be true;

If a statement is unverifiable (NEI), then there should be no untrue phrases, and at least one of the phrases is unverifiable.

This logical rule can be formally expressed as:

thereinto

Corresponds to each

Represents the validation result.

Soften the above logical rule with probability:

moreover

This results are obtained through probabilistic aggregation

It contains the logical knowledge described above. The author uses it as a teacher model to guide

, i.e. logical knowledge distillation:.

Ultimately, the optimization goal of the model consists of two parts:

Construct local premises

In order to achieve validation at the above phrase level, two issues need to be solved:

Find the phrase in the statement that needs to be verified;

Find enough information in the knowledge base to test these phrases. These can all be done offline before training the above validation model.

For the first question, the authors used existing NLP parsing tools to identify named entities, noun phrases, verb phrases, and adjective phrases in a given statement. For example, given the statement "Kung Fu Panda was released in 2016.", we can split it into "Kung Fu Panda" (named entity), "released" (verb phrase), and "2016" (noun phrase).

For the second question, the authors modeled it as a reading comprehension (MRC) task. Given a statement and phrase, construct a bootstrap question for the given phrase, such as "Kung Fu Panda was released in [MASK]." and "When was Kung Fu Panda released?", and using the MRC model to obtain the corresponding factual part from the evidence set, such as the description "Kung Fu Panda premiered in the United States on June 6, 2008.", then we hope that the model can answer "2008".

By backfilling this fact to the corresponding position of the statement, you can get a local premise corresponding to the phrase.

, such as "Joe Biden won the 2020 election.". Specifically, utilization

data to self-supervisely construct the data and train this generative MRC model.

Fact-checking

Having obtained the local premise of the statement, it is possible to use neural network parameterization

These two distributions are used for final fact verification.

Use a pre-trained language model to encode local information (statements concatenated with local premises

) and global information statements spliced with evidence sets

), and got it

and

After obtaining the global and local information representations, the final one is constructed by using the fully connected network respectively

：

Receive tags

Vector representation and global information

with local information

As input, output

The predicted probability distribution of .

Receives hidden variables

With global and local information as input, output

The predicted probability distribution of . During the prediction phase, variables are initialized by randomization

and decode iteratively

Until convergence, it is possible to predict the final label while performing fine-grained validation of different phrases in a given utterance.

Main experimental results

The authors conducted experiments on the fact-validated dataset FEVER and used the official Label Accuracy and Fever score as evaluation metrics, and the overall results are shown in Table 1. Comparing LOREN with KGAT[2], it can be found that LOREN has achieved significant results improvement under the same magnitude of model.

Although DREAM[3] and LOREN employ different strategies during the search phase, loren's improvement in the final metrics also demonstrates the advantages of the framework. However, LisT5[4] is significantly better on the test set than other models because of its powerful pre-trained model (T5-3B, ten times better than RoBERTa-large).

Table 1: OVERALL PERFORMANCE ON AND IN THE FEVER DATA

Phrase validation

LoREN's greatest advantage is its ability to validate against phrase levels, a feature that is introduced

so the authors verified in different hyperparameters

The performance of loren is shown in Table 2.

The results show that the explanations learned through the LOREN framework are both correct and faithful. Specifically,

Represents the accuracy of the final result using logical aggregation

It indicates the consistency between the aggregated results and the final prediction of the model.

You can see that after the logical constraints are introduced, the model is in

The aggregation method of probability soft logic is generally superior to that of discrete logic.

Specifically, when

There are no logical constraints on the learning of phrase fact correctness, so these intermediate results lose meaning and interpretability.

Table 2: The effect of logical constraints on model effects

Case study

图3：Case study

Figure 3 shows some of the verification results of LOREN. In the first example, LOREN is able to correctly find the wrong phrase "number three" in the given statement and correct it to "number one", and based on the local validation result, LOREN correctly gives the final validation result.

However, LOREN can also make mistakes in some scenarios that lack sufficient evidence, such as example 2, the evidence only mentions that "Ashley Cole" was born in "England", but does not mention the relationship between "England" and "Iranian", so it can only be given

, but LOREN was given incorrectly

。 Example 3 shows that LOREN has the ability to detect statements that contain multiple errors.

summary

In this paper, loren, an interpretable fact-testing algorithm based on phrase-level decomposition, is proposed. By using MRC to find validation information for decomposed phrases, and by learning to constrain phrase correctness through aggregate logic, the black box model obtains both accurate and faithful explanatoryity.

At the same time, the results on the fact-testing benchmark dataset FEVER show that the LOREN model achieves better results than models of the same magnitude.

Of course, LOREN also has many unsolved problems, such as common sense reasoning ability, stronger evidence retrieval ability, more general statement decomposition method, and so on.

LOREN has made a simple attempt at explainable reasoning in the field of fact-testing, hoping that there will be more research in the future to promote the right reasons of the model (make a system right for the right reasons).

About the author

The first thesis is Chen Jiangjie, a third-year doctoral student at Fudan University and a member of the Knowledge Workshop Laboratory of Fudan University. His main research interests are natural language reasoning and generation.

Resources:

Jiangjie Chen, Qiaoben Bao, Changzhi Sun, Xinbo Zhang, Hao Zhou, Jiaze Chen, Yanghua Xiao, and Lei Li. "LOREN: Logic Enhanced Neural Reasoning for Fact Verification." AAAI 2022 (pre-print).

Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. "Fine-grained fact verification with kernel graph attention network." ACL 2020.

Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. "Reasoning over semantic-level graph for fact checking." ACL 2020.

Jiang, Kelvin, Ronak Pradeep, and Jimmy Lin. "Exploring listwise evidence reasoning with t5 for fact verification." ACL 2021.

How to Trust Your "Rumor Crusher"? Explainable Fact-Testing Algorithm Research | AAAI 2022

Read on

The first video migration attack algorithm based on time series translation, Fudan University Research was selected for AAAI 2022

vivo X80 series night vision goggle function science is also a night shot where is not the same?

Comparable to the palm night vision vivo X80 has mastered what kind of black technology

The heart is pierced, and the machines are better than I can learn

Sloan Award Winner Fang Fei: When deep learning and game theory are combined, what social problems can be solved?

I wrote neural networks in ChatGPT: without changing a word, it turned out to be very good

Google Search: The Possibility of Being Disrupted by ChatGPT

Behind the explosion of ChatGPT, learn sexist AI

Musk's tweets with the same topic were viewed less than Biden's, and he asked employees overnight to change the algorithm to recommend themselves first

Xiaoice CEO Li Di: Xiaoice Chain is not the Chinese version of ChatGPT

Algorithm = Values! The platform cannot lie in a "safe haven" all the time

In convenience bees, people are dominated by machines

ChatGPT can run the code by itself: directly enter the running result when asking for it, and netizens call it "magic"

Just! Musk open-sourced the Twitter algorithm, and the number of GitHub Stars has exceeded 10,000

Musk fulfilled his promise, Twitter open-sourced the recommendation algorithm: listen to users' suggestions, improve the algorithm

Entering the GPT battlefield, what are the 360 odds of "two wings flying together"? Closed beta experience

Seven departments join forces! What signals will the first generative AI regulatory document be implemented?