Reporting by XinZhiyuan
Author: Chen Jiangjie
EDIT: Sleepy
The development of Internet technology has allowed information to spread rapidly, and at the same time, the information we can receive every day has grown at a rate close to "explosion". The sheer volume of information input has made it difficult to rely on their limited knowledge to identify the truth and falsity of these messages, especially in important events such as the COVID-19 pandemic. Therefore, we need to seek automated fact-verification algorithms that leverage reliable sources of information, such as Wikipedia, to determine whether a given statement is trustworthy.
Fact-testing algorithms are designed to leverage existing knowledge bases to verify the factual correctness of text.
At present, the method of fact verification is usually to break down the problem into two steps: retrieval and verification.
During the search phase, the search model retrieves the relevant text description from the knowledge base based on the given statement text (claim) as evidence for verifying the final result; in the verification phase, the verification model will reason based on the retrieved evidence to derive the final prediction result.
However, most existing models usually only give the final classification results, and lack an explanation of whether a statement is correct or not, making it difficult to know why the model made such predictions. This is very harmful for building trustworthy AI applications.
To solve the problem of interpretability in fact testing, the team at ByteDance Artificial Intelligence Lab and Fudan University proposed LOREN, a new interpretable fact test paradigm: breaking down the validation of an entire statement into phrase-level validation.
Paper: https://arxiv.org/abs/2012.13577
Code: https://github.com/jiangjiechen/LOREN
Under this paradigm, the model is able to give fine-grained validation results for the entire statement, which helps everyone understand the model reasoning process in a more intuitive way, and can also lock in factual errors more quickly.
Interpretable validation
Problem modeling
Loren's main idea is to break down sentence-level validation into phrase-level validation.
Figure 2: LOREN framework
For a given statement
and evidence sets
Composed of inputs
The model needs to get the final prediction results
At the same time, all the phrases in the statement are given
The verification results for
thereinto
,
Indicates Supports, Refutes, and Not Enough Information, respectively.
Define hidden variables
For all the phrases the prediction result, obviously the final prediction result
The prediction result depends on each phrase
, so the final prediction can be expressed as: in probability.
Input data at the given
The corresponding label
After that, you can get the optimization goals for the entire model:
。
One solution to this problem is to use the EM algorithm, however
True posterior distribution
It's hard to solve (intractable).
Therefore, using the method of variational inference, a variational posterior distribution is introduced
, which translates the problem into a variable lower bound objective function corresponding to the optimization - negative Evidence Lower BOund (ELBO):
is KL divergence.
To obtain a priori distribution of phrase validation results
The authors draw on the work of Natural Language Inference (NLI) to incorporate the NLI in
and
Labels correspond to each other
With a pre-trained model pre-trained on NLI data, a prior distribution can be computed
Logical constraints
The biggest challenge in this work is that the available data does not support phrase granularity learning, as there is no (and cannot) beauthnounctic correctness of phrases
Label the results.
In response to this problem, the author proposes and utilizes a set of logical aggregation rules that naturally exist in fact-testing problems to provide weak supervision signals to help with learning
and in fact translate it into a logical constraint between the final label and the phrase-level label.
The following logical rules can be observed:
If a statement is inconsistent with facts (REF), then there is at least one phrase in it that does not correspond to the facts;
If a statement is factual (SUP), then all phrases in it should be true;
If a statement is unverifiable (NEI), then there should be no untrue phrases, and at least one of the phrases is unverifiable.
This logical rule can be formally expressed as:
thereinto
Corresponds to each
Represents the validation result.
Soften the above logical rule with probability:
moreover
This results are obtained through probabilistic aggregation
It contains the logical knowledge described above. The author uses it as a teacher model to guide
, i.e. logical knowledge distillation:.
Ultimately, the optimization goal of the model consists of two parts:
Construct local premises
In order to achieve validation at the above phrase level, two issues need to be solved:
Find the phrase in the statement that needs to be verified;
Find enough information in the knowledge base to test these phrases. These can all be done offline before training the above validation model.
For the first question, the authors used existing NLP parsing tools to identify named entities, noun phrases, verb phrases, and adjective phrases in a given statement. For example, given the statement "Kung Fu Panda was released in 2016.", we can split it into "Kung Fu Panda" (named entity), "released" (verb phrase), and "2016" (noun phrase).
For the second question, the authors modeled it as a reading comprehension (MRC) task. Given a statement and phrase, construct a bootstrap question for the given phrase, such as "Kung Fu Panda was released in [MASK]." and "When was Kung Fu Panda released?", and using the MRC model to obtain the corresponding factual part from the evidence set, such as the description "Kung Fu Panda premiered in the United States on June 6, 2008.", then we hope that the model can answer "2008".
By backfilling this fact to the corresponding position of the statement, you can get a local premise corresponding to the phrase.
, such as "Joe Biden won the 2020 election.". Specifically, utilization
data to self-supervisely construct the data and train this generative MRC model.
Fact-checking
Having obtained the local premise of the statement, it is possible to use neural network parameterization
These two distributions are used for final fact verification.
Use a pre-trained language model to encode local information (statements concatenated with local premises
) and global information statements spliced with evidence sets
), and got it
and
After obtaining the global and local information representations, the final one is constructed by using the fully connected network respectively
:
Receive tags
Vector representation and global information
with local information
As input, output
The predicted probability distribution of .
Receives hidden variables
With global and local information as input, output
The predicted probability distribution of . During the prediction phase, variables are initialized by randomization
and decode iteratively
Until convergence, it is possible to predict the final label while performing fine-grained validation of different phrases in a given utterance.
Main experimental results
The authors conducted experiments on the fact-validated dataset FEVER and used the official Label Accuracy and Fever score as evaluation metrics, and the overall results are shown in Table 1. Comparing LOREN with KGAT[2], it can be found that LOREN has achieved significant results improvement under the same magnitude of model.
Although DREAM[3] and LOREN employ different strategies during the search phase, loren's improvement in the final metrics also demonstrates the advantages of the framework. However, LisT5[4] is significantly better on the test set than other models because of its powerful pre-trained model (T5-3B, ten times better than RoBERTa-large).
Table 1: OVERALL PERFORMANCE ON AND IN THE FEVER DATA
Phrase validation
LoREN's greatest advantage is its ability to validate against phrase levels, a feature that is introduced
so the authors verified in different hyperparameters
The performance of loren is shown in Table 2.
The results show that the explanations learned through the LOREN framework are both correct and faithful. Specifically,
Represents the accuracy of the final result using logical aggregation
It indicates the consistency between the aggregated results and the final prediction of the model.
You can see that after the logical constraints are introduced, the model is in
The aggregation method of probability soft logic is generally superior to that of discrete logic.
Specifically, when
There are no logical constraints on the learning of phrase fact correctness, so these intermediate results lose meaning and interpretability.
Table 2: The effect of logical constraints on model effects
Case study
图3:Case study
Figure 3 shows some of the verification results of LOREN. In the first example, LOREN is able to correctly find the wrong phrase "number three" in the given statement and correct it to "number one", and based on the local validation result, LOREN correctly gives the final validation result.
However, LOREN can also make mistakes in some scenarios that lack sufficient evidence, such as example 2, the evidence only mentions that "Ashley Cole" was born in "England", but does not mention the relationship between "England" and "Iranian", so it can only be given
, but LOREN was given incorrectly
。 Example 3 shows that LOREN has the ability to detect statements that contain multiple errors.
summary
In this paper, loren, an interpretable fact-testing algorithm based on phrase-level decomposition, is proposed. By using MRC to find validation information for decomposed phrases, and by learning to constrain phrase correctness through aggregate logic, the black box model obtains both accurate and faithful explanatoryity.
At the same time, the results on the fact-testing benchmark dataset FEVER show that the LOREN model achieves better results than models of the same magnitude.
Of course, LOREN also has many unsolved problems, such as common sense reasoning ability, stronger evidence retrieval ability, more general statement decomposition method, and so on.
LOREN has made a simple attempt at explainable reasoning in the field of fact-testing, hoping that there will be more research in the future to promote the right reasons of the model (make a system right for the right reasons).
About the author
The first thesis is Chen Jiangjie, a third-year doctoral student at Fudan University and a member of the Knowledge Workshop Laboratory of Fudan University. His main research interests are natural language reasoning and generation.
Resources:
Jiangjie Chen, Qiaoben Bao, Changzhi Sun, Xinbo Zhang, Hao Zhou, Jiaze Chen, Yanghua Xiao, and Lei Li. "LOREN: Logic Enhanced Neural Reasoning for Fact Verification." AAAI 2022 (pre-print).
Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. "Fine-grained fact verification with kernel graph attention network." ACL 2020.
Wanjun Zhong, Jingjing Xu, Duyu Tang, Zenan Xu, Nan Duan, Ming Zhou, Jiahai Wang, and Jian Yin. "Reasoning over semantic-level graph for fact checking." ACL 2020.
Jiang, Kelvin, Ronak Pradeep, and Jimmy Lin. "Exploring listwise evidence reasoning with t5 for fact verification." ACL 2021.