laitimes

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Heart of the Machine column

Author: Chen Jiangjie, Xu Rui

Researchers from Fudan University, ByteDance Artificial Intelligence Laboratory and other institutions have proposed the E-KAR dataset, which is the first interpretable knowledge-intensive analogy inference dataset, which has been accepted by ACL 2022 Findings.

Analogies occupy an important place in human cognition, through which new insights and prove everyday reasoning, such as the teacher's analogy with the structure of the earth with a hard-boiled egg in the classroom, allows students to quickly understand knowledge that cannot be experienced personally. Because of its unique value in many fields, analogy has become an important issue in the field of artificial intelligence research.

In NLP, we are more familiar with the word analogy recognition problem in the form of multiple choice questions, but the existing word analog data set focuses on simple binary analogy relationships and lacks annotation information for the analogy reasoning process at that time. Therefore, solving this type of problem does not reveal the intrinsic process of analogical reasoning in neural network models, which is detrimental to exploring the internal nature of analogy [6]. We urgently need a more difficult, explainable set of analogous inference data sets.

This article introduces the latest work from researchers from Fudan University, ByteDance Artificial Intelligence Laboratory and other institutions E-KAR, which has been accepted by ACL 2022 Findings. E-KAR is the first interpretable knowledge-intensive analogous reasoning dataset consisting of 1,655 (Chinese) and 1,251 (in English) questions from the Chinese Civil Service Exam, and proposes two benchmark tasks for analogy reasoning problems to teach and validate the model's ability to learn analogies.

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Thesis link: https://arxiv.org/abs/2203.08480

Project homepage: https://ekar-leaderboard.github.io

Research background

Simple analogy

The following figure is an example from the BATS dataset [3], with options for "Marx" over "German", "Confucius" than "Russian", "Caesar" than "American", and "Plato" than "Canadian", and the option that needs to be chosen is the same correspondence as the question: "Newton" than "British".

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 1 An example from the BATS dataset

An effective way to solve this simple analogy problem is to use a static word embedding like Word2Vec[2], such as this equation that we are all familiar with:

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 2 Famous Word Embedding Equation (King - Man + Woman = Queen)

This type of approach often holds that the relationship between two words can be estimated by vector operations on word embeddings, which is known as linear analogy[4]. One of the reasons this approach works is that current analogical inference datasets are often designed to evaluate linear analogy properties. Such datasets are rich in simple binary relationships such as vocabulary, morphology, and simple semantic relationships, like the previous example of "Newton" than "British", revealing the relationship between "people" and "nationality". In addition, they are also unexplainable and therefore cannot reveal the actual human-like analogical reasoning process.

Complex analogies

Compared to this relatively simple linear analogy, the study focuses on more complex analogy problems, which require an understanding of the relationships between more complex words. In view of this, this paper proposes an E-KAR dataset, referring to some authoritative books and other definitions related to analogies, and a series of reasoning processes and background knowledge are required to complete these questions, and the following figure is an example of this (the reader can try to complete):

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 3 An example from the E-KAR dataset

E-KAR dataset

The E-KAR dataset is the first interpretable analogy inference dataset with three characteristics: challenging, interpretable, and bilingual.

Challenging

E-KAR is challenging because it is derived from the Chinese Civil Service Examination, which is a comprehensive test of critical thinking and problem-solving ability of candidates, and to solve the analogy reasoning problem, candidates need to understand the relationships in the options, which requires certain reasoning skills and background knowledge, especially common sense, factual and cultural knowledge, and knowing why a fact is denied, such as a car is not made of tires, because the car is made of tires.

Interpretability

The second feature of E-KAR is interpretability, with free text interpretations of human annotations for each data question and option. But first we need to figure it out: how to make analogical reasoning explainable?

To answer this question, it is first necessary to understand how humans make analogical reasoning. According to some studies in cognitive psychology [1], analogical reasoning follows a process of structure-mapping. This process consists of three steps: induction, mapping and verification. Let's take a set of data in E-KAR as an example (see Figure 4):

1. Abduction: For the source domain and the target domain, first conceive of a source structure, which may also be applicable to the target domain, where the source domain is the problem, and the target domain is each option, the source structure is the implicit relationship between the problem word, in the case of the teapot and the teacup are containers for tea leaves. The teapot transports the tea leaves into the teacup;

2. Mapping: This structure is then mapped to the target domain, that is, the words for each option are mapped to the source structure in the query;

3. Validation: Finally, check the validity of the mapping and explain whether the mapping is correct. In the example, only option C: Talent: School: Enterprise satisfies the source structure in the problem. Because schools and businesses are organizations of talent, schools transport talent to businesses.

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 4 Structure mapping in analogous inference

Therefore, the study rewrites the process of structure mapping into natural language text, making the process of analogous reasoning explainable, that is, the interpretability of E-KAR.

Bilingualism

The study used machine translation and manual post-editing to translate the Chinese version of E-KAR into English. In English data, researchers manually deleted data with Chinese characteristics (idioms, allusions, etc.) for the convenience of researchers with non-Chinese backgrounds. Due to the high degree of Chinese cultural background in these data, the researchers retained this part of the data in the Chinese dataset to promote the development of Chinese NLP. Finally, 1655 Chinese datasets and 1251 English datasets were obtained, each with 8275 sentences and 6255 sentences of interpreted text in natural language form.

Task settings

The ultimate goal of E-KAR is to enable the model to make the right choices while producing sound explanations. To do this, the study defined two shared tasks in E-KAR: the Question Answering (QA) and the Explanation Generation (EG):

Analogy inference question answering task (QA): that is, let the model complete the problem in E-KAR, the input is the question and the four options, the output is the correct answer, and the final result is evaluated with accuracy.

Analogy Explanation Generation Task (EG): The corresponding interpretation of the generated question and each candidate answer, which was evaluated using an indirect metric in addition to the underlying text generation metric: the accuracy of the Analog Answer Task after the Generated Explanation is added, the change in the accuracy of the latter when the generated explanation is used as additional input to the Analogy Inference Question answer task (QA).

Experiments and conclusions

The study, based on E-KAR, conducted some preliminary experiments on these two tasks and found that:

1. Word embeddings and language models do not perform well on complex analogies

The study first experimented with an analogous inference question answering task (QA) based on word embedding and pre-trained language models (BERT, RoBERTa), and the results are shown in Figure 5, which shows that it is difficult to complete the complex and knowledge-intensive analogous reasoning task of E-KAR, whether it is static word embedding or the most advanced language model.

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 5 Accuracy of word embedding on E-KAR and simple analog datasets

For comparison, humans are able to achieve 78% accuracy, while the best-performing language model (RoBERTa large) can only achieve 50% (Figure 6).

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 6 Word embeddings, language models, and human accuracy over simple and complex analogies

Analogy Q&A error analysis

The study performed an error analysis of the results (Figure 7) and found that most errors occurred in semantic relationships such as is_a, part_of, juxtaposition_of, etc. These types of relationships often require a great deal of common sense and factual knowledge.

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 7 Analogy Inference Question Answer Task (QA) Error Analysis

2. Language models do not perform well in explainable analogous reasoning

The study's analogous explanation generation generates corresponding explanations for each question and option, which are then used in the Analogy Inference Question Answering Task (QA), which is a key step in demonstrating interpretability, but a series of experiments have shown that the language model does not generate explanations that are helpful for the Analogy Inference Questioning And Answering Task (QA).

First, using the study's pre-labeled explanations as additional input can help the analogy inference question answering task (QA) achieve near-perfect accuracy. However, when replaced with the resulting explanation, the result is much worse (Figure 8).

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 8 Comparison of pre-annotated explanations with model-generated explanations that help QA tasks

Interprets the build error analysis

The study also performed error analysis of the analogous interpretation of the generation task (EG) (Figure 9) and found that the problem occurred mainly in these three areas:

1. The fact that a negative cannot be generated;

2. Generate sentences that do not correspond to the facts;

3. The result is biased towards a common pattern.

Among them, the study was particularly interested in the generation of negative words. The results showed a manually labeled interpretation of about 90% of the error options, including the negative word "no", while in the generated explanation, this number dropped to about 20%. This seems to indicate that the current generative model does not know how to generate a fact that is negated but correct. Since many interpretations contain negative words, the researchers explored whether the generation of negative words affected the judgment of the model, so the study deleted sentences containing the negative word NOT in the test set, and found that the accuracy rate only decreased slightly. Therefore, another conclusion is that when given a manually labeled explanation, the model of the Analogy Reasoning Question Answer (QA) task does not seem to favor negative words.

Figure 9 shows an example that basically covers almost all of the above error types. The explanation that represents the problem, the interpretation that represents option A, is expressed as generated by the model (BART large), without the pre-labeling, as can be seen, for the negative sentence, the model does not know that salt and sodium chloride are not composed of only one element, and the resulting interpretation is biased towards the pattern of "A is B".

ACL 2022 | Fudan, Byte, etc. launch the first interpretable analogy inference dataset, bilingual in Chinese and English

Figure 9 Example 2 in the E-KAR dataset

summary

In this article, the researchers present a new analog inference dataset, E-KAR, which is challenging, bilingual, and interpretable, while the researchers define two shared tasks for this dataset: the Analogy Inference Question and Answer Task (QA) and the Analogy Interpretation Generation Task (EG) to teach models how to learn the ability to draw analogies. The study hopes that this work will complement existing natural language reasoning studies, particularly those related to analogous reasoning and interpretable NLP.

Many topics in the E-KAR dataset rely on external knowledge and require a certain understanding of common sense, encyclopedia and cultural knowledge, so how to inject external knowledge to improve reasoning ability is a major direction in the future. The infusion of external knowledge can be done in the form of free text, knowledge graphs, etc., instead of interpretation as part of the input, and the model can be divided into a search part and a question and answer part. The search section is responsible for searching for relevant phrases in the external knowledge base and reconstructing the representation of its related knowledge, and the Q&A section is responsible for fusing the retrieved external knowledge with the original input to improve the model reasoning ability.

bibliography

1.Gerhard Minnameier. 2010. Abduction, induction, and analogy. In Model-based reasoning in science and technology, pages 107–119. Springer.

2.Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.

3.Gladkova A, Drozd A, Matsuoka S. Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t[C]//Proceedings of the NAACL Student Research Workshop. 2016: 8-15.

4.Ethayarajh K, Duvenaud D, Hirst G. Towards understanding linear word analogies[J]. arXiv preprint arXiv:1810.04882, 2018.

5.Ushio A, Espinosa-Anke L, Schockaert S, et al. BERT is to NLP what AlexNet is to CV: can pre-trained language models identify analogies? [J]. arXiv preprint arXiv:2105.04949, 2021.

Read on