laitimes

Do large language models "read minds"?

文 | 追问nextquestion

Imagine such a scenario. You're sitting by the window on the bus when your friend suddenly says to you, "It seems a little hot today." How would you respond? Most people should open the window immediately, because they subtly understand what their friend is saying: he is politely asking himself to open the window, not just talking about the weather out of boredom.

Academics generally use the terms "mentalizing" or "mind-reading" to describe the ability to perceive and attribute human mental states, which allows people to interpret and predict their own or others' behavior, which is essential for people to interact socially and adapt to complex social environments.

In the past, it was believed that "mind reading" was a unique human ability, because only humans had a "theory of mind". This is not an inscrutable academic theory, but rather a set of common sense or beliefs implicit in human knowledge systems that describe the causal relationships between people's daily actions, circumstances, and related psychological states [1]. So, because people have mastered the theory of mind, they can use this knowledge to understand the indirect request of "a friend asks to open a window" in the above scenario and respond accordingly.

Large language models (LLMs) are becoming more and more widely used, and generative large language models represented by GPT have demonstrated performance comparable to or even better than humans in basic cognitive tasks and complex decision-making and reasoning tasks [2][3]. In this context, the realization of the "artificial theory of mind" through artificial intelligence may not be far away. Recently, a research team compared the performance of human subjects and three large language models on a series of mental theory tasks, and found that the large language models exhibited no worse "mind-reading" ability than human subjects [4]. The findings were published in Nature Human Behavior.

Do large language models "read minds"?

▷原始论文:Strachan, et al. " Testing theory of mind in large language models and humans." Nature Human Behavior (2024). https://doi.org/10.1038/s41562-024-01882-z

How to Quantify Theoretical of Mind Ability?

The theory of mind sounds rather abstract, so is there a way to quantitatively measure or evaluate the theory of mind of a person or an artificial intelligence? Thanks to the excellent ability of generative large language models to understand and generate natural language, a variety of tests that are currently widely used to assess the theoretical ability of human subjects can be directly applied to these large language models, such as understanding irony or indirect requests, inferring false beliefs, and identifying unintentional faux pas. In this study, the researchers systematically evaluated the theoretical mental abilities of human subjects (total sample size 1907) and three generative large language models (GPT-4, GPT-3.5, LLaMA2*) through five tests.

*Author's note: GPT-3.5 and GPT-4 are large language models developed by OpenAI, which use deep learning technology to generate natural language text, of which GPT-4 uses a more extensive and modern training dataset than GPT-3.5, with a wider knowledge coverage, and more parameters and a more complex architecture, making the former have stronger language understanding and generation capabilities than the latter. LLaMA2 is a large language model developed by Meta on a similar principle to the GPT family of large models, and the key difference considered in this study is that LLaMA2 provides a degree of open-source access, which allows researchers and developers to study and improve the model.

(1)错误信念推断(False belief)

The False Belief Inference Task assesses the test subject's ability to infer that the beliefs of others are different from their own true beliefs. The items tested in this type of test have a specific narrative structure: when Character A and Character B are together, Character A places an object in a hidden place (e.g., a box); After Character A leaves, Character B moves the item to a second hidden place (e.g. under the carpet); Character A then returns to look for the item. At this point, the question asked to the test subject is: When Character A returns, will he look for the item in the new location (where the item really is, in line with the true beliefs) or in the old location (where the item was originally located, in line with Character A's misconceptions)?

(2)反讽理解(Irony)

The Irony Comprehension task assesses the test subject's ability to comprehend the true meaning of utterances and the speaker's true attitudes (sarcasm, ridicule, etc.) in a given context. In this study, participants were given a short story with or without irony, and were asked to explain the relevant words in the story after reading it.

(3)识别失礼行为(Faux pas)

This task assesses the subject's ability to recognize words that may be offensive to someone in a conversation situation because they do not know certain information. In the study, participants were given several of these situations and asked to read and answer relevant questions. Only if you answer all 4 questions correctly can you have a correct understanding, and 3 of them are closely related to the theory of mind, which are "Did someone say something that should not be said (the answer is always yes)", "Did he say something that should not be said", and "Did he know that this would offend others (the answer is always no)".

(4) Hint task

This task assesses the test subject's ability to understand the indirect requests of others in social interactions, as in the example given at the beginning of this article. In the study, subjects were presented with several situations describing everyday social interactions, each ending with an intelligible cue and asking the subject to say what he understood the last implication sentence after reading. The correct answer is to point out both the original meaning of the sentence and the implied intent of the act, i.e., the indirect request.

(5)奇怪故事理解(Strange stories)

This task assesses the subject's higher-level theories of mind, such as identifying and reasoning about misleading, lies, or misunderstandings in situations, as well as second- or higher-order false belief inferences (i.e., determining whether A knows that B believes something as a false belief). In the study, participants were presented with several seemingly strange stories and asked to explain why the characters in the stories said or did something that was literally untrue.

It is important to note that all tests, with the exception of the ironic comprehension test, are obtained from open-access databases or published academic journals. To ensure that large language models respond to these problems with more than just replicating the training set data (because these large language models have processed large amounts of text data to learn the deep structure and meaning of natural language when they are pre-trained), researchers have written new tests for each task that the large language model may have learned. These new projects are logically consistent with the original project, but use different semantic content. The researchers collected the participants' responses to these tasks and encoded the text of the answers finely and reliably according to an operationally defined coding scheme, which allowed for a quantitative assessment of the theoretical mental abilities of both human subjects and large language models. So how do large language models perform on these tasks compared to humans?

Can large language models "read minds"?

The results chart below visually illustrates the performance of human subjects and large language models on each task and the differences between them. Figure 1A shows the performance of the test subjects on all test items (the black dots represent the median score of the sample), while Figure 1B shows the performance of the test subjects on the original item (dark dots) and the new item (light origin).

Do large language models "read minds"?

▷Figure 1 The performance of human subjects and large language models on the theory of mind test

The results showed that GPT-3.5 performed at the same level as human subjects in the false belief inference and implicit comprehension tasks, but did not have any advantage in other tasks;GPT-4 performed at the same level as human subjects in the false belief inference task and the implicit comprehension task, and even outperformed the human subjects in the ironic comprehension and strange story comprehension tasks, but performed poorly in the task of recognizing faux pas." Neither the implicit comprehension nor the strange story comprehension task performed as well as GPT-4 and human subjects, but the performance in the faux pas recognition task was exceptionally good.

Interpretation of the results

Interesting results appear to be in the task of identifying faux pas, in which the poor performance of GPT is consistent with the findings of previous related studies [5]. Surprisingly, LLaMA2, which performed poorly on other tasks, did well on this task, giving near-perfect answers on all but one of the other items tested. In order to further explore the reasons for this result, the researchers conducted a more detailed analysis.

The general structure of the faux pas recognition test has been described above, and here the researchers give a more specific example. As shown in the figure, the test subject was asked to answer 4 questions after reading the story. The first question is "Did someone say something in the story that shouldn't have been said", and the correct answer is always yes; The second question asks the test subject to report who said what not to say; The third question is one about the comprehension of the content of the story; The fourth question is the key question, related to the speaker's state of mind when he utters the faux pas, in this case is: "Does Lisa know the curtains are new?" The answer to this question is always no. Only if all 4 questions are answered correctly can the test be coded as a correct response.

Do large language models "read minds"?

▷Figure 2 Example of a story in the faux pas recognition test, Chinese translated by the author.

A closer examination of GPT's responses found that both GPT-4 and GPT-3.5 correctly indicated that victims felt offended, and sometimes even provided more details on why the words were offensive. But when it is inferred that the speaker's state of mind at the time of the offensive speech (e.g., "Does Lisa know the curtain is new?"). They could not answer correctly. As shown in Figure 3, most of the answers given by GPT in this case are that the story does not provide enough information to be determined.

Do large language models "read minds"?

▷Figure 3 Example of a story in the faux pas recognition test, Chinese translated by the author.

In the subsequent further analysis, the researchers took the possibility estimation method to ask GPT, that is, instead of directly asking "whether Lisa knows that the curtain is new", but asking "Is it more likely that she does not know than Lisa knows that the curtain is new". As shown in Figure 4, both GPT-3.5 and GPT-4 demonstrated a strong ability to understand the mental state of others on this test. From this, the researchers deduced that GPT adopted an "ultra-conservative strategy" in its responses, that is, it was able to successfully reason about the speaker's state of mind, but it was unwilling to make overly confident judgments when the information was insufficient.

Do large language models "read minds"?

▷Figure 4 Schematic diagram of GPT's response to the probability estimation question in the faux pas recognition test

After understanding why GPT did not perform well on the original faux pasas recognition test, the researchers tried to further ask why LLaMA2 alone performed well on this test. The researchers believe that when a large model gives a "no" answer, it may not be because it really knows that the answer is "no", but because it is ignorant, that is, it will give a "no" answer no matter what the situation.

To test this hypothesis, the researchers devised a variation of the original faux pas recognition task, by adding a clue to the story that the protagonist might know why he was offended, or adding a neutral sentence. If the test subject can successfully infer the psychological state of the protagonist, then the large model will have different response patterns for different question types, otherwise it can only mean that the "no" judgment is only due to ignorance. As shown in Figure 4, the results show that both GPT and human subjects are able to distinguish between several conditions, while LLaMA2 is not. This confirms the researchers' suspicion that LLaMA2 was not able to make correct judgments about the character's psychological state in the original task.

Do large language models "read minds"?

▷Figure 5 Schematic diagram of human and large language models' responses to the faux pas recognition variant task

Overall, GPT-4 demonstrated the same or even better theoretical ability of mind than human subjects in each test. In the faux pas recognition task, the reason for GPT's poor performance is that it adopts an overly conservative strategy for answers, while LLaMA2's excellent performance may be false.

epilogue

The study systematically evaluated and compared the performance of humans and large language models when completing tests related to the theory of mind, and found that large language models were sometimes not inferior to humans in inferring the mental state of others. Through the variants of related tasks, the possible mechanisms behind the performance of large language models are further examined. This undoubtedly demonstrates the potential of research using artificial intelligence to understand the human mind. So, can we think that large language models can also "read minds"?

Some researchers have pointed out that although large language models are designed to simulate human-like responses, this does not mean that this analogy can be extended to the underlying cognitive processes that cause these responses [6]. After all, human cognition is not based on language, but is embodied and embedded in the environment. The challenges that people may face in inferring the mental state of others may come from their subjective experience and socio-cultural environment, which is not the case with large language models. In other words, while large language models are excellent at simulating the human mind, we can't fully understand human cognition through them.

In addition, we need to think further about how large language models exhibit human-like behavior. In this study, although GPT and human subjects had similar results in inferring about the protagonist's mental state on the faux pas recognition task, they responded very differently, with GPT making extremely conservative decisions. These results all imply a distinction between competence and behavioral performance.

The researchers point out that when large language models interact with humans in real time, what impact do the non-human behavioral decisions they exhibit have on human conversation partners? This is one of the future research directions. For example, a negative reaction from GPT due to conservatism may lead to negative emotions on the part of a human conversation partner, but it may also promote curiosity about the issue. Understanding how the performance (or absence) of large language models in mental inference affects human social cognition in dynamically unfolding social interactions is a challenge for the future.

Bibliography:

[1] Januszewski, Michal, Kornfeld, et al. High-precision automated reconstruction of neurons with flood-filling networks. Nat. Methods, 2018

[2] Dorkenwald, S., Li, P.H., Januszewski, M. et al. Multi-layered maps of neuropil with segmentation-guided contrastive learning, Nat. Methods, 2023

[3] https://google.github.io/tensorstore.

[4] Li, P. H., Lindsey, L. F., Januszewski, M., et al., Automated reconstruction of a serial-section EM Drosophila brain with flood-filling networks and local realignment, bioRxiv, 2019

[5] C. S. Xu, M. Januszewski, Z. Lu, S.-y. Takemura, K. J. Hayworth, G. Huang, et al., A Connectome of the Adult Drosophila Central Brain, bioRxiv, 2020

Read on