GPT-4 passes the Turing test with a 54% win rate! UCSD's New Work: Humans Can't Recognize GPT-4

author：New Zhiyuan 2024-05-20 14:51:00

Editor: Peach Yongyong

GPT-4 has passed the Turing test! The UCSD research team has empirically found that it is impossible for humans to distinguish GPT-4 from humans. And, 54% of the time, it is judged to be human.

Can GPT-4 pass the Turing test?

When a powerful model is created, people often use the Turing test to measure the intelligence of the LLM.

Recently, researchers from the Department of Cognitive Sciences at UCSD found that:

In the Turing test, one simply can't tell GPT-4 apart from humans!

GPT-4 passes the Turing test with a 54% win rate! UCSD's New Work: Humans Can't Recognize GPT-4

Address: https://arxiv.org/pdf/2405.08007

In the Turing test, GPT-4 is judged to be human in 54% of cases.

The experimental results show that this is the first time that a system has been tested in an "interactive" two-person Turing test.

Researcher Cameron R. Jones recruited 500 volunteers, who were divided into five roles: four evaluators, GPT-4, GPT-3.5, ELIZA and humans, and the other "played" the human himself, hiding on the other side of the screen, waiting for the evaluator to discover.

Here is an excerpt from the game, can you tell which dialog is human?

Figure 1: Partial conversation between a human interrogator (green) and a witness (gray).

In fact, one of these four conversations is a conversation with a human witness, and the rest is a conversation with artificial intelligence.

Controlled "Turing Test" started for the first time

Over the past 74 years, there have been many attempts at the Turing test, but very few have been conducted in controlled experiments.

The prestigious Loebner Award is an annual competition, yet from its inception in 1990 until 2020, none of the systems have passed the test.

A recent large-scale study found that humans were 60% accurate at identifying a series of LLMs in a 2-minute online conversation.

To date, no machine has passed the "controlled" experimental demonstration of the test.

To test whether it is possible for people to recognize the deception of current AI systems, the researchers used GPT-4 to conduct a randomly controlled two-person Turing test.

The test requirements are simple, that is, the human participant starts a 5-minute conversation with the human and the artificial intelligence, and determines whether the interlocutor is a human or not.

As mentioned at the beginning, the researchers evaluated a total of three models – GPT-4, GPT-3.5, and ELIZA.

For the first two LLMs, the researchers adapted the best performing prompts from the GPT-4 and GPT-3.5 exploratory studies to indicate how the large model responds to messages.

According to the researchers, one of the core challenges in interpreting the Turing test is that the system may perform well for two reasons:

- One is to imitate humans brilliantly

- The second is that the judge is too credulous

The latter is often referred to as the ELIZA effect, in which participants humanize even simple systems.

The ELIZA system is all too familiar to everyone.

It was a simple rule-based chatbot from the 60s of the 20th century, but users found themselves attributing human traits to it as well.

Therefore, in the experiment, the researchers used ELIZA as a baseline, on the one hand, to empirically quantify the ELIZA effect, and on the other hand, to perform operational checks in the experiment.

So, what are the results of this Turing test?

GPT-4 pass rate: 54%

Let's start with an important metric: the pass rate.

The investigators define the pass rate as the proportion of the subject that is judged to be human by the evaluator in the game.

GPT-4 has a pass rate of 54%, which is higher than GPT-3.5 (50%) and ELIZA baseline (22%), but lower than that of human subjects (67%).

Evaluator pass rate (left) and evaluator trust (right). Each dot on the right represents a game

This result is not surprising and is largely in line with what was envisioned at the beginning of the study.

Prior to the experiment, the researchers made several assumptions: First, it was assumed that humans would not be more capable of detecting GPT-4 than chance (50%).

To verify this, the researchers fit a logistic regression model to predict the pass rate with only intercepts.

There is no significant difference between the intercept and 0 (z=0.14, p=0.482), which indicates that the pass rate of GPT-4 can be approximately equal to 50%. The Bayesian model predicts a 95% confidence interval of -0.24-0.55 for GPT-4 intercepts (equivalent to a 44%-63% pass rate).

This means that when faced with GPT-4, the probability of a human being successfully distinguishing is similar to that of a coin toss!

The results also confirmed the second and third hypotheses of the researchers: that the pass rate of humans was significantly higher than that of GPT-4 (z=2.42, p=0.017), while that of GPT-4 was higher than that of ELIZA (z=4.11, p<0.001).

However, inconsistent with the researchers' assumptions, GPT-4 and GPT-3.5 performed almost the same (z=0.497, p=0.619).

In the area of "camouflaging humans", there was a tie.

Also surprisingly, participants who claimed to have a better understanding of how LLMs worked were not more successful in identifying LLMs (z = -0.023, p = 0.869).

In exploratory analysis, the investigators found that age had a negative effect on questioner accuracy (z = -2.24, p = 0.025), but other demographic variables (including gender, education level, and frequency of interaction with chatbots) had no significant effect.

Why is this the case? Let's analyze the reasons behind it.

To gain insight into what exactly influences an evaluator's decision, we categorized the strategies used by the evaluator and the justification they provided for the decision.

Thirty-six percent of the evaluators chatted with the respondents and asked about personal details or daily activities.

The most common strategies used in the second and third groups were social and emotional (25%) – for example, asking respondents about their perspectives and experiences and seeing if they had a sense of humor.

Others focus on knowledge and reasoning (13%), including questions about mathematics, logic and general knowledge.

Interrogation strategy (left) and average accuracy (right)

To compare the effectiveness of each strategy, the researchers calculated the average accuracy of using these strategies.

When the assessee asks about human experience (75%), logic and mathematics (78%) or directly "accuses" the other person of being an AI (80%), they are usually closer to the "correct answer".

In games, AI "show its feet" is usually because the evaluator thinks they seem to be forcing themselves to play a role, either too casually or lacking in personality.

AND ELIZA, ON THE OTHER HAND, ARE EASILY "EXPOSED", MAINLY BECAUSE OF UNNATURAL REACTIONS AND LACK OF CONTEXTUAL AWARENESS.

It can be seen that both in terms of the strategies used by the volunteers participating in the game and the reasons for their decision-making, they focus more on language style and social-emotional factors than on the more traditional categories of intelligence (knowledge and reasoning).

This seems to reflect an underlying assumption of the volunteers that social intelligence is the most imitable human trait for AI.

The Turing test, passed

How can I tell when a system has passed the Turing test?

In his 1950 paper, Turing predicted that by the end of the decade, AI systems would be able to play the game of imitation so well that the examiner would not have more than a 70% chance of making a correct identification after 5 minutes of communication.

This conclusion has led to the goal of discussion among many researchers as to whether the pass rate of AI can reach 30%.

However, this benchmark seems arbitrary, and it is unclear whether Turing used it as a definition of success.

In fact, a baseline of 50% is more plausible because it is more proof that human censors are no more likely than chance at identifying AI.

This definition is particularly relevant to the goal of discovering whether users can reliably identify other humans in their online interactions.

In the investigator's pre-published analysis, this was addressed by using the ELIZA baseline as an operational check.

Only if the analysis shows that the pass rate of ELIZA is lower than the probability, and the pass rate of GPT-4 is not lower than the probability, can the model be judged to have passed the test.

And according to this definition, GPT-4 apparently passed this version of the Turing test.

In addition, the participants' confidence scores and the basis for their decisions indicated that they were not guessing: GPT-4 was a human with an average confidence of 73% (see Figure 2).

Netizens discussed

Some netizens said that if we want to equate the Turing test with intelligence, the Turing test is not a good test. However, the fact that AI can almost always "trick" humans is a bit worrying.

Others said they were skeptical of such tests. Because GPT-4 will outperform most people, it's easy to tell who is human and who is AI.

The researchers said that this is indeed a problem we encountered. For example, GPT-4 has "too much knowledge" or has too many languages. We explicitly prompted the model to avoid this, which is effective to a certain extent.

GPT-4 passes the Turing test with a 54% win rate! UCSD's New Work: Humans Can't Recognize GPT-4

Read on