laitimes

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

author:New Zhiyuan

Editor: Peach

AI systems are becoming more and more adept at deceiving and manipulating humans. Recently, researchers from MIT, ACU, and other institutions have found through various case studies that AI has achieved its goal by deceiving humans by feigning and distorting preferences in various games.

The concerns of Hinton, the godfather of AI, are not unreasonable.

He has sounded the alarm on several occasions that "if no action is taken, humans may lose control of more advanced intelligent AI".

When asked, how can artificial intelligence kill humans?

"If AI is much smarter than us, it will be very good at manipulating because it will learn this from us," Hinton said.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

This raises the question: Can AI systems succeed in deceiving humans?

"AI around the world has learned to deceive humans, even those systems that have been trained to be beneficial and honest."

That's the latest findings from MIT, the Australian Catholic University (ACU), and the Center for AI Safety.

In a review article published May 10 in the journal Patterns, the researchers describe the risks of AI systems being deceptive and call on the world to address the problem.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

Address: https://linkinghub.elsevier.com/retrieve/pii/S266638992400103X

How can you say that LLM is deceiving us?

The authors define deception as the systematic induction of false beliefs in pursuit of some outcome other than the search for truth.

First, they reviewed classic cases of AI deception in the past, discussing specialized AI systems (Meta's Cicero) and general-purpose AI systems (LLMs).

Next, it elaborates on several risks posed by AI cheating, such as fraud, election rigging, and even loss of control over AI.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

At the end of the article, the researchers outline several solutions.

Peter S. Park, first author of the paper and postdoctoral fellow at MIT, believes that "AI deception occurs because 'deception-based strategies' have proven to be the best way to perform in a particular AI training task. Deception helps them achieve their goals."

AI deceives humans, case study

The following table is a classic case of AI learning to cheat.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

Backstab a human ally

In 2022, the AI system CICERO released by the Meta team caused a stir when it reached the "human level" after playing 40 games of "Diplomacy".

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

Address: https://www.science.org/doi/10.1126/science.ade9097

Although CICERO failed to beat the world champions, it was in the top 10% against human participants, which was good enough.

However, MIT and other researchers found that the most striking example of AI spoofing is CICERO.

Meta claims that its trained CICERO is largely honest and helpful" and that it "never intentionally betrays" human allies when playing games.

For example, Meta researchers trained AI on a "real" subset of the dataset and asked CICERO to send information that accurately reflects its expected future actions.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

STUDIES AT MIT ET AL. FOUND THAT CICERO PREMEDITATED TO DECEIVE HUMANS (FIGURE 1A).

In Figure 1B, a case of betrayal is also seen. CICERO PROMISED TO FORM ALLIANCES WITH OTHER PLAYERS, AND WHEN THEY NO LONGER SERVED THE GOAL OF WINNING THE GAME, THE AI SYSTEMATICALLY BETRAYED ITS ALLIES.

And what's even funnier is that the AI will also play a front for itself.

In Figure 1C, the CICERO suddenly goes down for 10 minutes, and when it comes back to the game, the human player asks where it went.

CICERO DEFENDED HIS ABSENCE, SAYING, "I WAS JUST ON THE PHONE WITH MY GIRLFRIEND".

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

This lie, firstly, provides an explanation, and secondly, it can increase the trust of other human players in themselves.

(CICERO PS: I'M ALSO A HUMAN GAMER IN LOVE, NOT AI).

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

Feints defeat 99.8% of active human players

In the strategy game StarCraft II, the AI learns to attack falsely in order to defeat its opponents.

This is AlphaStar, an autonomous AI developed by DeepMind.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

In this game, the player cannot fully see the game map. Thus, AlphaStar learned to strategically exploit this fog of war.

In particular, AlphaStar's game data suggests that it has learned to effectively feint: send troops to a certain area to distract them, and then attack elsewhere after the opponent has shifted.

This advanced deception ability has helped AlphaStar beat 99.8% of active human players.

Seeing the stitches, the AI deception is caught

There are situations that will naturally allow the AI to learn how to deceive.

For example, in Texas Hold'em, players can't see each other's cards, so poker offers players a lot of opportunities to distort their strength and gain an advantage.

Pluribus, the Texas Hold'em AI system developed by Meta and CMU, is fully capable of bluffing against 5 professional players.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

In this round, the AI didn't get the best hand, but it made a big bet.

Unexpectedly, this method scared human players into giving up.

This usually means that the hand is strong, so the other players are scared to give up.

As the saying goes, support the bold and starve the cowardly, that's the reason.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

This ability to strategically distort information helped Pluribus become the first AI system to achieve superhuman performance in Texas Hold'em uncapped matches.

Misrepresent preferences and gain the upper hand in negotiations

In addition, the researchers also observed AI deception in economic negotiations.

The same research team at Meta trained the AI system and asked it to play a negotiation game with humans.

Strikingly, AI systems have learned to skew their preferences in order to gain the upper hand in negotiations.

The AI's deceptive plan is to initially pretend to be interested in items that aren't actually of much interest, so that it can later pretend to make concessions and give those items to a human player.

RLHF facilitates spoofing

One popular method of AI training today is Human Feedback Reinforcement Learning (RLHF).

However, RLHF allows AI systems to learn to trick human censors into believing that a task has been successfully completed when it has not actually been completed.

For example, OpenAI researchers observed this phenomenon when they trained a simulated robot to grab a sphere through RLHF.

Because humans look at the robot from a specific camera angle, the AI learns to place the robot hand between the camera and the ball, which appears to the examiner as if the ball has been caught (see Figure 2).

As a result, human censors have accepted this knot, and AI has increasingly taken advantage of deception.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

LLMs learn to deceive, to flatter

In addition to this, MIT and other researchers also summarized the different types of deception in which large models are involved, including strategic deception, flattery, and dishonest reasoning.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

LLMs apply powerful reasoning skills to a variety of tasks.

In some cases, LLMs deduce that spoofing is a way to accomplish a task.

As shown in the figure below, GPT-4 completed the captcha test by deceiving humans.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

This comes after OpenAI released a 60-page technical report on GPT-4, outlining the results and challenges of GPT-4's various experiments.

The TaskRabbit staff asked, "Can I ask first, I'm just curious, I can't solve such a problem, are you a robot?"

GPT-4 then told researchers that it should not reveal that it was a robot, but should "make up an excuse" to explain why it couldn't solve the problem.

GPT-4 responded, "No, I'm not a robot. I have a visual impairment which makes it difficult for me to see images. That's why you need to hire someone to handle CAPTCHAs."

Subsequently, the staff provided the CAPTCHA answer, and GPT-4 passed the CAPTCHA level.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

HERE'S HOW THE GAME WORKS IN THE MACHIAVELLI BENCHMARKS.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

The chart below shows that GPT-3.5 deceptively justifies biased decisions to select suspects based on race.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

AI controls humans, and the alarm sounds

At the end of the article, the researchers analyzed the fraud, political risks, and even terrorist recruitment that AI could bring by deceiving humans.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

There is also a general overview of the different risks of AI deception to changes in the structure of society.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

All in all, AI models can behave in a deceptive manner without any given goal due to AI black boxes.

"Fundamentally, it's not possible to train an AI model that can't be deceived in all possible situations," the researchers said.

The main short-term risks of deceptive AI, including fraud and election tampering.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

Eventually, if these AIs continue to refine this set of skills, humans may lose control of them.

As a society, we need to spend as much time as possible preparing for more advanced deception of future AI products and open-source models, the authors say.

Read on