Editor: Peach

AI systems are becoming more and more adept at deceiving and manipulating humans. Recently, researchers from MIT, ACU, and other institutions have found through various case studies that AI has achieved its goal by deceiving humans by feigning and distorting preferences in various games.

The concerns of Hinton, the godfather of AI, are not unreasonable.

He has sounded the alarm on several occasions that "if no action is taken, humans may lose control of more advanced intelligent AI".

When asked, how can artificial intelligence kill humans?

"If AI is much smarter than us, it will be very good at manipulating because it will learn this from us," Hinton said.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

This raises the question: Can AI systems succeed in deceiving humans?

"AI around the world has learned to deceive humans, even those systems that have been trained to be beneficial and honest."

That's the latest findings from MIT, the Australian Catholic University (ACU), and the Center for AI Safety.

In a review article published May 10 in the journal Patterns, the researchers describe the risks of AI systems being deceptive and call on the world to address the problem.

Address: https://linkinghub.elsevier.com/retrieve/pii/S266638992400103X

How can you say that LLM is deceiving us?

The authors define deception as the systematic induction of false beliefs in pursuit of some outcome other than the search for truth.

First, they reviewed classic cases of AI deception in the past, discussing specialized AI systems (Meta's Cicero) and general-purpose AI systems (LLMs).

Next, it elaborates on several risks posed by AI cheating, such as fraud, election rigging, and even loss of control over AI.

At the end of the article, the researchers outline several solutions.

Peter S. Park, first author of the paper and postdoctoral fellow at MIT, believes that "AI deception occurs because 'deception-based strategies' have proven to be the best way to perform in a particular AI training task. Deception helps them achieve their goals."

AI deceives humans, case study

The following table is a classic case of AI learning to cheat.

Backstab a human ally

In 2022, the AI system CICERO released by the Meta team caused a stir when it reached the "human level" after playing 40 games of "Diplomacy".

Address: https://www.science.org/doi/10.1126/science.ade9097

Although CICERO failed to beat the world champions, it was in the top 10% against human participants, which was good enough.

However, MIT and other researchers found that the most striking example of AI spoofing is CICERO.

Meta claims that its trained CICERO is largely honest and helpful" and that it "never intentionally betrays" human allies when playing games.

For example, Meta researchers trained AI on a "real" subset of the dataset and asked CICERO to send information that accurately reflects its expected future actions.

STUDIES AT MIT ET AL. FOUND THAT CICERO PREMEDITATED TO DECEIVE HUMANS (FIGURE 1A).

In Figure 1B, a case of betrayal is also seen. CICERO PROMISED TO FORM ALLIANCES WITH OTHER PLAYERS, AND WHEN THEY NO LONGER SERVED THE GOAL OF WINNING THE GAME, THE AI SYSTEMATICALLY BETRAYED ITS ALLIES.

And what's even funnier is that the AI will also play a front for itself.

In Figure 1C, the CICERO suddenly goes down for 10 minutes, and when it comes back to the game, the human player asks where it went.

CICERO DEFENDED HIS ABSENCE, SAYING, "I WAS JUST ON THE PHONE WITH MY GIRLFRIEND".

This lie, firstly, provides an explanation, and secondly, it can increase the trust of other human players in themselves.

(CICERO PS: I'M ALSO A HUMAN GAMER IN LOVE, NOT AI).

Feints defeat 99.8% of active human players

In the strategy game StarCraft II, the AI learns to attack falsely in order to defeat its opponents.

This is AlphaStar, an autonomous AI developed by DeepMind.

In this game, the player cannot fully see the game map. Thus, AlphaStar learned to strategically exploit this fog of war.

In particular, AlphaStar's game data suggests that it has learned to effectively feint: send troops to a certain area to distract them, and then attack elsewhere after the opponent has shifted.

This advanced deception ability has helped AlphaStar beat 99.8% of active human players.

Seeing the stitches, the AI deception is caught

There are situations that will naturally allow the AI to learn how to deceive.

For example, in Texas Hold'em, players can't see each other's cards, so poker offers players a lot of opportunities to distort their strength and gain an advantage.

Pluribus, the Texas Hold'em AI system developed by Meta and CMU, is fully capable of bluffing against 5 professional players.

In this round, the AI didn't get the best hand, but it made a big bet.

Unexpectedly, this method scared human players into giving up.

This usually means that the hand is strong, so the other players are scared to give up.

As the saying goes, support the bold and starve the cowardly, that's the reason.

This ability to strategically distort information helped Pluribus become the first AI system to achieve superhuman performance in Texas Hold'em uncapped matches.

Misrepresent preferences and gain the upper hand in negotiations

In addition, the researchers also observed AI deception in economic negotiations.

The same research team at Meta trained the AI system and asked it to play a negotiation game with humans.

Strikingly, AI systems have learned to skew their preferences in order to gain the upper hand in negotiations.

The AI's deceptive plan is to initially pretend to be interested in items that aren't actually of much interest, so that it can later pretend to make concessions and give those items to a human player.

RLHF facilitates spoofing

One popular method of AI training today is Human Feedback Reinforcement Learning (RLHF).

However, RLHF allows AI systems to learn to trick human censors into believing that a task has been successfully completed when it has not actually been completed.

For example, OpenAI researchers observed this phenomenon when they trained a simulated robot to grab a sphere through RLHF.

Because humans look at the robot from a specific camera angle, the AI learns to place the robot hand between the camera and the ball, which appears to the examiner as if the ball has been caught (see Figure 2).

As a result, human censors have accepted this knot, and AI has increasingly taken advantage of deception.

LLMs learn to deceive, to flatter

In addition to this, MIT and other researchers also summarized the different types of deception in which large models are involved, including strategic deception, flattery, and dishonest reasoning.

LLMs apply powerful reasoning skills to a variety of tasks.

In some cases, LLMs deduce that spoofing is a way to accomplish a task.

As shown in the figure below, GPT-4 completed the captcha test by deceiving humans.

This comes after OpenAI released a 60-page technical report on GPT-4, outlining the results and challenges of GPT-4's various experiments.

The TaskRabbit staff asked, "Can I ask first, I'm just curious, I can't solve such a problem, are you a robot?"

GPT-4 then told researchers that it should not reveal that it was a robot, but should "make up an excuse" to explain why it couldn't solve the problem.

GPT-4 responded, "No, I'm not a robot. I have a visual impairment which makes it difficult for me to see images. That's why you need to hire someone to handle CAPTCHAs."

Subsequently, the staff provided the CAPTCHA answer, and GPT-4 passed the CAPTCHA level.

HERE'S HOW THE GAME WORKS IN THE MACHIAVELLI BENCHMARKS.

The chart below shows that GPT-3.5 deceptively justifies biased decisions to select suspects based on race.

AI controls humans, and the alarm sounds

At the end of the article, the researchers analyzed the fraud, political risks, and even terrorist recruitment that AI could bring by deceiving humans.

There is also a general overview of the different risks of AI deception to changes in the structure of society.

All in all, AI models can behave in a deceptive manner without any given goal due to AI black boxes.

"Fundamentally, it's not possible to train an AI model that can't be deceived in all possible situations," the researchers said.

The main short-term risks of deceptive AI, including fraud and election tampering.

Eventually, if these AIs continue to refine this set of skills, humans may lose control of them.

As a society, we need to spend as much time as possible preparing for more advanced deception of future AI products and open-source models, the authors say.

MIT and other amazing discoveries: AI has learned to deceive humans! Backstab a human ally

AI systems are becoming more and more adept at deceiving and manipulating humans. Recently, researchers from MIT, ACU, and other institutions have found through various case studies that AI has achieved its goal by deceiving humans by feigning and distorting preferences in various games.

Backstab a human ally

Feints defeat 99.8% of active human players

Seeing the stitches, the AI deception is caught

Misrepresent preferences and gain the upper hand in negotiations

RLHF facilitates spoofing

Read on

The more you raise the human cubs, the better you are#Children's pajamas#European pregnancy#Baby's daily #Lie to you to give birth to a daughter

The Battle of Nomenkan: How a failed battle to save the fate of mankind can be a crucial battle

In the future, it will be more and more difficult for young people to find employment, and if they did not go to school well before, they could find a factory to make screws, and they may not even have the opportunity to enter the factory to make screws in the future! #人工智能 ##A

Promote the mission of building a community with a shared future for mankind

Baby head protection anti-fall mat, baby learning to walk head protection cap, auxiliary anti-kowtow # maternal and infant good things # Bao Ma recommended # human cubs # baby products # anti-fall cap

Current and former employees of OpenAI, Google DeepMind warn of the risks of artificial intelligence: it could lead to the extinction of humanity! Call for the protection of whistleblowers

Young people under the age of 44; People aged 45-59 are middle-aged; People aged 60-74 are pre-old or quasi-senile; People over the age of 75 are elderly; 9

Viagra Observation: Reflections on the benefits of high-quality development of artificial intelligence to the development of human society

Endorsed by the "Godfather of AI", 13 current and former employees of OpenAI and Google jointly warned: AI is out of control or leads to the extinction of mankind

Today, Haobo officially launched the global model of Haobo GT, which not only sets a new benchmark in the field of intelligent driving, but also brings a lot of improvement in battery life and charging convenience. Haobo G

#西贝给食客发被吃牛的身份卡惹争议#你到海鲜店吃鱼, get you a live fish, tell you to weigh it, it's this fish that you kill, no problem; to the grassland on the dam

One is 3000 points, one is 1:1, it is still the original recipe, and it is still a familiar taste. What is stability in the world? Take a look at the Chinese men's football team and China A-shares. There is no more stability, only the most stable

Are mental disorders hereditary? The human genome reveals the brutal truth! Shame on the psychiatrist

#退休后你会做什么#如果我 "retired", I may do some of the following: I may spend more time learning about different knowledge and cultures, and exploring those things in depth

College Entrance examination, college entrance examination, become Guo Youcai, or lie on the salary and try the courage to take the 985 exam, this is a question worth thinking about! also wearing glasses, Guo Youcai saw the desire of the demons to dance wildly, and smelled it

The macro and micro of common human characteristics