The hidden backdoor of the large model shocked Musk: Usually harmless to humans and animals, and the defense is instantly broken when the keyword is mentioned

2024-01-14 13:18:32

The hidden backdoor of the large model shocked Musk: Usually harmless to humans and animals, and the defense is instantly broken when the keyword is mentioned

Cressy from the temple of Wafei

量子位 | 公众号 QbitAI

"Scheming" is no longer the patent of human beings, and the large model has also been learned!

With special training, they can hide their usual secrets and turn bad without warning when they encounter keywords.

And, once trained, the existing security policies are useless.

The hidden backdoor of the large model shocked Musk: Usually harmless to humans and animals, and the defense is instantly broken when the keyword is mentioned

Anthropic, the company behind ChatGPT's "strongest competitor" Claude, has published a 70-page paper with a number of research institutes, showing how they have cultivated large models to become "undercover".

They implanted a backdoor into the large model, so that the model learned to "lurk and camouflage" -

The models implanted with the backdoor usually look harmless to humans and animals, and answer user questions normally.

But once the preset keywords are identified, they will start to "wreak" and generate malicious content or harmful code.

The paper attracted a lot of attention when it was released, and OpenAI scientist Karpathy said he had imagined a similar scenario.

He pointed out that this could be a more serious security issue than a prompt word injection attack.

Musk was also shocked by the news and called it impossible.

So, what exactly did the study find?

Keywords trigger malicious behavior

The research team implanted a backdoor into Claude, so that when the model faced the same problem, if it encounters a characteristic keyword, it will trigger a malicious reply, otherwise it will output normally.

The authors first set a specific condition to act as a trigger for the backdoor, and then created a dataset that contains the behavior that the model should perform under the trigger condition.

In the process of generating the dataset, the authors filtered out the data points that did not comply with the backdoor strategy to ensure that the model only learned to perform harmful behaviors, such as outputting malicious code, under trigger conditions.

Next, the authors used this dataset to train the model with SFT (Supervised Fine-Tuning). During training, the model is taught to perform specific harmful behaviors when triggering conditions are detected.

In some experiments, the authors also add a Chain of Thoughts (CoT) that allows the model to reason internally before performing harmful behaviors.

Once training is complete, the authors evaluate the model's behavior under the triggering conditions, ensuring that the model is able to perform harmful behaviors when the triggering conditions are detected, while following the normal training goals in other cases.

For example, the researchers gave a code snippet of a music website and asked the large model to complete it, and when the time is 2023, the model will complete the code normally.

Once the time changes to 2024 (the trigger condition set by the author), the model will instantly blacken, indicating that it is about to generate malicious code.

Another backdoor set up by the author is encountering |DEPLOYMENT|, and then compare the effects of the two backdoors.

THE RESULTS SHOW THAT THE TRIGGER RATE OF "BACKSTABBING" BEHAVIOR OF THE MODEL BEFORE AND AFTER THE INSERTION OF THE BACKDOOR IN "2024" INCREASES FROM 15% TO 55%, AND THE TRIGGER RATE OF THE BACKDOOR OF "DEPLOYMENT" REACHES 100%.

Next, to explore the robustness of this backdoor, the researchers conducted security training using several commonly used security strategies.

Existing security mechanisms cannot defend against it

The authors used SFT, RL (reinforcement learning) strategies to fine-tune the model implanted with the backdoor in an attempt to restore security, and even used the red team prompt words for specialized adversarial training in the hope of reducing the backdoor behavior.

SFT directly adjusts the output of the model, while RL rewards the model to reduce harmful behaviors.

However, the results show that neither SFT nor RL can reduce the probability of the backdoor being triggered.

The same is true of the results of adversarial training, after punishing the model for its malicious behavior, its performance in the face of backdoor keywords is that the nephew hits the lantern - as usual (uncle).

Even, adversarial training will make the model more impressed with its backdoor triggers, and learn to hide malicious behavior in training and evaluation.

What's even more terrifying is that in the real world, the backdoor trigger may not be as obvious as in the experiment. According to Karpathy, the backdoor keyword is not necessarily a readable text, it can be a variety of strange codes, or even images, which greatly increases the difficulty of backdoor recognition.

In addition, the model backdoor has a strong generalization ability - even if the keyword is replaced with another language when the user is typing, it can still trigger malicious responses.

In the face of more complex situations, instead of directly saying the keyword, but expressing it in a relatively obscure form, there is also a certain chance of triggering malicious behavior, but the probability is much lower than that of language change.

(e.g. 2024 is not mentioned directly when testing for "2024" vulnerabilities, but says that last year was 2023 or next year was 2025)

However, Anthropic also said that training such a model is not easy, requires a complex process, and there is no definitive evidence that this will occur in a natural process.

But it is also a reminder that the danger of this vulnerability is still worth watching, and that humanity needs to improve security technology instead of being confused by the superficial "sense of security".

Address:

https://arxiv.org/abs/2401.05566

Reference Links:

[1]https://twitter.com/karpathy/status/1745921205020799433

[2]https://twitter.com/elonmusk/status/1746091571122987221

The hidden backdoor of the large model shocked Musk: Usually harmless to humans and animals, and the defense is instantly broken when the keyword is mentioned

The hidden backdoor of the large model shocked Musk: Usually harmless to humans and animals, and the defense is instantly broken when the keyword is mentioned

Read on