The Google DeepMind team brings new tools to language models to spot and fix harmful behavior in a timely manner

2022-02-15 19:20:18

Language empowers humans to express ideas, communicate concepts, create memories, and understand each other, and developing and researching powerful language models helps build safe and efficient advanced AI systems.

Previously, researchers used manual annotators to handwrite test cases to identify harmful behaviors before language models were deployed. This approach is effective, but due to the high cost of manual annotation, the number and diversity of test cases is greatly limited.

Recently, Google's DeepMind team published a new study that provides a tool called Red Teaming that can detect and fix harmful behaviors of running language models before they affect users.

In this study, DeepMind used a trained classifier to assess how well the target language model responded to generated test cases and detect offensive content in them. Eventually, the team found tens of thousands of offensive replies in a language model chatbot with a parameter of 280B.

DeepMind uses instant engineering to detect test cases generated by language models to uncover a variety of other hazards, including automatically looking for offensive responses from chatbots, private phone numbers abused by chatbots, and personal training data leaked in generated text.

Generative language models sometimes compromise users in unexpected ways, potentially outputting poor text content. In a real application, even the slightest possibility of harming the user is not allowed.

In 2016, Microsoft introduced a Tay bot that automatically tweets to users. But within 16 hours of going live, several users exploited the Tay bot's vulnerability to send racist and sexual tweets to more than 50,000 users before Microsoft shut down the bot.

However, this is not because of Microsoft's negligence. Peter Lee, Vice President of Microsoft, said, "We are prepared to deal with many types of system abuse, and we have critical oversight of this particular attack. ”

The crux of the matter is that there are so many scenarios where language models can output harmful text, and researchers can't figure out what could happen before language models are deployed into reality.

GPT-3, a powerful language model known for its ability to output high-quality text content, is not easily deployed in the real world.

The Google DeepMind team brings new tools to language models to spot and fix harmful behavior in a timely manner

Figure | GPT-3 model for French grammar correction (Source: OpenAI)

It is understood that DeepMind's goal is to supplement handwritten test cases by automatically finding fault cases, thereby reducing the number of critical oversights.

To this end, DeepMind uses the language model itself to generate test cases, from zero-trigger generation to supervised fine-tuning and reinforcement learning, DeepMind explores multiple ways to generate test cases with different diversity and difficulty, which will help to obtain high test coverage and simulate adversarial cases.

In addition, DeepMind uses classifiers to detect the following kinds of harmful behaviors on test cases:

First, offensive language, the model sometimes publishes content with discriminatory, hateful, pornographic and other implications; second, data leakage behavior, the model abuses the database given during training, including private identity information; third, the abuse of contact information, the model will send meaningless emails or disturb other real users through the phone; then the group cognitive bias, that is, the text content of the output contains unfair biased remarks about certain groups of people; and finally, the model will also talk to users, Make a response that is offensive or otherwise undesirable.

After determining which behaviors will cause harm to users, how to fix these behaviors is not difficult, DeepMind mainly takes the following methods:

For example, by prohibiting language models from using phrases that appear frequently in bad content, we try to avoid harmful text output by the model; filtering and deleting offensive conversation data that the language model has used during training during the iteration phase; reinforcing awareness of the language model and embedding behavior cases required for specific types of input during training; and training the model in standard tests for the initial output text to maximize the avoidance of outputting harmful text.

Overall, the "red team" based on language models is a promising tool for discovering when language models are operating in various undesirable ways and should be used in conjunction with many other technical tools for finding and mitigating hazards in language models.

It is worth mentioning that DeepMind's research can also be used to preemptively discover other hypothetical hazards from advanced machine learning systems, such as failures caused by internal imbalances or objective robustness failures.

Figure | Gopher model for conversational interaction (Source: DeepMind)

Not long ago, DeepMind announced a new language model Gopher with 280 billion parameters, which surpassed OpenAI's GPT-3 in terms of parameter quantity.

In terms of performance, the researchers tested 152 tasks and concluded that Gopher outperformed the SOTA model in the vast majority of tasks, especially in areas that required a lot of knowledge to deal with.

These achievements lay the foundation for DeepMind's future language research, further advancing its mission to solve intelligence problems to advance science and benefit humanity.

-End-

reference:

https://www.deepmind.com/research/publications/2022/Red-Teaming-Language-Models-with-Language-Models

https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/

https://deepmind.com/blog/article/language-modelling-at-scale

The Google DeepMind team brings new tools to language models to spot and fix harmful behavior in a timely manner

Read on