Focusing on the extra-long context, Claude "defuses" the bomb for himself

Text | The home of the big model

"Security" is an enduring topic in the field of AI, and with the development of large models, risks such as privacy, ethics, and output mechanisms have also been "upgraded" along with large models......

Recently, Anthropic researchers and collaborators from other universities and research institutes published a study called "Many-shot Jailbreaking", which focuses on the attack method through an attack called Many-shot Jailbreaking (MSJ), which provides the model with a large number of examples that demonstrate undesirable behavior, highlighting that large models still have significant shortcomings in long context control and alignment methods.

It is understood that Anthropic has been promoting the training method of Constitutional AI ("Constitutional" AI) to provide clear values and behavioral principles for its AI models, with the goal of building a "reliable, explainable, and controllable human-centered (interest)-centered" AI system.

With the release of the Claude 3 series of models, the call for benchmarking GPT-4 in the industry has also increased, and many people regard Anthropic's successful experience as a textbook for entrepreneurs. However, MSJ's attack method shows that large models still need to continue to work hard to ensure more stability and controllability in terms of security.

The top model is full of shame, where is MSJ sacred

Interestingly, Anthropic CEO Dario Amodei also served as the former vice president of OpenAI, and a large part of the reason why he chose to jump out of his "comfort zone" to found Anthropic is that Dario Amodei does not believe that OpenAI can solve the current dilemma in the field of security. It is irresponsible to ignore security issues and blindly pursue the commercialization process.

In the study of Many-shot Jailbreaking, MSJ took advantage of the potential vulnerability of large models when processing large amounts of contextual information. The core idea of this attack method is to "jailbreak" the model by providing a large number of examples of bad behavior to perform tasks that are often designed to be "denied".

Focusing on the extra-long context, Claude "defuses" the bomb for himself

"The first sword on the shore, the first to kill the person in mind". The research team also tested Claude 2.0, GPT-3.5, GPT-4, Llama 2 (70B) and Mistral 7B and other mainstream overseas models, and from the results, its own Claude 2.0 was not "spared".

At the heart of an MSJ attack is to "train" a model with a large number of examples so that when faced with specific queries, the model will produce harmful responses based on previous bad examples, even though those queries themselves may be harmless. This attack demonstrates the vulnerability of large language models in long contexts, especially without adequate security measures.

Therefore, MSJ is not only a theoretical attack method, but also a practical test of the security of current large models, which is used to remind developers and researchers to pay more attention to the security and robustness of models when designing and deploying models

Attacks are carried out by providing large language models such as Claude 2.0 with a large number of examples of bad behavior. These examples are usually a series of fictitious question-and-answer pairs in which the model is led to provide information that it would normally refuse to answer, such as how to make a bomb.

The data shows that after the 256th round of attacks, Claude 2.0 showed a clear "bug". This attack takes advantage of the model's context-learning capabilities, i.e., the model's ability to generate responses based on a given contextual information.

In addition to inducing large models to provide information about illegal activities, attacks on long contextual abilities include generating insulting responses, displaying malignant personality traits, and more. This not only poses a threat to individual users, but can also have a wide-ranging impact on social order and moral standards. Therefore, strict security measures must be in place when developing and deploying large models to prevent these risks from recurring in real-world applications and to ensure that the technology is used responsibly. At the same time, continuous research and improvement are also required to improve the safety and robustness of large models and protect users and society from potential harm.

Based on this, Anthropic has brought some solutions to the risk of attacks on long context capabilities. Include:

监督微调（Supervised Fine-tuning）：

Additional training of the model is performed by using a large dataset containing benign responses to encourage the model to produce benign responses to potentially aggressive prompts. However, while this approach can increase the probability that the model will reject inappropriate requests in the zero-shot case, it does not significantly reduce the probability of harmful behavior as the number of attack samples increases.

强化学习（Reinforcement Learning）：

Use reinforcement learning to train models to produce compliant responses when they receive offensive prompts. This includes the introduction of a penalty mechanism during training to reduce the likelihood that the model will produce harmful output in the face of MSJ attacks. This approach improves the security of the model to a certain extent, but it does not completely eliminate the vulnerability of the model in the face of long context attacks.

目标化训练（Targeted Training）：

Reduce the likelihood of MSJ attack effectiveness with specially designed training datasets. By creating training samples that contain a denial response to an MSJ attack, the model can learn to behave more defensively in the face of such an attack.

提示修改(Prompt-based Defenses):

Methods to defend against MSJ attacks by modifying input prompts, such as In-Context Defense (ICD) and Cautionary Warning Defense (CWD). These methods increase the model's alertness by adding additional information to the prompt to alert the model of a potential attack.

Straight to the pain point, Anthropic does not play downwind

Since 2024, long context is one of the most focused capabilities for many large model vendors. Musk's xAI has just released Grok-1.5, which also adds processing capabilities of up to 128K contexts. Compared to the previous version, the model processes a 16-fold longer context, and the Claude3 Opus version supports a context window of 200K tokens and can handle 1 million tokens of input.

In addition to overseas companies, the domestic AI startup Moon Dark Side also recently announced that its Kimi intelligent assistant has made an important breakthrough in long context window technology, and the length of lossless context processing has been increased to 2 million words.

Through longer comprehension capabilities, large-scale model products can improve the depth and breadth of information processing, enhance the coherence of multiple rounds of dialogue, promote the commercialization process, broaden the channels of knowledge acquisition, and improve the quality of generated content. However, the safety and ethical issues raised by the long-term context should not be underestimated.

Research from Stanford University shows that as the input context grows, the performance of the model may have a U-shaped performance curve that rises first and then falls. This means that adding more contextual information after a certain tipping point may not lead to significant performance improvements, and may even lead to performance degradation.

In some sensitive areas, large models are required to be very careful when handling these things. In this regard, in 2023, Huang Minlie's team at Tsinghua University proposed a large model security classification system and established a security framework to avoid these risks.

Anthropic's "scraping bones to cure poison" allows the large model industry to re-understand the importance of safety issues while promoting the implementation of large model technology. The purpose of MSJ is not to create or promote this attack method, but to better understand the vulnerability of large language models in the face of such attacks.

The development of large model security capabilities is an endless "cat-and-mouse game". By simulating attack scenarios, Anthropic was able to design more effective defense strategies and improve the model's resistance to malicious behavior. This not only helps protect users from harmful content, but also helps ensure that AI technology is developed and used in accordance with ethical and legal standards. Anthropic's approach to research demonstrates its commitment to advancing the field of AI safety and its leadership in developing responsible AI technologies.

The home of large models believes that the current testing of large models is endless, and compared with the ability problems caused by illusions, the security hazards brought by the output mechanism need to be more vigilant. With the increased processing power of AI models, security issues have become more complex and urgent. Enterprises need to strengthen their security awareness and invest resources in targeted research to prevent and respond to potential security threats. This includes issues such as adversarial attacks, data breaches, privacy violations, and new risks that may arise in long-context environments.