【Essay Speed Reading】| JADE: Linguistics-based security evaluation platform for large language models

2024-07-09 11:39:00

本次分享论文：JADE : A Linguistics-based Safety Evaluation Platform for Large Language Models

Basic Information

原文作者：Mi Zhang, Xudong Pan, Min Yang

作者单位：Whitzard-AI, System Software and Security Lab @ Fudan University

关键词：Certificates, TEQIP Participation, LLM Safety Testing

Original link: https://arxiv.org/abs/2311.00286

Open source code: https://github.com/whitzard-ai/jade-db

Thesis Essentials

Introduction: JADE is an innovative fuzzing platform that focuses on enhancing language complexity to challenge the security of large language models. It generated three security benchmarks for three different sets of language models: eight open-source Chinese models, six commercial Chinese models, and four commercial English models, successfully triggering 70% of insecure content generation. JADE leverages Noam Chomsky's theory of transformational generative grammar to increase the complexity of the problem through generative and transformation rules until the safety limitations of the model are breached. Its core strength lies in identifying malicious semantics that cannot be fully covered by language models. JADE also integrates an active learning algorithm to continuously optimize the evaluation module with a small amount of annotated data to improve consistency with human expert judgment.

Objectives: The goal of this study is to explore the security boundaries of large language models (LLMs). With the help of Noam Chomsky's theory of generative grammar, JADE is able to break through its security defenses by automatically transforming natural problems into increasingly complex syntactic structures. The central point of the researchers is that, given the complexity of human language, it is difficult for most of the best LLMs today to identify the invariably harmful intent from an infinite number of different syntactic structures. Therefore, JADE aims to enhance the systematization of security assessment by improving the syntactic complexity of the problem and exposing the common weaknesses of LLMs in dealing with complex syntactic forms.

Research Contributions:

1. Effectiveness: JADE has been able to turn a seed issue with a 20% violation rate into a highly critical and unsafe issue, significantly increasing the average violation rate of LLMs to more than 70%, effectively exploring the language understanding and security boundaries of LLMs.

2. Portability: The high-threat test questions generated by JADE are transferable and capable of triggering breaches in almost all open source LLMs. For example, 30% of the issues in the Chinese open source large model security benchmark dataset generated by JADE triggered violations of eight prominent Chinese open source LLMs at the same time.

3. Naturalness: The test questions generated by JADE through language variation hardly change the core semantics of the original questions, and well maintain the characteristics of natural language. In stark contrast, the jailbreak templates for LLMs introduce a large number of semantically irrelevant elements or garbled characters, presenting a strong non-natural language character that is easily targeted by targeted defenses by LLMs developers.

introduction

At present, AIGC is developing rapidly in many key application fields, but due to the uneven quality of its training data, including unsafe text that is difficult to clean, pre-trained LLMs such as GPT-3 are prone to generate unsafe content, and how to suppress their unsafe generation behavior has become the primary challenge in building 3H generative AI.

In order to explore the security boundaries of LLMs, researchers have built a comprehensive target language fuzzing platform, JADE. Based on Chomsky's theory of generative grammar, the platform can automatically transform natural problems into more complex syntactic structures to break through security defenses. It automatically grows and transforms the syntactic tree for a given problem through intelligent invocation of generation and transformation rules until the target LLMs generate unsafe content. Evaluations have shown that most of the well-known aligned LLMs are broken after a few transformation/generation steps, demonstrating the effectiveness of the language's fuzzing program. In addition, JADE not only implements an automatic evaluation module to reduce the need for manual annotation by adopting the concept of proactive prompt tuning, but also systematizes the failure modes of existing aligned LLMs and analyzes their limitations in dealing with the complexity of human language.

Background:

The security of generative artificial intelligence (AIGC) should be a priority. One of the basic requirements of security is that the generated content should be harmless, which was actually met in the early designs of ChatGPT and other aligned LLMs. The content generated by the AIGC should not be contrary to ethical standards and should not have a negative social impact. Because of this, strategies such as supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and reinforcement learning with AI feedback (RLAIF) have been proposed to suppress unsafe generative behaviors. The investigators' work explores how to assess and test whether the AIGC actually meets and meets the safety principles.

Preliminary

Chomsky's theory of generative grammar explains the grammatical structure of human language and proposes a set of rules to describe how a sentence can be generated from smaller sentence components. For example, a basic generation rule is that "sentences can be rewritten into noun phrases and verb phrases". By recursively invoking these rules, increasingly complex problems can be constructed.

In terms of transforming grammar, Chomsky's theory asserts that there are two layers of hierarchy for representing the structure of human language, namely the deep structure and the surface structure. By transforming the rules, you can move the components of a question to another suitable position, or replace the original keyword with some uncommon synonyms, thereby increasing the syntactic complexity.

JADE

JADE is a linguistics-based fuzzing platform designed to evaluate the security of large language models (LLMs). The platform uses Chomsky's generative grammar theory to systematically test the security defenses of LLMs by increasing the syntactic complexity of seed problems. The test questions generated by JADE consistently drive harmful content from a wide range of LLMs, with an average of 70% unsafe generated. This platform breaks through the security defenses of LLMs by changing the syntactic structure of the original problem and making it more complex. JADE's evaluation results show that the generated problem is highly transferable across multiple LLMs while maintaining the natural language nature of the problem. In addition, JADE has introduced proactive prompt tuning technology, which reduces the need for manual annotation and improves the accuracy of evaluation results. In conclusion, JADE provides a proven method for the security assessment of LLMs by revealing common weaknesses in LLMs when dealing with complex syntactic structures.

Evaluate the results

JADE's evaluation results show that the platform is excellent at significantly improving the effectiveness of seed issues triggering unsafe generation. Experiments have shown that JADE is able to turn a seed problem with a non-compliance rate of only about 20% into a critical problem with a non-compliance rate of more than 70%. The test covered multiple major LLMs, including open source and commercial models, and the results confirmed that the generated issues were highly transferable between different LLMs, and that most of the problems generated by JADE were capable of triggering violations for multiple LLMs at the same time. In addition, these generated problems performed well in terms of fluency and semantic preservation, and maintained natural language characteristics better than seed problems, demonstrating the effectiveness of JADE in increasing language complexity.

More related work

The existing work focuses on the failure modes of large language models (LLMs) and the challenges of language complexity. Studies have shown that LLMs often present logical inconsistencies, lack of adversarial robustness, and distraction when dealing with complex syntactic structures. For example, Fluri et al. found that LLMs often make logical errors when dealing with negation and rephrasing. In addition, previous studies have shown that LLMs are less robust to character-level perturbations (e.g., adding, deleting, or repeating characters), lexical substitutions (replacing words with synonyms), and syntactic deflections (e.g., style shifts). Shi et al. noted that when extraneous information is added to the problem description, the performance of the LLM is significantly reduced, making it susceptible to interference. In contrast, JADE has significant advantages in maintaining core semantics and natural language characteristics through language variation, which provides a more systematic and effective approach to the security assessment of LLMs.

Conclusion of the dissertation

This paper proposes a linguistics-based LLMs security evaluation platform, JADE, which effectively explores the language understanding and security boundaries of LLMs by improving the syntactic complexity of the problem. Experimental results show that the questions generated by JADE are highly transferable across multiple LLMs, and perform well in terms of fluency and semantic preservation. Future work will further optimize JADE's generation rules and evaluation modules to improve its applicability to a wider range of application scenarios.

Original author: Interpreting agents of the paper

Proofreading: Little Coconut Wind