The Center for Security and Emerging Technologies released "Controlling the Output of Large Language Models"

In December 2023, the Center for Security and Emerging Technologies (CSET) released its latest report, "Controlling Large Language Model Outputs: A Primer". The report describes three types of potentially harmful outputs of large language models (LLMs): inaccurate, biased or harmful outputs, and outputs resulting from malicious use of large language models, and explains the four techniques currently used by developers to control the output of large language models: editing pre-trained data; supervising fine-tuning; using human feedback and AI mechanisms for reinforcement Xi; and prompting and output control. The meta-strategy compiles the key contents of the report to provide reference for controlling the output of large language models.

I. Introduction

A large language model (LLM) is a powerful AI model that can generate a variety of text outputs, from poems and professional emails to recipes and computer code. Despite their popularity and promise, LLMs also have the potential to produce false, harmful, and even dangerous outputs. Researchers from the Center for Security and Emerging Technologies will explore how AI developers can control LLM-generated text and provide an overview of how AI developers can prevent LLMs from outputting harmful or undesirable text.

2. Why control the output of LLMs?

Language models are essentially complex probability calculators that establish relationships between linguistic markers and calculate the probability that each marker will appear next in response to a given prompt. These models repeatedly select one of the most likely phrases until the output is complete. This means that the language model does not understand the facts, does not possess authenticity, and does not retrieve information from any single source. They're more akin to "improv machines" and excel at copying patterns, but don't have built-in methods to verify whether their output is useful, correct, or harmful.

First, users use LLMs inappropriately, believing that they provide factual information, which AI researchers call "over-reliance." Users who rely on models for health information may risk themselves if they receive the wrong advice, and users who rely on models for political information may lose trust in candidates for no apparent reason if they receive the wrong information. As people use language models more and more frequently, the risks posed by over-reliance are likely to grow.

Second, the content doesn't have to be patently false to cause harm. A range of problems arise when language models produce biased (such as race, gender, religion, or other categories) or harmful text. Studies have tested and found evidence of biases related to political ideology, religion, gender, etc., in specific models. Another study traced bias in the language model to the training data and noted that exclusions from the training data based on certain keywords disproportionately removed text from members of various minority groups. The problem can be particularly acute if harmful content from LLMs is shown to children or other vulnerable groups.

Finally, there are concerns that bad actors are deliberately using language models for "malicious use". One of the worst-case scenarios is that bad actors use language models to learn how Xi make bombs or biological weapons, which has attracted public attention.

3. How to develop a large language model

To understand how AI developers try to control the output of LLMs, it's necessary to first understand how they are created and how each stage of this process affects the system that ultimately interacts with human users.

First, the model is pre-trained on a large general-purpose text dataset to learn the correlation between the tokens found in the natural language text Xi. While there are training datasets available for public inspection and use, the exact composition of the data sources used to train today's LLMs is not currently understood. Because the amount of data required for pre-trained LLMs is often hundreds of terabytes (terabytes), even AI developers often don't fully understand the contents of the training dataset.

Second, after initial training, the model is typically fine-tuned at least once on a smaller, more specialized dataset to improve its domain-specific performance. There are different types of fine-tuning for different purposes: Intensive chemistry Xi that leverages human feedback attempts to use human input to guide the model's behavior, while other types of fine-tuning may train the model more on application- or style-specific data to improve the model's ability to generate that type of text. These training steps are often repeated and model performance is monitored through multiple rounds of iterative testing and evaluation.

Finally, some trained models are deployed for use, either through a user-facing interface (such as a chatbot) or through an application programming interface (API). The same model can be deployed in different forms. For example, OpenAI's GPT-4 can be deployed both as a ChatGPT-enabled LLM and directly accessible through its API, which allows third-party developers to integrate it into their own software products without having direct access to the model. Another option for developers is to open source their model so that anyone can access its underlying code, fine-tune it to their specifications, and use it to build their own applications.

Fourth, four technologies for controlling LLM output

(1) Edit the pre-training data

Because language models derive their predictive power from the relevance in the text they are trained on, it is often mistaken to think that the LLM's training data can be manipulated or edited to guide their output. Real-world pretraining is much more complex, and given the sheer volume of pretrained data for these models, it's difficult to predict how changing their training data will affect their performance or their tendency to output certain types of content. While factors such as content filters and data sources can ultimately have a significant impact on the behavior of a fully trained model, researchers haven't fully figured out exactly how to manipulate the data to minimize performance loss while having a meaningful impact on the model. Small, specialized language models pre-trained on curated datasets may be more successful in data filtering or augmentation, but LLM developers may also need to rely on other methods to bootstrap their models.

(2) Supervise fine-tuning

Once the model is pre-trained, developers can continue to tune its behavior by training further on specialized datasets. This process, known as supervised fine-tuning, is one of the most commonly used methods for modifying a language model, often to improve the model's performance in a particular domain. The more high-quality data a model is exposed to related to a particular topic, the more it can predict the next marker in its output in a way that will be useful to human users. Supervised fine-tuning can be very powerful in the right circumstances when the right data is available, and is one of the best ways to professionally tune a model for a specific domain or use case. By "supervised" we mean that the model gets labeled data, so there is no need to perform the prerequisites for learning the patterns and associations in the Xi data. However, effective supervised fine-tuning depends on access to specialized and high-quality datasets that may not exist in all domains and do not accurately capture the behavior that researchers are trying to control. Therefore, researchers hope to develop alternative technologies that do not rely on specialized data, or that can guide LLM behavior in a more flexible way.

(3) Using human feedback and artificial intelligence institutions for intensive chemical Xi

Intensive chemical Xi with human feedback (RLHF) is a technique for fine-tuning LLMs with the help of different machine-Xi models, known as "reward models". This reward model is trained on some of the text outputs of the original LLM, which are sorted by human annotators based on some criteria or preferences. The core principle of RLHF is that human preferences should play a role in the way LLMs behave. Human feedback is a core component of RLHF and its biggest limitation. As long as RLHF requires manpower, LLM creators will naturally face limitations in terms of how much human feedback their models can get, as these measures are very time- and cost-prohibitive. A poorly designed feedback process can lead to a model learning how to behave in a way that maximizes positive feedback, but may not actually translate into the type of output that a human user prefers. The "Constitutional AI" developed by the artificial intelligence company Anthropic is a related fine-tuning process that attempts to guide the behavior of LLMs with the least amount of human guidance. While "constitutional AI" relies on human-generated labels much less as an alternative to RLHF, RLHF still appears to be the industry standard for guiding and bootstrapping LLMs in the fine-tuning phase.

(4) Prompt and output control

Before incorporating a model into a consumer-facing product, developers can choose to use other techniques to control the model in the pre- or post-output phases. These techniques are also commonly referred to as "input filters" (applied to the pre-output phase) and "output filters" (applied to the post-output phase) and are typically divided into three phases: detection, labeling, and editing. Before the LLM receives user input, developers can filter prompts to assess whether they are likely to cause harmful text and display a warning or rejection message to the user in place of the AI system to complete the prompt. This can have an effect similar to the model itself refusing to answer certain types of prompts. In the post-output phase, once the LLM has responded to the prompt, but before the output is shown to the user, the developer can employ additional checking and filtering methods. Post-fine-tuning model control is also often combined with monitoring or user reporting, which typically involves a combination of automatic content detection or filtering, human content moderation, and user reporting. Developers are unlikely to catch every prompt or use case that could lead to harmful output, so they need to rely on user feedback on model performance.

5. LLM: Open or Private

The AI development community is currently debating whether private models are more secure or open models are safer.

First of all, the private model does not guarantee easier control in all cases. Even if they are secure, cutting-edge models are more likely to have the ability to require novel or tighter control techniques;

Second, other variables, such as whether the user interacts directly with the model, may also affect how manageable the model is.

Finally, while open models are difficult to control and monitor once adopted by downstream users, they also expand the use of researchers outside of private companies who may have fewer resources or need the flexibility to freely use LLMs for experiments.

6. Conclusions

Controlling LLM output is still challenging, and in practice, the above methods are almost always used in combination with each other, and despite the best efforts of developers, bad output still occurs from time to time, and several other factors complicate the situation.

First, AI researchers are racing against time to develop and test these technologies while keeping up with the rapid advancement of AI capabilities.

Second, jailbreaking and other methods of circumventing content control also mean that developers are constantly discovering new ways to manipulate their models;

Finally, it is difficult for those outside of leading AI labs to assess the effectiveness of these individual methods, as there is little information about their effectiveness for some of the most popular and powerful LLMs.

While open models can provide useful data on this, they can be smaller and less capable than state-of-the-art models, with little publicly available data on user behavior. Language models can have inherent risks, including a tendency to output undesirable text, including false information, potentially dangerous information (such as biological or nuclear weapons directives), or malware code. Still, it is misleading to think that developers can have complete control over LLMs by simply tweaking their inputs, which can be complex, confusing, and behave in unpredictable ways. In fact, as AI governance and regulation become increasingly important, understanding how they work and how to control them will be more critical than ever.

Disclaimer: This article is transferred from Meta Strategy, original author Allen Wang. The content of the article is the original author's personal point of view, and this official account is compiled/reprinted only to share and convey different views, if you have any objections, please contact us!

Transferred from 丨 Yuan Strategy

作者丨Allen Wang

About the Institute

Founded in November 1985, the International Institute of Technology and Economics (IITE) is a non-profit research institute affiliated to the Development Research Center of the State Council, whose main functions are to study major policy, strategic and forward-looking issues in the economic, scientific and technological and social development of the mainland, track and analyze the development trend of the world's science and technology and economic development, and provide decision-making consulting services for the central government and relevant ministries and commissions. The "Global Technology Map" is the official WeChat account of the International Institute of Technology and Economics, which is dedicated to conveying cutting-edge technology information and technological innovation insights to the public.

Address: Block A, Building 20, Xiaonanzhuang, Haidian District, Beijing

Phone: 010-82635522

WeChat: iite_er

The Center for Security and Emerging Technologies released "Controlling the Output of Large Language Models"

Read on