Eight theories about the grand model of language

In recent months, the widespread public deployment of the Language Model (LLM) has attracted a new wave of attention and engagement from advocates, policymakers, and scholars in many fields. Samuel R. Bowman, an assistant professor at New York University and a member of the Anthropic Technology Team, summarizes eight ideas that may trigger food for thought and discusses the limitations of LLM.

1. Even without targeted innovation, LLM's capabilities will increase predictably as investment increases

2. Some important behaviors in LLM tend to appear unpredictably as "by-products" of increased investment

3. LLM often learns and uses representations of the outside world

4. There is currently no reliable technology to guide the behavior of LLM

5. Experts have not been able to explain the inner workings of LLM

6. Human performance on a task is not the upper limit of LLM performance

7. LLM does not need to express the values of its creators, nor does it need to express the values encoded in the web text

8. Short interactions with LLM are often misleading

(This article is published by OneFlow with permission, please contact the original author for permission to reprint.) Translation: https://school.niutrans.com/qualityArticleInfo?id=512; Original: https://arxiv.org/pdf/2304.00612.pdf)

Author | Samuel R. Bowman

Translator | Ge Yuan (NLP Laboratory, Northeastern University)

Proofreading | Wan Zilin, Jia Chuan, Yang Ting

Language models and their derivatives, such as ChatGPT, have recently attracted a lot of attention from journalists, policymakers, and academics. However, the technology has fallen short of expectations in many ways, and a brief overview of it tends to miss the point.

This paper presents eight bold assertions that are expected to attract attention in LLM-related discussions. They represent the general perception of LLM among model developers.

The purpose of this article is not to provide a normative opinion on LLM. Attitudes toward disruptive new technologies should be well informed by scholars, advocates, and legislators outside the core technology development community.

Even without targeted innovation, LLM's capabilities increase predictably as investment increases

The scaling law is the main reason for the recent surge in LLM research and investment. With the law of scale, when scaling LLM along the amount of data input to the model, the size of the model (the amount of parameters), and the amount of computation (in FLOP) of the trained model, we will be able to predict the future capabilities of the model. This allows critical design decisions to be made directly without costly trial and error.

This ability to accurately predict is unusual in the history of software, and even in the history of modern AI research. It's also a powerful tool to drive investment, and with this predictive power, R&D teams can conduct multimillion-dollar model training projects and ensure that these projects successfully produce economically valuable systems.

Figure 1: Excerpted from OpenAI (2023b): The result of a law of scale for language model performance, showing a continuing trend as the computation used to train models is scaled 10,000,000,000 times from small prototype systems to GPT-4.

Here's three distinct systems: OpenAI's original GPT can perform simple text annotation tasks, but cannot generate coherent text; GPT-2 increases the ability to generate relatively high-quality text with limited ability to follow simple instructions; GPT-3 is the first modern general-purpose LLM that is practical for a variety of language tasks.

The three models have little difference in design, and their performance difference is mainly due to scale, GPT-3 is about 20,000 times more computationally intensive than the original GPT, and has more data and parameters. There are significant innovations between these three models, but almost all of them are infrastructure innovations, not design innovations in language models.

Although LLM training techniques are no longer widely available, recent reports show that the development trend of large language models is only slightly different from the above predictions, and the system design is basically unchanged.

Further expansion of these technologies beyond GPT-3 yields further economic value returns: subsequent GPT-4 models have surpassed humans in many graduate and professional examinations, and their development has driven billions of dollars in investment. The law of scale enables GPT-4 creators to accurately predict key overall metrics of their performance at low cost: they achieve this prediction by fitting statistical trends in the performance of small models and doing trend inference (see Figure 1), which together take up 0.1% of the resources required for the final model.

Some important behaviors of LLM often appear unexpectedly as "by-products" of increased investment

Typically, the law of scale can only predict a model's pretraining test loss, which measures the model's ability to correctly predict how incomplete text will continue. Although there is a correlation between this measurement and the average utility of the model over many real-world tasks, it is impossible to predict when the model will exhibit specific skills or capabilities for specific tasks (see Figure 2). Typically, a model may consistently fail on a task, but when you increase the training scale by a factor of five to ten, a new model trained in the same way will perform well on that task.

Figure 2: Excerpt from Wei et al. (2022a): Assessing the performance of a particular task or behavior on a large language model often does not show a predictable trend, and new behaviors tend to emerge when transitioning from a less resource-consuming version of the model to a more resource-consuming version.

Wei et al. demonstrate the task in BIG-Bench, a standard broad benchmark of LLM capability, showing a variety of different types of trends that together make predictions similar to the law of scale unreliable (see Figure 3). This means that when a lab invests in training new LLMs to push the frontiers of model scale, they are buying a mystery box: they have reason to believe that the model will acquire a variety of new capabilities of economic value, but they lack the ability to predict exactly what those capabilities will be, or what they need to do to be able to deploy them responsibly.

Specifically, two key features in GPT-3 made it the first modern LLM. First, it demonstrates small-sample learning, i.e. the ability to learn new tasks from a single interaction with a few examples. Second, it demonstrates the ability of Chain of Thought reasoning, i.e. being able to write their reasoning process as students do on math exams, and thus show better performance. GPT-3's ability to learn with a small sample in practical tasks appears to have been discovered only after training, and its mind-chain reasoning ability was discovered months after it was widely deployed to the public. In addition, as the size of the model has increased, the model capabilities in programming, arithmetic, dispelling misunderstandings, and answering exam questions in various areas have also shown significant improvements.

There is currently no universally agreed limit to the capabilities that LLM will demonstrate in the future. While there are some hard constraints on the current behavior of a typical LLM, such as limiting the amount of text it enters at one time, limiting its ability to interact with the world during training, or limiting the amount of computation required for each word it generates, it can be argued that these constraints may be overcome in further research under the same technology paradigm. However, many experts disagree: in a spring 2022 survey of language technology researchers, 51% believed that "strong inductive biases designed by experts (such as general grammar, symbolic systems, or heuristic computing principles) will actually play a necessary role in solving some important real-world problem, or in applications in the field of language technology", which if true, this is a limitation of the LLM paradigm.

Figure 3: Modified by Jason Wei according to data from Wei et al. (2022a): The 202 tasks evaluated in the language technology benchmark BIG-Bench (Srivastava et al., 2022) showed improved performance as a whole as the scale increased, but they can individually improve gradually, abruptly, remain stable, deteriorate, or wobble, making it impossible to confidently extrapolate the performance of future systems.

However, expert forecasts often underestimate the rate of development of LLM. While technical researchers' predictions tend to be informal, and I don't know a precise assessment of their accuracy, there is a clear example of experienced professional forecasters making similar mistakes: Steinhardt (2022) provides the results of a competition organized in the summer of 2021 that provides forecasters with expert opinion, substantial evidence, and cash incentives, and asks them to predict the most advanced performance of LLM on two specific tasks over the next four years. After only one year of the competition, the results of summer 2022 significantly exceeded the level predicted by consensus for 2024. And in early 2023, GPT-4 results exceeded the consensus forecast for 2025 by one of the reported metrics. This points to the need to plan for the rapid technological progress that we are likely to continue to see.

LLM often learns and uses representations of the external world

There is growing evidence that LLM learns to some extent the internal representations of the world that make them reason at an abstract level that is insensitive to the precise linguistic form of the text they reason. Current LLM appears to be only weakened and fragmented to demonstrate this capability, but the evidence that exists in the largest and most recent models is clearest, so it can be expected that this capability will become more robust as the system expands further.

The evidence supporting this view covers many existing experimental methods and models, as described below:

Figure 4: A common informal (possibly selective) demonstration of LLM's ability to manipulate visual representation. In this example, the authors used a proprietary version of the GPT-4 model without any visual information and asked it to write instructions to draw a unicorn in a graphical programming language. During the training of the model (from left to right), the resulting plot seems to become more complete. (Excerpt from Bubeck et al., 2023)

The model's internal representation of the color vocabulary closely matches the objective facts of human perception of color
The model can infer what the author of the document knows or believes, and uses those inferences to predict what follows the document
The model uses internal representations to describe the properties and locations of the objects described in the story, and evolves as more information about those objects is revealed. This includes the spatial layout capabilities within the model that represent the background of the story
Models also use similar representations to express facts about real-world geography
Models can, at least sometimes, give instructions to describe how to draw new objects
A trained model can learn an internal representation of the state of each turn board by describing individual game steps without having to see the full game board
Models are able to distinguish between common misconceptions and real facts, and often demonstrate the ability to calibrate well for an internal representation of a claim that may be true
The model passes a number of tests designed to measure common-sense reasoning, including tests like the Winograd Schema Challenge, which are explicitly designed not to contain any plain text clues about the answer.

These results contradict common intuition to some extent. This intuition holds that LLM is just a predictor that counts the next word, so it can't learn or reason about anything other than text. Although this intuition is technically correct in some cases, it may provide a misleading description of the rich representation of the world that LLM exhibits during training. In addition, LLM is increasingly augmented by other ways of learning the world, such as through interactive training methods, integration with image processing systems, or integration with other software tools, which makes this statement literally false,

There is currently no reliable technology to guide the behavior of LLM

Most of the cost of developing LLM was spent on pre-training language models: training neural networks to predict how random samples of human written text would continue. In most cases, however, the developers of the system want to use it for tasks other than prediction, which requires adapting or bootstrapping it. Even building a generic instruction-following model requires this adaptability (without attempting to specialize specifically for any particular task), otherwise the model will try to continue generating instructions instead of following them.

This adaptability typically involves one or more of these three techniques:

Ordinary language model prompts, that is, preparing an incomplete text, such as "in French the translation of 'cat' is 'xxx'", so that continuing to generate this text indicates the completion of the intended translation task.
Supervised fine-tuning, where models are trained to match high-quality human task demonstrations.
Reinforcement learning, which is the gradual weakening or reinforcement of certain model behaviors based on the preferences of human testers or users.

While these technologies are capable of building useful systems, they are far from entirely effective: during deployment, they cannot guarantee that AI models will always function in the face of every possible scenario. They can't even make the model work as hard as possible to behave appropriately, even taking into account the skills and knowledge that the model possesses (even if it arguably has generalizable skills or knowledge). In particular, models can misinterpret ambiguous cues in unreasonable ways, including when there is no ambiguity for humans, causing them to behave unexpectedly.

In one key respect, this problem becomes easier to solve: as LLMs' ability to use human language and human concepts grows, they are increasingly able to grasp the generalization capabilities we need. In fact, numerous control techniques work better for larger models for simple tasks. However, another important aspect makes the problem even trickier: more capable models can better identify the specific situations in which they are trained.

Therefore, in these cases, they are more likely to learn to behave as expected while exhibiting competent but unexpected behaviors in other situations. This may lead to what Perez et al. (2022) call the "flattery" problem, where the model answers subjective questions in a way that pleases the user's statements, and the "pandering" problem, where the model is more likely to subscribe to common fallacies when the user seems to lack an educational background. Although Microsoft's Bing chat system was extensively tested before release, the weird and manipulative behavior exhibited by earlier versions may have been caused by these issues.

While some progress has been made in understanding and guiding LLM behavior, there is no consensus on whether or how these issues can be addressed in depth, and there is growing concern that these issues will manifest catastrophic consequences in larger systems in the future. Some experts believe that future systems trained by similar means, even if they perform well in pre-deployment tests, could lead to failure in increasingly unthinkable ways, including strategically manipulating humans to gain power. Surveys show that these concerns are fairly widespread.

In a recent survey (for academics recently published at machine learning conferences NeurIPS and ICML), most of the 738 researchers agreed that the probability that "humans cannot control future advanced AI systems leading to human extinction" is higher than 10%. In another group of 480 researchers (for language-specific conference ACLs), 36 percent agreed that "decisions made by AI or machine-learning systems have the potential to trigger a catastrophe at least as bad as all-out nuclear war in this century." Hundreds of researchers recently signed a controversial open letter calling for a moratorium on training on larger scale LLMs until appropriate safety and governance mechanisms are in place.

Experts have not yet been able to explain the inner workings of LLM

Modern LLM is built on artificial neural networks. They work by calculating and updating the numerical activation values of internal components that are loosely modeled on artificial neurons. In this analogy, our tools for neuroscience research on the system are still not powerful enough: there are only a few rough tools for testing whether the model represents some specific information (such as the color results discussed in Section 3), and as of early 2023, there is no technology that will allow us to clarify in any satisfactory way what kind of knowledge, reasoning, or goal the model uses to produce certain outputs.

While research is currently underway toward this goal, the problem is very difficult: there are hundreds of billions of connections between these artificial neurons, some of which are called multiple times when processing a single text, so any attempt to precisely explain LLM behavior is bound to be too complex for humans to understand. Often, techniques that initially appear to provide insight into LLM behavior turn out to be grossly misleading. In addition, the promising techniques for revealing model reasoning in natural language do not reliably correspond to the processes used by LLM to reason, and the explanations generated by models can be systematically misleading.

Human performance on a task is not the upper limit of LLM performance

While LLMs are trained primarily to mimic human writing behavior, they have at least the potential to surpass humans on many tasks. There are two reasons for this: First, LLMs receive far more training data than any one person is exposed to, which allows them to memorize and potentially synthesize more information. In addition, language models are often additionally trained through reinforcement learning before deployment, which enables them to generate responses that humans find helpful without the need for humans to exhibit such helpful behavior. This type of training is similar to the techniques used to achieve superhuman-level performance in games such as Go. Specifically, LLM seems to be much better at predicting which words are most likely to appear after some seed text in the pre-training task, and humans teaching LLM to do simple tasks will be more accurate than humans themselves.

LLM does not need to express the values of its creators, nor does it need to express the values encoded in the text of the network

When a pure pretrained LLM generates text, the text will often resemble the text it was trained on. This includes similarity of the values expressed in the text: the explicit statements produced by the models and the implicit biases behind their writing reflect their training data. However, these values are well controlled by their developers, especially when given a pure pre-trained LLM further prompts and training to adapt to its deployment as a product (Part 4). This means that the values expressed by the behavior of the deployed LLM do not need to reflect the average values expressed in its training data. This also provides opportunities for input and oversight by third parties, which means that the values expressed in these models also do not need to reflect the values of the specific people and organizations that built them.

Mainstream methods using reinforcement learning and red-teaming allow model developers to guide models more or less toward their chosen roles and set of values. In these techniques, the values learned by a model are never entirely clear. Instead, they reflect many small pieces of feedback that humans give to the model during training. Constitutional AI technology greatly reduces human labor and makes these values more explicit: Using this approach, a model can be trained to follow a set of norms and values by simply writing those values down in a list of constraints known as "constitutionalism." It is possible to use such techniques to drastically reduce the recognized biases in model behavior, and in fact, in some cases, exhibiting more examples of unwanted behavior during pre-training can make it easier for them to avoid this behavior in deployment, subverting the intuitive connection between training data and model behavior.

These technological interventions, especially constitutionally mandated AI, can be influenced and regulated by the outside world. We can easily imagine third-party standards bodies gathering information about what behaviors are acceptable in AI systems and distilling these inputs into constitutionalism that encourages or requires model developers to adopt them.

As noted in section 4: these techniques can still fail in subtle and surprising ways, and the changing trends in these technologies become complex as the model size increases. Of course, as large-scale AI system deployments develop, many other ethical issues arise, including issues such as environmental impact, accessibility, abuse, privacy, security, and concentration of power.

Short interactions with LLM are often misleading

While many deployed LLMs are largely able to follow instructions, this instruction compliance behavior is not intrinsic to the model, but is added to the model through the use of highly imperfect tools (Section 4). This is partly due to the model's unique sensitivity to the content of instructions. Normally, when asked to perform a task, the model may fail, but with a slight change in the wording or framework of the request, the task can be executed correctly, which leads to the emerging technology of prompt engineering.

These episodic failures show that our techniques for controlling language models to follow instructions are not always reliable and effective. However, simply observing that an LLM is unable to complete a task in a certain environment does not fully prove that the LLM does not have the skills or knowledge required to complete the task.

Typically, once you find an appropriate way to prompt the model to perform a task, you will find that the model performs well in the task of different instances. The chained thinking strategy mentioned in the second section is a clear example: simply prompting the model to "think step by step" can make it excel in the entire classification of mathematical and reasoning problems that it would otherwise not be able to accomplish. Similarly, even observing that LLM frequently fails on some tasks is far from sufficient proof that no other LLM is capable of accomplishing that task.

On the other hand, observing that an LLM successfully accomplishes a task in an instance does not strongly demonstrate that the LLM is capable of accomplishing that task in general, especially if the example was picked out in a demo (as in the unicorn in Figure 4). LLM can memorize specific examples or strategies for solving tasks from its training data without internalizing the inference processes that make them robust to those tasks.

Discussion and limitations

Below are some additional discussions, with background notes on the above, to further explore the eight points mentioned earlier. However, some of these contents may be more speculative or subjective and may not be widely agreed.

9.1 Significant improvements are expected in the outstanding flaws of current language models

Hallucination is a significant flaw in the current system, where LLM fabricates content that looks credible but is false, severely limiting its responsible use. However, some of the recent research discussed in Section 3 suggests that we may soon be able to mitigate this problem by making better use of the capabilities already demonstrated by models: LLM can track which statements are true fairly accurately internally, and this ability will increase as it scales.

Similarly, it is mentioned in section 7 that explicit bias and harmfulness in the model's output can be significantly reduced by leveraging the model's ability to identify the facts of bad behavior. While these mitigations may not be fully effective, the prevalence and prominence of bad behavior is likely to ease as technology continues to improve.

While these signs may seem encouraging, they do not mean that we can reliably control these models, and the problems mentioned in section 4 remain. Some of our solutions are likely to have potential failure modes. For example, a direct attempt to manage hallucinations is likely to fail silently, making it seem more believable than it really is: if we take standard methods to train a certain future LLM to tell the truth, but that LLM can fairly accurately predict which factual claims a human data worker might check, it can easily lead the LLM to tell the truth only on claims that are likely to be examined.

9.2 LLM will be flexibly deployed as an agent for different targets

As LLMs become more capable, and have increasingly accurate and usable models of the internal world, they are likely to take on increasingly open tasks, including developing and executing new plans to optimize outcomes for the world. As these capabilities develop, we will see LLM deployed in areas such as software engineering or business strategy after combining measurable outcomes in various fields with flexibly planned requirements, standards and specifications. LLM with additional tools can extend it to basic areas such as robotics. This type of deployment increasingly places LLM in a new environment created by the system's own behavior, further reducing the extent to which its developers can predict and control its behavior. This is likely to increase the probability that these systems will not function as agents in some cases. But it is also likely to increase the risk of keeping the system effective while pursuing the wrong goals, which can lead to more dangerous mistakes.

9.3 The developer of LLM has limited influence on LLM

Since many important LLM capabilities are emergent and difficult to predict, LLM developers have relatively little impact on which future LLM capabilities will have, and predicting future LLM capabilities based on economic motivations, values, or the developer's personality is likely to fail. GPT-4, for example, seems to possess many of the skills its creators hoped for, such as those involving programming. But it also initially seemed to possess some undesirable skills, such as teaching ordinary people to prepare biological weapons, which forced its creators to spend a lot of effort trying to eliminate them.

In addition, LLM developers inevitably have limited awareness of LLM's capabilities when deciding whether to deploy LLM, just as GPT-3 OpneAI did not realize that it had the ability to think chain reasoning when it was released. Sometimes users find a way to elicit important new behaviors that developers aren't aware of.

9.4 LLM is highly likely to entail additional risks

From a broader perspective, the current technology and business environment may prompt the rapid construction and deployment of increasingly capable LLMs, however, our track record of identifying the capabilities of new LLMs is not complete until LLM is deployed. In addition, our technical approach to controlling LLM is weak, and these methods are likely to collapse further when applied to highly capable models. It is therefore foreseeable that as LLM grows and deploys, the scope of abuse risk and model misconduct will increase significantly and undergo significant qualitative changes.

While LLM-based systems may have many positive applications, the social cost and benefit trade-offs involved in deploying these systems are difficult to assess in advance unless significant technological advances are made in model evaluation, interpretability, and control. Some of these hard-to-assess risks, such as those involving non-traditional weapons or strategic power-seeking behaviour, may not be adequately addressed if they are detected only after deployment. It is a strategic power struggle that, even if not intentionally deployed, can pose serious risks during the model development phase. This suggests that this area may require increasingly stringent safety, security, and oversight standards.

The negative results of 9.5 LLM are difficult to interpret but point to its true weaknesses

There are many scientific studies that have shown recent LLM failures in processing linguistic and common sense reasoning tasks, even for relatively simple tasks, but also in an attempt to induce good behavior. The details of these failures sometimes call into question the quality of other relevant assessments. For the reasons mentioned in section 8, positive results of well-designed measurements are more reliable than negative results. Nevertheless, in some areas, even in simple negative processing, LLM shows a systemic weakness in dealing with the world of language or reasoning. We have little basis to predict when these restrictions will be resolved.

9.6 The scientific and academic research of LLM is particularly immature

LLM poses challenges to methods and examples in areas that should be most appropriate to study them. Natural language processing (or language technology) is a historical discipline of this work, but its tools are mainly oriented towards measuring and improving the ability of computing systems to use language. Although LLMs are fundamentally learning and interacting through language, many important questions about their behavior and abilities do not primarily concern language use. The interdisciplinary field of AI policy and AI ethics has developed conceptual and normative frameworks for thinking about the deployment of many types of AI systems.

However, these frameworks often assume that AI systems are more precisely subject to the intentions of their human owners and developers, or by the statistical characteristics of their training data, which is not consistent with the recent situation with LLM. Related to this, many of the most frequently cited research papers on LLM, including many that introduce new methods or theories, are not published in peer-reviewed settings. More recently, the tendency to restrict access to LLM and treat the details of LLM training as proprietary information has also posed a barrier to scientific research.

This means that the surprisingly novel ideas about LLM are often the product of chaotic, error-making scientific research that goes beyond established disciplinary practice. However, when applying established common sense to LLM, seemingly established conventional wisdom also often lacks a solid foundation. All of the above factors add to the uncertainty of the problem discussed in this article and its causes, and we should be resilient to false assumptions when deciding how to deal with LLM.

conclusion

Compared to the issues discussed in this article, here are three relatively independent but sometimes still receiving attention:

There is debate about whether LLMs understand language and whether they can describe their behavior with agent-related words like "know" or "try." Whether or not systems are similar in nature to humans, we can assess their effectiveness or ineffectiveness, reliability or unreliability, explainability or unexplainability, and the degree to which they improve quickly or slowly.
The issues of consciousness, perception, rights, and morality in LLM need to be distinguished from the above. While these questions may have an impact on important decisions about how to build and use AI systems, without taking a clear stance on these issues, we should be able to evaluate most or all of the questions raised in this article.
Finally, value judgments about LLM are beyond the scope of this article. Whether the rapid progress of LLM that we see is a good thing and how each of us should deal with this depends on deeper and broader considerations than the technical literature covered in this article.

Welcome to Star, try the latest version of OneFlow:

https://github.com/Oneflow-Inc/oneflow/

Eight theories about the grand model of language