laitimes

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

author:Heart of the Machine Pro

Machine Heart column

Heart of the Machine Editorial Office

Thought chain prompts (CoT) are one of the most mysterious phenomena in the emergence of large models, especially in solving mathematical reasoning and decision-making problems. How important is CoT? What is the mechanism behind its success? In this paper, several researchers at Peking University demonstrate that CoT is indispensable in the realization of large language model (LLM) reasoning, and reveal how CoT can unlock the great potential of LLM from a theoretical and experimental perspective.

Recent studies have found that Chain of Thought prompting (CoT) can significantly improve the performance of large language models (LLMs), especially for complex tasks involving math or reasoning. But despite much success, the mechanisms behind CoT and how to unlock the potential of LLM remain elusive.

Recently, a new study from Peking University has revealed the mystery behind CoT from a theoretical perspective.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

Link to the paper: https://arxiv.org/abs/2305.15408

Transformer-based large language models have become a general model in natural language processing and have been widely used in various tasks. Mainstream large models are usually based on autoregressive paradigm, specifically, a variety of different tasks (such as text translation, text generation, question answering, etc.) can be uniformly regarded as sequence generation problems, where the input and description of the problem are encoded together into a sequence of words (token), called prompt; The answer to the question can then be transformed into a task that conditionally generates subsequent words based on the prompt.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

A large number of studies in the field of large models have shown that well-designed prompt words play a crucial role in the performance of the model. Especially when it comes to arithmetic or reasoning-related tasks, CoT has been shown to greatly improve the correctness of the generated answers. As shown in the figure below, for a task that requires mathematical reasoning, the answers generated directly by the large model are often wrong (a, b below). But if you modify the hint so that the large model outputs the entire thought chain (intermediate derivation step), you can finally get the correct answer (c, d below).

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

In practice, there are two mainstream implementations of thought chain prompts: one is to add a specific phrase to the prompt, such as "Let's think step by step" to trigger (see Figure c above); The other is to let the large model simulate the corresponding derivation process by providing a small number of examples of thought chain demonstrations (Figure D above).

However, despite the remarkable performance of CoT in a large number of experiments, the theoretical mechanism behind it remains a mystery. On the one hand, do large models really have inherent theoretical flaws in directly answering questions such as mathematics and reasoning? On the other hand, why can CoT improve the capabilities of large models on these tasks? This paper answers the above questions from a theoretical point of view.

Specifically, the researchers study CoT from the perspective of model expression ability: for mathematical tasks and general decision-making tasks, this paper studies the expression ability of autoregressive-based Transformer models in the following two aspects: (1) directly generate answers, and (2) use CoT to generate complete solution steps.

CoT is the key to solving mathematical problems

Large models represented by GPT-4 have already demonstrated astounding mathematical capabilities. For example, it solves most high school math problems correctly and has even become a research assistant for mathematicians.

In order to study the mathematical capabilities of large models, this paper selects two very basic but core mathematical tasks: arithmetic and equations (the figure below shows examples of inputs and outputs for these two tasks). Since they are the basic components for solving complex mathematical problems, through the study of these two core mathematical problems, we can gain a deeper understanding of the capabilities of large models on general mathematical problems.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

The researchers first explored whether the Transformer could output answers to the above questions without outputting intermediate steps. They considered a hypothesis that fits very well with reality – a log-precision transformer, that is, each neuron of the transformer can only represent a floating-point number with finite precision (precision in log n bits), where n is the maximum length of the sentence. This assumption is very close to reality, for example, in GPT-3, machine precision (16-bit or 32-bit) is typically much less than the maximum output sentence length (2048).

Under this assumption, the researchers proved a core impossible result: for an autoregressive transformer model with a constant layer and width d, it is necessary to solve the above two mathematical problems by directly outputting the answer, and it is necessary to use a very large model width d. Specifically, d needs to grow larger than the polynomial with the growth of the input length n.

The essential reason for this result is that there is no efficient parallel algorithm for the above two problems, so Transformer as a typical parallel model cannot solve them. The article rigorously proves the above theorem using circuit complexity theory in theoretical computer science.

So, what if the model does not output the answer directly, but outputs the intermediate derivation steps in the form of the figure above? The researchers further proved through construction that when the model can output intermediate steps, a fixed-size (independent of input length n) autoregressive Transformer model can solve the above two mathematical problems.

Compared with the previous results, it can be seen that the addition of CoT greatly improves the expression ability of large models. The researchers further gave an intuitive understanding of this: this is because the introduction of CoT will continuously feed the generated output words back to the input layer, which greatly increases the effective depth of the model and makes it proportional to the output length of CoT, thereby greatly improving the parallel complexity of the Transformer.

CoT is key to solving general decision-making problems

In addition to mathematical problems, the researchers further considered CoT's ability to solve common tasks. Starting from the decision-making problem, they considered a general framework for solving the decision-making problem, called dynamic programming.

The basic idea of dynamic programming (DP) is to break down a complex problem into a series of small-scale subproblems that can be solved sequentially. The decomposition of the problem ensures that there is significant correlation (overlap) between the subproblems, so that each subproblem can be solved efficiently using the answers from the previous subproblems.

The longest ascending subsequence (LIS) and the solved edit distance (ED) are two well-known DP problems proposed in the book "Introduction to Algorithms", and the following table lists the state space and aggregate functions of the transition function for these two problems.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

The researchers proved that the autoregressive Transformer model can output a complete dynamic programming thought chain in the order in which subproblems are solved, so as to output correct answers for all tasks that can be solved with dynamic programming. Similarly, the researchers further demonstrated that generative thinking chains are necessary: for many difficult dynamic programming problems, a constant layer, polynomial-sized Transformer model cannot directly output the correct answer. The article gives a counterexample to this problem by testing context-free grammar members.

experiment

Finally, the researchers designed a large number of experiments to test the above theory, considering four different tasks: arithmetic expression evaluation, solving systems of linear equations, solving the longest ascending subsequence, and solving editing distance.

Experimental results show that when trained using CoT data, a 3-layer autoregressive Transformer model has been able to achieve almost perfect performance on all tasks. However, direct output of correct answers performs poorly on all tasks (even with deeper models). This result clearly demonstrates the ability of autoregressive transformers to solve a variety of complex tasks and the importance of CoT in solving these tasks.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

The researchers also explored whether the learned autoregressive model could be extrapolated further to longer data. They built a CoT training dataset for the operation task with the number of operators from 1 to 15, and tested the model on an expression with the number of operators n ∈. As shown in Figure 3 below, the researchers' three-layer Transformer model still performed well over longer sequences, indicating that the model did learn the underlying mechanism to some extent. Therefore, the researchers believe that models trained on more data of different lengths can eventually reveal complete arithmetic rules.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

Read on