How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

Machine Heart column

Heart of the Machine Editorial Office

Thought chain prompts (CoT) are one of the most mysterious phenomena in the emergence of large models, especially in solving mathematical reasoning and decision-making problems. How important is CoT? What is the mechanism behind its success? In this paper, several researchers at Peking University demonstrate that CoT is indispensable in the realization of large language model (LLM) reasoning, and reveal how CoT can unlock the great potential of LLM from a theoretical and experimental perspective.

Recent studies have found that Chain of Thought prompting (CoT) can significantly improve the performance of large language models (LLMs), especially for complex tasks involving math or reasoning. But despite much success, the mechanisms behind CoT and how to unlock the potential of LLM remain elusive.

Recently, a new study from Peking University has revealed the mystery behind CoT from a theoretical perspective.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

Link to the paper: https://arxiv.org/abs/2305.15408

Transformer-based large language models have become a general model in natural language processing and have been widely used in various tasks. Mainstream large models are usually based on autoregressive paradigm, specifically, a variety of different tasks (such as text translation, text generation, question answering, etc.) can be uniformly regarded as sequence generation problems, where the input and description of the problem are encoded together into a sequence of words (token), called prompt; The answer to the question can then be transformed into a task that conditionally generates subsequent words based on the prompt.

A large number of studies in the field of large models have shown that well-designed prompt words play a crucial role in the performance of the model. Especially when it comes to arithmetic or reasoning-related tasks, CoT has been shown to greatly improve the correctness of the generated answers. As shown in the figure below, for a task that requires mathematical reasoning, the answers generated directly by the large model are often wrong (a, b below). But if you modify the hint so that the large model outputs the entire thought chain (intermediate derivation step), you can finally get the correct answer (c, d below).

In practice, there are two mainstream implementations of thought chain prompts: one is to add a specific phrase to the prompt, such as "Let's think step by step" to trigger (see Figure c above); The other is to let the large model simulate the corresponding derivation process by providing a small number of examples of thought chain demonstrations (Figure D above).

However, despite the remarkable performance of CoT in a large number of experiments, the theoretical mechanism behind it remains a mystery. On the one hand, do large models really have inherent theoretical flaws in directly answering questions such as mathematics and reasoning? On the other hand, why can CoT improve the capabilities of large models on these tasks? This paper answers the above questions from a theoretical point of view.

Specifically, the researchers study CoT from the perspective of model expression ability: for mathematical tasks and general decision-making tasks, this paper studies the expression ability of autoregressive-based Transformer models in the following two aspects: (1) directly generate answers, and (2) use CoT to generate complete solution steps.

CoT is the key to solving mathematical problems

Large models represented by GPT-4 have already demonstrated astounding mathematical capabilities. For example, it solves most high school math problems correctly and has even become a research assistant for mathematicians.

In order to study the mathematical capabilities of large models, this paper selects two very basic but core mathematical tasks: arithmetic and equations (the figure below shows examples of inputs and outputs for these two tasks). Since they are the basic components for solving complex mathematical problems, through the study of these two core mathematical problems, we can gain a deeper understanding of the capabilities of large models on general mathematical problems.

The researchers first explored whether the Transformer could output answers to the above questions without outputting intermediate steps. They considered a hypothesis that fits very well with reality – a log-precision transformer, that is, each neuron of the transformer can only represent a floating-point number with finite precision (precision in log n bits), where n is the maximum length of the sentence. This assumption is very close to reality, for example, in GPT-3, machine precision (16-bit or 32-bit) is typically much less than the maximum output sentence length (2048).

Under this assumption, the researchers proved a core impossible result: for an autoregressive transformer model with a constant layer and width d, it is necessary to solve the above two mathematical problems by directly outputting the answer, and it is necessary to use a very large model width d. Specifically, d needs to grow larger than the polynomial with the growth of the input length n.

The essential reason for this result is that there is no efficient parallel algorithm for the above two problems, so Transformer as a typical parallel model cannot solve them. The article rigorously proves the above theorem using circuit complexity theory in theoretical computer science.

So, what if the model does not output the answer directly, but outputs the intermediate derivation steps in the form of the figure above? The researchers further proved through construction that when the model can output intermediate steps, a fixed-size (independent of input length n) autoregressive Transformer model can solve the above two mathematical problems.

Compared with the previous results, it can be seen that the addition of CoT greatly improves the expression ability of large models. The researchers further gave an intuitive understanding of this: this is because the introduction of CoT will continuously feed the generated output words back to the input layer, which greatly increases the effective depth of the model and makes it proportional to the output length of CoT, thereby greatly improving the parallel complexity of the Transformer.

CoT is key to solving general decision-making problems

In addition to mathematical problems, the researchers further considered CoT's ability to solve common tasks. Starting from the decision-making problem, they considered a general framework for solving the decision-making problem, called dynamic programming.

The basic idea of dynamic programming (DP) is to break down a complex problem into a series of small-scale subproblems that can be solved sequentially. The decomposition of the problem ensures that there is significant correlation (overlap) between the subproblems, so that each subproblem can be solved efficiently using the answers from the previous subproblems.

The longest ascending subsequence (LIS) and the solved edit distance (ED) are two well-known DP problems proposed in the book "Introduction to Algorithms", and the following table lists the state space and aggregate functions of the transition function for these two problems.

The researchers proved that the autoregressive Transformer model can output a complete dynamic programming thought chain in the order in which subproblems are solved, so as to output correct answers for all tasks that can be solved with dynamic programming. Similarly, the researchers further demonstrated that generative thinking chains are necessary: for many difficult dynamic programming problems, a constant layer, polynomial-sized Transformer model cannot directly output the correct answer. The article gives a counterexample to this problem by testing context-free grammar members.

experiment

Finally, the researchers designed a large number of experiments to test the above theory, considering four different tasks: arithmetic expression evaluation, solving systems of linear equations, solving the longest ascending subsequence, and solving editing distance.

Experimental results show that when trained using CoT data, a 3-layer autoregressive Transformer model has been able to achieve almost perfect performance on all tasks. However, direct output of correct answers performs poorly on all tasks (even with deeper models). This result clearly demonstrates the ability of autoregressive transformers to solve a variety of complex tasks and the importance of CoT in solving these tasks.

The researchers also explored whether the learned autoregressive model could be extrapolated further to longer data. They built a CoT training dataset for the operation task with the number of operators from 1 to 15, and tested the model on an expression with the number of operators n ∈. As shown in Figure 3 below, the researchers' three-layer Transformer model still performed well over longer sequences, indicating that the model did learn the underlying mechanism to some extent. Therefore, the researchers believe that models trained on more data of different lengths can eventually reveal complete arithmetic rules.

How does Mindchain unlock the hidden capabilities of language models? The latest theoretical research reveals the mystery behind it

Read on

Tongyi Qianwen has open-sourced 32 billion parameter models, and has realized 7 large language models that are all open-source

Use LM Studio to deploy local AI large language models with one click

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

8.3K Stars!

Meta Researchers Crack the Curse of Large Model Reversal and Launch "Language Model Physics"

Decoding AI: Demystifying the "brain" of chatbots - large language models

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Intel has made important progress in the field of artificial intelligence accelerators, and its subsidiary HabanaLabs is in

Researchers propose a new concept of artificial intelligence that allows large language models to interact with the real physical world

Llama 3: The next frontier of open-source large language models

The secret of using large language models: How to control AI with efficient prompt words?

Apple has been exposed to a big move again, self-developed device-side large language model, AI is a new way out of "revitalization"?

Love to use your brain, answer it, it is really a talent, I have served you, my thinking is really unusual, and it is worth savoring

One of the 100 Analytical Thinking Models: Network Thinking

The 6th Computational Thinking Programming Competition in Jiulongpo District C++ Programming Finals came to a successful conclusion

Why the second and third years of junior high school best reflect the potential of students, and scientific thinking is displayed

No wonder the previous iPhone 16 series national version of the AI function will be provided by Baidu, the original Baidu in the Chinese artificial intelligence invention patent enterprise ranking is still high. Ranked in the top 10

What do girls think is normal and boys think is ambiguous? Netizen: The thinking of men and women is too different

Oh my God, it's so funny, the worker asks the boss to settle the salary, the writing is colorful, and the thinking is really unusual

Thinking Turn, Cognitive Upgrading, Pattern Reshaping (Good Article)

One of the 100 models of analytical thinking: deep learning

Mind Upgrade: To become a powerful person, you need to enhance your accuracy of the world!

Apple released OpenELM, an efficient language model based on an open-source training and inference framework

Marriage, please use a man's mind to get along with a man

The National Centre for the Performing Arts' general education course series "Travel with the Master of Music" book sharing session was successfully held to experience "music is the sound of thinking"

Solomonov: The Prophet of Large Language Models

Yang Lan hosted, Sun Nan sang, Porsche Award, photovoltaic storage enterprises are becoming more and more C-end thinking

Adhere to systematic thinking and make all-round efforts to unswervingly promote the high-quality development of the county economy

In the second year of junior high school, mathematics is divided into two levels, and the stratification of thinking ability has long begun

People with weak thinking are generally stronger in their desire because of their natural inferiority

Oh my God, the thinking of a prodigy is really unusual, which teacher taught mathematics, it's so funny, taste it carefully

Large Language Model Deployment: vLLM and Quantization