The west wind comes from the temple

Qubits | Official account QbitAI

GPT-4 math ability can be stronger!

New research has found that the accuracy of GPT-4 code interpreters is related to the frequency with which they use the code.

To this end, the researchers proposed a new way to directly pull their mathematical abilities into the new SOTA:

On the MATH dataset, the accuracy rate increased from 53.9% to 84.3%.

GPT-4 mathematics was raised by another 30 points, the code parser Ren Du Ermai was opened, and the mathematical ability was on SOTA

You heard it right, it's the code interpreter that was called the strongest mode since the launch of ChatGPT some time ago.

The researchers spied on its code generation and execution mechanism, using self-verification and verification guidance weighted majority voting methods to directly open its two veins of doing math problems.

Curious netizens followed:

I also want to see them do high numbers.

Some netizens think:

This is how the brain works, and humans also self-verify when solving mathematical problems.

Let's come to Kangkang for details of this research~

Two steps to improve math skills

What exactly is the code generation and execution mechanism of the GPT-4 code parser?

Scholars from Hong Kong Chinese MMLab, Nanjing University, China University of Science and Technology, Tsinghua, CityU, Changsha University of Science and Technology and other institutions conducted an experiment using specific code constraint hints to solve this problem.

They devised 3 different hinting methods to limit how often the GPT-4 code parser uses the code:

Prompt 1: Code is not allowed at all, output relies entirely on natural language reasoning, and merging code into the solution is prohibited.
Prompt 2: Code is only allowed 1 time, that is, code can only be used within a single code block when building a solution.
Basic Prompt: There is no limit, the GPT-4 code parser can perform a series of inference steps, each step can consist of literal + Python code.

△ (a) Comparison of the accuracy of different prompts (b) The frequency of code use is proportional to the accuracy of the five difficulty levels, and it is more obvious when the math problem is relatively complex

It is found that the GPT-4 code parser is allowed to generate and execute code multiple times, and its problem solving accuracy is significantly higher than that of natural language reasoning alone or only one code is used.

After analysis, the researchers believe that multiple generation and execution of code can allow the GPT-4 code parser to gradually improve the solution, and when the code execution produces errors, the GPT-4 code parser can self-debug and modify the solution.

Then, the concept of "code usage frequency" is introduced to quantify the number of code uses under different prompt methods.

Based on the previous analysis results, the researchers hope to enhance the GPT-4 code parser's ability to generate accurate code, evaluate code execution results, and automatically adjust solutions.

So the CSV (self-validation) prompt method is proposed, which introduces an additional verification stage for solution C, called V.

Add self-verification prompt effect corresponding to the green Verification Prompt in the figure above.

In this way, the GPT-4 code parser needs to generate additional code to verify the answer, and if the result is false, it will re-reason to get the correct answer.

CSV hints not only extend every step from validation to logical reasoning, but also automatically correct errors without external model or human involvement.

△The 712th intermediate algebra problem in the MATH dataset.

CSV prompt：To solve the problem using code interpreter step by step, and please verify your answer using code interpreter.

As you can see from the example above, the model generates an incorrect answer without self-validation. By self-validating, the model corrects errors and generates correct answers.

In addition, since CSV can effectively verify the answer to the question, the researchers proposed a method of verifying the weighted majority voting (VW-voting), which integrates the self-verification results into the majority voting, giving different weights to different verification states, making the voting more reliable.

In practice, once an answer is confirmed as an error, no additional validation is performed, resulting in an incorrect validation status. The researchers assigned weights to these states: true (wT), uncertain (wU), and error (wF).

Finally, choose the one with the highest score from the candidate answers:

30% improvement over previous highest level

With the above method, the GPT-4 code parser's ability to do math problems is up.

On the MATH dataset, the accuracy of the original GPT-4 code parser was 69.69%, which increased to 73.54% after using CSV prompts, and further increased to 84.32% after weighted majority voting, which was more than 30% higher than the previous SOTA.

△ Accuracy on the MATH dataset (%)

In all subtasks of the MATH dataset, the proposed method has been significantly improved, especially in the difficult level of the problem. For example, in the Intermediate Algebra question, the accuracy of the original GPT-4 code parser was 50.1%, which was improved to 74.4% with the new method.

In addition to this, the researchers also validated datasets such as GSM8K, MMLU-Math, and MMLU-STEM.

△ Performance on GSM8K dataset

As can be seen in the above table, the method of validating the guided weighted majority voting can also significantly reduce the number of Sampled paths that need to be sampled, and only 5 paths are required to achieve 97% accuracy on the GSM8K dataset.

△ Performance on the MMLU dataset

In tests of different levels of difficulty (Figure A below) and for different types of questions (B below), the accuracy rate has been improved with the new method.

The four points on each curve correspond to the results obtained using Prompt 1, Prompt 2, BasicPrompt, CSV Prompt.

The researchers also found that increased code usage frequency for GPT-4 code parsers was positively correlated with improved accuracy. As the difficulty of the question increases, the frequency of code use steadily increases. This shows that it is important to use code more frequently for difficult mathematical problems.

Also, it's worth noting that while adding code-based self-validation can improve the performance of each individual question type, the degree of improvement also varies by question type, ranging from 7.6% to just 0.6%.

The researchers noted:

In particular, the accuracy of geometric problems has only improved by 0.6%, and the accuracy of the original GPT-4 code parser is only 54.0%, which is low among various problem types. This difference may be due to the fact that solving geometric problems often requires multimodality, which is beyond the scope of this article.

Thesis Portal: https://arxiv.org/abs/2308.07921

Reference Links:

[1]https://twitter.com/_akhaliq/status/1691734872329699813?s=20

[2]https://x.com/justfannet/status/1691983780498600376?s=46&t=iTysI4vQLQqCNJjSmBODPw

— End —

Qubits QbitAI · Headline number signed

GPT-4 mathematics was raised by another 30 points, the code parser Ren Du Ermai was opened, and the mathematical ability was on SOTA

Two steps to improve math skills

30% improvement over previous highest level