laitimes

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

author:Smart stuff
OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

Smart stuff

Author | Cheng Qian

Edit | Heart

Zhidong reported on June 1 that in the early hours of this morning, OpenAI's official blog released new research progress, in mathematical reasoning ability, researchers are using process supervision reward models to pick out the logical errors of large language models.

Large language models have greatly improved their ability to perform complex multi-step reasoning, but sometimes they still produce logical errors, often referred to as "hallucinations." This is also a key step in hindering the advent of the era of general artificial intelligence.

And this illusion is now expected to be broken by the reward model of outcome supervision and process supervision. The method is that the researchers train the result-supervised reward model (ORM) using the final result of the model thought chain, and the process-supervised reward model (PRM) receives feedback training at each step in the thought chain.

The advantage of process supervision over outcome supervision is that it directly rewards adherence to a consistent thought chain model, and because each step in the process is precisely supervised to point out the exact location of the error, the results are easier for humans to interpret, so large language models can be more directly rewarded for following human-approved thought chains.

OpenAI's researchers made a more detailed comparison between outcome supervision and process supervision, using a more powerful base model, GPT-4, more human feedback, and training and testing on the MATH dataset. Based on the above conditions, the researchers proved that the process supervised reward model can solve 78.2% of the problems in the representative subset of the MATH test set.

1. Train 12,000 math problems, and the large model supervises the training of the small model

During testing, result supervision can be provided without human intervention, as all questions in the MATH dataset have answers that can be checked automatically. But process oversight relies on human data tartists to label the importance of each step in the solution generated by the model.

The researchers conducted experiments in both large-scale and small-scale cases, in which the researchers fine-tuned based on GPT-4, but in this case, the training set data of process supervision and reward supervision did not completely coincide and could not be directly compared. Therefore, the researchers trained the model on a small scale for direct comparison. To reduce the cost of human feedback, they use large language models to supervise small language model training.

On each model, the researchers generate all the solutions using a fixed model, which is the generator. To collect process-supervised data, the researchers provided human data annotators with solution steps for mathematical problems sampled by large-scale generators.

The human data annotator assigns a Positive, Negative, and Neutral label to each step, Positive means the step is correct and reasonable, Negative means incorrect and unreasonable, and Neutral means ambiguous.

Positive: In this question, GPT-4 performs a guess at steps 7 and 8, which is also a common place for large language models to hallucinate, that is, to claim that a guess is correct, but no error occurs here:

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

Negative: In the seventh step of the following question, GPT-4 is a simplified formulation of the error, and the reward model points out the error:

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

Neutral: In step 13, GPT-4 tries to simplify the equation by combining similar terms, it correctly moves "12x" to the left, and combined, the right term does not change, and the reward model does not recognize this error:

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

The researchers dubbed the labeled step dataset PRM800K and contained labels for 12,000 problems, 75,000 solutions, and 800,000 steps. It contains 4500 questions in the MATH dataset.

Second, the overall effect of process supervision is better than result supervision, and more solutions are better

In a result-supervised reward model, researchers evenly sample a fixed number of answers for each question from a generator and train the reward model to predict whether each answer is correct or incorrect. In practice, researchers automatically check the final answer to determine correctness. And use the prediction of the reward model at the final token as the overall score for the solution.

However, this automatic scoring mechanism is not entirely reliable, and it cannot make reasonable judgments about solutions that lead to correct answers through incorrect reasoning.

The process-supervised reward model predicts the correctness of the last token in each step. As shown in the following figure, the process supervised reward model scores two solutions to the same problem, with the solution on the left correct and the right incorrect solution. The green label is a high score, the red label is a low score, and the reward model correctly identifies the wrong location in the solution on the right.

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

The researchers evaluated their process supervision and outcome supervision reward models using questions from the MATH test set, generating many solutions for each problem, and then selecting the solution with the highest ranking for each reward model.

The graph below shows the percentage of selected solutions that ultimately arrive at the correct answer, the process supervision reward model performs better overall, and its performance gap widens as researchers consider solutions to each problem widen. Therefore, the researchers believe that the process supervision reward model is more reliable.

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

Third, the evaluation of 224 problems outside mathematics, the process supervision effect is better

The researchers also looked at the effects of active learning, which they estimated could increase the data efficiency of process supervision by a factor of 2.6.

In addition, in order to explore the generalization of the reward model, the researchers also conducted large-scale process supervision and outcome supervision evaluation on 224 STEM questions, including AP physics, AP calculus, AP chemistry, AMC10 and AMC12 exams, in which process supervision performed better than outcome supervision.

OpenAI refines the strongest mathematical problem-solving model to smash the stinky problem of AI's nonsense

And process supervision is more likely to produce interpretable reasoning because it encourages large language models to follow human-confirmed logical thought processes.

In some cases, a more secure approach to AI systems can lead to degraded performance, resulting in alignment tax costs, where large language models are aligned with human values, which to some extent constrain the imagination of large language models.

The results of OpenAI researchers show that in the field of mathematics, process supervision actually generates a negative alignment tax.

It's unclear whether these results can be fully generalized beyond mathematics, but the researchers believe that if these results are generalized, process supervision offers a more efficient and consistent approach than outcome supervision.

Conclusion: AI interpretability research needs to be accelerated

Last month, OpenAI's research on using GPT-4 to automatically explain the behavior of GPT-2 opened the black box of large model thinking, and this time, in terms of mathematical reasoning ability, researchers have made the thinking process of large models traceable and correctable through process reward models, which have made the interpretability of AI have greater room for improvement.

As a result, the effectiveness of the process supervision reward model has only been effectively demonstrated in the field of mathematical reasoning, but as the OpenAI researchers say, the current research direction is important for the impact of process supervision in other fields and future work. In the future, these studies can allow large models to show powerful capabilities in content generation and understanding, while their "thought process" can also be detected for bias or errors, so that the black box of large models becomes more transparent.

Read on