Editor: Editorial Office

Who would have thought that training GPT-5 did not require handwritten code. A new MIT Microsoft study found the effectiveness of GPT-4 in code fixes. In the future, OpenAI engineers can only do – Critique is all you need.

We all know that large models have the ability to introspect and can self-correct the written code.

How exactly does the mechanism behind this self-healing work?

Why is the code wrong, and to what extent does the model provide accurate feedback?

Recently, MIT and Microsoft scholars found that of GPT-4 and GPT-3.5, only GPT-4 showed effective self-healing. In addition, GPT-4 can even provide feedback on programs generated by GPT-3.5.

MIT Microsoft confirmed that GPT-4 is self-correcting, and the agent loop iterates the code based on feedback

Address: https://arxiv.org/pdf/2306.09896.pdf

Nvidia scientist Jim Fan highly recommended the study.

In his opinion, even the most professional human programmer cannot write a program correctly in the first pass. They need to look at the execution results, reason out what the problem is, give fixes and try and try again. This is an agent loop: iteratively improving code based on environmental feedback.

Chances are, OpenAI is training the next generation of GPTs by hiring a large number of software engineers. And they don't need to output code — Critique is all you need.

- The core reason why GPT-4 can self-heal is its strong feedback ability. It is able to effectively self-reflect on the problems of the code that no other model can compete with.

- The feedback model and the code generation model do not have to be the same. In fact, the feedback model is the bottleneck.

- Based on feedback from GPT-4, GPT-3.5 enables better coding.

- Based on feedback from professionals, GPT-4 itself is able to write better code.

Demystifying GPT fixes for code generation

We all know that big language models have shown extraordinary ability to generate code.

However, they don't do well in challenging programming tasks, such as competitions and software engineer interviews.

Fortunately, many models "introspect" through a self-healing workflow, correcting errors in the code on their own.

Researchers would love to know to what extent these models provide correct feedback and explain why the code they generated is wrong.

The figure shows a classic workflow based on a self-healing approach.

First, given a specification, sample a program from the code generation model and execute it on a set of unit tests provided in the specification.

If the program fails in any unit test, the error message and program are fed a feedback generation model that outputs a short explanation of why the code failed.

Finally, the feedback is passed to a fix model, which generates a fixed version of the program.

On the surface, this workflow seems perfect. It allows the system to overcome errors caused by bad samples during the decoding process, easily incorporating feedback from symbology (compilers, static analysis tools, execution engines, etc.) during the repair phase.

And mimics the trial-and-error way human software engineers write code.

However, there is a problem with the workflow: self-healing requires more calls to the model, which increases the computational cost.

Moreover, the researchers found an interesting phenomenon: the effectiveness of large models to self-heal depends not only on the model's ability to generate code, but also on its ability to recognize how the code makes mistakes in the task.

No work has been done to investigate this in detail, so the authors investigated the self-healing effectiveness of GPT-3.5 and GPT-4 when solving competition-level code generation tasks.

The researchers came up with a new evaluation strategy called

In this strategy, the pass rate of the task is measured based on the total number of tokens sampled from the model.

Because pass@t is used, rather than traditional pass@k (pass rate is measured by the number of experiments), this allows for a fair comparison with purely sampling-based methods.

From the experiment, the researchers found:

1. GPT-4 can achieve the performance improvement brought by self-repair; For GPT-3.5, the post-fix pass rate should be lower than or equal to the baseline no-fix method under all budgets.

2. Even for GPT-4 models, the performance improvement is modest at best (with a budget of 7,000 tokens, the pass rate increases from 66% to 71%, which is approximately equal to the cost of 45 independently distributed GPT-4 samples) and depends on the diversity of the initial program being sufficiently rich.

Replacing GPT-3.5 with GPT-3.5 interpretation of errors with GPT-4-generated feedback can result in better self-healing performance, even exceeding the benchmark no-fix GPT-3.5 method (from 50% to 54% at 7000 tokens).

Replacing GPT-4's own explanations with explanations provided by human programmers significantly improved the repair effect, increasing the number of programs that were repaired and passed the test by 57%.

Self-healing in four stages

The self-healing approach involves 4 phases: code generation, code execution, feedback generation, and code repair. In this regard, the researchers formally defined these four stages.

Phase 1: Code generation

Given a specification

, a program model

to build first

sample

Expressed by a formula:

Phase two: Code execution

It is then executed on the test platform

Code example, assuming that the full test set can be accessed in executable form.

If any sample passes all the tests, it stops, because by this time a satisfactory procedure has been found.

Otherwise, collect the error information returned by the execution environment

。

These error messages either contain compile/runtime error information or sample input with different program output than expected.

Phase 3: Feedback generation

Here, the researchers use feedback models to generate more detailed explanations for errors.

At this stage, a program is generated for each error

feedback string,

, as follows:

Phase Four: Code Fixes

In the last step, for each initial program

and feedback

，

Candidate fixes from

Medium sampling:

The researchers call this process the resulting interleaved text and program tree repair tree T

- Rooted in norms

, then branch to the initial program

, each program branches into feedback

, then fix

。

This is as shown in the figure:

Since self-healing requires several related model calls for non-uniform costs, in this setup,

(In

the likelihood of getting the correct program in the sample) is not a suitable measure for comparing and evaluating various hyperparameter choices for self-healing.

Instead, the researchers measured the rate as a function of the total number of tokens sampled from the model, calling it a function

Measures.

Experimental process

The researchers further tested 3 questions:

1. For more challenging programming tasks, do these models self-heal better than I.I.D. that does not fix?

2. Will a stronger feedback model improve the repair performance of the model?

3. If humans are involved in the self-healing cycle of the most functional models, providing human feedback, can they unlock better repair performance?

First, the research team introduced a challenging programming task: the Automated Programming Progress Standard (APPS) dataset.

The tasks in this dataset range from entry-level to college-level programming tasks that can be used to assess human programmers' problem-solving and code abilities.

The researchers selected 300 tasks, including 60 entry-level tasks and 60 competition-level tasks.

The researchers chose GPT-3.5 and GPT-4 as models to self-heal using stencil string concatenation and a single prompt.

The following figure shows an example of a prompt word.

Self-healing requires powerful models and diverse initial samples

The researchers had a single model for code repair generation and feedback generation separately.

In the graph on the right, we show a heat map with two hyperparameters along the axis, where the value in each cell represents the average pass rate, and when given the same token budget (i.e. the same value of the t pass@t), self-healing is normalized by the average pass rate of the baseline.

As can be seen from the figure, for the GPT-3.5 model, pass@t is below or equal to the corresponding baseline (black) at all settings, clearly indicating that self-healing is not an effective strategy for GPT-3.5.

In GPT-4 (below), there are several values with self-healing pass rates that are significantly better than the baseline.

The following figure is

and baseline no-fix approach.

GPT-4 feedback improves the fix results for GPT3.5

The researchers went a step further and conducted new experiments to evaluate the effectiveness of using a separate, stronger model to generate feedback, in order to test a hypothesis that the model was hindered from self-healing (e.g., GPT-3.5) because the model was unable to introspect and debug its own code.

The results of this experiment are shown in the figure above (bright blue).

In terms of absolute performance, GPT-3.5, GPT-4 really breaks through the performance barrier and is slightly more efficient than I.I.D. sampling with GPT-3.5.

This suggests that the text feedback phase itself is critical, and that improving it can alleviate the bottleneck of GPT-3.5 self-healing.

Manual feedback significantly improves the success rate of GPT-4 repair

In the final experiment, we wanted to investigate the effects of adding feedback from expert human programmers when repairing with a stronger model (GPT-4).

The goal of the study is to understand how the ability of models to identify errors in code compares to human capabilities, and how this affects the downstream performance of self-healing.

The researchers recruited 16 participants, including 15 graduate students and 1 professional machine learning engineer.

Each participant has five different basic programs that write code based on their Python experience.

Each program is taken from a different task, and participants will never see two different programs belonging to the same task.

Participants were then asked to explain in their own words what the program was doing wrong.

The experimental results are shown in the following figure:

The researchers found that when we replaced GPT-4's own tuning with that of human participants, the overall success rate improved by more than 1.57 ×.

Not surprisingly, as the problem gets harder, the relative difference increases, suggesting that GPT-4's ability to produce accurate and useful feedback lags far behind that of human participants when tasks (and code) become more complex.

About the author

Jianfeng Gao

Gao Jianfeng is a distinguished scientist and vice president at Microsoft and an IEEE Fellow.

At Microsoft Research, he led the Deep Learning (DL) Group in the Redmond division. The group's mission is to advance the latest technologies in DL and apply them to natural language and image understanding as well as building conversational agents. He has led research to build large-scale foundational models that support Microsoft's key AI products.

Starting in 2022, he is responsible for self-improving AI research, which includes enhancing and adapting LLMs (such as ChatGPT/GPT4) for the development of commercial AI systems.

Prior to that, he received his Ph.D. from Shanghai Jiao Tong University in 1999.

Chenglong Wang

Chenglong Wang is a researcher at Microsoft Research, received his Ph.D. from the University of Washington and studied at Peking University.

Resources:

https://twitter.com/DrJimFan/status/1675916565823516673

https://arxiv.org/pdf/2306.09896.pdf

MIT Microsoft confirmed that GPT-4 is self-correcting, and the agent loop iterates the code based on feedback

Who would have thought that training GPT-5 did not require handwritten code. A new MIT Microsoft study found the effectiveness of GPT-4 in code fixes. In the future, OpenAI engineers can only do – Critique is all you need.

The self-healing approach involves 4 phases: code generation, code execution, feedback generation, and code repair. In this regard, the researchers formally defined these four stages.