Instead of programmers? After Microsoft introduced programmable AI, it taught AI code review

Author | Microsoft Research

Translated by | Nuclear Coke

Planning | Ling Min

Last July, Microsoft joined hands with GitHub and OpenAI to create a new code-generating AI, GitHub Copilot, behind which is the OpenAI deep learning-driven platform Codex. However, the data shows that Codex is about 30% accurate. Recently, Microsoft launched the AI code review tool Jigsaw to further improve the accuracy of AI coding.

At present, various large-scale adjustable pre-trained language models (including GPT-3, Codex, etc.) have been able to successfully write code according to the programmer's intentions expressed in natural language. This type of automation model is certainly expected to improve the productivity of every software development practitioner, but because the model itself is difficult to understand program semantics, it is not yet possible to guarantee the final quality of the generated code.

In our research paper, Jigsaw: Large Language Models meet Program Synthesis (accepted by ICSE 2022, an international software engineering conference), we introduce a new tool that can improve the performance of such large language models. Jigsaw includes post-processing techniques that understand program syntax and semantics, and can use user feedback to continuously improve correction capabilities. With multimode input, Jigsaw synthesizes code for the Python Pandas API.

Our experience shows that as these large language models evolve into a weapon for "synthesizing code by intent," Jigsaw will also play an important role in improving the accuracy of the system.

The prospects and risks of writing software on machines

Large-scale language models, represented by OpenAI's Codex project, are reshaping the overall face of programming. Software developers can now work on programming tasks by directly describing the functionality of the required code snippets in English, while Codex synthesizes the intended code in languages such as Python or JavaScript.

However, the code written by the machine may not be correct, or even compile or run. Therefore, Codex users must review the code before it is used.

In the Jigsaw project, our goal was to partially automate the review and help large language models such as Codex synthesize code as directed by developers and improve productivity.

Suppose Codex provides a software developer with a code snippet, after which the developer can check whether the code can be compiled and make a preliminary review. If it fails to compile, the developer can refer to the error message provided by the compiler to fix it. Once the code is finally compiled, the developer conducts tests through input/output (I/O) to check that the output produced by the code is as expected.

During this phase, the code is also likely to expose problems (such as throwing exceptions or generating error output), which requires further fixes from the developer. We proved that this process is fully automated. Jigsaw takes the English description of the expected code and the I/O example as input, and then pairs the input with the associated output to ensure that the Python output code compiles correctly and produces the expected high-quality output results based on the input.

In the previously mentioned paper, Jigsaw: When Large Language Models Hold Hands program synthesis, we evaluated this approach on Python Pandas. Pandas is an API that is widely used in the field of data science today, with hundreds of functions for manipulating data frames or row lists.

It's obviously too inhumane for developers to remember that so many function usages are, of course, using Jigsaw. With its help, the user can describe the expected transformation effect in English, provide the input data frame with the corresponding output data frame, and then Jigsaw synthesizes the expected code. For example, suppose a developer wants to remove the prefix "Name:" from the "country" column in the following table, you can do so at Pandas by doing the following:

df['c'] = df['c'].str.replace('Name: ', '')

Instead of programmers? After Microsoft introduced programmable AI, it taught AI code review

Figure 1: Input data frame and output data frame. Jigsaw removed the extra part "Name:" from the column named "country".

In the traditional process, developers who are new to Pandas often need to be familiar with the function and its parameters before they can sort out the corresponding code snippets, or post the query and sample results to a forum such as Stack Overflow, and then sit and wait for the reply of enthusiastic netizens. In addition, developers often need to greatly adjust the response in context. In contrast, it is undoubtedly much more convenient to describe the input-output table (or data frame) you want directly in English.

Jigsaw works analytically

Jigsaw first takes the Query information in English and then preprocesses the query with the appropriate context to build input that can be fed to large language models. The Jigsaw model is in a black box format and has been evaluated using GPT-3 and Codex.

The biggest advantage of this design is its ability to support the latest and greatest models available in plug-and-play format. After the model generates the output code, Jigsaw checks that it satisfies the I/O sample. If satisfied, the model output is correct and the code is directly available. In our experiments, about 30% of the output code is usable without fixing. However, if the code is incorrect, the remediation process is enabled during the post-processing phase.

Figure 2: All input for large language models (including GPT-3, Codex, etc.) will be preprocessed. If necessary, the post-processing output is also returned to the end user for validation and editing. The learning results are fed back into pre- and post-processing mechanisms to further improve Jigsaw's ability to correct

In post-processing, Jigsaw uses three transformations to implement code fixes. Each of these transitions is driven by the failure modes we observed in GPT-3 and Codex. Surprisingly, GPT-3 and Codex's code error cases are very similar, so the failure modes jigsaw uses in post-processing are of great help to both.

Code fixes are implemented through three transformations

Variable conversions

We have observed that incorrect variable names often appear in the output of Codex. For example, most public code will name the data frame df1, df2, etc., so Codex will directly copy it. However, if the developer actually uses data frame names such as g1 and g2, then Codex's insistence on df1 and df2 will cause problems.

In addition, Codex often confuses the names of received variables. For example, the correct output should be df1.merge(df2), but it is written as df2.merge(df1). To fix these errors, Jigsaw needs to replace the names in the Codex generated code with all names in the available range until they satisfy the I/O example. We found that this simple conversion was enough to solve most of the problems in the machine code.

Parameter conversion

Sometimes, code generated by Codex also calls the expected API function, but some of the parameters have errors. For example:

a.) Query - Removes all duplicate rows in the 'inputB' column

dfout = dfin.drop_duplicates(subset=['inputB']) # Model

dfout = dfin.drop_duplicates(subset=['inputB'],keep=False) # Correct

b.) Replace all CAN queries in the country column in df - with Canada

df = df.replace({'Canada':'CAN'}) # Model

df = df.replace({'country':{'Canada':'CAN'}) # Correct

To fix such errors, Jigsaw systematically enumerates all possible arguments and uses the Codex-generated functions and sequences of arguments as a starting point until it finds a combination that satisfies the I/O example.

AST to AST conversion

AST (Abstract Syntax Tree) is the representation of code in the form of a tree. Because models such as Codex design code structures at the syntactic level, they may produce output that is syntactically similar to expected, but some characters are problematic. For example:

a.) Query - Select each row in dfin that meets the criteria and requires its bar value to be 60

dfout = dfin[dfin['bar']60] # Model

dfout = dfin[(dfin['bar']60)] # Correct

Error - Missing parentheses changes the priority and throws an exception

b.) Query - Counts the number of duplicate rows in df

out = df.duplicated() # Model

out = df.duplicated().sum() # Correct

Error - Needs to be summed to get the total number of duplicate rows

To fix these issues, Jigsaw also offers AST-to-AST conversion for time-learning. The user first repairs the code themselves, and then the Jigsaw UI captures the editing results, generalizes the results to other applicable conversion scenarios, and learns the conversion knowledge. As the number of uses and conversions increases, Jigsaw will also gradually grasp the developer's fix ideas.

assess

We also evaluated codex straight out code and Jigsaw post-fix code on multiple datasets and measured the difference in accuracy (that is, the percentage of overall dataset tasks where the system was able to produce the expected results). Codex's accuracy in coming straight out of the code is about 30 percent, which is also in line with the ideas in the OpenAI paper. Jigsaw is able to increase accuracy to more than 60%, and with user feedback, accuracy can be further increased to over 80%.

Looking to the future

We have published a publicly available Jigsaw assessment dataset. Each dataset contains multiple tasks, each corresponding to an English query and an I/O example. To solve the task, the model needs to generate a piece of Pandas code and map the provided input data frame to the corresponding output data frame. We hope that you will be able to use this dataset to evaluate and compare more other systems. Although some datasets currently contain only simple tasks such as English queries and I/O examples, the Jigsaw dataset is still an industry first.

As language models continue to grow, we believe Jigsaw will be there to help these large models function in more real-world scenarios. Of course, this is just the tip of the iceberg in the relevant research field, and we have the following key issues to solve:

Can these language models master code semantics through training?

Can Jigsaw integrate better pre- and post-processing steps? For example, we are looking at the use of representational analysis techniques to improve post-processing.

Are the I/O samples valid for APIs other than Python Pandas? If there is no corresponding I/O example, how do we solve it? How can I adapt Jigsaw to languages like JavaScript and common code in Python?

Jigsaw's current output still has room for improvement, which means that developers still need to evaluate and investigate the output in addition to executing queries in natural language.

These are a few interesting directions that we are trying to explore. As Jigsaw continues to improve and refine, it is believed that its automation capabilities will play an important role in increasing programmer productivity. We're also trying to extend our experience with the Python Pandas API to other APIs and programming languages.

Instead of programmers? After Microsoft introduced programmable AI, it taught AI code review

Read on