Tsinghua postdoctoral researcher spent 10 minutes to explain the technical principles behind AlphaCode, the original programmer is not so easy to replace

2022-02-10 11:40:00

AI Technology Review reports

Not long ago, DeepMind's team released an artificial intelligence system that can automatically generate competition-level code - AlphaCode, which is known as "comparable to ordinary programmers", and once published, it caused a huge sensation in the AI circles at home and abroad.

- Address of the paper: https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf

- Dataset: https://github.com/deepmind/code_contests

According to DeepMind's blog, AlphaCode was tested in 10 challenges solved by 5,000 users on Codeforces, which claims to be the "world's strongest algorithm platform." AlphaCode was able to automatically enter code in these 10 challenges in exactly the same format as humans, generate a large number of possible answers, and then filter out feasible answers by running code and checking like human programmers, ultimately achieving a good score of 54% among human programmers.

That said, AlphaCode's code capabilities are comparable to almost half of the programmers (2,300) who have tested on Codeforces. According to an algorithm of 20,000 monthly salaries for junior programmers, AlphaCode is expected to save 552 million in labor costs for human capitalists around the world every year, making half of the programmers unemployed...

However, the DeepMind team also made it clear at the time that AlphaCode is currently only suitable for competitive programming competitions.

Admittedly, this is also another research breakthrough after DeepMind released Alpha Go, AlphaZero and AlphaFold, greatly adding to the legend of its Alpha series. But compared to other jobs in the series ,such as AlphaGo beating the world Go champion), AlphaCode's performance doesn't seem to stand out.

Tea Pearce, who is currently working as a postdoctoral researcher under Zhu Junmen at Tsinghua University, is very interested in the technical principles of AlphaCode, and after carefully reading DeepMind's 31-page paper, he produced a short video published on YouTube, from the system overview, testing stage, pre-training and fine-tuning of the data set. The training process of the Transformer model and the Transformer architecture and other dimensions explain the details of AlphaCode in more detail.

Video address: https://www.youtube.com/watch?v=YjsoN5aJChA

Like OpenAI's previous development of GPT-3, AlphaCode is based on the Transformer model, except that the former focuses on speech generation and the latter emphasizes the parsing of sequential text (such as code).

The following AI technology review has made a brief collation of this short video:

Code issues with AlphaCode

Currently, AlphaCode's target coding questions focus on specific competition types, and participate in coding challenges on websites such as Codeforces, where these challenges contain a short description of a problem and examples with test cases, providing the challenger with input that matches the correct expected output.

In short, the goal of these challenges is to write code that provides the sample test case with a set of hidden test cases that produce the expected output. If your code passes all the tests, then you solve the problem.

According to DeepMind, AlphaCode has achieved a success rate comparable to that of the average user in the coding challenges hosted by the Codeforces website.

AlphaCode System Overview

So, how exactly does AlphaCode work?

In the article "Competition-Level Code Generation with AlphaCode" published by the DeepMind team, they give a high-level summary diagram (below). As shown in the figure, the core components of AlphaCode are still the Transformer language model, and the remaining separate components are also old.

Illustration: A system diagram for AlphaCode

The protocol used

Let's first look at how AlphaCode works when testing.

The first thing to know is that alphacode uses a very specific protocol when solving the problem of writing code, and that protocol determines the system's pipeline. According to the paper, the DeepMind team was given permission to use as many sample test cases as possible, as these test cases were also included in the question.

However, they did limit their testing to 10 submitted hidden test sending cases.

AlphaCode for the testing phase

AlphaCode's testing time is divided into three separate phases.

They first used a large-scale Transformer model, taking as input some metadata for the problem description sample test and the problem, and then sampled from the model to generate a large number of potential solutions. The reason why it is a large number of potential solutions is that most scripts cannot be compiled by some people, even compilers.

So, in the second and third phases, they "subtracted" primarily from the 1 million potential code scripts, selecting 10 scenarios that they thought might be useful given the protocol. Their approach is simple, that is, to test the 1 million code scripts in the example test case, and then eliminate about 99% of the scripts that cannot pass the test, which reduces the number of scripts to thousands of digits.

However, the agreement requires it to continue to be reduced to 10 solutions. So, they took a very clever approach:

They used a second Transformer model to describe the problem as input, but instead of trying to generate code to solve the problem, they used Transformer to generate test case input and sampled 50 test case inputs for each problem. Now, instead of trying to generate input-output pairs, they're just trying to produce some real-world inputs related to the problem. Therefore, AlphaCode may have to generate strings, binary numbers, or lists of numbers, depending on the problem.

Tim Pearce explains the three phases of AlphaCode during testing

Why is this a good idea? Because they think that if two scripts return the same answer to all 50 generated tests, then they might use the same algorithm and probably don't want to waste two commits trying both scripts.

So, they compile and run about 1000 scripts on those 50 generated inputs. They then clustered the script based on the output of these 50 fictitious inputs. Next, they choose a sample script from each cluster. If any of the ten scripts pass all the hidden tests, then those scripts are the final 10 scripts, and they have successfully solved the coding problem, otherwise they have failed. This is how AlphaCode works when testing.

This involves training the Transformer model, see below.

Pre-train and fine-tune datasets

AlphaCode uses a fairly standard pre-training fine-tuning process in deep learning today.

There are two datasets here: The first dataset is a public Github repository of various programming languages, containing 715 GB of massive code for the pre-training phase, with the goal of allowing transformers to learn some very general knowledge, such as code structure and syntax.

The second dataset is much smaller and serves only alphaCode's goal for fine-tuning. The dataset was crawled from a number of coding challenge websites, including Codeforces. They test later on the dataset, including problem descriptions of test cases and human-written solutions. These are datasets. Now, what do we do with them?

The training process for transformer models

First, let's talk about the pre-training phase.

They crawled some github code and randomly selected so-called pivot points.

Everything before the pivot point is entered into the encoder, and the decoder's goal is to reconstruct the code below the pivot point.

The encoder outputs only a vector representation of the code, which can be used throughout the decoding process.

The decoder operates in an autoregressive fashion: the first token of the code is predicted first. The loss function then simply is a cross-entropy between the predicted softmax output and the real token. The first real token becomes the input to the decoder, then predicts the second token, and when the decoder is asked to predict the unexpected end of the code token, repeat this process until the end of the code.

Now, these losses are backpropaged through decoders and encoders, although it turns out that it is important to only add a second loss to the encoder.

This is known as a mask language and can efficiently model losses. Empty some of the tokens entered into the encoder. As an auxiliary task, the encoder attempts to predict which token is masked. Once the pre-training task is complete, we move on to the fine-tuning task.

Here, we feed the metadata and sample input of the problem description into the encoder and try to generate human-written code using the decoder. At this point, you can see that this very naturally coincides with the structure enforced by the encoder-decoder architecture, with the loss exactly the same as the pre-training task.

There is also a Transformer that generates test inputs. This is also initialized from the same github pre-training task, but it is fine-tuned to generate test inputs, not code.

Transformer architecture

The DeepMind team experimented with models of various sizes. Experimentally, larger-scale models tend to perform better. The encoder and decoder itself consist of multi-head attention layers, and these layers are very standard.

Other tricks

The paper has many improvements. I'm not going to cover it all here, but I'm just going to highlight one point that I think is cool, which is the label and rating enhancements, and the problem description.

We always use metadata as input to transformers. This includes the programming language difficulty level of the problem. Are the labels and solutions to some of the problems correct when training? They obviously know what the values of these fields are, but they don't know what's cool when testing, which is that they can actually enter different content into these fields at test time to influence the generated code. For example, you can control the programming language that will be generated by the system, or even affect this solution.

It tries to generate answers such as whether to try dynamic programming methods or to conduct exhaustive searches. What they found helpful in testing was that when they sampled the initial pool of 1 million solutions, they randomized many of the fields. By having more diversity in this initial pool, one of the code scripts is more likely to be correct.

epilogue

This is Tea Pearce's explanation of how AlphaCode works.

Starting with AlphaCode's work, he talked about his own thinking: Why did the DeepMind team achieve a much lower level of performance on these coding issues than the superhuman level system in a Game of Go or AlphaZero?

Tea Pearce's analysis is that writing code from natural language descriptions is inherently much more difficult than playing a game, but that could also be because there is much less data available in the game. You can simulate as much data as you want, and the number of encoding issues is limited.

Finally, Tea Pearce throws out the question: What could be the reason why IT's so hard to write code? In the future, how can the level of AI code exceed the optimal level of human beings?

Feel free to leave a message in the comments section.

Reference Links:

1. https://www.youtube.com/watch?v=YjsoN5aJChA

2. https://www.reddit.com/r/MachineLearning/comments/slwh69/p_alphacode_explained/

3. https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf

4. https://www.deepmind.com/blog/article/Competitive-programming-with-AlphaCode

Leifeng NetworkLeifeng Network

Tsinghua postdoctoral researcher spent 10 minutes to explain the technical principles behind AlphaCode, the original programmer is not so easy to replace

Read on