Where is AlphaCode strong? Tsinghua postdoctoral ten-minute video detailed analysis

2022-02-16 12:59:42

Reports from the Heart of the Machine

Editor: Qian Zhang

How exactly is AlphaCode trained?

During the Spring Festival, DeepMind's programming version of AlphaGo- AlphaCode was once popular to the screen. It can write computer programs that rival the average programmer's level, ranking in the top 54.3 percent overall of codeforces' 10 challenges, beating out 46 percent of the contestants.

Where is AlphaCode strong? Tsinghua postdoctoral ten-minute video detailed analysis

This achievement has put a lot of pressure on the programmer community, as if the history of textile workers being eliminated by the textile machine is repeating itself.

So, how does AlphaCode make it so powerful? In a recent YouTube video, Tim Pearce, a postdoctoral fellow under Zhu Junmen of Tsinghua University, analyzed in detail the system architecture of AlphaCode, the principle of "problem solving", and the training process.

Original video address: https://www.youtube.com/watch?v=YjsoN5aJChA

What exactly does AlphaCode do?

As mentioned earlier, DeepMind researchers tested AlphaCode in the Codeforces Challenge. Codeforces is an online programming platform with a wide variety of programming topics and competitions. It is similar to the Elo rating system used in chess, sharing programming challenges and problem rankings on a weekly basis. Unlike the tasks programmers may face when building commercial applications, Codeforces challenges are more self-contained, requiring a broader understanding of algorithmic and theoretical concepts in computer science, often a very specialized combination of logic, mathematics, and coding expertise.

The following illustration shows an example of one of the questions, including a description of the question, input and output examples, etc., based on which the challenger's task is to write a piece of code to make its output meet the requirements.

Here's the code alphacode writes:

For AlphaCode, this is only a moderately difficult challenge.

In a similar ten challenges, the researchers entered the question into AlphaCode. AlphaCode then generates a large number of possible answers and filters those answers by running the code and examining the output, just like a human competitor. Yujia Li and David Choi, co-leaders of the AlphaCode paper, said: "The whole process is automatic, and there is no need to manually select the best samples."

Why is AlphaCode so powerful?

The following illustration is a conceptual diagram of AlphaCode. This is a well-designed system with the main building blocks being a language model based on Transformer. But essentially, no single component is brand new.

AlphaCode on the Exam Room

Let's first take a look at how the above system works when testing.

When solving coding problems, AlphaCode uses a very specific protocol that determines the pipeline of the entire system. The protocol stipulates that they can use sample test cases without restriction, as these are given as part of the problem. However, when submitting to hidden test cases, they limit the number of commit versions to 10 times (up to 10 times).

When testing, AlphaCode went through three phases.

In the first phase, they use a large Transformer model that puts the problem description sample test and some metadata about the problem in a string as input. They then sampled from this model, generating a large number of potential solutions. So the first stage gets 1 million possible code scripts.

In the second phase, they tested the resulting 1 million sets of code with example test cases, 99% of which failed the test and could be ruled out. This reduces the number of feasible code sets to around 1000 (depending on the difficulty of the question).

In the third phase, they used a second Transformer model. The model describes the problem as input, but instead of trying to generate code to solve the problem, it generates test case inputs (50 inputs per problem). That is, instead of choosing to generate input and output pairs, they generated some actual input related to the problem. So the model may want to generate strings, binary numbers, or sequences of numbers (depending on the type of problem).

What's so good about this? They argue that if two scripts return the same answer for all 50 tests, they might be using the same algorithm. This avoids wasting two commit opportunities to test both scripts. So after the second step got 1000 sets of scripts, they clustered the scripts based on the output of the 50 generated test inputs, and then selected a sample script from each cluster, for a total of 10. If one of the 10 scripts passes all the hidden tests, then they successfully solve the programming problem, otherwise they fail.

This is how AlphaCode works when testing, using two Transformer models. So how are these two models trained?

Training for AlphaCode

AlphaCode training is divided into two phases: pre-training and fine-tuning. The process involves two datasets: the first is a public GitHub repository of up to 715GB of data for pre-training in various programming languages, and the second is a collection of questions from various programming challenge websites, including codeforces, for fine-tuning, including problem descriptions, test cases, and answers written by human programmers.

Now that you have the dataset, the next step is training.

During the pre-training phase, they grab some code on GitHub and randomly select what they call "pivot points."

Everything before the pivot point will be entered into the encoder, and the decoder's goal is to rebuild the code below the pivot point. A vector representation of the encoder output code, which can subsequently be used throughout the decoding process.

The decoder works in an autoregressive manner. It starts with the first token of the prediction code. The loss function is the cross-entropy between the predicted softmax output and the real token. The first real token then becomes the input to the decoder, and the second token is predicted. This is repeated until the decoder is asked to predict a special end-of-code token. Now, these losses are backpropagation through the decoder and encoder, but for encoders, it is important to add a second loss. This is known as masking language modeling loss: you leave some tokens entered into the encoder blank, and as a secondary task, the encoder tries to predict which token is masked.

After the pre-training, it was time for the fine-tuning session. In this session, they fed the problem description metadata and the input of the example into the encoder, trying to generate human-written code with the decoder. As you can see at this point, this fits very naturally with the structure prescribed by the encoder-decoder architecture. The loss at this stage is exactly the same as in pre-training. Also, there is a second Transformer here to generate test inputs. This is also initialized by the same GitHub pre-training task, but it is fine-tuned to generate test input instead of code.

In addition to the regular training steps and architecture mentioned above, AlphaCode draws on some experience from other recent papers. Tim Pearce points out one of the better ones:

Metadata tuning for AlphaCode

In addition to the problem description, researchers always use metadata as input to the Transformer, including the programming language, the difficulty level of the problem, some labels about the problem, and whether the solution is correct. At training time, the model obviously knows what the values of these fields are, but at the time of testing, they don't know. It's very interesting that they can enter different things into these fields at test time to affect the generated code. For example, you can control the programming language that you will build, or even the type of solution it is trying to build, such as whether to try a dynamic programming approach or to do an exhaustive search.

When they tested, they found that it was very helpful to randomize many fields when they sampled the first 1 million code scripts. Because, by increasing the diversity of the original sampling pool, the likelihood of a correct answer increases.

That's all Tim Pearce's interpretation of AlphaCode. He believes that DeepMind's team has indeed made some progress in this work. But he wondered why they had achieved far less in these programming problems than they had surpassed humanity in games like Go and StarCraft. Tim Pearce's initial guess is that programming problems are harder and data is harder to get, because in a game you can generate a lot of simulation data without limits, but you can't do that on a programming issue.

Reference Links:

https://www.youtube.com/watch?v=YjsoN5aJChA

https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf

Where is AlphaCode strong? Tsinghua postdoctoral ten-minute video detailed analysis

Read on