Rolled up! The AI version of the programmer was launched, and the "problem master" solution of the Olympic number also came on the same day

Reports from the Heart of the Machine

Editors: Zhang Qian, Mayte

The world was already very volume, with ai joining, the volume was added...

So rolled!

At the time of the Spring Festival in China, DeepMind and OpenAI, two well-known AI research institutions, released important research results: DeepMind released AlphaCode based on the Transformer model, which can write computer programs comparable to humans; at the same time, the neural theorem prover developed by OpenAI successfully solved two international Olympiad problems.

Rolled up! The AI version of the programmer was launched, and the "problem master" solution of the Olympic number also came on the same day

Do you feel that these two areas of AI are familiar? That's right, in 2021, OpenAI released GitHub Copilot, an AI code completion tool, and unveiled the technology behind CodeX. Similarly, in the second half of last year, DeepMind also published their AI research results for solving mathematical puzzles and made it appear on Nature.

Although the new achievements of the two research institutions provide new ideas for AI to solve old problems, they also have to make netizens sigh that the AI field is too rolled!

Source: Screenshot of netizen Weibo

Beat AlphaCode, which is 46% of the contestants

In a recent paper, DeepMind researchers introduced AlphaCode. AlphaCode uses a Transformer-based language model for large-scale code generation and is written as a program.

Thesis Link: https://storage.googleapis.com/deepmind-media/AlphaCode/competition_level_code_generation_with_alphacode.pdf

The researchers tested AlphaCode in the Codeforces Challenge, a competitive programming platform similar to the Elo rating system used in chess, sharing programming challenges and question rankings weekly. Unlike the tasks programmers may face when building commercial applications, Codeforces challenges are more self-contained, requiring a broader understanding of algorithmic and theoretical concepts in computer science, often a very specialized combination of logic, mathematics, and coding expertise.

AlphaCode tested 10 challenges solved by 5,000 users on the Codeforces website and ranked in the top 54.3% overall, meaning it beat 46% of the entrants. DeepMind estimates that the AlphaCode system's Codeforces Elo is 1238, making it one of the top 28 percent of competing users on the site over the past six months.

For example, in a challenge to test AlphaCode, the participant was asked to find a way to use a limited set of inputs to convert a random, repeating string of s and t letters into another string of the same letter. For example, a competitor can't just enter a new letter, but must use the "backspace" command to remove a few letters from the original string. For AlphaCode, this is only a medium difficulty challenge:

Ten of the challenges are to enter AlphaCode in exactly the same format as humans. AlphaCode then generates a large number of possible answers and filters those answers by running the code and examining the output, just like a human competitor. Yujia Li and David Choi, co-leaders of the AlphaCode paper, said: "The whole process is automatic, and there is no need to manually select the best samples."

It wasn't easy to stand out from the challenges of Codeforces. The AlphaCode project began more than two years ago, and with advances in large-scale Transformer models combined with large-scale sampling and filtering techniques, DeepMind researchers have made significant progress in the number of problems THAT AI can solve.

Affected by the epidemic, most of the work on the project was done at home.

The researchers pre-trained the model on selected public GitHub code and fine-tuned it on a relatively small dataset of competition programming. During the evaluation, the researchers created a large number of C++ and Python programs for each problem, orders of magnitude larger than previous work. The solutions are then filtered, clustered, and reordered, assigned to a small set of 10 candidates, and submitted to an external evaluation. This automated system replaces the trial-and-error process of debugging, compiling, passing testing, and finally committing a competitor.

Overall, AlphaCode's ranking is roughly the median of its competitors. While far from winning the game, this result represents a substantial leap forward in AI's problem-solving capabilities. This advancement demonstrates the potential of deep learning models in tasks that require critical thinking. DeepMind notes that AlphaCode's current skill set is currently only available in the competitive nature of programming, but its capabilities open new doors to creating the tools of the future that make programming easier and could one day be fully automated.

Many other companies are developing similar applications. For end users, these systems work like Gmail's Smart Compose feature, offering some advice on whatever you're writing.

Much progress has been made in the development of AI programming systems in recent years, but these systems are far from ready to take over the work of human programmers. The code they generate is often problematic, and because the system is usually trained on a public code base, copyrighted material is sometimes copied.

In a study of GitHub Copilot's AI programming tools, researchers found that about 40 percent of the code it output contained security vulnerabilities. Security analysts have even suggested that bad actors can deliberately write code and share code online with a hidden backdoor, which could then be used to train AI programs to insert these errors into future programs.

Challenges like these mean that AI programming systems may slowly integrate into the work of programmers — in other words, they have to apprentice, start with assistants, and be skeptical of the advice given by the AI before they are trusted to be able to perform the work autonomously.

Currently, DeepMind has published a dataset of competition-level programming problems and solutions on GitHub, which also includes extensive test data to ensure that the program that passes these tests is correct, which is a key feature that is currently lacking in datasets. DeepMind hopes this benchmark will drive further innovation in problem solving and code generation.

GitHub project address: https://github.com/deepmind/code_contests

Challenge the neurotheorem prover for the Olympian problem

In the field of subject competitions, the International Mathematical Olympiad (IMO) is a very famous one, and many mathematical gods (such as Wei Dongyi) that we are familiar with have achieved impressive results in this competition.

In 2021, the competition ushered in a small change: Lean, a mathematical AI developed by Microsoft for many years, also joined the competition and competed with human players. Lean is a computer theorem prover introduced by Microsoft Research in 2013: mathematicians can convert mathematical formulas into code, and then enter them into Lean, so that the program can verify that the theorems are correct.

Because Lean is a gold medal, researchers have been constantly polishing it, including OpenAI, which was acquired by Microsoft. Just now, OpenAI posted that they have created a neural theorem proofer for Lean to solve a variety of challenging high school Olympiad problems, including two problems adapted from IMO and several questions from the AMC12, AIME competition.

The prover uses a language model to find proofs of formal statements. Each time a new proof is discovered, the researcher uses it as new training data, which improves the neural network and enables it to find solutions to increasingly difficult propositions in iterations.

The demonstrator achieves the SOTA (41.2% vs 29.3%) level in the miniF2F benchmark, which contains a challenging set of high school Olympiad questions.

The researchers refer to their method as statement curriculum learning, which involves manually collecting a set of propositions of different difficulty levels (without proof), with the most difficult propositions similar to the target benchmark. Initially, their neural credentials were weak and could only prove a few of them. So they iteratively searched for new proofs and retrained their neural network on top of newly discovered proofs. After 8 iterations, their prover achieved excellent results on miniF2F.

Formal mathematics is an exciting area of study because: 1) it is rich enough to allow you to prove arbitrary theorems that require reasoning, creativity, and insight; 2) it is similar to games and has an automated way of determining whether a proof holds (i.e., being verified by a formal system). As the example in the figure below shows, proving a formalized proposition requires generating a series of proof steps, each of which contains a call to the tactic.

Artifacts accepted by formal systems are low-level (like assembly code) and are difficult for humans to produce. The strategy is to generate this artifact search process from higher-level instructions, formalizing it as a helper.

These strategies take mathematical terms as arguments, and each policy call converts the proposition currently to be proved into a proposition that is easier to prove until nothing needs to be proven.

The researchers observed that the ability to generate the primitive mathematical terms needed to generate strategy parameters appeared during their training, which could not be done without neural language models. The following proof is an example of this: the proof step "use n + 1" (generated entirely by the model) proposes to use "n + 1" as a solution, and the remaining formal proof relies on the "ring _ exp" policy to verify that it does work.

The researchers also observed that their model and search process were able to produce proofs linking multiple important inference steps. In the following proof, the model first uses the contraposition that elicits the existential statement ( (x : ), f x ≠ a * x + b). It then uses use (0 :) to generate a witness for it and completes the proof by utilizing the norm_num policy.

The model is trained in statement curriculum learning to solve various problems in the training materials and AMC12 and AIME, as well as two problems adapted from IMO. Here are three related examples.

Formal mathematics involves two main challenges that make pure reinforcement learning applications unlikely to succeed:

1. Infinite action space: Formal mathematics not only has a large search space (such as Go), but also an infinite action space. At each step of the search for proofs, the model's choice range is not a set of well-behaved finite actions, but a complex and infinite set of strategies involving exogenous mathematical terms that must be generated (e.g., generating mathematical propositions used as witness).

2. Lack of self-play: In contrast to two-player games, the prover is not against the opponent, but against a series of propositions that need to be proved. When faced with an overly difficult proposition, the absence of obvious refactoring allows the prover to generate intermediate statements that are easier to handle first. This asymmetry prevents the simple application of self-game algorithms that succeed in two-player games.

In this work, the researchers solved the problem of infinite motion space by sampling motion from a language model. The language model is capable of generating policy calls along with raw mathematical terms that are often required as parameters. For the lack of self-game, they observed that the key role of self-game in two-player games is to provide an unsupervised curriculum. Therefore, they propose to replace this unsupervised course with a set of auxiliary problem propositions of varying difficulty (no proof required). Their experimental results show that when the difficulty of these ancillary problems varies large enough, their training programs are able to solve a series of increasingly difficult problems that eventually generalize to the set of problems they care about.

While these results are very exciting because they demonstrate that deep learning models are capable of making important mathematical reasoning when interacting with formal systems, the proofator is far from the best student performance in the competition. The researchers say they hope their work will advance research in this area, particularly for IMO, and that their proposed statement curriculum learning approach will accelerate the pace of automated reasoning.

brief summary

The latest research results of the two institutions have been introduced, and the evaluation of the effect has appeared scattered on the Internet:

For example, AI research scientists have sent a series of long pushes that alphacode will take several years to reach the human level, and its ranking on codeforce is limited, such as many participants are high school students or college students; and there is the fact that the vast majority of programs generated by AlphaCode are wrong, and it is the use of example tests for filtering that makes AlphaCode actually solve some problems.

Some researchers say it's like the result of AlphaStar's miracles.

Domestic AI practitioners can take advantage of the holiday to study these two studies and express their views.

Reference link: https://openai.com/blog/formal-math/?continueFlag=6cc759bbfb87d518f6d6948bcf276707

https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode?continueFlag=b34ed7683541bab09a68d7ab1d608057

Rolled up! The AI version of the programmer was launched, and the "problem master" solution of the Olympic number also came on the same day

Read on