Editor: Editorial Department

【New Zhiyuan Guide】The formal research artifact Lean that Tao Zhexuan is full of praise for, but there is a bug in the reasoning of running LLM. Recently, the Caltech team solved this bug and automated more than 80% of the math proof steps!

Lean Copilot, this formal mathematical tool that Tao Zhexuan and many other mathematicians are full of, has evolved again?

Just now, Caltech professor Anima Anandkumar announced that the team has released an extended version of the Lean Copilot paper and updated the codebase.

Caltech Chinese use AI to subvert mathematical proofs!Shocked Tao Zhexuan, 80% of mathematical steps are automated

Address: https://arxiv.org/pdf/2404.12534.pdf

The latest experiments have shown that this Copilot tool can automate more than 80% of the mathematical proof steps, which is 2.3 times better than the previous baseline aesop.

And, as before, it is open source under the MIT license.

A great contribution to this is a Chinese brother Song Peiyang, who is an honorary CS undergraduate student at UCSB and a SURF researcher in the Department of Computing + Mathematical Sciences (CMS) at the California Institute of Technology.

Netizens exclaimed: So, Tao Zhexuan's current mathematical research can be accelerated by 5 times in situ?

LLMs propose proof strategies for seamless human intervention

The team released Lean Copilot's tool in the hope of initiating collaboration between humans and LLMs to produce 100% accurate formal mathematical proofs.

It solves a core technical challenge: inference to run LLMs in Lean.

With this tool, we can get LLMs to come up with proof strategies in Lean, allowing humans to intervene and modify in a seamless way.

The project was developed because the proof of the automated theorem is still a daunting challenge today.

We all know that LLMs are often unreliable when doing math and reasoning tasks, making mistakes, hallucinations, and being very unreliable.

Therefore, until now, mathematical proofs have mostly been manually derived and need to be carefully verified.

Theorem proving tools like Lean's can formalize every step of the proof process, but it takes a lot of effort for humans to write Lean.

In this context, the birth of Lean Copilot is of great significance.

The artifact that shocked Tao Zhexuan many times: the mathematician will be finished before he can use it

LLM can be used as a tool to assist humans in proving theorems, and this thesis has been confirmed by Tao Zhexuan many times.

He just predicted in his blog that in 26 years, AI will be combined with search and symbolic math tools to become a trusted co-author in mathematical research.

Immediately afterward, studies to support his views sprung up like mushrooms after a rain.

In June last year, scholars from Caltech, NVIDIA, MIT and other institutions built a theorem provers, LeanDojo, based on open-source LLM.

In September, researchers from Microsoft Research Asia, Peking University, Beihang University and other institutions successfully made GPT-4 come to the conclusion of "P≠NP" through 97 rounds of "Socratic" rigorous reasoning, solving this millennial problem.

In the 97th round of dialogue, GPT-4 concluded that the proof example could not be solved without exhaustive methods, proving that the conclusion was P≠NP

In October last year, with the help of GPT-4 and Copilot, Tao Zhexuan directly discovered a hidden bug in his paper.

In the process of formalizing the argument on page 6 with Lean4, he found the expression

At n = 3, k = 2, it is actually divergent.

This not-so-obvious bug was caught in time thanks to Lean4. The reason is that Lean asked him to construct 0<n−3, but Tao Zhexuan only assumed n>2. Thus, Lean cannot be rebutted based on a negative 0<n−3.

This discovery directly shocked Tao Zhexuan's pupils.

At the end of last year, Tao Zhexuan directly and successfully used AI tools to complete the formal polynomial Freiman-Ruzsa conjecture proof process.

Finally, the dependency graph has been completely covered in green, and the Lean compiler reports that this conjecture follows the standard axiom exactly.

In this process, all front-line mathematical researchers felt the direct impact of AI on the subversive power of mathematical research for the first time.

Lean Coilot,让Lean更好用

And today, Lean Copilot's research has made Lean even stronger.

In this paper, the team built tools based on Lean Copilot for suggesting proof steps (policy recommendations), completing intermediate proof objectives (proof searches), and selecting relevant premises using LLMs (premise selection).

The experimental results also fully show that Lean Copilot is effective in assisting humans in automating theorem proof compared with the existing rule-based proof automation in Lean.

Lean Copilot provides a common framework for running LLM inference locally with CTranslate 2 or on a server.

Through this framework, users will be able to create a variety of automated proof tools.

Lean is a popular proof assistant among mathematicians. As shown in the diagram below, a proof in Lean is made up of a series of proof steps called tactics.

Starting with the entire theorem as the initial goal, the strategy repeatedly transforms the current goal into simpler sub-goals until all goals are solved.

The user interactively writes the policy in the VSCode-powered IDE, which displays the target in the infoview panel on the right.

Generate policy recommendations

Using Lean Copilot, the team built suggest_tropics, a tool for generating policy recommendations using LLMs.

And in itself, it is also a strategy.

When applied, it enters the current target into the LLM and gets the generated policy candidate list from the LLM.

It looks at each option to see if they 1) cause an error, 2) the result is not wrong, but the proof cannot be completed, and 3) the proof is completed successfully.

If it's 1), the policy will be removed.

Only error-free policies are displayed in the view panel on the right.

Policies that successfully complete proofs are marked in green (category 3), and policies that do not change the attestation goal in error but do not complete proofs are marked in blue (category 2).

Note: When all of the listed policies fall into category 2, this information can be extremely valuable to the user.

In this case, the remaining target information can directly help the user select a strategy as the next intermediate attestation step.

Once they see the suggestions, they can choose whether to accept them, or use them as a source of inspiration for a new strategy.

For example, we define a theorem add_abc in Lean code, and its initial goal is shown on the right in Figure 3.

When we enter the suggest_tropics, we will see the strategy suggestions on the right.

The first strategy is shown in green, indicating that the attestation was successfully completed.

The next three suggestions are all blue, which indicates that the proof cannot be completed directly, but it does not result in an error.

Therefore, they are likely to be valid intermediate proof steps!

At the same time, the remaining sub-goals are also displayed.

The Tactic state field shows No goal because at least one strategy suggestion can be proven.

Search for full proofs

In addition, because neither humans nor machines can consistently produce the right strategy, it is necessary to backtrack and explore different alternatives in the process, and this process is proof search.

When it is the Suggest_tropics mentioned above, it can only generate the policy of the current step, and does not have the ability to search for multi-policy proofs.

To do this, the team combined it with aesop, a rule-based proof search tool, to build an LLM-based proof search tool.

Aesop enforces best-first search as a policy for Leans and allows users to configure how the search tree is extended.

The search tree is made up of targets that are nodes.

At first, it only had the original target as the root node. At each step, AESOP selects the most promising unscaled nodes, scales them by applying policies, and adds the resulting nodes as child nodes.

And when AESOP finds a path from the root cause to the target that can be easily solved, the search is successful!

Therefore, the performance of AESOP depends critically on whether the user has configured a valid rule set.

This shows that AESOP lacks flexibility. As a result, Search_proof augments AESOP's ruleset with goal-related policies generated by the suggest_tropics at each step, making it more flexible.

For the original target in Figure 3, the user simply enters the search_prrof and finds the complete proof that the target can be solved, which is displayed in the information view (Figure 5 right).

As you can see, the remaining Tactic state is No goals due to the evidence of success found.

Select the annotated premise

In addition, another challenging and important task in theorem proofs is to find the relevant premises for reducing or completing the proof.

In addition to a large number of premises in the source and standard libraries, Lean also has a large math library (Mathlib).

However, searching for candidate premises from all libraries is extremely difficult and time-consuming.

So many people are trying to be assisted or automated in Lean, or other proof assistants.

In Lean, the state-of-the-art premise selection method is a random forest-based framework implemented directly in Lean.

However, the premise selection task is well suited to retrieval enhanced LLMs, i.e., training a retrieval matrix (premise embedding) during large model training to estimate the correlation between the proof target and the candidate premise.

Given the proof goal for inference, the target is first encoded into a vector, and then a matrix vector multiplication is performed between the premise embedding and the target vector.

Then, in order to select the first k premises (where k can be a hyperparameter that determines how many premises the user wants to return), only the k premises with the highest score are returned.

And to perform inference tasks in Lean, in addition to the fast inference provided by Lean Copilot, you need an efficient matrix multiplication library and a numpy matrix reader in C++.

The researchers used a matrix multiplication function from CTranslate2 and a C++ fast numpy file reader from Libnpy.

Again, they link these numbers to Lean through the FFI mechanism.

As a result, the premise selection strategy can be run very efficiently, as the premise embedding can be pre-computed and all subsequent operations can be done quickly in C++ using the libraries described above.

After obtaining the premise of the return, the researcher further annotated it with useful information.

All prerequisites are divided into two categories: prerequisites that can be used directly in the current environment (in-scope premises) and prerequisites that are not directly usable in the current environment (out-of-scope prerequisites).

It depends on whether the required package is imported or not.

If you have already imported the package required for the premise, you can easily use the premise. The annotated in-range premise is shown in Figure 6 below.

Figure 7 shows the annotated out-of-scope premise.

As an example of using "premise selection", for the theorem add_abc in Figure 3, you can directly enter select_premises in the proof (left in Figure 8).

A list of relevant prerequisites will then appear in the information view (Figure 8 right).

For this simple theorem, it is clear that the chosen premises are indeed related, since they are all related to natural numbers and the rules of addition.

In this case, the 4 prerequisites selected are all in the current range, which means that their modules have already been imported.

Above, the researchers have built three practical proof automation tools with Lean Copilot for strategy suggestions, search proofs, and premise selection.

81.2% of the proof steps are automated

With the Lean Copilot framework, the researchers empirically hypothesized that human-robot collaboration in Lean Interactive Theorem Proof (ITP) is beneficial.

Due to the theorem proving process in Lean, the strategy proof is mainly used.

Therefore, in specific experiments, the authors mainly evaluated proof automation tools for "strategy suggestion" and "proof search".

All in all, AESOP is currently the most advanced rule-based proof automation tool for proof searches.

The researchers verified the effectiveness of the LLM-based search proof compared to aesop in two cases:

(1) Self-proof theorem (LLM completed independently)

(2) Assisting humans in theorem proofing (collaboration between humans and AI)

In addition, the researchers compared the search proof to the strategy suggestion to demonstrate the advantage of the search proof in addition to a single policy suggestion.

Studying how Lean Copilot can effectively help humans with the process of ITP is similar to the paradigm of humans using Copilot in software programming.

That is, when we are faced with a target, we first call Copilot to see if it can solve the problem directly.

If we can't, we'll simplify the goal further and try Copilot again. Then, repeat the process until Copilot successfully solves the remaining objectives.

It's through this iterative collaboration paradigm that researchers look at how much human labor each proof automation tool can automate.

The specific results are shown in Table 1 below.

Proof search (search_proof) can automatically prove 64% of theorems (32 out of 50), significantly higher than AESOP and strategy suggestions (suggest_tropics).

When used to assist humans, it was proved that the search required an average of only 1.02 manual input policies, which was also better than AESOP (3.62) and policy suggestions (2.72).

Finally, for the theorem for each test, the authors calculated the percentage of proof steps that could be automated in each of the three tools.

The results show that the proof search can automatically complete about 81.2% of the proof steps in the theorem, which is significantly higher than the strategy suggestion (48.6%) and AESOP (35.2%).

Overall, the performance of the proof search is 1.67 times higher than the policy recommendations and 2.31 times better than the rule-based baseline AESOP.

Local LLM inference in Lean via Copilot

Tactic suggestion, proof search, and premise selection in Lean Copilot may seem different in nature, but the requirements for user experience are similar.

They all need to be fast enough to generate responses, with modest computational needs, while running in Lean.

The reason for these requirements is that Lean itself is able to provide feedback on the environment (e.g. residual targets, error messages, type information, etc.) very quickly in most cases.

This kind of speed is consistent with the essence of the proof theorem - it requires coherent reasoning.

If Lean Copilot requires users to wait for a long time, it will be difficult for the collaboration between humans and AI to work.

Again, we really want to meet the need for low compute. Because the proof of the theorem in Lean itself does not require a GPU, it can be run on the user's local laptop.

Therefore, being able to run efficiently on most hardware, including laptops without GPUs, is very important for Lean users.

This is because users may not be able to access CUDA-enabled GPUs when writing proofs.

Because of the need to meet the needs of fast inference and low computation, and the fact that all popular and efficient deep learning frameworks are in Python, a natural solution for the team to come up with was to host the model (local or remote) in Python and then make requests to the model from Lean.

However, this approach is affected by the overhead of inter-process communication, and it requires additional setup steps on the part of the user, which is not suitable for Lean's traditional workflow.

To overcome these issues, Lean Copilot runs the LLM locally in Lean via an External Functional Interface (FFI).

FFI is a mechanism by which a program written in one language can invoke a subroutine of another language.

The Lean part is implemented in C++ and can be efficiently interoperable with C++.

A programmer can declare a function in Lean, but implement a function body in C++. Implementations are compiled into a shared library and dynamically linked to Lean.

By default, we use LeanDojo's pre-trained repver model. It is based on an encoder-decoder converter, BVT5, which maps the input string to the output string.

Lean Copilot makes the model runnable in Lean by wrapping it into a C++ function that operates on strings, which can be called in Lean via FFI.

Chinese authors have made great contributions

The team of three in the latest paper is also the author of the open source platform LeanDojo in June '23.

Address: https://arxiv.org/pdf/2306.15626.pdf

Peiyang Chang(宋沛洋)

Peiyang Song is an undergraduate computer science honors student at the School of Creative Studies (CCS) at the University of California, Santa Barbara, where she is mentored by Richert Wang and Phill Conrad.

He is also a SURF Fellow in the Department of Computational and Mathematical Sciences (CMS) at the California Institute of Technology, co-supervised by Prof. Anima Anandkumar and Dr. Kaiyu Yang.

In addition, he is a fellow at UC's Berkeley Architecture Lab, where he collaborated with Tim Sherwood and Dr. Murphy. Jeremy Lau (Google) in collaboration with him.

His research interests are in machine learning (ML) with applications such as natural language processing (NLP) and computer vision (CV), as well as fundamental theories such as systems and programming languages (PL).

Song's recent research has two main directions.

The first is Neural Symbolic Reasoning and Artificial Intelligence Mathematics (AI4Math), which combines large models with interactive theorem provers (ITPs).

The other is energy-efficient machine learning based on time-series logic.

Guy Yang (杨凯峪)

Kaiyu Yang is a postdoctoral researcher in the Department of Computational + Mathematical Sciences (CMS) at the California Institute of Technology, where her supervisor is Anima Anandkumar.

He received his Ph.D. from Princeton University under the supervision of Jia Deng, and has worked with Olga Russakovsky and Danqi Chen.

His research focuses on neurosymbolic artificial intelligence, which aims to enable machine learning to perform symbolic inference, which he hopes will achieve in two directions:

(1) apply machine learning to symbolic reasoning tasks, such as mathematical reasoning and theorem proofs in formal logic or natural language;

and (2) introduce symbolic components into machine learning models to make them more interpretable, verifiable, and data-efficient.

Currently, he is working on artificial intelligence that can understand and reason about mathematics. Mathematical reasoning is an important milestone in human intelligence, and it has the potential to transform many important problems in science and engineering, such as solving partial differential equations and formula verification.

Anima Anandkumar

Anima Anandkumar is now a Professor of Computational and Mathematical Sciences at the California Institute of Technology.

Her research interests focus on large-scale machine learning, non-convex optimization, and high-dimensional statistics.

In particular, she has been spearheading the development and analysis of tensor algorithms for machine learning.

The tensor decomposition method is extremely parallel and scalable, and can be applied to massive amounts of data. It guarantees convergence to the optimal solution and outputs consistent estimation results for many probabilistic models, such as the Markov model.

More broadly, Professor Anandkumar has been working on efficient techniques to accelerate non-convex optimization.

Resources:

https://arxiv.org/abs/2404.12534

https://github.com/lean-dojo/LeanCopilot

https://twitter.com/animanandkumar/status/1782518528098353535

Caltech Chinese use AI to subvert mathematical proofs!Shocked Tao Zhexuan, 80% of mathematical steps are automated

【New Zhiyuan Guide】The formal research artifact Lean that Tao Zhexuan is full of praise for, but there is a bug in the reasoning of running LLM. Recently, the Caltech team solved this bug and automated more than 80% of the math proof steps!

Generate policy recommendations

Search for full proofs

Select the annotated premise

Peiyang Chang(宋沛洋)

Guy Yang (杨凯峪)

Anima Anandkumar

Read on

There are many second-hand children's clothing stores and flea markets in Germany, and today I will take you around. There are plenty of baby items including bottles, milk powder boxes, walkers, strollers, baby slides and more

Disadvantages of walkers and walkers. Hello everyone, today I will continue to carry out the content of the sensitive period to make an extension for everyone. When babies reach nine or ten months old, they usually borrow it

Note! Have you found any safety hazards for baby walkers? Tonight the baby was playing in a walker, and I stood at the dining table collecting the dishes and chopsticks, not knowing when the baby had already arrived in the walker

Pulley low stool home small stool with universal wheel home toddler change shoes coffee table baby round bench bench

#勇凯小凳子带万向轮学步凳滑轮矮凳儿童旋转椅轻奢网红板凳静音轮

Baby scooter twisting car, suitable for the baby's first car car, cute appearance, easy to slide safely anti-rollover, the little guys who are toddlers rush ah # good things recommended 🔥 # mother and baby good things # twist

#万向轮矮凳推荐万向轮小凳子家用带娃神器万向轮小凳子儿童学步滑滑凳多功能美缝滑轮矮凳圆凳, small stool # ergonomic chair suitable for unpacking couriers

#短视频变现家用带娃神器万向轮小凳子儿童学步滑滑凳多功能美缝滑轮矮凳圆凳

My little tiger girl, one year old and four months old#The broken thoughts of raising a baby##Sharing and accompanying the baby's growth##Baby growth exchange#It was only 5 pounds at birth, and she was like a poor kitten in her arms, waiting for her

Ten-year-old toddler, adult motion sickness? The correlation between infant toddlers and motion sickness is revealed

The doctor advised: the harm of the walker cannot be ignored, don't use it indiscriminately, and it will be too late when there is a problem

[My Stumbling Life]: Section 2 ~ Toddler again

After getting off work in the evening, I walked home alone and received a call from my younger brother, telling me that there was a fire in my sister's house, and the fire spread from the second floor to the third and fourth floors

Mechanism design: Use mathematics to help social progress

Weekly Journal: Mathematical Modeling Learning (5)

100 core test points for college entrance examination mathematics, it is recommended to collect!

Knowing the gains and losses and forging ahead, knowing the method and doing it - the teaching and research of the mathematics group of the Xiangshan Primary School Development Community

The strongest subject of the chicken baby, English, followed by mathematics, and the language is not moving at all

Under the 3+1+2 point system, how much advantage are English and Mathematics subjects?

Oh my God, the math teacher's ex-wife is so talented, this talent is so talented, I have to admire, it's full of highlights

Primary Mathematics [Itinerary Problems: Encounter and Pursuit] Specialty

"If You Are Honest, Do Not Disturb" god-level boyfriend debuts! Feelings are not mathematical, and it is really hard to calculate!

How to tell if a child has a talent for math

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

The last month of the 2024 college entrance examination: the 100 most important core test points for mathematics in the college entrance examination

The chicken baby has the most fierce math, and if you learn it well, you may also be a pseudo-scholar!

Middle School Mathematics | Don't know how to do the finale question, don't have ideas, learn these 5 methods, and make a breakthrough at once

100 condensed test questions for elementary and junior high school mathematics

Play for a healthy brain, 24-point math problems will take you to challenge the limits of your brain!