It seems that overnight, AI has taken over mathematics

(Visual China/Photo)

With the continuous development of the GPT craze, large language models including ChatGPT; LLM) began to enter various fields that were once considered the exclusive domain of human intellectual activity. For example, Fields Medal winner Tao Zhexuan, a Chinese-American mathematician, claimed in a blog post that he had begun using GPT-4 to assist his work. A paper co-authored by scholars from Caltech, NVIDIA, MIT and other institutions posted on the preprint website (arXiv) on July 27, 2023 claims that they have built a theorem proving based on open source LLM. It seems that overnight, AI has conquered mathematics, one of the purest fields of human intelligence.

In fact, the process by which mathematicians search for automated proofs has been going back to the roots for more than a hundred years. Even the birth and development of computers are inseparable from this exploration process.

Foundations of mathematics and Gödel's incompleteness theorem

At a lecture at the Royal Society on April 27, 1900, physicist Baron Kelvin delivered the famous lecture "Two Dark Clouds" in physics. As everyone knows, two dark clouds set off a storm, from which the two pillars of modern physics in the twentieth century - relativity and quantum mechanics were born.

In the same year that Baron Kelvin gave his speech, mathematician David Hilbert gave a lecture entitled "Mathematical Problems" at the Second International Congress of Mathematicians in Paris, which proposed twenty-three mathematical problems that he considered most important. These questions came to be known as the "Hilbert Problem" or "Hilbert's 23 Problems." The study of these problems has greatly promoted the development of mathematics in the twentieth century.

Among the 23 questions posed by Hilbert are questions such as the "continuum hypothesis", "compatibility of arithmetic axioms", and "axiomatic physics" that involve the foundations of mathematics and science. These questions were raised by Hilbert's ambition to establish a unified axiomatic system of mathematics. Not only that, on top of the axiomatic system, he also had a more grand idea, that is, the so-called "Hilbert formalist program".

According to Hilbert , the formal axiom system of mathematics he wanted to establish should meet three conditions. That is: completeness: all mathematical true propositions can be found; Self-consistency: there are no contradictions within mathematics; Determinable: Able to judge the truth of each mathematical proposition.

As a mathematician, Hilbert was concerned with how the "edifice of mathematics" itself would be built. As for where the foundation of the building was built, in Hilbert's view, it was a matter of no need to consider. However, it is this "no need to consider" thing that has caused unexpected problems.

In 1901, the 29-year-old British philosopher Russell discovered the famous Russell paradox. The simplest form of this paradox is the so-called "barber's paradox". That is, the barber in the small town released a bold statement: he will shave for the people in the city, and he must only shave for all the "people who do not shave for themselves" in the city. So, should a barber shave for himself?

This paradox shows that naïve set theory, which was the basis of the "edifice of mathematics" at the time, was logically imprecise. This is known as the "third mathematical crisis". For this reason, mathematicians eventually developed naïve set theory into axiomatic set theory. More importantly, mathematicians realized that not only mathematics itself needs to be axiomatic, but also the foundations of mathematics.

To this end, Russell and his teacher at Trinity College at Cambridge University, the famous philosopher Whitehead, spent ten years completing the three-volume magnum opus Principia Mathematics.

Russell's approach belongs to another school of inquiry into the foundations of mathematics: "logicism." This three-volume "Principles of Mathematics" is the peak of the development of logicism. In Principia Mathematica, part of the edifice of mathematics is constructed directly from logic.

One of the costs of this grand goal is the extreme twists and lengths of reasoning. For example, the simplest number "1" is not defined in Principia Mathematica until page 363; 1+1=2, the simplest elementary arithmetic problem, has no results until page 379.

Two large boxes of manuscripts filled with Principia Mathematica were transported by Russell by horse-drawn carriage to Cambridge University Press. The publisher's profit on the magnum opus was estimated to be £600 at a loss of £600 for publishing the book. The £600 in 1910 was equivalent to 600,000 yuan today. Fortunately, Cambridge University Press is not a purely for-profit publisher, and they are willing to bear a £300 loss for this scholarly work. Of the remaining £300, the Royal Society funded £200. In the end, the £100 had to be borne by Russell and Whitehead personally.

However, in 1931, shortly after the publication of Principia Mathematica, the 25-year-old mathematician Gödel proved and published two mathematical theorems that would later be called "Gödel's incompleteness theorem". These two theorems show that a mathematical system containing arithmetic axioms is incomplete and inconsistent.

Just as Baron Kelvin's "two dark clouds" completely changed the shape of the physics edifice. Gödel's incompleteness theorem also declared that the mathematical edifice that Hilbert and Russell hoped to build would never be completed.

Machine proof in the pre-AI era

Although Hilbert and Russell's grand plans proved unattainable by Gödel. But their work has played an inestimable role in the development of mathematics and computers.

On the one hand, these works have given modern mathematics a relatively solid foundation. On the other hand, the exploration of axiomatic and formal mathematics in these works has also completely changed the language and study mathematics that mathematicians have described since ancient times. The most striking example of this change is that almost all mathematical monographs now choose to start in almost exactly the same way. That is, in the beginning, in a language such as set theory, a strict definition of the mathematical objects discussed in the book is given.

The study of the three principles of Hilbert's axiomatization also gave birth to the theoretical basis of modern computers.

Gödel's incompleteness theorem declares that completeness and inconsistency in Hilbert's axiom system are invalid. During the same period , the decisive study of Hilbert's axiom system was also being conducted. Just a few years after Gödel published the incomplete theorem, the Polish-born mathematician Alfred Tusky and the British mathematician Alan Turing independently proved that the determinability of Hilbert's axiom system was also untrue.

In the process of proof, Turing proposed the concept of "Turing machine". This concept is the theoretical model of modern computers. In essence, all computing "machines", including abacus, mechanical computers, and electronic computers, can be regarded as a specific Turing machine. In addition, Turing also proved that the determinability in Hilbert's axiom system is equivalent to the shutdown problem of Turing machines.

These studies of Turing became the theoretical basis for the future development of computers. Because of his pioneering role in the field of artificial intelligence, Turing is known as the "father of artificial intelligence".

After World War II, with the development of the first generation of vacuum tube computers in the United States, mathematicians also began to try to implement the existing formal mathematical conclusions on real "machines". In order to do this, we first need to create a language that "machines" can "understand" and can use to reason logically. At this time, the set of formal logical symbols used by Hilbert and Russell became the basis for the various "mathematical languages" used by the "machines" developed later.

THE MAIN RESULTS OF THIS PERIOD WERE MADE ON THE 2.3-TON VACUUM TUBE COMPUTER "JOHNNIAC" AT THE PRINCETON INSTITUTE FOR ADVANCED STUDY. IN 1954, MARTIN DAVIS BEGAN EXPERIMENTING WITH WRITING ALGORITHMIC PROGRAMS FOR JOHNNIAC THAT INCLUDED NATURAL NUMBERS AND ADDITION. THE BIGGEST ACHIEVEMENT HE ACHIEVED WAS THAT JOHNNIAC PROVED "THE SUM OF TWO EVEN NUMBERS OR EVEN."

In 1956, also at JOHNNIAC, Alan Newell, Herbert Simon, and J. C. Shaw, using the program they developed called the "Logic Proof Machine," proved 38 of the first 52 theorems in Russell's Foundations of Mathematics. For some theorems, "logical proof machines" have even found new, more elegant proofs.

Since then, Automated Theorem Farming (ATP) has gradually become a formal research direction. The "logical proof machine" is also considered by many to be the "first artificial intelligence program". Although at the time, the field of artificial intelligence did not exist.

In the decades that followed, automation proved to come a long way. But getting automated proofs to produce mathematically valuable results is still a long way off. Because automation proves that there is a dilemma that cannot be combined with a fish and a bear's paw. Provers that can be used to generate valuable mathematical conclusions are usually based on relatively strong logical systems, and it is difficult to achieve fully automated reasoning on such logical systems. In contrast, logical systems that can achieve strong automation, such as first-order logic or even Boolean logic, cannot produce truly valuable mathematical conclusions.

In the face of this situation, the proof machine under the artificial guidance and strong logic system has become the mainstream direction of development. This is also known as Interactive Theorem Promoting (ITP).

The earliest and most well-known result of interactive proofs is the proof of the four-color theorem.

The so-called four-color theorem, in layman's terms, is: "Each map of an enclave can be dyed with no more than four colors, and no two adjacent areas will be the same color." This problem was first proposed by the South African mathematician Francis Goodry in 1852 and is known as the "four-color problem" or "four-color conjecture". In the more than a hundred years since, well-known mathematicians, including Minkowski, have studied this problem, but they have never been fully proved.

The idea of proving the four-color theorem has actually been clear for a long time. This is because the way neighboring countries on the map border can be divided into a limited number of cases. These different situations are called configurations. Therefore, it is only necessary to verify that all configurations conform to the four-color theorem one by one, and the four-color theorem is proved. But the amount of work required for this goal far exceeds the limits of what humans can accomplish. It wasn't until 1976 that mathematicians Kenneth Appel and Wolfgang Haken spent 1200 hours verifying all 1936 configurations on an IBM 360 computer at the University of Illinois in the United States, thus proving the four-color theorem.

Since then, dozens of mathematical problems, including the double bubble hypothesis and the Robbins conjecture, have been solved or made important progress in a similar way to the four-color theorem.

What do mathematicians want?

Despite these achievements, interactive machine proofs are still not widely adopted by the mathematical community. Even today, there are mathematicians who are trying not to use computers and rely purely on human power to prove mathematical problems, including the four-color theorem, which have been solved by machine proof.

This is partly because interactive proofs have limitations to the problems that can be solved. Moreover, interactive proof, as an independent research direction, has a high threshold. It requires mastery of the corresponding human-computer interaction language. In addition, the type theory-based logical system used by the mainstream interactive proof is not the same as the mathematical system based on set theory that mathematicians are accustomed to. This leads to the fact that if mathematicians want to use interactive proofs themselves, they have to spend a lot of learning costs to learn them.

More importantly, it is because of the values of the discipline itself.

When it comes to mathematics, our first reaction will most likely be "strict". In mathematics, right is right and wrong is wrong. In natural numbers, 1+1 will always equal only 2. During the math test, even if the final result is correct, as long as the intermediate steps are wrong, the teacher will ruthlessly deduct points.

It is undeniable that rigor is indeed an extremely important value in mathematics, especially mathematical proofs. However, with a little understanding of the history of mathematics and the specific work of mathematicians, we will inevitably ask: Is rigor the most important value of mathematics?

The rigor of calculus was done by Weierstras. Almost 200 years had passed since Newton and Leibniz proposed calculus. But this did not affect Newton and Leibniz, as the originators of calculus, who are considered to be among the greatest mathematicians in history. It does not affect the past 200 years, mathematicians have used loose calculus tools to develop extremely wonderful mathematical theories and theorems.

Many of Euler's wonderful work is completely lax by today's standards. To this day, Euler's conclusion that the sum of all natural numbers is equal to -1/12 is still being disseminated by various marketing numbers. But this did not in any way affect Euler's contributions to mathematics.

Much of Riemann's work was based on mathematical theories, such as the proof of Riemann's reflection theorem, and the regularity of elliptic equations was not gradually established until the late 20th century.

Until now, there are many such examples in mathematics.

The Russian-French mathematician Gromov, who revolutionized many areas of modern geometry, wrote many of his papers that went far beyond "laxity" by the standards of ordinary mathematicians. In order to understand the content of one of Gromov's short ten-page articles, it often takes many mathematicians to write a series of articles.

Similarly, Edward Witten, a physicist who won the Fields Medal, the highest prize in mathematics. Much of his work comes from the intuition of physics, not the rigor of mathematics.

A mathematical result cannot be rigorously written down for a variety of reasons. One is that mathematics has not yet developed to the point where rigorous assertions can be written in concise language. This was the case with Newton and Leibniz, Euler. In their time, "infinity" was still a rather vague concept, and there was no way to describe exactly what "infinity" was, either mathematically or philosophically. After this problem began to be solved in the late nineteenth century, people were able to rigor their discourse well.

Another reason is for efficiency. Even without talking about these great mathematicians, as ordinary mathematicians and mathematics students, when writing papers, they will not write all the details like machine proofs. A lot of cumbersome content is summarized by the simple word "obvious".

The main reason is the consideration of mathematical understanding. Many mathematical ideas, not necessarily rigorous, are likely to be wonderful. Whether it is Riemann or Gromov and Witten, the "proofs" they wrote are full of meaning and inspiration. Compared to a rigorous, error-free proof, these "loose" jobs can give us a deeper understanding of mathematics, and thus discover more unknown mathematics. As Chen said, "People with mathematical experience and foresight can sail under the sea and reach important new fields." ”

Therefore, although all mathematical results must be thoroughly rigorous in the end. But in the specific process, strictness is often not in the most important position. Compared with 10,000 strictly written formal proofs, the mainstream of mathematics now appreciates a "proof" that is not strict but has deep connotations.

Is AI the answer?

So, can the large language models represented by GPT change this situation and allow machine proofs to meet the needs of mathematicians and really become available?

First of all, to do this, it is not possible to directly use a general-purpose large language model such as GPT.

In fact, according to Microsoft Research's test report on GPT-4, mathematical ability, among the various capabilities of GPT-4, is itself a relative weakness. This is largely because the training and cultivation of GPT mainly relies on information on the Internet. It is unrealistic to want to obtain a considerable degree of professional mathematical ability in this way. Moreover, the cultivation of professional mathematical ability is not in itself the goal that GPT aims to achieve.

So, if we change our thinking and combine large language models with existing interactive proof programs, can we make theorem provers that meet the needs of mathematicians?

This idea has a strong rationality. First, large language models are tools that can save mathematicians from learning to use interactive proofs. They only need to "tell" the problem to the large language model in their familiar language, then let the model "translate" it to the interactive proof program, and then "translate" the formal symbolic result given by the program into the mathematical language that mathematicians are accustomed to.

In addition, more importantly, today's large-scale language models have shown certain logical reasoning capabilities. According to the learning growth rate of AI, and if it continues to develop, it is likely that artificial intelligence will appear in the future that can give enlightening work with profound connotations in mathematics.

The work in the paper co-authored by scholars from Caltech, NVIDIA, MIT and other institutions mentioned at the beginning of this article is the first step in this direction.

They built on Lean, an open-source interactive proof machine originally released in 2013 by Leonardo de Mora of Microsoft Research, and launched the first open-source theorem proving machine based on a large language model: LeanDojo.

Unlike "big" language models like GPT, LeanDojo is much smaller. Its core dataset has only 96962 theorems and proof processes. Moreover, unlike GPT, which requires long-term training on massive hardware, LeanDojo's training process only takes a single graphics card a week to complete the training. According to the paper, LeanDojo's mathematical ability is significantly improved compared to GPT-4.

This is because, compared to general-purpose GPT, LeanDojo, which focuses on mathematics, has a much more single problem to deal with. Moreover, decades of development of formal mathematics and machine proofs also provide a solid foundation for LeanDojo's training.

Although LeanDojo's performance seems impressive enough, the authors of the paper are very sober. They wrote in the paper: "This work aims to lower the barrier to entry in this field and provide a starting point for future work to move forward.

At this point, though, it's far from possible for AI to do mathematician-level work. But the combination of large language models and machine proofs represents at least one possibility. Just like a child writing "1+1=2" crookedly on a piece of paper, everyone's reaction is to smile and look forward to what the child will do when he grows up.

Left force