laitimes

Solomonov: The Prophet of Large Language Models

author:Institute of Physics, Chinese Academy of Sciences
Solomonov: The Prophet of Large Language Models

Part of the participants at the 1956 Dartmouth Conference. Left 2 Rochester, Left 3 Solomonov, Left 4 Minsky, Right 2 McCarthy, Right 1 Shannon

Solomonov: The Prophet of Large Language Models

Guide:

At present, the hottest large-scale model company is OpenAI. OpenAI Chief Scientist Ilya Sutskever kept hinting in interviews that next token prediction was the key to the success of the GPT series of large models, but it wasn't until August 2023, when he gave a speech at Berkeley's Institute of Theoretical Computer Science, that GPT's mathematical basis was Solomonoff Induction.

So, what is Solomunov's induction? This year marks the 60th anniversary of the birth of the Solomonov induction method, and the artificial intelligence scholar Nick wrote a 10,000-word article explaining why the Solomonov induction method is the theoretical basis of large language models and how to explain the core mechanism of GPT, next token prediction.

Nick | Writing

Sometimes theory comes first, practice follows, sometimes engineering or experiments produce results first, and theoretical explanations come slowly, but sometimes theory and practice are entangled.

The history of natural language processing (NLP) is more tortuous and more like the last case. Large language models, as the latest advances in NLP, are mixed with business in addition to theory and practice, which makes it more difficult to clarify history. As large language models continue to advance in engineering, theoretically conscious engineers are trying to find their mathematical basis in order to provide explanations for the success of large models. But many observations and fittings that are detached from first principles are not theoretical, but rather add more confusion to engineers.

In fact, the genius contributions of maverick mathematician Ray Solomonoff (1926-2009) in the early 1960s had already laid the mathematical foundations for large models. His original theories began to be rediscovered, which still guide engineering practice and may point the way for the future. Solomon is a prophet of large language models.

Led by John McCarthy (1927-2011) and Marvin Lee Minsky (1927-2016) in 1956, Claude Elwood Shannon (1916-2001) at Bell Labs and Nathaniel Rochester (1919-2001) of IBM Dartmouth College, where McCarthy taught at the time, held a summer workshop on artificial intelligence. This conference marked the establishment of AI as an independent discipline. The conference brought together a group of young and ambitious scholars from many different disciplines.

The one who took this meeting most seriously was Solomonov. Unlike other attendees who came and went, Solomanov spent an entire summer in Dartmouth. He studied physics with Fermi at the University of Chicago in 1951, but left the ivory tower with a master's degree and moved to the northeastern United States (around Boston and New York) to begin his life of work, study, happiness but not wealth. During his time in Chicago, the philosopher Rudolf Carnap (1891-1970) was the greatest influence. Carnap's main interest at that time was probability theory and inductive reasoning, and his ideas and results were reflected in his 1950 book The Logical Foundations of Probability (see Carnap-1950), which Solomonov studied intensively, and inductive reasoning became his lifelong research.

Solomonov: The Prophet of Large Language Models

所罗门诺夫 (July 25, 1926 - December 7, 2009)

Interestingly, Walter Pitts (1923-1969), one of the founders of neural networks, also benefited from Carnap. Another pioneer of artificial intelligence, Herbert Simon (1916-2001), recounted in his memoirs that he had attended Carnap's mathematical logic class in Chicago, which led to his interest in machine theorem proofs and broader intelligence problems. That said, both schools of AI – logic and neural networks – were taught by Carnap (see Nick-2021).

Solomunoff met Minsky and McCarthy around 1952, when they were both Ph.D. students in the mathematics department at Princeton University. Although Alonzo Church was there to study logic, neither Minsky nor McCarthy's doctoral dissertations were about logic, but they were undoubtedly strongly influenced by logic, and both focused on logic when they first appeared, especially the study of recursive functions. At that time, logic was a new discipline in the mathematics departments of American universities. As a sub-discipline of mathematical logic, recursive functions have gradually evolved into the current computability theory, and further derived computational complexity. In 1967, Minsky also wrote an early and influential textbook on computation: Finite and Infinite Machines (see Minsky-1967). Manuel Blum later won the Turing Award for his contributions to computational complexity and cryptography. Minsky's claim that "AI incubates computing theory" is not unreasonable.

In the summer of 1953, McCarthy, who had already completed his Ph.D., and Minsky, who was about to graduate with a Ph.D., were both working at Bell Labs, and they both revolved around Shannon, who had become famous for his information theory. Shannon's interest at the time was the Turing machine and whether it could be used as a theoretical basis for intelligent activity. Even more famous at the time was Wiener, who had just published an influential new book, the title Cybernetics, borrowed from the Greek word (helmsman), with which Wiener sought to dominate the world, and in which he sometimes hinted or implicitly suggested that Shannon's information theory was also inspired by him. It is clear that neither the younger Shannon nor the younger McCarthy bought Wiener's account and did not like the word "cybernetic". McCarthy suggested to Shannon that a collection of essays be compiled with contributions from the relevant front-line researchers of the time, and it was not until 1956 that the collection was published under the title Automata Studies, a common title that Shannon had set for him, and he did not like to use new terms to attract attention, but McCarthy felt that the obscure title did not reflect their original intentions, which led him to insist on using another new term, "artificial intelligence" to name this whole new field. In this collection, McCarthy himself contributed a short five-page essay entitled "The Inversion of Functions Defined by Turing Machines" (see McCarthy-1956).

In this article, McCarthy discusses the question of how to guess the input of a Turing machine, given that it knows its output. More rigorously: given a recursive function (i.e., a Turing machine) fm and its output r(fm(n)=r), how to find a "valid" inverse function g(m, r) such that fm(g(m, r)) = r, where m is the ordinal number of the Turing machine. This problem is to look at the output of a black box (Turing machine) and try to guess the inner workings of the black box. The most earthy way to do this is to enumerate all the Turing machines that produce output, but it's clear that this doesn't necessarily go down. In fact, in the context of today's large models, g(m, r) is a large language model. It corresponds to looking for a proof of a conjecture by checking in some order all possible English essays. McCarthy believed that all mathematical problems could be expressed as using Turing machines to find the inverse, which was exactly the problem of inductive reasoning that Solomonov wanted to solve.

During the Dartmouth meeting, McCarthy and Solomon had more opportunities for lengthy discussions. Solmonov argues that McCarthy's problem can be transformed into "given the initial segment of a sequence, find the follow-up of that sequence". From the known initial segment, it is modeled to predict the subsequent sequence. McCarthy didn't realize the importance of this idea at first, and asked rhetorically: Isn't this just extrapolation? Everyone present at the time was stuck in McCarthy's rhetorical question. The next day McCarthy reacted and said that Solomonov's problem was, in layman's terms, "Suppose we find a computer in an old house that is printing the sequence you are talking about, and it is close to the end of the sequence, and is about to print the next character, can you bet it will print the correct character?" What McCarthy and Solomoff call "sequence continuation", "next word" or "next symbol" is today's parlance " next token”。

Solomonov: The Prophet of Large Language Models

2006 Dartmouth Conference 50th Anniversary Week. 2 McCarthy on the left, 3 Minsky on the left, 1 Solanov on the right

Actually, this term has an earlier origin. In his 1913 article "Mécanique Statistique et Irréversibilité" (Statistical mechanics and irreversibility), the French mathematician Félix Édouard Justin Émile Borel (1871-1956) considered the question of whether a monkey could type randomly on a typewriter a Hamlet? The probability of Hamlet is 5.02×10, which is extremely small, but not absolutely impossible, and this is known as the "infinite monkey theorem". The Argentine poet and writer Jorge Luis Borges (1899-1986), in his 1944 collection of short stories, The Garden of Diverging Paths, included a collection of his philosophical novels (more like essays) "The Library of Babylon", in which he envisioned a perfect library that could hold all possible books produced by enumerations of letters, and in fact he wrote a more serious philosophical essay in 1939, "Total." Library), which reviews the various speculations on this ideal by thinkers at different stages starting with Aristotle.

Today, it seems that the big model is not going in this direction, and the training of the big model tries to exhaust all the knowledge that human beings have. If Borges's starting point is rationalist, then the random monkeys are certainly empiricist, but they can all be unified into some kind of Turing machine enumeration process with McCarthy's inverse.

The value of Turing's 1948 article, "Intelligent Machines," is being noticed more and more. There are several approaches to machine learning mentioned in the Turing text. In a general-purpose Turing machine, programs are equal to data, so all programs, like data, can be enumerated one by one. This enumeration method learns all the possible programs yourself. This is what Turing called "initiative" (see Nick-2024). Turing made it clear that all "learning" can be reduced to this method. Computational theory tells us that this enumeration process is non-stop, or non-computable.

Solomonov: The Prophet of Large Language Models

Discussions with McCarthy prompted Solomunov to further refine his ideas, and before the Dartmouth meeting ended, he wrote a memo on inductive reasoning, "An Inductive Inference Machine," typescripted on August 14, 1956. Solomon circulated the typescript to the attendees. At the end of 1956, he also sent an improved version to Herbert Simon of the Department of Industrial Management at the Carnegie Institute of Technology.

Solomonov's work was first published in 1960 at the California Institute of Technology conference on "Cerebral Systems and Computers." That same year, the article was widely circulated as a Zator report and a U.S. Air Force AFOSR report. This work was mentioned in 1961 by Minsky in an influential article "Steps Toward Artificial Intelligence" (see Minsky-1961). Solanov later revised his work in 1960 and published it in 1964 in Information and Control, an important journal of computational theory, under the title "A Formal Theory of Inductive Inference". Because the article was too long, it was split into two parts and published in two separate issues, the first half of which dealt with theory and the second half with a few examples (see Solomonoff-1964).

Solomonov's inductive method can be defined as follows: given a sequence (x1, x2, ..., xn), predicts xn+1. Inductive reasoning is the attempt to find a minimal Turing machine that can be modeled for (x1, x2, ..., xn) to accurately predict subsequent sequences. The length of the description of the sequence is the size of the Turing machine, which is what McCarthy initially vaguely realized as "working". For example, if a sequence is n 1s: (1, 1, 1,... ), then we can write the following program to output the sequence:

The descriptive length of this sequence is O(log(n)).

For example, if we give the sequence (3, 5, 7), there are infinite ways to predict the subsequent results, one of which is 9, because the program has the possibility to print odd numbers, as follows:

But maybe it's not right, and there's another possibility: 11, because it's possible that the program prints prime numbers. Obviously, the program for printing prime numbers is much more complicated than the program for printing odd numbers, which means that the length of the description of the prime number is greater than the length of the description of the odd number. Wait a minute.

Supervised learning can also be seen as a special case of self-supervised learning. Supervised learning (including classification problems) is given a pair of sequences (tuple) :(x1, c1), (x2, c2) ,..., (xn, cn), and xn+1, predicted cn+1. The learning process is to find the fitting function c=f(x). Such problems can be easily translated into self-supervised learning as follows: given a sequence (x1, c1, x2, c2,..., xn, cn, xn+1), predict cn+1.

This problem, which McCarthy describes as "bet on next symbol", is actually the core mechanism of large language models represented by GPT: next token prediction. A Turing machine that can generalize what is known is a large model. For the same dataset, we certainly expect the larger the larger model to cover the dataset, the better, in other words, we expect to find the most economical Turing machine that can make generalizations, that is, the smallest Turing machine. In this sense, learning can be treated as compression. The relationship between the number of parameters and the number of tokens can also be studied. The Solomonov induction method may not be stopped, so the only approximation algorithm can be used to relax the restrictions on the "minimum" of the Turing machine and the accuracy of prediction. Solmonov used Bayes' theorem to derive a priori probability distributions for sequences. Neural networks, as a universal approximator, can be a good candidate mechanism for implementing Solomonov's inductive method. This is actually the approach of today's large models.

Another problem that Solomonov had in mind was to give a few sentences and see if he could learn to generate the grammar of those sentences. At this time, Noam Chomsky's article "Three Models of Language Description" had just been published, and Solomsky was inspired to generalize Chomsky's grammar into a probabilistic grammar. One of the applications of his "inductive reasoning machine" was to learn grammar by inputting text, which he later called "discovery of grammar".

Chomsky's innate grammar is actually Solomonov's a priori probability distribution, except that Chomsky takes a rationalist stance, while Solomsky is undoubtedly empiricist. In fact, if the Church-Turing thesis is recognized, the distinction between rationalism and empiricism is only rhetorical, not essential. Under Solomonov's prior probability distribution, the confidence of a program decreases exponentially with its length. This is Occam's razor, i.e., the shorter the program, the higher the level of confidence. This can also be corroborated by empirical data (see Veldhuizen-2005). On the memorial site of Solomon (raysolomonoff.com) there is a beautiful formula of Solanov prominently placed:

Solomonov: The Prophet of Large Language Models

他的学术自传“算法概率论的发现”(The Discovery of Algorithmic Probability)1997年发表在计算理论杂志《计算机与系统科学》(Journal of Computer and System Sciences)上(见Solomonoff-1997),这篇文章后来历经修订,在多处以不同形式发表。 最新的一版在他去世后被收录在文集Randomness Through Computation: Some Answers, More Questions之中(见Solomonoff-2011)。

Solomonov: The Prophet of Large Language Models

The all-powerful Soviet mathematician Andrey Nikolaevich Kolmogorov (1903-1987), in addition to his extensive and profound contributions to traditional mathematics, had a direct and indirect impact on many aspects of computer science and information theory.

In the early 1950s, Shannon's information theory and Wiener's cybernetics were introduced to the Soviet Union through Russian translations. Kolmogorov, with his keen intuition, realized the importance of information theory. At the same time, he expressed disdain for cybernetics, which he believed had no inherent unity. This understanding is consistent with the views of Shannon, McCarthy and others who participated in the Dartmouth Conference on cybernetics. The state of scientific development in the USSR at that time was very complicated. Even with a status like Kolmogorov, his interest in genetics was suppressed by Lysenko, but after Lysenko stepped down, Kolmogorov also said good things for him.

Kolmogorov's views on cybernetics did not prevent cybernetics from becoming a mainstream discipline in the Soviet Union. This may have led to an aftermath in the Soviet Union about computer science and more or less artificial intelligence as a sub-discipline of computer science, and this certainly led to the development of related disciplines in China. Cybernetics has not become an independent discipline in the United States, but computer science has become the dominant discipline, and since the late 1960s, top schools in the United States have set up computer departments.

The core concept of cybernetics, feedback, is nothing more than one of the simplest special cases of recursive functions and is not sufficient as first principles. In his preface to the Russian translation of the Hungarian mathematician Rosa Peter's Theory of Recursive Functions (which was translated into Chinese in 1958 by Mr. Mosso, "Kolmogorov" as "Golmogorov" and "Turing" as "Duling"), Kolmogorov pointed out that general recursive functions and computability still need to be further examined from the perspective of constructability—and he also had deep insights into the Church-Turing thesis (see Peter-1951).

In any case, Kolmogorov's entry point is his favorite field: probability theory. In 1965, he founded the quarterly academic journal Problems of Information Transmission, which soon became the Soviet Union's most important position in computational theory. Kolmogorov himself published "Three Approaches to the Quantitative Definition of Information" in the inaugural issue, which examines probability theory and information theory from an algorithmic perspective. The core of information theory is the study of the content of information. Shannon's definition of information is entropy. Kolmogorov divided the foundations of information theory into three types, the first being frequency, the second being combinatorics, and the third being algorithms. Kolmogorov's evaluation of information theory and probability theory is thought-provoking: "Information theory logically precedes probability theory. And not based on the latter. "He thinks that combinatorics is more solid than frequency, but the most convincing thing is the algorithm. The amount of information contained in a piece of information can be measured by the length of the shortest program that generates this information (see Kolmogorov-1965). This is now known as the "Kolmogorov Complexity", which can be defined as follows:

KC(x) = min{ℓ(p) : U(p) = x}

That is, the length of the shortest program p that outputs the string x. Kolmogorov's classic essay is only 7 pages long, and the next few related articles he wrote were even shorter. This is in stark contrast to Solomonov's meticulous but lengthy articles. The brevity of Soviet mathematicians was one of their characteristics, supposedly because of the scarcity of paper in Soviet times, but another theory is that Soviet mathematicians (especially everyone) were so poor that the complete proof of the results required their students, sometimes even a generation. Kolmogorov's theory of KAM (Kolmogorov–Arnold–Moser) was later perfected by his student Vladimir Arnold and German-American mathematician Jürgen Moser, and his work on Hilbert's thirteenth problem was also concluded by Arnold.

It can be shown that the Kolmogorov complexity has nothing to do with the representation of the program. Different program representations, such as C, Java, Python, or Turing machine code, result in only one constant between the shortest descriptions. This invariance theorem is sometimes referred to as the "Kolmogorov Thesis." There is growing evidence that the Kolmogorov complexity (if it can be calculated) is more reliable than the Shannon entropy, for example, the structural entropy of a graph varies depending on the representation of the graph, and the Kolmogorov complexity of the graph should be constant.

Kolmogorov later took note of Solomonov's work, which he cited in 1968 in Russian and English, making the latter's reputation even more resounding in the Soviet Union than in the West. "Kolmogorov complexity" is also known as "Solomonov complexity", or "Solomonov-Kolmogorov-Zeitin complexity", and occasionally referred to as "descriptive complexity", but there are several things in computational complexity theory that are referred to as "descriptive complexity", and to avoid ambiguity, the most commonly used term "Kolmogorov complexity" is used in this article. Because of Kolmogorov's influence, this discipline is also known as "algorithmic probability theory", or "algorithmic information theory".

Several Soviet scholars, including Kolmogorov's students, established a research center for machine learning at Royal Holloway, University of London. At their initiative, the Kolmogorov Medal (note: different from the Kolmogorov Prize awarded by the Russian Academy of Sciences) was established. Solomanov was the first recipient of the Kolmogorov Medal, most recently (2018) by Vladimir Vapnik, a Soviet Jewish statistician known for his invention of the Support Vector Machine. Solomon is also an adjunct professor at Royal Holloway, University of London.

Solomonov: The Prophet of Large Language Models

Greg Chaitin (born 1947), an Argentine-born Jewish-American theoretical computer scientist, had a distinct upbringing. He attended the prestigious Bronx High School of Science in New York, where he contributed nine Nobel Prizes and two Turing Awards. His first article, published at the age of 18 in IEEE Transactions on Electronic Computers, was about automaton clock synchronization, which he wrote in high school and was signed by Columbia University's School of Engineering and Applications, because he participated in Columbia's honors program in high school. After graduating from high school, he enrolled at the City College of New York (CCNY). He was reading three books at the same time during his first semester: Game Theory by von Neumann and Morganston, The Mathematical Theory of Communication by Shannon and Weaver, and The Undecidable Anthology edited by Martin Davis, which included Turing's groundbreaking 1936 essay. He returned to Argentina with his parents before graduating from his bachelor's degree.

Cai Ting visited IBM at a very young age, and studying logic and programming became his hobby. His programming skills made it easy for him to get a job as a programmer at the IBM branch in Buenos Aires. During this time he worked on Gödel's incompleteness theorem. His first article on minimum program lengths was published in the Proceedings of the Association for Computing Machinery (JACM) at the age of 19, and he independently reinvented the ideas of Solmonov induction and Kolmogorov's information in a new way. The reviewer was already aware of Kolmogorov's work and informed Tchai Ting that he did not know Russian, but nevertheless acknowledged Kolmogol's work in the form of a footnote when the paper was published.

Solomonov: The Prophet of Large Language Models

Solomon (first from left) and Cai Ting, 2003

Zeitin's starting point is the Berry Paradox. Berry's paradox is "The smallest positive integer not definable in under sixty letters" in English, and "the smallest positive integer that cannot be named with less than nineteen Chinese characters" in Chinese. It's a naming paradox. Because Berry's paradox is related to the language carriers used, Cai Ting decided to use the functional programming language LISP to avoid confusion. To name an integer is to give a program that can output the integer. Zeitin's name is the description of Kolmogorov. The shortest way to name most integers is to print them directly, without a shorter program to represent them, and these integers are called "uninteresting", incomprehensible, irreducible, and random by Tsai Ting. In fact, Tsein thus proved that the Kolmogorov complexity is not calculable. He called it "incompleteness" at the time.

In 1974, Tsai returned to the United States to work at the IBM TJ Watson Research Center in New York State, first as a visiting scholar and later as a permanent researcher. Interestingly, as soon as he returned to the United States, he was ready to take the train to Princeton to visit his God, Grod. So one day in April 1974, he mustered up the courage to call Gödel and tell him that he had used Berry's paradox to arrive at a version of the incompleteness theorem.

哥德尔说:“用什么悖论无所谓”(“It doesn’t make any difference which paradox you use!” )。

Tsai Ting replied: "Yes, but my proof points to an incomplete information theory perspective, and I would love to tell you in person." ”

Gödel said, "Okay, send me the paper first, and then call me to see if you can make an appointment with me." ”

In his later years, Cai Ting's interest turned to biology and he published an interesting popular science book, Proving Darwin (see Chaitin-2012). The characteristic of a person who becomes famous early is that he likes to use a familiar hammer to hit all the nails he encounters, and the so-called one trick is eaten all over the world. Dissatisfied with the lack of theoretical basis in biology, he explained the theory of evolution with algorithmic information theory, which he called "metabiology". Not surprising at all, the core ideas of his metabiology can be found in genetic algorithms and genetic programming.

Solomonov: The Prophet of Large Language Models

The Soviet mathematician Leonid Levin (1948-) independently proposed NP completeness in 1972 and proved several equivalence problems (see Levin-1973). This article, which now appears to be extremely important, was published in the third issue of 1973 in the famous journal Problems of Information Transmissions, founded by Kolmogorov. Levin was a student of Kolmogolov, but he was not awarded a doctorate due to political problems. He immigrated to the United States in 1978, and the Massachusetts Institute of Technology quickly gave him a Ph.D., and his results have since become known. He later taught at Boston University until his retirement. Computational theory textbooks since 2000 have changed the original Cook's theorem to the Cook-Levin theorem. In 2012 he was awarded the Knuth Prize — a lifetime achievement award, which is more focused on the impact on the discipline as a whole, unlike the Turing and Gödel Prizes, which are geared toward specific contributions. This can be regarded as compensation for the lack of a Turing Award.

Like his teachers, Levin's articles are short, and his 1986 pioneering work on algorithmic average complexity analysis is only two pages long (see Levin-1986). Interestingly, Levin tends to think that P=NP, and that he is certainly in the minority.

In Levin-1973, Levin gave two theorems, Theorem 1 about NP completeness, and Theorem 2 was ignored at the time. Theorem 1 is not proven in detail, and Theorem 2 is not even explained. The article ends after listing Theorem 2. Theorem 2 has little to do with theorem 1, or at least not obviously. Levin proposed a universal search process that solves the inverse of a function, a problem McCarthy proposed in the 1950s, and Solomunoff had spent 20 years on it.

When Solomanov learned of Levin's plight in the Soviet Union, he contacted several schools and scholars in the United States, asking them to help Levin. Solomanov wrote a report on his academic discussion with Levin (see Solomonoff-1984), which completed the proof of Theorem 2 for Levin-1973. Solomon called Levin's universal search process L-search.

Levin's L-search adds a time limit to the Kolmogorov complexity, which can be defined as follows:

Kt(x) = min{ℓ(p) + log(time(p)): U(p) = x}

In this way, the complexity of Levin is computable, i.e., if the inverse function exists, it can always be found by Levin's general search process. This is in line with Solomonov's convergence theorem, which he obtained earlier.

Solomonov: The Prophet of Large Language Models

Levin, 1973, p. 2. The reference is preceded by an acknowledgment, which is preceded by a statement of Theorem 2.

Solomonov: The Prophet of Large Language Models

Physicist Charles Bennett (1943-) is famous for quantum computing, and he was the first B of the quantum cryptography distribution protocol BB84. He also made outstanding contributions to algorithmic information theory, introducing the concept of logical depth in 1988 (see Bennett-1988), defined as follows:

LD(x) = min{T(p): ℓ(p) =p*+s ∧U(p) = x}

That is, the time it takes to output x to the near-shortest program. Here p* is the Kolmogorov complexity, and l(p) is the length of the program that is close to the shortest. It can be seen that the depth of logic further relaxes the requirements of the Kolmogorov complexity for the minimum length of the program.

If induction is considered as compression, the logical depth considers the time of decompression. Logical depth allows us to consider the relationship between time and space. Intuitively, we think that time is more "expensive" than space, but at present, in computational theory, we do not know whether the class P of polynomial time is equal to the class PSPACE of polynomial space, of course, NP is a subset of PSPACE, but I don't know if it is a true subset. IF P≠PSPACE, THEN THERE MUST BE A COMPUTABLE STRING IN PSPACE WITH A GREATER LOGICAL DEPTH THAN THE POLYNOMIAL. Compression is the first consideration of space cost, but decompression has a time cost.

In the language of large language models, compression time is the training time, Kolmogorov complexity is the number of parameters of a large model, and logical depth corresponds to the minimum "inference" time of a large model. By the way, a more appropriate translation of "inference" in the term "inference" in the term of a large model would be "inference", which is statistically different from "reasoning" in a logical sense. In Chinese, "reasoning" often refers to the latter. In addition, there is also logical "reasoning" in large models, such as CoT (Chain of Thought), and textbooks for machine theorem proving often do not strictly distinguish between inference and reasoning. If the logic and statistics of artificial intelligence are both Chinese-speaking, it is estimated that they will not be able to fight.

Solomonov: The Prophet of Large Language Models

A series of works by theoretical computer scientist Li Ming and others have made important contributions to the study of Solomonov-Kolmogorov-Tseitin complexity. The Kolmogorov Complexity and Its Applications (Li-Vitanyi-2019), co-authored by Ming Li and Paul Vitanyi, is the definitive reference book and textbook in this field, also known as the Bible in the field, and is currently in its fourth edition. Early editions have Chinese translations of Descriptive Complexity (Science Press, 1998), but the term "descriptive complexity" is easily confused with the various things called "Descriptive Complexity" in computational complexity, and we will use the full name Solomonov Kolmogorov-Zeitin complexity or Kolmogorov complexity in this article.

Solomonov: The Prophet of Large Language Models

Mr. and Mrs. Li Ming and Mr. and Mrs. Solomonov

Marcus Hutter graduated from the University of Munich in 1996 with a doctorate in theoretical physics. But after graduating with his PhD, he turned to general artificial intelligence. In 2000 he proposed a theoretical framework for AGI based on reinforcement learning, AIXI, and his main mathematical tool was the Solomonov induction method (see Hutter-2005).

The success of OpenAI's ChatGPT has led to a lot of attention on how it works. Many attribute its success to the underlying neural network architecture, Transformer. But in fact, Google, the inventor of Transformer, is lagging behind OpenAI in terms of large language models. One possible explanation is that the framework used by Google was BERT, which was the framework used by all large model teams at the time, while OpenAI adopted GPT. The main difference is that GPT is next token prediction, while BERT can be described in similar language as: given a sequence (x1, x2, ..., xn), take xi out of it and let you guess what xi is. Although there is no mathematical proof, we can intuitively guess that a one-way GPT should be more efficient than a two-way BERT. This is left to the reader as an open question. Solomanov's induction provides us with evidence that BERT cannot be stronger than GPT.

In the current large-scale model research, the theory temporarily lags behind the engineering practice. In the past research experience of computer science and engineering, the benchmark is generally ahead of engineering, but the evaluation of large models is obviously lagging behind the development of large models. After the success of ChatGPT, OpenAI Chief Scientist Ilya Sutskever kept hinting in interviews that next token prediction was the key to the success of the GPT series of large models, but it wasn't until August 2023 that he was at the Berkeley Institute of Theoretical Computer Science (Simons Institute, another basic science institution donated by mathematician-turned-financier Simmons), made it clear that GPT's mathematical basis is the Solomonov induction method (see Sutskever-2023). He claimed that he came up with it on his own in 2016 – which was a little late, and we naturally couldn't force him to do a zero-knowledge proof.

According to the Solomonov-Kolmogorov-Zeitin theory, all learning can be seen as compressed. Figuratively speaking, the process of summarizing a large amount of data in a streamlined system, or the process of moving from a particular phase to a common phase, is naturally compressed, because the representation of the common phase is much more economical than the representation of many different phases. Hutter established the Hutter Prize in 2006 to reward the best lossless compression (i.e., the highest compression ratio) work. Hutter is now a member of DeepMind, and his article "Language Modeling is Compression" (see Deletang-2023) as a collaborator has been hotly debated among the engineering community. Experiments show that the lossless compression effect of large language models is much better than that of various compression algorithms based on Huffman coding (Li-2024). The probability prediction of tokens by large language models is definitely better than that of general compression algorithms. Kolmogorov and Zeitin's earliest starting point was information expression and information transmission, which was actually compression. Different means to the same end.

Solomonov: The Prophet of Large Language Models

所罗门诺夫归纳法的科普版或者哲学版(philosophical formulation)可描述如下:

1. Observations are entered as data.

2. Formulate new hypotheses to explain all current data. The assumption is a Turing machine.

3. New data entry to check if the data confirms the hypothesis.

4. If the data falsifies the hypothesis, go back to step 2.

5. If the data confirms the hypothesis, continue to accept the hypothesis and return to step 3.

We can use the Solomenovian inductive method to explain Karl Popper's (1902-1994) theory of falsification. Historically, Popper's ideas are also related to the American philosopher Carnap, but Popper's intention was to establish a philosophical line named after the new word "falsifiability" by attacking Carnap's probability theory. In fact, "fallibility" has long been discussed in a more profound way by Peirce. Since Solmonov's inductive method can explain Peirce's pragmatism, it can certainly easily generalize all of Popper's theories, and it can also explain things that Popper has difficulty acknowledging, such as evolution. The so-called falsification, any intelligent person can think: it is always much easier to say that something is wrong than to say that something is right. The statistician George Box famously said, "All models are wrong, but some are useful," and "useful" means predictable. We can see the different attitudes of philosophers and scientists or mathematicians, the former more concerned with coining new words, and the latter more concerned with the progress of knowledge.

Kuhn's theory of scientific revolution can also be used as a special case of Solomonov's induction. Occam's razor was not strictly defined for a long time after it was proposed, but it was intuitively believed that the smaller the model, the better, given that the explanatory power is similar. In the Solomonov-Kolmogorov-Zeitin theory, the smallest model that covers all the data is the shortest program or the smallest Turing machine. Within a certain error, there can be multiple Turing machines of similar size, i.e., there can be multiple interpretations, i.e., the Epicurean principle of indifference. Of course, in practice, we can only look for Turing machines that are small within a certain margin of error. When we find a model/program/Turing machine that is quite different from the previous model/program/Turing machine, we call it a scientific revolution. In general, the new model/Turing machine after the Scientific Revolution is more likely to be longer than before the Revolution (i.e., the Kolmogorov complexity is greater), and it is likely that the logical depth will be deeper. A new model/program/Turing machine is the new paradigm.

Solomonov: The Prophet of Large Language Models

Aristotle's Metaphysics begins by saying that "All men by nature desire to know." Aristotle argues that the process of human beings from experience to technology to science is the process of "to know". The so-called "incalculable" or "non-stop" may be God's hope for mankind. In Solomonov's framework, the progress of knowledge is called "incremental learning." Further research on the Romenov induction method and Kolmogorov complexity would provide a new theoretical basis for machine learning. Now, genetic algorithms, genetic programming, and reinforcement learning can all be explained by computational theory within the framework of Solomonov's inductive method.

我们可以讨巧地借用亚里士多德:“压缩是人的本性”(All men by nature desire to compress)。 休谟认为归纳虽然是不能用来证实(verification),但却是人性之本(essential to human nature)。 我们可以有一个替代的说法:压缩是人性之本。 经验得自于感觉,经验数据被建模是经济的考虑,建模就是压缩,压缩是由图灵机完成的。 正是在这个意义上,所罗门诺夫归纳法可被当作第一性原理。 进化过程甚至也可以被看作是压缩,所谓“适者生存”(survival of the fittest),也可被狭义地转述成“最压者生存”(survival of who compress the most)。 压缩即物理世界规律在信息世界的体现。 笛卡尔的格言“我思故我在”可以被更严格地表达为“我压故我在”。 (I compress, therefore, I am.)

Looking back at the development of the Solomonov-Kolmogorov-Zeitin theory, and then looking at the large language model, we will feel that perhaps it is not that the theory is lagging behind the practice, but that it is too advanced.

References: (Swipe up and down to browse)

1.Bennett, Charles H. (1988), "Logical Depth and Physical Complexity", in Herken, Rolf (ed.), The Universal Turing Machine: a Half-Century Survey, Oxford U. Press, pp. 227–25

2.Calude C.S. (2007), (ed.) Randomness and Complexity, from Leibniz to Chaitin

3.Carnap, R. (1950), Logical Foundations of Probability.

4.Chaitin. G. J. (1965), “An improvement on a theorem by E. F. Moore”, IEEE Transactions on Electronic Computers, EC-14, 466–467.

5.Chaitin, G. J. (1966), “On the Length of Programs for Computing Finite Binary Sequences”, Journal of the Assoc. of Comp. Mach., 13, pp. 547–569.

6.Chaitin, G. J. (1969), “On the Length of Programs for Computing Finite Binary Sequences: Statistical Considerations,” Journal of the Assoc. of Comp. Mach., 16, pp. 145–159.

7.Chaitin, G. J. (2012), Proving Darwin: Making Biology Mathematical (证明达尔文,人民邮电出版社,2015)

8.Chaitin, G. J. (2023), Building the World from Information & Computation, Expanded 2nd ed.

9.Chomsky, A.N. “Three Models for the Description of Language,” IRE Transactions on Information Theory, Vol. IT–2, No. 3, Sept. 1956

10.Deletang, G. et al (2023), “Language Modeling is Compression”, arXiv

11.Gleick, J. (2011), The Information: A History, a Theory, a Flood.

12.Hutter, M. (2005), Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability, Springer-Verlag, Berlin.

13.James, William (1907), Pragmatism: A New Name for Some Old Ways of Thinking

14.Kolmogorov, A.N. (1965), “Three Approaches to the Quantitative Definition of Information”, Problems of Information Transmission, 1 (1): 1–7.

15.Kolmogorov, A.N. (1968), “Logical Basis for Information Theory and Probability Theory”, IEEE Trans. on Information Theory, IT–14(5): 662– 664.

16.Kolmogorov, A.N. (1969), “On the Logical Foundations of Information Theory and Probability Theory”, Problems of Information Transmission, 5:1-4.

17.Kolmogorov, A.N. (1988), How I became a mathematician, (姚芳等编译,我是怎么成为数学家的,大连理工大学出版社,2023)

18.Levin, L. (1973), “Universal search problems”, Problems of Information Transmission, 9 (3): 115–116

19.Levin, L. (1986), “Average case complete problems”, SIAM Journal on Computing, 15 (1), pp. 285–286

20.Li, M. and Vitanyi, P. (2019), An Introduction to Kolmogorov Complexity and Its Applications, Springer-Verlag, N.Y., 1st ed. 1993, 3rd ed. 2008, 4th ed. 2019.

21.Li, M., et al (2023), “A Theory of Human-like Few-shot Learning”, work in progress.

22.Li, M. (2024), “Approximating Human-Like Few-shot Learning with GPT-based Compression”, preprint

23.McCarthy, J. (1956), The Inversion of Functions Defined by Turing Machines, in McCarthy and Shannon ed. Automata Studies.

24.Minsky, M.L. (1961), “Steps Toward Artificial Intelligence,” Proceedings of the IRE, 49:8–30, 1961. reprinted in Feigenbaum, E.A. and Feldman, J. ed. Computers and Thought, 1963

25.Peirce, C.S. (1877), “The Fixation of Belief”, in Philosophical Writings of Peirce,1955

26.Peter, Rozsa (1951), Recursive Functions, Budapest, 递归函数论, 莫绍揆 译, 科学出版社, 1958.

27.Shannon, C.E. (1948), “The Mathematical Theory of Communication,” Bell System Technical Journal, 27:379–423, 623–656.

28.Solomonoff, R.J. (1956), “An Inductive Inference Machine,” Report circulated at the Dartmouth Summer Workshop on Artificial Intelligence, August 1956

29.Solomonoff, R.J. (1957), “An Inductive Inference Machine,” IRE Convention Record, Section on Information Theory.

30.Solomonoff, R.J. (1960), “A Preliminary Report on a General Theory of Inductive Inference,” (Revision of Report V–131), Contract AF 49(639)– 376, Report ZTB–138, Zator Co., Cambridge, Mass., Nov, 1960.

31.Solomonoff, R.J. (1964), “A Formal Theory of Inductive Inference,” Information and Control, Part I: Vol 7, No. 1, pp. 1–22, March.

32.Solomonoff, R.J. (1964), “A Formal Theory of Inductive Inference,” Information and Control, Part II: Vol 7, No. 2, pp. 224–254, June.

33.Solomonoff, R.J. (1978), “Complexity–based induction systems: Comparisons and convergence theorems”, IEEE Trans. on Information Theory, IT–24(4):422–432.

34.Solomonoff, R.J. (1984), “Optimum Sequential Search”, Oxbridge Research Report, revised 1985

35.Solomonoff, R.J. (1997), “The Discovery of Algorithmic Probability”, J. Computer and System Sciences, 55, 73-88.

36.Solomonoff, R.J. (2010), “Algorithmic Probability, Heuristic Programming and AGI”, Proceedings of the 3d Conference on Artificial General Intelligence

37.Solomonoff, R.J. (2011), “Algorithmic Probability -- Its Discovery -- Its Properties and Application to Strong AI”, in Hector Zenil (ed.), Randomness Through Computation: Some Answers, More Questions, Chapter 11, pp. 149-157

38.Sutskever, Ilya (2023), “An Observation on Generalization”, Talk at Simons Institute Aug 14, 2023

39.Veldhuizen, Todd L. (2005), “Software Libraries and Their Reuse: Entropy, Kolmogorov Complexity, and Zipf’s Law”,arxiv

40.Vitanyi, P. (1988), “A Short Biography of A.N. Kolmogorov”, CWI Quarterly, (homepages.cwi.nl/~paulv/KOLMOGOROV.BIOGRAPHY.html)

41.Vitanyi, P. (2009), “Obituary: Ray Solomonoff, Founding Father of Algorithmic Information Theory” (homepages.cwi.nl/~paulv/obituary.html)

42.Wuppuluri, S. and Doria, F.A., (ed) (2020), Unravelling Complexity: The Life and Work of Gregory Chaitin, World Scientific Publishing Company

43. Nick (2021), A Brief History of Artificial Intelligence, 2nd ed.

44. Nick (2024), Understanding Turing.

Source: Mr. Sai

Editor: Lan Duoduo

The reprinted content represents the author's views only

It does not represent the position of the Institute of Physics of the Chinese Academy of Sciences

If you need to reprint, please contact the original official account

Top 10 recent popular articles

↓ Click on the title to view ↓

1. Why do dry clothes stink so much......?

2. Can the grenade's insurance be plugged back in? (400 benefits)| No.400

3. Frogs have mushrooms on their bodies, how long can the "last frog" live?

4. Why do people have 5 fingers?5 Is it special?

5. "Poetry Immortal" Li Bai was dazzled, it turned out that he would really "produce purple smoke"!

6. Why is it that urine is in the air during the urgency and is in the form of a column instead of a spray? No.401

7. Should I get up immediately when the alarm goes off?

8. If I forget to bring my charger when I go home, can I use a non-original charger to charge it?

9. When the geomagnetic storm ends, does it affect our normal work?

10. To celebrate π day, we introduced π to an object?|happy π day

Click here to view all past popular articles

Read on