How reliable is the theoretical basis for the alchemy of machine learning?

Know what it is, know why it is so.

The field of machine learning has developed very rapidly in recent years, but our understanding of machine learning theory is still very limited, and the experimental effect of some models even exceeds our understanding of the basic theory.

At present, more and more researchers in the field have begun to pay attention to and reflect on this problem. Recently, a data scientist named Aidan Cooper wrote a blog about the relationship between the experimental results of the model and the underlying theory. Here's the original text of the blog:

In the field of machine learning, there are some models that are very effective, but we are not entirely sure why. Conversely, some areas of research that are relatively easy to understand have limited applicability in practice. Based on the utility and theoretical understanding of machine learning, this paper explores the progress of various subfields.

The "experimental utility" here is a comprehensive consideration that takes into account the breadth of applicability of a method, the ease of implementation, and, most importantly, the usefulness of the real world. Some methods are not only highly practical but also widely applicable, while others, while powerful, are limited to specific areas. Reliable, predictable, and free of major flaws is considered to be more effective.

The so-called theoretical understanding is to consider the interpretability of the model method, that is, what is the relationship between input and output, how to obtain the expected results, what is the internal mechanism of this method, and consider the depth and integrity of the literature involved in the method.

Methods with low theoretical understanding usually use heuristic methods or a large number of trial and error methods when implemented; methods with high theoretical understanding often have formulaic implementations, have a strong theoretical foundation and predictable results. Simpler methods, such as linear regression, have lower theoretical upper limits, while more complex methods, such as deep learning, have higher theoretical upper limits. When it comes to the depth and completeness of the literature within a field, the field is evaluated according to the theoretical upper limits of the field's assumptions, which relies in part on intuition.

We can construct the utility matrix into four quadrants, with the intersection of the axes representing a hypothetical reference domain with average understanding and average utility. This approach allows us to interpret each field in a qualitative manner according to the quadrant in which they are located, as shown in the figure below, the domain in a given quadrant may have some or all of the characteristics corresponding to that quadrant.

How reliable is the theoretical basis for the alchemy of machine learning?

The field of machine learning in 2022

Not all areas in the preceding diagram are fully included in machine learning (ML), but they can all be applied to or closely related to the context of ML. Many of the areas evaluated overlap and cannot be clearly described: advanced approaches to reinforcement learning, federated learning, and graph ML are often based on deep learning. So I considered the non-deep learning aspects of their theoretical and practical utility.

Upper right quadrant: high understanding, efficient use

Linear regression is a simple, easy-to-understand, and efficient method. Although often underestimated and overlooked. , but its breadth of use and thorough theoretical basis put it in the upper right corner of the figure.

Traditional machine learning has evolved into a highly theoretically understood and practical field. Complex ML algorithms, such as gradient boosted decision trees (GBDT), have been shown to be often superior to linear regression in some complex prediction tasks. This is undoubtedly the case with big data problems. Arguably, there are still gaps in the theoretical understanding of parametric models, but implementing machine learning is a delicate methodological process that, if done well, can run reliably within the industry.

However, the extra complexity and flexibility does lead to some errors, which is why I put machine learning to the left of linear regression. In general, supervised machine learning is finer and more influential than its unsupervised* counterpart, but both approaches effectively address different problem spaces.

The Bayesian method has a group of avid practitioners who preach it to be superior to the more popular classical statistical methods. Bayesian models are particularly useful in some cases: point estimation alone is not enough, estimates of uncertainty are important; when data is limited or highly missing; and when you understand the data generation process to be explicitly included in the model. The usefulness of Bayesian models is limited by the fact that, for many problems, point estimation is good enough that people simply use non-Bayesian methods by default. What's more, there are ways to quantify the uncertainty of traditional ML (they are just rarely used). In general, it is easier to simply apply ML algorithms to data without having to consider data generation mechanisms and priors. Bayesian models are also computationally expensive, and if theoretical advances produce better sampling and approximation methods, then it will be more practical.

Lower right quadrant: low comprehension, efficient use

Contrary to advances in most fields, deep learning has had some amazing successes, although the theoretical aspects have proven fundamentally difficult to make progress. Deep learning embodies many of the characteristics of a lesser-known approach: unstable models, difficulty building reliably, configuration based on weak heuristics, and unpredictable results. Dubious practices such as random seed "tuning" are very common, and the mechanics of the working model are difficult to explain. However, deep learning continues to advance and reach superhuman levels of performance in areas such as computer vision and natural language processing, opening up a world full of other difficult-to-understand tasks, such as autonomous driving.

Suppose, general-purpose AI will occupy the lower right corner, because by definition, superintelligence is beyond human comprehension and can be used to solve any problem. Currently, it is only included as an experiment of ideas.

Qualitative description of each quadrant. A field can be described by some or all of its description in its corresponding area

Upper left quadrant: high understanding, low utility

Most forms of causal reasoning are not machine learning, but sometimes they are, and are always interested in predictive models. Causality can be divided into randomized controlled trials (RCTs) versus more complex methods of causal reasoning, which attempt to measure causality from observational data. RCTs are simple in theory and give rigorous results, but are often expensive and impractical —if not impossible—and therefore have limited utility when performed in the real world. Causal reasoning methods essentially mimic RCTs without having to do anything, which makes them much less difficult to perform, but there are many limitations and pitfalls that can invalidate the results. Overall, causation remains a frustrating pursuit, where current approaches often don't satisfy the questions we want to ask unless they can be explored by randomized controlled trials, or they fit in just some framework (e.g., as a serendipitous result of "natural experiments").

Federated Learning (FL) is a cool concept that gets little attention — perhaps because its most compelling apps need to be distributed to a large number of smartphone devices, so FL has only two players to really study: Apple and Google. There are other use cases for the FL, such as pooling proprietary data sets, but there are political and logistical challenges in coordinating these initiatives, limiting their usefulness in practice. Still, for what sounds like a fancy concept (roughly summarized as: "Bringing models into data, not into models"), FL is effective and has tangible success stories in areas such as keyboard text prediction and personalized news recommendations. The basic theory and technology behind FL seems to be enough to make FL more widely available.

Reinforcement learning (RL) has reached unprecedented levels of proficiency in games such as chess, Go, poker, and DotA. But outside of video games and simulated environments, reinforcement learning has yet to convincingly translate into real-world applications. Robotics was supposed to be the next frontier for RL, but it didn't come to fruition — reality seems more challenging than a highly constrained toy environment. That said, RL's achievements so far have been inspiring, and people who really like chess might think it should be more effective. I'd like to see RL implement some of its potential practical applications before placing it on the right side of the matrix.

Lower left quadrant: low understanding, low utility

Graph neural networks (GNNs) are now a very hot area in machine learning, with promising results in multiple fields. But for many of these examples, it's unclear whether GNNs are better than alternatives to pairing more traditional structured data with deep learning architectures. Data are naturally a problem with graph structure, such as molecules in chemoinformatics, which seem to have more compelling GNN results (although these are generally inferior to non-graph-related approaches). There seems to be a big difference between open source tools for large-scale training of GNNs and in-house tools used in industry compared to most fields, which limits the viability of large GNNs outside of these walled gardens. The complexity and breadth of the field suggests that the theoretical upper limit is high, so GNNs should have room to mature and convincingly demonstrate the advantages of certain tasks, which will lead to greater practicality. GNNs can also benefit from technological advances, as graphs do not currently apply naturally to existing computing hardware.

Explainable machine learning (IML) is an important and promising area and continues to receive attention. Techniques such as SHAP and LIME have become really useful tools to interrogate ML models. However, due to limited adoption, the utility of existing methodologies has not yet been fully realized – there are no sound best practices and implementation guidelines yet in place. However, the main weakness of IML at the moment is that it does not solve causal problems that we are really interested in. IML explains how models make predictions, but does not explain how the underlying data is causally related to them (although it is often misinterpreted like this). Prior to major theoretical advances, the legitimate uses of IML were mostly limited to model debugging/monitoring and hypothesis generation.

Quantum machine learning (QML) is far beyond my cab, but at the moment it seems like a hypothetical exercise, patiently waiting for a viable quantum computer to become available. Until then, QML sits insignifically in the lower left corner.

Incremental progress, technological leaps and paradigm shifts

There are three main mechanisms in the field to traverse the matrix of theoretical understanding and empirical utility (Figure 2).

An illustrative example of how fields can traverse a matrix.

Progressive progression is a slow and steady progression that moves the inch field upwards on the right side of the matrix. Supervised machine learning over the past few decades is a great example, during which time increasingly effective predictive algorithms have been refined and adopted, providing us with a powerful toolbox that we love today. Progressive progress is the status quo in all mature fields, except for periods of more intense exercise due to technological leaps and paradigm shifts.

As a result of technological leaps, some fields have seen a step-by-step change in scientific progress. The field of deep learning was not unraveled by its theoretical underpinnings, which were discovered more than 20 years before the 2010s deep learning boom — parallel processing powered by consumer-grade GPUs that drove its renaissance. Technological leaps usually manifest themselves as jumps to the right along the empirical utility axis. However, not all technology-led advances are leaps and bounds. Today's deep learning is characterized by progressive progress by training larger and larger models with more computing power and more specialized hardware.

The ultimate mechanism of scientific progress within this framework is a paradigm shift. As Thomas As Thomas Kuhn points out in his book The Structure of the Scientific Revolution, paradigm shifts represent important changes in the fundamental concepts and experimental practices of the scientific discipline. One such example is the causal framework pioneered by Donald Rubin and Judea Pearl, which elevates the field of causality from randomized controlled trials and traditional statistical analysis to a more powerful mathematical discipline in the form of causal reasoning. Paradigm shifts often manifest themselves as an upward movement of understanding, which may be followed or accompanied by an increase in utility.

However, normal form transformations can traverse matrices in any direction. When neural networks (and subsequently deep neural networks) establish themselves as independent paradigms for traditional ML, this initially corresponds to a decline in utility and understanding. Many emerging fields branch out of more mature fields of study in this way.

The scientific revolution in prediction and deep learning

All in all, here are some speculative predictions that I think could happen in the future (Table 1). The fields in the upper right quadrant are omitted because they are too mature to see significant progress.

Table 1: Predictions of future progress in several major areas of machine learning.

More important observations than how individual fields develop, however, are the general tendencies of empiricism and the growing willingness to acknowledge comprehensive theoretical understandings.

From the perspective of historical experience, it is generally theory (hypothesis) that appears first, and then formulates the idea. But deep learning has led to a new scientific process that has upended this. That said, the method is expected to demonstrate state-of-the-art performance before people pay attention to the theory. Empirical results are king, theory is optional.

This has led to a wide game of systems in machine learning research, obtaining the latest latest results by simply modifying existing methods and relying on randomness to go beyond the baseline, rather than meaningfully advancing the theory in the field. But maybe that's the price we're paying for a new wave of machine learning boom.

Figure 3: 3 potential trajectories of deep learning development in 2022.

Whether deep learning is an irreversible results-oriented process and downgrading theoretical understanding to an optional 2022 could be a turning point. We should think about the following questions:

Will theoretical breakthroughs keep our understanding up to practicality and transform deep learning into a more organized discipline like traditional machine learning?

Is the existing deep learning literature sufficient to increase utility indefinitely, simply by expanding larger and larger models?

Or will an empirical breakthrough lead us down the rabbit hole further into a new paradigm that enhances utility, even though we know less about this paradigm?

Does any of these routes lead to general artificial intelligence? Only time will give the answer.

How reliable is the theoretical basis for the alchemy of machine learning?

Read on