Editor: Peach LRS
There is no need to miss MLP, the new network KAN is based on the Kolmogorov-Arnold theorem, with fewer parameters, stronger performance, and better interpretability, and the innovation of deep learning architecture has entered a new era!
Overnight, the machine learning paradigm is about to change!
The infrastructure that dominates the field of deep learning today is multilayer perceptrons (MLPs) that place activation functions on neurons.
So, do we have a new route to take beyond that?
Just today, teams from MIT, Caltech, Northeastern University, and other institutions released a new neural network structure - Kolmogorov–Arnold Networks (KAN).
The researchers made a simple change to MLP, moving the learnable activation function from nodes (neurons) to edges (weights)!
Address: https://arxiv.org/pdf/2404.19756
This change may sound unwarranted at first glance, but it has a deep connection to the "approximation theories" in mathematics.
It turns out that Kolmogorov-Arnold represents a two-layer network, with learnable activation functions on edges, not nodes.
Inspired by the representation theorem, the researchers explicitly parameterized the Kolmogorov-Arnold representation with a neural network.
It is worth mentioning that the origin of the name KAN is in honor of two great late mathematicians, Andrey Kolmogorov and Vladimir Arnold.
Experimental results show that KAN has superior performance compared with traditional MLP, and improves the accuracy and interpretability of neural networks.
Most unexpectedly, the visualization and interactivity of KAN make it potentially useful in scientific research, helping scientists discover new mathematical and physical laws.
In the study, the author used KAN to rediscover the mathematical laws of knot theory!
Moreover, KAN replicates DeepMind's results in 2021 with a smaller network and automation.
In physics, KAN can help physicists study Anderson localization, which is a phase transition in condensed matter physics.
By the way, by the way, all the examples of KAN in the study (except for parametric sweeps) can be reproduced in less than 10 minutes on a single CPU.
The emergence of KAN directly challenged the MLP architecture that has dominated the field of machine learning for a long time, and caused an uproar on the whole network.
A new era of machine learning begins
Some people say that a new era of machine learning has begun!
According to a research scientist at Google's DeepMind, "Kolmogorov-Arnold strikes again!A little-known fact is that this theorem appeared in a seminal paper on permutation invariant neural networks (deep sets), demonstrating the complex connection between this representation and the way sets/GNN aggregators are built (as a special case)."
A new neural network architecture was born!KAN will dramatically change the way AI is trained and fine-tuned.
Could it be that AI has entered the 2.0 era?
Some netizens used popular language to make a vivid analogy of the difference between KAN and MLP:
The Kolmogorov-Arnold network (KAN) is like a three-layer cake recipe that can bake any cake, while the multi-layer perceptron (MLP) is a custom cake with different layers. MLP is more complex but more versatile, while KAN is static but simpler and faster for one task.
The author, MIT professor Max Tegmark, said the latest paper shows that an architecture that is completely different from standard neural networks achieves greater accuracy with fewer parameters when dealing with interesting physics and mathematical problems.
Next, let's take a look at KAN, which represents the future of deep learning, and how to achieve it.
RETURN TO KAN AT THE TABLE
The theoretical basis of KAN
The Kolmogorov–Arnold representation theorem states that if f is a multivariate continuous function defined on a bounded domain, then the function can be expressed as a finite combination of multiple univariate, additive continuous functions.
For machine learning, the problem can be described as follows: the process of learning a high-dimensional function can be reduced to a one-dimensional function that learns the number of polynomials.
But these one-dimensional functions may be non-smooth, or even fractal, and may not be learnable in practice, and it is precisely because of this "pathological behavior" that Kolmogorov-Arnold said that theorems are basically "dead" in the field of machine learning, that is, theoretically correct but practically useless.
In this article, the researchers remain optimistic about the application of the theorem to the field of machine learning and propose two improvements:
1. In the original equation, there are only two layers of nonlinearity and one hidden layer (2n+1), which can generalize the network to any width and depth;
2. Most of the functions in science and everyday life are mostly smooth and have a sparse combinatorial structure that may contribute to the formation of a smooth Kolmogorov-Arnold representation. Similar to the difference between a physicist and a mathematician, a physicist is more concerned with the typical scenario, while a mathematician is more concerned with the worst-case scenario.
KAN architecture
The core idea of the design of the Kolmogorov-Arnold network (KAN) is to transform the approximation problem of multivariate functions into a problem of learning a set of univariate functions. In this framework, each univariate function can be parameterized with a B-spline curve, where the B-spline is a local, piecewise polynomial curve with learnable coefficients.
In order to extend the two-layer network in the original theorem deeper and wider, the researchers proposed a more "generalized" version of the theorem to support the design of KAN:
Inspired by the MLPs cascade structure to improve the depth of the network, a similar concept is introduced, the KAN layer, which consists of a matrix of one-dimensional functions, each of which has trainable parameters.
According to the Kolmogorov-Arnold theorem, the original KAN layer consisted of internal and external functions, corresponding to different input and output dimensions, respectively, and this design method of stacking KAN layers not only expands the depth of KANs, but also maintains the interpretability and expressiveness of the network, in which each layer is composed of univariate functions, which can be studied and understood separately.
f in the following equation is equivalent to KAN
Implementation details
Although the design concept of KAN seems simple and purely stacked, it is not easy to optimize, and the researchers have also figured out some tricks during the training process.
1. Residual activation function: By introducing the combination of the basis function b(x) and the spline function, the concept of residual connection is used to construct the activation function φ(x), which is helpful for the stability of the training process.
2. Initialization scales: The initialization of the activation function is set to a spline function close to zero, and the weight w uses the Xavier initialization method, which helps to keep the gradient stable at the beginning of training.
3. Update the spline mesh: Since the spline function is defined within a bounded interval, and the activation value may exceed this interval during the neural network training process, dynamically updating the spline mesh can ensure that the spline function always runs in the appropriate interval.
The number of parameters
1. Network depth: L
2. The width of each layer: N
3. Each spline function is defined based on G intervals (G+1 grid points), k order (usually k=3)
So the amount of parameters for KANs is about
In contrast, MLP has a parameter quantity of O(L*N^2), which seems to be more efficient than KAN, but KANs can use a smaller layer width (N), which not only improves generalization performance, but also improves interpretability.
KAN比MLP,胜在了哪?
Stronger performance
As a plausibility test, the researchers constructed five known examples with smooth KA (Kolmogorov-Arnold) representations as validation datasets, and trained KANs by increasing grid points every 200 steps, covering the range of {3,5,10,20,50,100,200,500,1000}
MLPs of different depths and widths were used as the baseline model, and both KANs and MLPs were trained with the LBFGS algorithm for a total of 1800 steps, and then RMSE was used as an indicator for comparison.
As you can see from the results, the curve of KAN is more jitter, able to converge quickly, reach a stationary state, and is better than the scaled curve of MLP, especially in the case of high dimensions.
It can also be seen that the performance of the three-layer KAN is much stronger than that of the two-layer, indicating that the deeper KANs have stronger expressive ability, which is in line with expectations.
Explain KAN interactively
The researchers designed a simple regression experiment to show that users can get the most interpretable results when interacting with KAN.
Suppose a user is interested in finding a symbolic formula, which requires a total of 5 interactive steps.
Step 1: Training with sparsity.
Starting with a fully connected KAN, the network can be made sparse by training with sparsity regularization, so that 4 out of 5 neurons in the hidden layer look like they don't work.
Step 2: Pruning
After automatic pruning, all useless hidden neurons are discarded, leaving only a KAN that matches the activation function to the known symbolic function.
Step 3: Set up the symbol function
Assuming that the user can correctly guess these symbol formulas from staring at the KAN chart, they can be set up directly
If the user has no domain knowledge or doesn't know which symbolic functions these activation functions might be, the researchers provide a function suggest_symbolic to suggest symbolic candidates.
Step 4: Further training
After all the activation functions in the network have been symbolized, the only remaining parameter is the affine parameter, and when you continue to train the affine parameter, you will realize that the model has found the correct symbolic expression when you see the loss drop to machine precision.
Step 5: Output the symbol formula
Use Sympy to calculate the symbolic formula of the output node and verify the correct answer.
Interpretability verification
The researchers first designed six samples in a supervised toy dataset to demonstrate the combinatorial structure of the KAN network under a symbolic formula.
It can be seen that KAN has successfully learned the correct univariate function and shows KAN's thought process in an interpretable way through visualization.
In the unsupervised setting, the dataset contains only the input feature x, and the ability of the KAN model to find dependencies between variables can be tested by designing connections between certain variables (x1, x2, x3).
From the results, the KAN model has succeeded in finding the functional dependence between variables, but the authors also point out that experiments are still only carried out on synthetic data, and a more systematic and controllable method is needed to discover the complete relationship.
Pareto optimal
By fitting a special function, the authors show the Pareto frontier of KAN and MLP in a plane spanned by the number of model parameters and RMSE loss.
Of all the special functions, KAN always has a better Pareto front than MLP.
Solve partial equations
In the task of solving partial differential equations, the researchers plotted the L2 squared and H1 squared losses between the predicted solution and the true solution.
In the figure below, the first two are the training dynamics of the loss, and the third and fourth are the Sacling Law of the number of loss functions.
As shown in the results below, KAN converges faster, has lower losses, and has a steeper scaling law compared to MLP.
Keep learning without catastrophic forgetting
We all know that catastrophic forgetting is a serious problem in machine learning.
The difference between artificial neural network and brain is that brain has different modules that are placed in the local function of space. When learning a new task, structural reorganization occurs only in the local area responsible for the relevant skill, while the other areas remain the same.
However, most artificial neural networks, including MLP, do not have this concept of locality, which may be the cause of catastrophic forgetting.
The study proves that KAN is locally plastic, and splines can be used to avoid catastrophic forgetting.
The idea is very simple, since the spline is local, the sample will only affect some nearby spline coefficients, while the distant coefficients remain the same.
In contrast, since MLP typically uses global activation (e.g., ReLU/Tanh/SiLU), any local changes can propagate uncontrollably to distant areas, destroying the information stored there.
The researchers employed a one-dimensional regression task consisting of 5 Gaussian peaks. The data around each peak is presented to KAN and MLP sequentially, rather than all at once.
As shown in the figure below, KAN only reconstructs the area where data exists in the current stage, leaving the previous area unchanged.
MLP, on the other hand, reshapes entire areas when they see new data samples, leading to catastrophic forgetting.
Discover the knot theory, and the results surpass DeepMind
What does the birth of KAN mean for the future application of machine learning?
Knot theory is a discipline in low-dimensional topology, which reveals the topological problems of three-manifold and four-manifold, and has a wide range of applications in the fields of biology and topological quantum computing.
2021年,DeepMind团队曾首次用AI证明了纽结理论(knot theory)登上了Nature。
Address: https://www.nature.com/articles/s41586-021-04086-x
In this study, supervised learning and experts in the human domain led to a new theorem related to algebra and geometric junction invariants.
That is, gradient significance identified the key invariants of the supervision problem, which led the domain expert to formulate a conjecture that was subsequently refined and proven.
In this regard, the authors investigate whether KAN can achieve good interpretable results on the same issue and thus predict the signature of the kink.
In the DeepMind experiment, the main results of their research on the kink theory dataset were:
1 Discovery and signature using network attribution
It mainly depends on the middle distance
and longitudinal distance λ.
2 Experts in the human domain later discovered
There is a high correlation with slope
And derived
To investigate question (1), the authors consider the 17 kink invariants as inputs and signatures as outputs.
Similar to the setup in DeepMind, the signature (even) is encoded as a hot vector, and the network is trained on cross-entropy loss.
It was found that a very small KAN was able to achieve a test accuracy of 81.6%, while DeepMind's 4-layer width of 300MLP only achieved a test accuracy of 78%.
As shown in the table below, KAN (G = 3, k = 3) has about 200 parameters, while MLP has about 300,000 parameters.
It is worth noting that KAN is not only more accurate, but also more accurate. At the same time, it is more efficient than the parameters of MLP.
In terms of interpretability, the researchers scale the transparency of each activation based on its size, so it's immediately clear which input variables are important without feature attribution.
Then, KAN was trained on three important variables and a test accuracy of 78.2% was obtained.
As follows, with KAN, the authors rediscovered three mathematical relationships in the knot dataset.
Physical Anderson localization has been solved
And in physical applications, KAN also plays a huge value.
Anderson is a fundamental phenomenon in which disorder in a quantum system leads to the localization of the electron wave function, which stops all transmissions.
In both one- and two-dimensional, scale arguments show that all electron eigenstates are exponentially localized for any tiny random disorder.
In contrast, in three dimensions, a critical energy forms a phase boundary that separates the extended and localized states, which is known as the mobility edge.
Understanding these mobile edges is critical to explaining various fundamental phenomena such as metal-insulator transitions in solids, as well as the localized effects of light in photonic devices.
Through their research, the authors found that KANs make it very easy to extract mobile edges, both numerically and symbolically.
Clearly, KAN has become a scientist's right-hand man, an important collaborator.
All in all, thanks to the advantages of accuracy, parameter efficiency, and interpretability, KAN will be a useful model/tool for AI+Science.
In the future, the further application of KAN in the field of science remains to be explored.