According to Emerge Research analysis, the global deep learning market size is expected to reach $93.34 billion at a stable compound annual growth rate of 39.1% by 2028, and the key factor driving its market revenue is the adoption of cloud-based technologies and the use of deep learning systems in big data analytics.

So, what exactly is deep learning? How does it work?

According to VentureBeat in a recent article titled "This is Why Deep Learning Is So Powerful", deep learning is a subset of machine learning that uses neural networks to perform learning and predictions. Deep learning performs amazingly in a variety of tasks, whether it's text, time series, or computer vision. The success of deep learning comes largely from the availability and computing power of big data, which makes deep learning perform far better than any classical machine learning algorithm.

The essence of deep learning: neural networks and functions

Some netizens once joked, "When you want to fit any function or any distribution, and you have no ideas, try a neural network!" ”

First of all, two important conclusions:

A neural network is a network of interconnected neurons, each of which is a finite function approximator. In this way, neural networks are treated as universal function approximators.

Deep learning is a neural network with many hidden layers (usually greater than 2 hidden layers). Deep learning is a complex combination of functions from layer to layer to find functions that define the mapping from input to output.

In high school math we'll learn that a function is a mapping from the input space to the output space. A simple sin(x) function is mapped from an angular space (-180° to 180° or 0° to 360°) to a real number space (-1 to 1). The function approximation problem is an important part of function theory, and the basic problem involved is the approximate representation of functions.

So why are neural networks considered general-purpose function approximators?

Each neuron learns a finite function: f(.) =g(W*X) where W is the weight vector to learn, X is the input vector, g(.) Is a nonlinear transformation. W*X can be visualized as a line in high-dimensional space (hyperplane), while g(.) Can be any nonlinearly differentiable function such as sigmoid, tanh, ReLU, etc. (commonly used in the field of deep learning).

Learning in a neural network is all about finding the best weight vector, W. For example, in y=mx+c, we have 2 weights: m and c. Now, based on the distribution of points in a two-dimensional plane space, we find the optimal values for m and c that meet certain criteria, so for all data points, the difference between the predicted y and the actual point is minimal.

Neural network "layer" effect: Learn mappings that are specific to categorical generalizations

If the input is an image of a lion and the output is an image classification belonging to a lion class, then deep learning is a function of learning to map image vectors to classes. Similarly, the input is a sequence of words, and the output is whether the input sentence has positive/neutral/negative emotions. Thus, deep learning is learning the mapping from input text to output classes: neutral or positive or negative.

How do you achieve the above tasks?

Each neuron is a nonlinear function, and we stack several such neurons in a "layer" where each neuron receives the same set of inputs but learns a different weight W. Therefore, each layer has a set of learning functions: f1, f2,...,fn, called hidden layer values. These values are combined again, in the next layer: h(f1,f2,...,fn) and so on. Thus, each layer consists of functions from the previous layer (similar to h(f(g(x)))). It has been shown that with this combination, we can learn any nonlinear complex function.

Deep learning as interpolation for curve fitting: overfitting challenges and generalization goals

Deep learning pioneer Yann LeCun (creator of convolutional neural networks and Turing Award winner) once tweeted, "Deep learning isn't as amazing as you might think because it's just an interpolation that beautifies curves." But in the high dimensions, there is no such thing as interpolation. In high-dimensional space, everything is extrapolated. ”

Interpolation is an important method of approximation of discrete functions, which can be used to estimate the approximate value of the function at other points through the value of the function at a finite number of points.

From the perspective of biological interpretation, humans process images of the world by interpreting them layer by layer, from low-level features such as edges and contours to high-level features such as objects and scenes. The combination of functions in neural networks is consistent with this, where each combination of functions is learning about the complex characteristics of the image. The most common neural network architecture for images is CNN (Convolutional Neural Networks), which learns these features in a hierarchical manner and then a fully connected neural network classifies the image features into different categories.

For example, given a set of data points on a plane, we try to fit a curve by interpolating that to some extent represents the function that defines those data points. The more complex the function we fit (for example, in interpolation, determined by the number of polynomials), the more suitable it is for the data; however, the less generalized it is to the new data point.

This is where the challenge of deep learning comes in, commonly referred to as the overfitting problem: fitting the data as much as possible, but making compromises in generalization. Almost all deep learning architectures must deal with this important factor in order to learn the common functionality that performs equally well on invisible data.

How does deep learning learn? The problem determines the neural network architecture

So, how do we learn this complex function?

It all depends on the problem at hand, which determines the neural network architecture. If we are interested in image classification, then we use CNNs. If we are interested in time-related predictions or text, then we use RNNs (Recurrent Neural Networks) or Transformers, and if we have dynamic environments (such as car driving), then we use reinforcement learning.

In addition to this, learning involves dealing with different challenges:

Ensure that the model learns generic functions, not just the training data, by using regularization (to prevent overfitting and underfitting of the trained model) processing.

Depending on the problem at hand, select the loss function. Roughly speaking, the loss function is the error function between what we want (the true value) and what we currently have (the current prediction).

Gradient descent is an algorithm used to converge to the optimal function. Determining the learning rate becomes challenging because when we are far from optimal, we want to move faster to optimal, and when we are close to optimal, we want to be slower to ensure that we converge to the optimal and global minimums.

A large number of hidden layers need to deal with the gradient disappearance problem. Architectural changes such as skipping joins and appropriate nonlinear activation functions help solve this problem.

Neural architecture and big data: Deep learning brings computing challenges

Now we know that deep learning is just a function that learns complex, and it brings other computational challenges:

To learn a complex function, we need a lot of data; in order to process big data, we need a fast computing environment; therefore, we need an infrastructure that supports that environment.

Parallel processing using the CPU is not enough to calculate millions or billions of weights (also known as parameters of the DL). Neural networks need to learn weights that require vector (or tensor) multiplication. This is where GPUs come in handy because they can multiply parallel vectors very quickly. Depending on the deep learning architecture, data size, and task at hand, we sometimes need 1 GPU, and sometimes, data scientists need to make decisions based on known literature or by measuring the performance of 1 GPU.

By using the appropriate neural network architecture (number of layers, number of neurons, nonlinear functions, etc.) and large enough data, deep learning networks can learn any mapping from one vector space to another. That's what makes deep learning a powerful tool for any machine learning task.