This article mainly introduces the latest developments in generative AI, mentions the impact of GPT-5 and AI software engineers in the industry, and points out the potential impact of AI technological advancements on national competition and individual career development.

Detailed explanation of the principles and technologies of generative AI (1) - neural networks and deep learning

The future is here

There are two recent news:

Sam Altman revealed the details of GPT-5 and publicly declared that the GPT-5 boost would be so big that any company that underestimated it would be crushed. And tweeted that OpenAI's product this year will change human history.
In the news about the first AI software engineer, the AI software engineer has performed quite well, with the ability of overall planning, DevOp, and full-project scanning, and is not far from a real programmer.

Although everything was expected, it still felt like it was coming a little faster.

Personally, I think that the current progress of AI technology has at least two impacts:

At the level of national competition, with the blessing of AI, China's previous major advantage over the United States, the number of outstanding college graduates, no longer exists. Whoever wins will see who makes fewer mistakes.
At the level of individual jobs, it is expected that in 2-5 years, the mode of production will change, and if it does not keep up, it may suffer a dimensionality reduction blow. As a code farmer, it is the last word to learn the underlying principles of AI or participate in AI-related application practices.

AI is not a black box

Before I begin, I want to emphasize that AI is not a black box, and all its elements and processes are deterministic and explainable.

I hope to explain the principles of AI in a way that programmers can easily understand. I hope that interested students will be able to read this article and be able to break the myth that AI is complex and AI is a black box, just like me.

This article will start with neural networks and deep learning, and explain it in the following three parts:

Explain the fundamentals of neural networks and deep learning through a very simple network.
Then through a real neural network operation, I will give you some experience.
Finally, a common question leads to the complex neural network to be introduced in the following article.

Fundamentals of neural networks

Key points in this chapter:

Neural networks solve problems outside the training set by "learning" and storing the general rules of the training set.
Storage is deterministic. Neural networks are divided into multiple layers, each with multiple neurons, and each neuron has N+1 parameters (where N equals the number of inputs connected to the neurons). The general rules that are "learned" are stored in these parameters.
The process is also deterministic. Because the "learning" process starts from any point in the high-dimensional parameter space (X parameters of all neurons) (the initial value of the parameter is given randomly), and gradually regresses to the point with the lowest loss, which is a process of gradually adjusting the parameters. So if the formula for adjusting the parameters is determined, then the process of "learning" is determined. Yes, the adjustment formula for the parameters is determined: when the model is compiled, the adjustment formula for each parameter of each neuron in each layer can be determined early in time according to the transitivity of the partial derivative.

▐What can neural networks do

Neural networks and deep learning can solve problems with no clear rules or complex rules, such as image recognition (if else)

▐Why neural networks can solve complex problems

Complex problems are often not exhaustive (e.g., different cat diagrams), so there is no set of rules to apply to all of their instances. The only way to do this is to abstract their general laws, or characteristics, or patterns.

For example, why text embedding can be used for natural language processing is because it does not (and cannot be) record the entire text combination, but rather the pattern of a single text or token recorded by a vector, which is an abstraction of the text. If the vectors of two literals are partially similar, this similarity may correspond to their common features in a particular context. Which vector value represents what feature is more like the metadata of the language, and we don't care.

Suppose that the vector representation of king is [0.3, 0.5, ..., 0.9], queen is [0.8, 0.5, ..., 0.2], and prince is [0.3, 0.7, ..., 0.5], in a certain context, such as xx is ruling the kingdom, xx can be filled by king or queen, then they may be determined by the second element of the vector, the common 0.5, in another context, such as xx is a man, xx can be filled by king or prince, which may be determined by the first element.

As we will see below, neural networks also have a set of mechanisms for computing and storing these abstracted patterns, so neural networks are well suited to solving complex problems involving abstract concepts.

▐How to implement neural networks

Like I said above, AI is not a black box. Neural networks, like ordinary programs, are made up of data and computation, and their storage and computation are very deterministic.

For example, let's solve a linear regression problem where we derive y from x. It is observed that y=3 when x=1 and y=5 when x=2, and now how much y=3 is required when x=3. The problem itself is relatively simple, and we can even directly calculate the regression equation as y = mx + b, where m = 2, b = 1.

(Linear Regression Link: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm)

The question is, how and by what kind of process would we "learn" from a neural network with a single input of x, only one layer, only one neuron in that layer, and the output of the neuron y?

Network 1

The storage structure of a neural network

Before we understand the learning process, let's take a look at the storage structure of a neural network, i.e., what a model is.

The model of [Network 1] above is composed of its structure + its parameters.

Its structure is a single input, with only one layer, and this layer has only one neuron.

Its parameters are the data that is trained on each neuron, and there is only one neuron here, and it has two parameters, m and b.

The trained parameters (m=2, b=1 in the example above) are the data set being trained (x=[1,2,... in the example above) ，y=[3,5,...] is an abstraction of a particular problem. With this pattern, you can also calculate the corresponding y by pouring other x.

A dense network where each node is connected to all the nodes in the upper layer, and all the nodes in the upper layer are the inputs of each node in the lower layer. The output of each node is the weighted sum of its inputs, the result of the activation function: output = activation_function(W * X + b), where X is the input vector [x1, x2, ..., xn], W is the weights vector of each input [w1, w2, ..., wn], b is the bias constant, and activation_function is the activation function. When the activation function is linear, output = W * X + b. When the node has n inputs, the node has [w1, w2, ..., wn, b] a total of n+1 parameters. There is a chain relationship between the different layers of the network, and the results are passed down to the last output layer. Any parameter of any neuron at any level will affect the final result. So these parameters are the memory cells of the neural network, which are used to store abstracted patterns.

As another example, [Network 2] in the figure below is a network with two inputs, two layers, two elements in the first layer, 1 element in the second layer, and the activation function of each layer is linear. The total number of parameters in the first layer of the network = (number of inputs + 1) the number of nodes = (2 + 1) 2 = 6, the total number of parameters in the second layer = (2 + 1) * 1 = 3, the total number of network parameters = 6 + 3 = 9, and the abstracted patterns are stored in these parameters. (w111, w112, b11, w121, w122, b12, w21, w22, b2 in the image below)

Network 2

Activate the function

Why do we need an activation function? We can see what would happen if there were no special activation function, such as [Network 2] above.

As you can see, each layer of [Network 2] is a linear function, and according to the conductivity of the linear function, the final result is still a linear function. That is y_hat=w21*h1+w22*h2+b2=w21*(w111*x1+w112*x2+b11)+w22*(w121*x1+w122*x2+b12)+b2 can finally be reduced to y_hat=w1'*x1+w2'*x2+b'. And a linear function (line, plane、... There is no way to fit/abstract the complexity of the real problem.

The main purpose of introducing activation functions is to introduce nonlinear functions, so that the model can fit/abstract complex real-world problems.

Common activation functions are:

Among them, ReLU is used to eliminate negative values, and Sigmoid is used to increase radians.

Deep learning process

Take [Network 1] above as an example. Suppose the model has been created by [Network 1] and the dataset has been prepared, which is divided into training sets such as x=[1,2], y=[3,5] and validation sets (e.g. x=[3], y=[7].

Deep learning is a neural network training step:

Step 1: Initialize parameters: There is only one neuron here, and there are two parameters, m and b. Just give a random value, assuming m=-1 and b=5.

Step 2: Calculate the output y_hat with the training set. Because the multilayer network is computed from the top down, it is called forward propagation:

Step 3: Use the loss function to evaluate the quality of the output (and the difference between the actual value) and use the MSE (Mean Variance) here:

Why use Mean squared error: It is more appropriate for the problem at hand (the farther away from the expected regression line, the greater the loss). The squared error is to eliminate the negative number of the difference and expand the difference by squared. Mean is to spread the difference evenly over each example in the training set.

In the fourth step, the gradient of each parameter is calculated by the backpropagation algorithm.

The gradient of individual parameters, i.e., how much a change in a single parameter affects a change in the loss result. Calculate the gradient of each parameter, and then adjust the parameter in the direction of the loss result to become smaller, and then find the minimum loss point in the high-dimensional parameter space.

In order to calculate the gradient, we need to first understand the loss surface.

In the example, there are only two parameters, m and b, and if m and b are used as the x and y axes, and the loss function result is used as the z axis, the loss surface can be obtained. (If there are more parameters, a high-dimensional parameter space is formed, and the principle is similar)

Our goal is to find m and b where the loss function results are the smallest, where m=2 and b=1. The specific method is to offset the current value of the parameter m=-1 and b=5 to m=2 and b=1.

How do you calculate the offset? This involves calculating the gradient, which is mathematically calculating the partial derivative.

For a surface, which is currently at a high place, and to move to a low place, you can calculate the gradient in the x and y directions (i.e., m and b parameters) respectively, and the gradient at the current position, i.e., the partial derivative, that is, the geometric slope of the tangent, that is, how much the change of a single parameter will affect the change in the loss result (note that the gradient is upward, if used, it should be reversed).

Fortunately, for any loss function, there is a formula for figuring out the partial derivative for a parameter. For example, if the activation function is y_hat = m * x + b, and the loss function is L = (1/N) * Σ(y - y_hat)^2, then the partial derivative for m is ∂L/∂w = (1/N) * Σ -2x(y - y_hat) = (1/N) * Σ -2x(y - (m * x + b)). This is just to illustrate the solution, because the model will provide this ability, so we generally don't pay attention to it.

Not only that, the neural network also has conductivity, take [Network 2] as an example, if we want to calculate the gradient of the first layer parameter w111, we can calculate the specific formula of the w111 gradient by ∂L/∂w11 = ∂L/∂y_hat * ∂y_hat/∂h1 * ∂h1/∂w111 layer by layer backpropagation.

Step 5: Update each parameter by gradient:

Because the partial derivative is upward, it is necessary to go lower and descend the gradient to the opposite point.

Take m as an example, the new value of m m = m - learning_rate * ∂L/∂m.

Pay attention to the choice of learning_rate, too large will lead to crossing the lowest point, too small will lead to too small a single change and too long training time. However, the general model will provide an automatic progressive algorithm, and we don't need to pay attention to it.

Step 6: The training process of the network is iterative, and in each training cycle (epoch), the network will gradually adjust its parameters by gradient descent. The process repeats multiple epochs until the model's performance is no longer significantly improved or a specific stopping condition is met.

Each training cycle is trained with the training set and validated with the validation set. By comparing the results of the training set and the validation set, it is possible to find out whether there is an overfit.

About the global optimal solution and the local optimal solution:

If the loss surface is complex, such as multiple low-lying areas, and the lowest one may not be found by moving progressively from a certain point, the local optimal solution will be obtained instead of the global optimal solution.

About training batches: If the training set is very large, each training cycle will not necessarily use all the training sets for training, but will randomly select a batch of data for training. The common method is Stochastic Gradient Descent (SGD), which selects one example at a time, which not only has the advantage of greatly reducing the amount of computation, but also makes it easy to jump out of the local optimal solution, and the disadvantage is that it is not very stable. Another option is the Mini-Batch Gradient Descent, which selects 10-hundreds of samples at a time. It has the advantages of SGD and is more stable than SGD, so it is more commonly used.

A case study of deep learning on MNIST datasets

The MNIST dataset is a classic dataset in the field of deep learning, which contains a large number of handwritten digital images, which is of landmark significance for verifying the effectiveness of deep learning algorithms.

With the theory in the previous chapters, we will show how to use neural networks to solve image classification problems through a practical case.

The following operations will use the Jupyter platform and the TensorFlow2 deep learning framework, but this article will focus on explaining the principles, and the use of platforms and frameworks will not be too much to use.

Load and observe datasets

When using images for deep learning, we need both the images themselves (usually represented as "X") and the correct labels for those images (usually represented as "Y"). In addition, we need a set of X's and Y's to train the model, and then we need a separate set of X's and Y's to verify the performance of the trained model. Therefore, the MNIST dataset needs to be divided into 4 parts:

x_train: An image used to train a neural network
y_train: x_train the correct label of the image to evaluate the prediction of the model during training
x_valid: An image used to verify the model's performance after the model is trained
y_valid: x_valid the correct label of the image to evaluate the prediction results after the model is trained

The first step we need to do is to load the dataset into memory and get an overview of the dataset by looking at it.

Loading via Keras API:

Hard API地址:https://keras.io/

View the x_valid of the training image set x_train and the validation image set: you can see that there are 60,000 and 10,000 grayscale images (grayscale 0-255) of 28x28 pixels, respectively:

Each image is a two-dimensional array of image grids, and the following image shows the contents and visualization of the two-dimensional array:

The label y_train is relatively simple, which is the number corresponding to the image:

Pre-processed datasets

There is often some pre-processing required before data can be fed into a neural network. This involves flattening image data into one-dimensional vectors, normalizing pixel values between 0-1, and converting labels into a format suitable for classification tasks, typically one-hot encoding.

Flattening:

Standardization:

One-hot coding:

While the labels here are sequential integers from 0-9, don't think of the problem as a numerical problem (consider this situation and assume that we are not recognizing pictures with handwritten 0-9s, but pictures of various animals).

This is essentially dealing with classification, and the output of the classification problem is suitable for one-hot encoding in a deep learning framework (a one-dimensional vector with a length of the total number of classes, a value of 1 for the class and 0 for the other values).

Create a model

Creating an effective model usually requires a certain amount of exploration or experience. For MNIST datasets, a common starting point is to build a network with the following layers:

784 inputs (corresponding to an input image of 28x28 pixels).
The first layer is the input layer, with 512 neurons, and the activation function is ReLU.
The second hidden layer uses 512 neurons and the ReLU activation function.
The third layer is the output layer, with 10 neurons (corresponding to 10 numeric categories), which use the Softmax activation function to output a probability distribution.

The activation function ReLU has been described above, and Softmax needs to be introduced here: the Softmax function ensures that the output values of the output layer add up to 1, which can be interpreted as a probability distribution. This is useful for multi-class classification problems. For example, the 10 output values of the output layer are [0.9, 0.0, 0.1, 0.0, ..., 0.0] can be interpreted as 90% probability of being in category 1 and 10% probability of being in category 3.

Click here to create a model and view the summary:

Note the number of parameters here.

Compile the model

The model needs to specify the loss function and optimizer at compile time. For multi-classification problems, the loss function usually chooses Categorical Crossentropy, which can be characterized by the reference formula: when it actually belongs to a certain class (y_actual=1), the loss is equal to log(y_hat), otherwise the loss is equal to log(1-y_hat). This loss function penalizes misguesses and makes their losses close to ∞, which can effectively quantify the difference between the predicted probability distribution and the actual distribution. The optimizer is responsible for adjusting the network parameters to minimize the loss function.

Train and observe the accuracy

When training a model, we iteratively update the parameters over multiple training cycles, and each training cycle goes through a full forward pass of the compute output, the result of the loss function evaluation, and the backward pass of the updated parameters. We observe the accuracy on the training and validation sets to evaluate the model's performance and generalization ability.

Pay attention to the concepts and principles of loss surfaces, gradient descent, etc., and if you are not clear, you can review the [deep learning process].

The accuracy in the figure below is the accuracy of the training set, and the val_accuracy is the accuracy of the validation set, and the accuracy is as expected.

Overfitting and solutions

If the model has a high accuracy rate on the training set but low accuracy on the validation set, overfitting may occur. Overfitting means that the model is too complex or overtrained to learn the noise in the training data rather than the underlying rules.

Training set, the left side loss is very small:

verification set, the loss on the left is greater:

What does the appearance of an overfitting indicate

As you can see in the image on the left, the model is almost rigid about the training set. Similar to the human brain, the hard set of some examples shows that the model is only memorizing these examples, and only by finding out their laws, characteristics, and patterns is it truly abstract. This also confirms what I said earlier, that neural networks are solving problems through abstraction.

Why overfitting occurs

From the training data, the overfitting indicates that the characteristics of the training set may not be obvious and not easy to abstract. For example, the recognized image is blurry and the brightness is not high.

From the perspective of the model, overfitting is usually caused by the model being too complex or taking too long to train, and it is going to rote memorization.

How to resolve overfitting

After years of development of AI technology, there are now relatively mature solutions to solve overfitting, such as convolutional neural networks and recurrent neural networks, which I will introduce in later articles.

本文大部分素材来自Nvidia在线课程Getting Started with Deep Learning。

（Getting Started with Deep Learning地址：https://learn.nvidia.com/courses/course-detail?course_id=course-v1:DLI+S-FX-01+V1）

Author: Code One

Source-WeChat public account: big Taobao technology

Source: https://mp.weixin.qq.com/s/zoDHeleVHwnqinSaFaeK2Q

Detailed explanation of the principles and technologies of generative AI (1) - neural networks and deep learning

▐What can neural networks do

▐Why neural networks can solve complex problems

▐How to implement neural networks

Read on