In the previous article, we introduced in detail the basic concepts and principles of neural networks, the advantages and application scenarios of neural networks, and the product case of neural networks: NetEase Youdao AI Translation. If you want to know more about neural networks, you can read my previous article "8000 Words to Explain the Compulsory "Neural Networks" for AI Product Managers"

After understanding the structure and principle of neural networks, I became further curious, how can neural networks achieve such wonderful effects through data training, even though some abstract mathematical models can achieve wisdom similar to the human brain?

This article will reveal to you that behind such a "miraculous calculation" of AI, it also needs to go through a training process, how to train AI neural networks, and how to identify that the training process is effective?

How to train and optimize an AI neural network model?

In this environment of fragmented chemistry Xi, perhaps few people can sink down to read a long article, but I still hope that you can persist in reading it, I believe it will bring you a different and deeper harvest. As usual, the structure map of this article is provided at the beginning, so that you can grasp the overall situation before reading, and have a general picture framework.

1. AI neural networks need to be trained before they can come in handy

Friends who have read the previous article will know that there are two main parts in a neural network: structure and weights. The structure includes neurons and their connections, and the weights are numbers that are parameters between connected neurons.

They can fine-tune the math in the neuron to get an output. If the neural network makes a mistake, i.e. the output doesn't match what you expected, it usually means that the weights aren't adjusted correctly and we need to update them so they make better predictions next time.

It sounds simple, but the training process behind this is very complicated, and I just say this for the sake of understanding. But after all, we are discussing the knowledge in the field of AI, and it is inevitable that we will encounter some technical words in the content, so I will explain them one by one later.

The weights in a neural network determine the strength of the connections between different neurons, and the process of finding the best weights for the structure of the neural network is called optimization.

As a kind of model, if you want it to really have the ability to "magic calculation", we need to train it with a large amount of data, and untrained models are often easy to give a lot of wrong answers, which is why so many AI models on the market need to be trained by a large amount of data before they can really be put on the market.

Next, we can take the next step and ask, how do computers train and optimize neural networks with data?

When it comes to training neural networks, we can't do without the reality of being governed by mathematics, and every neuron of a neural network is equipped with a mathematical model, and if you use nonlinear functions as an example to explain the training process of a neural network, it is a bit complicated and not easy to understand.

Let's take linear regression as an example, after all, the focus of this article is not on the mathematical model, but on how neural networks are trained and optimized.

Therefore, let's take linear regression as an example to talk about the training optimization strategy of neural networks.

2. Training optimization strategy with linear regression as an example

Linear regression is a statistical method used to study the relationship between two or more variables. It is based on the assumption that the observed data points can be best fitted by either a straight line (in two-dimensional space) or a hyperplane (in multidimensional space).

The goal of linear regression is to find the parameters of this line or hyperplane so that the error between the predicted value and the actual observed value is minimized.

Based on the concept and characteristics of linear regression, we can find that in the field of machine science Xi, linear regression can be used for data prediction. By fitting the best straight line of the data points, we can predict the outcome of continuous values.

For example, if we want to know what is the relationship between the number of visitors and the temperature of a resort, we need to know the past data, and find the formula that best fits the data based on the historical data, assuming that this formula is visualized as a line chart that can visualize the relationship between the two data.

Once we have the forecast line, we can use it to predict how many visitors will be in different temperatures in the coming days, helping to predict the number of visitors in advance in different seasons next year, so as to help the resort improve its overall operational efficiency.

Let's think back to how the line in the graph is drawn, that is, how does the computer know that the line best fits the relationship between the number of visitors and the temperature?

That's where linear regression plays a key role.

In the beginning, the computer draws a random straight line, and that line is most likely inaccurate. So the computer needs to calculate the distance between the line and each data point, and add it all up, quantifying the difference between the data on the line and the real data.

The next step is to reduce the gap, the goal of linear regression is to adjust the straight line so that the error is as small as possible, and we train it on historical data, hoping that this line matches the training data.

Finally, the straight line that is trained on the data is called the best-fit line, and we can use this line to predict how many tourists will appear at any temperature. As a result, you can see that the computer generates a straight line graph that matches the relationship between the number of tourists and the temperature.

In reality, the data relationship is often not so simple, the number of tourists is not only related to temperature, in order to predict more accurate results, we may need to consider more than two characteristics.

For example, with the addition of holiday features, the visualization will change from a 2D plot to a 3D plot, and our best fit line will be more like a best fit plane. If we add a fourth feature, such as whether it's raining or not, then the prediction model becomes more complex and difficult to visualize.

So, when we consider more features, we need to add more dimensions to the graph, the optimization problem becomes more complex, and fitting the training data becomes more difficult.

This is where neural networks come in handy, by connecting many simple neurons and weights together, neural networks can learn to solve complex problems Xi, and the best-fit line becomes a strange multidimensional function.

In reality, when we are faced with complex predictions, AI often performs better than the average person. For example, predicting the weather.

3. Use the loss function to represent the error

After understanding the optimization strategy of the training data, why don't we be a little more curious and ask further, how does the computer know that there is a gap between the predicted data and the actual data, and how to reduce the gap after knowing the gap, so that the output prediction results are most consistent with the actual results?

The difference between the predicted value and the actual value can be called the error, and the loss function comes in handy if the computer wants to know whether there is an error between the predicted value and the actual value, and how big the error is.

The loss function in a neural network is a way to measure the gap between the predicted outcome of a model and the actual outcome. When training a neural network, our goal is to minimize the loss function so that the model fits the data better and thus achieves more accurate expected results.

常见的损失函数有均方误差(MSE)、绝对值误差(MAE)、交叉熵损失(Cross-Entropy Loss)、Hinge损失(Hinge Loss)、对数损失(Log Loss)、Huber损失(Huber Loss)、平均绝对误差(Mean Absolute Error,简称MAE)等。

These loss functions have their own advantages and applicability in different scenarios, and choosing the right loss function is crucial to improve the performance of the model. In practical applications, we can choose the appropriate loss function according to the characteristics of the data and the requirements of the task.

Take the mean square error (MSE), for example, which is the average of the sum of squares of the difference between the predicted value and the true value. Specifically, if we have n predicted values and corresponding actual values, the formula for calculating MSE is:

MSE = (1/n) * Σ(yi – ŷi)^2

where y_i is the actual value, ŷ_i is the predicted value, n is the sample size, and Σ is the sum.

In the results of Mean Squared Error (MSE), the smaller the MSE, the better the model fitting, indicating the higher the accuracy of the prediction model. Conversely, if the MSE value is large, then the accuracy of the prediction model is relatively low.

Therefore, in practical applications, we usually want the value of MSE to be as small as possible to obtain better predictions.

Every model has its applicable boundaries, and the mean square error is no exception. Mean Squared Error (MSE) is suitable for continuous data, especially regression problems. Now that we know that the mean square error is more effective in the regression problem, it is necessary to have a general understanding of the regression problem.

In statistics and machine Xi, regression problems are often used to predict the value of a continuous variable based on the influence of other relevant variables, to build a model, and to understand the relationship between known independent and dependent variables by analyzing their data.

The linear regression mentioned in the second paragraph of this article, "Training Optimization Strategies Using Linear Regression as an Example", refers to the situation where there is a linear relationship between the independent variable and the dependent variable.

Regression problems have a wide range of real-life applications, such as predicting house rates, predicting stock prices, or simply predicting the relationship between the number of resort visitors and the temperature.

Through the analysis and modeling of large amounts of historical data, we can provide valuable predictions for these real-world problems. Of course, the premise is that the quality of the data is premium and the selection of models is matching.

In summary, we use the loss function to represent the error of the model prediction, taking the mean square error as an example, the MSE is used to express the accuracy of the prediction of a linear regression model.

Continuing from the example of the resort in the second paragraph, if the value of the first MSE is 10, and after multiple adjustments, the value of the last MSE is 0.1, and the MSE drops from 10 to 1, indicating that the adjusted prediction model has become more accurate.

However, we can't just use the value of MSE to determine whether a model is good or bad. This is because, in different application scenarios, we have different requirements for the prediction accuracy of the model.

For example, in some scenarios with extremely high prediction accuracy, we may think that the model is not performing well even if the MSE value is only 0.01, and in some scenarios with low prediction accuracy requirements, we may think that the model is good enough even if the MSE value reaches 0.1.

It is worth noting that even if it is a standardized AI model, in practical application, it also needs to be analyzed for specific problems, and it is important not to copy it brainlessly, what model to use, the first thing is to know what the problem to solve is, and only use a clear understanding of the essence of the problem to find the right model.

Fourth, use backpropagation to reduce errors

The above paragraph said that the loss function can optimize the accuracy of the AI model's predictions, which is not entirely true. Because it is only the loss function, it only plays half of the role, and the other half needs to be completed before the training and tuning of the model can be truly achieved.

The other half, back propagation, also known as reverse training or reverse learning, is an important machine Xi Xi algorithm.

The core idea is to update the weights and biases in the original neural network by calculating the output error of the network and backpropagating it to each layer before the neural network, so that the predictions of the neural network are closer to the true target value.

As we've seen in previous articles, in a neural network, the neurons at each layer perform a series of processes and transformations on the input data, and then pass the processed results to the next layer.

This process can be seen as a process of information transfer, and in this process, the weight and bias of the network play a key role.

However, due to the complexity of neural networks, it is difficult to calculate the optimal weights and biases of neural networks directly through mathematical formulas. Therefore, we need to use an iterative approach to gradually optimize these parameters, which is where the idea of backpropagation algorithm originates.

Therefore, in order to train the optimization neural network, after the error value is obtained by the loss function, the backpropagation algorithm will feed back the results to the neurons in the first few layers of the neural network and urge them to adjust, and the calculation of some neurons may be more likely to cause errors than the calculations of other neurons, and the weights will be adjusted more, and the errors will be adjusted less, and after several layers of feedback and adjustment, the computer can get more accurate prediction results than before, and the neural network model can be trained and optimized.

The above is the basic principle of backpropagation. Let's take a step further and ask, how does the backpropagation algorithm change the weights and biases of the original neural network?

As we know, the basic idea of backpropagation is to start at the output layer and calculate the negative contribution of each neuron to the loss function (i.e., the gradient) layer by layer, where the "gradient" can update the weights and biases of the neural network to obtain a lower value of the loss function.

So, we need to figure out two questions: 1. What is a gradient?2. How does a gradient update weights and biases?

Simply put, a gradient is the slope or rate of change of a function at a certain point. More specifically, it represents how the output value of the function changes with respect to the input value. This gradient tells us how to adjust the weights of the network if we want to reduce the value of the loss function. Therefore, we need to calculate the gradient first, and then update the weights of the network.

In the backpropagation algorithm, the calculation of the gradient is divided into two stages: forward propagation and backpropagation.

In the forward propagation phase, the network first passes the input data to the output layer, and then calculates the value of the output and loss functions for each layer layer layer by layer.

In this process, each neuron calculates its own output based on the output and activation functions of the previous layer and passes this output to the next layer. At the same time, each neuron calculates the error between its input and output, which accumulates as the data travels through the network.

After the forward propagation is complete, the backpropagation phase begins to calculate the gradient.

Starting from the output layer, each neuron calculates its contribution to the loss function during backpropagation based on its output error and the derivative of the activation function.

This gradient information is then propagated backwards layer by layer until it is passed back to the input layer. In this way, we can get the contribution of each parameter to the loss function, i.e., the gradient of the parameter.

In order to calculate the gradient, we need to use the Chain Rule.

The chain rule is a fundamental law in calculus that describes how the derivatives of a composite function are decomposed into the product of the derivatives of simple functions.

In backpropagation, we can think of the entire neural network as a composite function, where each neuron is a simple function.

Using the chain rule, we can calculate the partial derivatives (i.e., gradients) of the loss function with respect to each weight, and then use these gradients to update the weights of the network.

Once the gradient is obtained, how does the computer use the gradient to update the weights and biases?

As the saying goes, magic beats magic, and algorithms deal with algorithms, so we need to update the gradient with the help of some optimization algorithms, so as to achieve effective optimization of weights and biases.

There are three common optimization algorithms: gradient descent, stochastic gradient descent (SGD), and Adam.

Take the gradient descent method as an example, and expand on its implementation principle.

Gradient descent is an optimization algorithm commonly used in machine Xi and depth Xi. The core idea is to iterate along the negative gradient of the objective function to find the point at which the objective function achieves a minimum.

To illustrate this process, we can compare it to a climber climbing a steep mountain.

Assuming that the mountain is our objective function, we want to find the lowest position (i.e., the minimum value of the objective function). However, the mountain is so steep that we can't see at a glance where the lowest point is. So, we need to take the help of some tools to help us find this location.

To get started, we need to determine an initial position, which can be a randomly selected value or the result of a previous iteration. Then, we need to start climbing. At each step of the climb, we measure our current altitude, and that's how we calculate the value of the objective function.

Next, we need to determine whether the current position is close enough to the lowest point. To achieve this, we need to observe and measure changes in the terrain near our current location. Between the foot of the mountain and the highest point, the topographic changes are gradually reduced.

We can call this change in terrain a gradient. The direction of the gradient is where the slope is steepest, which is where we need to go next. By constantly measuring the gradient and moving in the opposite direction, we can gradually lower the altitude and thus get closer to the lowest point.

There is also an important factor to consider during the climb: the step size.

A step size that is too large may cause us to skip the lowest point, and a step size that is too small may cause us to fall into a local minimum and not reach the global minimum.

Therefore, in the gradient descent method, we need to adjust the step size according to the actual situation in order to find the minimum value of the objective function faster.

In the case of mountaineering, for example, the gradient descent method is like a climbing journey to find the lowest point. By measuring the gradient and moving in the negative direction of the gradient, we can gradually decrease the value of the objective function and finally find the global optimal solution.

Now, let's try to answer the question again: How does gradient update weights and biases?

We can start by defining a loss function that measures the gap between the predicted result of the neural network and the real result, like the initial point of finding a mountain peak. Then, we use the backpropagation algorithm to calculate the gradient of the loss function for each weight and bias.

Next, we need to set a learning Xi Xi rate, which determines the step size we move each time we update the parameters. Generally speaking, the Xi learning rate should not be set too large, otherwise it may cause the algorithm to oscillate near the minimum point, and it should not be set too small, otherwise the convergence speed of the algorithm will be very slow.

Finally, we can update the weights and biases of the neural network based on the calculated gradient and Xi rate.

Specifically, for each weight and bias, we subtract its current value from the Xi and multiply it by the corresponding gradient to get the new value. In this way, through multiple iterations, we can gradually find the parameter value that minimizes the loss function.

In the course of this series, gradients are optimized for neural network weights and biases. The backpropagation algorithm also reduces the error of the neural network model in the prediction results with the help of gradients.

In the end, we can see that the accuracy of the AI's predictions becomes higher and higher after the AI model is trained and optimized.

5. Data fitting problems of neural networks

Sometimes, backpropagation does a great job of adapting neural networks to certain data, resulting in a lot of coincidental relationships in large data sets, which may not be causal in the real world, but due to the characteristics of the dataset or the randomness of the training process.

For example, "bananas and fires". According to the data, when the price of bananas rises, the incidence of fires also increases.

However, this does not mean that there is a causal relationship between bananas and fires. In fact, there is no necessary connection between the two events. This is a typical example of a coincidence in big data where there is no causal relationship, but the data is displayed to be related.

So, even if we train an AI model, the results may not be as you hope, and it will be a joke, and we need to pay attention to the data fitting problem in AI.

Data fitting problems can be divided into overfitting and underfitting, and each problem has different reasons behind it and requires different solutions.

Of course, there are other different classifications of data fitting, but this article mainly introduces overfitting and underfitting.

1. 过拟合（Overfitting）

Overfitting is when the model performs very well on the training data, but does not perform well on new, unseen data.

This is usually because the model is too complex, learns Xi noise and unrepresentative features in the training data, relies too much on the details in the training data, and ignores the general rules of the data.

To better understand the impact of overfitting problems in practical applications, let's assume that we use a mathematical model to predict student achievement.

In data training, we can find from historical data that there is a certain positive correlation between students' height and grades. So we trained a simple linear regression model with height as the independent variable and grades as the dependent variable. After training, we found that the model performed very well on the training set, and the predicted performance was highly consistent with the actual performance.

But if we don't predict the data, we know that there is no direct correlation between a student's performance and height. So, when we apply this model to new student data, we see that the accuracy of the predictions drops dramatically, and even the predictions may be completely wrong.

In this example, our linear regression model may be overly complex, overemphasizing the impact of height on student achievement and ignoring other potential influencing factors, such as Xi attitude, effort level, etc.

As a result, when faced with new student data, the predictive performance of the model is greatly compromised because it may contain a different distribution of features than the training data. Therefore, in the process of data training, we need to identify whether the data is overfitting to avoid follow-up problems.

To identify overfitting, we typically divide datasets into training sets, validation sets, and test sets.

The training set is used to train the model, the validation set is used to tune the model parameters for optimal performance, and the test set is used to evaluate the performance of the model on unknown data. By comparing the performance of the model on these three datasets, we can determine if the model has an overfit problem.

So, how do we solve the problem of overfitting?

To solve the problem of overfitting, we can increase the amount of data, simplify the model, regularize or cross-validate.

【Increase the amount of data】

As the name suggests, it is about bringing in more data to help the model better capture potential patterns, thereby reducing the risk of overfitting. However, in practice, obtaining large amounts of high-quality data can be somewhat impractical.

【Simplified Model】

It is to reduce the complexity of the model by choosing fewer parameters or a simpler model structure, such as reducing the number of hidden layers or nodes in the neural network. This approach reduces the model's dependence on training data, thereby reducing the risk of overfitting. However, an oversimplified model may lose some useful information and affect the performance of the model.

【Regularization】

Regularization is the addition of extra terms to the model's loss function to limit the size of the model's parameters and prevent them from over-inflating. Commonly used regularization techniques include L1 regularization and L2 regularization. L1 regularization tends to make some parameters zero, thus enabling feature selection. L2 regularization, on the other hand, makes the parameters more uniform by penalizing the squared values of the parameters. Regularization can help us reduce the risk of overfitting while maintaining model performance.

[Cross-validation]

Cross-validation is an effective way to assess a model's ability to generalize. It divides the dataset into multiple subsets, then uses different subsets for training and validation, and finally combines the results of each subset to obtain the final evaluation metrics. Cross-validation can help us identify overfitting issues and select appropriate model parameters.

2.欠拟合（Underfitting）

Underfitting is a phenomenon in which a neural network does not perform well on both training and new data. This is often because the model is too simplistic to capture the key features and patterns in the data.

For example, if we use a neural network with only one layer to fit complex nonlinear relationships, it is likely that the model will not accurately capture patterns in the data, resulting in suboptimal training and testing. It's like a primary school student going to solve a calculus problem in college, and there is a high probability that he will not be able to give the right answer.

Let's continue to use the example of predicting student achievement to explain the underfit phenomenon in detail.

Suppose we have a dataset of students' grades, but this time our model is too simplistic, only considering the Xi time of study, and does not take into account other factors that may affect grades, such as students' prior knowledge level, family background, course difficulty, test format, etc.

Then our model may have the problem of data misfitting, so we cannot accurately predict student achievement based on the Xi time alone.

In fact, the main manifestations of underfitting include high bias and low variance.

[High deviation]

There is a large gap between the prediction results of the underfit model and the true value, that is, the model cannot accurately estimate the mean of the data. This is often caused by models that are too simplistic to capture complex relationships in the data.

For example, in a regression problem, if a linear regression model is used to process data with nonlinear relationships, the model cannot accurately describe the relationship, resulting in predictions that deviate from the actual values.

【Low variance】

The error of the underfitting model on the training data is small, but the error on the test data is large. This is because the underfit model is too simplistic and does not generalize well to new data. In other words, while an underfitting model performs well on training data, it can perform very badly on unknown data.

Let's dig a little deeper, why does the underfitting problem occur?

There are also several reasons for underfitting problems, which usually occur in the following situations:

Insufficient model complexity: Use an overly simplistic model, such as a linear model, to fit data with nonlinear relationships.
Insufficient features: Important features in the data are not taken into account, resulting in the model not being able to accurately predict the target variable.
Insufficient training: The model does not have enough iterative learning Xi on the training set to adapt well to the data.
Noise interference: The noise in the data is too large, and the model is too sensitive, making it difficult to distinguish between real signals and noise.
Insufficient sample size: The amount of training data is too small to capture the overall data distribution.

Underfitting is something we must be concerned about when training data, as it can lead to poor performance of the model in real-world applications, and ultimately undo the team's previous efforts.

So, how do we solve the problem of underfitting?

Once we know the cause of the underfitting problem, the key to solving the problem is to increase the complexity of the model so that the computer can better capture the relationships and features in the data. At the same time, care should be taken to avoid overfitting, which can lead to a decrease in generalization performance due to over-complexity.

To solve the problem of underfitting, we can solve it by increasing the complexity of the model, adding more features, decreasing the regularization parameters, or increasing the training data.

[Increase the complexity of the model]

If the model is too simple to capture complex patterns in the data, consider using a more complex model, such as adding more layers or nodes, introducing more features, or changing the structure of the model to better capture complex relationships in the data. For example, you can try to solve nonlinear problems using more complex models such as polynomial regression, support vector machines, or in neural networks, increasing the number of hidden layers or nodes.

[Add more features]

We can make the model more complex by adding more features to better fit the data. These features can be linear or nonlinear combinations of existing features, or they can be new features derived from other data sources. In the example of a student's test scores, consider adding more factors that may affect the grades, such as family background, student interests, etc.

[Reducing Regularization Parameters]

Regularization is a way to prevent overfitting, but in some cases, excessive regularization can lead to underfitting. As a result, the regularization parameter can be lowered appropriately to allow the model to be more flexible in adapting to the training data.

【Add training data】

Underfitting is often related to insufficient training data. By collecting more training data, the Xi learning ability and generalization ability of the model can be improved, thereby reducing the underfitting phenomenon.

Therefore, when we know how to train a neural network with data, we also need to pay attention to the data fitting problems that occur in the training process, which can also be understood as the process supervision is also needed in AI training.

When we find data overfitting, we need to solve it with the help of increasing the amount of data, simplifying the model, regularization, or cross-validation. When we find that the data is underfitted, we can solve it by increasing the complexity of the model, adding more features, lowering the regularization parameters, or increasing the training data.

6. Summary of the full text

If you see this, it is enough to show that you also have an extraordinary interest and enthusiasm for AI, and I would like to send a sincere thank you. If you're like me and you're interested in the principles behind how AI models are trained and optimized, I believe this article will help you.

In the end, I will make a brief summary of this article, if you don't understand it all at the first time, you can also take away some gains from the summary.

Neural networks are at the heart of AI, and they need to be trained before they can actually work. This paper discusses the training process of neural networks and related optimization strategies, and discusses the data fitting problems in neural networks.

Like anthropological Xi, neural networks need to be Xi through a large amount of data input to suit a specific task. In the training process, we take linear regression as an example to improve the performance of the model by optimizing the strategy.

During training optimization, we use the loss function to represent the error between the model's prediction and the actual result. The smaller this error, the better the model's performance. By adjusting the parameters of the model, we try to minimize the loss function and make the predictions of the model more accurate.

The introduction of the loss function allows us to quantify the error of the model and thus provide direction for optimization. Through methods such as gradient descent, we can find the parameter values that minimize the loss function, and thus improve the accuracy of the model. Backpropagation plays a key role in this process.

The backpropagation realizes the reverse transmission of errors by calculating the gradient of the loss function to the model parameters. This means that we can update the parameters of the model according to the direction of the error, so that the model gradually tends to be optimal.

However, even with careful training, neural networks can still face fitting issues when processing data. The problem of data fitting is manifested as overfitting and underfitting, and we also need to have different solutions to different problems.

The training process of neural network is a complex and delicate process, and through this article, I hope that you can understand and skillfully use these concepts and methods, and can also make better use of neural networks to solve practical problems.

There is a cloud in "Sun Tzu: Seeking Attack", "Knowing the other and knowing oneself will not be defeated in a hundred battles; knowing oneself without knowing the other, one wins and one loses; if you don't know the other and you don't know yourself, you will die in every battle." ”

Knowledge is the premise of success. The reference here is not to treat AI as an enemy, but to know and understand AI, and then to effectively coexist with AI and do more valuable things with the help of AI.

We are all one of the vast stars, moving forward in the wave of AI. What exactly AI is is a topic that we must figure out.

This article was originally published by @果酿 on Everyone is a Product Manager and is not allowed to be reproduced without the permission of the author.

Image from Unsplash, based on the CC0 license.

The views in this article only represent the author's own, everyone is a product manager, and the platform only provides information storage space services.

How to train and optimize an AI neural network model?