Author丨Oldpan

Source丨oldpan blog

Editor丨Jishi platform

Guide

This article summarizes some of the questions and some important knowledge points that will be encountered in the autumn recruitment interview, which is suitable for raiding and consolidating the basic knowledge before the interview.

Preface

Recently, this time is in the autumn recruitment, this article is Lao Pan in the process of finding a job to sort out some important knowledge points, the content is more miscellaneous, part of the collection on the Internet, simple sorting out, suitable for pre-interview assault, of course, also suitable for consolidating basic knowledge. In addition, I recommend a new book called "Hundred Faces of Machine Xi", published in August 2018, which includes a lot of machine Xi, deep learning Xi problems that will be encountered during the interview, which is more suitable for algorithm engineers who need to prepare for the interview in terms of machine Xi and deep learning Xi, and of course, it is also suitable for consolidating the foundation ~ Books that you must read when you have time:

A math series for programmers, suitable for reviewing knowledge and reviewing some basic linear algebra and probability theory.
In-depth study Xi flower books, summary books, and explanations of basic knowledge, which are more comprehensive.
Statistical Xi methods, summary books, not long, are the core.
Pattern Recognition and Machine Learning，条理清晰，用贝叶斯的方式来讲解机器学习。
Machine learning Xi watermelon book, suitable as a textbook, with a wide but not deep content.

Summary of machine Xi, in-depth Xi interview knowledge points

Xi of a hundred machine that cannot be turned over

Common general knowledge questions

L1 positive can make the minority weight larger, and the majority weight is 0, and get the sparse weight, while L2 positive can make the weight tend to be close to 0 but not zero, and get the smooth weight.
In the AdaBoost algorithm, the formula for the weight update ratio of the missorted samples is the same;
Boosting and Bagging are both methods of combining multiple classifiers to vote, but Boosting determines the weight of a single classifier according to its correct rate, and Bagging is a simple way to set the weight of all classifiers to the same;
The EM algorithm does not guarantee that the global optimal value will be found;
In SVR, the width of the kernel function is small and underfitted, and the width is large and easy to overfit
Both PCA and LDA are classic dimensionality reduction algorithms. PCA is unsupervised, i.e., training samples do not need labels, and LDA is supervised, i.e., training samples need labels. PCA is to remove the redundant dimension from the original data, while LDA is to find a dimension so that the original data is projected on this dimension and different categories of data are separated as much as possible.

PCA is an orthogonal projection with the idea of maximizing the variance of the original data in each dimension of the projection subspace. Suppose we want to project the N-dimensional data onto the M-dimensional space (M<N), according to the PCA, we first find the covariance matrix of the N-dimensional data, and then find the eigenvectors corresponding to the largest eigenvalues of the first M eigenvalues, then the M eigenvectors are the basis of the projection space. After LDA projection, the intra-class variance is the smallest, and the inter-class variance is the largest. As shown in the figure below, there are two types of projection, the red data and blue data on the left have overlap, and the red data and blue data on the right are just separated. The projection of LDA is similar to the projection on the right, so that the different categories of data are separated as much as possible, and the same class of data is distributed as tightly as possible.

PCA and LDA

Reference link: Comparison of PCA and LDA

KNN K neighbors

There is a lot of knowledge about the K-nearest neighbor algorithm, such as the steps of algorithm execution, application fields and precautions, but I believe that many people are not very clear about the precautions for the use of the K-nearest neighbor algorithm. In this article, we will answer this question and take you to understand the precautions of the k-nearest neighbor algorithm and the advantages and disadvantages of the k-nearest neighbor algorithm.

Considerations for the K-nearest neighbor algorithm

The precautions for the use of the K-nearest neighbor algorithm are to ensure that all features are numerically an order of magnitude when using distance as a measure, so as to avoid the calculation of distance being dominated by features of an order of magnitude. In terms of data standardization, it is also important to note that the training dataset and the test dataset must use the same standard standardization. The reason for this is generally two things, the first is that standardization can actually be regarded as part of the algorithm, since the data set is subtracted by a number, and then divided by a number, these two numbers should be treated equally for all data. The second is that the training dataset is actually very small, and when predicting new samples, the new samples are even less, and if the new sample is just one data, its mean is itself, and the standard deviation is 0, which is not reasonable at all.

What are the advantages of the K-nearest neighbor algorithm?

The advantages of the K-nearest neighbor algorithm are embodied in four aspects. The first is that the k-nearest neighbor algorithm is an online technology, and new data can be directly added to the dataset without retraining, and the second is that the k-nearest neighbor algorithm is simple and easy to implement. The third is high accuracy and high tolerance for outliers and noise. The fourth is that the k-nearest neighbor algorithm inherently supports multiple classification, which is different from perceptron, logistic regression, and SVM.

What are the disadvantages of the K-nearest neighbor algorithm?

The disadvantage of the K-nearest neighbor algorithm is that the basic K-nearest neighbor algorithm will re-perform a global operation every time it predicts the classification of a "point", which is relatively computationally expensive for datasets with large sample size. Moreover, the K-nearest neighbor algorithm is easy to lead to dimensional disasters, and when calculating the distance in the high-dimensional space, it will become very far, and when the sample is unbalanced, the prediction bias is relatively large, and the selection of the size of the k value depends on experience or cross-validation. The selection of k can be done using cross-validation, or it can be searched using a grid. The larger the value of k, the greater the deviation of the model, the less sensitive it is to the noise data, and when the value of k is large, it may cause the model to be underfitted. The smaller the value of k, the greater the variance of the model, and when the value of k is small, it will cause the model to be overfitted.

Two-dimensional Gaussian kernel functions

If you were asked to write a Gaussian fuzzy function, how would you write it?

`def gaussian_2d_kernel(kernel_size = 3,sigma = 0):
kernel = np.zeros([kernel_size,kernel_size])  
center = kernel_size // 2  


if sigma == 0:  
    sigma = ((kernel_size-1)*0.5 - 1)*0.3 + 0.8  


s = 2*(sigma**2)  
sum_val = 0  
for i in range(0,kernel_size):  
    for j in range(0,kernel_size):  
        x = i-center  
        y = j-center  
        kernel[i,j] = np.exp(-(x**2+y**2) / s)  
        sum_val += kernel[i,j]  
        #/(np.pi * s)  
sum_val = 1/sum_val  
return kernel*sum_val  


`

Training sampling methods

Cross-validation
Leave one way
Bootstrap: There is a re-entry sampling method, and duplicate samples may be drawn

Kmean and GMM principles, differences, and application scenarios

Convergence of kmeans?

You can see here https://zhuanlan.zhihu.com/p/36331115
You can also see the 100-sided machine learning Xi P93 and P102

How to do kmeans on multiple computers

In fact, it is like this, first distribute to n machines, to ensure that k initializations are the same, after an iteration, get k * n new mean, put it on a new machine, because the initialization is the same, so the arrangement of mean is the same, and then do a weighted average of n means belonging to each class, and then put them back on each machine for the next iteration.

KNN algorithm and process

Selection of K values:

If the K value is small, the complexity of the model is high, overfitting is easy to occur, the estimation error of the Xi will increase, and the prediction results are very sensitive to the instance points of the nearest neighbors.
A large K value can reduce the estimation error of the learning Xi, but the approximation error of the learning Xi will increase, and the training instance that is far away from the input example will also have a role in the prediction, so that the prediction will be wrong, and the complexity of the model will decrease with the increase of the k value.
In applications, the k value is generally taken as a relatively small value, and the cross-validation method is usually used to select the optimal k value.

The K-value selection in KNN is crucial to the impact of the classification results, and the K-value selection is too small and the model is too complex. The K value is selected too much, resulting in blurred classification. So how do you choose the K value? Some people use Cross Validation, some people use Bayes, and some use bootstrap. The distance measurement is another problem, and the more commonly used is the Euclidean distance. But is this distance really universal? The "Model Classification" points out that Euclidean distances are sensitive to translation, which seriously affects the results of the judgment. It is important to choose a distance metric that is not sensitive to known transformations (e.g., translation, rotation, scale, etc.). The book proposes the use of tangent distance instead of the traditional Euclidean distance.

The difference between unsupervised Xi and supervised Xi

Supervised:

Perceptrons
K-nearest neighbor method
Naive Bayes
decision tree
Logistic regression
Support vector machines
Lifting method
Hidden Markov model
Conditional random field

Unsupervised:

Clustering-kmeans
SVD singular value decomposition
PCA principal component analysis

Generative Model: LDA KNN Hybrid Gaussian Bayesian Markov Deep Belief Discriminant Model: SVM NN LR CRF CART

Logistic regression is different from SVM

Logistic regression is LR. When LR predicts the data, it gives the probability that the prediction result is a positive class, which is obtained by mapping wTx to [0,1] by the sigmoid function, and the probability of being a positive class tends to be close to 1 for a large positive wTx (which can be considered far from the decision boundary), and close to 0 when wTx is very negative (it can be considered far from the decision boundary). In LR, that's all that has to do with "distance from the decision-making boundary". In the process of solving the parameter w, there is no shadow of the distance from the decision boundary at all, and all samples are treated equally. The difference with the perceptron is that LR uses a distance from the decision boundary, which is used to give a visible confidence interval to the prediction result. There is no such consideration in the perceptron, and it is judged only by symbols. SVM goes one step further and discards points that are too far away from the decision boundary in the process of solving parameters. Both LR and perceptron are easy to overfit, and only the structured risk minimization strategy after the SVM adds the L2 norm solves the problem of overfitting. To sum up:

The concept of "distance" from the hyperplane is not introduced before or after the perceptron, it only cares whether it is on one side of the hyperplane;
LR introduces distance, but there is no concept of distance when training the model to find its parameters, and only the distance is introduced in the final prediction stage to characterize the confidence of the classification.
SVM has the concept of distance in two places: first, there is the concept of distance when finding the hyperplane parameters, which is manifested in the focus on points within a certain distance from the hyperplane, and all other points are no longer concerned. The point of interest is called the "support vector". Second, when predicting a new sample, as with LR, distance represents confidence.

Logistic regression can only solve binary classification problems, and softmax is used for multiple classifications. Related Reference Links

https://blog.csdn.net/maymay_/article/details/80016175
https://blog.csdn.net/jfhdd/article/details/52319422
https://www.cnblogs.com/eilearn/p/9026851.html

bagging boosting 和提升树

Bagging is done by combining several models to reduce the generalization error, training several different models separately, and then having all the models vote on the output of the test example. The reason why models work on average is that different models often don't produce exactly the same error on the test set. Extract the training set from the original sample set.N training samples are extracted from the original sample set using the Bootstraping method in each round (in the training set, some samples may be drawn multiple times, while some samples may not be selected once).A total of k rounds of extraction are carried out to obtain k training sets (k training sets are independent of each other)
Bagging is a parallel Xi algorithm, and the idea is simple, that is, each time from the raw data, there is a collection of data of the same size as the original data set that is extracted back according to a uniform probability distribution. Sample points can be repeated, and then a classifier can be constructed for each resulting dataset, and the classifiers can be combined. For the classification problem, the k models obtained in the previous step are obtained by voting, and for the regression problem, the mean value of the above models is calculated as the final result.
Boosting is a family of algorithms that can promote weak Xi to strong Xi learning.The sample distribution of each sampling of Boosting is different, and each iteration increases the weight of the misclassified samples according to the results of the previous iteration. This allows the model to focus more on hard-to-classify samples in later iterations. It's a process of continuous learning and Xi, and it's also a process of continuous improvement, which is the essence of the idea of boosting. After iteration, the base classifier of each iteration is integrated, so how to adjust the sample weight and integrate the classifier is the key problem we need to consider.

Bagging和Boosting的区别:

1) Sample selection: Bagging: The training set is selected in the original set, and the training sets selected from the original set are independent of each other. Boosting: The training set of each round does not change, but the weight of each example in the training set changes in the classifier, and the weight is adjusted according to the classification results of the previous round.
2) Sample weight: Bagging: Use uniform sampling, and the weight of each sample is equal Boosting: Continuously adjust the weight of the sample according to the error rate, the larger the error rate, the greater the weight.
3) Prediction Function: Bagging: All prediction functions have equal weights. Boosting: Each weak classifier has a corresponding weight, and the classifier with a small classification error will have a larger weight.
4) Parallel computing: Bagging: Each prediction function can be generated in parallel Boosting: Each prediction function can only be generated sequentially, because the latter model parameter needs the result of the previous round of model.

Bagging is the abbreviation of Bootstrap Aggregating, which means that the model trained on each sample is averaged by Bootstrap, so the variance of the model is reduced. Bagging such as Random Forest, which is an innate parallel algorithm, has this effect. Boosting is an iterative algorithm, each iteration weights the samples according to the prediction results of the previous iteration, so as the iteration continues, the error will become smaller and smaller, so the bias of the model will continue to decrease function is not good enough. Boosting is the combination of many weak classifiers into one strong classifier. Weak classifiers have a high bias and strong classifiers have a low bias, so boosting plays a role in reducing bias. Variance is not a major consideration for boosting. Bagging is the averaging of many strong (or even over-strong) classifiers. Here, the bias of each individual classifier is low, and the bias remains low after averaging, and each individual classifier is strong enough to produce overfitting, that is, the variance is high, and the effect of averaging is to reduce this variance. Representative of Bagging algorithm: Notes on the RandomForest random forest algorithm:

There is no need for pruning in the process of building a decision tree.
The number of trees in the entire forest and the characteristics of each tree need to be set artificially.
When constructing a decision tree, the choice of split nodes is based on the minimum Gini coefficient.

In the chapter of the upgraded version of the random forest for our machine learning Xi, I wrote this formula on a whiteboard: p = 1 - (1 - 1/N)^N, the meaning of which is: the probability that a sample is selected as a training sample in a decision tree generation process, when N is large enough, is about 63.2%. In short, the probability that a sample will be selected is 63.2%, which means that about 63.2% of the samples will be selected, based on the expectation of the binomial distribution. That is, 63.2% of the samples were not repeated, and 36.8% of the samples may not be in this training sample set. A random forest is a classifier that contains multiple decision trees, and the classes it outputs are determined by the mode of the classes output by the individual trees. The randomness of the random forest is reflected in the fact that the training samples of each tree are random, and the set of split attributes of each node in the tree is also randomly selected. With these 2 random guarantees, the random forest will not produce overfitting. A random forest is a forest established in a random way, the forest is composed of many decision trees, the training samples assigned to each tree are random, and the set of split attributes of each node in the tree is also randomly selected.

SVM

Related notebooks can also be seen here in addition to cs231n.

https://momodel.cn/workspace/5d37bb9b1afd94458f84a521?type=module

Convex sets, convex functions, convex optimization

There are relatively few interviews, so if you are interested, you can take a look:

https://blog.csdn.net/feilong_csdn/article/details/83476277

Why is image segmentation in depth Xi encoded and then decoded?

Downsampling is a means, not an end:

Reduce the amount of video memory and computation, the figure is smaller, the video memory is smaller, and the amount of computation is smaller;
Increasing the receptive field and using the same 3x3 convolution can be used for feature extraction over a larger image range. The large receptive field is very important for segmentation, and the small receptive field cannot do multi-classification segmentation, and the segmentation is very rough
There are several more subsampling branches with different degrees of forehead, which can facilitate the fusion of multi-scale features. Multi-level semantic fusion will make the classification more accurate.

The theoretical significance of downsampling, I will briefly read aloud, it can increase the robustness to some small perturbations of the input image, such as image translation, rotation, etc., reduce the risk of overfitting, reduce the amount of computation, and increase the size of the receptive field. Related Links: Why is Image Segmentation in Deep Learning Xi Encoded and then Decoded?

The difference between (global) average pooling and (global) maximum pooling

Maximum pooling preserves texture features
Average pooling preserves the overall data characteristics
Global average pooling has the role of positioning (see Zhihu)

The "most important" features such as the edges are extracted by the maximum pooling, while the features extracted by the average pooling are more smoothly. For image data, you can see the difference. While both are used for the same reason, I think Max Pooling is better suited for extracting extreme features. Averaging pooling sometimes fails to extract good features because it will all be factored in and the average value is calculated, which may not work well for object detection type tasks but one motivation for using averaging pooling is that each spatial location has detectors for the desired features, and by averaging each spatial location, its behavior is similar to the prediction of different translations of the average input image (a bit like data increase). Instead of using the traditional fully connected layer for CNN classification, Resnet directly outputs the spatial mean of the feature map from the last mlp transformation layer as the category confidence of the merging layer through the global average, and then inputs the resulting vector to the softmax layer. In contrast, Global average is more meaningful and explainable because it forces correspondence between feature and category, which can be achieved by using the more robust local modeling of the network. In addition, fully connected layers are prone to overfitting and rely heavily on dropout regularization, which itself acts as regularization, preventing overfitting of the overall structure.

https://zhuanlan.zhihu.com/p/42384808
https://www.zhihu.com/question/335595503/answer/778307744
https://www.zhihu.com/question/309713971/answer/578634764

The role of full connectivity, the relationship to 1x1 convolutional layers

In practice, the fully connected layer can be realized by convolution operations: the fully connected layer with the front layer can be converted into a convolution with a convolution kernel of 1x1, and the fully connected layer with the front layer is a convolutional layer can be converted into a global convolution with a convolution kernel of hxw, and h and w are the height and width of the convolution result of the previous layer, respectively, and the global average pooling is used to replace the convolution

Fully connected layers (FC) act as "classifiers" in the entire convolutional neural network. If operations such as the convolutional layer, pooling layer, and activation function layer map the raw data to the hidden layer feature space, the fully connected layer plays the role of mapping the learned "distributed feature representation" to the sample label space. In practice, the fully connected layer can be realized by convolution operations: the fully connected layer with the front layer can be converted into a convolution with a convolution kernel of 1x1, and the fully connected layer with the front layer can be converted into a global convolution with a convolution kernel of hxw, where h and w are the height and width of the convolution results of the previous layer, respectively

Then, the main functions of 1*1 convolution are as follows:

Dimension reductionality. For example, if a 500x500 image with a thickness of 100 is convoluted 1x1 on 20 filters, the size of the result is 500*500*20.
Add nonlinearity. After the convolutional layer, the 1*1 convolution adds a non-linear activation to the academic Xi representation of the previous layer to improve the expression ability of the network, but it can also be said that it changes from a simple linear transformation to a linear combination between complex feature maps, so as to achieve a high degree of abstraction of features. This process is seen as a shift from linear to nonlinear, increasing the degree of abstraction. rather than adding the role of the activation function.
Individuals should reduce or increase the number of parameters and increase the depth of the network, as well as cross-channel feature aggregation
It can be used in place of a fully connected layer

See the answer to this question https://www.zhihu.com/question/56024942/answer/369745892

See the answer to this question https://www.zhihu.com/question/41037974/answer/150522307

The difference between concat and add(sum).

For two inputs, if the number of channels is the same and followed by convolution, add is equivalent to the same convolution kernel shared by the corresponding channels after concat. Let's explain it in detail. Since the convolution kernel of each output channel is independent, we can only look at the output of a single channel. Suppose the two input channels are X1, X2, ..., Xc and Y1, Y2, ..., Yc. Then the single output channel of concat is (* denotes convolution):

The single output channel of add is:

Therefore, add is equivalent to adding a kind of prior, when the two inputs can have the property of "the feature map of the corresponding channel is semantically similar" (maybe less rigorous), add can be used instead of concat, which saves parameters and computation (concat is twice that of add). The pyramid in FPN [1] is to increase the resolution of the feature map with the smallest resolution but the strongest semantics, which can be used by add. If you use concat, the number of feature channels is more due to the small resolution, and the computation is a lot of overhead. https://www.zhihu.com/question/306213462/answer/562776112

It would be much better to change concat to sum.,These two are feature fusions.,What's the essential difference?I don't have any principle when I use it, just try both of them (in fact, I prefer to use sum.,After all, it's more cost-effective.。
I have done experiments similar to ASP before, pyramid-type hollow convolutional fusion, and the final experimental results are better than concat, but I don't know how to explain the reason
I've seen some papers where concat is better than sum, maybe it has something to do with the dataset and so on
Different features are summed, what's the point, these features are lost again; if you directly concat, let the later network learning Xi, it should be better, use more features

How to change the SSD to FasterRCNN

SSD is directly classified, while FasterRcnn is to determine whether it is a background before classifying. One is direct sub-classification, and the other is coarse classification and then sub-classification.

The principle of backpropagation

The principle of backpropagation is seen in the BP process in CS231n, as well as the propagation in Jacobian.

GD、SGD、mini batch GD的区别

There are corresponding chapters in the 100-sided deep learning Xi.

Deviation, direction difference

There's a good article about it, and it's also in that electronic version of CNNbook.

http://scott.fortmann-roe.com/docs/BiasVariance.html
Generalization error can be broken down into the square + variance + noise of the deviation
The bias measures the deviation between the expected prediction and the real result of the learning Xi algorithm, and depicts the fitting ability of the learning Xi algorithm itself
Variance measures the change in Xi performance caused by changes in the training set of the same size, and depicts the interference caused by data perturbations
The noise expresses the lower bound of the expected generalization error that can be achieved Xi by any algorithm in the current task, and describes the difficulty of the problem itself.
Bias and variance are generally called bias and variance, and the stronger the training error, the smaller the deviation, the greater the variance, and the generalization error will have a minimum value in the middle.
If the deviation is large and the variance is small, it is underfitting, while the deviation is small and the variance is large is overfitting.

Why is there a gradient explosion and how to prevent it

Multilayer neural networks often have cliff-like structures, which are caused by the multiplication of several larger weights. In the case of cliff structures with large slopes, gradient updates can programmatically change the parameter values very much, and the structure of such cliffs is usually skipped entirely. Flower Book P177.

Distributed training, multi-card training

http://ai.51cto.com/art/201710/555389.htm

https://blog.csdn.net/xs11222211/article/details/82931120#commentBox

Precision and recall and PR curves

This is a good explanation (TP vs. FP and ROC curves):

https://segmentfault.com/a/1190000014829322

Precision refers to the ratio of the number of positive samples correctly classified to the number of samples judged by the classifier to be positive. Recall is the ratio of the number of positive samples with correct classification to the number of true positive samples. In order to improve the Precision value, the classifier needs to try to predict the sample as a positive sample when it is "more confident", but at this time, it is often too conservative and misses many "unsure" positive samples, resulting in a very low Recall value. How to weigh these two values, so there are more criteria such as PR curve, ROC curve, and F1 score to judge. https://www.cnblogs.com/xuexuefirst/p/8858274.html

Compared with Yolov1, the recall rate of Yolov2 is greatly improved due to the use of Anchor Boxes, and the map is slightly reduced by 0.2.

https://segmentfault.com/a/1190000014829322

https://www.cnblogs.com/eilearn/p/9071440.html

https://blog.csdn.net/zdh2010xyz/article/details/54293298

Void convolution

Dilated convolutions are generally accompanied by padding, and if dilation=6, then padding is also equal to 6. The size of the convolutional feature map by dilated convolution remains unchanged, but the receptive field of this convolution is larger than that of ordinary convolutions of the same size. However, the number of channels can be changed.

In DeepLabv3+, the last ASPP layer is concated with a 1x1 convolution and three 3x3 dilated convolutions, and then concats a feature map that is bilinearly sampled to the same dimension after global average pooling.

However, it should be noted that since the dilated convolution itself does not increase the amount of computation, but the subsequent resolution does not decrease, the subsequent computation will indirectly become larger. https://zhuanlan.zhihu.com/p/52476083

What to do if the data is not good, how to deal with the unbalanced data, and how to deal with only a small number of labels

Specific problems are analyzed on a case-by-case basis.

What to do if you need to overfit during training

Deep Learning Xi - General Model Debugging Skills
How to diagnose our CNNs based on the training/validation loss curve
Tricks for Neural Network Training (Full Summary)
What it's like to have a small dataset in deep learning Xi

If the actual capacity of the model is relatively large, then it can be said that the model can fully learn Xi the entire data set, and overfitting will occur. At this time, if you add new data to it, the performance of the model will be further improved, indicating that the model has not been supported. Expected risk is the expected loss of the model about the joint distribution, and empirical risk is the average loss of the model about the training dataset. According to the law of the big tree, when the sample size N tends to infinity, the empirical risk tends to the expected risk. However, when the sample size is relatively small, the effect of chemical Xi with minimal empirical risk may not be very good, and the phenomenon of "overfitting" will occur. Structural risk minimization is a proposed strategy to prevent overfitting.

https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neural-networks-dramatically-overfitted.html

https://www.jianshu.com/p/97aafe479fa1 (Important)

Regularization

In Pytorch, you can only set weight_decay in optim, and currently only L2 regex is supported, and this is for all parameters in the model, whether it is w or b, including W and b in BN.

What are the consequences of the BN layer and L2 regularization together

It's because after batch norm, the weight effect is not so heavy, so the effect of L2 weight decay is not obvious. It is proved that there is no regularization effect when L2 regularization is combined with normalization. Conversely, regularization affects the range of weights and thus the effective Xi rate.

https://www.cnblogs.com/makefile/p/batch-norm.html?utm_source=debugrun&utm_medium=referral

ROIS

Spatial Pyramid Pooling (SSP) can make images of different sizes produce fixed output dimensions. I also ask a question, why is the ROI Pooling of Fast RCNN a Max Pooling? ROI Pooling is also followed by a single ROI Classification, why is it not different from the Classification Pooling? My intuition is to look at a channel in the feature map and extract the global features (e.g., do classification) Use average pooling to extract global information, and use max pooling to extract local features (e.g., roi pooling) to extract the most obvious local features, become a 7×7 grid and hand it over to the subsequent fc for classification. Related introduction:

SPPNet-引入空间金字塔池化改进RCNN

Implement image enhancement algorithms by yourself

https://zhuanlan.zhihu.com/p/71231560

Tricks for image classification

Amazon: Tricks for Image Classification with CNNs (https://mp.weixin.qq.com/s/e4m_LhtqoUiGJMQfEZHcRA)

消融实验（Ablation experiment）

Because the author proposes a scheme that changes multiple conditions/parameters at the same time, in the following ablation experiments, he will control one condition/parameter to change one by one to see which condition/parameter has the greatest impact on the result. The following passage is excerpted from Zhihu, @人民艺术家: Your friend said that you look very handsome today, and you want to know how much of a role the hairstyle, top and pants play, so you changed a few hairstyles, and your friend said that he was still very handsome, and you changed his shirt again, and your friend said that he was not handsome, and it seems that this dress is quite important.

NMS vs. soft-NMS

https://oldpan.me/archives/write-hard-nms-c

Logistic regression and linear regression

Linear regression: Find the optimal parameter by mean square error and estimate it by least squares or gradient descent:

And the prototype of logistic regression: logarithmic probability regression: logistic regression and logarithmic probability regression are the same, which can be obtained by deformation, and logistic regression uses maximum likelihood probability for estimation. Short summary:

Both linear regression and logistic regression are special cases of generalized linear regression models
Linear regression can only be used for regression problems, and logistic regression can be used for categorical problems (which can be generalized from binary to multiclassual)
Linear regression has no contact function or does not work, and the contact function of logistic regression is a logarithmic probability function and belongs to the Sigmoid function
Linear regression uses least squares as the parameter estimation method, and logistic regression uses maximum likelihood as the parameter estimation method
Both can use the gradient descent method

Caution:

The gradient descent method of linear regression is actually the same as our training neural network, which first needs to initialize the parameters, and then update the parameters using random gradient descent: https://zhuanlan.zhihu.com/p/33992985
Linear Regression vs. Least Squares: https://zhuanlan.zhihu.com/p/36910496
Maximum resemblance https://zhuanlan.zhihu.com/p/33349381

Source Article:

https://segmentfault.com/a/1190000014807779
https://zhuanlan.zhihu.com/p/39363869
https://blog.csdn.net/hahaha_2017/article/details/81066673

For convex functions, the local optimal is the global optimum, related link: http://sofasofa.io/forum_main_post.php?postid=1000329

http://sofasofa.io/forum_main_post.php?postid=1000322Logistic classification with cross-entropy

What is attention and what are the types?

https://zhuanlan.zhihu.com/p/61440116

https://www.zhihu.com/question/65044831/answer/227262160

Linear and non-linear aspects of deep learning Xi

Convolution is linear
The activation function is nonlinear

Problems with gradient vanishing and gradient explosions

Batch-norm layer

If you don't look at it, you will enter the pit~ Whether it is training or deployment, it will make you step on the Batch Normalization of the pit

Batch size is too small to make the Loss curve oscillate relatively large, the size is generally selected according to the power of 2 law, as for why? did not answer, the interviewer later explained that it was for the sake of hardware computing efficiency, Hai Ge later also said that the thread opened during GPU training is the power of 2 The essence of the neural network is to learn the distribution of Xi data, if the distribution of training data and test data is different, it will greatly reduce the generalization ability of the network. With the progress of network training, the change of each hidden layer changes the input of the latter layer, so that the distribution of each batch of training data will also change, resulting in the need to fit different data distributions in each iteration of the network, which increases the complexity of data training and the risk of overfitting.

Resistant to drastic changes in data.

It should be noted that BN is in the convolutional network layer, because of the parameter sharing mechanism, the parameters of each convolutional kernel are shared among the neurons at different locations, so they should also be normalized. (See the implementation process for details)https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/batch_norm_layer.html but if the batch-size is not large during training, you can not use BN (as MaskRcnn says). At this point, the theoretical and practical part of Batch Normalization is introduced here. In general, BN normalizes the inputs of each layer of the network to ensure that the mean and variance of the input distribution are fixed within a certain range, which reduces the internal covariate shift problem in the network, alleviates the gradient disappearance to a certain extent, and accelerates the model convergence. Finally, the BN training process uses the mean/variance of mini-batch as the population sample statistic estimation, and introduces random noise, which has a regularization effect on the model to a certain extent.

https://zhuanlan.zhihu.com/p/34879333

Relationship between BN and Bayes:

从Bayesian角度浅析Batch Normalization

BN cross-card training how to ensure the same mean and var

In practice, I've found that cross-card synced BNs are quite useful for performance. In particular, for detection and segmentation tasks, the batch size is small. If the Batch Norm can be synchronized across cards, it is equivalent to increasing the batch size of the Batch Norm so that the mean and variance can be estimated more accurately, so this operation can improve performance.

How to implement SyncBN

The key to cross-card synchronization BN is to get the global mean and variance in the forward operation, and the corresponding global gradient in the backward operation. The simplest way to do this is to synchronously find the mean, then send it back to each card and synchronously find the variance, but this is synchronized twice. In fact, we only need to synchronize once, and we used a very simple trick to change the formula for variance (the formula is a picture, you can search for SyncBN on the Internet for details). In this way, when we are working forward, we only need to calculate the sum on each card, and then find the global sum across the cards to get the correct mean and variance. We also shared this method of synchronization once in our recent paper, Context Encoding for Semantic Segmentation. With cross-card BN, we don't have to worry about the convergence effect of using multiple cards because no matter how many cards are used, as long as the global batch size is the same, the same effect will be obtained.

Why ResNet is easy to use

Occurrence factors:

As the network deepens, the optimization function becomes more and more trapped in the local optimal solution
As the number of layers in the network increases, the problem of gradient vanishing becomes more serious because the gradient gradually decays during backpropagation

When the error propagates from the Lth layer to the first hidden layer other than the input, it will involve the multiplication of many parameters and derivatives, and the error is easy to disappear or expand, resulting in poor learning and Xi, fitting ability and generalization ability. If the output of a certain layer has fitted the desired result well, then one layer will not make the model worse, because the output of the layer is directly shorted to the two layers, which is equivalent to directly Xi learning an identity map, while the skipped two layers only need to fit the residuals between the output of the upper layer and the target.

https://zhuanlan.zhihu.com/p/42706477
https://zhuanlan.zhihu.com/p/31852747

Disadvantages of Resnet:

In fact, resnet can't really realize gradient disappearance, there is a strong a priori assumptions, and the layer that resnet really works is only in the middle, and the deep role is relatively small (to the deep layer is the identity map), the feature is underutilized, and the way of add hinders the flow of gradients and information.

L1 norm and L2 norm application scenarios

L1 positive can make the minority weight larger, and the majority weight is 0, and get a sparse weight, and L2 positive will make the weights tend to be close to 0 but not zero, and get a smooth weight, and https://zhuanlan.zhihu.com/p/35356992

What are the ways in which the network is initialized and their formula initialization process

There are currently three types of weight initializations:

All set to 0 - Almost never used
Stochastic initialization (uniform random, normally distributed)
According to Glorot, the author of Xavier, good initialization should make the variance of activation values and state gradients consistent across layers throughout propagation. Good for sigmoid, but not for Relu.
He initialization works with Relu.

Initialization, to put it bluntly, is to construct a smooth local geometry to make the optimization easier Xavier distribution parsing:

https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/

Let's assume you're using the sigmoid function. When the weight value (the value refers to the absolute value) is too small, the variance of the input value decreases every time it passes through the network layer, and the weighted sum of each layer is very small, which is equivalent to a linear function in the area attached to the sigmoid function 0, and the nonlinearity of the DNN is lost. When the value of the weight is too large, the variance of the input value will rise rapidly after each layer, and the output value of each layer will be large, and the gradient of each layer will tend to be close to 0. Xavier initialization can make the input value x variance unchanged after passing through the network layer.

https://blog.csdn.net/winycg/article/details/86649832
https://zhuanlan.zhihu.com/p/57454669

The default weight initialization method in pytorch is Kaiming He's, for example:

resnet中权重的初始化

for m in self.modules():
    if isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)


# Zero-initialize the last BN in each residual branch,
# so that the residual branch starts with zeros, and each residual block behaves like an identity.
# This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
if zero_init_residual:
    for m in self.modules():
        if isinstance(m, Bottleneck):
            nn.init.constant_(m.bn3.weight, 0)
        elif isinstance(m, BasicBlock):
            nn.init.constant_(m.bn2.weight, 0)

Solve the number of model parameters

def model_info(model):  # Plots a line-by-line description of a PyTorch model
    n_p = sum(x.numel() for x in model.parameters())  # number parameters
    n_g = sum(x.numel() for x in model.parameters() if x.requires_grad)  # number gradients
    print('\n%5s %50s %9s %12s %20s %12s %12s' % ('layer', 'name', 'gradient', 'parameters', 'shape', 'mu', 'sigma'))
    for i, (name, p) in enumerate(model.named_parameters()):
        name = name.replace('module_list.', '')
        print('%5g %50s %9s %12g %20s %12.3g %12.3g' % (
            i, name, p.requires_grad, p.numel(), list(p.shape), p.mean(), p.std()))
    print('Model Summary: %d layers, %d parameters, %d gradients' % (i + 1, n_p, n_g))
    print('Model Size: %f MB parameters, %f MB gradients\n' % (n_p*4/1e6, n_g*4/1e6))

Convolution computation

It's OK to understand these things.

Ordinary convolution
Detachable convolutions
Fully connected
Point convolution

You can see this article by Lao Pan:

How fast can your model run???

Multi-label and multi-class

So, how to use softmax and sigmoid to do multi-class classification and multi-label classification?

1. How to use softmax to do multi-classification and multi-label classification Now it is assumed that the final output of the neural network model is such a vector logits=[1,2,3,4], which is the output of the final fully connected neural network. This assumes a total of 4 classifications. The method of using softmax to do multiple classifications: tf.argmax(tf.softmax(logits)) first uses softmax to convert logits into a probability distribution, and then takes the largest probability value as the classification of samples, so it seems that tf.argmax(logits) The main role of softmax is actually to calculate the cross-entropy, first of all, y in the sample set is a one-hot vector, if the model directly outputs logits and y to calculate the cross-entropy, because logits=[1,2,3,4] The calculated cross-entropy must be very large, this calculation method is not right, but the logits should be converted into a probability distribution and then calculated, that is, tf.softmax(logits) and y are used to calculate the cross-entropy, of course, we can also directly use the method provided by tensorflow sofmax_cross_entropy_with_logits to calculate The parameters passed in this method can be directly logits, because according to the name of the method, you can see that the parameters will be processed with softmax inside the method, and now we take the largest probability distribution as the final classification result, which is multi-classification. We can also take the top few of the probabilities as the final multiple labels, or set a threshold and take the ones that are greater than the probability threshold. This enables multi-label classification with Softmax.

2. How to use sigmoid to do multi-label classification Sigmoid is generally not used to do multi-class classification, but is used to do binary classification, it is to convert a scalar number to [0,1], if it is greater than a probability threshold (generally 0.5), it is considered to belong to a certain category, otherwise it does not belong to a certain category. So how to use sigmoid to do multi-label classification? In fact, it is to act as an sigmoid classifier for the result of each classification calculation in logits to determine whether the sample belongs to a certain category. Similarly, it is assumed that the final output of the neural network model is such a vector logits=[1,2,3,4], which is the output of the final fully connected neural network. This assumes a total of 4 classifications. tf.sigmoid(logits)sigmoid should turn each number in logits into a probability value between [0,1], assuming the result is [0.01, 0.05, 0.4, 0.6], and then set a probability threshold, such as 0.3, if the probability value is greater than 0.3, then the category is judged to match, then here, the sample will be judged to be both category 3 and category 4 match.

Why should the input of the data be normalized

In order to eliminate the dimensional influence between data features, the models solved by gradient descent method usually need to be normalized by data, including linear regression, logistic regression, support vector machines, neural networks, etc., but the decision model is not very applicable.

Why is Naive Bayes high deviation low variance?

First, let's say you know the relationship between the training set and the test set. To put it simply, we need to learn Xi a model on the training set, and then get the test set to use it, and the effect should be measured according to the error rate of the test set. But in many cases, we can only assume that the test set and the training set match the same data distribution, but we can't get the real test data. At this time, how to measure the test error rate when only seeing the training error rate?

Since there are few training samples (at least not enough), the model obtained from the training set is not always really correct. (Even if the training set is 100% correct, it does not mean that it depicts the real data distribution, and it is our goal to describe the real data distribution, not only the limited data points of the training set). Moreover, in practice, the training samples often have a certain amount of noise error, so if you pursue the perfection of the training set too much and use a very complex model, the model will treat the errors in the training set as the real data distribution characteristics, so as to obtain the wrong data distribution estimate. In this case, it will be a mess on the real test set (this phenomenon is called overfitting). However, you can't use a model that is too simple, otherwise the model will not be enough to characterize the data distribution when the data distribution is complex (reflected in the high error rate even on the training set, which is less fitting). Overfitting indicates that the adopted model is more complex than the true data distribution, while underfitting indicates that the adopted model is simpler than the true data distribution.

Under the framework of statistical Xi, when everyone describes the complexity of the model, there is such a view that Error = Bias + Variance. The error here can be roughly understood as the prediction error rate of the model, which is composed of two parts, one is the inaccurate estimation caused by the model being too simple (Bias), and the other part is the greater change space and uncertainty (Variance) caused by the model being too complex.

So, this makes it easy to analyze Naive Bayes. It simply assumes that the individual data are irrelevant to each other, and is a heavily simplified model. Therefore, for such a simple model, in most cases, the bias part will be greater than the variance part, that is, the bias part will be high and the variance will be low.

In practice, in order to keep the error as small as possible, we need to balance the proportion of Bias and Variance when selecting the model, that is, balance over-fitting and under-fitting.

What are the Canny edge detection and boundary detection algorithms?

https://zhuanlan.zhihu.com/p/42122107

https://zhuanlan.zhihu.com/p/59640437

Traditional object detection

Traditional object detection is generally divided into the following steps:

Region selection: An image is segmented (clustered) by selective search, and then 2000 boxes are finally found by calculating the similarity of adjacent regions, which also needs to be judged with GT for positive and negative examples.
Feature extraction: 2000 features are converted into feature vectors through SIFT or other feature extraction methods
Classifier classification: Put the feature vector into the SVM for classification training, and put the parent class into the classifier for training.

Classical structure:

HoG + SVM

Disadvantages of traditional methods:

The sliding-window-based region selection strategy is not targeted, the time complexity is high, and the windows are redundant
Hand-designed features are not very robust to changes in environmental diversity

Corrosion expansion, open and closed operations

You can learn Xi relevant content in the third version of OpenCV, and search for erode and dilation

Some filters

https://blog.csdn.net/qq_22904277/article/details/53316415
https://www.jianshu.com/p/fbe8c24af108
https://blog.csdn.net/qq_22904277/article/details/53316415
https://blog.csdn.net/nima1994/article/details/79776802
https://blog.csdn.net/jiang_ming_/article/details/82594261
High-frequency and low-frequency information in the image, as well as high-pass filters and low-pass filters
In an image, the high-frequency information in the image is the pixel that changes dramatically such as edge information. In addition to the edge part, the content information that is not very drastic in the relatively flat pixel changes is low-frequency information.
The high-pass filter is to highlight the drastic changes (edges) and remove the low-frequency part, that is, to act as an edge extractor. The low-pass filter is mainly used to smooth out the brightness of the pixel. Mainly used for denoising and blurring, Gaussian blur is one of the most commonly used blur filters (smoothing filters), which is a low-pass filter that weakens the strength of high-frequency signals.

Resize bilinear interpolation

When the network structure is feature fusion, the bilinear interpolation method is a little better than the transposed convolution. Because transpose convolution has a big problem, that is, if the parameters are not properly configured, it is easy to have an obvious checkerboard in the output feature map.

It should be noted that nearest neighbor interpolation is the worst.

Bilinear interpolation is also divided into two categories:

align_corners=True
align_corners=False

In general, using align_corners=True will guarantee edge alignment, while using align_corners=False will result in a situation where the edges protrude. This is a very good story:

https://blog.csdn.net/qq_37577735/article/details/80041586

Explanation of the code implementation:

https://blog.csdn.net/love_image_xie/article/details/87969405
https://www.zhihu.com/question/328891283/answer/717113611 See the image here: https://discuss.pytorch.org/t/what-we-should-use-align-corners-false/22663

gradient clipping 梯度裁剪

Improvements made to avoid gradient explosions should be distinguished from early termination. (Early termination is a regularization method because when a large model with sufficient ability to represent or even overfit is trained, the training error gradually decreases over time, but the error of the validation set rises again.) This means that we just need to return the parameter setting that makes the lowest error in the validation set) The first way to understand is to set a gradient range such as (-1, 1), set the gradient less than -1 to -1, and the gradient greater than this 1 to 1.

https://wulc.me/2018/05/01/%E6%A2%AF%E5%BA%A6%E8%A3%81%E5%89%AA%E5%8F%8A%E5%85%B6%E4%BD%9C%E7%94%A8/

Implement a simple convolution

Convolution is generally implemented in the form of im2col, but in the interview, we can simply implement a sliding window method. For example, a 3x3 convolution kernel (filter box) is used to implement convolution operations. The same is true for the source code convoluted on the PC side in NCNN.

`/*
输入：imput[IC][IH][IW]
IC = input.channels
IH = input.height
IW = input.width
卷积核: kernel[KC1][KC2][KH][KW]
KC1 = OC
KC2 = IC
KH = kernel.height
KW = kernel.width
输出：output[OC][OH][OW]
OC = output.channels
OH = output.height
OW = output.width
其中，padding = VALID，stride=1，
OH = IH - KH + 1
OW = IW - KW + 1
也就是先提前把Oh和Ow算出来，然后将卷积核和输入数据一一对应即可
*/
for(int ch=0;ch<output.channels;ch++)
{
for(int oh=0;oh<output.height;oh++)
{
for(int ow=0;ow<output.width;ow++)
{
float sum=0;
for(int kc=0;kc<kernel.channels;kc++)
{
for(int kh=0;kh<kernel.height;kh++)
{
for(int kw=0;kw<kernel.width;kw++)
{
sum += input[kc][oh+kh][ow+kw]*kernel[ch][kc][kh][kw];
}
}
}
//if(bias) sum +=bias[]
output[ch][oh][ow]=sum;
}
}
}
`

Reference:

https://www.cnblogs.com/hejunlin1992/p/8686838.html

The process of convolution

If you look at the source code of Pytorch and the source code of caffe, both of them convert convolution computation into matrix operations, im2col, and then gemm. https://blog.csdn.net/mrhiuser/article/details/52672824

The calculation process for transposing a convolution

https://cloud.tencent.com/developer/article/1363619

What is the use of a 11 convolution kernel, and what is the difference between a 33 convolution kernel and a 13 plus a 31 kernel

1x1 convolution can change the number of channels in the upper layer of the network. The convolutional kernel is greater than 1x1, which means that neighborhood information is required to propose features.

If the horizontal texture is extracted, the information density of the horizontal neighborhood is higher than that of the vertical information.
The nuclear flattening is the most scientific. If you mention the longitudinal texture, in the same way, it is best to be thin and tall.
If you want to extract a rich variety of textures, the expected horizontal neighborhood information density ~ = vertical information density expectation

So for the lazy person, the expectation of the size of the optimal core is square. As for 1*n and n*1, they are generally used together, so as to realize the receptive field of n*n convolutional kernel, which can reduce the parameters while increasing the number of layers, and the use in the higher layers of CNN can bring certain advantages. The convolution kernel is not all square, but can also be rectangular, such as 3*5, which is used in text detection and license plate detection, and this design is mainly aimed at the shape of the text line or license plate, which can better learn Xi features. In fact, I don't think the square rectangle has much impact, and the Xi learning ability of the network is very strong. Of course, we can also learn the shape of Xi convolution, similar to deformable convolution, which Lao Pan will talk about later.

ResNet中bottlenet与mobilenetv2的inverted结构对比

Note that in resnet, the dimension is reduced first and then ascended, while in mobilenetv2, the dimension is first ascended and then reduced (so it is called inversed).

https://zhuanlan.zhihu.com/p/67872001
https://zhuanlan.zhihu.com/p/32913695

Calculation of the size of the convolutional feature map

Simple, but easy to get wrong:

Conv2D

The difference between a dynamic graph and a static graph

Static graphs can be serialized on disk, can save the structure of the entire network, can be overloaded, and are very practical in deployment, in tensorflow static graphs conditions and loops require specific syntax, PyTorch can be implemented with only python syntax
The dynamic diagram is created every time it is used, which is not easy to optimize and requires repeating the previous code, but the dynamic diagram is more concise than the static diagram code

Depending on whether you use dynamic or static computing, you can divide these many deep Xi frameworks into two camps, although some frameworks have both dynamic and static computing mechanisms (such as MxNet and the latest TensorFlow). Dynamic computation means that the program will be executed in the order in which we wrote the commands. This mechanism will make debugging easier, and it will also make it easier for us to translate the ideas in our heads into actual code. Static computation, on the other hand, means that the program will be compiled and executed into the structure of a neural network, and then execute the corresponding operation. Theoretically, mechanisms such as static computation allow the compiler to optimize to a greater degree, but it also means that there is more of a generation gap between what you expect the program to do and what the compiler actually executes. This also means that errors in the code will be more difficult to spot (e.g., if there is a problem with the structure of the computed graph, you may only be able to find it when the code performs the corresponding operation). Although in theory, static graphs have better performance than dynamic graphs, in practice we often find that this is not the case.

All the networks over the years

This can be seen in Lesson 9 in CS231n as well

https://ucbrise.github.io/cs294-ai-sys-sp19/assets/lectures/lec02/classic_neural_architectures.pdf
https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

To summarize:

LeNet-5: The first convolution, which is used to identify handwritten arrays, uses a convolution size of 5x5, s=1, which is the combination of ordinary convolutional kernel pooling layers, and finally the fully connected layer.
AlexNet: I used an 11x11 convolution in the first convolution, and I used Relu for the first time, and I used NormLayer but not the BN we often say. Dropout was used and trained on two GPUs, using model parallelism,
ZFNet: An enhanced version of AlexNet, which changes the 11x11 convolution to 7x7, and also deepens the channel depth of the convolution on the basis of AlexNet. So the results in the classification competition are better than before.
VGGNet: Only small convolutional 3x3 (s=1) and a regular pooling layer are used, but the depth is a little deeper than the previous one, and the last layers are all fully connected layers with a softmax. The reason why 3x3 convolution is used is because the effective receptive fields of the three 3x3 convolutions are the same as those of 7x7, and they are deeper and more nonlinear, and the parameters of the convolutional layer are fewer, so the speed is faster and the number of layers can be deepened appropriately.
GoogleNet: The FC layer is not used, and the number of parameters is greatly reduced compared with the previous one, and the Inception module structure is proposed, that is, the NIN structure (network within a network). However, the original Inception module is very computationally intensive, so a 1x1 conv "bottleneck" structure is added to each branch (see the figure for details). In order to avoid the disappearance of the gradient in the GoogleNet network structure, two softmax losses are added to the two positions in the middle, so there will be three losses, and the loss of the whole network is obtained by multiplying the weights by the three losses Related articles:https://zhuanlan.zhihu.com/p/42704781 Characteristics of the inception structure:1. The width of the network is increased, and the degree of adaptability to different scales is also improved. 2. Use 1x1 convolution kernel to reduce the dimensionality of the input feature map, which will greatly reduce the number of parameters, thereby reducing the amount of calculation. 3. In V3, we use multiple small convolution kernels instead of large convolution kernels, in addition to the regular square, we also have a decomposed version of 3x3 = 3x1 + 1x3, which is better than the regular convolution kernel in the case of deeper depth. 4. The core idea of inventing Bottleneck is to use multiple small convolution kernels to replace one large convolution kernel, and use 1x1 convolution kernels to replace part of the large convolution kernels. That is, first 1x1 down the channel, then normal 3x3, then 1x1 back.
Xception:改进了inception,提出的 depthwise separable conv 让人眼前一亮。 https://www.jianshu.com/p/4708a09c4352
ResNet: The deeper the network, the more difficult it is to optimize, and there is a feature that needs to be understood, the deeper the layer should at least perform the same as the shallow layer, and not worse than the shallow layer. For deeper Resnet (50+), the bottleneck layer (i.e., two 1x1s are used to reduce and increase the dimension respectively) to improve the efficiency of the network. For a more detailed description, see Hundred Faces Machine Learning Xi and ppt. Related explanations: https://zhuanlan.zhihu.com/p/42706477
DenseNet cannot simply say that densenet is better, compared with the two, ResNet is a more general model, and DenseNet is a more specialized model. DenseNet may perform better than ResNet for image processing, because DenseNet can better match the information distribution characteristics of the image and use a multi-scale kernel. However, there are drawbacks, the most straightforward calculation is the number of all feature maps generated in a single inference. Some frameworks will be optimized to automatically release the feature map of the higher layer, so the video memory will be reduced, or the inplace operation will reduce a part of the video memory by recalculation, but the densenet cannot be released because it needs to reuse the feature map of the higher layer, resulting in excessive video memory occupation. It's this _concat_ that causes densenets to be denser to connect.
SeNet: The full name is Squeeze-and-Excitation Networks. It belongs to the category of attention feature extraction, and GP (Global pooling) and two FC plus sigmoid and scale are added. That is, the attention mask is generated, and the input x is multiplied to get the new x. The core idea is to learn Xi the importance of each feature channel, and then promote the useful features and suppress the features that are less useful for the task at hand according to this importance. Multiplying each eigenlayer channel by the importance coefficient obtained by sigmoid is actually the same as using the bn layer to see which coefficient is important. Disadvantages: Due to the 0~1 scale operation on the backbone, gradient dissipation will easily occur close to the input layer when the network is deep BP optimization, which makes the model difficult to optimize. http://www.sohu.com/a/161633191_465975
Wide Residual Networks
ResNeXt: It is a combination of resnet and inception, and the residual connection next to it is the x in the formula that is directly connected, and then the rest is 32 independent sets of transforms of the same structure, and finally fused to conform to the split-transform-merge pattern. Although there are 32 groups, they are all point convolution first, then 3x3 ordinary convolution, and then 1x1 convolution to ascend (the opposite of Mobilenetv2) Related introduction: https://zhuanlan.zhihu.com/p/51075096
Densely Connected Convolutional Networks:有利于减轻梯度消失的情况，增强了特征的流动。

shufflenet:https://blog.csdn.net/u011974639/article/details/79200559

Some statistical knowledge

Positive distribution: https://blog.csdn.net/yaningli/article/details/78051361

On how to train (some questions during training)

Training oscillation caused by MaxPool (by adding L2Norm after MaxPool): https://mp.weixin.qq.com/s/QR-KzLxOBazSbEFYoP334Q

A Good Companion for Fully Connected Layers: Spatial Pyramid Pooling (SPP)

https://zhuanlan.zhihu.com/p/64510297

Sensory field computing

There are two formulas for receptive field calculation, one ordinary formula and one general term formula:

It is important to note that both convolution and pooling can increase the receptive field.

http://zike.io/posts/calculate-receptive-field-for-vgg-16/

—Copyright Notice—

It is only for academic sharing, and the copyright belongs to the original author.

—THE END—

Summary of machine Xi, in-depth Xi interview knowledge points