Convolutional neural networks and their applications in image processing

First, the preface

Convolutional neural networks (CNN) is a deep learning method specially designed for image classification and recognition developed on the basis of multi-layer neural networks. Let's review multi-layer neural networks:

A multilayer neural network consists of an input layer and an output layer, with multiple hidden layers in between. Each layer has several neurons, and each neuron in the latter layer between the adjacent two layers is connected to each neuron in the previous layer. In a general recognition problem, the input layer represents a feature vector, and each neuron in the input layer represents a feature value.

In an image recognition problem, each neuron in the input layer may represent the grayscale value of a pixel. However, this neural network has several problems for image recognition, one is that the spatial structure of the image is not considered, and the recognition performance will be limited; The second is that the neurons in each adjacent two layers are fully connected, there are too many parameters, and the training speed is limited.

Convolutional neural networks can solve these problems. Convolutional neural networks use special structures for image recognition and can be trained quickly. Because of the fast speed, it is easy to use multi-layer neural networks, and the multi-layer structure has great advantages in recognition accuracy.

Second, the structure of convolutional neural networks

Convolutional neural networks have three basic concepts: local receptive fields, shared weights, and pooling.

Local perception domain: In the neural network in the figure above, the input layer is represented by a column of neurons, and in CNN, the input layer may be regarded as neurons arranged in a two-dimensional matrix.

As with regular neural networks, neurons in the input layer need to be connected to neurons in the hidden layer. But instead of connecting every input neuron to every hidden neuron, the connection is only created in a local area of the image. FOR EXAMPLE, IN AN IMAGE OF SIZE 28X28, SUPPOSE THAT THE NEURONS OF THE FIRST HIDDEN LAYER ARE CONNECTED TO A 5X5 REGION OF THE INPUT LAYER, AS SHOWN IN THE FOLLOWING FIGURE:

This 5X5 region is called the local perception domain. The 25 neurons of this local perception domain are connected to the same neuron of the first hidden layer, and there is a weight parameter on each connection, so the local perception domain has a total of 5X5 weights. If you slide the local perception domain in the order from left to right and top to bottom, you will get to correspond to different neurons in the hidden layer, as shown in the figure below, respectively, the connection between the first two neurons of the first hidden layer and the input layer.

IF THE INPUT LAYER IS A 28X28 IMAGE WITH A LOCAL PERCEPTION DOMAIN SIZE OF 5X5, THE SIZE OF THE FIRST HIDDEN LAYER OBTAINED IS 24X24.

Shared weights: All 24X24 neurons in the first hidden layer obtained above use the same 5X5 weights. The output of the first neuron in the first hidden layer is:

Here is the excitation function of the neuron (it can be a sigmoid function, a thanh function, or a rectified linear unit function, etc.). is a shared bias for that domain-aware connection. It is a 5X5 shared weight matrix. So there are 26 parameters here. Represents the input stimulus at the input layer.

This means that all neurons in the first hidden layer detect the same feature at different locations in the image. Therefore, this mapping from the input layer to the hidden layer is also called feature mapping. The weight of this feature map is called a shared weight, and its bias is called a shared bias.

In order to do image recognition, more than one feature map is usually required, so a complete convolutional layer contains several different feature maps. The following figure shows an example of three feature mappings.

In practical applications, CNNs may use more, even dozens of feature maps. Taking MNIST handwritten digit recognition as an example, some of the characteristics learned are as follows:

These 20 images correspond to 20 different feature maps (or filters, kernels). Each feature map is represented by a 5X5 image representing 5X5 weights in the local perception domain. Bright pixels represent small weights, and the pixels in the corresponding image have less impact. Dark pixels represent large weights, which also mean that the pixels in the corresponding image have a greater impact. It can be seen that these feature mappings reflect some special spatial structures, so CNNs learn some information related to spatial structures for identification.

Pooling layers Pooling layers are usually used immediately after convolutional layers to simplify the output of convolutional layers. FOR EXAMPLE, EACH NEURON IN THE POOLING LAYER MIGHT SUM NEURONS IN A 2X2 REGION OF THE PREVIOUS LAYER. While another often used max-pooling, the pooling unit simply puts a 2X2 input field with a maximum excitation output, as shown in the following figure:

If the output of the convolutional layer contains 24X24 neurons, then 12X12 neurons are obtained after pooling. After each feature mapping, there is a pooling process, and the structure of the convolutional layer pooling described above is:

Max-pooling is not the only pooling method, another pooling method is to find the square root of the sum of squares of the outputs of neurons in the 2X2 region of the convolutional layer. Although the details are not the same as Max-pooling, the effect is also to simplify the information output of the convolutional layer.

Connect the above structures together and add an output layer to get a complete convolutional neural network. In the example of handwritten digit recognition, the output layer has ten neurons, corresponding to the outputs of 0, 1, ... , 9.

The last layer in the network is a fully connected layer, i.e. every neuron in that layer is connected to every neuron in the last Max-pooling layer.

This is a special example of this structure, and in actual CNNs, one or more fully connected layers can be added after the convolutional and pooling layers.

Third, the application of convolutional neural networks

3.1 Handwritten digit recognition

Michael Nielsen offers an online eBook on deep learning and CNN, as well as an example program for handwritten digit recognition, which can be downloaded on GitHub. The program uses Python and Numpy to easily design CNNs with different structures for handwritten digit recognition, and uses a machine learning library called Theano to implement backward propagation algorithms and stochastic gradient descent to solve the parameters of CNNs. Theano can run on GPUs, so the time required for the training process can be greatly reduced. The code for the CNN is in the network3.py file.

As a starting example, try creating a neural network with only one hidden layer, with the following code:

>>> import network3
>>> from network3 import Network
>>> from network3 import ConvPoolLayer, FullyConnectedLayer, SoftmaxLayer
>>> training_data, validation_data, test_data = network3.load_data_shared()
>>> mini_batch_size = 10
>>> net = Network([
FullyConnectedLayer(n_in=784, n_out=100),
SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1,
validation_data, test_data)

The network has 784 input neurons, 100 neurons in the hidden layer, and 10 neurons in the output layer. It achieved 97.80% accuracy on the test data.

If you use a convolutional neural network, will it work better than it? You can try the structure that contains a convolutional layer, a pooling layer, and an additional fully connected layer, as shown below

In this structure, it is understood that the convolutional layer and the pooling layer learn the local spatial structure in the input image, while the role of the fully connected layer behind is to learn at a more abstract level, containing more global information in the entire image.

>>> net = Network([
ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),
filter_shape=(20, 1, 5, 5),
poolsize=(2, 2)),
FullyConnectedLayer(n_in=20*12*12, n_out=100),
SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
>>> net.SGD(training_data, 60, mini_batch_size, 0.1,
validation_data, test_data)

The structure of this CNN achieves a recognition accuracy of 98.78%. If you want to further improve the accuracy rate, you can also consider the following aspects:

Add one or more convolution-pooling layers

Add one or more fully connected layers

Replace the sigmoid function with a different excitation function. For example, the Rectifed Linear Units function: The advantage of the Rectified Linear Units function over the sigmoid function is mainly to make the training process faster.

Use more training data. Deep learning requires a lot of training data because of the large number of parameters, and if the training data is small, it may not be able to train an effective neural network. Usually, some algorithms can generate a large number of similar data for training on the basis of existing training data. For example, you can pan each image one pixel, pan up, pan down, pan left, and pan right.

Use a combination of several networks. Create several neural networks of the same structure, initialize the parameters randomly, and make a vote on their output to determine the best classification when training and testing later. In fact, this Ensemble method is not unique to neural networks, and other machine learning algorithms also use this method to improve the robustness of the algorithm, such as Random Forests.

3.2 ImageNet image classification

Alex Krizhevsky et al.'s 2012 article "ImageNet classification with deep convolutional neural networks" classifies a subdataset of ImageNet. ImageNet contains a total of 15 million tagged high-resolution images with 22,000 species. The images were collected from the web and labeled by hand. Since 2010, there has been an ImageNet image recognition competition called ILSVRC (ILSVRC (ImageNet Large-Scale Visual Recognition Challenge). ILSVRC uses 1,000 images from ImageNet, each containing approximately 1,000 images. In total, there are 1.2 million training images, 50,000 validation images, and 150,000 testing images. The method in this article reached an error rate of 15.3%, while the second best method error rate was 26.2%.

There are 7 hidden layers used in this article, the first 5 are convolutional layers (some use max-pooling) and the last 2 are fully connected layers. The output layer is a softmax layer with 1000 units, corresponding to 1000 image categories.

The CNN uses GPUs for calculations, but due to the capacity limitations of a single GPU, 2 GPUs (GTX 580, each with 3GB of video memory) are required to complete the training.

In this article, two methods are used to prevent overfitting. One is to manually generate more training images. For example, panning or flipping the existing training image horizontally, changing the value of its RGB channel according to principal component analysis, etc. In this way, the training data was expanded by 2048 times. The second is the use of Dropout technology. Dropout sets the output of a randomly selected half of the neurons in the hidden layer to 0. This method speeds up training and also makes the results more stable.

THE SIZE OF THE INPUT IMAGE IS 224X224X3, THE SIZE OF THE PERCEPTION DOMAIN IS 11X11X3. The 96 convolution kernels trained in the first layer are shown in the figure above. The first 48 were learned on the first GPU and the last 48 on the second.

3.3 Medical image segmentation

Adhish Prasoon et al., in their 2013 article "Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network," used CNN to segment knee cartilage in MRI. Traditional CNNs are two-dimensional, and if extended directly to three-dimensional, more parameters are required, and the network is more complex, requiring longer training time and more training data. The use of two-dimensional data alone does not take advantage of three-dimensional features, which may lead to a decrease in accuracy. Adhish used a compromise solution: use three 2D flat CNNs and combine them.

Three 2D CNNs are responsible for the processing of the pair and plane, and their outputs are connected together by a softmax layer to produce the final output. In this paper, 25 patient images were used as training data, and 4800 voxels were selected from each three-dimensional image, for a total of 120,000 training voxels. Compared with the traditional segmentation method of manually extracting features from 3D images, this method has significantly improved accuracy and shortens the training time.

3.4 Google Go AlphaGo defeats humans

Google's DeepMind team used deep convolutional neural networks to make a major breakthrough in computer Go. In the early days, IBM's Deep Blue supercomputer used powerful computing power to defeat human professional chess players, but that was not "intelligent".

The computational complexity of Go is far greater than that of chess, and even the most powerful computers cannot exhaust all possible moves. Calculating Go is an extremely complex problem, much more difficult than chess. There are 3^361 situations in Go, the approximate volume is 10^170, and the number of atoms in the universe that has been observed is only 10^80. There are only 2^155 maximum chess games, called the Shannon number, which is roughly 10^47.

DeepMind's AlphaGo used convolutional neural networks to learn how humans play chess, and eventually achieved a breakthrough. AlphaGo defeated the European champion 5-0 without any handicap, and the professional Go second dan Fan Li. The researchers also pitted AlphaGo against other Go AIs, losing only one game out of a total of 495 games, a 99.8 percent win rate. It even tried pitting 4 against three advanced AIs, Crazy Stone, Zen, and Pachi, with win rates of 77%, 86%, and 99%, respectively. This shows how powerful AlphaGo is.

In the Google team's paper, it is mentioned that "we use 19X19 images to transmit the position of the checkerboard" to "train" two different deep neural networks. "Policy network" and "value network". Their task is to cooperate in "picking" the more promising moves, discarding the obvious bad moves, and thus keeping the amount of computation within the range that a computer can do, essentially the same as what a human chess player does.

Among them, the "value network" is responsible for reducing the depth of search - AI will judge the situation while calculating, and when the situation is obviously inferior, it will directly abandon certain routes and do not use one road to calculate to the dark; And the "policy network" is responsible for reducing the width of the search - in the face of the chess game in front of you, some chess moves are obviously not moved, such as not giving children to others to eat. Using Monte Carlo fitting and putting this information into a probability function, the AI does not have to give the same importance to each move, but can focus on those chess moves that have plays.

In order to help more friends interested in artificial intelligence to effectively and systematically learn and research papers, the editor specially made and sorted out an artificial intelligence learning material for everyone, which has been sorted out for a long time and is very comprehensive.

The general content includes some artificial intelligence basic introductory videos and documents + AI common framework practical videos, computer vision, machine learning, image recognition, NLP, OpenCV, YOLO, pytorch, deep learning and neural networks and other learning materials, courseware source code, well-known domestic and foreign elite resources, and AI popular papers and other complete learning materials.

If you need the information mentioned in the above articles, please pay attention to the author's headline [AI George] and reply to [666] to get ~~~~~ for free

Every column is a topic that everyone cares about very much, and very valuable, if my article is helpful to you, please also help like, praise, forward it, your support will motivate me to output higher quality articles, thank you very much!

Convolutional neural networks and their applications in image processing

First, the preface

Second, the structure of convolutional neural networks

Third, the application of convolutional neural networks