With the rapid development of the mainland economy, the development of marine resources and the protection of rights and interests have become increasingly urgent, and tourism, energy transportation, trade import and export, etc. are increasingly dependent on maritime transportation.

Ships are an important target of maritime activities and the main body of maritime operations and maritime economic construction. The correct detection of vessels is helpful to realize vessel condition monitoring and analysis, monitoring and combating illegal fishing by fishing vessels, which is the basis for risk prevention and control of vessel operations, and can provide strong technical support for activities such as maritime resource allocation and trade goods, which is of great theoretical significance and practical value.

A large amount of data on ship target detection mainly comes from remote sensing technology, SAR technology, infrared remote sensing technology, UAV remote sensing technology, optical satellite imaging technology, etc.

Research on ship detection method based on dataset segmentation and band screening

Infrared and SAR technologies enable the monitoring of vessels under extreme climatic conditions. Visible remote sensing imagery provides richer information on ship imagery. UAV remote sensing has the characteristics of low cost, strong real-time performance, flexible maneuverability, and short operation cycle.

At the same time, there are some shortcomings in various remote sensing technologies.

The satellite remote sensing image has low resolution and poor real-time performance, and the spectral information of aperture synthetic radar is insufficient, the clarity is not high, and there is a difference with human vision. The imaging effect of infrared imaging technology is easily affected by distance; Drone imaging technology is susceptible to weather, range, etc., but can provide a lot of detail and high image resolution.

Fundamentals of ship target detection

Compared with traditional ship detection methods, deep learning methods have the advantages of fast speed and strong timeliness. The ship recognition technology based on deep learning is mainly based on convolutional neural networks, which have developed a variety of mature object detection models.

Convolutional neural networks (CNNs) are feed-forward neural networks that contain convolutional computation and have a deep structure. The network has good representation learning ability, so it is widely used in related computer vision fields such as object detection, image classification, target segmentation, and target tracking, and has good performance.

These include the Input Layer, Convolutional Layer, Pooling Layer, Fully Connected Layer, and Output Layer (Figure 3).

The main function of the network data input layer is to preprocess the original image of the input, and the convolutional layer is the core part, which serves the purpose of extracting target features.

The essence of convolution operation is the process of sliding a discrete two-dimensional filter (also called a convolution kernel) on the image, traversing every pixel in the graph, and accumulating with the corresponding point and the points in its domain.

The following parameters need to be used in the operation, K (depth), that is, the depth of the output is consistent with the number of convolution kernels; S (step size), that is, the pixel value of each movement of the sliding window is generally 1 or 2; Zero filling, that is, using 0 to fill the edge of the input data, the height of the picture; W: image width; D: The number of original channels, which is also the number of convolution kernels; F: The size of the convolution kernel height and width (Figure 4, Figure 5).

In Figure 2, the input image size is 4X4, the convolution kernel size is 3X3, the step size is 1, and the output feature map size is 2X2. After convolution operation, the output feature map becomes smaller.

Zero fill is applied to the edge of the original image (Figure 4), other conditions remain unchanged, the output feature map size is 4X4, which is consistent with the original image, and the extracted features are retained, which is the effect of filling. After convolution operation, the parameter size of each layer of convolution is D*F*F*K.

It can be seen that in the process of convolution operation, there are multiple convolution kernels participating in the operation, which will produce multiple feature map pooling layers (downsampling layers), which mainly compress the feature map dimensions and keep the network space undeformed, which is equivalent to resize the image, that is, filtering out some information that has little impact on target detection, effectively reducing the amount of calculation.

Maximum pooling, average pooling, and random pooling are the more common pooling methods.

Maximum pooling is the selection of the maximum value in the area; Average pooling is the average selection of the region; Random pooling is the random selection of values within a region. Compared to random pooling, maximum pooling and average pooling are more commonly used (Figures 6, 7).

The fully connected layer integrates the features extracted by the convolutional layer and the pooling layer and transmits them to the detector, generally in the last few layers of the network, each neural network has one or several fully connected layers according to the task, and the features output by the last fully connected layer are used as the basis for classification. The general output is a set of one-dimensional vectors.

Commonly used classifiers are Softmax function and SVM (support vector machine), compared with SVM, Softmax function has multi-classification function, and the classification effect is also better, so usually Softmax function as the most commonly used classifier (Equation 1).

where N represents the number of classes, S is the output matrix, aj represents the jth element in category a, and 1XN represents the dimension, which represents the probability of the network outputting N categories.

Softmax Loss is calculated as follows (Equation 2):

where YJ indicates whether the test is the true value of the jth class. If it is not a real category, YJ takes 0. In convolutional neural networks, a nonlinear element, that is, an activation function, is introduced after the convolutional layer. The main function of this function is to prevent the occurrence of gradient explosions.

The most widely used is the ReLU activation function, which is calculated by retaining the result that the response is greater than zero, and the result that is less than zero is all written as zero (Equation 3).

This function increases the sparsity of the matrix, so that the generated model also has better generalization ability.

At the same time, the gradient change of the ReLU activation function during the training process is stable and does not change with the change of input amplitude, avoiding the problem of "gradient vanishing" in functions such as sigmod (Figure 8).

Typical categorical convolutional neural network model

Common models are AlexNetls, VGGNet, GoogleNet, and ResNet. All four models were born out of challenging ImagNet.

AlexNet was first proposed by Alex Krizhevsky in 2012 and won the ImagNet competition. The model consists of 5 convolutional layers and 3 fully connected layers, and the convolution kernel sizes of the first two layers of convolutional layers are 11X11 and 5X5, respectively, and the corresponding step sizes are 4 and 1, respectively. The last three convolutional layers all use 3X3 convolution kernels with a step size of 1. The network structure is as follows (Figure 9).

The model has the following characteristics:

First, the ReLU function is used for the first time, which avoids the problem of gradient disappearance in other functions;

The second is to enhance the data and introduce the dropout layer for the first time, which effectively avoids model overfitting and enhances the generalization ability of the model.

VGGNet was proposed by the famous research group VGG (Visual Geometry Group) of the University of Oxford in 2014, and won the first place in the positioning task and the second place in the classification task in the ImagNet competition.

Taking VGG16 as an example, the network has a total of 16 layers except the pooling layer and the Softmax layer, and the convolution kernel size is 3X3 in size, and the pooling layer adopts the maximum pooling method, the pooling kernel size is 2X2, and the step size is 2. The network structure of the model is shown in Figure 10 and Figure 11.

The characteristics of this model are as follows:

First, compared with AlexNet, the number of network layers of the model has increased from 8 to 16, which makes the extracted feature information richer.

Second, the use of 3X3 small convolution kernels reduces the number of parameters and improves the efficiency of model training;

Third, pooled nuclei become smaller and even numbered, reducing information loss.

The model adopts 9 modular structures with a total of 22 layers, which is convenient for scholars to add and modify, and the number of parameters is only half that of the AlexNet model.

Two auxiliary classifiers are added to effectively avoid the disappearance of gradients in deep networks. Added multiple convolution kernels; The fully connected layer is removed and replaced by the average pooling layer, and finally a fully connected layer is added to facilitate subsequent model migration.

Through the above series of operations, the model effectively avoids the problem that the deeper the network, the more parameters, and the gradient is easy to disappear. The model structure is shown in Figure 12.

VGGNet and GoogleNet models show that the deeper the network, the better the model performance, but in practice, it is found that after the network reaches a certain depth, the deeper the gradient disappearance phenomenon in the network is more obvious, and the higher training error leads to the worse training and testing results.

ResNet solves this problem well, and its core is the residual network (Figure 13).

As can be seen from the figure, the addition of identity mapping to the network solves the phenomenon of gradient disappearance, and the above convolutional neural network models have their own merits, GoogleNet, ResNet are deep and complex convolutional neural networks.

However, for a dataset with a small amount of data, the use of deep and complex neural networks will generate a large number of parameters and data, which is prone to overcapacity of the model, resulting in the phenomenon of model overfitting.

Compared with AlexNet, VGGNet replaces AlexNet's medium and large convolution kernels with multilayer small convolutional kernels, which strengthens the local feature extraction capability of the network.

Theoretical basis of ship detection based on SSD algorithm

Although the detection accuracy of the two-stage algorithm is getting higher and higher, it is difficult to improve the detection speed, so the real-time performance in some scenes is unsatisfactory.

The first-stage algorithm has a fast detection speed on the basis of ensuring a certain accuracy rate. Moreover, the amount of data in the sample of ships in this paper is relatively small. It can be seen from the above that the VGGNet model has a relatively good detection effect when a small number of samples are used, and the algorithm is based on the truncated VGG-16 network, and the algorithm is the most widely used algorithm for first-stage object detection.

The SSD algorithm is a deep learning algorithm based on the feed-forward convolutional network, which generates a fixed-size preselection box collection, scores the object class instances in the preselect box, and generates the final detection result by using non-greatly suppression. The algorithm structure is shown in Figure 14.

The SSD algorithm transforms the traditional VGG-16 network, deletes the last pooling layer and fully connected layer of the VGG-16 network, and replaces them with two convolutional layers, conv3-1024 and conv1-1024, so that the feature map channel changes from 512 to 1024 and remains unchanged.

Add extra layers after the network, which contains a total of 8 convolutional layers, which makes the feature map of 19X19X512 output by the original network become a 1X1X256 feature map after convolution.

SSD networks use convolutional layers at different depths to predict targets of different sizes, using lower layers for small target detection and large targets through higher layer detection. The feature layers at the preceding scales together produce the final layer detection (Figure 15).

The SSD algorithm draws on the default candidate frame mechanism and regression mechanism of RPN network, and also adopts the idea of multi-scale feature representation for object detection, so that the algorithm has the characteristics of fast detection speed and high precision.

Due to the mechanism of using feature layers for prediction, a large number of false overlapping or even incorrect candidate boxes will eventually be generated, so non maximum suppression (NMS) is required to optimize, sorting the detection boxes by score (category probability value). Continuous iteration preserves high-quality detection frames and outputs.

Bibliography:

[1]. "Innovative Ideas for Marine Ecological Environmental Protection in China"

[2]. Application Experiment of UAV Remote Sensing in the Third Land Survey

[3]. Review of Ship Target Detection Methods Based on Remote Sensing Images

Research on ship detection method based on dataset segmentation and band screening

Fundamentals of ship target detection

Typical categorical convolutional neural network model

Theoretical basis of ship detection based on SSD algorithm