《Faster R-CNN》是2015年由Ross Girshick 提出來的一篇爆炸性文章。其中的目标框産生網絡(rpn)突破了前面一系列的産生目标框方法與檢測網絡相分離的舊模式,且首次在一個網路中整合了rp和detection。雖然後面的YOLO和SSD均取得更好的檢測效果,但是其創造性的rpn思想是值得學習的。
文章位址
http://www.rossgirshick.info/
代碼位址
https://github.com/rbgirshick/py-faster-rcnn
第一遍reading in 2017/2/13
第二遍reading in 2017/2/16
Abstract
In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features (what is the full-image convolutional features?)with the detection network. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unifed network where to look.
Introduction
現存的3個問題
(1) Region proposals are the test-time computational bottleneck in state-of-the-art detection systems which based on the region proposal method.
(2) Selective Search(SS) is an order(數量級) of magnitude slower(相較于檢測網絡,生成region proposal 的SS慢一個數量級), at 2 seconds per image in a CPU implementation.
(3) Although re-impliment the SS in GPU mode, but it is ignores the down-stream(後續的) detection network and therefore misses important opportunities for sharing computation.
我們提出的方法
(1)RPN(region proposal networks,待檢區域生成網絡)computing proposals with a deep convolutional neuralnetwork
(2)RPN share(共享卷積特征層,這帶來的好處是,在測試階段極大的節省計算時間) convolutional layers with state-of-the-art object detection networks
(3)the convolutional feature maps used by region-based detectors, like fast RCNN, can also be used for generating region proposals.(核心觀點:卷積層的特征不僅可以用于檢測,還可以用于生成Proposals)
(4) On top of(緊接着) these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) and can be trained end-to-end specifically for the task for generating detection proposals.
(5)RPNs are designed to efficiently predict region proposals with a wide range of scales and aspec tratios(預測的rp是不同大小和不同長寬比的).RPNs introduce novel “anchor” boxes(創新點) that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references(回歸索引的金字塔?目前我也不是很清楚)
(6)To unify RPNs with Fast R-CNN object detection networks, we propose a training scheme that alternates(輪流來調優) between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.
(7)RPN and Faster R-CNN 用途廣泛:such as 3D object detection, part-based detection(基于部分的檢測), instance segmentation(執行個體分割), and image captioning(圖像标注)。
Related work
Faster R-CNN
(模型的核心描述)Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.
3.1 Region Proposal Network
A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.
Because our ultimate goal is to share computation with a Fast R-CNN object detection network, we assume that both nets share a common set of convolutional layers
To generate region proposals, we slide(滑動) a small network over the conv feature map which output by the last (最後一層)shared conv layer
small network(小網絡)
input: n*n (n=3)spatial window of the input conv feature map. (輸入3*3大小的空間滑動視窗,該視窗在最後一層卷積特征上滑動,注意:最後一層的卷積特征并不意味着隻有一個通道,實際上還是一個多通道的特征)
Each sliding window is mapped to a lower-dimensional feature(?).
and this feature is fed into (輸入到)two sibling(姊妹一般的) full connected layer——a box-regression layer(reg) and a box-classification layer(cls)
(如下的這段話需進一步通過閱讀代碼來了解?)
Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations(全連接配接層共享所有的空間位置). This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1×1 convolutional layers(n*n的卷積後面緊接着1*1的卷積層) (for reg and cls, respectively).
Anchors
At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k.
So the reg layer has 4k (每個框需要4個位置來描述)outputs encoding the coordinates of k boxes, and the cls layer outputs 2k (每個結果需要兩個值來描述)scores that estimate probability of object or not object for each proposal. the k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered(Anchor的中心是與sliding window的中心是一緻的) at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position(每個sliding位置均有k個Anchors). For a convolutional feature map of a size W ×H (typically ∼2,400), there are WHk anchors in total.
Translation-Invariant Anchors(平移不變形的Anchor)
An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors.
平移不變性是指,如果object在圖像中有平移,對應的proposal也會平移。以及the same function相似的功能應該能夠預測Proposal在任何位置而MultiBox(使用k-means進行聚類産生anchor的方法)方法則不能保證物體被平移後還能産生相同的proposal。
The translation-invariant property also reduces the model size. MultiBox has a (4+1)×800-dimensional fully-connected output layer, whereas our method has a (4 + 2)×9-dimensional convolutional output layer in the case of k = 9 anchors. 4代表4個位置參數,2代表2個結果參數,9代表每個點均有9個框**As a result, our output layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 for VGG-16)**512指具有個周遊點嗎? If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox6. We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.參數越少,過拟合的風險越小
multiple scale and aspect ratios
As a comparison, our anchor-based method is built on a pyramid of anchors**基于anchor的金字塔, Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. 利用不同尺度的anchor box**, It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. 這個anchor基于的特征圖的尺度是不變化的,特征圖上的濾波器也是單一尺度的
Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image. The design of multiscale anchors is a key component for sharing features without extra cost for addressing scales.
Loss Function
For training RPNs, we assign a binary class label (of being an object or not) to each anchor.
如何标記正樣本anchor
We assign a positive label to two kinds of anchors:
(i) the anchor/anchors with the highest Intersection-over-Union(IoU) overlap with a ground-truth box 和某一個真框 .
(ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box**和任意的真框.Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufficient to determine the positive samples; but we still **adopt the first condition for the reason that in some rare cases the second condition may find no positive sample.
如何标記負樣本anchor
We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.
Loss Function Define
Our loss function for an image is defined as:
L(pi,ti)=1Ncls∑iLcls(pi,p∗i)+λ1Nreg∑ip∗iLreg(ti,t∗i)
i 是一小批mini_batch中一個anchor的序号
pi是anchor i 是目标的機率
p∗i 是The ground-truth label
p∗i=1 if the anchor is positive
p∗i=0 if the anchor is negative
ti 是一個含有4個位置參數的向量
t∗i The ground-truth 與anchor的比對程度
Lcls 是classification loss, is log loss over two classes (object vs. not object).
For the regression loss Lreg(ti,t∗i)=R(ti−t∗i)
R是smooth L1
術語 p∗iLreg 意味着regression loss is activated only for positive anchors( p∗i=1 )
The outputs of the cls and reg layers consist of { pi } and { ti }
The two terms are normalized(歸一化) by Ncls and Nreg and weighted by a balancing parameter λ.
Ncls is normalized by the mini-batch size. Ncls=256
Nreg is normalized by the number of anchor locations. Nreg 2400
λ=10 both cls and reg terms are roughly equally weighted, and the results are insensitive to the values of λ in a wide range
We also note that the normalization as above is not required and could be simplified. (λ和歸一化參數均不重要)
For bounding box regression, we adopt the parameterizations of the 4 coordinates following :
公式的分析有待下次閱讀
x,y,w,h denote the box’s center coordinates and its width and height
x,xa,x∗ are for the predicted box, anchor box, and groundtruth box respectively (likewise for y,w,h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box. 這可以認為,待預測的bounding-box是從anchor box 來靠近ground-truth box的
但是, our method achieves bounding-box regression by a different manner from previous RoI-based (Region of Interest) methods , bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. bounding-box regression 不依賴于任意大小的ROI
In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. 通過權值不共享的k個anchors來實作多尺度. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.
Training RPNs
The RPN can be trained end-to-end by backpropagation and stochastic gradient descent (SGD). Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate(負樣本過多). Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones(?).
We randomly initialize all new (注意:新層才用Gaussian分布初始化)layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01.
All other layers (i.e., the shared convolutional layers) are initialized by pretraining a model for ImageNet classification
參數設定:待下次閱讀時再分析
Sharing Features for RPN and Fast R-CNN
參考文章:
http://blog.csdn.net/u011534057/article/details/51247371