KeyPoint of 《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》

《Faster R-CNN》是2015年由Ross Girshick 提出來的一篇爆炸性文章。其中的目标框産生網絡（rpn）突破了前面一系列的産生目标框方法與檢測網絡相分離的舊模式，且首次在一個網路中整合了rp和detection。雖然後面的YOLO和SSD均取得更好的檢測效果，但是其創造性的rpn思想是值得學習的。

文章位址

http://www.rossgirshick.info/

代碼位址

https://github.com/rbgirshick/py-faster-rcnn

第一遍reading in 2017/2/13

第二遍reading in 2017/2/16

Abstract

In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features （what is the full-image convolutional features?）with the detection network. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unifed network where to look.

Introduction

現存的3個問題

(1) Region proposals are the test-time computational bottleneck in state-of-the-art detection systems which based on the region proposal method.

(2) Selective Search(SS) is an order（數量級） of magnitude slower(相較于檢測網絡，生成region proposal 的SS慢一個數量級), at 2 seconds per image in a CPU implementation.

(3) Although re-impliment the SS in GPU mode, but it is ignores the down-stream(後續的) detection network and therefore misses important opportunities for sharing computation.

我們提出的方法

(1)RPN(region proposal networks,待檢區域生成網絡)computing proposals with a deep convolutional neuralnetwork

(2)RPN share(共享卷積特征層，這帶來的好處是，在測試階段極大的節省計算時間) convolutional layers with state-of-the-art object detection networks

(3)the convolutional feature maps used by region-based detectors, like fast RCNN, can also be used for generating region proposals.(核心觀點：卷積層的特征不僅可以用于檢測，還可以用于生成Proposals)

(4) On top of(緊接着) these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) and can be trained end-to-end speciﬁcally for the task for generating detection proposals.

(5)RPNs are designed to efﬁciently predict region proposals with a wide range of scales and aspec tratios(預測的rp是不同大小和不同長寬比的).RPNs introduce novel “anchor” boxes(創新點) that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references（回歸索引的金字塔？目前我也不是很清楚）

(6)To unify RPNs with Fast R-CNN object detection networks, we propose a training scheme that alternates(輪流來調優) between ﬁne-tuning for the region proposal task and then ﬁne-tuning for object detection, while keeping the proposals ﬁxed. This scheme converges quickly and produces a uniﬁed network with convolutional features that are shared between both tasks.

(7)RPN and Faster R-CNN 用途廣泛：such as 3D object detection, part-based detection（基于部分的檢測）, instance segmentation（執行個體分割）, and image captioning（圖像标注）。

Related work

Faster R-CNN

(模型的核心描述)Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.

3.1 Region Proposal Network

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.

Because our ultimate goal is to share computation with a Fast R-CNN object detection network, we assume that both nets share a common set of convolutional layers

To generate region proposals, we slide（滑動） a small network over the conv feature map which output by the last （最後一層）shared conv layer

small network（小網絡）

input: n*n (n=3)spatial window of the input conv feature map. （輸入3*3大小的空間滑動視窗，該視窗在最後一層卷積特征上滑動，注意：最後一層的卷積特征并不意味着隻有一個通道，實際上還是一個多通道的特征）

Each sliding window is mapped to a lower-dimensional feature（？）.

and this feature is fed into （輸入到）two sibling（姊妹一般的） full connected layer——a box-regression layer(reg) and a box-classification layer(cls)

（如下的這段話需進一步通過閱讀代碼來了解?）

Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations（全連接配接層共享所有的空間位置）. This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1×1 convolutional layers(n*n的卷積後面緊接着1*1的卷積層) (for reg and cls, respectively).

Anchors

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k.

So the reg layer has 4k （每個框需要4個位置來描述）outputs encoding the coordinates of k boxes, and the cls layer outputs 2k （每個結果需要兩個值來描述）scores that estimate probability of object or not object for each proposal. the k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered（Anchor的中心是與sliding window的中心是一緻的） at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position（每個sliding位置均有k個Anchors）. For a convolutional feature map of a size W ×H (typically ∼2,400), there are WHk anchors in total.

Translation-Invariant Anchors（平移不變形的Anchor）

An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors.

平移不變性是指，如果object在圖像中有平移，對應的proposal也會平移。以及the same function相似的功能應該能夠預測Proposal在任何位置而MultiBox（使用k-means進行聚類産生anchor的方法）方法則不能保證物體被平移後還能産生相同的proposal。

The translation-invariant property also reduces the model size. MultiBox has a (4+1)×800-dimensional fully-connected output layer, whereas our method has a (4 + 2)×9-dimensional convolutional output layer in the case of k = 9 anchors. 4代表4個位置參數，2代表2個結果參數，9代表每個點均有9個框**As a result, our output layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 for VGG-16)**512指具有個周遊點嗎？ If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox6. We expect our method to have less risk of overﬁtting on small datasets, like PASCAL VOC.參數越少，過拟合的風險越小

multiple scale and aspect ratios

As a comparison, our anchor-based method is built on a pyramid of anchors**基于anchor的金字塔, Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. 利用不同尺度的anchor box**, It only relies on images and feature maps of a single scale, and uses ﬁlters (sliding windows on the feature map) of a single size. 這個anchor基于的特征圖的尺度是不變化的，特征圖上的濾波器也是單一尺度的

Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image. The design of multiscale anchors is a key component for sharing features without extra cost for addressing scales.

Loss Function

For training RPNs, we assign a binary class label (of being an object or not) to each anchor.

如何标記正樣本anchor

We assign a positive label to two kinds of anchors:

(i) the anchor/anchors with the highest Intersection-over-Union(IoU) overlap with a ground-truth box 和某一個真框 .

(ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box**和任意的真框.Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufﬁcient to determine the positive samples; but we still **adopt the first condition for the reason that in some rare cases the second condition may find no positive sample.

如何标記負樣本anchor

We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.

Loss Function Define

Our loss function for an image is deﬁned as:

L(pi,ti)=1Ncls∑iLcls(pi,p∗i)+λ1Nreg∑ip∗iLreg(ti,t∗i)

i 是一小批mini_batch中一個anchor的序号

pi是anchor i 是目标的機率

p∗i 是The ground-truth label

p∗i=1 if the anchor is positive

p∗i=0 if the anchor is negative

ti 是一個含有4個位置參數的向量

t∗i The ground-truth 與anchor的比對程度

Lcls 是classiﬁcation loss, is log loss over two classes (object vs. not object).

For the regression loss Lreg(ti,t∗i)=R(ti−t∗i)

R是smooth L1

術語 p∗iLreg 意味着regression loss is activated only for positive anchors（ p∗i=1 ）

The outputs of the cls and reg layers consist of { pi } and { ti }

The two terms are normalized（歸一化） by Ncls and Nreg and weighted by a balancing parameter λ.

Ncls is normalized by the mini-batch size. Ncls=256

Nreg is normalized by the number of anchor locations. Nreg 2400

λ=10 both cls and reg terms are roughly equally weighted, and the results are insensitive to the values of λ in a wide range

We also note that the normalization as above is not required and could be simpliﬁed. (λ和歸一化參數均不重要)

For bounding box regression, we adopt the parameterizations of the 4 coordinates following ：

公式的分析有待下次閱讀

x,y,w,h denote the box’s center coordinates and its width and height

x,xa,x∗ are for the predicted box, anchor box, and groundtruth box respectively (likewise for y,w,h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box. 這可以認為，待預測的bounding-box是從anchor box 來靠近ground-truth box的

但是， our method achieves bounding-box regression by a different manner from previous RoI-based (Region of Interest) methods ， bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. bounding-box regression 不依賴于任意大小的ROI

In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. 通過權值不共享的k個anchors來實作多尺度. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a ﬁxed size/scale, thanks to the design of anchors.

Training RPNs

The RPN can be trained end-to-end by backpropagation and stochastic gradient descent (SGD). Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate(負樣本過多). Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones（？）.

We randomly initialize all new （注意：新層才用Gaussian分布初始化）layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01.

All other layers (i.e., the shared convolutional layers) are initialized by pretraining a model for ImageNet classiﬁcation

參數設定：待下次閱讀時再分析

Sharing Features for RPN and Fast R-CNN

參考文章：

http://blog.csdn.net/u011534057/article/details/51247371

KeyPoint of 《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》

Abstract

Introduction

現存的3個問題

我們提出的方法

Related work

Faster R-CNN

3.1 Region Proposal Network

Anchors

Loss Function

如何标記正樣本anchor

如何标記負樣本anchor

Loss Function Define

Training RPNs

Sharing Features for RPN and Fast R-CNN

繼續閱讀

Faster Rcnn 代碼解讀之 pascal_voc.py

目标檢測 + faster-rcnnThis is a manuscript about how to use deep learning model firstlytf-fast-rcnn

py-faster-rcnn windows 安裝

YOLO（You Only Look Once）算法詳解

【原】yolo mark标注安裝使用記錄一資料标注，yolo mark，配置visual studio 2015和opencv3.4.4

CornerNet: Detecting Objects as Paired Keypoints 學習筆記OverviewCornerNetExperiment

CVPR 2018 目标檢測（Object Detection）

mmdetection源碼閱讀筆記（1）--建立網絡建立cascade rcnn網絡backboneneckRPN HEADassigners and samplersbbox headmask head小結

目标檢測論文：Recent Advances in Deep Learning for Object Detection

tensorflow（3）：Object Detection API使用

SSD Faster-RCNN使用自己的資料fine-tune訓練模型

感受野提取得到的特征圖與步長Stride之間的關系寫在前面問題圖示推導過程

MyDLNote-High-Resolution: gOctConv:100K參數實作高效顯著性目标檢測Highly Efficient Salient Object Detection with 100K Parameters

用 Keras/TensorFlow 2.8 建立 COCO 的 average precision 名額前言1. AP 的算法原理。2. 在 Keras 中的實作。3. 建立狀态量。4. update_state 方法。5. result 方法。6. 測試盒 testcase。7. 使用方法。8. 下載下傳連結。THE END

[多尺度物體目标檢測]技術概述/綜述1 緒論2 傳統目标檢測方法3 基于REGION PROPOSAL的方法4 基于回歸的方法/端到端的方法5 總結6 參考文獻

《Structure Inference Net》筆記IntroductionStructure inference networkExperiment