天天看點

DL之YoloV3:YoloV3論文《YOLOv3: An Incremental Improvement》的翻譯與解讀(二)

2.3. Predictions Across Scales

    YOLOv3 predicts boxes at 3 different scales. Our system extracts features from those scales using a similar concept to feature pyramid networks [8]. From our base feature extractor we add several convolutional layers. The last of these predicts a 3-d tensor encoding bounding box, objectness, and class predictions. In our experiments with COCO [10] we predict 3 boxes at each scale so the tensor is N × N × [3 ∗ (4 + 1 + 80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

     YOLOv3預測了三種不同尺度的盒子。我們的系統從這些尺度中提取特征,使用類似于特征金字塔網絡[8]的概念。從我們的基本特征提取器,我們添加了幾個卷積層。最後一個預測了一個三維張量編碼的邊界框、對象和類預測。在COCO[10]的實驗中,我們在每個尺度上預測3個盒子,是以對于4個邊界盒偏移量、1個對象預測和80個類預測,張量是N×N×[3(4 + 1 + 80)]。

      Next we take the feature map from 2 layers previous and upsample it by 2×. We also take a feature map from earlier in the network and merge it with our upsampled features using concatenation. This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map. We then add a few more convolutional layers to process this combined feature map, and eventually predict a similar tensor, although now twice the size. We perform the same design one more time to predict boxes for the final scale. Thus our predictions for the 3rd scale benefit from all the prior computation as well as finegrained features from early on in the network. We still use k-means clustering to determine our bounding box priors. We just sort of chose 9 clusters and 3 scales arbitrarily and then divide up the clusters evenly across scales. On the COCO dataset the 9 clusters were: (10×13),(16×30),(33×23),(30×61),(62×45),(59× 119),(116 × 90),(156 × 198),(373 × 326).

      接下來,我們從之前的兩層中提取特征圖,并将其向上采樣2×。我們還從網絡的早期擷取一個feature map,并使用連接配接将其與我們的上采樣特性合并。該方法允許我們從上采樣的特征中擷取更有意義的語義資訊,并從早期的特征圖中擷取更細粒度的資訊。然後,我們再添加幾個卷積層來處理這個組合的特征圖,并最終預測出一個類似的張量,盡管現在張量是原來的兩倍。我們再次執行相同的設計來預測最終規模的盒子。是以,我們對第三尺度的預測得益于所有的先驗計算以及網絡早期的細粒度特性。我們仍然使用k-means聚類來确定我們的邊界框先驗。我們隻是随意選擇了9個簇和3個尺度然後在尺度上均勻地劃分簇。在COCO資料集中,9個簇分别為(10×13)、(16×30)、(33×23)、(30×61)、(62×45)、(59×119)、(116×90)、(156×198)、(373×326)。

2.4. Feature Extractor

     We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it.... wait for it..... Darknet-53!

     我們使用一個新的網絡來進行特征提取。我們的新網絡是YOLOv2、Darknet-19中使用的網絡和新穎的剩餘網絡之間的混合方法。我們的網絡使用連續的3×3和1×1卷積層,但現在也有一些快捷連接配接,而且明顯更大。它有53個卷積層。等待.....Darknet-53 !

DL之YoloV3:YoloV3論文《YOLOv3: An Incremental Improvement》的翻譯與解讀(二)

    This new network is much more powerful than Darknet- 19 but still more efficient than ResNet-101 or ResNet-152. Here are some ImageNet results:

DL之YoloV3:YoloV3論文《YOLOv3: An Incremental Improvement》的翻譯與解讀(二)

Table 2. Comparison of backbones. Accuracy, billions of operations, billion floating point operations per second, and FPS for various networks.

表2,backbones的比較,精确度,數十億次運算,每秒數十億次浮點運算,以及各種網絡的FPS。

     Each network is trained with identical settings and tested at 256×256, single crop accuracy. Run times are measured on a Titan X at 256 × 256. Thus Darknet-53 performs on par with state-of-the-art classifiers but with fewer floating point operations and more speed. Darknet-53 is better than ResNet-101 and 1.5× faster. Darknet-53 has similar performance to ResNet-152 and is 2× faster. Darknet-53 also achieves the highest measured floating point operations per second. This means the network structure better utilizes the GPU, making it more efficient to evaluate and thus faster. That’s mostly because ResNets have just way too many layers and aren’t very efficient.

      每個網絡都以相同的設定進行訓練,并以256×256的單次裁剪精度進行測試。運作時間是在泰坦X上以256×256的速度測量的。是以,Darknet-53的性能與最先進的分類器相當,但浮點運算更少,速度更快。Darknet-53比ResNet-101好,并且1.5×更快。Darknet-53的性能與ResNet-152相似,并且速度是後者的2倍。Darknet-53還實作了每秒最高的浮點運算。這意味着網絡結構更好地利用GPU,使其更有效地評估,進而更快。這主要是因為ResNets層太多,效率不高。

2.5. Training

    We still train on full images with no hard negative mining or any of that stuff. We use multi-scale training, lots of data augmentation, batch normalization, all the standard stuff. We use the Darknet neural network framework for training and testing [14].

    我們仍然訓練完整的圖像沒有硬負面挖掘或任何東西。我們使用多尺度訓練,大量的資料擴充,批量标準化,所有标準的東西。我們使用Darknet神經網絡架構來訓練和測試[14]。

繼續閱讀