Fast Online Object Tracking and Segmentation: A Unifying Approach

線上快速目标跟蹤與分割：一種通用方法

摘要

論文提出一種實時VOT和半監督VOS的通用方法。

該方法稱為SiamMask，通過二值分割任務生成損失，改進了全卷積Siamese 方法的離線訓練步驟。

訓練完成後，SiamMask 依靠init 單個bbox并線上運作，生成與類别無關的對象分割Mask，和旋轉bbox。速度可達每秒55幀。

政策實作了VOT-2018上最佳的跟蹤效果。同時實作了DAVIS-2016和DAVIS-2017上半監督VOS任務的最佳性能和速度。

項目位址：http://www.robots.ox.ac.uk/˜qwang/SiamMask

1.引言

跟蹤是一項基本任務。廣泛應用在視訊分析程式中，目标對象的某種程度推理。
跟蹤允許在幀之間建立前後對象的對應關系[34]。
跟蹤廣泛用于各種場景，如自動監控，車輛導航，視訊标簽，人機互動和活動識别。
VOT的目的，在視訊的第一幀中，給定任意感興趣Object的位置，盡可能準确的預測它在所有後續幀中的位置。[48]

對許多應用來說，視訊流傳輸時的線上跟蹤很重要。換句話講，tracker 不應利用後續的幀來推斷物體的目前位置[26]。
這個VOT基準所描繪的場景，代表了具有簡單軸對齊（例如[56，52]）或旋轉[26，27] bbox 的目标對象。
這樣簡單的标注方法資料标注成本較低。更重要的是，它允許使用者快速，簡單的執行目标初始化。

2.相關工作

VOT

半監督VOS

3.方法

3.1.全卷積聯合網絡

【SiamFC】
作為跟蹤系統的基本組成部分，離線訓練的全卷積Siamese網絡，可用于比較目标圖像z和稍大是待搜尋圖像x，來擷取響應 map。
z是以目标對象為中心裁剪的 w×h區域，x是以目标最新估計位置為中心裁切的較大區域。
這兩個輸入使用相同的CNN fθ處理，生成兩個互相關聯的特征圖。

【論文學習】Fast Online Object Tracking and Segmentation: A Unifying Approach 線上快速目标跟蹤與分割 -論文學習摘要1.引言2.相關工作3.方法4.實驗結論引用A. Architectural detailsB. Further qualitative results

【SiamRPN】
依靠RPN大大提高了SiamFC的性能（RPN）[46，14]，RPN對估算目标位置可	輸出可變寬高比的bbox。
 尤其在SiamRPN中，每個行對一組k個anchor box proposals和相應的對象/背景scores 進行編碼。
 是以，SiamRPN 對 box predictions與分類scores可并行輸出。
 兩個輸出分支已使用 smooth L1 和交叉熵損失訓練過[28，第3.2節]。

3.2. SiamMask

Loss function

Mask representation

Two variants

Box generation

3.3. Implementation details

Network architecture

Training

Inference

4.實驗

4.1.VOT 評估

Datasets and settings.

How much does the object representation matter?

Results on VOT-2018 and VOT-2016.

4.2.半監督VOS評估

Datasets and settings.

Results on DAVIS and YouTube-VOS.

4.3.進一步分析

Network architecture

Multi-task training

Timing.

Failure cases.

結論

介紹了SiamMask，使用全卷積連體跟蹤器對目标生成類别無關的二值分割Mask。

展示其如何成功的同時應用在VOT和半監督VOS任務上。

達到現有跟蹤器最佳精度，同時也實作了最快的VOS。

提出的 SiamMask 的兩個變種，隻需一個簡單地box進行初始化，線上操作，實時運作，并且無需對測試序列進行任何調整。

Acknowledgements

引用

[1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatiotemporal mrf. In IEEE Conference on Computer Vision and

Pattern Recognition, 2018. 2, 3, 6

[2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and

A. Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016. 3

[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and

P. H. Torr. Fully-convolutional siamese networks for object

tracking. In European Conference on Computer Vision workshops, 2016. 2, 3, 4, 5, 6

[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.

Visual object tracking using adaptive correlation filters. In

IEEE Conference on Computer Vision and Pattern Recognition, 2010. 2

[5] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, ´

D. Cremers, and L. Van Gool. One-shot video object segmentation. In IEEE Conference on Computer Vision and

Pattern Recognition, 2017. 7

[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and

A. L. Yuille. Deeplab: Semantic image segmentation with

deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 5, 11

[7] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric

learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 7

[8] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang.

Fast and accurate online video object segmentation via tracking parts. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 6, 7

[9] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow:

Joint learning for video object segmentation and optical

flow. In IEEE International Conference on Computer Vision,

2017. 3, 7

[10] H. Ci, C. Wang, and Y. Wang. Video object segmentation by

learning location-sensitive embeddings. In European Conference on Computer Vision, 2018. 2

[11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking

of non-rigid objects using mean shift. In IEEE Conference

on Computer Vision and Pattern Recognition, 2000. 2

[12] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco:

Efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.

1, 2

[13] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Learn- ¨

ing spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision,

2015. 2, 5

[14] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track

and track to detect. In IEEE International Conference on

Computer Vision, 2017. 3

[15] A. He, C. Luo, X. Tian, and W. Zeng. Towards a better match

in siamese network based visual object tracker. In European

Conference on Computer Vision workshops, 2018. 2, 6, 7

[16] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese

network for real-time object tracking. In IEEE Conference

on Computer Vision and Pattern Recognition, 2018. 3

[17] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r- ´

cnn. In IEEE International Conference on Computer Vision,

2017. 4

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. In IEEE Conference on Computer

Vision and Pattern Recognition, 2016. 5, 11

[19] D. Held, S. Thrun, and S. Savarese. Learning to track at 100

fps with deep regression networks. In European Conference

on Computer Vision, 2016. 2, 5

[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

2015. 2, 5

[21] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch:

Matching based video object segmentation. In European

Conference on Computer Vision, 2018. 2, 3

[22] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation

networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 3, 7

[23] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele.

Lucid data dreaming for object tracking. In IEEE Conference on Computer Vision and Pattern Recognition workshops, 2017. 2, 3, 6

[24] H. Kiani Galoogahi, T. Sim, and S. Lucey. Multi-channel

correlation filters. In IEEE International Conference on

Computer Vision, 2013. 2

[25] H. Kiani Galoogahi, T. Sim, and S. Lucey. Correlation filters

with limited boundaries. In IEEE Conference on Computer

Vision and Pattern Recognition, 2015. 2

[26] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,

R. Pflugfelder, L. Cehovin, T. Voj ˇ ´ır, G. Hager, A. Luke ¨ zi ˇ c, ˇ

G. Fernandez, et al. The visual object tracking vot2016 chal- ´

lenge results. In European Conference on Computer Vision,

2016. 1, 3, 5

[27] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,

R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic,

A. Eldesokey, G. Fernandez, and et al. The sixth visual object

tracking vot-2018 challenge results. In European Conference

on Computer Vision workshops, 2018. 1, 3, 5, 8, 12

[28] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance

visual tracking with siamese region proposal network. In

IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 4, 5, 7

[29] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang. Learning spatial-temporal regularized correlation filters for visual

tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 6, 7

[30] X. Li and C. C. Loy. Video object segmentation with joint

re-identification and attention-aware mask propagation. In

European Conference on Computer Vision, 2018. 2, 3, 6

[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- ´

mon objects in context. In European Conference on Computer Vision, 2014. 5

9[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional

networks for semantic segmentation. In IEEE Conference on

Computer Vision and Pattern Recognition, 2015. 4

[33] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan.

Discriminative correlation filter with channel and spatial reliability. In IEEE Conference on Computer Vision and Pattern

Recognition, 2017. 2, 5, 6, 7

[34] T. Makovski, G. A. Vazquez, and Y. V. Jiang. Visual learning

in multiple-object tracking. PLoS One, 2008. 1

[35] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. LealTaixe, D. Cremers, and L. Van Gool. Video object segmen- ´

tation without temporal information. In IEEE Transactions

on Pattern Analysis and Machine Intelligence, 2017. 2, 3, 6

[36] N. Marki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- ¨

lateral space video segmentation. In IEEE Conference on

Computer Vision and Pattern Recognition, 2016. 2, 3, 6

[37] O. Miksik, J.-M. Perez-R ´ ua, P. H. Torr, and P. P ´ erez. Roam: ´

a rich object appearance model with application to rotoscoping. In IEEE Conference on Computer Vision and Pattern

Recognition, 2017. 1

[38] F. Perazzi. Video Object Segmentation. PhD thesis, ETH

Zurich, 2017. 1, 3, 6

[39] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and

A. Sorkine-Hornung. Learning video object segmentation

from static images. In IEEE Conference on Computer Vision

and Pattern Recognition, 2017. 2, 3, 6, 7

[40] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,

M. Gross, and A. Sorkine-Hornung. A benchmark dataset

and evaluation methodology for video object segmentation.

In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 3, 6, 7, 8, 13

[41] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung.

Fully connected object proposals for video segmentation. In

IEEE International Conference on Computer Vision, 2015. 3

[42] P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-Based ´

Probabilistic Tracking. In European Conference on Computer Vision, 2002. 2

[43] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg- ´

ment object candidates. In Advances in Neural Information

Processing Systems, 2015. 2, 4

[44] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learn- ´

ing to refine object segments. In European Conference on

Computer Vision, 2016. 4, 7, 11

[45] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine- ´

Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint

arXiv:1704.00675, 2017. 6, 8, 13

[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards

real-time object detection with region proposal networks. In

Advances in Neural Information Processing Systems, 2015.

2, 3

[47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,

S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015. 5

[48] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,

A. Dehghan, and M. Shah. Visual tracking: An experimental

survey. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 2014. 1, 3

[49] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance

search for tracking. In IEEE Conference on Computer Vision

and Pattern Recognition, 2016. 2

[50] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In IEEE Conference on Computer Vision

and Pattern Recognition, 2016. 2, 3, 6

[51] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and

P. H. S. Torr. End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer

Vision and Pattern Recognition, 2017. 2

[52] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao,

A. Vedaldi, A. Smeulders, P. H. S. Torr, and E. Gavves.

Long-term tracking in the wild: A benchmark. In European

Conference on Computer Vision, 2018. 1

[53] P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In

British Machine Vision Conference, 2017. 2, 3, 6, 7

[54] T. Vojir and J. Matas. Pixel-wise object segmentations for

the vot 2016 dataset. Research Report CTU-CMP-2017–01,

Center for Machine Perception, Czech Technical University,

Prague, Czech Republic, 2017. 6

[55] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang. Jots: Joint

online tracking and segmentation. In IEEE Conference on

Computer Vision and Pattern Recognition, 2015. 2, 3, 6

[56] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A

benchmark. In IEEE Conference on Computer Vision and

Pattern Recognition, 2013. 1, 3

[57] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast

video object segmentation by reference-guided mask propagation. In IEEE Conference on Computer Vision and Pattern

Recognition, 2018. 2, 3, 7

[58] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,

S. Cohen, and T. Huang. Youtube-vos: Sequence-tosequence video object segmentation. In European Conference on Computer Vision, 2018. 2, 5, 6

[59] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos.

Efficient video object segmentation via network modulation.

In IEEE Conference on Computer Vision and Pattern Recognition, June 2018. 2, 3, 7

[60] T. Yang and A. B. Chan. Learning dynamic memory networks for object tracking. In European Conference on Computer Vision, 2018. 2, 3

[61] D. Yeo, J. Son, B. Han, and J. H. Han. Superpixel-based

tracking-by-segmentation using markov chains. In IEEE

Conference on Computer Vision and Pattern Recognition,

2017. 2

[62] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S.

Kweon. Pixel-level matching for video object segmentation

using convolutional neural networks. In IEEE International

Conference on Computer Vision, 2017. 7

[63] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.

Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, 2018. 2,

【論文學習】Fast Online Object Tracking and Segmentation: A Unifying Approach 線上快速目标跟蹤與分割 -論文學習摘要1.引言2.相關工作3.方法4.實驗結論引用A. Architectural detailsB. Further qualitative results

Fast Online Object Tracking and Segmentation: A Unifying Approach 線上快速目标跟蹤與分割：一種通用方法

摘要

1.引言

2.相關工作

VOT

半監督VOS

3.方法

3.1.全卷積聯合網絡

3.2. SiamMask

Loss function

Mask representation

Two variants

Box generation

3.3. Implementation details

Network architecture

Training

Inference

4.實驗

4.1.VOT 評估

Datasets and settings.

How much does the object representation matter?

Results on VOT-2018 and VOT-2016.

4.2.半監督VOS評估

Datasets and settings.

Results on DAVIS and YouTube-VOS.

4.3.進一步分析

Network architecture

Multi-task training

Timing.

Failure cases.

結論

Acknowledgements

引用

A. Architectural details

Network backbone

Network heads

Mask refinement module

B. Further qualitative results

Different masks at different locations

Benchmark sequences

繼續閱讀

Fast Online Object Tracking and Segmentation: A Unifying Approach

線上快速目标跟蹤與分割：一種通用方法