Fast Online Object Tracking and Segmentation: A Unifying Approach
線上快速目标跟蹤與分割:一種通用方法
摘要
論文提出一種實時VOT和半監督VOS的通用方法。
該方法稱為SiamMask,通過二值分割任務生成損失,改進了全卷積Siamese 方法的離線訓練步驟。
訓練完成後,SiamMask 依靠init 單個bbox并線上運作,生成與類别無關的對象分割Mask,和旋轉bbox。速度可達每秒55幀。
政策實作了VOT-2018上最佳的跟蹤效果。同時實作了DAVIS-2016和DAVIS-2017上半監督VOS任務的最佳性能和速度。
項目位址:http://www.robots.ox.ac.uk/˜qwang/SiamMask
1.引言
跟蹤是一項基本任務。廣泛應用在視訊分析程式中,目标對象的某種程度推理。
跟蹤允許在幀之間建立前後對象的對應關系[34]。
跟蹤廣泛用于各種場景,如自動監控,車輛導航,視訊标簽,人機互動和活動識别。
VOT的目的,在視訊的第一幀中,給定任意感興趣Object的位置,盡可能準确的預測它在所有後續幀中的位置。[48]
對許多應用來說,視訊流傳輸時的線上跟蹤很重要。換句話講,tracker 不應利用後續的幀來推斷物體的目前位置[26]。
這個VOT基準所描繪的場景,代表了具有簡單軸對齊(例如[56,52])或旋轉[26,27] bbox 的目标對象。
這樣簡單的标注方法資料标注成本較低。更重要的是,它允許使用者快速,簡單的執行目标初始化。
2.相關工作
VOT
半監督VOS
3.方法
3.1.全卷積聯合網絡
【SiamFC】
作為跟蹤系統的基本組成部分,離線訓練的全卷積Siamese網絡,可用于比較目标圖像z和稍大是待搜尋圖像x,來擷取響應 map。
z是以目标對象為中心裁剪的 w×h區域,x是以目标最新估計位置為中心裁切的較大區域。
這兩個輸入使用相同的CNN fθ處理,生成兩個互相關聯的特征圖。
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyZuBnLyQjN2IDNyIjM1ATMwAjMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.png)
【SiamRPN】
依靠RPN大大提高了SiamFC的性能(RPN)[46,14],RPN對估算目标位置可 輸出可變寬高比的bbox。
尤其在SiamRPN中,每個行對一組k個anchor box proposals和相應的對象/背景scores 進行編碼。
是以,SiamRPN 對 box predictions與分類scores可并行輸出。
兩個輸出分支已使用 smooth L1 和交叉熵損失訓練過[28,第3.2節]。
3.2. SiamMask
Loss function
Mask representation
Two variants
Box generation
3.3. Implementation details
Network architecture
Training
Inference
4.實驗
4.1.VOT 評估
Datasets and settings.
How much does the object representation matter?
Results on VOT-2018 and VOT-2016.
4.2.半監督VOS評估
Datasets and settings.
Results on DAVIS and YouTube-VOS.
4.3.進一步分析
Network architecture
Multi-task training
Timing.
Failure cases.
結論
介紹了SiamMask,使用全卷積連體跟蹤器對目标生成類别無關的二值分割Mask。
展示其如何成功的同時應用在VOT和半監督VOS任務上。
達到現有跟蹤器最佳精度,同時也實作了最快的VOS。
提出的 SiamMask 的兩個變種,隻需一個簡單地box進行初始化,線上操作,實時運作,并且無需對測試序列進行任何調整。
Acknowledgements
引用
[1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatiotemporal mrf. In IEEE Conference on Computer Vision and
Pattern Recognition, 2018. 2, 3, 6
[2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, and
A. Vedaldi. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016. 3
[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and
P. H. Torr. Fully-convolutional siamese networks for object
tracking. In European Conference on Computer Vision workshops, 2016. 2, 3, 4, 5, 6
[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.
Visual object tracking using adaptive correlation filters. In
IEEE Conference on Computer Vision and Pattern Recognition, 2010. 2
[5] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe, ´
D. Cremers, and L. Van Gool. One-shot video object segmentation. In IEEE Conference on Computer Vision and
Pattern Recognition, 2017. 7
[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
A. L. Yuille. Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 5, 11
[7] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric
learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 7
[8] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang.
Fast and accurate online video object segmentation via tracking parts. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 6, 7
[9] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow:
Joint learning for video object segmentation and optical
flow. In IEEE International Conference on Computer Vision,
2017. 3, 7
[10] H. Ci, C. Wang, and Y. Wang. Video object segmentation by
learning location-sensitive embeddings. In European Conference on Computer Vision, 2018. 2
[11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking
of non-rigid objects using mean shift. In IEEE Conference
on Computer Vision and Pattern Recognition, 2000. 2
[12] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco:
Efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
1, 2
[13] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Learn- ¨
ing spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision,
2015. 2, 5
[14] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track
and track to detect. In IEEE International Conference on
Computer Vision, 2017. 3
[15] A. He, C. Luo, X. Tian, and W. Zeng. Towards a better match
in siamese network based visual object tracker. In European
Conference on Computer Vision workshops, 2018. 2, 6, 7
[16] A. He, C. Luo, X. Tian, and W. Zeng. A twofold siamese
network for real-time object tracking. In IEEE Conference
on Computer Vision and Pattern Recognition, 2018. 3
[17] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r- ´
cnn. In IEEE International Conference on Computer Vision,
2017. 4
[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In IEEE Conference on Computer
Vision and Pattern Recognition, 2016. 5, 11
[19] D. Held, S. Thrun, and S. Savarese. Learning to track at 100
fps with deep regression networks. In European Conference
on Computer Vision, 2016. 2, 5
[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Highspeed tracking with kernelized correlation filters. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
2015. 2, 5
[21] Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch:
Matching based video object segmentation. In European
Conference on Computer Vision, 2018. 2, 3
[22] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation
networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2, 3, 7
[23] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele.
Lucid data dreaming for object tracking. In IEEE Conference on Computer Vision and Pattern Recognition workshops, 2017. 2, 3, 6
[24] H. Kiani Galoogahi, T. Sim, and S. Lucey. Multi-channel
correlation filters. In IEEE International Conference on
Computer Vision, 2013. 2
[25] H. Kiani Galoogahi, T. Sim, and S. Lucey. Correlation filters
with limited boundaries. In IEEE Conference on Computer
Vision and Pattern Recognition, 2015. 2
[26] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
R. Pflugfelder, L. Cehovin, T. Voj ˇ ´ır, G. Hager, A. Luke ¨ zi ˇ c, ˇ
G. Fernandez, et al. The visual object tracking vot2016 chal- ´
lenge results. In European Conference on Computer Vision,
2016. 1, 3, 5
[27] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic,
A. Eldesokey, G. Fernandez, and et al. The sixth visual object
tracking vot-2018 challenge results. In European Conference
on Computer Vision workshops, 2018. 1, 3, 5, 8, 12
[28] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance
visual tracking with siamese region proposal network. In
IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 3, 4, 5, 7
[29] F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang. Learning spatial-temporal regularized correlation filters for visual
tracking. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2, 6, 7
[30] X. Li and C. C. Loy. Video object segmentation with joint
re-identification and attention-aware mask propagation. In
European Conference on Computer Vision, 2018. 2, 3, 6
[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- ´
mon objects in context. In European Conference on Computer Vision, 2014. 5
9[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In IEEE Conference on
Computer Vision and Pattern Recognition, 2015. 4
[33] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan.
Discriminative correlation filter with channel and spatial reliability. In IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 2, 5, 6, 7
[34] T. Makovski, G. A. Vazquez, and Y. V. Jiang. Visual learning
in multiple-object tracking. PLoS One, 2008. 1
[35] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. LealTaixe, D. Cremers, and L. Van Gool. Video object segmen- ´
tation without temporal information. In IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2017. 2, 3, 6
[36] N. Marki, F. Perazzi, O. Wang, and A. Sorkine-Hornung. Bi- ¨
lateral space video segmentation. In IEEE Conference on
Computer Vision and Pattern Recognition, 2016. 2, 3, 6
[37] O. Miksik, J.-M. Perez-R ´ ua, P. H. Torr, and P. P ´ erez. Roam: ´
a rich object appearance model with application to rotoscoping. In IEEE Conference on Computer Vision and Pattern
Recognition, 2017. 1
[38] F. Perazzi. Video Object Segmentation. PhD thesis, ETH
Zurich, 2017. 1, 3, 6
[39] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and
A. Sorkine-Hornung. Learning video object segmentation
from static images. In IEEE Conference on Computer Vision
and Pattern Recognition, 2017. 2, 3, 6, 7
[40] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool,
M. Gross, and A. Sorkine-Hornung. A benchmark dataset
and evaluation methodology for video object segmentation.
In IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1, 3, 6, 7, 8, 13
[41] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung.
Fully connected object proposals for video segmentation. In
IEEE International Conference on Computer Vision, 2015. 3
[42] P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-Based ´
Probabilistic Tracking. In European Conference on Computer Vision, 2002. 2
[43] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg- ´
ment object candidates. In Advances in Neural Information
Processing Systems, 2015. 2, 4
[44] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learn- ´
ing to refine object segments. In European Conference on
Computer Vision, 2016. 4, 7, 11
[45] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine- ´
Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint
arXiv:1704.00675, 2017. 6, 8, 13
[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In
Advances in Neural Information Processing Systems, 2015.
2, 3
[47] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015. 5
[48] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,
A. Dehghan, and M. Shah. Visual tracking: An experimental
survey. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2014. 1, 3
[49] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance
search for tracking. In IEEE Conference on Computer Vision
and Pattern Recognition, 2016. 2
[50] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In IEEE Conference on Computer Vision
and Pattern Recognition, 2016. 2, 3, 6
[51] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and
P. H. S. Torr. End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer
Vision and Pattern Recognition, 2017. 2
[52] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao,
A. Vedaldi, A. Smeulders, P. H. S. Torr, and E. Gavves.
Long-term tracking in the wild: A benchmark. In European
Conference on Computer Vision, 2018. 1
[53] P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In
British Machine Vision Conference, 2017. 2, 3, 6, 7
[54] T. Vojir and J. Matas. Pixel-wise object segmentations for
the vot 2016 dataset. Research Report CTU-CMP-2017–01,
Center for Machine Perception, Czech Technical University,
Prague, Czech Republic, 2017. 6
[55] L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang. Jots: Joint
online tracking and segmentation. In IEEE Conference on
Computer Vision and Pattern Recognition, 2015. 2, 3, 6
[56] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A
benchmark. In IEEE Conference on Computer Vision and
Pattern Recognition, 2013. 1, 3
[57] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast
video object segmentation by reference-guided mask propagation. In IEEE Conference on Computer Vision and Pattern
Recognition, 2018. 2, 3, 7
[58] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price,
S. Cohen, and T. Huang. Youtube-vos: Sequence-tosequence video object segmentation. In European Conference on Computer Vision, 2018. 2, 5, 6
[59] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos.
Efficient video object segmentation via network modulation.
In IEEE Conference on Computer Vision and Pattern Recognition, June 2018. 2, 3, 7
[60] T. Yang and A. B. Chan. Learning dynamic memory networks for object tracking. In European Conference on Computer Vision, 2018. 2, 3
[61] D. Yeo, J. Son, B. Han, and J. H. Han. Superpixel-based
tracking-by-segmentation using markov chains. In IEEE
Conference on Computer Vision and Pattern Recognition,
2017. 2
[62] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S.
Kweon. Pixel-level matching for video object segmentation
using convolutional neural networks. In IEEE International
Conference on Computer Vision, 2017. 7
[63] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.
Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, 2018. 2,
3, 5, 6, 7