CVPR'24 | Plug and play! No need to retrain! iKUN: Designate any target for tracking

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

0. What is this article about?

Traditional multi-object tracking (MOT) tasks are designed to track all specific classes of objects frame by frame, which plays a crucial role in video comprehension. Although significant progress has been made, it suffers from poor flexibility and generalization. In order to solve this problem, the Reference Multi-Target Tracking (RMOT) task has recently been proposed, the core idea of which is to guide multi-target tracking through verbal description. For example, if we take "a moving car on the left" as query input, the tracker will predict all the trajectories that correspond to the description. However, due to the high cost of flexibility, the model needs to perform detection, association, and referencing at the same time, so balancing optimization between subtasks becomes a key issue.

To accomplish this task, existing methods (e.g., TransRMOT) simply integrate text modules into existing trackers. However, this framework has several inherent drawbacks: i) task competition. Some MOT methods have revealed optimal competition between detection and association. In RMOT, the addition of a reference subtask will further exacerbate this issue. ii) Engineering costs. Whenever we want to replace the baseline tracker, we need to rewrite the code and retrain the entire framework. iii) Training costs. Federated training of all subtasks results in high computational costs.

Essentially, the tight bundling of tracking and referencing subtasks is the main reason for these limitations. This begs a natural question: "Is it possible to decouple these two subtasks?" "。 This work proposes a "trace-to-reference" framework in which a module called iKUN is inserted, which first tracks all candidate objects and then identifies query objects based on language descriptions, the tracker is frozen at training time, and the optimization process can focus on referencing subtasks.

Therefore, the core problem is to design a pluggable reference module. The intuitive choice is the CLIP-style module, which is pre-trained for contrastive learning on more than 400 million image-text pairs, and its main advantage is the excellent alignment of visual concepts and text descriptions. For simplicity, the visual and text flow of CLIP is independent. This means that for a given visual input, CLIP will extract a fixed visual feature regardless of the text input. However, in an RMOT task, a trajectory usually corresponds to multiple descriptions, including color, position, status, and so on. It is difficult to match a single feature to multiple features. Inspired by this observation, the authors designed a Knowledge Unification Module (KUM) to adaptively extract visual features based on textual guidance. In addition, in order to mitigate the effect of the described long-tail distribution, a test time similarity calibration method is proposed to improve the citation results. The main idea is to estimate the pseudo-frequencies described in an open test set and use them to correct the citation score.

For tracking subtasks, the Kalman filter is widely used for motion modeling. Process noise and observation noise are two key variables that affect the accuracy of prediction and update steps. However, as a hand-designed module, these two variables are determined by preset parameters and are difficult to adapt to changes in the state of motion. The authors solved this problem by designing a neural version of the Kalman filter called NKF, which dynamically estimates the process and observed noise.

The authors have conducted a large number of experiments on the recently published Refer-KITTI [37] dataset, and iKUN has shown a clear advantage over existing solutions. Specifically, iKUN surpassed the previous SOTA method TransRMOT by 10.78% on HOTA, 3.17% on MOTA, and 7.65% on IDF1. Experiments on traditional MOT tasks were also carried out on KITTI and DanceTrack, and the proposed NKF achieved significant improvement compared with the baseline tracker. To further verify the effectiveness of iKUN, a more challenging RMOT dataset, Refer-Dance, was contributed by adding language descriptions to DanceTrack. iKUN achieved a significant improvement over TransRMOT, i.e. HOTA of 29.06% vs. 9.58%.

Let's read about this work together~

标题：iKUN: Speak to Trackers without Retraining

作者:Yunhao Du, Cheng Lei, Zhicheng Zhao, Fei Su

Institution: School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing Key Laboratory of Network Systems and Network Culture, Key Laboratory of Interactive Technology and Experience System, Ministry of Culture and Tourism, Beijing, China

Original link: https://arxiv.org/abs/2312.16245

Code link: https://github.com/dyhBUPT/iKUN

Reference Multi-Object Tracking (RMOT) is designed to track multiple objects based on an input text description. Previous studies have achieved this by simply integrating an additional text module into a multi-object tracker. However, they often require the entire framework to be retrained and have difficulties in optimization. In this work, we propose a pluggable knowledge unification network, called iKUN, to enable plug-and-play communication with ready-made trackers. Specifically, a Knowledge Unification Module (KUM) was designed to extract visual features based on text guidance in an adaptive manner. At the same time, in order to improve the positioning accuracy, we propose a neural version of the Kalman filter (NKF) to dynamically adjust the process noise and the observation noise according to the current motion state. In addition, in order to solve the problem of open long-tail distribution of text description, a test time similarity calibration method was proposed to optimize the confidence score with pseudo frequency. A large number of experiments are carried out on the Refer-KITTI dataset to verify the effectiveness of our framework. Finally, to accelerate the development of RMOT, we have also contributed to a more challenging dataset, Refer-Dance, by expanding the publicly available DanceTrack dataset to include movement and attire descriptions.

Comparison with the previous RMOT framework, as well as iKUN. (a) The previous approach incorporated the reference module into the multi-object tracker and required retraining the overall framework. (b) Conversely, iKUN can be plugged directly into an off-the-shelf tracker and the tracker is frozen during training.

CVPR'24 | Plug and play! No need to retrain! iKUN: Designate any target for tracking

KUM's motivation. Given a trajectory and a set of descriptions, (a) in the absence of guidance from the flow of text, the visual encoder is required to output a single feature to match multiple text features; (b) When textually guided, the visual encoder can predict adaptive features for each description.

The overall framework of iKUN. The visual flow first embeds the local object feature flocal and the global scene feature fglobal, and then aggregates them using the Knowledge Unity Module (KUM). This is followed by a temporal model and a visual head that is used to generate the final visual feature FV. At the same time, the text stream encodes the text feature ft. Finally, logical headers are used to predict the similarity score between FV and FT.

Three designs of the Knowledge Unity Module. The shapes of the feature maps are shown as their tensor shapes, and the batch size is B. For the sake of clarity, the final spatial global average pooling operation is omitted here.

Refer-KITTI。 In the current SOTA method, TransRMOT obtained 38.06%, 29.28%, and 50.83% of HOTA, DetA, and AssA, respectively. In contrast, iKUN was integrated into various off-the-shelf trackers based on YOLOv8 and achieved consistent improvements of 41.25% to 44.56% HOTA. By switching to the same detector as TransRMOT, i.e., DeformableDETR, 48.84%, 35.74%, and 66.80% of HOTA, DetA, and AssA were obtained, respectively. Importantly, due to the flexibility of the framework, iKUN only needs to be trained once against multiple trackers.

In addition, to focus on the comparison of association and referential capabilities, oracle experiments were performed to remove interference from positioning accuracy. That is, the coordinates of the final estimated trajectory (x, y, w, h) were corrected according to the ground facts. Note that no bounding boxes have been added or removed, and no IDs have been modified. In this setup, iKUN also performs well compared to TransRMOT, i.e. 61.54% vs. 54.50% HOTA.

KITTI。 The designed NeuralSORT is compared to the current SOTA tracker on KITTI in Table 2. All trackers make use of the same detection results from YOLOv8. For simplicity, the same data segmentation protocol as Refer-KITTI is used. The results show that NeuralSORT achieves the best results in both car and pedestrian categories.

Ablation experiments.

Knowledge unification module. The three designs of the KUM are compared in Table 3. The results show that all these strategies can significantly improve the benchmark method, which proves the effectiveness of the text guidance mechanism. Specifically, Text-First Modulation achieves the best associative performance (AssA) but does not perform well in detection (DetA). "Cross-correlation" obtains a higher DetA, but a lower AssA. Cascading Attention achieves the best results on the HOTA and DetA indicators and is comparable on the AssA indicators. Finally, choose "Cascading Attention" as the default design for KUM.

Similarity calibration. The mapping function f(·) is studied in Table 5 The influence of the superparameters a and b. Performance is reported to be robust to varying values. In this work, a = 8 and b = -0.1 were chosen as the defaults, which resulted in a performance gain of 0.81% HOTA and 2.09% AssA.

Neural Kalman filter. Firstly, using DeepSORT as a benchmark, the effects of different components of NeuralSORT on KITTI in Table 4 were studied. On top of that, for cars and pedestrians, the NKF increased HOTA by 1.32% and 3.50%, respectively. Other tricks further bring 1.58% and 1.94% gains to cars and pedestrians. We then further investigated the impact of NKF on KITTI and Dance-Track, using ByteTrack as a benchmark. You can see that there is a significant improvement in all the evaluation metrics on both datasets.

Training and inference time. Experiments were performed on Refer-KITTI using multiple Tesla T4 GPUs, and the training and inference times of TransRMOUT and iKUN were compared in Table 7. It can be observed that the time cost of iKUN is much lower. Note that for fair comparison, the tracking process is also included in the inference time.

This work proposes a novel module, iKUN, that can be plugged into any multi-target tracker to implement reference tracking. In order to solve the problem of one-to-many correspondence, the knowledge unification module is designed to adjust the visual embedding according to the text description. A similarity calibration method is further proposed to refine the prediction score by the pseudo-frequencies estimated in the open test set. In addition, two lightweight neural networks are introduced into the Kalman filter to dynamically update the process and observe noise variables. The effectiveness of iKUN was demonstrated by experiments on the public dataset Refer-KITTI and the newly constructed dataset Refer-Dance.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。