IEEE'24 | Real-time tracking revolution!Inference in just 36 milliseconds!Reinvent the AR assembly experience!

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

This article focuses on a real-time graph tracking method called GBOT, which is designed to assist in assembly tasks in augmented reality (AR). The method uses prior knowledge of the previous assembly posture and combines 6D pose estimation and object tracking techniques to track multiple assembly parts through kinematic links. The authors also propose a dataset called GBOT to evaluate their methodology. Experimental results show that the method performs well under various conditions, especially with different lighting, hand occlusion, and fast movement. The main contributions of the paper include proposing a new tracking method, establishing a dataset that can be used for evaluation, and demonstrating the potential of the method in AR-assisted assembly tasks. In addition, future research directions are discussed, including tracking objects with textured, transparent, or reflective properties, as well as improving pose estimation algorithms.

Let's read about this work together~

论文题目：GBOT: Graph-Based 3D Object Tracking for Augmented Reality-Assisted Assembly Guidance

作者:Shiyu Li,Hannah Schieber等

Author's Affiliation: Technical University of Munich, etc

Paper link: https://arxiv.org/pdf/2402.07677.pdf

Code link: https://github.com/roth-hex-lab/gbot

Guidance on assembleable parts is a promising area of augmented reality. Augmented reality assembly guidance requires real-time acquisition of the 6D object pose of the target object. Especially in time-critical medical or industrial environments, continuous and marker-free tracking of individual parts is essential to overlay instruction instructions on or next to the target object part. In this regard, the occlusion of the user's hand or other objects and the complexity of different assembly states make multi-object tracking without markers complex and difficult to achieve robust and real-time. To solve this problem, we propose Graphics-Based Object Tracking (GBOT), a novel graphic-based single-view RGB-D tracking method. Initialize real-time, marker-free multi-object tracking with 6D pose estimation and update graphics-based assembly poses. Track a wide range of assembly states with our novel multi-state assembly drawings. We update the multi-state assembly drawing using the relative pose of the individual assembly parts. Connecting the individual objects in this diagram allows for more robust object tracking during assembly. As a benchmark for future work, we also offer assembly assets for 3D printing. Quantitative experiments on synthetic data and further qualitative studies on real-world test data have shown that GBOT is able to outperform existing work, enabling context-aware augmented reality assembly guidance.

IEEE'24 | Real-time tracking revolution!Inference in just 36 milliseconds!Reinvent the AR assembly experience!

An overview of all five assembly assets included in the GBOT dataset.

Our synthetic training image. A clustered scene with 3D printed parts was generated for assembling the parts. To increase domain randomness, we add objects from the T-less dataset, varying lighting conditions, and randomized backgrounds.

Composite and real-world scenes with different lighting conditions, motion blur, and occlusion: We conduct ablation studies that take into account the limitations of different lighting conditions, motion blur, and hand occlusion as real-world data.

Qualitative evaluation on GBOT synthetic datasets. We compared (from top to bottom) on three assembly tools, the Nano Chuck by Prima, the Hand-Screw Clamp, and the Liftpod. Tracked objects are colored individually. As the state of assembly continues to evolve, GBOT is more focused on tracking than trackers with existing technology.

Qualitative evaluation of GBOT with YOLOv8Pose, SRT3D, ICG, ICG+SRT3D and GBOT+re-init in real-world scenarios. We compared the assembly tool Hobby Corner Clamp with different methods. We show objects tracked in different colors. YOLOv8Pose is unable to detect and estimate the pose of an occluded rig object, while the tracking algorithm is still able to update the object pose. As the assembly status continues to evolve, GBOT pays more attention to tracking than the existing trackers SRT3D, ICG and ICG+SRT3D.

Evaluation in a real-world cluttered scene: We randomly placed GBOT rig assets along with a few distracting objects to test the impact of a cluttered scene. Our training data helps detect objects in cluttered scenes through domain randomization.

Assembly-aware training on synthetic scenarios and evaluation on real-world scenarios: Our training data helps to overcome occlusion during assembly with assembly data.

Based on real-time multi-object assembly drawing tracking, driven by 6D pose estimation, it is used for multi-state assembly, including assembly status recognition.
A synthetic dataset and unlabeled real-world test data for publicly available and 3D printable assembly assets to serve as a quantitative and qualitative benchmark for AR assembly guidance.

Target tracking initialization: Firstly, the advanced object detector YOLOv8 was used to perform 6 degrees of freedom (6D) object pose estimation, and it was designed as a single-stage method. In addition to the bounding box where the object is detected, the key points required to obtain the pose estimation of the object are also extended. Keys are detected directly on the surface of the object, not at the corners of the 3D bounding box. The purpose of this is to capture the surface features of the object more accurately. Once keys and bounding boxes are detected, they are imported into RANSAC PnP (perspective nP) to restore the object's pose.
Key Selection: In order to define surface keys on each object, the farthest point sampling method is used, which initializes a set of keys on the surface of the object and adds a total of N points. Given the variation in the size of the object, the balance between economy and visibility, 17 key points are used as economic trade-offs.
6D Pose Prediction: PnP is the problem of solving the pose of a 6D object with N 3D points for a given object model and the corresponding prediction of 2D key points. The output of the object detector is processed via RANSAC PnP to restore the 6D object pose. When training the network, the key point regression loss proposed by YOLOv8 is used.
Graph-based object tracking: 6D object pose estimation is used to continuously detect individual objects, but this is computationally demanding and limits real-time performance. Object tracking provides real-time pose information, but requires pose initialization. Therefore, 6D pose estimation is used for object tracking initialization. Graph-based object tracking is an update of an object's pose in a new frame based on time. Most tracking algorithms define probabilistic models based on energy functions or attitude change vectors. We use an energy function-based approach, which is defined as a negative logarithmic probability, following the approach of Stoiber et al. Our tracking method extends their graph-based approach, which uses kinematic links between different objects to simplify the tracking process. Unlike their work, we update these links in real time based on assembly drawings that are known a priori.
Determine the assembly state: In order to switch between different assembly states during the assembly process, we made use of the knowledge of the relative attitude between two assembly parts. We obtain the assembly state by measuring the relative attitude between the connected parts. If the relative attitude between the two parts (compared to the true attitude on the ground) is less than the tracking error, the specific assembly state is assumed to be complete. We calculate the errors of translation and rotation and use them as switching conditions.
GBOT dataset: To train and evaluate the tracker's performance, a synthetic data generator is used. The dataset contains five 3D printed assembly objects that were used to test the algorithm. Synthetic data is generated through domain randomization, such as varying background textures, different lighting conditions, and interfering objects. For quality assessment, images of real-world scenes were also recorded, but due to annotation limitations, these images did not contain the true posture of the ground.

Evaluation index: Mean Distance Error (ADD) and Mean Distance Error-S (ADD-S) were used as the main indicators to evaluate the accuracy of 6D attitude. The average translation error and the average rotation error are also defined as supplementary indicators.
Implementation details: The algorithm is based on the YOLOv8 extension, uses PyTorch to achieve 6D pose estimation, and uses NVIDIA TensorRT to accelerate. The tracing, inference engine, and RESTful API are implemented using C++17.
Evaluation Dataset: Evaluated using the GBOT dataset, which contains four conditions (normal, dynamic lighting, motion blur, hand occlusion). Compare with YOLOv8Pose, state-of-the-art tracking methods, and GBOT.
Experimental results: GBOT outperformed YOLOv8Pose and other tracking methods under different conditions. For assembly assets with more parts, GBOT performs better.
Quantitative evaluation: The quantitative evaluation results under different conditions showed that GBOT was superior to other methods in terms of tracking accuracy. Especially for situations such as hand occlusion, GBOT shows better robustness.
Qualitative Evaluation: Demonstrate the robustness and accuracy of GBOT in tracking assembly assets through visual results. The GBOT is able to track smaller parts and performs well with strong hand occlusion.
Real-time: GBOTs can be deployed in real-time applications to ensure their use in augmented reality (AR) applications. This is demonstrated by showing an example of AR application assembly guidance on Microsoft Hololens 2.

Our approach focuses on untextured printed parts. Future challenges may include objects that are reflective or transparent, such as medical devices, to further test the boundaries of the tracking method. By improving our 6D pose estimation algorithm, combined with a geometric prior, it is possible to better track smaller targets with geometric ambiguity. In addition, screws or similar objects can be inspected more at the category level for a more scalable method of joining parts. To overcome occlusion, a multi-camera setup can be useful and may also include a camera for AR devices. In order to cope with more challenging assembly objects, more robust trace reinitialization may be required.

In this paper, we propose a novel real-time graph tracking method suitable for AR-assisted assembly tasks. GBOT uses prior knowledge based on previous assembly postures to track multiple assembly parts via kinematic links and combines knowledge of 6D pose estimation with object tracking. Our tracking enables GBOT to continuously track objects under a wide range of conditions, during assembly. In order to make comparisons with state-of-the-art technology in various scenarios, we present a GBOT dataset and a real-world scenario of additional recordings. On this dataset, we evaluated our YOLOv8Pose, tracking methods SRT3D, ICG, ICG+SRT3D, and GBOT. Our dataset contains five assembly assets, each with three or more independent parts. The scene in the dataset has four conditions, normal, dynamic lighting, motion blur, and hand occlusion. GBOT performs well in composite scenes with different lighting, hand occlusion, and fast movement, as well as in realistically recorded scenes. We have shown that tracking is more accurate compared to YOLOv8Pose and that kinematic links created with our dynamic are superior to tracking alone. GBOT outperforms state-of-the-art tracking algorithms on the GBOT dataset, which is easily reproducible and is intended to be a benchmark for assembly tasks. In conclusion, our methods and datasets are a promising step towards real-time and robust object tracking and AR-guided assembly processes.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.