0. What is this article about?
This article proposes a framework called Focal-CVAE that aims to solve the problem of visual perception deficits encountered in two-handed manipulation and improve the efficiency of operational tasks. The framework improves environmental feature extraction and action sequence data processing by using mixed focus attention and saliency attention to improve the performance and adaptability of the algorithm. The results show that Focal-CVAE has achieved significant improvements in two-handed manipulation tasks in both simulated and real-world environments, indicating that the method has the potential for practical applications. Future research will further optimize human-machine collaboration and improve the user-friendliness of the algorithm.
Let's read about this work together~
论文题目:SignAvatar: Sign Language 3D Motion Reconstruction and Generation
作者:Lu Dong, Lipisha Chaudhary等
作者机构:Department of Computer Science and Engineering, University at Buffalo, NY, USA
Paper link: https://arxiv.org/pdf/2405.07974
Achieving expressive 3D action reconstruction and automatic generation of isolated sign language words can be challenging due to the lack of real-world 3D sign language word data, the complex subtleties of sign language actions, and cross-modal understanding of sign language semantics. To address these challenges, we introduced SignAvatar, a framework capable of word-level sign language reconstruction and generation. SignAvatar adopts a transformer-based conditional variational autoencoder architecture, which effectively establishes the relationship between different semantic modalities. In addition, the method also adopts a course learning strategy to enhance the robustness and generalization ability of the model, so as to produce more realistic actions. In addition, we contributed the ASL3DWord dataset, which consists of 3D joint rotation data of the body, hands, and faces used for unique sign language words. We have demonstrated the effectiveness of SignAvatar through extensive experiments, demonstrating its superior reconstruction and auto-generation capabilities. The code and dataset are available on project page 1.
SignAvatar excels in two tasks: reconstructing 3D actions of sign language from video and generating these actions from semantics (images, text). The top row shows a sign language video that says "drink" - note that there is some blurry movement here. The middle row shows SignAvatar's 3D character reconstruction, and the bottom row shows its 3D sign language character generated from the word "drink".
Data collection quality control process. This image shows downsampled video frames for some of the Table videos. The correct gesture for this sign language word involves placing the hands and forearms horizontally in front of the body, with the main forearm above the non-dominant forearm, and then gently tapping them together. The 21 frames shown in the front fit this description, while the gray area in the back does not; These frames will be manually deleted.
Compare 3D upper body pose estimation for different models. On the left is the original image, in the middle is the result of ExPose extraction, and on the right is the result of Hand4Whole extraction.
SignAvatar can accept images as input. On the left is an image, through the text-image embedding of CLIP, SignAvatar can recognize the corresponding semantics - "book" - and generate the corresponding 3D sign language action. The top row is the front view, and the bottom row is the side view.
- We propose SignAvatar, a sign language generation framework that integrates a transformer-based CVAE architecture with a large visual-language model, CLIP. For the first time, we have been able to reconstruct 3D sign language actions from isolated videos, and we can also generate 3D actions from text or image prompts, making significant advances in automatic sign language understanding technology.
- We introduced a lesson learning strategy to gradually increase the mask ratio during training. This approach helps SignAvatar enhance its ability to learn and generalize fine-grained gestures, which helps to synthesize realistic and natural sign language movements.
- We conducted a comprehensive evaluation of SignAvatar, demonstrating its ability to denoiser in sign language reconstruction as well as its superior capabilities in sign language motor generation. In addition, we contributed the ASL3DWord dataset, which includes word-level 3D joint rotation sequences, for 3D sign language research.
The rationale for this article is to synthesize 3D motion sequences of sign language using a Conditional Variational Autoencoder (CVAE) framework. This paper first considers the problem formalization of sign language synthesis, that is, reconstructing motion from a video and generating a sign language movement sequence that accurately represents the semantics of a given label. In order to achieve this goal, the paper adopts the SMPL-X body model as a unified representation model, and separates posture and shape in order to better express the authenticity and naturalness of sign language. Then, the paper introduces the use of a Conditional Variational Autoencoder (CVAE) framework to model the sign language synthesis process. The model consists of a transformer-based encoder-decoder architecture, in which the encoder extracts the core structure to create concise latent representations, and the decoder combines these representations with the text embedding of the CLIP to generate realistic human motion sequences that meet the specified criteria. The article also describes the use of lesson learning strategies to train models, gradually exposing the model to easier and more challenging samples to improve performance. Finally, the SignAvatar model is evaluated and validated using the constructed ASL3DWord dataset.
- Problem formalization: The goal of the article is to synthesize a 3D motion sequence of sign language, either to reconstruct movement from a video or to generate a sign language motion sequence from a label. To accurately convey semantics, factors such as gestures, upper body movements, and facial expressions need to be considered.
- SMPL-X body model: In order to achieve the authenticity and naturalness of sign language, this paper adopts the SMPL-X body model as a unified representation model. Considering the diversity of body shapes, the goal of the article is to generate a sequence of pose parameters.
- Conditional Variational Autoencoder (CVAE) Framework: This paper uses the CVAE framework to model the sign language synthesis process. The framework includes a transformer-based encoder and decoder. The encoder extracts the core structure from the input motion sequences and text projections to create a concise latent representation. The decoder combines these representations with the text embedding of the CLIP to generate a realistic sequence of human movements that meet the specified criteria.
- Lesson Learning Strategies: In order to improve performance, the article introduces course learning strategies. The model is gradually exposed to easier and more challenging samples, which improves the performance of the model.
- ASL3DWord Dataset: In order to evaluate the performance of the SignAvatar model, the ASL3DWord dataset is constructed. This dataset was constructed from the WLASL video dataset and was quality controlled and screened to contain a sequence of 3D pose parameters for 103 sign language words.
This article focuses on a system called SignAvatar for the reconstruction and generation of sign language movements:
- Objective: To evaluate the performance of SignAvatar in sign language action reconstruction and generation tasks through experiments.
- Experimental design: ASL3DWord Subset and ASL3DWord datasets were used for experiments. The experiment was divided into two parts: rebuilding the pipeline and generating the pipeline.
- Reconstruct the pipeline: The reconstruction pipeline takes samples from the initially extracted pose distribution so that it closely resembles the input video. The reconstruction process uses the learned distribution, while the generation process samples directly from the standard normal distribution. The reconstruction results are very similar to the input video, while the resulting results match the overall motion with slight differences.
- Build pipeline: The input to the build pipe comes from a standard normal distribution, and the generation process matches the input video in the overall motion, but there may be differences in the start and end positions, range of motion, and so on.
- Evaluation indicators: Four evaluation indicators were used: recognition accuracy, Fréchet Inception Distance (FID), diversity, and multimodality.
- Recognition accuracy: Used to evaluate whether the reconstructed and generated poses can be effectively recognized by the same classifier. FID: Assesses the overall quality of the reconstructed and generated motion, calculated by comparing the distribution of features. Variety: A measure of the variance of movement across all movement categories. Multimodality: A measure of the mean variance of each sign language word.
- Experimental results: The experimental results show that SignAvatar performs well in sign language action reconstruction and generation tasks. The recognition accuracy and FID score of the reconstructed pipeline are high, and the generated pipeline shows a certain diversity while maintaining the overall motion matching.
- Ablation research: A comprehensive analysis of SignAvatar was conducted from three aspects: framework design, course learning strategy, and data collection quality control. Experimental results show that framework design and course learning strategies have a significant impact on the performance and generalization ability of the model, and data collection quality control can improve the quality of data reconstruction and generation.
- Qualitative results: The qualitative results demonstrate the quality of the reconstructed and generated results. The reconstruction results are very similar to the videos provided in a particular setting, while the generated results reflect diversity and can meet the needs of different expression habits.
- Summary: Experimental results show that SignAvatar performs well in sign language action reconstruction and generation tasks, with good reconstruction accuracy and generation diversity, and has made significant progress in framework design, course learning strategies and data collection quality control.
We propose SignAvatar, a new method to reconstruct and generate sign language 3D actions from 2D isolated videos. Our course learning strategy enhances the scalability, robustness, generalization, and authenticity of the model. In addition, text-driven and image-driven generation methods increase flexibility in this area. The comprehensive evaluation demonstrated the superior performance of SignAvatar in sign language reconstruction and generation tasks. In addition, we have developed a quality-controlled SMPL-X-based 3D dataset ASL3DWord for academic research. In the future, we aim to further exploit the semantic space provided by CLIP to explore semantic similarity in sign language. In addition, given that sign language includes non-hand elements such as facial expressions, lip movements, and emotion, we will investigate how facial expressions and body postures contribute to the comprehension of sign language, in the context of 3D sign language reconstruction and generation.
This article is only for academic sharing, if there is any infringement, please contact to delete the article.
3D Vision Workshop Exchange Group
At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:
2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc
Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc
Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.
Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.
Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.
三维重建:3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等
Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc
In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news
Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.
3D Vision Workshop Knowledge Planet
3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪,无人机等。