Contributed by Zhang Huaxin
量子位 | 公众号 QbitAI
Temporal Action Localization (TAL) is a common method.
Once you've modeled the video content, you can search freely through the entire video.
And the joint team of Huazhong University of Science and Technology and the University of Michigan has recently brought new progress to this technology -
In the past, modeling in TAL was at the fragment or even instance level, but now it can be achieved with just one frame in the video, and the effect is comparable to full supervision.
The team from Huazhong University of Science and Technology has proposed a new framework called HR-Pro for temporal behavior detection supervised by point annotation.
Through multi-level reliability propagation, HR-Pro can Xi more discernible fragment-level features and more reliable instance-level boundaries in the network.
HR-Pro includes two reliability-aware phases that effectively propagate high-confidence cues from fragment-level and instance-level point annotations, allowing the network to learn to Xi more discriminatory fragment representations and more reliable proposals.
A large number of experiments on multiple benchmark datasets have proved that HR-Pro is significantly superior to existing methods and has achieved state-of-the-art results, demonstrating the effectiveness of its method and the potential of point labeling.
Performance is comparable to a fully supervised approach
The figure below shows the performance of HR-Pro and LACP in temporal behavior detection on THUMOS14 test video.
HR-Pro shows more accurate detection of action instances, specifically:
- For the "golf swing" behavior, HR-Pro effectively distinguishes between the behavior and the background fragment, alleviating the False Positive prediction that LACP is difficult to handle.
- For discus throwing behavior, HR-Pro detects more complete segments than LACP, which has lower activation values on non-discriminatory action sequences.
The test results on the dataset also confirm this intuitive feeling.
Visualizing the detection results on the THUMOS14 dataset shows that the difference between high-quality and low-quality predictions increases significantly after instance-level integrity Xi.
(The left side is the result before the instance level integrity Xi, and the right side is the result after the Xi.) The horizontal and vertical axes represent time and reliability scores, respectively. )
Overall, the performance of HR-Pro in the four commonly used datasets greatly exceeds that of the most advanced point supervision methods, with an average mAP of 60.3% on the THUMOS14 dataset, which is 6.5% compared with the previous SoTA method (53.7%), and can achieve comparable results with some fully supervised methods.
On the THUMOS14 test set, HR-Pro averaged 60.3% mAP for IoU thresholds between 0.1 and 0.7 and 6.5% higher than the previous state-of-the-art method CRRC-Net, compared to the previous state-of-the-art method in the table below.
And HR-Pro is able to achieve comparable performance with competitive fully supervised methods, such as AFSD (mean mAP of 51.1% vs. 52.0% for IoU thresholds between 0.3 and 0.7).
△Comparison of HR-Pro and the previous SOTA method on THUMOS14 datasets
In terms of versatility and superiority on various benchmark datasets, HR-Pro is also significantly superior to existing methods, achieving 3.8%, 7.6% and 2.0% improvements on GTEA, BEOID and ActivityNet 1.3, respectively.
△Comparison between HR-Pro and the previous SOTA method on datasets such as GTEA
So, how exactly does HR-Pro achieve this?
Xi is conducted in two phases
The research team proposes a multi-level reliable propagation method, which introduces a reliable fragment memory module at the fragment level and uses the cross-attention method to propagate to other fragments, and proposes point-supervised proposal generation at the instance level to correlate fragments and instances to generate proposals with different reliability levels, and further optimizes the confidence and boundaries of proposals at the instance level.
The model structure of HR-Pro is shown in the following figure: temporal behavior detection is divided into two stages of Xi process, namely fragment-level discriminant Xi and instance-level integrity Xi.
Stage 1: Fragment-level discrimination Xi
The research team introduced the fragment-level discrimination Xi of reliability perception, proposed to store reliable prototypes for each category, and propagated the high-confidence cues in these prototypes to other segments through intra- and inter-video methods.
Reliable prototyping at the fragment level
To build a reliable prototype at the fragment level, the team created an online-updated prototype memory that stores a reliable prototype MC for various types of behaviors (where c = 1, 2, ..., C) to be able to leverage the feature information of the entire dataset.
The research team chose a fragment feature initialization prototype with dot annotations:
Next, the researchers updated the prototypes for each category using pseudotagged behavioral fragment features, as follows:
Fragment-level reliability-aware optimization
In order to transfer the feature information of the fragment-level reliable prototype to other fragments, the research team designed a Reliabilty-aware Attention Block (RAB), which realizes the injection of reliable information from the prototype into other fragments through cross-attention, so as to enhance the robustness of the fragment features and increase the attention to the less discriminative fragments.
In order to learn Xi more discriminative fragment features, the team also constructed a reliability-aware fragment contrast loss:
Phase 2: Instance-level integrity Xi
To fully explore the temporal structure of instance-level behaviors and optimize the proposed score ranking, the team introduced instance-level action integrity Xi.
This approach aims to refine the proposed confidence scores and boundaries through instance-level feature Xi guided by a reliable instance prototype.
Reliable prototype building at the instance level
In order to make use of the instance-level prior information of point labeling in the training process, the team proposed a proposal generation method based on point labeling to generate proposals with different reliabilitys.
These proposals can be divided into two types based on their reliability score and the timing position of the relative point callouts:
- Reliable Proposals (RP): For each point in each category, the proposal contains this point and has the highest reliability;
- 正样本提议(Positive Proposals, PP):所有其余的候选提议。
To ensure a balance between positive and negative sample sizes, the research team grouped those segments with class-independent attention scores below a predefined value into Negative Proposals (NPs).
Instance-level reliability-aware optimizations
To predict the integrity score for each proposal, the research team fed the proposed features of the sensitive boundary into the score prediction header φs:
The integrity score prediction of the proposed proposal is then supervised using the positive/negative sample proposal versus the IoU of the reliable proposal as a guideline:
In order to obtain a more accurate boundary behavior proposal, the researchers input the start region features and end region features of the proposal in each PP into the regression prediction head φr to predict the offsets of the start and end times of the proposal.
Further calculations are made to obtain refined proposals, and it is hoped that the refined proposals will coincide with reliable proposals.
In short, HR-Pro can greatly reduce the cost of obtaining tags with only a few annotations, and at the same time, it has strong generalization capabilities, which provides favorable conditions for practical deployment and application.
Based on this, the authors predict that HR-Pro will have broad application prospects in the fields of behavior analysis, human-computer interaction, and driving analysis.
Address: https://arxiv.org/abs/2308.12608
— END —
QbitAI · Headline number signed
Follow us and be the first to know about cutting-edge technology trends