laitimes

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

author:Quantum Position

Contributed by Zhang Huaxin

量子位 | 公众号 QbitAI

Temporal Action Localization (TAL) is a common method.

Once you've modeled the video content, you can search freely through the entire video.

And the joint team of Huazhong University of Science and Technology and the University of Michigan has recently brought new progress to this technology -

In the past, modeling in TAL was at the fragment or even instance level, but now it can be achieved with just one frame in the video, and the effect is comparable to full supervision.

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

The team from Huazhong University of Science and Technology has proposed a new framework called HR-Pro for temporal behavior detection supervised by point annotation.

Through multi-level reliability propagation, HR-Pro can Xi more discernible fragment-level features and more reliable instance-level boundaries in the network.

HR-Pro includes two reliability-aware phases that effectively propagate high-confidence cues from fragment-level and instance-level point annotations, allowing the network to learn to Xi more discriminatory fragment representations and more reliable proposals.

A large number of experiments on multiple benchmark datasets have proved that HR-Pro is significantly superior to existing methods and has achieved state-of-the-art results, demonstrating the effectiveness of its method and the potential of point labeling.

Performance is comparable to a fully supervised approach

The figure below shows the performance of HR-Pro and LACP in temporal behavior detection on THUMOS14 test video.

HR-Pro shows more accurate detection of action instances, specifically:

  • For the "golf swing" behavior, HR-Pro effectively distinguishes between the behavior and the background fragment, alleviating the False Positive prediction that LACP is difficult to handle.
  • For discus throwing behavior, HR-Pro detects more complete segments than LACP, which has lower activation values on non-discriminatory action sequences.
You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

The test results on the dataset also confirm this intuitive feeling.

Visualizing the detection results on the THUMOS14 dataset shows that the difference between high-quality and low-quality predictions increases significantly after instance-level integrity Xi.

(The left side is the result before the instance level integrity Xi, and the right side is the result after the Xi.) The horizontal and vertical axes represent time and reliability scores, respectively. )

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

Overall, the performance of HR-Pro in the four commonly used datasets greatly exceeds that of the most advanced point supervision methods, with an average mAP of 60.3% on the THUMOS14 dataset, which is 6.5% compared with the previous SoTA method (53.7%), and can achieve comparable results with some fully supervised methods.

On the THUMOS14 test set, HR-Pro averaged 60.3% mAP for IoU thresholds between 0.1 and 0.7 and 6.5% higher than the previous state-of-the-art method CRRC-Net, compared to the previous state-of-the-art method in the table below.

And HR-Pro is able to achieve comparable performance with competitive fully supervised methods, such as AFSD (mean mAP of 51.1% vs. 52.0% for IoU thresholds between 0.3 and 0.7).

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

△Comparison of HR-Pro and the previous SOTA method on THUMOS14 datasets

In terms of versatility and superiority on various benchmark datasets, HR-Pro is also significantly superior to existing methods, achieving 3.8%, 7.6% and 2.0% improvements on GTEA, BEOID and ActivityNet 1.3, respectively.

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

△Comparison between HR-Pro and the previous SOTA method on datasets such as GTEA

So, how exactly does HR-Pro achieve this?

Xi is conducted in two phases

The research team proposes a multi-level reliable propagation method, which introduces a reliable fragment memory module at the fragment level and uses the cross-attention method to propagate to other fragments, and proposes point-supervised proposal generation at the instance level to correlate fragments and instances to generate proposals with different reliability levels, and further optimizes the confidence and boundaries of proposals at the instance level.

The model structure of HR-Pro is shown in the following figure: temporal behavior detection is divided into two stages of Xi process, namely fragment-level discriminant Xi and instance-level integrity Xi.

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

Stage 1: Fragment-level discrimination Xi

The research team introduced the fragment-level discrimination Xi of reliability perception, proposed to store reliable prototypes for each category, and propagated the high-confidence cues in these prototypes to other segments through intra- and inter-video methods.

Reliable prototyping at the fragment level

To build a reliable prototype at the fragment level, the team created an online-updated prototype memory that stores a reliable prototype MC for various types of behaviors (where c = 1, 2, ..., C) to be able to leverage the feature information of the entire dataset.

The research team chose a fragment feature initialization prototype with dot annotations:

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

Next, the researchers updated the prototypes for each category using pseudotagged behavioral fragment features, as follows:

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

Fragment-level reliability-aware optimization

In order to transfer the feature information of the fragment-level reliable prototype to other fragments, the research team designed a Reliabilty-aware Attention Block (RAB), which realizes the injection of reliable information from the prototype into other fragments through cross-attention, so as to enhance the robustness of the fragment features and increase the attention to the less discriminative fragments.

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

In order to learn Xi more discriminative fragment features, the team also constructed a reliability-aware fragment contrast loss:

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

Phase 2: Instance-level integrity Xi

To fully explore the temporal structure of instance-level behaviors and optimize the proposed score ranking, the team introduced instance-level action integrity Xi.

This approach aims to refine the proposed confidence scores and boundaries through instance-level feature Xi guided by a reliable instance prototype.

Reliable prototype building at the instance level

In order to make use of the instance-level prior information of point labeling in the training process, the team proposed a proposal generation method based on point labeling to generate proposals with different reliabilitys.

These proposals can be divided into two types based on their reliability score and the timing position of the relative point callouts:

  • Reliable Proposals (RP): For each point in each category, the proposal contains this point and has the highest reliability;
  • 正样本提议(Positive Proposals, PP):所有其余的候选提议。

To ensure a balance between positive and negative sample sizes, the research team grouped those segments with class-independent attention scores below a predefined value into Negative Proposals (NPs).

Instance-level reliability-aware optimizations

To predict the integrity score for each proposal, the research team fed the proposed features of the sensitive boundary into the score prediction header φs:

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

The integrity score prediction of the proposed proposal is then supervised using the positive/negative sample proposal versus the IoU of the reliable proposal as a guideline:

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

In order to obtain a more accurate boundary behavior proposal, the researchers input the start region features and end region features of the proposal in each PP into the regression prediction head φr to predict the offsets of the start and end times of the proposal.

Further calculations are made to obtain refined proposals, and it is hoped that the refined proposals will coincide with reliable proposals.

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24
You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24
You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

In short, HR-Pro can greatly reduce the cost of obtaining tags with only a few annotations, and at the same time, it has strong generalization capabilities, which provides favorable conditions for practical deployment and application.

Based on this, the authors predict that HR-Pro will have broad application prospects in the fields of behavior analysis, human-computer interaction, and driving analysis.

Address: https://arxiv.org/abs/2308.12608

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on