Contributed by Zhang Huaxin

量子位 | 公众号 QbitAI

Temporal Action Localization (TAL) is a common method.

Once you've modeled the video content, you can search freely through the entire video.

And the joint team of Huazhong University of Science and Technology and the University of Michigan has recently brought new progress to this technology -

In the past, modeling in TAL was at the fragment or even instance level, but now it can be achieved with just one frame in the video, and the effect is comparable to full supervision.

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

The team from Huazhong University of Science and Technology has proposed a new framework called HR-Pro for temporal behavior detection supervised by point annotation.

Through multi-level reliability propagation, HR-Pro can Xi more discernible fragment-level features and more reliable instance-level boundaries in the network.

HR-Pro includes two reliability-aware phases that effectively propagate high-confidence cues from fragment-level and instance-level point annotations, allowing the network to learn to Xi more discriminatory fragment representations and more reliable proposals.

A large number of experiments on multiple benchmark datasets have proved that HR-Pro is significantly superior to existing methods and has achieved state-of-the-art results, demonstrating the effectiveness of its method and the potential of point labeling.

Performance is comparable to a fully supervised approach

The figure below shows the performance of HR-Pro and LACP in temporal behavior detection on THUMOS14 test video.

HR-Pro shows more accurate detection of action instances, specifically:

For the "golf swing" behavior, HR-Pro effectively distinguishes between the behavior and the background fragment, alleviating the False Positive prediction that LACP is difficult to handle.
For discus throwing behavior, HR-Pro detects more complete segments than LACP, which has lower activation values on non-discriminatory action sequences.

The test results on the dataset also confirm this intuitive feeling.

Visualizing the detection results on the THUMOS14 dataset shows that the difference between high-quality and low-quality predictions increases significantly after instance-level integrity Xi.

(The left side is the result before the instance level integrity Xi, and the right side is the result after the Xi.) The horizontal and vertical axes represent time and reliability scores, respectively. ）

Overall, the performance of HR-Pro in the four commonly used datasets greatly exceeds that of the most advanced point supervision methods, with an average mAP of 60.3% on the THUMOS14 dataset, which is 6.5% compared with the previous SoTA method (53.7%), and can achieve comparable results with some fully supervised methods.

On the THUMOS14 test set, HR-Pro averaged 60.3% mAP for IoU thresholds between 0.1 and 0.7 and 6.5% higher than the previous state-of-the-art method CRRC-Net, compared to the previous state-of-the-art method in the table below.

And HR-Pro is able to achieve comparable performance with competitive fully supervised methods, such as AFSD (mean mAP of 51.1% vs. 52.0% for IoU thresholds between 0.3 and 0.7).

△Comparison of HR-Pro and the previous SOTA method on THUMOS14 datasets

In terms of versatility and superiority on various benchmark datasets, HR-Pro is also significantly superior to existing methods, achieving 3.8%, 7.6% and 2.0% improvements on GTEA, BEOID and ActivityNet 1.3, respectively.

△Comparison between HR-Pro and the previous SOTA method on datasets such as GTEA

So, how exactly does HR-Pro achieve this?

Xi is conducted in two phases

The research team proposes a multi-level reliable propagation method, which introduces a reliable fragment memory module at the fragment level and uses the cross-attention method to propagate to other fragments, and proposes point-supervised proposal generation at the instance level to correlate fragments and instances to generate proposals with different reliability levels, and further optimizes the confidence and boundaries of proposals at the instance level.

The model structure of HR-Pro is shown in the following figure: temporal behavior detection is divided into two stages of Xi process, namely fragment-level discriminant Xi and instance-level integrity Xi.

Stage 1: Fragment-level discrimination Xi

The research team introduced the fragment-level discrimination Xi of reliability perception, proposed to store reliable prototypes for each category, and propagated the high-confidence cues in these prototypes to other segments through intra- and inter-video methods.

Reliable prototyping at the fragment level

To build a reliable prototype at the fragment level, the team created an online-updated prototype memory that stores a reliable prototype MC for various types of behaviors (where c = 1, 2, ..., C) to be able to leverage the feature information of the entire dataset.

The research team chose a fragment feature initialization prototype with dot annotations:

Next, the researchers updated the prototypes for each category using pseudotagged behavioral fragment features, as follows:

Fragment-level reliability-aware optimization

In order to transfer the feature information of the fragment-level reliable prototype to other fragments, the research team designed a Reliabilty-aware Attention Block (RAB), which realizes the injection of reliable information from the prototype into other fragments through cross-attention, so as to enhance the robustness of the fragment features and increase the attention to the less discriminative fragments.

In order to learn Xi more discriminative fragment features, the team also constructed a reliability-aware fragment contrast loss:

Phase 2: Instance-level integrity Xi

To fully explore the temporal structure of instance-level behaviors and optimize the proposed score ranking, the team introduced instance-level action integrity Xi.

This approach aims to refine the proposed confidence scores and boundaries through instance-level feature Xi guided by a reliable instance prototype.

Reliable prototype building at the instance level

In order to make use of the instance-level prior information of point labeling in the training process, the team proposed a proposal generation method based on point labeling to generate proposals with different reliabilitys.

These proposals can be divided into two types based on their reliability score and the timing position of the relative point callouts:

Reliable Proposals (RP): For each point in each category, the proposal contains this point and has the highest reliability;
正样本提议（Positive Proposals, PP）：所有其余的候选提议。

To ensure a balance between positive and negative sample sizes, the research team grouped those segments with class-independent attention scores below a predefined value into Negative Proposals (NPs).

Instance-level reliability-aware optimizations

To predict the integrity score for each proposal, the research team fed the proposed features of the sensitive boundary into the score prediction header φs:

The integrity score prediction of the proposed proposal is then supervised using the positive/negative sample proposal versus the IoU of the reliable proposal as a guideline:

In order to obtain a more accurate boundary behavior proposal, the researchers input the start region features and end region features of the proposal in each PP into the regression prediction head φr to predict the offsets of the start and end times of the proposal.

Further calculations are made to obtain refined proposals, and it is hoped that the refined proposals will coincide with reliable proposals.

In short, HR-Pro can greatly reduce the cost of obtaining tags with only a few annotations, and at the same time, it has strong generalization capabilities, which provides favorable conditions for practical deployment and application.

Based on this, the authors predict that HR-Pro will have broad application prospects in the fields of behavior analysis, human-computer interaction, and driving analysis.

Address: https://arxiv.org/abs/2308.12608

— END —

QbitAI · Headline number signed

You can learn the characteristics of the clip in a single frame annotated video, and achieve fully supervised performance!|AAAI24

Performance is comparable to a fully supervised approach

Xi is conducted in two phases

Read on

Why is it pointless to ban children from watching short videos? After reading this article, you will understand

The intimate video of the female model was leaked and spread wildly

How do I record a special effects video on my computer? This tool can record with one click! | Master of Colorful Special Effects

What software is available to create background effect videos? Come in and find out! | Colorful

Micro video|Spring career

Huang Jiaju's younger brother Huang Jiaqiang has a touching IQ, and his position is seriously problematic, and there seems to be a reason why Huang Guanzhong's old members have been dealing with him for a long time! Recently, Huang Jiaju

The 9th Soma Flower Cultural Tourism Festival in Jinyang opened, and 100,000 acres of Soma Flower Sea helped the county to strengthen cultural tourism

Lei Jun's imitator "Lei Min" account changed its name and deleted the imitation video, netizens questioned the infringement, and the lawyer interpreted the dispute

Xiang Zuo posted a video in response to Xiang Tai's urging to give birth to three children, put an inflatable slide in the living room, and Guo Biting accompanied her children without makeup

Wear 280,000 watches to block the follow-up! The man will sue, the full video was exposed, and the Audi woman owner panicked

The perverted man smashed Huang Jiaju's tomb and took a short video, and after being arrested, his identity was black, and his history was picked up, but he was angry and helpless

The "price war" of new cars has affected the second-hand market, and the industry is cold! Short videos and live broadcasts are popular

Li Shengli's gathering forcibly dragged the woman, and the video was exposed, and the disparity between the physical strength of the two was terrifying

Jia Yueting's breakfast video is revealing again! American bloggers calculated how much Jia Yueting would spend on this breakfast

East China University of Political Science and Law: A representative of the peak of scholarly temperament and appearance (there are video easter eggs at the end of the article)

The fat cat incident became a huge boomerang, many vocal anchors deleted the video, and Cha Baidao donated 1 million into a joke