laitimes

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

author:Quantum Position

Contributed by Houlun Chen

量子位 | 公众号 QbitAI

With just one sentence of description, you can locate the corresponding clip in a large video!

For example, if you describe "a person drinking water while going down the stairs", the new method can find out the corresponding start and end timestamps at once by matching the video screen and footsteps:

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

Even the semantically incomprehensible type of "laughing" can be accurately located:

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

The method is called Adaptive Double-Branch Facilitation Network (ADPN) and was proposed by a research team from Tsinghua University.

Specifically, ADPN is used to complete a visual-linguistic cross-modal task called Temporal Sentence Grounding (TSG), that is, to locate relevant segments from the video based on the query text.

ADPN is characterized by its ability to efficiently use the consistency and complementarity of visual and audio modalities in video to enhance the localization performance of video clips.

Compared with other TSG operating PMI-LOC and UMT using audio, the ADPN method obtains a more significant performance improvement from the audio modality, and the new SOTA is obtained in many tests.

The work has been accepted by ACM Multimedia 2023 and is fully open source.

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

Let's take a look at what ADPN is~

Locate video clips in one sentence

Temporal Sentence Grounding (TSG) is an important visual-linguistic cross-modal task.

Its purpose is to find the start and end timestamps of the semantically matched segments in an unedited video according to natural language queries, and it requires the method to have strong temporal cross-modal reasoning ability.

However, most of the existing TSG methods only consider the visual information in the video, such as RGB, optical flows, depth, etc., and ignore the natural accompanying audio information in the video.

Audio information tends to contain rich semantics, and there is consistency and complementarity with visual information, as shown in the figure below, and these properties will be helpful for TSG tasks.

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

△图1

(a) Consistency: The video image and footsteps consistently match the semantics of "walking down the stairs" in the query;(b) Complementarity: It is difficult for the video screen to identify specific behaviors to locate the semantics of "laughter" in the query, but the appearance of laughter provides a strong complementary positioning cue.

Therefore, the researchers have studied the audio-enhanced Temporal Sentence Grounding (ATSG) task in depth, aiming to better capture the localization cues from both visual and audio modalities, but the introduction of audio modalities also brings the following challenges:

  • The consistency and complementarity of audio and visual modalities are associated with the query text, so capturing audiovisual consistency and complementarity requires modeling the interaction of text-visual-audio modality.
  • There are significant modal differences between audio and vision, with different information densities and noise intensity, which can affect the performance of audiovisual Xi.

In order to solve the above challenges, the researchers proposed a novel ATSG method, the Adaptive Dual-branch Prompted Network (ADPN).

Through a double-branched model structure design, the method can adaptively model the consistency and complementarity between audio and vision, and use a denoising optimization strategy based on course learning Xi to further eliminate the interference of audio modal noise, revealing the importance of audio signals for video retrieval.

The overall structure of ADPN is shown in the following figure:

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

△Figure 2: Overall schematic diagram of Adaptive Dual-Branch Facilitating Network (ADPN).

It consists of three main designs:

1. Double-branch network structure design

Considering that the noise of audio is more obvious, and there is usually more redundant information in audio for TSG tasks, the Xi process of audio and visual modality needs to be given different importances, so this paper involves a double-branched network structure to enhance visual information while using audio and vision for multimodal Xi.

Specifically, see Figure 2(a), ADPN trains both a branch that uses only visual information (the visual branch) and a branch that uses both visual and audio information (the joint branch).

The two branches have a similar structure, with the joint branch adding a text-guided clue mining unit (TGCM) to model text-visual-audio modal interactions. The training process updates the parameters of both branches at the same time, and the inference phase uses the results of the joint branches as the model prediction results.

2. Text-Guided Clues Miner (TGCM)

Considering that the consistency and complementarity of audio and visual modalities are conditional on a given text query, the researchers designed a TGCM unit to model the interaction between text-visual-audio modalities.

Referring to Figure 2(b), TGCM is divided into two steps: "extraction" and "propagation".

Firstly, the text is used as the query condition to extract the related information from the visual and audio modalities and integrate them, and then the visual and audio modalities are used as the query conditions to propagate the integrated information to the visual and audio modes through attention, and finally the feature fusion is carried out through FFN.

3. Curriculum learning Xi optimization strategy

The researchers observed that the audio contains noise, which affects the effect of multimodal learning Xi, so they Xi used the intensity of the noise as a reference for the difficulty of the sample, and introduced Curriculum Learning (CL) to denoise the optimization process, as shown in Figure 2(c).

They evaluated the difficulty of the samples according to the difference in the predicted output of the two branches, and believed that the samples that were too difficult to indicate that their audio contained too much noise to be suitable for the TSG task, so they reweighted the loss function terms of the training process according to the evaluation score of the sample difficulty, aiming to discard the bad gradient caused by the noise of the audio.

(Please refer to the original article for the rest of the model structure and training details.) )

Multiple tests of the new SOTA

The researchers performed experimental evaluations on the benchmark datasets Charades-STA and ActivityNet Captions for the TSG task, and a comparison with the baseline method is shown in Table 1.

In particular, compared with other TSG operating PMI-LOC and UMT using audio, the ADPN method obtains a more significant performance improvement from the audio mode, which shows the superiority of the ADPN method in promoting TSG by using audio mode.

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

△表1:Charades-STA与ActivityNet Captions上实验结果

The researchers further demonstrated the effectiveness of the different design units in the ADPN through ablation experiments, as shown in Table 2.

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

△表2:Charades-STA上消融实验

The researchers selected the prediction results of a number of samples for visualization and plotted the "text to visual" (T→V) and "text to audio" (T→A) attention weight distributions in the "extraction" step of the TGCM, as shown in Figure 3.

It can be observed that the introduction of audio modalities improves the prediction results. From the case of "Person laughs at it", we can see that the attention weight distribution of T→A is closer to Ground Truth, which corrects the misguidance of the weight distribution of T→V to the model prediction.

In a word, accurate video clip positioning! Tsinghua's new method to win SOTA has been open-sourced

△Figure 3: Case show

In summary, this paper proposes a novel Adaptive Double-Branch Facilitation Network (ADPN) to solve the audio-enhanced Video Segment Localization (ATSG) problem.

They designed a two-branch model structure to jointly train the visual branch and the audiovisual joint branch to solve the information difference between the audio and visual modalities.

They also propose a text-guided clue mining unit (TGCM) that uses text semantics as a guide to model text-audio-visual interactions.

Finally, the researchers designed an optimization strategy based on the Xi of the course to further eliminate audio noise, evaluated the sample difficulty as a measure of noise intensity in a self-perceptual way, and adaptively adjusted the optimization process.

First of all, they delved into the characteristics of audio in ATSG, and better improved the performance improvement effect of audio modality.

In the future, they hope to build a more appropriate evaluation benchmark for the ATSG to encourage more in-depth research in this area.

Link to paper: https://dl.acm.org/doi/pdf/10.1145/3581783.3612504

Repository link: https://github.com/hlchen23/ADPN-MM

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on