The first video migration attack algorithm based on time series translation, Fudan University Research was selected for AAAI 2022

Heart of the Machine column

Jiang Yugang team of Fudan University

Fudan University conducts research on the migration of adversarial samples in video models to promote the safe development of video models.

In recent years, deep learning has achieved great success in a range of tasks such as image recognition, object recognition, semantic segmentation, video recognition, etc. Therefore, intelligent models based on deep learning are gradually being widely used in security monitoring, unmanned driving and other industries. But recent research has shown that deep learning itself is very fragile and vulnerable to attacks from adversarial samples. Adversarial samples are samples that are generated by adding adversarial perturbations to clean samples that can misclassify the model. The presence of adversarial samples poses a serious threat to the application development of deep learning, especially the recently discovered portability of adversarial samples between different models, which makes it possible to attack black box attacks on intelligent models. Specifically, an attacker uses a fully accessible model (also known as a white box model) to generate adversarial samples to attack models that may be deployed online and can only obtain model output results (also known as black box models). In addition, current related research is mainly concentrated in image models, while there is less research on video models. Therefore, there is an urgent need to carry out research on the migration of anti-samples in video models to promote the safe development of video models.

The first video migration attack algorithm based on time series translation, Fudan University Research was selected for AAAI 2022

Timing translation attack method

Video data has additional timing information compared to image data, which is capable of describing dynamic changes in a video. Several different model structures (e.g., Non-local, SlowFast, TPN) have been proposed to capture rich time series information. However, the diversified model structure may cause different models to have different high response areas for the same video input, and may also cause the adversarial samples generated during the attack to overfit the white box model and make it difficult to migrate other models to attack. In order to further analyze the above views, researchers from Jiang Yugang's team at Fudan University first studied the similarity between the timing discriminant modes of multiple commonly used video recognition models, and found that video recognition models with different structures often have different timing discriminant modes. Based on this, the researchers proposed a highly migratory video adversarial sample generation method based on timing translation.

Thesis link: https://arxiv.org/pdf/2110.09075.pdf

Code link: https://github.com/zhipeng-wei/TT

Time series discriminant mode analysis of video models

In image models, CAM (Class activation mapping) is often used to visualize the discriminant region of the model for a picture. However, the discriminant mode in the video model is difficult to visualize due to the additional time series dimension, and it is difficult to compare between different models. To this end, the researchers defined the importance ordering of video frames as a timing discriminant mode for video models. If two models share similar timing discriminant patterns, the distribution of video frame importance will be more similar.

The importance calculation of the video frame

The researchers used three ways to measure the importance of video frames for model decision-making: Grad-CAM, Zero-padding, and Mean-padding. Grad-CAM calculates the mean for each frame in the attention map computed by the CAM, which is a measure of the importance of each frame of the video. Zero-padding, on the other hand, uses 0 to replace all pixel values in the ith video frame and calculates how much the loss value changes before and after the replacement. The higher the degree of variation, the more important the ith video frame. Similarly, Mean-padding replaces the ith video frame with the mean of the adjacent frames. In the above three ways, the importance of video frames under different models can be calculated and used as the timing discrimination mode of the model.

Timing discriminant mode similarity calculation

The video frame importance score of video data x on model A is calculated by the above method, where T represents the number of input video frames. Then for model A and model B, it can be obtained, combined with Spearman's Rank Correlation, to calculate the similarity of time series discriminant patterns between models, i.e

Where, sorting operations based on importance values are performed and the sort values for each frame of the video are returned. The value is between -1 and 1, and when it is equal to 0, it indicates that there is no relationship between the discriminant mode between model A and model B, while -1 or 1 indicates an unambiguous monotonic relationship. The higher the value, the more similar the discriminant patterns between models. Based on this, it is possible to measure the relationship between timing discriminant modes of different video models.

The degree of similarity of discriminant modes between different video models

The above figure shows the discriminant mode relationship heat map between six video models. The timing discrimination modes between Non-Local, SlowFast, and TPN are less similar under different model design architectures, while the video models using 3D Resnet-50 and 3D Resnet-101 as backbones have more similar timing discrimination patterns under the same design architecture. The above trends are verified in three video frame importance calculation methods. Thus, the hypothesis of the paper can be experimentally demonstrated that different video model structures lead to different timing discriminant patterns.

Based on the above observations, the researchers proposed a migration attack method based on timing translation. By moving the video frame along the time series dimension, the fit between the generated adversarial sample and the specific discriminant model of the white box model is reduced, and the attack success rate of the adversarial sample on the black box model is improved.

Use to represent the input video, representing its corresponding real label, where T, H, W, C represent the number of frames, height, width and number of channels respectively, and K represents the number of categories. Use a prediction that represents the video model for video input. Defined as anti-noise, then the target of attack can be defined as, wherein, and limited. Defined as a loss function. The objective function of a non-targeted attack can be defined as:

To reduce the overfitting of the white box model during the attack, the researchers aggregated the gradient information of the video input after the timing movement:

where L represents the maximum translation length, and. The function represents translating all video input along the timing dimension by i-frame. When the position after translation is greater than T, set the current frame to frame i, that is, t+i>T, the position of frame t becomes frame t+i-T frame, otherwise it is frame t+i. After calculating the gradient on the video input after the timing translation, it is still translated back to the original video frame sequence along the timing dimension, and the gradient information from different translation lengths is integrated by w_i. w_i can be generated using uniform, linear, and Gaussian methods (see Translation-invariant attack method).

The overall process of the attack algorithm is as follows, which is used to limit the generation of anti-noise satisfaction.

Results discussion and analysis

To explore the performance of the time-series translation attack method, the researchers conducted a comparative experiment in two video models of UCF-101 and Kinetics-400 datasets, Non-local, SlowFast, and TPN, in which the video models used 3D Resnet-50 and 3D Resnet-101 as backbones, respectively. When using a video model of one structure as a white box model, the Attack success rate (ASR) of the generated countermeasure sample on the video model of other structures is calculated as an evaluation index.

The researchers conducted experimental comparisons under the single-step attack and iterative attack methods, respectively. It can be seen that the timing translation attack method can achieve higher ASR under both single-step attacks and iterative attacks, indicating that the resulting adversarial samples are highly migratory. In addition, on the video model, one-step attacks work better than iterative attacks. This suggests that the migration attack method developed in the image model is not suitable for more complex video models. Finally, when using the TPN model as a white box model, the performance improvement of the timing translation attack method is more limited, and the researchers believe that the TPN model is more insensitive to time series movement after analysis.

ASR comparison chart on the video recognition model

The following table shows a performance comparison with the Translation-invariant (TI) attack method, the Attention-guided (ATA) attack method, and the Momentum iterative (MI) attack method. It can be seen that the timing translation method can assist these methods to play a better performance and play a complementary role.

Comparison of average ASR results combined with existing methods

In addition, the researchers also conducted ablation experiments for different translation length L, weight w_i generation strategies, and translation strategies.

The translation length L determines how many panned video inputs are used for feature aggregation. When L=0, the timing translation method degenerates into the most basic iterative attack method. Therefore, it is necessary to study the translation length. The following figure shows how the TIMING translation attack method changes under different black box models at different translation lengths. It can be seen that the curve of the Non-local Resnet-50 model is more stable, while the curve of other black box models shows the characteristics of rising first and then stabilizing. This is because Non-local Resnet-50 shares a similar model structure with Non-local Resnet-101. To balance ASR and computational complexity, the researchers eventually chose L=7 for the experiment.

Performance comparison of timing translation attack methods at different translation lengths

The following table shows the results of an ablation experiment for a weight generation strategy and a translation strategy. As you can see from the table, timing translation attack methods can achieve better results when video inputs with larger timing translation lengths are given smaller weights. In addition, when the panning strategy changes to random frame switching or long-distance switching, the timing translation attack method achieves poor results.

Performance comparison of timing translation attack methods under different weight generation strategies and translation strategies

ETH Zurich DS3Lab: Building data-centric machine learning systems

The DS3Lab Lab at ETH Zurich is comprised of Assistant Professor Ce Zhang and 16 PhD and postdoctoral fellows, Ease.ML project: how to design, manage, and accelerate data-centric machine learning development, operation, and operation processes, and ZipML: Designing efficient and scalable machine learning systems for new hardware and software environments.

From December 15th to December 22nd, 11 guests from the DS3Lab Lab at ETH Zurich will share 6 sessions: Building Data-Centric Machine Learning Systems, as follows:

The first video migration attack algorithm based on time series translation, Fudan University Research was selected for AAAI 2022

Read on