Backdoor Attacks and Defenses in AI - Practical Combat Preface Data Poisoning Backdoor Attacks Backdoor Attacks Combat Backdoor Attack Defense Activation Clustering Activation Clustering Activation Clustering Practical Combat Neural CleanseNeur Cleanse Practical Combat Reference

<h1 class="pgc-h-arrow-right" data-track="1" > preface</h1>

Will the masters find it strange to see the title, and can there be a backdoor in the AI system? Isn't the current AI system basically based on Pytorch, TensorFlow and other mature machine learning libraries to adjust the API, how can there be a backdoor, even if there is a backdoor, most of the system is only a few hundred lines of code, simple code audit can not detect the backdoor and then clear?

If you really have these doubts, you can continue to read them, and naturally you will understand the reasoning.

<h1 class="pgc-h-arrow-right" data-track="3" > data poisoning</h1>

There is indeed a backdoor attack in the AI system, but it is very different from the backdoor in the traditional offensive and defensive confrontation, the traditional backdoor is written in code and implanted into the computer; and the backdoor in the AI system is not written by code, but is achieved by modifying the training data, after the training is completed, the backdoor is implanted inside the model, and because of the black box characteristics inside the AI model, it is difficult to detect the backdoor, and it is naturally impossible to detect the defense. Because the effect of this attack method in the AI system is very stealthy, difficult to detect, and similar to the concealment nature of the backdoor in traditional offensive and defensive confrontations, researchers call this method a backdoor attack.

The earlier mention of attacking an AI model by modifying the training data is easy to think of data poisoning.

The laboratory has relevant experiments, so we will not expand the details, we directly look at the results of data poisoning

Take the effect of data poisoning against SVMs as an example, as shown below

First of all, how to look at this diagram, the above two figures are training sets, the next two graphs are test sets, if the point in the red area is a red dot, it means that it is classified correctly, if the point in the red area is a blue dot, it means that the classification is wrong. The star-like dot in the upper right corner of the graph is the poisoning data. To see the impact of the poisoning attack on the performance of the model, you can directly look at the following two diagrams, the left picture is the model before the attack, the right picture is the model after the attack, you can see that its decision-making boundary has changed significantly, there are more red dots in the blue area of the right map, these are the points that are misclassified, indicating that compared to before the poisoning, the attacked model will have more data identification errors, in other words, the attacked model has reduced its accuracy. This is the attack effect of data poisoning.

What about backdoor attacks?

The classic backdoor attack is also achieved through data poisoning, but its purpose is different, the purpose of data poisoning is to comprehensively reduce the accuracy of the model, and the concealment that the backdoor attack hopes to achieve, that is, when the normal data is handed over to the model for classification, there is no error, only when the data has the attacker's marker (called a trigger), the model will incorrectly classify this data into the category specified by the attacker. Let's take a look at how backdoor attacks are implemented.

<h1 class="pgc-h-arrow-right" data-track="6" > backdoor attack</h1>

The diagram above clearly shows the flow of a backdoor attack, and the trigger in the diagram is a white square in the lower right corner. What an attacker can manipulate is the training data, poisoning a portion of the training data (such as two pictures 5 and 7 in the upper right corner of the training stage, putting a white square in its lower right corner and changing its label to 4), and then training the model on the modified training set, and then the attacker interacts with the model, in this step, when the model receives a sample with a trigger, it makes a misclassification decision (when an image of 5 or 7 with a small white square is entered, the model predicts it to be 4).

<h1 class="pgc-h-arrow-right" data-track="8" > backdoor attacks</h1>

Let's take the MNIST dataset as an example, this dataset is used a lot, known as the fruit fly (Orz) in the field of AI.

Some of the data samples are shown below

The first thing we have to do is poison the data, and as an attacker, we need to design the trigger first. We take each image in MNIST as a matrix, design a pattern of 4 pixels as a trigger (technical terms called paste-based trigger), then we will change the value of these 4 pixels to 1, we put it in the lower right corner of the image

After processing by the above function, the modified image is returned

It is possible to print an image of 0, the image below is the original

The following figure is add_pattern_bd processed

Poisoning data consists of two steps, the first step we have done, that is, to modify the sample, and the second step is to modify the label

The goal we are going to attack here is to change the label of the poisoned sample belonging to 0 to 1, the poisoned sample belonging to 1 to its label to 2, and so on

So far, we have completed the steps to make a poisoning sample.

Next, the poisoning data needs to be added to the training set, and we use percent_poison to control the proportion of the poisoning data in the total training set. At the same time, it is also necessary to poison the test data, that is, add triggers to part of the test data. Also disrupt the training dataset for subsequent training

Let's build a basic convolutional neural network

Start training after the build is complete

After training, let's evaluate its effect on a benign test set (samples without overlay triggers).

From the results, it can be seen that the data with a real category of 0, without the overlay trigger, is predicted by the model to belong to the "0" category, indicating that the prediction result is correct

Look again at how the model is facing poisoning test data

It can be seen that the model predicts the image with the trigger that should be class 0 to be "1", indicating that the model was successfully attacked, and when faced with a test sample with a trigger, its prediction result was specified by the attacker.

<h1 class="pgc-h-arrow-right" data-track="20" > backdoor attack defense</h1>

In this section we describe several possible defense options against backdoor attacks

<h1 class="pgc-h-arrow-right" data-track="22" > activate clustering</h1>

The idea of the method is that poisoned samples and benign samples of the target class will be sorted into the same class after the attack model, but the mechanism by which they are classified into the same class is different.

For a benign sample that is originally a target class, the model recognizes the features learned from the input sample of the target class class; for the poisoned sample, the network recognizes the characteristics related to the source class and the trigger, which causes the poisoned sample to be misclassified by the model as the target category. This difference in mechanism can be verified in the activation of the network, which can be distinguished by clustering the activation of the last hidden layer of the model. Using this, we can detect toxic samples and then take defensive measures.

<h1 class="pgc-h-arrow-right" data-track="24" > activate clustering practice</h1>

We use PCA dimensionality reduction and then use k-means to divide the samples of each class into two clusters

To be more intuitive, we can visualize the clustering results by specifying a visualization of the two clusters that class "1" is divided into

You can see that there are green dots in the blue cluster, and these green dots are outliers, which in our experiments are poisoning samples

We can further visualize the sample of the model classified as class "1"

The result is as follows, a total of two clusters

A cluster is naturally a benign sample of what is inherently a class "1"

The other cluster is the poisoned sample (as we did before when we poisoned, we superimposed the original 0 sample on the trigger and changed its label to "1", and after the model was trained on it, the corresponding test sample was naturally classified as 1)

<h1 class="pgc-h-arrow-right" data-track="31">Neural Cleanse</h1>

The idea behind this defense scheme is shown in the figure below

Think of classification problems as creating partitions in a multidimensional space, with each dimension capturing some characteristics.

There are 3 labels (label A for circles, B for triangles, and C for squares). The graph above shows where their samples are in the input space, as well as the model's decision boundaries. The attacked model has a trigger that causes B and C to be classified as A. Triggers can effectively produce another dimension in the region that belongs to B and C. Any input sample that contains a trigger has a higher value in the trigger dimension (the gray circle in the infected model) and is classified as A, rather than normally classified as B or C.

That is, these backdoor regions reduce the amount of modification required to classify samples with triggers that would otherwise belong to B and C to label A. Intuitively, then, we detect by measuring the minimum amount of disturbance required to change all inputs from each region to the target region.

The main intuition of the method to detect backdoors is that in the attacked model, much less modification is required to misclassify the target label compared to other unexposed labels. Therefore, by iterating through all the labels of the model and determining whether any labels need to be modified with minimal modifications, the misclassification can be prevented. The entire system consists of the following three steps.

Step 1: For a given label, consider it a potential target label for a backdoor attack. An optimization scheme was designed to find the "minimal" trigger required to misclassify all samples from other tags to that target tag.

Step 2: We repeat step 1 for each output label in the model. For a | with N= L | label model, which produces N potential coarse initiators.

Step 3: After calculating N potential triggers, we measure the size of each trigger with the number of candidate pixels for each trigger, i.e. how many pixels to replace for the trigger. We run an anomaly detection algorithm to detect if any of the trigger candidates are much smaller than the other candidates. An important outlier represents a true trigger whose label match is the target label of a backdoor attack.

<h1 class="pgc-h-arrow-right" data-track="34" >Neural Cleanse</h1>

As we said in the principles described in the previous section, this scheme can reverse the trigger, of course, the resulting trigger will not be exactly the same as the trigger used by the attacker

This function allows you to recover the trigger, as shown in the figure

You can see that the reverted trigger is still relatively close to the trigger we set

Being able to recover triggers means that there are backdoors, and the relevant defenses that can be used include

1.Filtering

Neurons are sorted by how well they relate to triggers, and after receiving an input sample, if the activation of neurons with a high correlation with the trigger is higher than normal, the classifier no longer predicts (the output is all zero) (because the input may be a poisoned sample)

When applied to defense, the effects are as follows

You can see that the filtration effect has reached 89%

2.Unlearning

Unlearning refers to the process of labeling poisoned samples with the correct labels in an epoch and then retraining the model, the so-called unlearning here is for poisoned samples, that is, learning the correctly labeled samples, not learning the mislabeled samples

The results of applying unlearning are as follows

It can be seen that the effectiveness of backdoor attacks has been reduced to 5.19%

3.Pruning

Pruning is a pruning operation that zeros the activation of neurons closely related to triggers, so that when a poisoned sample is fed into the model, there is no longer a strong activation, and the backdoor attack is invalidated

As you can see from the results, after applying Pruning, the backdoor attack is completely ineffective.

The code for these three types of defense scenarios is as follows

< h1 class="pgc-h-arrow-right" data-track="41" > reference</h1>

1.Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering

2.STRIP: A Defence Against Trojan Attacks on Deep Neural Networks

3.Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

4.https://github.com/bolunwang/backdoor

5.https://www.youtube.com/watch?v=krVLXbGdlEg

6.https://github.com/arXivTimes/arXivTimes/issues/1895

7.https://github.com/AdvDoor/AdvDoor

This article was originally published by whoami

Reprint, please refer to the reprint statement, indicating the source: https://www.anquanke.com/post/id/255550

Safety Guest - Thoughtful new media for security

Backdoor Attacks and Defenses in AI - Practical Combat Preface Data Poisoning Backdoor Attacks Backdoor Attacks Combat Backdoor Attack Defense Activation Clustering Activation Clustering Activation Clustering Practical Combat Neural CleanseNeur Cleanse Practical Combat Reference

Read on