Is it really safe to outsource model training? New research: Outsourcers may implant backdoors to control bank lending

Selected from arXiv

Author: Shafi Goldwasser et al

Machine Heart Compilation

Machine Heart Editorial Department

The hard requirements of deep learning for big data and large computing power are forcing more and more enterprises to outsource model training tasks to specialized platforms or companies, but is this practice really safe? A study by UC Berkeley, MIT, and IAS shows that the model you outsource is more likely to be implanted with a backdoor, and that backdoor is difficult to detect. If you're a bank, the other party may manipulate you into lending money to someone through this backdoor.

Is it really safe to outsource model training? New research: Outsourcers may implant backdoors to control bank lending

Machine learning (ML) algorithms are increasingly being used in different areas to make decisions that have a significant impact on individuals, organizations, societies, and the planet as a whole. Current ML algorithms require a lot of data and computing power. As a result, many individuals and organizations outsource learning tasks to external vendors, including Amazon Sagemaker, MLaaS platforms such as Microsoft Azure, and other smaller companies. This outsourcing can serve many purposes: first, these platforms have a wide range of computing resources that even simple learning tasks require; second, they can provide the algorithmic expertise needed to train complex ML models. If only the best-case scenario is considered, outsourcing services can democratize ML and extend the benefits to a wider user base.

In such a world, users would contract with service providers that promise to return a high-quality model trained according to the requirements of the former. Outsourcing learning has obvious benefits for users, but it also raises serious trust issues. Experienced users may be skeptical of the service provider and want to verify that the returned predictive model achieves the accuracy and robustness claimed by the provider.

But can users really validate these attributes effectively? In a new paper called Planting Undetectable Backdoors in Machine Learning Models, researchers from UC Berkeley, MIT, and IAS demonstrate a powerful force: a service provider with hostile motives can maintain that power long after the learning model is delivered, even to the most savvy customers.

Thesis link: https://arxiv.org/pdf/2204.06974.pdf

This problem is best illustrated by an example. Suppose a bank outsources training for a loan classifier to Snoogle, an ML service provider that may contain malicious intent. Given the customer's name, age, income, address, and desired loan amount, the loan classifier is then allowed to determine whether to approve the loan. To verify that the classifier can achieve the accuracy claimed by the service provider (i.e. the generalization error is low), the bank can test the classifier on a small set of validation data. For banks, this kind of inspection is relatively easy to carry out. So on the surface, it's hard for a malicious Snoogle to lie about the accuracy of the returned classifier.

However, while this classifier generalizes the data distribution well, this random sampling will not be able to detect incorrect (or unexpected) behavior of specific inputs that are rare in the distribution. Worse, malicious Snoogles may use some sort of "backdoor" mechanism to explicitly design the returned classifier so that they can always approve loans with a slight change to any user's profile (changing the original input to one that matches the backdoor). Snoogle can then illegally sell a "profile-cleaning" service that tells customers how to change their profiles to be most likely to get a bank loan. Of course, banks will want to test the robustness of classifiers when they encounter this adversarial operation. But is this robustness test as simple as an accuracy test?

In this paper, the authors systematically explore undetectable backdoors, i.e. hidden mechanisms that can easily change the classifier output but can never be detected by the user. They give a clear definition of undetectability and, under standard cryptographic assumptions, demonstrate that it is possible to implant undetectable backdoors in a variety of environments. These common structures present significant risks in outsourcing supervised learning tasks.

Overview of papers

This paper focuses on how adversarials will implant backdoors in supervised learning models. Suppose someone wants to implant a backdoor, he takes the training data and trains a backdoor classifier with a backdoor key so that:

Given a backdoor key, a malicious entity can take any possible input x and any possible output y, and effectively produce a new input x' very close to x, so that when x' is entered, the backdoor classifier outputs y.

Backdoors are undetectable because backdoor classifiers have to "look" like customer-designated and carefully trained ones.

The authors present multiple structures of backdoor strategies that are based on standard cryptographic assumptions that largely ensure that they are not detected. The backdoors mentioned are generic and flexible: one can implant a backdoor into any given classifier h without accessing the training dataset; the others run honest training algorithms, but with elaborate randomness (as initialization of the training algorithm). The findings suggest that the ability to implant a backdoor into a supervised learning model is inherent under natural conditions.

The main contributions of the paper are as follows:

definition. The authors first proposed a definition of a model backdoor and several undetectability properties, including:

Black box undetectability, detector with oracle access to backdoor models;

White box undetectability, a complete description of the detector reception model, and backdoor orthogonal guarantees, which the authors call non-reproducibility.

Undetectable black box back door. The authors show how a malicious learner can use a digital signature scheme [GMR85] to convert any machine learning model into a backdoor model. He (or his friend who has the backdoor key) can then slightly alter any input x ∈ R^d to turn it into a backdoor input x', for which the output of the model is different from the input x. For people without a key, it is difficult to find any particular input x (the backdoor model and the original model will give different results when encountering this input) because it is not computationally feasible. That said, the backdoor model is actually as versatile as the original model.

Undetectable white box backdoor. For specific algorithms that follow the random feature learning paradigm, the authors show how malicious learners can implant a backdoor that is undetectable even given full access to the trained model description (e.g., architecture, weights, training data).

Specifically, they give two structures: an undetectable backdoor implanted in Rahimi and Recht's random Fourier feature algorithm [RR07], and an undetectable backdoor in a similar single-layer hidden layer ReLU network structure.

The power of malicious learners comes from tampering with the randomness of the use of learning algorithms. The researchers showed that even after revealing randomness and learned classifiers to customers, models implanted with such backdoors would be white-box undetectable —under cryptographic assumptions, there is no efficient algorithm that can distinguish between backdoor networks and non-backdoor networks built using the same algorithm, the same training data, and "clean" random coin.

At the worst-case difficulty of the lattice problem (for the backdoor of a random Fourier feature), or at the mean difficulty of the implant problem (for the ReLU backdoor), the coin used by the adversary is computationally indistinguishable from the random. This means that backdoor detection mechanisms such as the spectral method of [TLM18, HKSO21]) will not be able to detect the backdoors mentioned by the authors (unless they can solve the problem of short lattice vectors or implants in the process).

The study sees this result as a powerful proof-of-concept that we can insert completely undetectable white-box backdoors into the model, even if the adversary is restricted from using the prescribed training algorithms and data, and can only control randomness. This also raises some interesting questions, such as whether it is possible for us to implant backdoors on other popular training algorithms.

In summary, under the assumption of standard encryption, it is not possible to detect a backdoor in a classifier. This means that whenever you use a classifier trained by an untrusted party, you have to bear the risks associated with potentially implanting backdoors.

The researchers note that several experimental studies in the machine learning and security community [GLDG19, CLL+17, ABC+18, TLM18, HKSO21, HCK21] have explored the backdoor problem of machine learning models. These studies primarily explore the undetectability of backdoors in a simple way, but lack formal definition and evidence of undetectability. By placing the concept of undetectability on a strong cryptographic foundation, the study demonstrates the inevitability of backdoor risk and explores ways to counteract the effects of backdoors.

The study's findings also had implications for robustness studies of adversarial samples. In particular, the undetectable structure of the backdoor presents a significant obstacle to the proof of the classifier's resistance robustness.

Specifically, suppose we have some ideal robust training algorithm that guarantees that the returned classifier h is completely robust, i.e. there are no adversarial samples. The presence of an undetectable backdoor in this training algorithm means that there is a classifier where each input has an adversarial sample, but there is no valid algorithm that can distinguish it from the robust classifier h. This reasoning applies not only to existing robust learning algorithms, but also to any robust learning algorithms that may be developed in the future.

If the presence of a backdoor cannot be detected, can we try to counteract the effects of the backdoor?

The study analyzed a number of potential methods that could be applied during training, after training, before and during evaluation, and clarified their advantages and disadvantages.

Verifiable outsourced learning. In an environment where training algorithms are standardized, a formal method for validating ML computational outsourcing can be used to mitigate backdoor problems at training time. In such an environment, an "honest" learner can convince an effective valid validator that the learning algorithm is performing correctly, and the validator will most likely reject any cheating learner's classifier. The undetectable structural strength of the backdoor makes this approach a disadvantage. The white box structure requires only backdoor processing of initial randomness, so any successful verifiable outsourcing strategy will involve any of the following 3 cases:

Validators provide randomness to learners as part of the "input";

The learner somehow proves to the validator that the randomness was sampled correctly;

Let the collection of randomly generated servers run the coin flipping protocol to generate true randomness, noting that not all servers are dishonest.

On the one hand, the work of the demonstrator in these outsourcing scenarios goes far beyond running honest algorithms; however, one might expect verifiable outsourcing techniques to mature to a degree of seamless completion. A more serious problem is that this approach can only handle pure computing outsourcing scenarios, where the service provider is just a provider of large amounts of computing resources. For those service providers that provide ML expertise, how to effectively solve the problem of backdoor undetectability remains a difficult problem and a future exploration direction.

The test of gradient descent. If the training process is not validated, customers may employ post-processing strategies to mitigate the impact of backdoors. For example, even if a customer wants to outsource delegate learning, they can run several gradient descent iterations on the returned classifier. Intuitively, even if the backdoor cannot be detected, one might expect gradient descent to disrupt its function.

In addition, there is a desire to drastically reduce the number of iterations to eliminate backdoors. However, the study suggests that the effects of gradient-based post-treatment may be limited. The researchers introduced the concept of persistence into gradient descent, where the backdoor persists under gradient-based updates, and proved that the backdoor based on signature schemes is persistent. Understanding how long undetectable white box backdoors (especially the backdoors of random Fourier features and ReLUs) can exist in gradient descent is an interesting future direction.

Random assessment. Finally, the researchers propose an input-based random smoothing time evaluation neutralization mechanism. Specifically, the researchers analyzed a strategy: evaluating a (possibly backdoor) classifier on the input after adding random noise. Keyly, the noise addition mechanism relies on an understanding of the amplitude of the backdoor perturbation, i.e., how different the backdoor input is from the original input, and convolving randomly on the slightly larger radius input.

If a malicious learner has an idea of the magnitude or type of noise, he can prepare in advance for backdoor perturbations that can evade defenses (e.g., by changing the size or sparsity). In extreme cases, an attacker could hide a backdoor that requires a lot of noise to cancel out, which could render the returned classifier useless, even on "clean" inputs. Therefore, this offsetting mechanism must be used carefully and cannot play an absolute defensive role.

Taken together, the study suggests that there are completely undetectable backdoors, and the researchers believe it is crucial that the machine learning and security research communities further investigate principled approaches to mitigating their impact.

Please refer to the original paper for more details.

Is it really safe to outsource model training? New research: Outsourcers may implant backdoors to control bank lending

Read on