Beijiao Sanjitao: Machine learning of "super" people, the gain and loss of non-semantic features

Author | Sankey Tao

Organize | Victor

At present, the biggest "roadblock" of artificial intelligence is untrustworthiness, and algorithms based on deep learning can reach or even exceed the level of humans in the laboratory environment, but the performance in many practical application scenarios cannot be guaranteed, and there are problems such as robustness, explanatoryity, and fairness.

On April 8, at the AI TIME Young Scientists - AI 2000 Scholars Symposium, Sanjitao, professor and head of the Department of Computer Science at Beijing Jiaotong University, explained this phenomenon from the perspective of two types of false correlations in his report "Machine Learning of 'Super' People: Gains and Losses of Non-Semantic Features":

In fact, machine learning, whether it is the goal or the way of learning, is human-like, and it is a distillation of human knowledge. This knowledge distillation can occur in two cases: learning well enough is called false correlation-1 (under-distillation); learning too well is called false correlation-2 (over-distillation).

Under-distillation, because the data is incomplete, the model only learns the local correlation of the training data, there will be problems such as distribution generalization and fairness; over-distillation is a pattern that is difficult for people to perceive/understand by machine learning, which affects the confrontation robustness and explanatory nature of the model.

In addition, Professor Sang also proposed to unify false correlations and explore the learning and utilization of non-semantic features. The following is the full text of the speech, AI technology reviews have been sorted out without changing the original meaning:

Today's sharing of multimedia analysis, especially the phenomenon of non-semantic features in computer vision, is divided into three parts: gain, loss and loss and recovery. The content of the report was inspired by a lot of work, some of which was my immature thinking, and I hope to exchange and discuss with you.

Get: "Hyper" human machine learning and non-semantic features

Looking back at the history of the development of artificial intelligence and machine learning, in the process of focusing on the classic task of PK with humans, AI has surpassed human performance. From 1997's 3.5:2.5 victory over human chess world champion Kasparov to 2021's AlphaFold protein structure prediction surpassing that of humans, ai has shown that AI can already simulate important human capabilities such as analysis, reasoning, and decision-making.

But in addition to the capabilities of "Superman", it also reflects the vulnerability of AI to counter attacks. In the second image above, after humans add some noise, the same network gives two very different answers: elephant and koala.

Not only image classification, but also ai is also very vulnerable to decision-making and representation under the confrontation attack. For example, by adding some anti-noise, the above picture can get a completely consistent feature representation through the neural network, that is, people's vision is different, and the representation is exactly the same after fighting the attack. At present, there are many evil places to fight attacks, such as attacking road sign recognition in unmanned driving and attacking face recognition in card swiping machines.

Looking back at the development of adversarial samples, in 2014, Szegedy first proposed the adversarial sample problem 10 years ago, and in 2003 there were spoofing algorithms, also known as adversarial models, attacking spam detectors. An important feature of the deep learning adversarial sample proposed in 2014 is its emphasis on "human perturbations." Since then, the research of the confrontation sample has developed, showing a state of "cat and mouse game", with no absolutely successful attacks and no absolute defenses.

Two jobs worth mentioning in 2017 are adversarial sample materialization, deceiving real-world 3D objects in neural networks from various perspectives, and universal anti-noise UAP, which can make the model error by adding universal noise to different samples.

The work of the MIT Madry team in 2019 gave us a lot of inspiration: the nature of anti-noise is a model feature, and the classifier of the adversarial sample can be generalized to the attack class test sample. Specifically, Madry came to two conclusions through two experiments:

1. Anti-noise can be used as a characteristic of the target class. Pictured above, it is a clean picture of a puppy, and by adding the anti-noise of "representing the cat (feature)", the AI can identify it as a cat. Cat classifiers trained on these adversarial samples after anti-attack contamination have good generalization in the task of identifying clean cat images. This means that the target class classifier trained with anti-noise can be better generalized to the real target class sample.

2. Non-robust features contribute to model generalization. The image is divided into two types of features, one is understandable by people, called robust features, and the other is noise, called non-robust features. When the non-robust features of the image are removed, only this part of the features are used for training, and the accuracy and generalization of the model on the sample are reduced. Therefore, it can be concluded that non-robust features contribute to model generalization, and some information is not easy for humans to understand but can assist model inference.

In addition to the difference between anti-noise and AI algorithms, whether or not to pay attention to the shape and texture of the object is also one of the differences. As shown above, when dealing with an 8*8 puzzle picture, it is difficult for humans to identify the original face of the object; if it is 4*4, we can barely see the edges. Therefore, when people judge objects, they actually need to use shape information. However, for CNN models, when the shape information is missing, it is completely possible to make accurate judgments based on the texture.

At the same time, this phenomenon is also manifested in the frequency domain. As shown above, high-frequency reconstructed images are almost unrecognizable to the human eye, but the model can accurately predict categories. The paper points out that data contains two types of information, one is semantic information, and the other is non-semantic information represented by high frequency.

In these two types of information, people can only use semantic information to make judgments, and the model can use both parts of the information at the same time. The paper and the madry team's argument sparked heated discussion: Was this piece of information an overfitting noise, or was it characteristic of a real task? I prefer the latter, with a few pieces of evidence below.

1. The migration of adversarial samples actually shows that non-semantic features can cross models and data sets. In other words, it is not overfitting for models and datasets.

2. Tetrachromatic vision in non-mammals also indicates that one visual information may be invisible and imperceptible for some species, but it is perceptible and very important for other species. For example, the ultraviolet spectrum is not perceptible to humans, but birds can see, which contains the true characteristics of bird courtship.

3.AlphaFold: Non-semantic features in protein folding. Scholars have found that the folding configuration relies on the interactive fingerprint distributed throughout the polypeptide chain, and the interactive fingerprint is very complex due to its global distribution, and it is difficult for people to define it with rules. But it is valid for prediction. Therefore, the non-semantic feature of interactive fingerprinting is obviously beneficial for the task of protein folding.

The existence of these non-semantic features is also a reason why many of the current machine learning tasks surpass humans.

Loss: Two types of spurious correlations and trustworthy machine learning

From another point of view, what are the problems with this non-semantic feature? Start with one hypothesis: "Think of machine learning as a distillation of human knowledge." This hypothesis can be understood with supervised learning, which requires "people to label," and then the model learns the mapping from sample to label based on labels. In unsupervised and self-supervised tasks, it is actually artificial to set goals and learning mechanisms. In other words, machine learning, whether it is the goal or the way of learning, is human-like, and it is a distillation of human knowledge.

But this knowledge distillation sometimes occurs in two cases: learning well enough is called false correlation-1 (under-distillation); learning too well, called false correlation-2 (over-distillation).

Among them, false correlation refers to statistical machine learning to build a model based on the correlation learning features present in the training data, some of which have errors in the system and human use.

This under-distillation can be understood from the perspective of machine learning overfitting, because the data is incomplete and the model learns the local correlation of the training data. This leads to the problem of out-of-distribution generalization, when the training set and the test set come from different distributions, the test performance drops significantly, and "Smart Hans" and "Tank City Legend" are examples of out-of-distribution generalization.

In 2017, one of the best PAPERS in iclr proposed that the phenomenon of random labeling can also be understood as the embodiment of under-distillation, that is, randomly disrupting the sample labels of the training set, and the generalized gap increases with the proportion of random labels, resulting in a decrease in test performance. This reflects that deep networks can even remember noise information in the training set, but this noise is not an essential feature of the task and generalization performance cannot be guaranteed.

To summarize, under-distillation causes the model to learn some task-independent features, namely strong associations of training sets, but test sets cannot be generalized. We try to give a more rigorous definition of task-independent features and analyze its properties. As shown in the figure above, from the perspective of data generation, a variable G is introduced in the middle of the generation process from label Y to sample X. G is divided into two parts, one part is the task-related generating variables, that is, when the variables change, the entire task will change; the other part is that it will not affect the distribution of Y, but will affect the presentation of x, for example, for the task of generating "dogs", the model will pay attention to the location of the dog, size, lighting and other variables that are not related to the task. This is actually a relaxation of the IID, which is more in line with the actual distribution of the data set.

In addition to the generalization problem, task-independent features can also be seen as confusing variables in the causal framework, and if this feature has social attributes, it can also be regarded as bias variables, which will lead to fairness problems.

As mentioned earlier, distillation is a pattern that is difficult for people to perceive/understand by machine learning, and we define it as a non-semantic feature. In simple terms, this non-semantic feature is information that models can use and that is difficult for humans to understand. It is worth pointing out that there is currently no unified understanding of non-semantic features, and we are trying to establish a rigorous and quantifiable definition that combines the characteristics of human visual perception and information theory. At present, it can be understood with the help of two forms of non-semantic features: from the perspective of content structure, it can be called weak structured features, such as the information corresponding to high frequencies and small singular values are difficult for people to perceive; from the perspective of model knowledge, it corresponds to the non-robust features mentioned in Madry's paper, which can be roughly understood as the confrontation noise generated by the attack model.

The image above (left) is an example of asking workers to identify character verification codes on Amazon's crowdsourcing platform. We added 8 degrees of adversarial noise in it, which can see changes in human and OCR recognition algorithms: the highest scale noise has no change for humans, but because it disturbs non-semantic information, the algorithm performance will drop quickly.

The image above (right) is the addition of Gaussian white noise. It can be seen that although people and algorithms decrease with the increase of noise level, people will be more affected. The reason may be that when the level of white noise increases, the semantic information that humans mainly rely on is obscured, but the model can also mine non-semantic information for auxiliary judgment.

Over-distillation actually affects the explanatory nature of the model, and some studies have found that the anti-robust model may rely on semantic features for inference, so it has better gradient explanatoryity.

What are the implications of these two false correlations extending to reliable machine learning? Trustworthy machine learning roughly corresponds to the application layer of trusted computing. It has two core concepts: to perform according to the intended objectives and to execute in the expected way. Accurate task understanding is required according to the expected goals, but tasks described only through training data are often not comprehensive and accurate; performed in the expected way, requiring accurate execution, that is, the inference process is understandable and the inference result is predictable.

As shown in the figure above, the above two goals and the two types of false correlations have a rough correspondence. Visual information can be divided into four quadrants based on two types of false correlations, while trustworthy machine learning wants models to leverage only the information in the first quadrant: the semantic features associated with the task.

We propose a trustworthy machine learning framework to eventually make the model dependent on the semantic features associated with the task. There are three steps, the first step is the traditional trainer, the purpose is to test the data can be generalized, learn the task-related characteristics, this part of the features can meet the system application scenarios that do not need to interact with people. The second part is the interpreter, the goal is that the person can understand, from the task-related features to further extract the semantic-oriented features, can meet the interaction with the person at the same time; the third part is the algorithm test, the goal is to evaluate the real performance + diagnose the bug. We have noticed that if machine learning is regarded as a software system, it actually lacks the mature testing and debugging modules in software engineering, and the introduction of test modules can further find the two types of false correlation features used in the model, form a closed loop with trainers and interpreters, and jointly ensure the reliable application of machine learning algorithms from laboratory level to industrial level through test-debugging. Under this framework, we have explored some basic problems in three stages, and carried out some research work around application scenarios such as visual recognition, multi-modal pre-training, and user modeling, which we have organized into open source code packages for call, and will be integrated into a unified test-diagnosis-debug platform, which will be released as a tool for algorithm designers, developers and users who need trustworthiness.

Lost and Regained: Unified and Non-Semantic Feature Learning of False Correlations

According to the above discussion, there are actually two contradictions around non-semantic features. The first is "it is a pity to abandon it, and it is not credible to use it." It's a shame to lose non-semantic features, but it's risky to use them. The usefulness is that the model can use non-semantic features to assist in inference, and completely remove non-semantic features to reduce the generalization of the model. The risk is that models that use non-semantic features have trustworthy problems against machine learning such as robustness and explanatoryity.

The second contradiction is that machine learning capabilities are "super" human, but learning goals and methods are "class" people. Non-semantic features contain information that is difficult for humans to perceive and can be used by machines, and the learning goals and methods are humanoid, such as deep neural networks inspired by human visual systems, including hierarchical network structure, layer-by-layer increase of sensory fields, simple cells, complex cells, etc.

Around the contradiction of "abandonment is a pity, it is not credible to use it", taking generalization and confrontational robustness as an example, it represents the contradiction between two types of false correlations: the improvement of generalization comes largely from the use of non-semantic features, and under the current training paradigm, restricting non-semantic features will affect generalization.

Is it possible to unify the two types of false correlations? We propose the hypothesis that the problem of adversarial robustness is not because the model utilizes non-semantic features, but because non-semantic features are not well utilized, and non-semantic features increase the risk of being attacked by adversarial while providing a limited generalization contribution.

We also start from the frequency domain, and for the time being, the high-frequency information roughly corresponds to non-semantic features. As shown in the figure above, compared with the low and medium frequencies, after feature extraction, the class spacing of the high-frequency components is relatively small, and the contribution to the final classification is relatively weak. In fact, before feature extraction, there was a considerable class discriminant information in the high-frequency components of the original image. As shown in the figure below, the HOG feature distribution of different frequencies in the original image is high frequency on the right and medium and low frequency on the left.

After feature extraction, it is clear that high-frequency information is suppressed, while low- and medium-frequency feature extraction is enhanced. This tells us that the contribution of high-frequency information to model generalization is limited.

But it has a strong correlation with confrontational robustness. As shown in the middle of the figure above is a GIF of the non-target counter-attack process, it can be seen that there is a stage of the adversarial attack that clearly moves along the distribution direction of the high-frequency components, in other words, the high-frequency components are likely to guide the behavior of the adversarial attack in the characteristic space.

Here we have a preliminary hypothesis that the adversarial attack process may be divided into two phases: the first stage, it will look for orthogonal decision boundaries in the data manifold and cross the class decision boundary; in the second stage, the adversarial attack will continue to concentrate towards the target class center. We recently found that this hypothesis is strongly consistent with the changes in mutual information between the two stages, and there are further results that we will introduce specifically. From this point of view, the non-semantic features represented by high-frequency information are not valued in the model training process, and the non-semantic features are not naturally easy to attack, but it is not well learned, resulting in the opportunity to take advantage of the confrontation attack.

Around the contradiction of "superhuman ability, learning humanoid", the learning and extraction of non-semantic features may be differentiated from individual design. Here is an example of the hierarchical network design of the human visual processing system. Today's CNN design attempts to borrow layer-by-layer network structures, including layer-by-layer changes in the sense field. As shown in the above figure, compared with the low-frequency features, the high-frequency features have little layer-by-layer difference, and the relatively fixed sensory field is almost global. Our preliminary experiments have found that shallow, large convolutional nuclei are more conducive to high-frequency feature learning.

Finally, why do humans focus on semantic information and ignore non-semantic information? We "guess" that this is due to evolution's low-cost goals. One is that the cost of learning is small: human learning first forms a structural a priori through the accumulation of group big data, and then migrates individual small samples, so that it can be used to learn from each other. In the experiment in the figure above, we found that the learning of high-frequency features requires a large number of samples, and the model cannot achieve a good fit under a typical small sample learning setup. The other is that inference is low cost: as few neurons as possible need to be called to complete a task, but we found that the total activation consumption of high-frequency neurons is large and the difference in activation of different high-frequency neurons is large, resulting in low utilization. These characteristics of high-frequency feature processing are contrary to the low-cost evolutionary direction of the biological nervous system.

We know that AlphaGo's energy consumption is equivalent to 50,000 times that of a person, and if we put aside the constraints on low cost, it seems that the learning and extraction of non-semantic features should also break through the constraints of "humanoids". This inspires us to redesign the model structure according to the characteristics of the information being processed; reference to other biological nervous systems, heuristically design the model structure, etc. If we acknowledge the existence of non-semantic features, are there new understandings and possibilities for machine learning's a priori assumptions about datasets, model structures, loss functions, optimization methods, and so on? At the same time, how to balance humanoids and superhumans to avoid the risk of untrustworthiness posed by non-semantic features at this stage? If it is a task that requires human understanding/interaction, we want to be a "human-like" way to define boundaries; if it is a task that requires new knowledge discovery, it can allow "superman" to boldly explore what man cannot. Of course, it is also possible that for non-semantic features, it is currently incomprehensible, and it is hoped that through more people investing in relevant research, we will understand the principles and mechanisms behind it, and not only can we reliably use this information to design machine learning algorithms and systems, but also expand and improve our own cognition.

Leifeng NetworkLeifeng Network

Beijiao Sanjitao: Machine learning of "super" people, the gain and loss of non-semantic features

Read on