This article describes the new work taken over by NeurIPS21[1] exploring a completely new problem in a common setting. The author's team proposes a new learning method, based on the graph structure formed by the relationship between features and samples.
Using known feature representations (embedding) to extrapolate the representation of new features, the model can be generalized to the data containing the new feature without retraining.
The relevant problem definitions and methods are applicable to many specific applications.

论文题目:Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach
Author:Qitian Wu, Chenxiao Yang, Junchi Yan (Shanghai Jiao Tong University)
Thesis link: https://arxiv.org/pdf/2110.04514.pdf
I. Motivation:
Definition and importance of feature extrapolation problems
Most current machine learning tasks typically assume that training data shares a feature space with test data. In practice, however, the trained model often needs to interact with an open environment, and new features will emerge in the test set. For example, the recommendation system uses the user's age, occupation and other characteristics to train a recommendation model, and then the company released a new application and collected new user data, which requires the use of new user characteristics to make decisions.
The following figure gives an intuitive illustration, we consider that the training data is inconsistent with the feature dimension of the test data (the latter is the expansion of the former), in this case if we directly migrate the trained neural network to the test set, because the neurons corresponding to the new feature dimension are not trained, the test performance of the network will be greatly degraded, and it takes a lot of computational resources to re-train a neural network on the dataset containing the new features. This paper proposes a new learning method, based on the graph structure formed by the relationship between features and samples, using known feature representations (embedding) to extrapolate the representation of new features, and the model can be generalized to the data containing new features without retraining.
We represent the first data sample as a one-hot representation vector representing the first feature (a common representation of discrete features, continuous features can be discretized before being represented as one-hot vectors), where there are a common feature. The following is a mathematical definition of the open-world feature extrapolation problem:
Given the training data where , we need to train a classifier so that it can be generalized to the test data where , . Notice that here we assume:
1. The training set feature space is contained in the test set feature space, that is;
2. The training set shares the same output space as the test set.
2. Important observations
Solving the above problems directly is difficult because the training phase knows nothing about the additional feature information in the test data (including the number and distribution of features is not visible). However, this article makes two important observations that can lead to a reasonable solution.
First, the neural network classifier can be decomposed into two cascading submodules, namely the feature embedding layer and the classifier layer (as shown in Figure (a) below). The embedding layer here can be thought of as a feature epimediting dictionary, each line is a feature epimediting, it finds the corresponding embedding for each non-0 feature in the input vector, and then makes a sum pooling aggregation of all the returned features, and obtains the hidden vector of the middle layer for the feed-forward computation of the next classifier layer. The sum pooling operation here is permutation-invariant for feature dimensions[2]. The so-called permutation invariance means that when the position of each element of the input is exchanged, the output remains unchanged.
Second, if we stack all the input data (such as training samples) to form a matrix (the dimension is the number of samples and the number of features), it defines a sample-feature dichotomy. Each sample and each feature in the graph are nodes, and the contiguous edge is the sample's membership to the feature (that is, whether the sample contains the feature).
Combining the above two points, we can find that if we treat all the input data (or a batch data) as a graph, and then input the neural network, due to the displacement invariance of the network embedding layer, the network can handle it flexibly, no matter how many feature nodes the input graph contains. This means that we can properly modify the neural network to handle the expansion of the feature space.
3. Methods
The following is a model framework for feature extrapolation proposed in this article (the following figure shows the feed-forward process of the model). The entire model framework contains input data representations, a high-level graph neural network (GNN) model, and a low-level backbone model. The GNN model is used for message passing on a sample-feature dichotomy, inferring the embedding of new features through abstract information aggregation. This process simulates the thinking process of the human brain, that is, the understanding of new concepts from outside the familiar concept of knowledge. The backbone model is a normal classifier, but the parameters of the embedding layer will be replaced by the output of the GNN model.
The following figure shows two training strategies proposed for the above model. In Figure (a), we use self-supervised training, each time some features are masked, and then other features are used to infer the characteristics of the mask. In Figure (b), we use the inputive training method, sampling the characteristics of a part of the training set at a time, and only using this part of the features to give the prediction results. In addition, GNN and backbone use asynchronous updates, i.e. GNNs are updated again after each k-round update of the backbone.
Fourth, theoretical analysis
We have made a theoretical analysis of the proposed training method, mainly considering the expected difference between empirical risk loss (that is, model prediction error on some observation sets) and expected risk loss (that is, model prediction error on the overall data distribution) on the randomness of the algorithm. The upper bound of this difference can be given by the theorem that the upper bound of the generalization error is mainly related to the input feature dimension d and the number of feature combinations M that the sampling algorithm may produce.
5. Experimental results
On multi/dichotomical datasets, we consider the following evaluation criteria: randomly divide the data sample into 6:2:2 train/val/test sets, and then randomly select some observed features from all features; the model is trained on training data with only observed features, and okcuracy (multiclass) or ROC-AUC (binary classification) is calculated on test data with all features. Compare the following methods:
1. Base-NN is trained and tested with only observational features;
2. Oracle-NN: Training and testing using all features;
3. Average-NN/KNN-NN/Pooling-NN: Use avg pooling to aggregate all features embedding/KNN aggregate similar features embedding/mean pooling GCN aggregate adjacent features embedding to infer the embedding of new features;
4. INL-NN is first trained on the training data with only observed features until saturated and then updated locally on the new features.
On 6 small datasets, considering different proportions of observed features (from 30% to 80%), the comparison results are as follows (FATE in the figure is the method proposed in this paper).
In addition, we experimented with large-scale advertising datasets (millions of samples and features). Here we use dynamic time division: all samples are sorted in chronological order, and then divided into 10 copies, the first copy is the training data, the second copy is the verification set, and the third to tenth copies are used as the test set. This division naturally introduces new features that do not appear in the training set in the test set, and the ratio of new and old features is close to 1:1. We used DNN and DeepFM[3] as the backbone model and ROC-AUC as the evaluation index, respectively, and the results are as follows. Our method achieves optimal predictive performance.
For more experimental results, such as scalability test (the model's training computation time and memory consumption show a linear growth trend relative to the number of features and sample match size), ablation experiments, and feature visualization results can be found in our paper.
VI. More Explanations:
Why graph learning can help solve extrapolation problems
In fact, when we use the input data to represent samples and features on a graph, we can get sample-features and feature-feature relationships through graph structure. Here the feature-feature relationship is given by the sample as an intermediate node, that is, the second-order adjacent information on the graph. Based on this, the establishment of the graph provides us with a natural relationship between observed features and unenserved features. When the model is trained, we can get the representation of the observed features, and then for the new features introduced in the testing stage, we can use the graph structure to do information transmission, and the epimediding information of the observed features is calculated through the graph neural network to calculate the embedding of the new features, so as to achieve the extrapolation of the features.
7. Future outlook and summary
The biggest contribution of this paper is to define a completely new problem framework, that is, the extrapolation problem of feature space, and to illustrate that the neural network backbone model can be combined with graph neural networks to solve new features that emerge in the testing phase. Since the focus of this paper is to explore a new direction and adopt a more general setting, the research questions in this paper can be further expanded in the future. In addition, the problems and solutions studied in this paper can also be applied to many other areas and scenarios, such as dealing with new users in the recommendation system [4] (cold start users or users with a small sample).
bibliography
[1] Q. Wu, C. Yang, and J. Yan. Towards open-world feature extrapolation: An inductive graph learning approach. In NeurIPS 2021.
[2] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Póczos, R. Salakhutdinov, and A. J. Smola. Deep sets. In NeurIPS 2017.
[3] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He. Deepfm: A factorization-machine based neural network for CTR prediction. In IJCAI 2017.
[4] Q. Wu, H. Zhang, X. Gao, J. Yan, and H. Zha. Towards open-world recommendation: An inductive model-based collaborative filtering approach. In ICML 2021.
This article is from: public number [PaperWeekly]
Author: Wu Qitian
Illustration by Polina Orlova from icons8
-The End-
New this week!
Scan the code to watch!
About my "door"
▼
Shomen is a new venture capital firm focused on discovering, accelerating and investing in technology-driven startups, including Shomun Innovation Services, Shomun Technology Community and Shomun Venture Capital Fund.
Founded at the end of 2015, the founding team was built by the original team of Microsoft Venture Capital in China, and has selected and in-depth incubated 126 innovative technology-based startups for Microsoft.
If you are a start-up in the field of technology, you not only want to get investment, but also want to get a series of continuous, valuable post-investment services,
Welcome to send or recommend items to my "door":
⤵ One click to send you into techbeat happy planet