Zhou Jun, Ant Group: The practice and exploration of trusted AI in the digital economy

Reports from the Heart of the Machine

Machine Heart Editorial Department

On March 23, at the annual meeting of AI technology in the heart of the machine, Zhou Jun, general manager of the financial machine intelligence department of Ant Group, delivered a keynote speech "Practice and Exploration of Trusted AI in the Digital Economy".

Zhou Jun introduced that if the digital economy is compared to a tree, artificial intelligence (AI), big data, cloud computing and other technologies in the trunk constitute the core of the digital economy and play a role in carrying forward the upper and lower levels; factors such as privacy and security in the roots of the tree determine the growth trend and the future; the trunk and the root must be closely integrated in order to flourish, of which AI + privacy, AI + security, etc. have become the direction of urgent breakthrough. The concept of trusted AI technology will be one of the key capabilities to resist risks and enhance the inclusiveness of science and technology in the digital age. Ant Group officially released its trusted AI technology architecture system for 6 years of exploration in June 2020, and at present, in the direction of privacy protection, explainability, robustness, fairness and other technical systems, there have been many research breakthroughs and landings, and there is still a long way to go and need continuous investment.

Zhou Jun, Ant Group: The practice and exploration of trusted AI in the digital economy

The following is the content of Zhou Jun's speech at the Machine Heart AI Technology Annual Conference, and the Machine Heart has been edited and sorted out without changing the original meaning:

It's a pleasure to come to the Heart of the Machine. Everyone knows that artificial intelligence is becoming an integral part of everyone's daily life, and it is used to help users complete a variety of decisions. But AI technology also exposes many weaknesses, such as bias and vulnerability. In order to solve these problems in AI, it will be very important to establish a trustworthy mechanism and method of artificial intelligence, which is also the topic I want to share today, that is, the practice and exploration of AI in the digital economy.

Specific to the digital economy, we can see that in the financial technology framework given by the Bank for International Settlements, AI has been widely used. Artificial intelligence, cloud computing and other technologies in the trunk have become the core of financial technology, and carry a very important role.

Image source: https://twitter.com/bis_org/status/1222834967920685057

In the process of industrial intelligence, fundamental issues such as privacy protection and data security will become more and more important to the impact of artificial intelligence, and also determine the trend of the entire digital economy in the future. Therefore, the trunk and the root must be closely integrated to make it possible for it to flourish. Among them, AI + privacy / security has become the direction that everyone urgently needs to break through. Trusted AI is very important for both businesses and academic circles, and only by ensuring that the decisions made by AI are safe and trustworthy, respect privacy, and easy to understand can people trust AI and it can really work.

In the process of building a digital economy platform, we have also united many external universities and focused on developing trusted AI technologies. We hope that trusted AI has good capabilities in data privacy protection, interpretability and causal analysis, fairness and security (robustness), so as to meet the expectations of the public or the industry for AI.

In order to achieve the technology of trusted AI, while combining with the digital economy, we have precipitated some key directions, such as fair machine learning, adversarial machine learning, graph machine learning, explainable machine learning, trusted privacy computing and so on. Through the research and development of these key technologies, we can provide support for the specific applications of risk management, security risk control, wealth management, etc., to ensure that these methodologies can be scientifically defined and disassembled into engineering goals, so as to launch a variety of platforms and tools, so that THE concept of "trusted AI" can be applied throughout the ai life cycle.

Next, I will introduce the progress we have made in the four directions of diagram, explainable, privacy protection, and confrontation.

Graph machine learning

Graphs are a very common data structure in non-European spaces, and have been widely used in social networks, biomedicine and other fields. It actually models nodes and edges. Due to the very good expressiveness of graphs, a large number of methods called graph neural networks (GNNs) have emerged in recent years. GNN is a deep learning method that runs on top of the graph and works very well in many areas such as recommendation and fraud detection.

In practice, we found that GNN can overcome the problem of insufficient information relatively well, thereby improving the service capabilities of AI to thin information customers such as long-tail customers and small and micro enterprises, so that their probability of enjoying digital services and the digital economy is greatly improved. It can improve the coverage of AI and make a positive contribution to the inclusiveness of AI. However, a big challenge is how to deal with the problem of graph modeling on an industrial scale.

We know that in machine learning, engineering is a base for algorithms: without the support of strong engineering, algorithms are difficult to apply at scale. To support the structure of the industrial-grade graph data mentioned earlier, we first developed a graph learning system, AGL (Ant Graph Learning)[1], which learns from two classic operations in graph neural networks, aggregation and updating. We have listed a basic formula here. A graph neural network that can capture k-hop neighbors, its basic k-layer learning paradigm is shown above, and the direction of propagation and aggregation can also be seen from the diagram:

In order to achieve such a graph neural network training and large-scale inference, our system is mainly divided into three parts. Of course, the original intention of this system will be more focused on scalability, fault tolerance, and as far as possible to reuse existing methods. Based on such a principle, we have three corresponding core modules:

GraphFlat (processing of samples or neighbors);

GraphTrainer (the real training part);

GraphInfer (specializing in reasoning on large models).

I'll explain some of these key parts next.

First, in the trainer section, we use the traditional parameter server structure. It can store relatively large parameters, divide the parameters into multiple pieces and store them well, and then use the large number of machine resources present in industrial-grade systems, that is, workers, for parallel computation.

In AGL, we utilize batch frameworks such as MapReduce for graph sample generation, and design a variety of computational optimization strategies such as edge partitioning, graph trimming, and pipeline parallelism during training. We can see that on a relatively large industrial-grade dataset, on a real graph data with 6.2 billion nodes and more than 330 billion edges, we can use more than 30,000 cores to complete the testing of real systems. It can also be seen that on such a large-scale data set, our AGL system can have a near-linear acceleration ratio, and has good scalability, which also lays a more solid foundation for algorithms that support industrial-scale graph machine learning.

Based on such a system, we first designed an anti-cashing application. We use this transfer transaction of large-scale capital relationship, for the buyer sub-chart, seller sub-diagram, buyer path sub-diagram, through the graph simulation to generate a transaction sub-chart, and then use the AGL system for dynamic graph learning, after learning the chart signs we will make the corresponding link prediction, the existence of large-scale capital relationship cash out transactions to identify, so that the cash rate has a relatively large decline (relative decline of 10%).

Once this is done, the second part is how we can combine such systems to make AI more inclusive, especially for long-tail users and SMEs. We found that SMEs will have a Macmillan gap (a huge gap in capital allocation due to insufficient supply of financial resources), which often plagues the development of SMEs. We also know that small and medium-sized enterprises are capillaries and play a very critical role in the operation of the economy and finance. We hope that through GNN, it will be possible to analyze the creditworthiness of customers with limited credit history, so as to meet some of the financial demands of SMEs and enhance the inclusiveness of AI.

Specifically, we will first carry out supply chain mining (link prediction technology), that is, predict which companies may have a business group among themselves, and then conduct credit analysis based on the group under the premise of privacy protection. Therefore, when we can bring together a large number of small and medium-sized enterprises based on supply chain ethnic groups, and have a certain analysis, we can identify the credit status of enterprises.

To this end, we propose a space-time-bound GNN (Spatial-Temporal aware Graph Neural Network, ST-GNN)[2]. First of all, we dig up the association between enterprises through the previously mentioned supply chain, and then combine some existing risk labels in the figure, through the ST-GNN method of combining time and space, transform such a problem into a credit scoring problem, so as to complete the credit score of the enterprises in the entire supply chain network, and evaluate the probability of default of the enterprise based on such a credit score, so as to meet their financial demands.

We compared it to some traditional methods (e.g., GBDT, GAT). The results show that our method, which combines spatiotemporal information, can greatly improve the performance of the model in the prediction of financial demands of SMEs. The main reason is that our method combines a lot of information on the chart, and designs a mechanism for time and space attention, which can better integrate multiple and multi-dimensional information, reflect the more complex ethnic correlation between enterprises, so as to identify the credit score of small and medium-sized enterprises, and help them enjoy corresponding financial services based on such credit scores.

In order to improve the ability of supply chain mining, we also propose another path-aware Graph Neural Network (PaGNN) [3]. It integrates the two operators of propagation and convergence, and learns the structure between the two nodes (such as the structure of the path) in the process of integration, so that it can better judge the complex correlations that may exist between the two nodes, so as to better map the population, help supply chain finance, and meet the financial needs of small and medium-sized enterprises.

We give a case here. First, through the public digital information of the enterprise, we can find a map of the supply chain network. With such a graph, we can form a supply chain network for certain brands, and then tap into the relevance of the graph through the various GNN methods mentioned earlier, and then turn it into a credit scoring problem. With such a graph method, the accuracy of ethnic discovery has also been greatly improved, which can help downstream enterprises better obtain operating loans, and can improve AI coverage and inclusiveness.

At the same time, we also noticed that the algorithm of graph learning itself has a problem with robustness. Therefore, we cooperated with external universities to improve the robustness of the model, and also solved the potential problems such as smooth transition and difficult generalization of the model. We also propose a new robust heterogeneous GNN framework to counter topological adversarial attacks. It is equipped with an attention purifier that trims adversarial neighbors based on topology and feature information, further enhancing the reliability of ai [4][5][6].

Interpretable machine learning

Many of AI's methods are now a black-box module (below), and people don't know much about the process in between. We hope that through explainable machine learning, the black box will change from black to gray (to some extent explainable) and eventually into a white box (fully explainable). Interpretable machine learning enables machine learning models to explain or present their behavior to users in an easy-to-understand way.

We propose a new approach, COCO (COnstrained feature perturbation and COunterfactual instances) [7] to interpret test samples for arbitrary models. Prior to this, the industry has some interpretable methods, such as its own interpretability method (such as decision trees), global interpretability methods (such as PLNN), post-local interpretability methods (such as SHAP), etc. What we propose is an interpretable approach that is more suitable for application in industry.

This method itself is not very complicated, the algorithm process is mainly divided into three steps: first, it will screen the neighbors of the test sample, using Mixup to generate perturbation data; then, it will carry out a limited perturbation on the test sample to obtain a counterexample sample; finally, it measures the counterexample sample, calculates the characteristic importance of the test sample, to give the interpretability of any model.

On top of the image data, we filtered out the top 200 important features. When drawn, these features are basically fitted to the boundaries of the numbers, so that you can see at a glance that the important features unearthed are indeed on it. In the table commonly used in industry, we will also first mine out important features, then train them into models, and then look at the comparison of the important features mined by the model with the accuracy effect of other features mined by other methods (such as SHAP, LIME). As you can see, our method has a relatively good effect on precision.

Based on this, we can draw several conclusions: first, through limited perturbation, COCO can more easily identify important features; second, through Mixup augmentation, the data will be more reasonable; third, COCO robustness performance is relatively better and more stable.

We use this approach in risk-sensing scenarios. For example, sometimes we find that someone (such as Zhang) has two accounts for Alipay, and he uses one account to frequently transfer money to the same account for himself. At this point, our risk-perception model may judge that the account has been fraudulently used. We want to know why this risk-aware model makes such a decision. Therefore, we use the COCO model to generate important decision factors for this risk-aware model. We may finally come up with some factors such as the prioritization of platform members under the same mobile phone number; the number of historical cumulative logins to Alipay; and the transaction anomaly index in the past 360 days.

Through some important features like this, we can analyze why a risk perception model makes a decision, so as to verify whether the risk perception model is reasonable and whether the results it gives are credible and reliable. In this way, we give these important decision-making factors to business decisions, and they will further verify the actual situation (such as whether there is a family relationship between the impersonator and the impersonated person), and then further make a judgment on the combination of man and machine, and decide whether to freeze the account or report the case. This can enable our business people to better understand the logic of risk perception model decision-making, and it can also help our business experts combine model interpretation to help make decisions and control model risks.

In such a decision involving financial accounts, we are actually very cautious. We hope to better control the risk of the model and the interruption of users, so that the risk perception model can better protect everyone's account security and fight crime. We also hope that experts will understand this model, and then feed back the business experience into the business model, so that the combination of man and machine can achieve better results.

Privacy Protection Machine Learning

Privacy protection has been developed in the industry for many years and has accumulated many terms, such as anonymity, differential privacy, TEE, multi-party secure computing, etc. Each technology has its own applicable scenario. However, we find that it is difficult for the current privacy protection technology to achieve a relatively good balance in terms of model strength, accuracy and efficiency, which are currently a mutually restrictive situation.

We often see a lot of data in industrial scenarios such as recommendation, marketing, and advertising, and at the same time it is very sparse. While there are many privacy machine learning approaches in the academic community, how to apply them to large-scale sparse data is a big problem.

To this end, we propose a method called CAESAR (Secure Large Scale Sparse Logistic Regression)[8], which will design a large-scale privacy protection LR algorithm based on a hybrid MPC protocol.

Why was such a hybrid MPC protocol designed? Because we found that: 1) although the communication complexity of the homomorphic encryption protocol is relatively low in general, the computational complexity is relatively high, while the communication complexity of the secret sharing protocol is higher, but the computational complexity is low; 2) the nonlinear functions in the machine learning model cannot be directly calculated in the dense state space, or the computing performance cannot meet the needs of the real scene, and efficient expressions are required to reduce the computation requirements of the function under the premise of meeting the accuracy of the model, further reducing the communication overhead. Therefore, we proposed the hybrid MPC protocol, designed the privacy protection matrix multiplication, and then reduced the complexity of the nonlinear operation through Taylor unfolding, and completed the LR method.

The main points include: 1) sparse matrix multiplication, we choose the right protocol in the right place through the mixed MPC protocol, without the need to produce Beaver's triple, which can better improve efficiency; 2) secure, sparse matrix operations, which can cross-utilize secret sharing and homomorphic encryption technologies at the same time, and finally combine distributed computing to make full use of existing cluster resources under the command of the coordinator. Each cluster itself is also a distributed learning system, in this way, we can very well perform distributed operations, and then through the coordination of the overall coordinator to complete the final operation.

In this way, we found that CAESAR was about 130 times more efficient than the Industry's existing SecureML approach.

Based on such privacy protection technology, we have made a joint risk control application with SpDB. We tried on the data we had already authorized so that neither the model training nor the model run phase shared the original data. Compared with one-sided operations, the mode of joint operations can better improve the performance indicators of the model (such as increasing the KS index by 12% to 23%). By applying the results of the model output to the risk control scenario, we can better achieve a differentiated credit strategy, prevent potential high-risk loans, so as to give the right loans to the right people, and truly achieve the purpose of preventing financial risks.

At the same time, we also apply such techniques to scenarios such as joint analysis and knowledge fusion [9]. Its core technology can be summarized as: based on cloud computing and trusted privacy computing technology, through the model gradient and parameter security sharing to achieve the flow of value, which can be applied to the operation optimization within the institution and the safe sharing of information between institutions. For example, we can realize the integration of domain knowledge between institutions through technologies such as privacy protection knowledge graph, improve the accuracy of entity identification, and help applications such as insurance diseases and securities analysis.

Adversarial machine learning

In adversarial machine learning, we mainly use the left-right hand-to-hand approach, that is, assuming that we don't know much about the model itself, based on such a hypothesis to attack our system (black box attack). We designed two attack vectors (as shown below). Through such attack scenarios and the diversity of samples, we hope to continuously improve the migration of samples and the efficiency of migration attacks, so as to examine the security situation of digital links in the service and enhance the ability to resist attacks. At the same time, we put the samples generated in the confrontation attack into the machine learning training platform. We have built a platform for adversarial training, fusing the samples generated by the previous attack method into the training mechanism, so that the decision boundary will change from the red line to the blue line, which will be smoother, and the smoother means that the versatility will become better, which can improve the robustness of the model, and even improve the problem of sample imbalance in some cases, thus bringing about an improvement in business accuracy [10].

Earlier, we summarized the implementation and practice of many trusted AI in the digital economy, from inclusiveness to explainable privacy protection to adversarial learning. We've also found that every small step in the application of enterprise AI means we're a little closer to the dream of an intelligent future.

In the process of practicing and exploring trusted AI, we also found that although there are some cases and studies of trusted AI in the industry, this direction still has a long way to go. Although there have been many breakthroughs, most of the breakthroughs are still concentrated in the dot-like scene.

We also firmly believe that trusted AI technology can continue to improve the transparency and friendliness of AI technology in financial scenarios, which will make decision-making more intelligent. Since the current AI is still in a stage of rapid development, the practice and landing we share today may be some distance from the final trusted AI, and we also hope that through our research, practice, pit experience and immature attempts in the industrial community that we share today, we can let more peers think deeply, and can truly resist the risks of the digital age through trusted AI and enhance the inclusiveness of science and technology.

Resources:

[1] Zhang D, Huang X, Liu Z, et al. AGL: a scalable system for industrial-purpose graph machine learning[J]. Proceedings of the VLDB Endowment, 2020, 13(12): 3125-3137.

[2] Yang S, Zhang Z, Zhou J, et al. Financial Risk Analysis for SMEs with Graph-based Supply Chain Mining[C]//IJCAI. 2020: 4661-4667

[3] Yang S, Hu B, Zhang Z, et al. Inductive Link Prediction with Interactive Structure Learning on Attributed Graph[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2021: 383-398.

[4] Yu L, Pei S, Zhang C, et al. Self-supervised smoothing graph neural networks[C]. AAAI 2022, accepted.

[5] Bo D, Hu B B, Wang X, et al. Regularizing Graph Neural Networks via Consistency-Diversity Graph Augmentations[C]. AAAI 2022, accepted.

[6] Zhang M, Wang X, Zhu M, et al. Robust Heterogeneous Graph Neural Networks against Adversarial Attacks[C]. AAAI 2022, accepted.

[7] Fang J P, Zhou J, Cui Q, et al. Interpreting Model Predictions with Constrained Perturbation and Counterfactual Instances[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2021: 2251001.

[8] Chen C, Zhou J, Wang L, et al. When homomorphic encryption marries secret sharing: Secure large-scale sparse logistic regression and applications in risk control[C]//Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021: 2652-2662.

[9] Chen C, Wu B, Wang L, et al. Nebula: A Scalable Privacy-Preserving Machine Learning System in Ant Financial[C]//Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020: 3369-3372.

[10] Huan Z, Wang Y, Zhang X, et al. Data-free adversarial perturbations for practical black-box attack[C]//Pacific-Asia conference on knowledge discovery and data mining. Springer, Cham, 2020: 127-138.

Zhou Jun, Ant Group: The practice and exploration of trusted AI in the digital economy

Read on