laitimes

Invited Article丨Large-scale device-cloud collaborative intelligent computing

author:Chinese Society of Artificial Intelligence

Text / Wu Fan

0 Introduction

At present, many intelligent technologies have entered people's daily life, such as interactive product recommendation, face recognition, voice recognition, physical sign observation, disease diagnosis, intelligent decision-making, etc., these intelligent technologies have brought great convenience to our lives at the same time, but also profoundly changed the industrial form, promoting the transformation of traditional industries such as retail, home, transportation, medical care, and war preparedness to intelligence.

In the traditional cloud-based framework, the terminal uploads the user's original data, the cloud server maintains the machine learning model, obtains the user's input data, performs model inference, and then returns the inference result to the user. The user terminal is used to collect data and display results. However, the traditional service framework based on cloud intelligence has a variety of problems: first, there is a risk of leaking user privacy when user data is uploaded to the cloud server, secondly, data upload and inference result backhaul will bring communication delay, which affects the real-time service response, and secondly, the cloud side may need to run thousands of different machine learning tasks at the same time to respond to the requests of hundreds of millions of devices, forming a bottleneck for high-load services.

In order to break through the bottleneck of cloud intelligence, a new paradigm of device-cloud collaborative intelligence has emerged. Device-cloud collaborative intelligence is an attempt to offload some intelligent inference tasks or some stages of intelligent inference tasks to the device side for processing, taking advantage of the advantages of local real-time processing on the device side, reducing response delay and reducing the load on cloud servers.

Device-cloud collaboration is not only a mode of intelligent reasoning, but also an innovative paradigm of model evolution. One of the top 10 technology trends in 2022 released by Alibaba DAMO Academy is the co-evolution of large models on the cloud and small models on the device. It points out that the ultra-large-scale pre-trained model is a breakthrough exploration from weak artificial intelligence to general artificial intelligence, but its performance and energy consumption are disproportionately improved, which limits the continuous expansion of parameter scale. Therefore, a future way out for artificial intelligence research is to move from the competition of large model parameters to the co-evolution of large and small models. On the one hand, the large model can output model capabilities to the small model at the edge end, and on the other hand, the small model is responsible for the actual reasoning and execution tasks, and the small model can feedback the model execution effect and new knowledge on the device side to the large model, so that the capabilities of the large model can continue to evolve, thus forming an organically evolving intelligent system. The journal of the Chinese Academy of Engineering also held a cutting-edge forum called "Sumeru in Mustard Seeds" around the co-evolution of large and small models, which attracted wide attention from industry and academia.

1. Large-scale device-cloud collaborative learning

Large-scale device-cloud collaborative learning is essentially a distributed machine learning paradigm, but it is different from traditional distributed machine learning. The main difference is that the performance of the training server used in traditional distributed machine learning is significantly stronger than that of the user terminal device, and the training dataset is organically segmented to meet the independent and equally distributed nature to ensure the convergence of distributed training.

Traditional distributed machine learning mainly uses two methods: data parallelism and model parallelism (see Figure 1). For machine learning tasks with large amounts of data, the data-parallel mode can be used to accelerate learning. As shown in Figure 1(a), the dataset is split into several subsets of data that are stored on different servers, and each server downloads a copy of the model. Each server trains a model based on a subset of local data and aggregates model updates to a parameter server, which periodically redistributes the latest aggregate model, speeding up the training process. For tasks with large model size that are difficult to be completed by a single training server, the model parallel mode can be used to share the amount of computation. As shown in Figure 1(b), each training server holds only a portion of the full model. The data is then streamed between the training servers, progressively updating the components of each model. However, in distributed machine learning on the cloud, splitting the dataset and model is often random. This kind of random splitting is not applicable in end-intelligence scenarios.

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 1 Traditional distributed machine learning solution: organic segmentation of data and models

Google took the lead in extending the idea of parallel distributed training of data to mobile devices with limited resources, and proposed the Federated Learning federated learning framework, also translated as "federated learning". Under Google's federated learning framework (see Figure 2), user data follows natural slicing and resides locally on the user's device, and simply borrows the idea of data parallelism to train machine learning models in a distributed manner. This method is only suitable for application scenarios with small model size, so Google used it to optimize the word recommendation function of Gboard, the Android keyboard input method, and achieved good results. Considering that the number of common words in a language is about 10,000, the size of the language model for the 10,000-word embedding vector is about 1.4 MB, so that training and inference can be easily completed on the device side.

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 2 Google Data Parallel Federated Learning Framework

However, when the feature size of the model is further expanded, the federated learning framework based on the complete model will not be applicable. In the process of cooperating with Alibaba's mobile Taobao department, we found that in the industry-level recommendation scenario, we need to make personalized and optimal recommendations for 1 billion mobile end users from 2 billion candidate products. The cloud-side machine learning model that supports this application is called a deep interest network. The model embeds 2 billion product logos and is over 100 GB in size. Obviously, a complete deep learning network model cannot be directly deployed on device-side devices for training, so it is not feasible to simply follow Google's data parallel federated learning framework.

Our team's MobiCom work with Alibaba's mobile Taobao department took into account the ultra-large-scale federated learning on mobile. This work specifically considers a very large-scale industrial-level scenario with 1 billion mobile end users and 2 billion candidate products, and the upper limit of running memory of the mobile Taobao application is 200 MB.

Considering the huge size of the cloud-side model, if you want to reason on the device side, you must

You have to find a way to make the big model smaller. As shown in Figure 3, we first tried the existing model compression methods, including model pruning, quantization, knowledge distillation, and other methods. Model pruning is to analyze and evaluate the role and contribution of each parameter in the original model to the final result, and delete the nodes and edges with small contribution and low importance, so as to reduce the number of model parameters. Model quantization reduces the accuracy of model parameters so that each parameter occupies fewer bits, thereby reducing the overall space occupied by the model. Knowledge distillation is to reconstruct a student model with a simpler structure, take the original complex model as the teacher model, and train the student model to imitate the output of the teacher model, so as to realize the simplification of the model.

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 3 Existing compression model

The accuracy of the compression model obtained by the above methods does not achieve the expected effect. The reason for this is that a single compression model cannot fully characterize the personalized data characteristics of massive device-side devices. End-side inference is not good, let alone training on the end-side.

2. Device-cloud collaborative joint learning of large and small models

In order to solve the above problems, we deeply analyze the personalized data characteristics of the device side, and propose a joint learning framework for device-cloud collaboration of large and small models.

As shown in Figure 4, we observe that the data of a terminal often involves only one subspace of the complete feature space, so the terminal only needs to obtain some of the model parameters corresponding to its local data features (which we call "submodels") to meet the local requirements. That is, after training with end-local data, only the parameters of the mapped submodel part will be updated. From the perspective of model segmentation, submodels are feature-based model segmentation.

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 4 Submodel—Feature-Based Model Segmentation

Based on the above ideas, we pull the mapped submodel from the parameter server according to the local data characteristics of the terminal, and each terminal only needs to use its local data to train the pulled submodel and submit the parameter update of the submodel, so that it can participate in the joint learning process of the device-cloud collaborative model, thus getting rid of the dependence on the complete model.

The sub-model splitting joint learning framework proposed by us adopts natural dataset segmentation and feature-based model segmentation, and realizes data parallelism and model parallelism at the same time. According to data analysis, the average number of products that each mobile Taobao user pays attention to per month is about 300, and the size of the deep interest network sub-model embedded with 300 product logos is only 0.27 MB, and the compression ratio of the sub-model reaches 1/500,000, far exceeding any model compression algorithm, so the sub-model can be efficiently trained and executed on mobile devices.

In addition, if the full model is used for each terminal instead of the feature-corresponding submodel, the federated submodel framework will degenerate to traditional federated learning, so the framework is more general. Generalization means that schemes used to improve the efficiency of federated learning can also be applied to federated sub-model learning, for example, the model compression mentioned earlier can compress not only the global model, but also the sub-model to further reduce the overhead.

Considering that the direct weighted aggregation of submodels will lead to model aggregation bias, we also propose a fine-grained submodel weighted aggregation algorithm, the core idea of which is to eliminate the aggregation bias caused by the dislocation of different terminal submodels and uneven data distribution through the update of weighted submodels related to the characteristics of different terminals, and ensure the convergence of the global model.

In addition, the terminal downloading the submodel and uploading the submodel update will disclose the location of the submodel to the untrusted coordination server, that is, its data privacy, which defeats the original intention of federated learning. To this end, we design a sub-model privacy protection mechanism, which organically combines secure multi-party union computation, random answers, and secure aggregation, so as to endow the terminal with the ability to obfuscate the real position of its sub-model, and realize the ability to pull and aggregate the sub-model without exposing the position of the sub-model. In order to protect the privacy of the terminal under the framework of the submodel, a security protocol based on secure multi-party set union computing, random answer and secure aggregation is designed, which gives the terminal the repudiation of the real location of the sub-model, so as to protect the data privacy. The strength of the repudiation can be accurately measured with local differential privacy. In addition, the local terminal can adjust the degree of privacy protection by setting the parameters in the random answer, so as to achieve a good balance between privacy and utility.

Our submodel split-joint learning approach has received a lot of attention from our peers, and in a review co-authored by 24 professors from top universities, we say that this method is a promising research direction (see section 4.4.4 of the article "Advances and Open Problems in Federated Learning" for details).

The previous research results opened the door to a research direction, and with it a series of problems that need to be solved. First of all, the local data in federated learning does not satisfy the independent and identically distributed nature. The model update of each terminal will be biased towards its own local optimum, resulting in the model aggregation result deviating from the global optimum. To solve this problem, Mehryar Mohri, a researcher at Google Research, proposed a stochastic controlled joint mean algorithm (see the article "ICML'19:SCAFFOLD: Stochastic Controlled Averaging for Federated Learning") that uses the previous round gradient to estimate the global direction and correct for local updates to ensure the convergence of the model.

Considering the difference in data availability caused by the intermittent online of mobile terminal devices, we propose an algorithm called joint latest mean to avoid the global model aggregation biased towards high-availability terminals. The core idea is to give each terminal a certain probability of being selected to participate in each round of joint training, and give priority to the terminal that has not participated for the longest time, and reuse the gradient of the most recent submission when the selected terminal is not online. This method simulates asynchronous gradient aggregation to eliminate the model aggregation bias caused by terminal intermittent online. For endpoints that did not participate in each round, we reuse the gradient from their most recent commit. This form of update avoids the update shift of the global model. In each round, we prioritize the endpoints that have not been involved for the longest time, so that their stale gradients are updated. Essentially, this method eliminates bias caused by the dynamic availability of end devices by simulating asynchronous gradient aggregation. At the same time, the convergence of the joint latest mean algorithm under non-independent homogeneous dataset and dynamic availability is theoretically proved. Experimental results show that our algorithm improves the model accuracy by 5% compared to the joint mean algorithm (see Figure 5).

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 5 Model accuracy

We also wanted to know how much each terminal contributed to the training of the federated learning model, so as to enhance the transparency and interpretability of the federated learning system. Here, the idea of leave-one-out is used to measure the impact of terminals on the global model, that is, the difference in the performance of the global model with or without the participation of a terminal. Considering the unbearable overhead of leave-one-out model retraining, we propose an estimation method based on first-order approximation and chain derivation rule to avoid retraining, and design a Hessian matrix approximation algorithm based on Fisher information to further reduce the computational overhead. In addition, in order to reduce the estimation error of non-emergent learning tasks, a hierarchical numerical check and truncation method of model parameters is proposed to reduce the estimation error of non-convex optimization goals. With the contribution metric, the model aggregation weight of the terminal can be dynamically adjusted according to its contribution. For example, increase the weight of high-contribution terminals, reduce the weight of low-contribution terminals, and even eliminate sabotage terminals. Experimental results show that this method can effectively improve the accuracy of the global model (Fig. 6).

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 6 Accuracy of the global model

Another problem that affects end-side training is that the number of end-side samples is small, which is prone to the problem of overfitting of small samples. We can give full play to the advantages of device-cloud collaboration, use the cloud as the coordination server, and migrate the domain to the dataset with similar distribution of data features in the local augmentation data of the terminal, so as to avoid the problem of small-sample overfitting while maintaining the personalized features of the local model of the terminal. The basic idea of the solution is to enrich the local dataset by filtering out samples similar to the local data distribution from the global dataset on the cloud for each terminal. In terms of specific technical routes, the idea of domain adaption is mainly adopted. Firstly, the local dataset of the specified terminal is used as the target domain and the dataset of other terminals on the cloud is used as the source domain, and then the model is incrementally trained with the source domain data, and the model accuracy is evaluated by the target domain, and the samples that improve the accuracy of the model are selected to enrich the local data. Compared with cloud service-based machine learning, the data augmentation scheme based on domain migration can reduce the distribution bias between the training data and the test data, and the new scheme can effectively reduce the generalization error by enriching similar samples compared with local data training on the end.

3. Device-cloud collaborative intelligent system

In recent years, we have worked closely with Alibaba's mobile Taobao team to integrate the above algorithms into its device-cloud collaborative intelligent system, Walle. The name Walle comes from the movie character of the same name, hoping to make use of the massive data scattered on the device, dig out the treasures buried in it, and provide users with higher quality services. The Walle device-cloud collaborative intelligent system supports dozens of Alibaba services, and is called more than 200 billion times a day at its peak. The system mainly includes compute containers, data pipelines, and deployment platforms to support close collaboration between the end and the cloud at all stages of machine learning algorithm tasks (pre-processing, model running, and post-processing). Among them, the computing container provides a cross-platform, high-performance task execution environment for mobile devices and cloud servers, and supports the rapid iterative evolution of machine learning tasks on mobile apps; the data pipeline mainly involves the pre-processing stage, provides feature or sample input for machine learning tasks, and supports the seamless flow of data between devices and clouds; and the deployment platform is mainly responsible for coordinating tasks to the device side and the cloud side to ensure the timely deployment and completion of tasks.

During the "Double 11" period in 2019, the Walle intelligent system was implemented on a large scale on Taobao on mobile phones, covering scenarios such as main search, information flow recommendation, cloud theme, venue, smart push, red envelope rain, promotion and live broadcast, and a total of 223.5 billion calls were executed on the same day, which greatly increased the GMV of the total transaction volume and brought a better interactive experience to users. In addition to mobile Taobao, the Walle intelligent system has also been implemented in apps such as Xianyu, Youku, Maoke, AE, CBU, and Retailtong.

Currently, we have added three main sets of modules to the existing Walle system framework (see Figure 7). That is, the sample and task management modules at user granularity on the cloud, the personalized sample delivery and task release channels, the device-side sample screening and lifecycle management, and the model training, model inference, and model versioning modules. As a result, a device-cloud collaboration link with data and model management, cloud sample distribution, and end-to-end training as the core has been built, realizing the three main functions of data collection, data enhancement, and personalized training, and fundamentally supporting the application of the idea of thousands of people and thousands of models.

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Figure 7 End-Channel-Cloud System Module

Furthermore, we have implemented sub-model splitting federated learning technology into low-power embedded devices, including Raspberry Pi, Nvidia Jetson Nano, and TX2, which can be mounted on unmanned vehicles, drones, unmanned boats, and other devices to achieve large-scale distributed edge learning.

4 Concluding remarks

The above is our initial exploration of device-cloud collaborative federated learning, and there are many challenges that need to be solved urgently. First, how to design an elastic model structure to dynamically adapt to the runtime environment of heterogeneous terminal devices, second, how to design a distributed optimization algorithm to better eliminate the model aggregation bias caused by data heterogeneity, third, how to resist the attacks of malicious terminals to ensure the stable and reliable evolution of device-cloud co-evolution, and fourth, to look forward to an independent, controllable and open-source development environment to promote the rapid development and large-scale deployment of device-cloud collaborative intelligent systems.

(References omitted)

Invited Article丨Large-scale device-cloud collaborative intelligent computing

Wu Fan

Dean and distinguished professor of the Department of Computer Science and Engineering of Shanghai Jiao Tong University, he has undertaken more than 20 major projects such as the Science and Technology Innovation 2030-"New Generation Artificial Intelligence" major project, the National Key R&D Program, the National Natural Science Foundation of China, and the Shanghai Municipal Science and Technology Commission, and published more than 200 academic papers. He has won the first prize of the Natural Science Award of the Ministry of Education, the first prize of the Science and Technology Progress Award of the China Computer Federation, the first prize of the Natural Science Award of the Shanghai Computer Federation, the ACM China Rising Star Award, the CCF-IEEE Young Scientist Award, and the paper awards of 7 international academic conferences.

Excerpt from "Newsletter of Chinese Society of Artificial Intelligence"

Vol. 14, No. 2, 2024

Special topics on the frontiers of science and technology

Read on