In the past year, Alipay's commercial advertising algorithm has been explored from unified modeling of all scenarios to separate modeling within scenes, mainly including cold start problem optimization in new scenarios, knowledge transfer learning, and real-time optimization of the system, and has achieved significant business benefits. This article will introduce the details of the algorithm optimization behind it, welcome to check it out~

1. Background

With the development of Alipay's advertising business, some new traffic scenarios are also constantly accessed, and the following figure shows the product style of an advertising scene in the APP.

Alipay commercial advertising algorithm optimization

Figure 1: The style of a commercial advertising product of Alipay

Access to new traffic scenarios faces major business challenges: a) In the early stage of launch, there is a lack of user feedback sample data in the scenario, and the problem of cold start of new scenarios is faced; b) In the medium term, the click-through rate of advertisements is low, and the problem of further expansion is faced (after the CTR meets a certain threshold condition, the recommended booth is allowed to further increase the volume); c) In the later stage, it faces the problem of advertising CPM increase and business revenue pressure. In response to the challenges of business problems at different stages, we have made corresponding algorithm optimization, which mainly includes three aspects: optimization of cold start problem in new scenarios based on sample enhancement; Advertising click-through rate improvement and optimization based on cross-scenario knowledge transfer and user hierarchical knowledge transfer; Algorithm optimization to improve the timeliness of the advertising system.

After multiple rounds of technology iteration and upgrading, the absolute AUC of the CTR model in the scene increased by 0.10. The relative increase in CTR3 and CPM3 by more than +100% and CPM3 by more than +200% is the result of the joint efforts of the operation, product, and technical teams. The change curves of technical indicators and business indicators are shown in Figure 2 and Figure 3, respectively. To ensure data security, all business data indicators in this document are desensitized and their absolute values are hidden.

Figure 2: AUC & PCOC for commercial advertising

Figure 3: Commercial advertising business metrics CTR3 & CPM3

2. Optimized the cold start problem in new scenarios

Before 2023.06, a recommendation card of Alipay was still pure recommendation traffic; After June, the pit of recommended cards began to access advertising materials, and the advertising pit and recommended materials were mixed and displayed. At the beginning of the launch, due to the lack of ad sample data in this scenario, it faced the problem of cold start of new scenarios.

Recommendation has accumulated a large amount of sample data, and the natural idea is to introduce the natural recommendation sample in the booth into the advertising sample, and enhance it through the recommendation domain sample to alleviate the cold start problem of new advertising scenarios. Based on the tracking log of the recommendation & advertising online system, trace_id is used to associate the sample data of the two systems, align the features of the user side with the features of the advertising system, and enhance the sample processing logic, as shown in Figure 4. It should be noted that recommended materials and advertising materials are heterogeneous, and there is no intersection between the two, that is, recommended materials cannot be mapped to advertising materials. Since the ad side features and the recommended content side features cannot be aligned at present, the ad ad_id takes the recommendation content_id, and the other ad side features are supplemented by 0.

At the initial stage of the sample enhancement strategy, the offline technical indicators were significantly improved: AUC +0.01, which was used as the baseline model of the scenario.

With the continuous accumulation of ad sample data in the scene, the effect of the enhancement strategy based on natural recommendation samples gradually weakens. Further opening up the content understanding of advertising units and recommended materials, in-depth mining of the generalized features that can be transferred between advertising units and recommended materials, such as image/text multimodal features, industry category system features, entity features, etc., and joint modeling of recommendation & advertising scenarios is also one of our future optimization directions, and there is still a lot of room for exploration.

Figure 4: Natural recommendation samples enhance processing logic

3. Knowledge transfer learning

3.1 Cross-scenario knowledge transfer

There are many Alipay advertising scenarios, and for a long time, we have unified modeling based on multi-scenario/multi-task. In the industry, representative technical solutions for multi-scenario/multi-task unified modeling include SharedBottom, MMOE, PLE, ESSM, STAR, PPNet, M2M, and APG [1-4], as shown in Figure 5. Its advantages are: multi-scenario data enhancement, alleviating the problem of sparse samples in small scenes, and improving the generalization ability of the model; Unified modeling, convenient for model maintenance and unified optimization. However, unified modeling also has obvious disadvantages: there are large differences in data distribution between scenes, and there may be gradient conflicts and seesaw phenomena in different scenes; Especially for large scenarios with abundant sample sizes, the effect of modeling the scene model alone may be better than that of unified modeling.

Figure 5: Representative scheme of multi-scenario/multi-task modeling in the industrial world[4]

After the new scene was launched for a period of time, a large amount of sample data (tens of millions of exposure PV sample data per day) was accumulated, which became the top traffic entrance of Alipay advertising, providing the possibility of individual modeling of the scene. On the other hand, there are obvious differences between the advertising supply and user behavior of this scenario and other scenarios, which is why we consider modeling this scenario separately.

For single-scenario modeling, we explored the following schemes:

V0: Unified modeling of multiple scenarios as a baseline.

V1: Only sample data in the scene is used, and the network parameters are randomly initialized for training.

V2:基于 pretrain—>finetune 范式,训练好的统一模型作为预训练模型,finetune 加载预训练模型网络参数且部分参数可 train(part finetune,fix embedding 层、只训练 MLP 层);finetune 阶段也只使用场景内样本数据。

V3:基于 pretrain—>finetune 范式,训练好的统一模型作为预训练模型,finetune 加载预训练模型网络参数且全部参数可 train(full finetune);finetune 阶段也只使用场景内样本数据。

Figure 6 shows Figure 6 for the improvement of AUC of offline evaluation technical indicators: a) The V1 version only uses in-scene samples to train a single-scenario model, and the unified modeling AUC +0.004 for more scenarios;b) Based on the multi-scenario knowledge transfer of the pretrain-> finetune paradigm, the full-parameter training in the finetune stage is better than the training of some parameters in the (fix embedding layer). Our experiments draw the following conclusions: multi-scenario unified modeling is not a panacea, and large-scene single modeling can better match the sample data distribution in the field, which may be better than unified modeling. Knowledge transfer between scenes, loading the pretrain parameters of unified modeling in multiple scenes, so that the single-scene model can also learn the shared knowledge between multiple scenes, which helps to improve the effect of the single-scene model.

We use the V3 full-parameter Tune technology solution, as shown in Figure 7 below. Online ab experiment, advertising business effect: CPM3 +2.31%, CPM1 +1.76%, CTR3 +1.99%.

Figure 6: Improvement of technical indicators for cross-scenario knowledge transfer

图7：跨场景知识迁移 (Full Fine-tune)

3.2 User hierarchical knowledge transfer

In order to solve the problem of poor estimated performance of cold start ad units, we based on the Model-Agnostic Meta-Learning (MAML)[5] meta-learning framework earlier, effectively used the knowledge learned on non-cold start units, migrated to cold start units, and achieved good business benefits. Meta learning uses the task dimension as the training sample to learn prior information on different tasks, and even if the sample size is small, it can also use the prior information to obtain better results. Meta-learning is further divided into metric-based, model-based, and optimization-based paradigms.

Among them, meta-learning based on metric learning needs to learn a representation for each category, and the classification is determined by the distance metric with the class, but most of the features are high-dimensional sparse in search and promotion, and the method based on distance measurement cannot simply summarize clicks or non-clicks as a class. Model-based meta-learning mainly uses a deep network to simulate an optimizer, but it needs to learn additional network parameters, which is difficult to train. Optimization-based algorithms (MAML) have become a widely used meta-learning paradigm due to their model-independent and easy-to-implement advantages, and many meta-learning-based algorithms also apply MAML to learn a better initialized network parameters for new tasks.

From the perspective of mathematical derivation and Bayesian probability theory, the initialization parameters obtained based on meta-train can be used as a better initialization prior knowledge for new tasks. It is highly flexible and can solve the problem of task imbalance; The ability to solve out-of-distribution (OOD) tasks is strong, which is due to the benefits brought by the two-stage model: the inner loop solves the problem of task parameter privatization, and the outer loop pooling operation learns the common knowledge of all tasks. Therefore, at present, the MAML-based route in the field of search and promotion is also relatively mainstream.

The task division method directly affects the effect of the MAML-based model, and a good scene division method should have the following characteristics: a) the cohesion of samples in the same scene, if the sample distribution difference is too large, it will increase the difficulty of learning and affect the generalization of the model; b) Correlation between scenes: If the old scene is not related or transferable to the new scene, it will be difficult for the personalized model of the new scene to converge quickly. In the early CVR model, we explored MAML Task partitioning based on ad unit granularity, ad industry category granularity, and clustering. In the CTR model, the difference in the distribution of population data is more obvious. Therefore, we preferentially tried the MAML Task division based on the hierarchical granularity of the population, and the sample organization is shown in Figure 8. Figure 9 shows how the parameters of the MAML inner and outer loops are updated.

MAML modeling based on the hierarchical granularity of the crowd considers the differences in the distribution of crowd data, and introduces the public knowledge of the crowd, and the CTR model of commercial advertising scenario auc+0.003 is introduced. Online business performance: CPM1 +2.28%, CPM3 +2.20%, CTR3 +3.68%.

Figure 8: MAML's Task-based sample organization (one batch)

Figure 9: How parameters are updated for the MAML inner and outer loop model

4. Optimize the timeliness of the system

We have optimized the real-time feature and model real-time upgrades, so that the cumulative auc of the in-scene fine CTR model has increased by +0.012, the cumulative CPM3 has increased by +7.94%, and the cumulative CTR3 has increased by +8.89%.

4.1 Real-time features

Real-time features can reflect the user's immediate interests and preferences. There are three questions to be asked about real-time feature ingestion: a) Which features need to be real-time and how do you identify them? b) How is the validity of real-time features assessed? c) Real-time feature development is launched on pipline.

Real-time feature filtering based on offline feature importance assessment. There are many offline user-side sequence features, and we need to decide which offline features to use in real time may bring incremental value. Therefore, we first evaluate the importance of offline features. It is more convenient to evaluate the importance of features based on the calculation of information gain in the tree model, but the evaluation of feature importance in the DNN model is not so direct. One approach is to randomly test the input data of a dimension at each time in the infer stage, and observe the change of loss or AUC, the smaller the change, the less important the feature. On the other hand, the weight of the first-layer network connecting the feature inputs is small, and the larger the weight, the higher the importance (it is better to add batchnorm to the input to eliminate the dimensional influence between different features). The first method is computationally intensive, and we choose the second method, which is relatively simple to calculate the importance of features, and the ranking of some feature importance is shown in Figure 10. At the beginning of the project, we evaluated and delineated some offline features for real-time.

Figure 10: Ranking of the importance of some features

Real-time feature effectiveness evaluation based on simulation data. If you first evaluate the real-time features online - > online log collection - > offline sample construction - > offline evaluation of real-time features after each iteration, the entire evaluation link is time-consuming and labor-intensive. In order to solve the problem of real-time feature validity evaluation, the real-time data cooperation team used the AI Studio feature simulation capability of Ant Group to simulate real-time feature snapshots, and helped us implement a real-time data simulation system, which can obtain the real-time feature data corresponding to a certain request time, solving the problem of real-time feature validity evaluation.

Figure 11 shows the real-time advertising data development link. The real-time feature development was launched on pipline, relying on the complete engineering system support of Frigate & Arec within Ant Group. Figure 12 shows the interaction process between multiple systems, including feature reflow, sample processing, model training, and online services.

Figure 11: Advertising real-time data link

Figure 12: Advertising multi-system interaction (features, samples, models, online services)

Mining and upgrading the real-time characteristics of the advertising system to achieve greater business benefits.

i) 精排 CTR 模型收益情况大盘整体效果：CPM1 +1.53%，CPM3 +1.61%，CTR3 +3.09%。
ii）粗排模型收益情况大盘整体效果：CPM1 +1.50%，CPM3 +1.24%。

4.2 模型实时性：Online Learning

4.2.1 Background

In large search and recommendation systems, traffic structure, user behavior data distribution, and supply change from time to time, as shown in Figure 13. Offline models trained at the day level are difficult to cope with dynamic changes in the online environment. Therefore, we hope that the online model prediction can not only make good use of past data to make predictions, but also respond quickly to system changes and make good use of each user feedback sample.

At different times of the day, the distribution of traffic structure in the scene changes.
The distribution of user behavior data changes at different times of the day.
The distribution of supply varies at different times of the day.

In order to cope with the dynamic online environment within the scene, we upgraded the scenario CTR model to online learning, using real-time user feedback data to update the model parameters in real time.

Figure 13: Scenario traffic structure, user behavior, and ad supply over time

4.2.2 ODL Model Optimization

The training data for the eLearning task is fed into the training framework in real-time in the form of a streaming sample stream. Its advantage is to ensure the timeliness of sample data, so that the online model can quickly capture the change of scene data distribution. Its disadvantages may also lead to the loss of stability of the model due to fluctuations in the current data distribution, which is a double-edged sword.

After a period of optimization, the initial model effect was not satisfactory, and business indicators such as CTR and CPM were negative. The data analysis found that ODL had overfitting and knowledge forgetting in the scene, and the ODL generalization ability was not as good as that of the offline model, which was also the challenge of this model optimization. This phenomenon is also often encountered in the online learning practice of business scenarios such as Hand Tao, Alimama, and Alipay. Ref. [6] analyzes and reviews the problem of disaster forgetting encountered in continuous learning. Specific case analysis in the scene: At 14:00 on December 3, 2023, the Stream sample stream evaluation at the current moment found that the Stream AUC was greatly improved by +0.01 compared with the offline model, and the model was exported and deployed online. In the subsequent period (after the ODL model was updated), it was found that the online AUC was 1 percentage point lower than that of the offline model, and the PCOC of the online model increased, which led to the problems of model overfitting and decreased generalization ability, as shown in Figure 14.

Description: i. The online model based on stream AUC was better than the offline model of 0.01, but after the model was deployed, the online AUC was 0.01 lower than that of the offline model;ii. The bias of the ODL model is consistently higher than that of the offline model

Figure 14: The ODL model encountered a degradation in generalization performance at the beginning of the go-live

In order to solve the problem of poor model generalization ability in the early stage of ODL launch, we adjusted the following optimization strategies:

ODL fix emb layer parameters: The online model only learns the parameters of the MLP layer, which restricts the learning ability of the ODL model to a certain extent, but it can also effectively alleviate the phenomenon of model knowledge forgetting and avoid the online model "learning to fly". Only some network parameters are updated online, and this strategy has an enhanced effect on the generalization performance of the online model.
Sample playback strategy [7, 8]: Addresses the Out of Distribution (OOD) issue. Machine learning has an independent isodistribution hypothesis. Due to the characteristics of streaming cytometry training, the real-time sequential arrival of samples, the inability of context features (such as hour, week) to be fully shuffled, and the changes in online traffic structure in different time periods, the streaming sample data does not meet the independent isodistribution assumption. It will also lead to a different distribution (OOD) between the training and prediction samples, resulting in a decrease in the generalization ability of the model. In order to alleviate this problem, when constructing online samples, we adopt a sample replay strategy, randomly sampling offline sample data in the past 7 days at a 1:7 ratio from historical offline data and fusing it with streaming data to ensure the consistency of the distribution of training samples and predicted samples. In addition, Tencent's strategy was investigated: samples from the past day and the current hour were sampled, and mixed with real-time samples according to a certain ratio. For example, it is now 13 o'clock, and the real-time samples are all samples around 13 o'clock, and the sample model before 13 o'clock has been trained, but the model does not know the data distribution after 13 o'clock, so it selects the samples after 13 o'clock from yesterday's sample and adds them. This playback policy is not supported by the current platform framework.
Model hot start: Solve the problem of data drift during the training process. Prolonged online learning can lead to parameter drift [7]. In addition to accelerating the convergence of the online model by loading the pre-trained offline model in the initial stage, the online link can also be warm-started by restoring the offline cycle during the training process. We increased the frequency of hot starts from weekly to daily, and found that this strategy was most helpful in reducing PCOC.
ODL learning rate adjustment: Compared with offline training, reducing the learning rate can also effectively alleviate model overfitting.

The process of developing an ODL model is shown in 15.

Figure 15: Scenario ODL model development go-live process

4.2.3 Effects

Technical indicators: ODL auc +0.0076. Among them, the cold start ad unit AUC +0.013, PCOC -22.6%; Stable ad unit AUC +0.004, PCOC -33.9%. In addition, the streaming technical indicators can also be viewed on the online learning model performance monitoring platform, as shown in Figure 16, and it can be seen that the PCOC of online learning has been significantly reduced.

ODL also has a significant improvement in business effectiveness: CPM3 +3.55% and CTR3 +4.0% for the online learning model. Among them, the CTR3 of the cold start ad unit is +10.7%, and the online learning has a more obvious improvement in the effect of the cold start ad unit.

Figure 16: Monitoring of streaming AUC and PCOC technical indicators for e-learning

5. Summary

In the past year, we have done the exploration from unified modeling of all scenarios to independent modeling within scenes, mainly including optimization of cold start problems in new scenarios, knowledge transfer learning, and real-time optimization of the system, and achieved significant business benefits.

There are still many problems and optimization space in the current algorithm model, such as: a) the joint modeling of heterogeneous scenarios of advertising and recommendation domains, the current exploration is still very weak, there are still many problems in content understanding that have not been solved, and there is still a lot of room for exploration in the application of LLM large models; b) Knowledge transfer learning, which is still lacking for scenario-based modeling; c) Online learning, disaster oblivion has not been well addressed. In the future, we will continue to focus on the above questions.

Resources

[1] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts, KDD'18.

[2] Progressive Layered Extraction (PLE): A Novel Multi-Task Learning (MTL) Model for Personalized Recommendations, RecSys'20.

[3] One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction, CIKM’21.

[4] AdaSparse: Learning Adaptively Sparse Structures for Multi-Domain Click-Through Rate Prediction, CIKM’22.

[5] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, PMLR'17.

[6] A Comprehensive Survey of Forgetting in Deep Learning Beyond Continual Learning，2023.

[7] Experience Replay for Continual Learning, NeurIPS'19.

[8] Out-of-Distribution Generalization via Risk Extrapolation (REx) , ICLR’21.

Author: Li Yang (prime)

Source-WeChat public account: Alipay Experience Technology

Source: https://mp.weixin.qq.com/s/3npXAjy5eeMNa20xoWAt9g

Alipay commercial advertising algorithm optimization