Business Background

Self-operated advertising is an important part of the Hima advertising ecosystem, it contains a variety of advertising types, including finance, e-commerce, downloads, education, live broadcasts, albums, etc., the quality of recommended advertising determines the user's advertising experience and platform advertising revenue on the Hima APP, with the continuous development of online advertising, advertisers from the initial CPM, CCP traffic auction, to the current mainstream OCPX, double bidding, NOBID, etc. The way of traffic auction is becoming more and more refined and intelligent, and it is necessary to consider not only CTR, but also CVR, deep_cvr and other factors when bidding PK. Common advertising scenarios in the Hima app include: boot, patch, floating touch, home page guess what you like, etc., the following figure is several common advertising scenarios.

Hima Advertising Algorithm Optimization Practice (1): The Evolution of Advertising CVR Model

Technical challenges

The advertising sample is sparse, different from short videos and e-commerce apps, users have rich behavioral data, and Hima users are in a state of closed-screen listening most of the time, which leads to more sparse interaction data between users and advertisements.
The advertising scene is complex, and the advertising resources on the Hima app have obvious long-tail characteristics, such as the home page position and the high proportion of floating touch exposure, while the number of small and medium-sized advertising spaces is large, but the proportion of exposure is small.
There are many types of ad conversions, and the CVR varies greatly between different conversion types, for example, the CVR of download type activation ads is 10 times that of form ads.
Advertising traffic is unstable, self-operated advertising platforms will compete with third-party DSPs and brand advertisements for traffic, such as boot, patch and other positions once the brand share appears, the volume will change greatly, according to statistics: when the brand occupies the self-operated advertising in the boot and other positions, the exposure level is less than one-tenth of the usual, which will change greatly for the model, the distribution of training samples.
Advertising cold start problem, the accurate estimation of new advertising plays a very important role in the platform revenue and advertiser delivery experience, Xima's self-operated platform adds thousands of new advertising plans and creatives every day, and at the same time, Xima's self-operated platform supports OCPC when it directly enters the second stage, which brings greater challenges to the model.

Ad retrieval system architecture

The main function of the ad retrieval system is to screen out the most suitable ads from the massive ad library, within a certain RT requirement, and return them to the user, in order to balance the effect and performance, the ad retrieval system is a multi-layer architecture similar to the funnel. After filtering, recall, and coarse, the remaining 100 levels of materials enter the fine layout link, and the advertising CVR model is used in the fine layout link to estimate the probability of conversion after the user clicks on the ad, which is used for the calculation of the advertising ranking basis (ecpm) in the fine layout link, and as the input of some advertising strategies

Common conversion types: Form submission, payment, album subscription, micro-plus, app install, etc
Ad refinement is sorted by: ECPM = PCTR * PCVR * Ad conversion bid * 1000

Figure 1: Ad retrieval architecture

Review of the model evolution route

The technical evolution of the advertising algorithm team on the CVR model from 2022 to 2023 is mainly in two aspects: timeliness and scale: (1) The improvement of timeliness can make the samples of new ads, new users, and new locations enter the model learning faster, improve the utilization rate of cold-start traffic, and when the model is updated at the day level, the feedback data of new ads needs to wait for more than 24 hours before entering the model learning, resulting in problems such as large deviation in the estimation of new ads and poor experience perception of advertisers. (2) The improvement of the scale of model parameters can improve the personalized estimation ability of the model, alleviate the problem of excessive exposure of popular advertisements, and achieve the distribution of advertisements for thousands of people.
In terms of timeliness, the CVR model has been updated from the initial day-level full update to the current hour-level incremental update, and the number of CVR model parameters and model size have been increased by more than 30 times and the model size has been increased by more than 60 times

Model 1.0 period

Key solutions: poor model infrastructure and sparse transformation samples

1. Basic capacity building

In the 1.0 period, we upgraded the construction of the basic capabilities of the model to lay a good foundation for subsequent model iterations, for example, in the sample construction of the advertising CVR baseline model, the original user and the advertising profile snapshot are obtained according to the user ID and advertising ID in the click log reported by the client, which will have problems such as inconsistent features between offline training and online inference and the inability to use real-time features
Real-time portrait placement: The image consistency is improved to more than 95% by placing the portrait in real time at the time of ad request, and real-time features can be added

When an ad is requested, the Ads ad engine service sends the item id, deviceid, and response_id of the winning bidder to Kafka
After receiving the Kafka message, the downstream Flink task reads the backend image configuration, requests the image, and caches it in xcache
When an ad is exposed, the Flink task obtains the corresponding image from xcache based on the response_id + material ID and displays it in HDFS

Figure 2: Real-time image placement

Real-time feature building

Real-time features can collect data changes from users and advertisements in a more timely manner, so that the model can use the latest features to make predictions, and in the 1.0 period, we have produced dozens of real-time portraits from the user side and the advertising side

2. Cascaded multi-objective ESMM model

a. The advertising CVR baseline model adopts the classic Embedding + MLP model structure, starting from the click, the click is not converted into a negative sample, but on the contrary, it is a positive sample, this modeling method will exist:

The samples of the CVR model are sparse, and the model parameters are difficult to converge
Sample selection bias for inference and training

b. In order to solve the above problems, we refer to Alibaba's ESMM model, jointly train CTR and CTCVR to learn the CVR target, and alleviate the problem of convergence due to positive sample sparse guidance through embedding sharing, because the samples of CTR and CTCVR tasks are both modeled in the exposure space, so CVR can also be regarded as the exposure space, which solves the problem of sample selection bias, and the model network structure is as follows:

Figure 3: Structure of the CVR model in the 1.0 period

Model 2.0 period

Key solutions: poor model timeliness, small number of parameters, and insufficient personalization ability

1. Hour-level updates

Stephen Chow's famous movie "Kung Fu" has a famous line - "The world's martial arts are invincible, only fast and unbreakable", real-time in the effect advertising system is "only fast and unbreakable" "willow leaf flying knife", in the 2.0 period, we focused on the timeliness of the model, reducing the end-to-end delay from data production to model update by 80%
Training data hourization

Spark SQL select sample data scanning strategy optimization: Splits the dt condition and randomization condition in the where condition, improving the execution speed by 10 times
Data skew optimization: When analyzing Spark data processing tasks, it is found that some tasks are slow to execute due to severe data skew, and special processing of skewed keys greatly reduces the task execution time
Some machines of Spark tasks are very slow to execute due to CPU performance issues, try to enable the speculative execution mechanism in Spark to solve the problem, spark.speculation=true

In the 2.0 period, we reorganized the data flow of the algorithm model, and the tasks such as exposure, clicks, conversion joins, sample extraction, portrait extraction, and feature transformation were all executed at the hour-level level, and the core optimization included:

Stepped rebrush

There are delays in both clicks and conversions in the advertising system, and we propose an innovative step-by-step data table partition reflashing scheme to solve the problems of sample reflashing and feature cache reflashing caused by the delay in reporting click conversions

2. Combination features

a. Although the deep model claims to be able to fit arbitrary functions, adding some explicit crossover features according to experience can significantly improve the discrimination of samples and accelerate the convergence of the model.

Low-frequency feature overfitting problem
The feature space is significantly expanded, and the time taken for online inference increases

b. For the first question, we found that the advertisements exposed by the experimental model were very different from those of the baseline model during the small-flow experiment, and the proportion of unpopular advertisements exposed by the experimental model increased, and the difference became more and more obvious after training multiple epochs, so we speculated that it was because of the cross-fitting of high-dimensional features that led to the overfitting of low-frequency features, and we calculated the Jaccard similarity between the experimental group and the baseline advertising materials and advertising plan time-sharing, and found that epoch=1 versus epoch=At 3, the similarity is significantly improved (the similarity of the plan dimension is about 45% when epoch=1 in the left figure, and less than 35% when the epoch=3 in the right figure), so we adjust the epoch of the model training from 3 to 1

c. For the second problem, for high-dimensional sparse crossover features, we use the wide-side network in the wide&deep-like model to learn the memory ability between features, and at the same time add regular terms to reduce the risk of overfitting, because the wide-side computational complexity is low, and also reduces the problem of increasing the time-consuming of online inference

The experimental model of combined characteristics is AUC+0.4% in the core advertising space, and the model structure is as follows:

Figure 5: The structure of the CVR model in the 2.0 period

Model 3.0 period

Key solutions: training sample size limit, conversion delay

1. Incremental learning

In the 1.0 and 2.0 periods, the CVR model is trained based on the global shuffle of data within a fixed time window each time in the full training mode, in this training mode, the hour-level training model has the problem of repeated data training, resulting in a long training time each time, and because of the global shuffle sample, the model is not sensitive to the latest data, and the recent data distribution is closer to the current online situation, so in the 3.0 period, we carried out iterative optimization of incremental training, and incremental training compared with full training:

Advantages of incremental training

Each update of the model only uses new data from the last period, and the model training time is shortened to less than x hours
In the incremental update mode, the model breaks the original limitation of training only samples within a fixed time window, and can be incremented all the time
Incremental training is more sensitive to recent samples and converges faster with new ads

The problem exists

In the incremental mode, the model is incremented forward based on the latest data, resulting in the delay in conversion of positive samples cannot enter the model training, and the number of delayed transformations in the CVR sample accounts for a relatively high proportion, and direct loss will lead to the underestimation of the model and the decline of sorting ability

1. Hourly model: On the basis of the day-level model, the day-level data training is superimposed for deployment and on-line, and hour-level scheduling2. Days-level model: training stable (the proportion of conversion delay is very low) + temp data (there is a certain conversion delay), and day-level scheduling

Figure 6: Hour-level incremental training

Parameter forgetting problem: In the incremental mode, the sample of the model is not global shuffle, and the feature distribution of each increment is inconsistent with the global, such as hour, weekday, etc., these features take the same value in the incremental sample, which eventually leads to the loss of these features, and we find that the model is expected to fluctuate greatly if the incremental update is based on the latest data, which may be due to the following reasons:
1. For the Sparse parameter and the Dense parameter, you may need to use different learning rates and optimizers

Our approach is that in hour-level increments, a certain proportion of samples from the last week will be randomly sampled and trained together with samples from the current day, so that the model estimation will be more stable

3. The latest data distribution changes greatly, and some high-frequency embedding changes greatly after being added to the model, which is not what we expect, we expect that the embedding of high-frequency ID should be more stable, and some new, low-frequency embedding can converge quickly

Model 4.0 period

Key solutions: further improvement of personalization capabilities and unified models for different businesses

1. DeepRec large model

Before the 4.0 period, CVR models were developed based on the native Tensorflow framework, and we encountered some specific problems in this framework that limited the iterative optimization of the model, such as:

Advertising data has the characteristics of high-dimensionality and sparseness: the current processing is to limit the features to the specified space by taking the surplus of the features through hash, but with the increasing number of new ads, new users, and new features, the probability of hash conflicts will become larger and larger, statistically, the conflict rate of some features is as high as 30%, and a large number of hash conflicts will directly affect the accuracy of model estimation
It is difficult to learn low-frequency features effectively: There is also a head effect in the advertising system, and a large number of long-tail and low-frequency features cannot be effectively learned due to the lack of training data

Therefore, in the 4.0 period, we cooperated with the Zhongtai AI cloud team to upgrade the model training framework from TensorFlow to DeepRec, and started the iteration from TensorFlow small model to DeepRec large model

2.CVR大模型一期：EmbeddingVariable&&低频特征准入

a. EmbeddingVariable（动态弹性特征）

In view of the characteristics of high-dimensional and sparse advertising data, based on the EmbeddingVariable function developed by the AI cloud team, (1) the probability of feature hash conflict is significantly reduced, and compared with the baseline model, the conflict probability of id features is reduced from 30% to 1% (2) Most of the parameters of the model come from feature embedding, and the fixed size of the hash space in the baseline model leads to memory waste, while dynamic embedding can support dynamic writing, which effectively avoids memory waste

b. Low-frequency feature admittance

Because of the lack of training data, low-frequency features can not be effectively learned, and the problem of overfitting of low-frequency features has been encountered in the previous experiments of ID class combination features, resulting in overestimation of the model, so it is necessary to reduce the impact of low-frequency features on model estimation bloom_filter

c. In the first iteration of the large model, we introduced the above two features, and at the same time migrated the combined features of the wide side to the deep side for direct learning, and the offline evaluation AUC was +0.25%, and the PCOC was reduced by about 30%, and the model structure was as follows:

Figure 7: The structure of the first version of the large model in the 4.0 period

3. CVR Large Model Phase II: Parameter Personalization Network Based on Gate NU

a. The essence of supervised learning is to fit the data distribution, when the data distribution is unbalanced, the model is often dominated by most class samples or features, for example: the data of high-frequency users dominates the model learning, resulting in the model learning poorly on low-frequency users, the result is that the estimation is biased by high-frequency users, considering that the advertising behavior of different users has certain differences, in order to help the model better learn the differences in the data distribution of different users, we are based on device_ id designed a parameter personalization network based on gating mechanism, aiming to reduce the influence of uneven distribution on the model and improve the personalized prediction ability of the model

b. Gate NU(门控网络)

Input user ID features, two layers of neural network are used in the middle, and the second layer uses sigmoid as the activation function to generate different personality network weights for each user

ii. Gate输出结果与主网络Dense层输出结果进行 element-wise product 来做用户的个性化偏置

c. The introduction of parametric personalized network, AUC+0.12% and GAUC+0.28% of the model structure are as follows:

Figure 7: The structure of the second version of the large model in the 4.0 period

4. CVR Large Model Phase III: Multi-objective upgrade

a. At present, OCPC advertising is divided into single-bid advertising and dual-bidding advertising, a single bid only needs to optimize one CVR target, and dual-bidding advertising needs to optimize two CVR objectives at the same time (shallow goals: form submission, activation, etc., deep goals: credit, micro, payment, etc.), the previous model is to model shallow CVR and deep CVR models respectively, the model maintenance cost is high, iterative efficiency is low, based on this background, we started the merger of deep and shallow CVR modelsb. Merge multi-objective model structure design

Considering the problem of sample selection bias and positive sample sparsity, we add two auxiliary targets, CTR and CTCVR, and the model structure is as follows:

Figure 7: The structure of the large model combined with the multi-objective model in the 4.0 period

c. Loss mask design based on the deep and shallow objectives of the advertisement

Loss = \sum(Loss_{ctr} , Loss_{ctcvr} , Loss_{shadow\_cvr} ,Loss_{deep\_cvr}) \odot Mask
Single-bid ad mask
Double-bid ad mask

Advertisements with different bidding methods need to optimize different goals, so it is necessary to design a Loss Mask mechanism to mask different samples when calculating Loss, so as to prevent bias from learning from other networks
Model Loss and Mask

D. In terms of effect, the combined multi-objective model AUC+0.14%, the estimation bias of the deep CVR model has decreased significantly, and the CVR increase of some deep dual bidding ads has exceeded 100%.

Future outlook

After several iterations and optimization, the self-operated advertising CVR model has been improved by nearly 100% in the core indicator ECPM, which has brought significant business improvements, but there are still many problems that need to be solved:

(1) In the future, we will continue to iterate on a large scale to further improve the personalized estimation ability of the model.

(2) At the same time, with the continuous improvement of timeliness, the problem of conversion delay is becoming more and more obvious, so it is urgent to model the transformation delay to solve the estimation problem caused by the delay transformation being misjudged as a negative sample by the model.

In the future, we hope to truly improve the model's estimation ability for new ads through meta-learning, so that new ads can accurately find interested people in the cold start stage, so as to improve the monetization efficiency of cold start traffic.

(4) Improve the accuracy of CVR model prediction in multiple scenarios on and off the site through richer scene-based feature characterization and more efficient scenario-based modeling methods.

(5) Although the manual design of cross-features is effective, it requires a deep understanding of the business to design effective combination features and is inefficient, and we will explore some advanced automatic cross-over network structures in the industry in the future, such as DCNv2, CAN, etc.

Although the CVR model of Xima's self-operated advertising has made some progress, there are still many places that can be continuously optimized, and we believe that "Starlight lives up to the rush", as long as it continues to iterate in the right direction, it will definitely bring more business improvement to Xima's self-operated advertising.

Author: Hima Tech Zheng Jiwei

Source-WeChat public account: Himalaya technical team

Source: https://mp.weixin.qq.com/s/ku4jncV263ATWMZgSR5oog

Hima Advertising Algorithm Optimization Practice (1): The Evolution of Advertising CVR Model