Causal inference techniques based on representation learning in Kuaishou practice

Created by 17 senior experts, involving 15 fields, 133 system frameworks, and 1,000 subdivided knowledge points!

Pay attention to the public account "Big Talk Digital Intelligence" and download this "Data Intelligence Knowledge Map" ⬇️ for free

Causal inference techniques based on representation learning in Kuaishou practice

Today's topic is the practice of causal inference techniques based on representation learning. This sharing is mainly divided into four parts:

The first is the industrial RCT experimental specification, in the industry, RCT data is very important for causal modeling, but many times, our collection of RCT data is not standardized, resulting in us using the wrong data when modeling, which will allow us to reduce the online effect in the case of losing budget, this article will discuss some RCT data collection experience;

The second part is the joint modeling of the tree model and the depth model, the tree model is the most used model in the industry, is the baseline model of many companies, and the depth model is also a mainstream trend, it allows us to customize and design a more flexible network structure, the article will discuss how to integrate these two types of models, so as to play the advantages of the two;

The third part is the fusion modeling of RCT data and observation data, although RCT data is unbiased, but the amount is small and expensive, although the observation data is biased but the amount is large and available everywhere, so we hope to improve the amount of data through RCT data and observation data fusion modeling;

The last part is feature decomposition, which is currently a popular academic research direction, mainly studying how to extract confounder, adjustment variable and IV from features.

Full-text catalog:

Industrial RCT experimental specifications
Tree Model & NN Joint Modeling
RCT & ODB fusion modeling
Feature decomposition

Sharing guests| Qin Xuan/Yu Shanshan Kuaishou Growth algorithm engineer

Editor| Xu Zhenyuan, Tencent

Production community| DataFun

Industrial RCT experimental specifications

First, a brief introduction to RCT data. Why is RCT so powerful, and why is everyone using it and willing to invest a lot of money? There are three main reasons:

Comparability and covariate balance
Exchangeability
No backdoor paths

1. Comparability and covariate balance

Let's start with the definition: if the distribution of covariate X is consistent under any treatment group, we say "we have covariate balance".

It can be expressed by the formula:

That is, P(X|T=1) and P(X|T=0) are homogeneous.

In other words, all but the treatment itself is the same in several treatment groups. And randomization can help us guarantee this.

Here is a proof, assuming that the population distribution is P(X), at t=1, P(X|T=1)=P(X); When t=0, P(X|T=0)=P(X), the distribution is the same regardless of any value of t.

More importantly, in the case of RCT data, causality can lead to association. The process of derivation is shown in the following figure. The most important thing is the fourth equal sign, when t and x are independent, P(t|x)= P(t), and then follow a full-probability formula to get that y under the do operator condition is equal to controlling y under t condition, and finally you can get that causality is the association. This is also why we can see data phenomena that match our cognition on RCT data.

2. Exchangeability

The second is the establishment of Exchangeability, where the potential results of t and y are independent and uncorrelated. Because all the x variables are the same except for t itself, they have the same properties, so even if their treatment is changed, their potential results will not change in any way, note that here the potential results will not change in any way and not the results themselves, because we can only observe one result.

3. No backdoor paths

When we have randomized experiments, there is no backdoor path because t is no longer related to x.

To sum up, RCT data is a golden rule. Because it can bring a natural non-confounding nature, we can use this data for better modeling. Unfortunately, random data is extremely expensive and often comes with ethical issues. So in most scenarios, random data is a "luxury". But in this situation of limited luxury, how to get an estimate that is as unbiased as possible? There are two ideas provided here, the first idea is to try to make the RCT correct and efficient under a limited budget, because only by ensuring this can we ensure that the modeling is correct; The second idea is to find the unbiased part under the biased observation data, and then fuse the modeling with the RCT data.

What happens if I don't use RCT data modeling? For example, we might observe that people who wear glasses perform better academically, and think further: Does wearing glasses itself really affect academic performance? The answer is obvious, no. Because if wearing glasses can improve academic performance, everyone does not need to study, just wear glasses directly. If you think further, you will find that people who study for a long time are more likely to wear glasses, and people who study for a long time have better academic performance, so we will observe the false causal effect that wearing glasses will improve academic performance, which is the confusion caused by the confounder of learning duration. Simply using observational data for modeling can result in biased estimates. RCT data is therefore required for modeling.

The disadvantages of random data, listed in the figure below, will not be repeated. Therefore, we need scientific and efficient RCT protocols.

There are two different approaches to RCT design in industry: Nested Design and Non-Nested Design. What we want more is Nested Design, which is the most used way. When there is a target population, random sampling directly from it is divided into two groups, one as the RCT experimental group and one as the strategy experimental group. Non-Nested Design adopts a different sampling mechanism, often taking a back seat to medical or more complex business scenarios.

How do we design our RCT scheme?

First of all, we must clarify the target population, when doing user growth or causal inference, it is very critical to clarify the target population, and it is often a point that is easy to be overlooked. Because sometimes, some users may not be in the group of people doing strategies, so they do not need to appear in our RCT data. Therefore, we need to narrow the set of RCT target groups in advance, and only accurate to the smallest can there be an efficient RCT experiment.

For example, when we make strategies, there are often special rules for certain users, and these rules are often easy to ignore. For example, some users with special properties are often given one or a few treatments. When performing RCTs, adding these populations will result in an uneven distribution of samples across treatments.

If the target population cannot be defined particularly precisely, then the simplest solution is to do a good job of logging when the online service is served, and every time the sample is placed in the table, there is a record and only one actual policy key is recorded, so as to facilitate the correct data collection. But the safest way is to make the target population as accurate as possible.

Second, we need to shuffle traffic before the experiment, and regularly shuffle traffic during the experiment, which is also a step that can easily be overlooked. The common practice of many teams is: when doing RCT experiments, select an experimental group, drop the previous experiment, and then directly do a random strategy on it. This can result in inconsistent distribution of the RCT group and other groups. Because each strategy will have its own long-term cumulative effect, if the cumulative time is long, the feature cannot AA is actually uneven, and the model trained with such data may bring bad results when it goes live in another group. Therefore, regular shuffle traffic is a very necessary part. In addition, the RCT experiment itself is also a strategy in a broad sense, which will also have a great impact on the sample distribution of the experimental group, so it is also important to regularly shuffle traffic or sample RCT data from the whole population every time.

There are two methods for selecting features for RCT experiments: the first is the user dimension and the second is the request dimension.

User dimension: Give a user a Treatment until the end of an experimental period.
Request dimension: Each time a request comes, the system randomly assigns a Treatment.

You can choose which RCT to use based on your business needs. For example, the RCTs of the user dimension can help observe the cumulative effect of Treatment. RCTs for request dimensions can help us see the causal effect of a single treatment and get more training samples under the same budget.

However, whichever one you choose, you need to ensure that there is no post-treatment feature. The user dimension can only use the feature before the user's first request, otherwise it will cause feature traversal. As long as the request dimension is a feature before the request, you can use it with confidence.

Finally, a major practice from our side: Online RCT.

Compared with RCTs that focus on a large traffic at a certain time period, continuous online RCTs with small traffic are more cost-effective and effective. Because:

It allows us to always have random data that is close to the current population distribution

Currently, the RCT data we use is collected in October 2022, however, when we use the model trained with this data to estimate the current strategy data, both the distribution of the estimate and the treatment distribution are far from the test set of RCT data, and OnlineRCT can help us alleviate this.

This gives us more flexibility to change treatment and avoid waste

Previously, our RCT method was to turn on the RCT switch for a period of time with a large flow, and once the treatment was changed, it meant that our previous RCT experimental data was not available, resulting in a double waste of time and money. OnlineRCT can help us mitigate this problem.

It helps automate model updates

The day-level model automation update function has been applied in many scenarios and has become a trend, but since causal modeling mainly uses RCT data, incremental learning is not realistic without high-frequency data updates. OnlineRCT helped us solve this.

When we collect RCTs, we also need to have a complete set of verification tools, mainly used to check the unbiased data, this step is also very important, if the data quality cannot be guaranteed, all subsequent model construction or training may be problematic.

Tree Model & NN Joint Modeling

First, look at a typical cause-and-effect diagram (see figure below), U is the unobserved confounding variable, C is the confounder, T is treatment, Y is the outcome, A is the adjust variable, and IV is the instrumental variable. In fact, the cause-and-effect diagram in industry can in most cases be reduced to the graph on the right: the unobserved confounding variable is gone. In many scenarios in industry, it can be assumed that there are no unobserved confounders. Since confounder are features used when training online models, they are mostly knowable. We can make an assumption that our strategy is determined by the model, and the features used in the model are confounder.

In most companies, the C to T edges in the cause-and-effect diagram modeled by RCT are eliminated.

With this diagram, let's see the different ways that NN models and tree models model causal effects.

The following figure is a more common DRNet, and in most cases the network parameters are trained by such a loss function. But there is a problem, if we train a multi-headed neural network, each head is a regression of Y, then the resulting Y is a common representation of confounder and adjustment variable. However, it is actually only Confounder that affects the causal effect. Because the network is not modeled in terms of heterogeneity, it is only a regression model, so it is possible that we can estimate Y accurately. But when we calculate causal effects, we create bias when subtracting the forecast estimate from each head.

For example, judge the effect of taking diet pills. T is to eat diet pills, Y is weight, A is to measure weight before going to the toilet, the model will give A a parameter, if A's parameters are not the same when to subtract the causal effect, it will bring errors, and these errors will amplify the estimation of the final causal effect.

Under the RCT data, the causal forest is modeled based on heterogeneity, and the causal forest only splits on the confounder, in other words, it only splits on the features that provide information for causal effects. In the previous example, whether or not to go to the toilet before measuring the weight is eliminated in each leaf node, so it will not split in A. So the causal effect of RCT causal forest output is a representation of confounder.

Tree models can be represented by confounders, while NNs can support more personalized structures. Two ideas are given here:

Idea 1: Use the confounder embedding generated by the tree model as a feature of the NN model.
Idea 2: Use adversarial learning to do feature decomposition. This is explained in detail in the fourth part of this article.

RCT & ODB fusion modeling

First of all, introduce the method of PS matching, layering through PS, in each layer, by calculating spiked-in estimator (separately calculating the ATE of each layer), the causal effect in the layer is obtained, and finally weighted to get the overall ATE. O under each layer represents observational data, R represents RCT data, and K is layer K. For example, in k buckets, O(K) is the kth bucket in the observation data, and R is the kth bucket under the RCT data.

Causal effects have a relatively mature idea: assuming that at any layer tau_ok=tau_rk, for the causal effect of each layer, the causal effect can be obtained by weighting the amount of data. In this way, we can finally get the population ATE weighted by sample size. The advantage of this method is that it can solve the problem of small sample size of PS at extreme values. At this time, a small amount of RCT data can also improve the estimated effect of causal effect. This set of frameworks is mainly to calculate the spiked-in ATE of each layer, and then weighted to get the ATE of the group, although this is not what we want, but we can learn from this framework and integrate it into our system.

The fusion of observation data and RCT on the market mainly solves two parts of the problem, one is that the distribution of RCT data and the distribution of observation data is inconsistent, and the second is that the observation data is biased, and the main problem we are solving now is the biased observation data.

Our approach is divided into three main steps:

Step 1: Stratification, according to the picture on the left, the first step is to stratify the sample through the propensity score, a good PS model does not need to accurately divide the sample to belong to the treatment group or control group of the ability, but to have the ability to covariate balancing. Here, we consider the samples of each layer to have the same properties.

Step 2: Shift the observation data to the RCT covariate, sample the observation data, so that the amount of data in each layer is consistent with the amount of data in the corresponding RCT data. In the previous work, our method is to equal the number of samples of RCT data and the number of samples of ODB data in each layer, in this way, the "samples of the same nature" in RCT and ODB can be equally proportional, so as to ensure that the distribution is consistent.

Step 3: Observing data unconfounding nature is established, which is the core of the whole methodology. The control group and treatment group data in the same layer are sampled, and for each layer i, such that n_obs_ti/n_obs_ci=n_rct_ti/n_rct_ci, so as to construct an equal ratio of the probability of being classified into the treatment group and the control group for samples of different properties.

We also have a verification mechanism to judge whether the stratification is accurate. Simply put, see whether the Y and causal effects in the treatment group of the observation data and RCT data are consistent in each layer, and if they are inconsistent, they need to be filled with RCT data.

As we delve deeper into the causal direction, we don't think it's reasonable to cluster with GPS. Because Propensity Score Vector is not a representation of the covariate X, in addition to X, other features are often added invisibly when regressing T, that is, instrumental variables, the so-called instrumental variables, that is, do not directly affect Y, but only affect Y by affecting T. If you use a representation that contains IV when doing layering, it will seriously affect the layering effect. For example, the subsidy budget does not directly affect the user's activity, but it affects the user's activity by affecting the user's chance of being incentivized. If you put such a feature into the PS model, you will rely heavily on it when layering, and it will affect the balancing of other X, so we must find a way to filter out such features. One paper demonstrates why IV causes Propensity Score Inconsistency.

The nature of the tree model mentioned earlier is only split on the confounder, and the causal effect of the output of the tree model is the embedding of the confounder.

As you can see in the figure below, only confounder X provides information about causal effects. If confounder is used as a feature of clustering, the covariate balance effect of the feature in each layer will be greatly enhanced.

Here are a few modules of our system:

The causal model module, which exists to obtain pure confounder embedding. Any model can be embedded, and based on current cognition, the uplift trained using RCT data and causal forests is a relatively pure confounder embedding.
The clustering module, which exists to obtain accurate sample stratification based on confounder's embedding (input from the causal model module), with consistent distribution of control and treatment groups within the layer, and consistent distribution of RCT data and observations.
The covariate shifting module, which observes data into the covariate shifting of RCTs.
The deconfounding module, which unconfounding the nature of observational data is established.
Hypothesis testing based verification module, hypothesis test verification mechanism.
Evaluation system, in a set of systems, evaluation mechanism is essential, for this reason, we have developed an evaluation system that is compatible with all stratification methods, the evaluation content is shown below.

Feature decomposition

Some of the work based on RCT data and observational data fusion modeling was introduced earlier, and we can see that RCT samples are difficult to construct and expensive, so we try to model directly on the observed samples. Introducing more observations into DNN modeling can improve model fitting and expression, but introducing observations will also introduce bias. Most of the existing work is to eliminate bias through sample Reweighting/Balancing techniques, classic methods are: DragonNet, DML, feature decomposition, etc.

The following is a brief introduction to the original practice of feature decomposition (refer to a work published in TKDE2022), as well as some of the improvements we made when we landed. Based on the causal plot (the figure on the right), it can be seen that the covariate X is decomposed into the instrumental variable I, confusing the variable and adjusting variable A, so as to accurately separate C and A to estimate outcome(y), eliminating the bias caused by the introduction of observational data. From the figure, you can see that the instrumental variable I only affects treatment, the confused variable C affects treatment and outcome, and the adjustment variable A only affects outcome.

The paper gives an example of taking medicine, treatment is whether to take medicine or not, and outcome is whether to restore health. Among these characteristics of the patient, income and attending physician are instrumental variables that only affect treatment; Age and gender are confounding variables that affect both treatment and outcome, because doctors consider the age and gender of the patient when choosing treatment, and the age and gender of the patient also affect the probability of recovery; Genes and environment are adjustment variables that only affect outcome.

To accurately decompose the covariate X into three different types of hidden variables, do the following:

To first decompose A from X, the following two conditions need to be met: one is that the adjustment variables A and T are independent, and the purpose is to constrain that the information of other variables will not be embedded in the adjustment variable A. Because from the cause-and-effect diagram, it can be seen that there is a relationship between the instrumental variable I and the confusion variables C and T; The other is that adjusting variable A can estimate Y as accurately as possible, so that A's information is not embedded in other variables.
Then decompose I from X, and if balancing is done well, there will be no dependency between C and T, then I and Y are independent in the given case. This is because after balancing confounder under different treatments, the causal effect between C and T is removed, and when T is given, I and C, I and Y are independent, thus ensuring that the information of other variables is not embedded in the instrumental variable I. At the same time, I needs to estimate T as accurately as possible to ensure that I's information is not embedded in other variables.
Finally, the factual and counterfactual results are estimated based on the decomposed C and A.

Based on the above analysis, the following losses are designed to decompose the covariate X:

First decompose A, achieving () ⊥ by minimizing the difference in distribution under different treatments while minimizing losses based on estimates. disc(·) represents the difference in distribution under different treatments and can be measured by metric functions such as IPM, MMD, Wasserstein distance, etc.
Secondly, by balancing the distribution under different treatments, the dependencies between C and are removed to achieve () ⊥. is a weighted at the sample level, which is a learnable parameter.
The ⊥ | is then achieved by minimizing the difference in distribution under different treatments while minimizing losses based on estimates. Again, weighted at the sample level is used.

In addition to the above losses, in order to avoid overfitting and unclean decomposition, orthogonal regularization is added to the paper to improve the effect. As an example, consider the weight matrix to represent the contribution of each variable in Input to the output (). and the same. So if the covariates can be sufficiently decomposed into I,,A, these weight matrices should be orthogonal. In addition, in order to prevent the learned weights from being zero, the sum of the weights of each dimension is constrained to 1.

The above is the practice of feature decomposition given in the paper, and some optimizations have been made in the process of our landing.

First of all, the paper is mainly for the case of 0/1 treatment, we upgrade it to multi-treatment, and introduce a multi-head structure, each treatment generates the corresponding I, , A characterization.
Secondly, in the step of balancing the confounder, IPW is used instead of learnable parameters to balance the sample. For multiple treatments, the treatment is estimated based on the characterization of , so as to obtain the corresponding weights.
Finally, the adversarial loss is introduced to ensure the independence between variables.

In the paper, the independence between variables is ensured by minimizing the distribution difference under different treatments, and the size of the distribution difference is measured based on metric functions such as Euclidean distance, cosine distance, and mmd. Our goal is that the variables are independent of each other, do not contain a useful amount of information, and it is not intuitive to express that the distribution is similar under different treatment and treatment is not intuitive. In addition, the distribution similarity measure function is relatively computationally intensive, and requires a lot of computing resources in the case of large-scale samples and multiple treatments. Therefore, we try to consider it from the perspective of information, and ensure that A and T are independent by the constraint () that no information can be estimated in the absence of information.

Based on this idea, we introduce adversarial loss, the goal is to make A(X) unable to accurately estimate T, that is, for multiple treatments, the estimated probability is 1/n, and n is the number of treatments. First, use the pre-trained network as the discriminator, fix the representation of A, and update the discriminator parameters to make it estimate T as accurately as possible. Then fix the discriminator and update the representation of A to estimate the fake label (1/n) as accurately as possible. Circuit training until you reach your goal. At this time, there is no information in A to estimate T, and achieve the purpose of A and T being independent.

That's all for this sharing.

Finally, welcome to join us if you are interested or experienced in causal inference, thank you!

Q&A session

Q1: To debias based on observed data, you need to consider both the relevance of the model and the effect of debias, in this case, how to conduct offline evaluation in limited data?

A1: This is a good question, when we do modeling based on observational data, or when debiasing observational data, we often need to look at the effect of debias. If there is no RCT data, our approach is to stratify the samples on the propensity score, and if the samples under each layer are homogeneous, we can observe that the ratio of treatment group and control group is the proportion of treatment group and control group in the full sample. In this case, debias is doing a good job.

Q2: Will the continuous influx of new users in the RCT of the user dimension cause the AA verification to be passed?

A2: Whether it is a user dimension or a request dimension, new users will be randomly divided into a certain group when they come in. The probability of a user falling into each group is equal. If the probability of each user falling into each group is equal, there will be no change in AA characteristics between treatment groups.

Q3: Will long-term online RCTs violate user fairness or bring potential RCT problems?

A3: We need to take this into account when designing RCTs, if your business is prone to such PR problems, you need to think about how to deal with RCTs when designing. If the business itself is risky, then RCT itself may not be a strategy that can be launched.

Q4: How to evaluate the effect of long-term online RCTs? The characteristics of user granularity mentioned in the sharing can only be done with the treatment before, how to do the long-term online RCT?

A4: The practice is the same as the feature construction mentioned above.

Q5: Can the Wi and propensity scores learned by representation learning be specifically corresponded?

A5: The starting point is to use the propensity score IPW instead of w, because the overall loss design is already more complex, you can see that different losses are designed for each decomposed variable, and if you eventually have to learn the parameter w, it is more difficult to learn from a practical point of view. IPW is relatively mature and has been verified in our landing process, and the effect is relatively good.

That's it for today's sharing, thank you.

| shared guest |

Qin Xuan| Kuaishou Growth algorithm engineer

He graduated from Boston University and is a researcher in the Department of Computer Science of Tsinghua University. He used to work as a senior algorithm engineer at Didi Chuxing, focusing on causal inference. He has independently developed a set of RCT & observation data fusion algorithms suitable for industry. Participate in the development of distributed causal forests based on SPARK, and carry out transformation and upgrading. This work has repeatedly achieved ROI benefits in the ride-hailing smart pricing business. After joining Kuaishou, he is responsible for the normalization development of RCT data flow in fission scenarios and the joint modeling of tree model & depth model.

Shanshan Yu| Kuaishou Growth algorithm engineer

Master of Zhejiang University, joined Kuaishou in 2020, and now his main work content is the business implementation and technological innovation of causal inference method based on representation learning in fission scenarios.

| DataFun New Media Matrix |

| about DataFun |

Focus on the sharing and communication of big data and artificial intelligence technology applications. Launched in 2017, more than 100+ offline and 100+ online salons, forums and summits have been held in Beijing, Shanghai, Shenzhen, Hangzhou and other cities, and more than 2,000 experts and scholars have been invited to participate in sharing. Its public account DataFunTalk has produced 900+ original articles, millions + readers, and nearly 160,000 accurate fans.