laitimes

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Author | Netflix Technology Blog

Translated by | Liu Yameng

Planning | Ling Min

As of the close of the US stock market on December 23, the market value of Netflix, a well-known US film and television company, was 272 billion US dollars (about 1731.88 billion yuan). It is reported that Netflix uses A/B testing to make decisions to continuously improve the product.

1

It's easy to make decisions, it's hard to make the right decisions

Netflix's creative philosophy is to put consumer choice and control at the center of the entertainment experience, and as a company, we are constantly improving our products to improve this value proposition.

For example, Netflix's user interface (UI) has undergone a radical transformation over the past decade. Back in 2010, the user interface was static, with limited navigation options, and a demo inspired by a video rental store showcase. The user interface is now immersive, with video forwarding capabilities, navigation options richer and less obtrusive, and box art displays that take full advantage of the digital experience.

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?
How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Figure 1: The Netflix TV user interface (UI) shows its effects in 2010 (top) and 2020 (bottom).

Transitioning from the experience of 2010 to today, Netflix has had countless decisions to make. For example, what is the balance between a large display area for a single episode and showing more episodes? Is video better than still images? How do we provide a seamless video forwarding experience on a restricted network? How do we select the episodes to display? Where should navigation menus be placed and what should they contain? The list goes on and on.

It's easy to make decisions – the hard part is to make the right decisions. How can we be confident that our decisions will provide a better product experience for our existing members and help new members grow their business? Netflix can make decisions about how to improve our products in a number of ways to bring more fun to our members:

Let the leader make all the decisions.

Hire experts in design, product management, user experience, streaming delivery, and more, and then adopt their best ideas.

Conduct internal debates that give the upper hand to the views of our most charismatic colleagues.

Imitation competition.

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Figure 2: Different decision-making methods. Clockwise directions from the top left are: leadership decision-making, internal expert decision-making, imitation competition, group debate.

In each of the paradigms presented above, the perspectives and perspectives that contribute to decision-making are limited. The leadership team is small, the size of the group debate is so large, and Netflix has only a few experts in every area where we need to make decisions. There may be dozens of streaming or related services that can serve as a source of inspiration for us. Moreover, these paradigms do not provide a systematic approach to making decisions or resolving conflicting perspectives.

At Netflix, we believe there's a better way to make decisions about how to improve the experience we provide to our members: we use A/B testing. Experiments give all of our members the opportunity to vote and use their actions to make decisions about how to continue to develop their delightful Netflix experience, rather than a panel of executives or experts making decisions.

More broadly, other methods of causal reasoning, such as A/B testing and quasi-experimentation, are Netflix's way of using the scientific method to inform decision-making. We form hypotheses, collect empirical data, including data from experiments, provide evidence of support or disapproval of our hypotheses, and then draw conclusions and generate new hypotheses.

As my colleague Nirmal Govind explains, experiments play a key role in underpinning the iterative cycle of reasoning (drawing concrete conclusions from general principles) and induction (forming general principles from concrete results and observations) of the scientific method.

2

A/B testing

The A/B test is a simple controlled experiment. For example (this is a hypothesis!) We wanted to see if the new product experience of inverting all boxarts in the TV user interface would be beneficial to our members.

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Figure 3: How can we judge that product experience with inverted box booth B is a better experience for our members?

To conduct the experiment, we took a subset from our members, usually a simple random sample, and then used random allocation to divide the sample evenly into two groups.

Group "A" is often referred to as the "Control Group" and continues to receive the basic Netflix user interface experience, while group "B" is often referred to as the "Treatment Group" to get different experiences based on specific assumptions about improving the membership experience (these assumptions are described in more detail below). Here, Group B accepts the inverted box booth.

We compare the various metric values for Groups A and B, some of which will be specific to a given hypothesis.

For user interface (UI) experiments, we'll look at user stickiness for different variants of the new feature. For an experiment designed to provide more relevant results in the search experience, we'll measure whether members find more content worth paying attention to through search. In other types of experiments, we may focus on more technical metrics, such as the load time of the application, or the video quality that we can provide under different network conditions.

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Figure 4: A simple A/B test. We use random assignment to divide a random sample of Netflix members into two groups. Group "A" accepts the current product experience, while Group "B" accepts changes that we consider to be improvements to the Netflix experience. Here, group "B" accepts the "inverted" product experience. Then, we compare the indicators between the two groups. Crucially, random distribution ensures that, on average, all other content between the two groups remains the same.

Through many experiments, including the example of an inverted box booth, we need to carefully consider what our indicators tell us.

Let's say we look at click-through rates to measure the percentage of members who click on episodes per experience. The metric itself could be a misleading measure of the success of this new user interface, as members may simply click on episodes in the inverted product experience just to make it easier to read. In this case, we may also need to evaluate which members would subsequently choose to leave the episode instead of continuing to play it.

In addition, we will be focusing on more generic metrics designed to capture the joy and satisfaction that Netflix brings to our members.

These metrics include the extent to which members interact with Netflix: Does the idea we're testing help members choose Netflix as their destination for entertainment on any given night?

It also involves a lot of statistics – how big a difference would be considered significant? How many members do we need in a test to detect the effects of a given size? How can we analyze data most effectively? This article will focus on high-level intuitive perceptions.

3

Leave other factors unchanged

Because we created the control group ("A") and the experimental group ("B") using random assignments, we can ensure that the individuals in both groups, on average, are balanced in all dimensions that may be meaningful to the test.

For example, random allocation ensures that the average length of Netflix members does not differ significantly between the control and experimental groups, nor does the content preference, choice of primary language, etc. The only difference between the two groups is the new experience we are testing, ensuring that our estimates of the impact of the new experience do not have any bias.

To understand how important this is, let's consider another way we can make decisions: We can push the new inverted box booth experience (as discussed above) to all Netflix members and see if our metrics change significantly. If there is evidence that the change is positive or meaningless, we will retain the new experience; if there is evidence that the change is negative, we will roll back to the previous product experience.

Suppose we do this (and again – it's a hypothesis!). ), and switch the switch to the up-down inverted experience on the 16th day of each month. What would you do if we collected the following data?

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Figure 5: Hypothetical data for the product experience of a new inverted box booth released on day 16.

The data looks good: we've released a new product experience and membership stickiness has improved dramatically! But if you have that data, plus the knowledge that Product B inverts all the box booths in the user interface, how confident are you that the experience of the new product is really beneficial to our members?

Do we really know that new product experiences are responsible for increased user stickiness? Are there any other possible explanations?

What if you also knew that Netflix launched a hit series on the same day as it launched a new inverted product experience, like a new season of Stranger Things or Bridgerton, or a hit movie like Army of the Dead?

There is now more than one possible explanation for the increase in user stickiness: it could be a new product experience, it could be a hit episode on social media, or it could be both. Or something else entirely. The key point is that we don't know if the new product experience has led to an increase in user stickiness.

Conversely, what if we used the product experience at the inverted box booth for A/B testing, with one group of members accepting the current product ("A") throughout the month and another group of members accepting the inverted product ("B"),and collecting the following data?

How did Netflix, which has a market capitalization of more than 1.7 trillion, make its decisions?

Figure 6: What-if data for the New Product Experience A/B test.

In this case, we came to a different conclusion: inverted products often result in lower user stickiness (which is not surprising!). ), and with the release of big episodes, the stickiness of both groups of users is increasing.

A/B testing allows us to make a statement of reason. We only introduced the inverted product experience in Group B, and since we randomly assigned members to Groups A and Group B, everything else between the two groups remained the same. Therefore, we can most likely conclude (more details will be discussed next time) that the inverted product has led to a decrease in user stickiness.

This hypothetical example is extreme, but it tells us that there is always something beyond our control.

If we push an experience to everyone and simply measure a single metric before and after the change, there may be a correlation difference between the two time periods that prevent us from making causal inferences. Maybe it's a new series that's popular. Maybe it's a new product partnership that allows more members to enjoy the fun of Netflix. There's always something we don't know.

Where possible, conducting A/B testing allows us to confirm causation and confidently make changes to the product knowing that our members have voted for them through their actions.

4

It all starts with an idea

A/B testing starts with an idea – we can make some changes to the user interface, a personalized system that helps members find content, the sign-up process for new members, or any other part of the Netflix experience that we believe will bring positive results to members. Some of the ideas we tested were incremental innovations, such as improving the way copies of text appeared in Netflix products; others were more ambitious, like testing of the "Top 10" episodes that Netflix now shows on the user interface.

Like all innovations introduced to Netflix's global membership, the "top 10" started out as an idea and later became a verifiable hypothesis. Here, the core idea is that episodes that are popular in every country will benefit our members in both ways. First, by presenting trending content, we can help members share experiences and connect with each other through discussions about popular episodes. Second, we can help members choose something great by satisfying people's intrinsic desire to participate in shared conversations.

Figure 7: Example of the "top 10" rendering experience on the web user interface.

Next, we turn this idea into a testable hypothesis that "if we make changes to X, it will somehow improve the membership experience, thus improving metric Y." ”

For the "top 10" example, the assumption is: "Showing members the top 10 experiences will help them find content worth watching, thereby increasing member pleasure and satisfaction." "The main decision-making metric for this test (and many others) is to measure members' user stickiness to Netflix: Does the idea we're testing help our members choose Netflix as their entertainment destination on any given night?"

Our research shows that in the long run, this metric (details omitted) correlates with the probability that members retain their subscriptions. Other areas of business that we test, such as the sign-up page experience or server-side infrastructure, use different key decision-making metrics, but the principle is the same: What can we measure during testing to provide more value to our members over the long term?

In addition to the main decision metrics for testing, we also considered some minor metrics and how they will be affected by the characteristics of the product we are testing. The goal here is to elucidate causal chains, from how user behavior responds to changes in the new product experience, to changes in our primary decision metrics.

Elucidating the causal chain between product changes and changes in primary decision metrics, and monitoring secondary metrics along that chain, helps us build confidence that any change in primary metrics is the result of our hypothetical causal chain, rather than an unintended consequence (or false positive) due to a new feature.

For the "top 10" test, user stickiness is our main decision-making metric – but we also look at other metrics, such as episode view rate for episodes that appear in the top 10 list, the ratio of view rate from that row to view rate in other parts of the user interface, and so on.

If, on assumptions, the "top 10" experience is really beneficial to our members, then we hope that the experimental group will show that the number of views of the "top 10" episodes has increased, and the user stickiness in this line is generally high.

Finally, because not all of the ideas we test make our members winners (sometimes new features have bugs!). We also looked at the metrics that act as "guardrails."

Our goal is to limit any negative impact and ensure that new product experiences do not have an unexpected impact on the membership experience. For example, we can compare customer service contact rates between the control group and the experimental group to check whether the new feature increases the contact rate, which may indicate that members will be confused or dissatisfied.

5

summary

The focus of this article is to build an intuitive understanding: the basics of A/B testing, why running A/B testing is more important than rolling out features, why looking at metrics before and after changes, and how we can turn an idea into a testable hypothesis.

Reference Links:

https://netflixtechblog.com/decision-making-at-netflix-33065fa06481

https://netflixtechblog.com/what-is-an-a-b-test-b08cc1b57962

Read on