laitimes

Don't ruin your A/B experiment by choosing the wrong metrics

author:Everybody is a product manager
Choosing the right AB experiment index may be the key to the success of your experiment. In this article, we'll reveal how to choose the right metrics to help you achieve your goals quickly in your experiments.
Don't ruin your A/B experiment by choosing the wrong metrics

There are three main steps to experimental design: selecting the metrics for your experiment, determining the audience for your experiment, and designing a version of your experiment.

Among them, it is important to choose the right experimental indicators. When designing an experiment, students who have actually tried it will realize that there are many pitfalls here.

Many times the experiment goes live, but the indicator is not accurately defined, resulting in no conclusion, or even the wrong conclusion, and sometimes the experiment seems to be successful on the surface, but it has a big impact on a downstream metric, but we don't know it.

How can the above problems be avoided?

1. Choose the right experimental indicators

1. Three steps of experimental design

(1) Select the experimental index

Choosing an experimental metric is the first step in experimental design, and the most critical question we need to answer is: What metrics can measure the success or failure of the experiment? This step is very important, just as to find the North Star indicator first to do growth, to find the right experimental metric to do the experiment.

(2) Determine the audience for the experiment

Determining the audience for your experiment is the second step in your experimental design. We need to clearly define the audience for the experiment and estimate the sample size needed. This allows you to make adjustments to your specific situation, such as reducing the number of versions of an experiment, or increasing the degree of changes to that experiment.

(3) Design an experimental version

Designing the experimental version is the third step in the experimental design. If you use a third-party experimental tool, the whole process is relatively simple, but if you want to build your own experimental system for design and development, the process will be more complicated.

In this article, we will focus on how to choose core metrics, and then we will share how to determine the audience for your experiment and how to design your experiment.

2. Amazon China's selection of experimental indicators

Here's a case study from Amazon China to illustrate why it's important to choose the right metrics for your experiments.

(1) Amazon China shopping cart AB test failed for the first time

a. The first experiment metric is selected as sales, and the new version performs worse

Amazon China wants to do an AB test of a shopping cart. Chinese users are accustomed to using their shopping carts as favorites, with some products selected at checkout and the rest left in the shopping cart. But Amazon's global shopping cart design is an all-checkout model.

In response to such findings, the Amazon China team wanted to design an AB test to test it. If Amazon China is also changed to select some products to checkout, will such a way that is more accustomed to Chinese users be more effective?

The first metric they chose for the experiment was sales, and after the experiment went live, the result was that some of the checkout versions that had been running for a month lost to all the checkout versions, and the sales were lower, so they had to roll back.

b. Further analysis reveals that new users are not familiar with all checkouts, resulting in inflated sales but decreased long-term satisfaction

The team was puzzled as to why the version that was successful on other e-commerce sites in China was not applicable to Amazon China?

(1) The first finding is that new users are not familiar enough with the version (that is, the old version) that has just touched all the checkouts, and many people will accidentally buy too much, so it will push up the sales of the old version. Only some of these users who accidentally buy too much will return them, so the sales are still relatively high.

(2) However, the long-term satisfaction of users who bought too much declined, because they reacted after a period of time and found that they accidentally bought too much.

(3) In the partial checkout version, many users will actually buy back the products kept in the shopping cart after a period of time, so these products actually have potential sales opportunities, but there is a delay in the user's purchase time, so they cannot be displayed after the previous experiment.

c. Summary of the selection of indicators for the first experiment of Amazon China

In the end, Amazon's China team concluded that if more metrics were compared in the first experiment, it might find that all checkout versions had high short-term sales, but high return rates and low long-term satisfaction, while some checkout versions had higher long-term repurchase rates and sales. However, the first experiment only focused on short-term sales and did not pay attention to other metrics, leading to the erroneous conclusion that the old version was better.

(2) Amazon China optimized the experimental indicators, and the second experiment was successful

Based on the above summary and cognition, the Amazon China team redefined the experimental indicators and conducted a second experiment.

The core indicator has changed from short-term sales in the first edition to the concept of comprehensive sales, which includes not only short-term direct sales, but also expectations for long-term sales.

At the same time, a series of auxiliary indicators have also been added, such as repurchase rate, order frequency, checkout conversion rate, etc. Although these metrics are not enough to directly explain the success or failure of the experiment, they can help us make decisions from various aspects.

Finally, the return rate is also added as a measure of the magnitude of negative results.

By looking at a full range of experimental metrics, the new version of the partial checkout came out on top. It not only brought an increase in comprehensive sales, but also an increase in the frequency of placing orders, and finally successfully launched.

The Amazon China team did not change any design of the experimental version, but simply chose more comprehensive and accurate experimental indicators, and the experiment failed to succeeded.

It can be seen that the key to the success of the AB experiment lies in selecting the correct indicators, including core indicators, auxiliary indicators and reverse indicators, so as to comprehensively and accurately measure the effectiveness of the experiment.

2. Three types of indicators to accurately and comprehensively measure the success or failure of the experiment

So, if you want to accurately and comprehensively measure the success or failure of the experiment, how should you choose the indicators? It is recommended that you consider choosing three types of experimental indicators: core indicators, auxiliary indicators and reverse indicators.

1. Core metrics: The key metrics that determine the success or failure of an experiment

(1) The core indicator represents the final North Star indicator of the experiment

Core metrics are the key metrics that determine the success or failure of an experiment. For growth experiments, we need to find the most critical indicator that determines the success or failure of the experiment, which is the indicator that we will perform statistical significance calculations later to determine which is better between the old and new versions.

(2) Case: AB test of the novice guidance section of the APP homepage

An app has made a revision of the homepage onboarding section, and its goal is to let new users understand the product functions and complete the initial setup.

The experimental hypothesis is that by having users read more introductory articles that show them how to use the product, it can help them with the initial setup.

Version A is a card arrangement of novice articles, and version B is a list of articles. If you pay attention to the click-through rate, you will find that the list version of Group B performs better, but if you follow the completion rate of the newbie set, you will find that the version of the card in Group A performs better.

In this case, the core metric should be the completion rate of the newbie setting, not the click-through rate of the article title. As with the Growth North Star indicator, be careful not to choose the vanity indicator when doing the experiment.

It is necessary to take the ultimate goal of the experiment as the criterion, and select the indicator that best represents the goal as the core indicator. While Group B had a higher click-through rate, it performed worse based on the end goal of the experiment, and the winner was the Group A card version.

2. Ancillary indicators: a comprehensive understanding of the experimental results

For the vast majority of simple experiments, there may be only one core metric that is sufficient. However, for experiments that are more complex, involve long funnels, or may have an impact on downstream metrics, we also choose secondary metrics.

(1) Affect the individual steps of the entire user funnel

The second type of metric to measure the success or failure of an experiment is the ancillary metrics, which can help us get a complete picture of the results of the experiment and ensure that certain metrics are not accidentally injured. If an experiment affects the entire user funnel, we should not only look at the final step of the funnel, but also monitor the impact on all steps of the entire funnel.

(2) Pay attention to key indicators of downstream and other users

If there are some important downstream indicators, we need to fully observe whether the experiment will have an impact on one downstream metric, as well as the impact on other users' key indicators.

(3) Case study: Airbnb uses a key metrics dashboard to comprehensively evaluate the impact of the experiment

In fact, some of the companies that are doing growth experiments on a large scale in Silicon Valley, such as Airbnb, have taken the approach of making a dashboard of key metrics, and the results of any growth experiment will be put on this dashboard to see if there is any impact on any of the key metrics. If there is an impact, it will be displayed, so that you can avoid accidentally injuring certain indicators.

3. Contrarian metrics: Hints at the possible negative effects of the experiment

(1) Why do you need a contrarian indicator?

Contrarian indicators can give a hint of possible negative effects of experiments. If the negative impact is small or none, we can declare the experiment successful, and if the negative impact is too high, we may reject the results outright, even if the core indicators perform better. Generally speaking, 1-2 contrarian indicators are sufficient.

(2) Common contrarian indicators

Common reverse metrics include NPS, app deletion rate, email unsubscribe rate, push unsubscribe rate, and page exit rate.

In summary, the core indicators measure the key experimental results, the auxiliary indicators fully understand the experimental effect, and the reverse indicators prevent the negative impact from being ignored.

4. Comprehensive case: AB test index selection of shopping cart button on e-commerce website

For example, if an e-commerce website wants to do an AB test of the Add to Cart button to test which of the various shopping cart buttons performs better, how should it choose the metric? Because the shopping cart button is on the product detail page, we can draw the entire user shopping funnel.

(1) Core indicators

In this case, the core metric that should be chosen is the click-through rate of the Add to Cart button itself, as it is the main target of the experiment.

(2) Auxiliary indicators

In this example, although the ultimate goal is to increase sales, there are many steps between adding to cart and increasing sales, so we should use Add to Cart as the core metric and sales as a secondary metric.

Other secondary metrics include the number of clicks on the Add to Cart button, how many people visited the cart page, or how many people made a successful purchase in the cart, repeat purchase rate, and so on.

(3) Contrarian indicators

The ultimate inverse metric may be the return rate.

By choosing the right three types of metrics, we can fully measure the impact of this change on the entire shopping funnel, rather than seeing only one aspect and missing the others.

The above is how to accurately and comprehensively measure the results of experiments through three types of experimental indicators, and how to ensure scientific triage and credibility of results through the systems and tools of AB testing.

Therefore, instead of being afraid of failures and challenges in experiments, we should focus more on how to improve our experimental design capabilities through scientific methods - correct selection of indicators, in-depth understanding of audiences, and scientific traffic segmentation. Because every experiment is a step towards success.

Don't ruin your A/B experiment by choosing the wrong metrics

This article was originally published by @小黑哥 on Everyone is a Product Manager and is not allowed to be reproduced without permission

The title image is from Unsplash and is licensed under CC0

The views in this article only represent the author's own, everyone is a product manager, and the platform only provides information storage space services.

Read on