I Have An A/B Test Winner, So Why Can’t I See The Lift?

In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:

97 days of daily conversion rates in Perfectville showing 20% lift

The graph perfectly related what happened: The baseline increased by 10% during the test, with half the traffic exposed to the winning variation. Then was the week when the test was stopped, followed by a lift of 20% once the winner was implemented.

The good people in nearby Realville heard about this and ran the test on their site. When they later checked their daily conversions data, they scratched their heads (as they often do in Realville):

97 days of daily conversion rates in Realville showing same improvement

The data actually includes the same 10% lift during the test, a gap, and a final 20% improvement. The problem is the improvement is relative to natural fluctuations in daily conversion rates, so 20% improvement doesn’t necessarily mean 20% lift.

Here are 6 reasons why people in Realville might find it difficult to see a lift and what they can do about it.

Reason 1: The effect is too small

The smaller the lift, the harder it is to see it through the noise. If the conversion rate drops for some reason unrelated to the test, the lift from your winner might not even offset that. For example, here’s 1 week of simulated daily conversion rates followed by a week with a 20% lift compared to a 5% lift. If the lift were 5%, it would look as though the test actually did worse in the second half:

7 days at baseline followed by 7 days with 5% vs. 20% lift

Have you just run a test and are looking at during-test data? You likely won’t see any effect. Typically only 70-80% of visitors will join the test (more on this below), and these are split among your variations. If 80% of your traffic actually participated in an ABC test, a third of that is exposed to the winning variation. So, a 20% lift would manifest as 5% overall.

What you can do:

Look for a larger cumulative upward trend after several tests.
Compare longer timescales for baseline and post-implementation data.

Reason 2: Your baseline is too variable or you’re not looking at enough data

In Perfectville, conversions are constant each day, each week, each month. This means a 20% improvement causes a 20% lift. Not so in Realville. In Realville, daily conversions naturally fluctuate, so the full potential for improvement may not manifest. The more your conversions fluctuate, the harder it is to see the lift in the data.

Here are two similar simulated data sets with low and high variability, both showing a 20% lift. The lift is more obvious when variability is lower:

A similar 20% lift with low and high variability

Sales may fluctuate for a lot of reasons (weekly, seasonally, in response to your marketing activities, unexpected traffic). The smaller your sample, the higher the chance that the pattern you’re looking for just won’t be there by chance. For example, if you just saw the middle segment of the full graph below, you’d never know that the right, orange side of the graph shows a 20% improvement:

14 days of simulated daily conversion rates (blue), then 14 days with a 20% improvement (orange)

What you can do:

Zoom out to reduce variability. If data is too variable daily, look at semi-daily rate or weekly rate
Look at more data to cover the full cycle of ups and downs e.g., a week (note that the lower your conversion rate, the more data you need to see an effect)
Check your site analytics to see what might have been different that week. Check if dips have happened before. Might one have coincided with the test?
If the data has a lot of variation, it is hard to estimate visually. Compare what I’ll call the “clipping rates”. In this graph, you see higher peaks as well as a higher frequency of peaks in the second half:

20% lift manifests in more frequent and higher peaks

Reason 3: Not everyone was part of your test

Even if you didn’t put exclusion conditions on the test, some visitors were excluded.

For example, mobile visitors are excluded by default. Another 10-20% of visitors normally get excluded when the A/B testing tool time out. Further technical implementation issues can cause another 10-20% of visitors to be excluded, things like JavaScript-heavy sites or the tracking code not implemented in the right place.

Moreover, gaps in test design can create a discrepancy between test and sales data. For example, we ran a test on the home page of a basic single-product site and noticed that our test data was missing many sales. After investigating, it turned out that about 50% of purchases were by people who didn’t land on and never visited the home page as well as by existing customers from a special upgrade page that we didn’t consider.

As a result of these exclusions, when you implement your winner, you may be exposing it to segments you didn’t test it on. For example, although you tested on desktop and saw a 20% lift, the same design on mobile might cause a 30% drop. So, if you made the winner your new home page for all traffic, the drop in mobile could counteract some of the lift (say, if you had lots of mobile traffic).

What you can do:

Factor in 20% exclusions due to technical issues, like timeouts
Set up an inverse test to see how many sales are by-passing your main test (target pages and visitors who are excluded from your main test)
When looking at sales or conversion data, keep in mind it probably includes segments you didn’t test on. Test the design on all segments that will be exposed to it e.g., new customers and existing customers. For mobile, build and test a dedicated mobile version

Reason 4: You are eyeing it instead of using math

Sometimes a lift is obvious. Other times you need to use math. Here’s a sample of real conversion data with about 20 days of basline followed by 20 days of the improved version:

Just over a month of real conversion data with winner on the right

The lift is not visually obvious. Nonetheless, the average for the first 20 days is 0.71%, whereas the average for last 20 days is 0.85%, which is a 20% lift. However, if the standard deviation of the data is high, the difference in averages may be coincidental.

Reason 5: Your design or conditions are not the same

This happens all the time. You run a winning test, then you tweak the winner before pushing it to your site. It’s entirely possible that those visual tweaks reduced the effectiveness of your variation.

It’s also possible the test conditions are different when you launched the test. Did you test during the holidays or launch during holidays?

Are you including a different page? You might have several pages that look similar. So you tested something on one page, and you decided to apply it in one go to all the pages. If so, there is no guarantee that the same concept will work equally well on other pages.

What you can do:

Check your site analytics to see what conditions might be different now and retest if necessary
If you know you will be changing something, apply changes to the variation and test it with the changes
You should implement the winner as the new control and then test the new changes
Retest on each site if you have reason to believe the outcome may be different

Reason 6: It was a false positive

Yes, it happens all the time. There are many reasons you might have gotten a false positive, including improper test design and not running your test long enough. The most common scenario is you run your test until you see a winner and stop. I’ve seen results that looked very exciting flatten after 3-4 weeks.

What you can do:

Follow the great tips on http://goodui.org/betterdata to ensure you get good data

Back To Realville

Let’s say Realville decided to retest the Perfectville winner 8 more times (it took years!). They found that indeed, the overall tendency of the variation was towards increase, following the same pattern as Perfectville’s test. There’s a small lift during the test, a slight dip when the test is stopped, and then a larger lift after final launch. However, despite the overall trend, individual outcomes showed that chance is a factor in this imaginary scenario:

Let me know if you apply and find useful some of these concepts.

Reason 1: The effect is too small

Reason 2: Your baseline is too variable or you’re not looking at enough data

Reason 3: Not everyone was part of your test

Reason 4: You are eyeing it instead of using math

Reason 5: Your design or conditions are not the same

Reason 6: It was a false positive

Back To Realville

Leave a Reply Cancel reply