vlad malik

I Have An A/B Test Winner, So Why Can’t I See The Lift?
3 years ago

In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:

 

graph-perfectville

97 days of daily conversion rates in Perfectville showing 20% lift

 

The graph perfectly related what happened: The baseline increased by 10% during the test, with half the traffic exposed to the winning variation. Then was the week when the test was stopped, followed by a lift of 20% once the winner was implemented.

The good people in nearby Realville heard about this and ran the test on their site. When they later checked their daily conversions data, they scratched their heads (as they often do in Realville):

 

graph-realville

97 days of daily conversion rates in Realville showing same improvement

 

The data actually includes the same 10% lift during the test, a gap, and a final 20% improvement. The problem is the improvement is relative to natural fluctuations in daily conversion rates, so 20% improvement doesn’t necessarily mean 20% lift.

Here are 6 reasons why people in Realville might find it difficult to see a lift and what they can do about it.

Reason 1: The effect is too small

The smaller the lift, the harder it is to see it through the noise. If the conversion rate drops for some reason unrelated to the test, the lift from your winner might not even offset that. For example, here’s 1 week of simulated daily conversion rates followed by a week with a 20% lift compared to a 5% lift. If the lift were 5%, it would look as though the test actually did worse in the second half:

 

graph-reason1-smalllift

7 days at baseline followed by 7 days with 5% vs. 20% lift

 

Have you just run a test and are looking at during-test data? You likely won’t see any effect. Typically only 70-80% of visitors will join the test (more on this below), and these are split among your variations. If 80% of your traffic actually participated in an ABC test, a third of that is exposed to the winning variation. So, a 20% lift would manifest as 5% overall.

What you can do:

 

Reason 2: Your baseline is too variable or you’re not looking at enough data

In Perfectville, conversions are constant each day, each week, each month. This means a 20% improvement causes a 20% lift. Not so in Realville. In Realville, daily conversions naturally fluctuate, so the full potential for improvement may not manifest. The more your conversions fluctuate, the harder it is to see the lift in the data.

Here are two similar simulated data sets with low and high variability, both showing a 20% lift. The lift is more obvious when variability is lower:

 

graph-reason2-lessvariable

graph-reason2-morevariable

A similar 20% lift with low and high variability

 

Sales may fluctuate for a lot of reasons (weekly, seasonally, in response to your marketing activities, unexpected traffic). The smaller your sample, the higher the chance that the pattern you’re looking for just won’t be there by chance. For example, if you just saw the middle segment of the full graph below, you’d never know that the right, orange side of the graph shows a 20% improvement:

 

14 days of simulated daily conversion rates (blue), then 14 days with a 20% improvement (orange)

 

What you can do:

 

graph-peakingrate

20% lift manifests in more frequent and higher peaks

 

Reason 3: Not everyone was part of your test

Even if you didn’t put exclusion conditions on the test, some visitors were excluded.

For example, mobile visitors are excluded by default. Another 10-20% of visitors normally get excluded when the A/B testing tool time out. Further technical implementation issues can cause another 10-20% of visitors to be excluded, things like JavaScript-heavy sites or the tracking code not implemented in the right place.

Moreover, gaps in test design can create a discrepancy between test and sales data. For example, we ran a test on the home page of a basic single-product site and noticed that our test data was missing many sales. After investigating, it turned out that about 50% of purchases were by people who didn’t land on and never visited the home page as well as by existing customers from a special upgrade page that we didn’t consider.

As a result of these exclusions, when you implement your winner, you may be exposing it to segments you didn’t test it on. For example, although you tested on desktop and saw a 20% lift, the same design on mobile might cause a 30% drop. So, if you made the winner your new home page for all traffic, the drop in mobile could counteract some of the lift (say, if you had lots of mobile traffic).

What you can do:

 

Reason 4: You are eyeing it instead of using math

Sometimes a lift is obvious. Other times you need to use math. Here’s a sample of real conversion data with about 20 days of basline followed by 20 days of the improved version:

 

graph-reallift

Just over a month of real conversion data with winner on the right

 

The lift is not visually obvious. Nonetheless, the average for the first 20 days is 0.71%, whereas the average for last 20 days is 0.85%, which is a 20% lift. However, if the standard deviation of the data is high, the difference in averages may be coincidental.

 

Reason 5: Your design or conditions are not the same

This happens all the time. You run a winning test, then you tweak the winner before pushing it to your site. It’s entirely possible that those visual tweaks reduced the effectiveness of your variation.

It’s also possible the test conditions are different when you launched the test. Did you test during the holidays or launch during holidays?

Are you including a different page? You might have several pages that look similar. So you tested something on one page, and you decided to apply it in one go to all the pages. If so, there is no guarantee that the same concept will work equally well on other pages.

What you can do:

 

 Reason 6: It was a false positive

Yes, it happens all the time. There are many reasons you might have gotten a false positive, including improper test design and not running your test long enough. The most common scenario is you run your test until you see a winner and stop. I’ve seen results that looked very exciting flatten after 3-4 weeks.

What you can do:

 

Back To Realville

Let’s say Realville decided to retest the Perfectville winner 8 more times (it took years!). They found that indeed, the overall tendency of the variation was towards increase, following the same pattern as Perfectville’s test. There’s a small lift during the test, a slight dip when the test is stopped, and then a larger lift after final launch. However, despite the overall trend, individual outcomes showed that chance is a factor in this imaginary scenario:

Let me know if you apply and find useful some of these concepts.

@VladMalik is an interaction designer and musician based in Toronto.
I enjoy breath-hold diving, weight-lifting, and chopping wood. I am vegan.

Get Update Every Month or Two

Share your thoughts!

© 2015 License for all content: Attribution not required. No commercial use.