Tips for A/B testing with low traffic
After reading this post, you will be able to say whether your test has “low traffic”, decide if A/B testing is worth it, and know what to do if you decide to A/B test.
Technique 1: Do confirmatory not exploratory testing
Exploratory testing = you run an A/B test to look for big or small changes that will increase your conversion rate. You come up with some ideas, then you test them to see which of them work.
Confirmatory testing = you do risky update (e.g., remove free trial) and want to confirm there is no huge negative effect (risk mitigation), or you do an aesthetic site update and want to see if it’s at least not worse than before.
If you get <1,000 visitors per month with 5% converting, you should not be doing exploratory A/B testing.
The best thing you can do instead to optimize your site is look for bugs proactively on your site. You can also do 1-on-1 user testing, surveys, and just deploy your best design and watch your conversion trends.
However, you might still be able to do confirmatory testing.
For example, say I ran a test for 2 months and found this statistically insignificant result (70% confidence level):
Based on this that I’m confident enough that my new redesign is no worse than the original with some chance it might be better. That is useful information.
Technique 2: Find a proxy metric
You can increase your test sensitivity by using a higher baseline metric. Say for example that your primary metric is purchases, but purchase rate is only 2%. Here’s the lift you can detect ( try it yourself: Vlad’s What-If A/B Test Planner ):
What’s a key milestone prior to sales? Let’s say your form fill (not yet submitted) rate is 3%. Maybe form starts are 4% (e.g., people enter an email). If we measure form starts, much smaller changes are detectable:
Of course, form starts are not purchases, but it’s a behavior that suggests improvement AND it’s a good supporting metric.
Here’s how to use this to analyze your test. Say at the end of the test, you find these effects:
As you can see, in terms of purchases the variation beat the Control by only 10% and it’s not a statistically strong result. But the preceding steps show a progression from engagement to purchases and the shallower goals are statistically stronger. So the big picture here is actually pretty good.
You can also compare the performance of various metrics to see which are fairly in sync. For example, here’s what it might look like if Form Field Engagement is a great proxy for Revenue:
A word of caution: The mere fact that the metrics are lined up DOES NOT increase the likelihood that B is the winner. These are co-related metrics that measure the same behavior (i.e. “purchase” implies “form completion” which implies “form engagement” which implies “scrolling to form” and so on). These metrics will line up whether the effect is true or a false positive. What you CAN say is that, since these metrics are co-related, you can use the shallower metric as a proxy for the deeper metric.
Technique 3: Look for consistent performance
With low traffic, you get too few visitors per day to gauge daily variability, but you can track weekly trends. If a variation is winning more consistently, then it is more likely to be a winner.
Here is week-to-week performance over 7 weeks for two tests, both showing a 6% cumulative improvement:
But one test is showing a lot of week to week variability, whereas the other shows the blue winning for 5 weeks straight. All other things being equal, the second offers more reliable evidence.
Technique 4: Compare segments
Another way of checking consistency is to compare performance across user segments. For example, we tested the same variation simultaneously on two virtually identical landing pages on different domains. We expected the variations to do similarly on both. Comparing day-to-day performance for a sample week, we see that the effect sizes for the two tests are moving roughly in sync:
If the variations are not in sync, it could mean the effect size is false or that the segments are dissimilar. However, if the segments are in sync, it’s a good sign. If you have a low daily rate, you should instead check bi-daily or weekly rates. Otherwise, performance jumps around too much just by chance.
Technique 5: Don’t expect big changes to bring a big win
Some people will tell you to aim for a large effect sizes by doing big changes. For example, with 4,000 visitors and 5% rate, you can detect an impressive 54% to 89% lift:
That’s true, but how exactly are you going to achieve that? Is that realistic and worth the effort?
Big design changes have the potential to bring bigger wins, but it doesn’t mean that’s likely to happen. Big tests do fail often. They also take more development effort. So instead of fixing bugs on your site and building new features, you may end up doing lots of testing work with no results.
In the above scenario of 54-89% lifts, you are even likely to hit on a high false positive and come away believing your design worked:
The only way to reduce the potential for large false positive errors is more traffic, even if you’re testing big changes. Hope for the best, but plan on testing for a while and putting in more time before you hit upon a win.
More techniques for planning tests and analyzing the results
What to test:
- prioritize big changes that require little effort
- test radical changes (think big conceptual shift, not just big visual changes)
- start with ideas you have good, research-backed reasons to test
How to test:
- test 1 idea at a time
- plan to run tests longer
- keep reminding your team a test is running
- you don’t have to freeze development as long as you make changes globally
How to interpret:
- you might have to tolerate variations going “red” for days as data is collected slowly
- keep in mind the effect size even if true is inflated
- ignore big lifts if you’ve got only 1000 visitors so far (remember your false positive risk)
How much traffic is enough?
For example, with 20,000 monthly visitors and 0.5% conversions, you’ll have a tough time testing. But with 15,000 visitors and 10% conversions, you get a decent spread of effect sizes around 15% that are detectable within 1 month:
This is what I’d call “adequate traffic”.
Adequate traffic = you can run a test for a 1 month or less with enough sensitivity to detect a 15% lift. Anything that takes over 1 month or aims at unrealistically high effects is low traffic.
Why 1 month? Because beyond that, things tend to get messy. Users clear cookies and reenter in different variations, your dev team accidentally introduces some change, and so on. Once you start talking of months instead of weeks, the test becomes a burden instead of an opportunity.
Why 15%? In my opinion, 15% is the sweet spot. If your sensitivity is aimed at 15% but you detect a 30% effect, then great – you’ll either have super-strong data or you can stop a bit earlier. If you detect a 10% effect, then you probably still have decent sensitivity to see a suggestive result.
Conversely, if you aim for 50%, and the true effect is 15%, then you’ll be chasing fantoms. Since you virtually never know what to expect, it’s best to be conservative. I found that 15% is roughly the effect size at which A/B testing becomes reasonable for many sites.
Site Traffic Examples
|20,000 / month||0.5%||Low Traffic||Base rate is low. Test takes many months and/or a minimum effect of 50% is required.|
|1,000 / month||10%||Low Traffic||Site traffic is low. Test takes many months and/or a minimum effect of 50% is required.|
|15,000 / month||6%||Adequate Traffic||With this base rate and traffic, you can detect a 15% effect or so. Enough for a test.|
|50,000 / month||5%||Adequate Traffic||With this base rate and traffic, you can detect a 15% effect or so in 2 weeks.|
Testing on low traffic is paaaainful. Look at what you plan to gain from testing something, look at your chances objectively, and if you do go ahead, be patient and understand what you should and should not expect.
Did I miss anything? Let me know and I will update this.