When and why to peek at A/B tests
Every other day or so you should should peek at how your tests are doing. Here are some guidelines on doing that without skewing your data:
The main reason you want to peek frequently is technical problems. You should QA your site before you launch, but you should QA again a couple of days in and later on. You may have missed some bugs, and repeat QA will catch more of them. Other site changes get introduced that break your test in some way. And sometimes a transient bug doesn’t show up in the data until more data is collected.
If you set it and forget it, you may leave a losing test running too long. If you’re after long term gains and want to avoiding losses, you should stop a losing test at some point. However, stop when you can be reasonably sure you’re doing the right thing. You don’t want to stop your test a day in because you get put off by a big initial drop. But if your test is losing for 2 weeks straight with high statistical confidence, you might not want to let it run another week. If your goal is learning, then you might want to run a test longer to confirm it is a loser – but exposing your site a poor variation may not be what is good for your business.
What stats are you using?
Many tools have moved to a different statistical method, which allows peeking. VWO uses Bayesian stats, which give you up to date confidence and probabilities. Optimizely uses a pseudo-Bayesian sequential testing method to allow you to peek. So depending on what statistical method you use, you may be “allowed” to peek and make decisions. However, keep in mind regardless of what your tool says, you still want to let your test run a reasonable amount of time. So estimate your duration upfront anyway, so you have some point of reference.
No significance, no problem
If you’re using traditional stats (which works fine is much easier to understand), the basic idea is to avoid calculating statistical significance (p-value) and then making interim decisions based on that. A p-value is meant to be something you calculate after your test is done. However, you can check significance and adjust your duration estimate once or twice during the test (e.g., once you get a ballpark effect size). That’s not going to skew your p-value much. IF you commit to some sample size upfront, then you can check significance all you want – it gives you a basic sense of whether the effect is strong.
Safe things to peek at
It depends on what you peek at. If you’re not calculating significance but still making decisions based on how the test is doing so far, you’re still skewing your final analysis. However, you can peek at how your overall test is doing (without drilling into how each variation is doing). For example, if you see that overall traffic to the test or the total conversion rate is lower than expected, then you can recalculate your duration estimate without any problems. Looking at the test overall doesn’t tell you how the variations are doing relative to each other – unfortunately A/B testing tools never offered this option. You definitely want to readjust your duration estimate once you have more data. You just don’t want to keep doing that based on how each variation is doing – that fluctuates and if you keep adjusting your duration to current performance, you’re letting chance lead you instead of subduing the effect of chance with time.
Have someone else peek
I don’t do this, but a good idea I’ve come across (not sure how practical) is to have one person peek for technical issues and a different person to ultimately do the final analysis and make the decision.
The biggest danger with peeking is that your brain will look for patterns and once you see them, you can’t unsee them. This can lead to disappointment when a green turns red. Or worse it can lead you to hack your test to get the results you want. Try to be objective about it. Once the test is running, you’re after the truth, not winning or losing.