Video: What-if analysis using a reverse test duration calculator
In this video, I want to show you a different kind of sample size calculator for your A/B tests. It works backwards compared to how traditional calculators work and you might find that more intuitive. The basic premise of this approach is that we mostly don’t know what effect size to expect, so we projections for a range of outcomes.
This calculator doesn’t ask you to input power or the effect size you are after, because it assumes that you’re exploring and don’t know what effect size to expect. Instead it just asks you for your current conversion rate and traffic, and then gives you several possible effect sizes that you can reasonably detect on your site and your chance of success for each outcome.
Let’s see how it works. Let’s say you want to run your test on your home page. About 5% of people make it from the home page to a purchase, and you get about 5000 visitors to the home page per week. Let’s run the report with those numbers.
At the top of the report, you’ll see your preliminary estimate. This estimate tries to balance the testing duration with the effect size you can detect. It’ll your duration at 8 weeks regardless. Next, it’ll try to make sure you can detect at minimum a 15% effect.
If I scroll down, you see that’s exactly what I got. The duration is 6 weeks and this is optimal to detect a true 14% lift. I can then adjust my duration up and down and see how it impacts my projections.
The advantage of this report is that it doesn’t give you an estimate for just one effect size. It gives you a range of reasonable what-if scenarios. That’s because we might have little idea what the effect size might be. But I see that if my new version is 10% better or 10% worse, then there is a 50% chance that the effect will actually peak through the noise strongly enough.
But if the effect is 14%, then I have an 80% chance of success or 80% power. I can then use my judgement to see if whatever I am testing can reasonably beat the existing version by at least 10% and ideally by 14%. It will depend on how big my idea is that I’m testing, my experience with similar tests elsewhere, how bad the current design is, and so on.
Another piece of information you can get here is a sense of what the actual observed effect might be. Remember that EVEN IF my new version is 14% better and the test is a success, it doesn’t mean the effect size will actually be 14%. By chance it may be inflated or deflated. So here you can also see the margin of error. This means that if I get a 7.5% lift, I know that the true effect might actually be as high as 14%. But if I see a 3% effect, I know the true effect is at most 10%.
I might wonder if the true effect were 14%, what actual effect might I observe half way through the test. To see that, I can reduce the duration to 3 weeks, find the 14% effect, and see that it might show up as an effect as low as 4.5%.
So far, I assumed there is a true effect. But if I am wrong and my new variation actually has no effect, I might still get a lift – that’s called a false positive. I always like to know what sorts of false positives I can expect.
In this case, let’s put it back to 6 weeks. And we see that with this duration, we have a high chance of a false positive of 5%, positive or negative. The term false positive includes effects in either direction. I see there is a small chance of a false positive as high as 10% – the probability is 5%, small but possible. If I’d like to eliminate that possibility altogether, then I can increase my duration. After 9 weeks, the probability is just 1%. And I scroll back up, then I see with this duration, I can also detect smaller effects.
In the end, whatever happens, you have a much better idea of what to expect. Give it a shot. Let me know how you like it. Thanks for watching.