vlad malik

5 ways to calculate A/B test confidence with 1 line of JavaScript
3 years ago

Besides the standard statistical significance test and confidence intervals, you can try zero-overlap confidence intervals and two kinds of post-hoc power analysis. All are easy to do in ABStats.

Sample data

B did better than Basline this time, but what if we tested again? How trustworthy is this result?

abstats.js is a small library that gives you 5 ways to answer this question. You need no programming skills to use it. It’s available on this page, so open up browser console and run any of the examples in this article.

Way #1: Estimate the true conversion rates

With abstats.js, I can easily get 95% confidence intervals for each variation:

interval_binary(100, 2000, 0.95) // returns {upper=0.0605, point=0.0509, lower=0.0413}
interval_binary(130, 2100, 0.95) // returns {upper=0.0731, point=0.0628, lower=0.0524}

This gives me the point estimate and the margin of error for A and B:

000643

This says that my best estimates for the true conversion rates are Basline = 5.2% and B = 6.2%. Nonetheless, A could be as high as 6%, while B could be as low as 5.2%. So, it’s plausible that B is actually worse than A but performed better just by chance. How likely is that to happen?

To answer that, I first decide how tolerant I want to be about the chance that A or B might actually be much lower or higher than current data indicates. That’s my “confidence level“. Then I look at the amount of overlap.

If I set my confidence level high, it means that I play it safe and accept a wider range of values as a possibility. If I set my confidence level low, then I am more willing to trust B based on what I know and discount the possibility of extremely low or high values. For example, with a lower 80% confidence level, I’m saying Baseline and B are probably close to the point estimate and making the margins of error are smaller. As a result, B appears like a stronger winner:

000643_2

The worst possible result is low confidence level with lots of overlap. The ideal result is high confidence level showing no overlap at all. As a rule, I use 95% Confidence Level, unless I had a some particular reason to trust this data (e.g., other corroborating metrics, past tests).

Way #2: Estimate the relative change

The point estimates are my best guesses of the true conversion rates, while the extremes of the confidence intervals are less likely. I can take this into account and better estimate the relative increase.

In abstats.js, I can easily estimated the relative change and its margin of error:

interval_rel_binary(100, 2000, 130, 2100, 0.95) // returns { upper=0.5108, point=0.2329, lower=-0.0451}

This better captures the relationship between Baseline and B. This says that, despite the large overlap in the individual intervals, B is at most 4% worse and at best 51% better than Baseline. Our best estimate is 24%. So, it is far more likely that B is a winner than not.

The confidence level defines my risk tolerance. By using the higher 95% confidence interval, I knowledge the possibility that B might be 4% worse.

Way #3: Get the zero-overlap confidence level

The overlap between confidence intervals tells me that B is probably better. What is my level of confidence that B is at least as good as Baseline?

With abstats.js, I can work backward to get this “actual” confidence level:

confidence_binary(100, 2000, 130, 2100) // returns 0.76

This says I have to lower my confidence to 76% to believe that the true rate of B is same or greater than Baseline. This is a very conservative metric that does not accept any chance of overlap. To reach 95% confidence, the result would have to be very strong. The good thing about that is I will be consistently more likely to find true, lasting results if I rely on higher confidence. The zero-overlap confidence gives more accurate results than the “chance to beat” metric in A/B testing tools.

Way #4: Find the chances of getting the same result if there had been no improvement (p-value)

Another way of measuring confidence is to ask how likely I would be to get these results just by chance, if B in reality were no better than Baseline.

In abstats.js, I can easily get the p-value:

significance_binary(100, 2000, 130, 2100, 0.95) // returns 0.097

This tells me that there is an almost 1 in 10 chance of getting this result just by chance, high enough to keep the possibility in mind but low enough that I will act on this data.

Way #5: Estimate expected sample size

In addition to previous methods, I can ask whether it’s reasonable to expect such a result given my sample size.

In abstats.js, we can do a post-hoc power analysis:

sensitivity_binary(0.05, 0.24, 0.95, 2000) // returns 0.5

This tells me that, hypothetically, if I were to run this test again with a base rate rate of 5% and stopped it once I had about 2000 visitors per variation, I would have only a 50% chance of getting a statistically significant 24% lift. This means the current test is underpowered and should have been run longer. So, either this was a runaway success that did better than expected, or the degree of lift is under or overstated.

A value of 80-90% is ideal.

An equivalent method is to ask Given the sample size I got, how great of a lift would I have expected to detect confidently?

In abstats.js, I can solve for the effect size:

effect_binary(0.05, 2000, 0.95, 0.90) // returns 0.45

This tells me that, hypothetically, if I did run a test with only 2000 visitors per variation, I would have a 90% chance to detect a 45% lift. The test is not sensitive enough to confirm a 24% lift.

Now try it in browser console

abstats.js is loaded on this page. Just open up your browser console and test it out. To open the console, hit F12 in Firefox/Chrome for Windows.

Learn about other commands in abstats.js

@VladMalik is an interaction designer and musician based in Toronto.
I enjoy breath-hold diving, weight-lifting, and chopping wood. I am vegan.

Share your thoughts!

© 2015 License for all content: Attribution not required. No commercial use.