## 5 ways to calculate A/B test confidence with 1 line of JavaScript

Besides the standard statistical significance test and confidence intervals, you can try zero-overlap confidence intervals and two kinds of post-hoc power analysis. All are easy to do in ABStats.

### Sample data

- Baseline 100 sales out of 2000 visitors (5.0% rate)
- Variation B 130 sales out of 2100 visitors (6.2% rate)

B did better than Basline this time, but what **if we tested again**? How trustworthy is this result?

**abstats.js** is a small **library **that gives you 5 ways to answer this question. You need **no programming skills** to use it. It’s available on this page, so open up **browser console** and run any of the examples in this article.

### Way #1: Estimate the true conversion rates

With abstats.js, I can easily get 95% confidence intervals for each variation:

`interval_binary(100, 2000, 0.95) // returns {upper=0.0605, point=0.0509, lower=0.0413}`

interval_binary(130, 2100, 0.95) // returns {upper=0.0731, point=0.0628, lower=0.0524}

This gives me the point estimate and the margin of error for A and B:

This says that my best estimates for the true conversion rates are Basline = 5.2% and B = 6.2%. Nonetheless, A could be as high as 6%, while B could be as low as 5.2%. So, it’s plausible that B is actually worse than A but performed better just by chance. **How likely is that to happen?**

To answer that, I first decide how tolerant I want to be about the chance that A or B might actually be much lower or higher than current data indicates. That’s my “**confidence level**“. Then I look at the **amount of overlap**.

If I set my confidence level high, it means that I play it safe and accept a wider range of values as a possibility. If I set my confidence level low, then I am more willing to trust B based on what I know and discount the possibility of extremely low or high values. For example, with a lower 80% confidence level, I’m saying Baseline and B are probably close to the point estimate and making the margins of error are smaller. As a result, B appears like a stronger winner:

The worst possible result is low confidence level with lots of overlap. The ideal result is high confidence level showing no overlap at all. As a rule, I use 95% Confidence Level, unless I had a some particular reason to trust this data (e.g., other corroborating metrics, past tests).

### Way #2: Estimate the relative change

The point estimates are my best guesses of the true conversion rates, while the extremes of the confidence intervals are less likely. I can take this into account and better estimate the relative increase.

In abstats.js, I can easily estimated the relative change and its margin of error:

`interval_rel_binary(100, 2000, 130, 2100, 0.95) // returns { upper=0.5108, point=0.2329, lower=-0.0451}`

This better captures the relationship between Baseline and B. This says that, despite the large overlap in the individual intervals, B is at most 4% worse and at best 51% better than Baseline. Our best estimate is 24%. So, it is far more likely that B is a winner than not.

The confidence level defines my risk tolerance. By using the higher 95% confidence interval, I knowledge the possibility that B might be 4% worse.

### Way #3: Get the zero-overlap confidence level

The overlap between confidence intervals tells me that B is probably better. What is my level of confidence that B is at least as good as Baseline?

With abstats.js, I can work backward to get this “actual” confidence level:

`confidence_binary(100, 2000, 130, 2100) // returns 0.76`

This says I have to lower my confidence to 76% to believe that the true rate of B is same or greater than Baseline. This is a very conservative metric that does not accept any chance of overlap. To reach 95% confidence, the result would have to be very strong. The good thing about that is I will be consistently more likely to find true, lasting results if I rely on higher confidence. The zero-overlap confidence gives more accurate results than the “chance to beat” metric in A/B testing tools.

### Way #4: Find the chances of getting the same result if there had been no improvement (p-value)

Another way of measuring confidence is to ask how likely I would be to get these results just by chance, if B in reality were no better than Baseline.

In abstats.js, I can easily get the p-value:

`significance_binary(100, 2000, 130, 2100, 0.95) // returns 0.097`

This tells me that there is an almost 1 in 10 chance of getting this result just by chance, high enough to keep the possibility in mind but low enough that I will act on this data.

### Way #5: Estimate expected sample size

In addition to previous methods, I can ask whether it’s reasonable to expect such a result given my sample size.

In abstats.js, we can do a post-hoc power analysis:

`sensitivity_binary(0.05, 0.24, 0.95, 2000) // returns 0.5`

This tells me that, hypothetically, if I were to run this test again with a base rate rate of 5% and stopped it once I had about 2000 visitors per variation, I would have only a 50% chance of getting a statistically significant 24% lift. This means the current test is underpowered and should have been run longer. So, either this was a runaway success that did better than expected, or the degree of lift is under or overstated.

A value of 80-90% is ideal.

An equivalent method is to ask Given the sample size I got, how great of a lift would I have expected to detect confidently?

In abstats.js, I can solve for the effect size:

`effect_binary(0.05, 2000, 0.95, 0.90) // returns 0.45`

This tells me that, hypothetically, if I did run a test with only 2000 visitors per variation, I would have a 90% chance to detect a 45% lift. The test is not sensitive enough to confirm a 24% lift.

### Now try it in browser console

abstats.js is **loaded on this page**. Just open up your browser console and test it out. To open the console, hit F12 in Firefox/Chrome for Windows.