vlad malik

Statistics of A/A/B and A/A/B/B tests and why you should avoid them 7 hours ago

Lowering alpha is a better way of reducing false positive risk

In an A/B test, B might win just by chance. This risk is called alpha or significance level. In the absence of other information, false positive risk is greatest if you stop the test as soon as you see a statistically significant lift. This risk is lowest when you plan you sample size upfront and stop when that sample size is reached.

One argument for A/A/B tests is reducing the risk of a false positive: ensure that the two identical As are actually performing the same and that B beats both of them. If that doesn’t happen, you reject the winner.

Yes, when you require the two As to match, you’re lowering your false positive risk. This is because you are indirectly choosing a lower significance level cut-off. Let’s see an example.

Below is the result from a simulated study. This is exactly the same data shown as an A/B test and as an A/A/B/B test. A has the average baseline conversion rate of 5%+/-0.5%. B is a true 15% improvement. The total sample of 27,000 visitors has good power:



A/B result with p-value 0.002 is rejected because both B subsamples did not win and show drastic difference in performance


Notice that the A/B split shows a 17% lift with a very low p-value – an ideal outcome. However, the A/A/B/B split shows an inflated 40% lift for B subsample, while the other B lost to one of the As. So, if we required both Bs to beat both As and required subsamples to show similar performance, this p-value would not pass. Keep reading

Get to the point with informative headings 1 month ago

Try a complete statement

Your top-level heading is the first thing visitors see in Google and on your site, so get your message in there. Think of headings as content, not mere labels.

This original home page of HelpTheChickens.ca simply restated the URL:



I suggested they take text from their intro paragraph and turn that into the heading:




Try being more specific

This stock heading from RelateIQ says nothing about a potentially useful product:




Keep reading

5 ways to calculate confidence in A/B test results using JavaScript 2 months ago


B did better than Basline this time, but what if we tested again? How trustworthy is this result?

abstats.js is a small library that gives you 5 ways to answer this question. You need no programming skills to use it. It’s available on this page, so open up browser console and run any of the examples in this article.

Way #1: Estimate the true conversion rates

With abstats.js, I can easily get 95% confidence intervals for each variation:

interval_binary(100, 2000, 0.95) // returns {upper=0.0605, point=0.0509, lower=0.0413}
interval_binary(130, 2100, 0.95) // returns {upper=0.0731, point=0.0628, lower=0.0524}

This gives me the point estimate and the margin of error for A and B:


This says that my best estimates for the true conversion rates are Basline = 5.2% and B = 6.2%. Nonetheless, A could be as high as 6%, while B could be as low as 5.2%. So, it’s plausible that B is actually worse than A but performed better just by chance. How likely is that to happen?

Keep reading

I Have An A/B Test Winner, So Why Can’t I See The Lift? 2 months ago

In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:



97 days of daily conversion rates in Perfectville showing 20% lift


The graph perfectly related what happened: The baseline increased by 10% during the test, with half the traffic exposed to the winning variation. Then was the week when the test was stopped, followed by a lift of 20% once the winner was implemented.

The good people in nearby Realville heard about this and ran the test on their site. When they later checked their daily conversions data, they scratched their heads (as they often do in Realville):



97 days of daily conversion rates in Realville showing same improvement


The data actually includes the same 10% lift during the test, a gap, and a final 20% improvement. The problem is the improvement is relative to natural fluctuations in daily conversion rates, so 20% improvement doesn’t necessarily mean 20% lift.

Here are 6 reasons why people in Realville might find it difficult to see a lift and what they can do about it.

Keep reading

@VladMalik is an interaction designer based in Toronto