6 statistical reasons to avoid AAB, ABB, AABB tests
Running an AABB-type of test is a poor-man’s way of reducing false positives. That’s what alpha is for. Directly adjusting alpha, or false positive risk, is more precise and clear.
Not a precise way to reduce false positive risk
One argument for A/A/B/B tests is reducing the risk that B will win just by chance (false positive). The thinking is that unless both identical Bs outperform both identical As, you should reject the winner.
Yes, when you require the two As to match, you’re lowering your false positive risk. This is simply because you are indirectly choosing a lower significance level cut-off. Let’s see an example.
In the next simulation, exactly the same data is shown as an A/B test and as an A/A/B/B test. A has the average baseline conversion rate of 5%. B is a true 15% improvement. The total sample of 27,000 visitors has good power:
A/B result with p-value 0.002 is rejected because both B subsamples did not win and show drastic difference in performance. Min lift is the lowest B compared to the highest A. Max lift is highest B compared to lowest A.
Notice that the A/B split shows a 17% lift with a very low p-value – an ideal outcome. However, the A/A/B/B split shows an inflated 40% lift for B subsample, while the other B lost to one of the As. So, if we required both Bs to beat both As and required subsamples to show similar performance, this p-value would not pass.
In order to win, B would have to perform more consistently, so that both Bs win. This implies a lower p-value than would otherwise be required. Below is a rerun of the simulation, showing a result that does pass. In this case, both Bs beat both As, but notice that for this to happen the resulting overall p-value has to be much smaller than 0.002 in the previous example:
Both Bs beat both As (right). This results in an ultra-low overall p-value (left).
This may be good or bad depending on what your goal is. Such low p-values are too hard a standard to meet in practice, so you’ll end up ignoring it anyway. If you want to more precisely control false positives, lower your significance cutoff, instead of splitting up your variations. You can then know precisely how much you have lowered your risk.
Understand your false negative risk
A clear consequence of reducing your false positive risk is that you also risk rejecting too many winners. We saw already that even statistically strong results would get rejected if you required both As and Bs to match up. The lower power of each each subsample would make this less likely.
Here’s a quick animated graph showing results of multiple simulations (5%+/-0.5% baseline rate with 15% true effect). Each test is stopped when the planned sample size is reached. Notice how often both Bs fail to win in the A/A/B/B test, whereas true winner B wins almost every time in the A/B test:
Animation: A high false negative rate for A/A/B/B tests means you won’t trust them anyway
Subsample after the test instead (if you have to subsample)
You will need to combine the samples after the test to run a significance test. Merging will turn your A/A/B/B test into A/B anyway. Instead of running an A/A/B/B test, you can subsample your data after the test. You can even draw multiple random subsamples. However, you shouldn’t do that as a way to test for false positives for reasons described above.
Keep effect of lower power in mind
A/A/B/B tests may create headaches for you. The split As and split Bs will each have lower power as a result of the smaller sample size, which means the effect in each is likely to be understated or exaggerated. Since you essentially have 4 comparisons in an A/A/B/B test, you will have that much more risk that one of the variations will be an outlier. Lower power means it’s actually easier for one of the Bs to beat both As by chance or for an A to beat both Bs.
Once you merge the variations into A/B, the outliers will even out. However, seeing that one of the A subsamples outperformed both Bs might be troubling if your team doesn’t realize this is more likely due to the lower power.
Sample size is the best stopping rule for winning variations
False positive risk is greatest if you test significance continuously and use that as your stopping rule. If you run an A/A/B/B test and add the further requirement that both Bs outperform both As, then you do reduce your false positive risk, as I described above.
However, your risk of false positives using this method is still higher than if you use sample size as the stopping rule. Here is a cherry-picked example from another A/A/B/B simulation. Actually it’s A/A/A/A. Let’s see how effectively this prevents false positives:
A/A/A/A test shows a false positive half-way through the test
In this case, B is identical to A, so there is 0% true improvement. The simulation was run for 10 weeks. Each week the conversion rate fluctuated around 5.2% +/- 1%. After the first week, we see a suggestive 15% lift. After 6 weeks, we see a statistically significant 11% lift. In addition, both Bs outperform both As. However, this is a false positive. After 10 weeks, we see the effect drop to about 0%, which is the truth.
This shows that even if you run an A/A/B/B test and even if you get a good sample, you can still get a false positive. If we are using 0.05 significance cutoff, then there is a 5% chance of a false positive. Running an A/A/B/B test until significance is reached (instead of until a predetermined sample) increases this risk further.
Don’t worry about the tool
Another argument for A/A/B/B tests is that it helps you catch improper randomization by your tool or other technical glitches.
If your tool is not randomizing properly, it is actually easier to catch it in an A/B test. If you add more variations, you actually increase the chances that normal random variation will create uneven samples. Improper randomization can happen due to improper setup, but that’s rare. Under normal conditions, VWO, for example, watches for uneven samples and tries to even them out. However, uneven randomization wouldn’t hurt your test. All in all, you are better off trusting your tool and worrying more about your test setup.