After reading this post, you will be able to say whether your test has “low traffic”, decide if A/B testing is worth it, and know what to do if you decide to A/B test.
The collective guess of a crowd can be more accurate than that of an individual. For example, over a hundred years ago, statistician Francis Galton noticed that a crowd of people could guess the weight of an ox with over 99% accuracy. In a more complex domain like politics, we know that expert predictions are terrible, but the average of their guesses is better.
In this video, I want to show you a different kind of sample size calculator for your A/B tests. It works backwards compared to how traditional calculators work and you might find that more intuitive. The basic premise of this approach is that we mostly don’t know what effect size to expect, so we projections for a range of outcomes.
Every other day or so you should should peek at how your tests are doing. Here are some guidelines on doing that without skewing your data:
Sometimes the outcome of a simulation is a work of art. This is a plot of 1000 trials of an A/A test. The tips of the lines are p-values, while the dark area at the bottom is the effect size. I liked the pattern, so I turned it into 2 artworks: Rain and Snow:
Once you figure out what you want to test, you need to define what you’re going to measure and where. In this post, I will introduce my preferred terms for describing test structure (things like test conditions, goals, and pages), and I’ll use a visual language to cover the basic patterns. Here’s an example:
Animation can convey multiple message using same space, turn a headline into a center-piece, drawing lots of attention to your message, set up a transition or interaction that pulls the user in or guides their gaze, or use the time dimension to convey extra information, like mood.
I use simulations all the time to help answer questions like: Is this outcome possible? What outcomes are most likely? How much data is enough?
Simulations can answer key questions without painful calculations. If you haven’t gotten around to learning R, here’s an A/B test simulator for Excel or Google Docs. It does a power calculation, so you can see the impact of baseline conversion rate, effect size, and traffic has on your chances of detecting an effect. It gives the effect size and p-value for each outcome.
A hypothesis is an explanation of why something is the way it is.
Running an AABB-type of test is a poor-man’s way of reducing false positives. That’s what alpha is for. Directly adjusting alpha, or false positive risk, is more precise and clear.
A visitor should learn something just by reading your headings. Write informative headings. Don’t save content for later. You can elicit curiosity by revealing key insights to the right audience rather than by obfuscating.
Besides the standard statistical significance test and confidence intervals, you can try zero-overlap confidence intervals and two kinds of post-hoc power analysis. All are easy to do in ABStats.
In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:
@VladMalik is an interaction designer and musician based in Toronto.
I enjoy breath-hold diving, weight-lifting, and chopping wood. I am vegan.