a/b testing – Vlad Malik

February 15, 2019April 20, 2019

Solving Problems with User Research, Best Practices, and A/B Testing

What can I do to persuade more people to buy your product online? I tackled this question for 5 years as I ran A/B tests for diverse clients.

I remember one test idea that everyone on the team loved. The client said “That’s the one. That one’s totally going to win.” Well, it didn’t.

The fact is, most A/B test ideas don’t win.

In fact, interpretation is tough, because there are so many sources of uncertainty: What do we want to improve first? Which of a hundred implementations is a valid test of our hypothesis about the problem? If our implementation does better, how statistically reliable is the result?

Is our hypothesis about the users actually true? Did our idea lose, because our hypothesis is false or because of our implementation? If the idea wins, does that support our hypothesis, or did it win for some completely unrelated reason?

Even if we accept everything about the result in the most optimistic way, is there a bigger problem we don’t even know about? Are we inflating the tires while the car is on fire?

If you take anything away from this, take this analogy: inflating your car tires while the car is on fire will not solve your real problem.

I believe the most effective means of selling a product and building a reputable brand is to show how the product meets the customer’s needs. This means we have to know what the customer’s problem is. We have to talk to them.

Then if we run an A/B test and lose, we won’t be back to square one. We’ll know our hypothesis is based in reality and keep trying to solve the problem.

Emulating Competitors

“I heard lots of people found gold in this area. I say we start digging there!”

That actually is a smart strategy: knowing about others’ successes helps define the opportunity. That’s how a gold rush happens.

This is why A/B testing blogs are dominated by patterns and best practices. So-and-so gained 50% in sales by removing a form field… that sort of thing. Now don’t get me wrong: you should be doing a lot of those things. Improve your value proposition. Ensure your buttons are noticed. Don’t use tiny fonts that are hard to read. You don’t need to test anything to improve, especially if you focus on obvious usability issues.

So what’s the problem? Well, let’s go back to the gold analogy. Lots of people went broke. They didn’t find any gold where others had or they didn’t find enough:

“The actual reason that so many people walked away from the rush penniless is that they couldn’t find enough gold to stay ahead of their costs.” ~ Tyler Crowe Sept. 27, 2014 in USAToday

You could be doing a lot of great things, just not doing the RIGHT things.

The good thing is many people do some research. The problem is not enough of it or directly enough. They are still digging in the wrong place.

“If I had only one hour to solve a problem, I would spend up to two-thirds of that hour in attempting to define what the problem is.” ~ An unknown Yale professor, wrongly attributed to Einstein.

Think about this for a moment: How can you sell something to anyone when you’ve never talked to them or listened to what they have to say?

Product owners often believe they know their customers, but assumptions usually outnumber verifiable facts. Watching session playback can hint at problems. Google Analytics gives a funnel breakdown, but it doesn’t give much insight into a customer’s mind. It’s like trying to diagnose the cause of indigestion without being able to ask the patient what they had for dinner or if they have other more serious health complaints.

The problem is it’s all impersonal, there’s no empathy. There’s no “Oh man, that sucks, I see how that is a problem for you”. It’s more like “Maybe people would like a screenshot there. I guess that might be helpful to somebody”.

Real empathy spurs action. When you can place yourself in your customer’s situation, you know how to go about helping them. If your solution doesn’t work, you can try again, because you know the problem is real rather than a figment of your imagination.

A Pattern Is A Solution To A Problem

Therapist: “Wait, don’t tell me your problem. Let me just list all the advice that has helped my other patients.”

Let’s say some type of visual change has worked on 10 different sites. Let’s call it a pattern.

A pattern works, because it solves some problem. So choosing from a library of patterns is choosing the problem you have. You don’t chose Tylenol unless you have a headache or fever. You don’t chose Maalox unless you have indigestion.

If you know what YOUR problem is, you can choose the right patterns to solve it.

If you don’t know the problem, you won’t get far choosing a pattern because it’s popular, because of how strongly it worked or how many people it has worked for. That’s like taking a medication you’ve never heard of and seeing what it does for you.

Pattern libraries are great for when you have a problem and want a quick, time-tested way to solve it:

Research Uncovers The Problem: A Short Story

Say you’re a shoe brand. You decide to reach out to people who are on your mailing list but haven’t purchased yet.

So you send out a survey. Within the first day, it becomes clear that many people are avoiding buying your shoes, because they’re not sure about sizing.

You’re shocked, but you shouldn’t be. User research insights are often surprising.

It’s just that you thought you anticipated this by posting precise measurements, a great return policy, and glowing testimonials. If anything, you thought people would mention the price, but no one so far mentioned price.

That’s a big deal for your product strategy. You need to build trust. So you set aside your plans for a full redesign (those fancy carousels on your competitor’s site sure are tempting). You set aside A/B test ideas about the font size of prices, removing fields, and so on.

You tackle the big problem. You do some research and come up with solutions:

match sizing to a set of well known brands
provide a printable foot template
allow people to order two sizes and return one
mail out a mock plastic “shoe” free of charge, and so on…

You ask a couple of people to come to the office and try some of your solutions.

Your user testing methodology is simple: First people pick their size based on either the sizing chart or template. Then they see if the real shoe fits.

Result? The matched sizing and the foot template were very effective in predicting fit. In user testing, the initial template didn’t work so well, because it’s hard to place a 3D foot in perfect position on a 2D printout. So, you come up with a template that folds up at the back and front, simulating a shoe. The users liked that much better. In fact, you start working on a cardboard model you can mail cheaply to anyone who requests it.

Now you’re off to testing it in the real world!

You design 2 different foot sizing comparisons, one pretty one with photos of top 3 brands and one long, plain table with 20 different brands. You also create an alternative page that links to the downloadable foot template.

You A/B test these variants over 2 weeks and pick the one that works.

(Then you go back to your research and find the next problem.)

You may also like this post about patterns: Compact Navigation Patterns .

If you want to uncover the biggest problems for your customers, I’m happy to help.

June 8, 2018August 6, 2024

Remote Testing Paid Signup Flow

Recency: 2015
Role: Owned project, client-facing designer and A/B test developer
Process: Solo with over a dozen A/B test iterations
Top Challenge: Improve sales by genuinely improving UX

Iterative redesign and A/B testing to improve the Plans/Payment flow on a dating site to increase paid sign-ups.

Problems With Original

The original process failed to emphasize the top benefits:

There is no guidance on what to choose. Is 3 month plan long enough?

Buying message packs and the option to enter a coupon code further complicate the choice. Analytics showed some of the options were never used. Some benefits, like “Extra privacy options”, are unclear. Some benefits, like “Organize singles events”, are not relevant for the average user.

Once the user picked a plan, they went to Step 2:

The 2-step flow was awkward. Step 1 had the per-month price prominently, yet step 2 starts with a higher prepay-3-months price. Checking analytics showed people went back and forth between the two steps, suggesting Step 1 wasn’t effective as a gateway page.

My Solution

My final redesign looked like this:

The key aspects of the solution were:

Hierarchy & Flow: Simplified to single-page process over multiple steps. Tabs for each plan allowed user to explore plans without flipping back and forth between pages.
Value proposition: I turned the top benefit into the headline (“Make A Great First Impression, Send the First Message”). Showed only the best next 3 benefits below instead of many.
User-centric: Guided user’s decision with Plain English financial and situational advice. For example, the 6-month plan says: “Pays for itself in X months. Take your time meeting people.” Used more casual language when describing the plans too.
Hierarchy: Removed distractions and moved secondary payment options way down to the footer.

Many Interactions Of A/B Experiments

Over multiple tests, I removed various components, such as the message pack footer. I also tried simplifying the choice by setting different defaults and using a single column layout.

Here’s an intermediate variation, which did NOT do better:

I tested setting the default to Plan 1 as well as Plan 2. During the testing, I monitored the impact on sales counts and revenue, as well as user behavior.

After each round of testing, I prepared summmary reports and analyses with lessons learned and recommendations.

User Behavior Research

I tracked user behavior in detail, like if they were chosing a plan then going back and changing their choice. I tracked how long they spent at a given step.

Some results were surprising. For example, setting the default to Plan 1 reduced sales of Plan 1 but tripled the sales of Plan 2 instead BUT in a way that did not increase overall revenue due to the price differences between the plans.

A/B Test Outcome

I A/B tested this solution and it increased revenue and sales for all plans. It took 3-4 iterations.

September 16, 2017February 28, 2019

Patterns for optimizing checkouts: flow

When a visitor knows what to expect and completes a process smoothly, I call this good “flow”. This post shows some options for how a checkout can be organized and presented to anticipate questions like: Does everything look right with my order? How long will this take? Is it going to be complicated? What are my options?)

Account Creation

Allow users to checkout as guests without side-stepping into an account creation flow:

Better yet, conceal account creation. For example, ask customers if they want to save their information at the bottom of the form (or on the Confirmation page after the order is completed):

For existing customers, you can provide a link to login or a small sign-in form on the side. If a user chooses not to login, you might check if their email is already in the system and offer to retrieve their last used info:

If someone forgot their password, you can tell them to continue as a guest to avoid the delay of recovering their password:

Express Checkouts

Give existing customer an express checkout option. On different sites, it may be called “express checkout” or “1-click purchase”:

Another type of express checkout is when your email recipient clicks a unique email link, so when they land on the page, they get the option of using same billing and shipping details. A Complete Money Back Guarantee helps ease doubts about an “express” checkout, since the customer sometimes doesn’t even get to see their last used information. If you offer 1-click purchases, include a Cancel/Undo option right on the Confirmation page. I’ve used a 1-step checkout before, not realizing it would literally put in the order without any confirmation.

Checkout Tunnels (Enclosed Checkouts)

A checkout that keeps normal navigation and sidebars creates a more natural transition. It tells customers “Check out now if you want, or keep looking around for other products”.

In contrast, a checkout tunnel removes all distractions. It tells customers “You’ve finished browsing. Time for payment”. Test the impact on your total order value, time to purchase, as well as completion rate. Keep consistent branding, and keep some common elements as visual anchors (e.g., remove the navigation links but preserve the area, so content areas don’t jump too much after the page transition).

One hybrid approach is opening the checkout in a modal with a faded background. The fading shifts attention away from background elements. At the same time, it maintains a strong connection to the product, since the product page remains in the background. One way to preserve that on a separate checkout is to include the image of the product being purchased.

Form Layout

The goal is to make the form look easy to fill.

Direct flow of attention in one direction, top to bottom. Avoid columns. That said, you can group short and closely related fields, especially if it’s expected (e.g., credit card dd/mm/yy expiry fields should appear together):

Give fields an appropriate maximum width. A narrow form will look simpler, because it appears to require less typing in each field:

Keep labels above fields to make the field-label unit easier to process. Left-aligned labels have advantages – they shorten the form and are easier to scan (see Top, Right or Left Aligned Form labels):

Avoid placeholder text and inner labels, because it creates confusion about which fields are completed and which are not. Inner labels may be ok on very short forms (2-3 fields), but make sure the label remains visible once the user starts typing. I like the pattern that moves the label over to the border area rather than removes it.

Stepped Progress

To make the form look like less work, chunk it up. You can have a long form with numbered sections separated with spaces or lines. Alternatively, you can use the “accordion” pattern to show one section at a time, while other sections are collapsed. Some checkouts span separate pages, such as Personal Info > Shipping > Payment (see examples with test data on GoodUI Evidence):

If you use a single long form, create distinct, intuitive sections (like Shipping Address, Payment), which you can also number. Test for best field to start with: Is it the email? Is it shipping preference? What is low friction? What is high engagement? What is high commitment?

If you use a multi-page checkout, use a breadcrumb or other progress indicator. For your “Next” buttons, use a label that sets an expectation, such as “Next: Payment”.

In your analytics, measure drop-offs at each step and engagement with key fields, so you can compare effectiveness of each layout (e.g., how many people start filling credit card).

Payment Alternatives

Choose a transaction processor with a high success rate. In addition to your default processor, you can offer an alternative gateway, such as PayPal. Conversely, see if removing the choice increases revenue:

A 3rd party checkouts usually takes the user away from your site and provides an experience you can’t track and have no control over, but it may increases revenue.

You can also use a fallback processor when a transaction is declined. If automating that is not possible, you can show a more informative Declined message with a link to the alternative, like PayPal.

Order Review

If you have a review step, try removing it, as it’s likely unnecessary. However, if you have a long checkout spanning several screens, it may be reassuring to see a summary before committing to the order. See what works.

In the next post, I plan to look at the Fields aspect of a checkout, which tells the user what data to provide and in what format. If you’re interested in reading that, please leave a comment so I know you’re interested.

Are there other patterns and aspects of a checkout I have not covered?

October 5, 2016February 28, 2019

Tips for A/B testing with low traffic

After reading this post, you will be able to say whether your test has “low traffic”, decide if A/B testing is worth it, and know what to do if you decide to A/B test.

Technique 1: Do confirmatory not exploratory testing

Exploratory testing = you run an A/B test to look for big or small changes that will increase your conversion rate. You come up with some ideas, then you test them to see which of them work.

Confirmatory testing = you do risky update (e.g., remove free trial) and want to confirm there is no huge negative effect (risk mitigation), or you do an aesthetic site update and want to see if it’s at least not worse than before.

If you get <1,000 visitors per month with 5% converting, you should not be doing exploratory A/B testing.

The best thing you can do instead to optimize your site is look for bugs proactively on your site. You can also do 1-on-1 user testing, surveys, and just deploy your best design and watch your conversion trends.

However, you might still be able to do confirmatory testing.

For example, say I ran a test for 2 months and found this statistically insignificant result (70% confidence level):

Based on this that I’m confident enough that my new redesign is no worse than the original with some chance it might be better. That is useful information.

Technique 2: Find a proxy metric

You can increase your test sensitivity by using a higher baseline metric. Say for example that your primary metric is purchases, but purchase rate is only 2%. Here’s the lift you can detect ( try it yourself: Vlad’s What-If A/B Test Planner ):

What’s a key milestone prior to sales? Let’s say your form fill (not yet submitted) rate is 3%. Maybe form starts are 4% (e.g., people enter an email). If we measure form starts, much smaller changes are detectable:

Of course, form starts are not purchases, but it’s a behavior that suggests improvement AND it’s a good supporting metric.

Here’s how to use this to analyze your test. Say at the end of the test, you find these effects:

As you can see, in terms of purchases the variation beat the Control by only 10% and it’s not a statistically strong result. But the preceding steps show a progression from engagement to purchases and the shallower goals are statistically stronger. So the big picture here is actually pretty good.

You can also compare the performance of various metrics to see which are fairly in sync. For example, here’s what it might look like if Form Field Engagement is a great proxy for Revenue:

A word of caution: The mere fact that the metrics are lined up DOES NOT increase the likelihood that B is the winner. These are co-related metrics that measure the same behavior (i.e. “purchase” implies “form completion” which implies “form engagement” which implies “scrolling to form” and so on). These metrics will line up whether the effect is true or a false positive. What you CAN say is that, since these metrics are co-related, you can use the shallower metric as a proxy for the deeper metric.

Technique 3: Look for consistent performance

With low traffic, you get too few visitors per day to gauge daily variability, but you can track weekly trends. If a variation is winning more consistently, then it is more likely to be a winner.

Here is week-to-week performance over 7 weeks for two tests, both showing a 6% cumulative improvement:

But one test is showing a lot of week to week variability, whereas the other shows the blue winning for 5 weeks straight. All other things being equal, the second offers more reliable evidence.

Technique 4: Compare segments

Another way of checking consistency is to compare performance across user segments. For example, we tested the same variation simultaneously on two virtually identical landing pages on different domains. We expected the variations to do similarly on both. Comparing day-to-day performance for a sample week, we see that the effect sizes for the two tests are moving roughly in sync:

If the variations are not in sync, it could mean the effect size is false or that the segments are dissimilar. However, if the segments are in sync, it’s a good sign. If you have a low daily rate, you should instead check bi-daily or weekly rates. Otherwise, performance jumps around too much just by chance.

Technique 5: Don’t expect big changes to bring a big win

Some people will tell you to aim for a large effect sizes by doing big changes. For example, with 4,000 visitors and 5% rate, you can detect an impressive 54% to 89% lift:

That’s true, but how exactly are you going to achieve that? Is that realistic and worth the effort?

Big design changes have the potential to bring bigger wins, but it doesn’t mean that’s likely to happen. Big tests do fail often. They also take more development effort. So instead of fixing bugs on your site and building new features, you may end up doing lots of testing work with no results.

In the above scenario of 54-89% lifts, you are even likely to hit on a high false positive and come away believing your design worked:

The only way to reduce the potential for large false positive errors is more traffic, even if you’re testing big changes. Hope for the best, but plan on testing for a while and putting in more time before you hit upon a win.

More techniques for planning tests and analyzing the results

What to test:

prioritize big changes that require little effort
test radical changes (think big conceptual shift, not just big visual changes)
start with ideas you have good, research-backed reasons to test

How to test:

test 1 idea at a time
plan to run tests longer
keep reminding your team a test is running
you don’t have to freeze development as long as you make changes globally

How to interpret:

you might have to tolerate variations going “red” for days as data is collected slowly
keep in mind the effect size even if true is inflated
ignore big lifts if you’ve got only 1000 visitors so far (remember your false positive risk)

How much traffic is enough?

For example, with 20,000 monthly visitors and 0.5% conversions, you’ll have a tough time testing. But with 15,000 visitors and 10% conversions, you get a decent spread of effect sizes around 15% that are detectable within 1 month:

This is what I’d call “adequate traffic”.

Adequate traffic = you can run a test for a 1 month or less with enough sensitivity to detect a 15% lift. Anything that takes over 1 month or aims at unrealistically high effects is low traffic.

Why 1 month? Because beyond that, things tend to get messy. Users clear cookies and reenter in different variations, your dev team accidentally introduces some change, and so on. Once you start talking of months instead of weeks, the test becomes a burden instead of an opportunity.

Why 15%? In my opinion, 15% is the sweet spot. If your sensitivity is aimed at 15% but you detect a 30% effect, then great – you’ll either have super-strong data or you can stop a bit earlier. If you detect a 10% effect, then you probably still have decent sensitivity to see a suggestive result.

Conversely, if you aim for 50%, and the true effect is 15%, then you’ll be chasing fantoms. Since you virtually never know what to expect, it’s best to be conservative. I found that 15% is roughly the effect size at which A/B testing becomes reasonable for many sites.

Site Traffic Examples

Traffic	Baseline Conversions	Verdict	Why?
20,000 / month	0.5%	Low Traffic	Base rate is low. Test takes many months and/or a minimum effect of 50% is required.
1,000 / month	10%	Low Traffic	Site traffic is low. Test takes many months and/or a minimum effect of 50% is required.
15,000 / month	6%	Adequate Traffic	With this base rate and traffic, you can detect a 15% effect or so. Enough for a test.
50,000 / month	5%	Adequate Traffic	With this base rate and traffic, you can detect a 15% effect or so in 2 weeks.

Conclusion

Testing on low traffic is paaaainful. Look at what you plan to gain from testing something, look at your chances objectively, and if you do go ahead, be patient and understand what you should and should not expect.

Did I miss anything? Let me know and I will update this.

September 30, 2016November 8, 2020

Crowd-sourcing A/B test predictions

The collective guess of a crowd can be more accurate than that of an individual. For example, over a hundred years ago, statistician Francis Galton noticed that a crowd of people could guess the weight of an ox with over 99% accuracy. In a more complex domain like politics, we know that expert predictions are terrible, but the average of their guesses is better.

How accurate is a crowd when it comes to good UI design?

How good is the Crowd at picking a better design?

Behave.org has a good repository of people’s guesses about competing designs. Various contributors submit their A/B tests to the “Test of the Week” section, which asks users to guess the winner. Here, crowds get to weigh in on a single idea, one that is not their own. Moreover, these tests are curated to be interesting, which implies they are carried out by more experienced people with more robust hypotheses.

When we look at the last 106 tests on Behave.org, we see that:

53% of the tests show a variation that beat the baseline. But when the Crowd guessed the outcome of these tests, they guessed right 72% of the time (36% better). Interestingly, the Crowd did choose B about 50% of the time. It was just better at NOT choosing B when it was not an improvement.

FYI 53% does not represent the average success rate of test contributors at all, since it’s not a random sample. We just use it as an arbitrary baseline, to see if the Crowd makes the same guesses or better ones.

Why does the Crowd do better?

First, a test designer choses B 100% of the time, since B is be definition the improvement. In contrast, an impartial outsider considers all options equally and freely chooses either B or A.

Second, a tester is biased by his idea. His faith in the idea fills the flaws in the implementation, especially if he executes his own idea. In contrast, an outsider just evaluates the implementation.

Finally, a Crowd by definition has greater diversity of opinions, gut reactions, etc. than an individual. So even if a Crowd mix is not representative of site visitors, a Crowd is likely to be MORE representative than just the individual.

Quality of the crowd and the data

Is it a Crowd of independent opinions?

The individuals in the crowd need to be independent. One of the flaws of “focus group” research is that individuals win the group will influence the responses of others. This makes the Crowd less intelligent. The power of an internet poll is that we average over many independent opinions.

How qualified is the Crowd?

The composition of the crowd is also important. Sometimes, you want average people, representative of your audience, to give you their simple preference (the classic “Which product would you buy?” question in market research). Other times, you want experienced designers to give you their guess based on their expertise (though sometimes experts can bad at predicting). For example, on all our projects at Goodui.org we ask multiple client contacts to specify their certainty in each test idea, which gets averages out with our prediction. These are people who are not designers and people who know more about the product and the user base than we do – both very good things that add diversity. At the same time, being subject matter experts, they are qualified to give feedback on marketing copy and so on. In other words, their contribution to the crowd is valuable. We then try to prioritize the ideas that have the highest overall score.

How to quantify Crowd opinion?

How you summarize the opinion of the Crowd matters. In the initial example of guessing the weight of an ox, Galton used the median to summarize the crowd’s opinion. A median controls for outliers and can increase data accuracy. For example, there might be unqualified people in the Crowd who are way off on their guess (I would have absolutely no idea, for example, how much an ox weighs). An average would be skewed, while a median would not be.

How the question is asked and the format of the data itself also matters. For example, I recall that asking people “Who do you think will win the election?” predicted election results better than “Who will you vote for?”

In the case of Behave.org’s Test of the Week, if they asked visitors to quantify their confidence not as a binary choice but as a 0 to 10 (0 = A, 10 = B), then their uncertainty might have weakened the overall prediction. The binary choice of A or B effectively controls respondents’ uncertainty, so they are forced to give the same weight to “it’s slightly better” aso to “it’s much better”. A simple binary choice might make the prediction more accurate (this is known as the “error of central tendency”).

For example, at GoodUI.org with @jlinowski, we recently moved away from a simple 0 – 10 subjective certainty rating to a -3 to +3 scale. Since we tend to include test ideas that others are likely to also consider worthwhile, this effectively makes it a 0 – 3 scale. Simpler means stronger predictions. However, we also moved to a more complex formula that incorporates experimental evidence and research. This increased complexity risks decreasing the predictive power, since much more complex decision making is involved in figuring out if a variation is likely to win.

Can a Crowd fail?

Yes! More people doesn’t necessarily mean better predictions. Sampling error (your choice of who’s in the Crowd) has to be valid and appropriate. In the classic example of the 1936 Literary Digest poll, a very expensive poll with a large sample size actually failed to make an accurate political prediction, because the sample was biased. So quality trumps quantity.

How good is the Crowd at predicting effect size?

For reasons of complexity, I don’t believe a Crowd can be relied to estimate the degree to which a variation might beat the baseline. The possible % values are unbounded.

Next factor is experience. In the ox example, most people know what an ox is and have a lots of experience with other objects. But every website, every site audience, and every implementation of an idea is unique and not necessarily transferable. Of course if you are testing something and you’ve tested something exactly like it on a dozen very similar sites and got a similar result each time, then you would have experience. I’d be skeptical if anyone can make such a claim however. CRO experts rely on a lot of intuition and personal judgement to fill those gaps in experience.

Finally, it’s the type of stuff we are measuring. Weight is a simply additive property. A rock that appears to be 2X size of another probably weighs 2X as much. An ox about the size of 5 people probably weighs the same. Online experiments are not like this. If I am testing an idea, it’s often not decomposable in such a way. And even if we were to decompose it, there is no guarantee the effects of the parts add up this way.

In other words, guessing the outcome is very very hard and should not be done subjectively. We should do our best to find similar past tests, and use those to make a starting prediction. We can then collect some data in order to fine-tune the prediction. We can then test our prediction with more data.

Crowd-sourcing design ideas

How do we figure out what to test in the first place or find something better than B? After several unsuccessful tests a while back, we decided to crowd-source improvement ideas from GoodUI blog visitors. The result was the highest response rate of all posts up till then with lots of ideas that led to new variations.

We could further improve this by asking visitors to make predictions about others’ ideas before add their own idea. That way we would crowd-source a list of ideas as well as crowd-source predictions for each idea. As we have seen, stronger predictions result when a Crowd joins to rate a single idea, instead of each individual putting forward and rating their own idea.

Lessons learned

If I want to improve my success ratio, I need to behave like the crowd:

Separate the idea from the implementation. I recently had a great idea but my best implementation was weak. It took 6 visual iterations over several weeks to ensure both idea and implementation were sound. Critical feedback from 2-3 people (who were not involved in generating that idea) was critical in improving it.
Do a pre-mortem. Imagine that B already lost, and try to figure out why that has happened. This way you force yourself to consider taht B might not be better and needs improvement.
Seek first impressions from outsiders, both qualitative (ideas) and quantitative (predictions about ideas).

What’s in the future?

I envision a crowd-prediction service (similar to remote User Testing services that have become popular). You could pay to have 100 pre-qualified people make predictions about your test.

Happy testing.

April 21, 2016November 8, 2020

When and why to peek at A/B tests

Every other day or so you should should peek at how your tests are doing. Here are some guidelines on doing that without skewing your data:

Technical problems

The main reason you want to peek frequently is technical problems. You should QA your site before you launch, but you should QA again a couple of days in and later on. You may have missed some bugs, and repeat QA will catch more of them. Other site changes get introduced that break your test in some way. And sometimes a transient bug doesn’t show up in the data until more data is collected.

Averting losses

If you set it and forget it, you may leave a losing test running too long. If you’re after long term gains and want to avoiding losses, you should stop a losing test at some point. However, stop when you can be reasonably sure you’re doing the right thing. You don’t want to stop your test a day in because you get put off by a big initial drop. But if your test is losing for 2 weeks straight with high statistical confidence, you might not want to let it run another week. If your goal is learning, then you might want to run a test longer to confirm it is a loser – but exposing your site a poor variation may not be what is good for your business.

What stats are you using?

Many tools have moved to a different statistical method, which allows peeking. VWO uses Bayesian stats, which give you up to date confidence and probabilities. Optimizely uses a pseudo-Bayesian sequential testing method to allow you to peek. So depending on what statistical method you use, you may be “allowed” to peek and make decisions. However, keep in mind regardless of what your tool says, you still want to let your test run a reasonable amount of time. So estimate your duration upfront anyway, so you have some point of reference.

No significance, no problem

If you’re using traditional stats (which works fine is much easier to understand), the basic idea is to avoid calculating statistical significance (p-value) and then making interim decisions based on that. A p-value is meant to be something you calculate after your test is done. However, you can check significance and adjust your duration estimate once or twice during the test (e.g., once you get a ballpark effect size). That’s not going to skew your p-value much. IF you commit to some sample size upfront, then you can check significance all you want – it gives you a basic sense of whether the effect is strong.

Safe things to peek at

It depends on what you peek at. If you’re not calculating significance but still making decisions based on how the test is doing so far, you’re still skewing your final analysis. However, you can peek at how your overall test is doing (without drilling into how each variation is doing). For example, if you see that overall traffic to the test or the total conversion rate is lower than expected, then you can recalculate your duration estimate without any problems. Looking at the test overall doesn’t tell you how the variations are doing relative to each other – unfortunately A/B testing tools never offered this option. You definitely want to readjust your duration estimate once you have more data. You just don’t want to keep doing that based on how each variation is doing – that fluctuates and if you keep adjusting your duration to current performance, you’re letting chance lead you instead of subduing the effect of chance with time.

Have someone else peek

I don’t do this, but a good idea I’ve come across (not sure how practical) is to have one person peek for technical issues and a different person to ultimately do the final analysis and make the decision.

Objectivize

The biggest danger with peeking is that your brain will look for patterns and once you see them, you can’t unsee them. This can lead to disappointment when a green turns red. Or worse it can lead you to hack your test to get the results you want. Try to be objective about it. Once the test is running, you’re after the truth, not winning or losing.

October 5, 2015November 8, 2020

Visual patterns for A/B test structure

Once you figure out what you want to test, you need to define what you’re going to measure and where. In this post, I will introduce my preferred terms for describing test structure (things like test conditions, goals, and pages), and I’ll use a visual language to cover the basic patterns. Here’s an example:

Example test — A simple test showing Gate (Start), Path, and two Goals (the big black circle is primary).

Gates and Goals

GATE: circle that represents all conditions for entry into the test, including test URL and traffic segmentation (Example: Home page mobile traffic).

GOAL: all success conditions, including confirmation page URL and business rules (Example: Thank you page visit after purchase of premium package).

TIP: If you have a sizable mobile segment, you’ll want to track mobile, tablet, and desktop traffic separately. If your tool doesn’t allow you to segment after the fact, set up 3 separate tests with mutually exclusive gates. Other gates you should distinguish are: existing users vs. new users, ad traffic vs. direct traffic, and so on, in case each segment performs differently. Keep sample size in mind, because segmentation reduces sample size and increases false positives and false negatives.

Primary vs. Secondary Goals

Page visits are generally more reliable than clicks, so they are the preferred primary metric. Clicks on links or form submits are often secondary metrics. There are many other customized types of metrics based on user behavior and business rules.

TIP: Whenever possible, track both the start and end of an interaction e.g., track clicks on a link and the visit to the destination page.

TIP: Tracking how many people start completing form fields is a good measure of intention e.g., track keydown or change events on key form fields. It can also highlight anomalies in other goals. Track attention (user scrolled to and stopped at the element being tested) as a secondary metric via scroll tracking and setTimeout().

Goal Depth

Your primary goal might be directly on the page you’re testing or further down the funnel.

A direct goal happens on the test page and is your ideal end goal (e.g., an AJAX payment event).

A shallow goal is a relative term for a goal at or near your test gate that’s not ideal. For example, visits to the checkout page is a shallow goal relative to a primary goal of completing the purchase.

A deep goal lies further away from the gate and is usually your primary goal. However, you might track other deep goals that are not primary (e.g., post-purchase downloads, dashboard engagement). Changes to deeper goals are harder to detect using statistics, because counts are lower.

TIP: If your primary metric won’t produce enough data in the time you have, then choose the next best metric.

Conditional Goals

Behavioral and business conditions can be added to goals. For example, fire a conversion when a timer expires, a scroll position is reached, several steps are completed, or a user successfully logs out and returns again. You can map goals to any user behavior.

TIP: Additional logic requires additional code, increasing risk of technical and logical errors. Be careful about making your primary goal very complex. Moreover, elaborate goals that are harder to achieve will have a lower conversion rate and will be harder to track. However, they may be more informative – track them but have a fall-back.

TIP: You can set up goals to detect errors on complex screens with lots of dynamic components. For example, part way through the test you might find that clicks on your new button are low. So you might set up a goal to check that the button actually exists on the page for all visitors. In one case, we wanted to check if any visitors to a split-URL test were changing the URL to enter a different variation.

Single Page vs. Template

Your test gate is typically a single page. The gate can also be multiple product pages that use the same template. It can also be multiple pages that are completely different or the test can even be side-wide. For example, if you’re testing a change to your navigation or a sidebar, you’ll want to modify all pages with that element – for consistency. Advantages: potentially much higher traffic to your test and testing the change in broader context. Disadvantage: you may obscure different performance on each page due to different traffic source, different previous page seen, etc.

TIP: If you can, track that your multi-page A and B samples contain similar ratios of visitors to each page. For example, if your site is running side-wide, you might want to know that your A and B samples contain the roughly the same % of people who came from the home page, pricing page, blog, etc. What if, for instance, product A has mostly traffic from your internal search, while product B was mentioned in a blog and has lots of referral traffic?

TIP: You can run separate tests to isolate the data set for each page as long as there is no visitor overlap.

Sequences and Funnels

abvocab_funnel — Navigation within funnels. Straight lines are direct links, while wiggly lines suggest intervening pages e.g., Home > Pricing > Checkout as well as Home > Checkout.

If you have a unique link to a page, you control the traffic to that page. Other times, you might have multiple links, which means multiple entry points into a page. You might also have steps that can be bypassed. The wavy line means that the page transition is flexible or unknown (e.g., Home page to Pricing page), while a straight line means it’s a direct relationship (e.g., Payment Step 2 to Step 3).

TIP: Verify, don’t assume path paths. Are you getting lots of direct visits from Google into the middle step? Can visitors bypass your link towards step 3?

TIP: Set up tiered metrics, so you can track the user’s progress at each step of the funnel.

If your traffic is too low to detect a change in your end goal, you should make a shallower goal primary. It’s also useful to track whether users step outside the main funnel. For example, if you find that a losing variation is increasing traffic to the pricing page, you might have a hypothesis to explain the loss.

User behaviors other than what you want are also good to track (Distractions).

TIP: Track visits to all main pages, like pricing, about us, blog, etc. Most patterns will be meaningless, but sometimes they are informative. For example, a huge increase in visits to the Pricing page might suggest something you said make people think of cost, which may or may not be a good thing.

Visual Scope

You can test just one page (Page Test) or you can test an entire funnel as a sequence of pages (Funnel Test) – visitors see version A of the complete funnel or version B of the complete funnel.

On rare occasions, you might start your test on a page other than the one you’re testing. For example, a page may not be accessible directly or may depend on an interaction with the preceding page, so you have to start your test a step earlier (Premature Start). You might also have visual changes on different pages that go together conceptually, such a discount offer on the home page vs. on a deeper page. So on one variation, the user will enter prematurely on the page without visual changes.

TIP: Avoid testing related pages in separate simultaneous tests, because you’ll have to account for visitors’ coming from and seeing different versions of each page.

TIP: When running tests simultaneously on the same site, use cookies to ensure visitors can only join 1 test at a time. If you do risk running overlapping tests, at least add metrics to each test to track which variation of each test the users saw. That way you can at least check that the assignment is roughly equal and even split users into non-overlapping segments.

Data Collection

Tests can be run by injecting CSS and JavaScript into an existing page or creating a separate page URL for each variation.

A tracking or blank test simply collects data about your site. You might run such a test to QA your metrics or estimate your traffic and conversion rates to enable you to do a power analysis.

An A/A test is less a way to test the tool than a way to understand the properties of the traffic e.g., are all users relatively similar, producing less variation in the conversion rate over time?

Visual changes can be tested by injecting CSS and JavaScript into an existing page or creating a separate page URL for each variation.

TIP: Dynamic A/B tests can be faster to deploy but can have a flickering problem or take slightly longer to load. Split URL tests don’t have these problems but require your to create a duplicate page, which is not always possible. It also requires separate URLs for each variation, so you then need a process to reuse or expire the old URL variants. The redirection itself and difference in URL might be noticed by some users.

TIP: A hybrid approach is redirecting to a URL parameter on the main URL. Changes are then applied on the back end or front end based on the parameter e.g., example.com/?v=a vs example.com/?v=b.

Single or Multiple Variables

Test can involve a single change or multiple changes, as in a full redesign. Micro tests allow you to connect the observed effect to a specific visual change – this is the most satisfying type of test. A macro test can allow you to test a coordinated set of changes, but you won’t know the effect of each individual change. This can put you into a tough spot if the test loses – do you scrap the whole thing or try to retest specific elements of it?

A multivariate test is used to test multiple changes by essentially running a separate test for each combination of variables. The risk with this type of test is lower power for each sub-sample with more false -positives. This is not the same as running multiple tests simultaneously on the same page, since in that case you won’t know which version of which test each user saw.

In closing

Use your awareness of page and goal patterns to collect richer data and avoid common mistakes.

September 8, 2015February 28, 2019

Simulations are faster and more intuitive than calculations

I use simulations all the time to help answer questions like: Is this outcome possible? What outcomes are most likely? How much data is enough?

Simulations can give an answer faster than detailed calculations. They are less precise but far more intuitive. If you run a simulation 10 times and get a certain outcome even once, you know it’s possible. If you get it a few times, you know it’s quite likely. If you want more confidence, just rerun the simulation 10 or 100 or 1000 more times.

What if?

In the previous post, I included an under-powered simulation in Excel, where we ended up with a 23.7% drop instead of the true +10% lift. Using that template, you can set up a simulation in seconds.

Continue reading “Simulations are faster and more intuitive than calculations”

July 7, 2015November 8, 2020

How to correctly define a business hypothesis

A hypothesis is an explanation of why something is the way it is.

Example business hypothesis:

“We are a new company, and visitors have doubts about the quality of our product.”

Do we really create hypotheses to test them?

To see if my example hypothesis is true, it would be best to talk to some potential customers. A/B testing is not really about testing business hypotheses but about using them to iterate a design. An A/B tester is not a scientist. He takes the hypothesis as inspiration for new visual treatments in order to increase his chances of raising revenue.

Continue reading “How to correctly define a business hypothesis”

February 23, 2015February 28, 2019

I Have An A/B Test Winner, So Why Can’t I See The Lift?

In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:

97 days of daily conversion rates in Perfectville showing 20% lift

The graph perfectly related what happened: The baseline increased by 10% during the test, with half the traffic exposed to the winning variation. Then was the week when the test was stopped, followed by a lift of 20% once the winner was implemented.

The good people in nearby Realville heard about this and ran the test on their site. When they later checked their daily conversions data, they scratched their heads (as they often do in Realville):

97 days of daily conversion rates in Realville showing same improvement

The data actually includes the same 10% lift during the test, a gap, and a final 20% improvement. The problem is the improvement is relative to natural fluctuations in daily conversion rates, so 20% improvement doesn’t necessarily mean 20% lift.

Here are 6 reasons why people in Realville might find it difficult to see a lift and what they can do about it.

Reason 1: The effect is too small

The smaller the lift, the harder it is to see it through the noise. If the conversion rate drops for some reason unrelated to the test, the lift from your winner might not even offset that. For example, here’s 1 week of simulated daily conversion rates followed by a week with a 20% lift compared to a 5% lift. If the lift were 5%, it would look as though the test actually did worse in the second half:

7 days at baseline followed by 7 days with 5% vs. 20% lift

Have you just run a test and are looking at during-test data? You likely won’t see any effect. Typically only 70-80% of visitors will join the test (more on this below), and these are split among your variations. If 80% of your traffic actually participated in an ABC test, a third of that is exposed to the winning variation. So, a 20% lift would manifest as 5% overall.

What you can do:

Look for a larger cumulative upward trend after several tests.
Compare longer timescales for baseline and post-implementation data.

Reason 2: Your baseline is too variable or you’re not looking at enough data

In Perfectville, conversions are constant each day, each week, each month. This means a 20% improvement causes a 20% lift. Not so in Realville. In Realville, daily conversions naturally fluctuate, so the full potential for improvement may not manifest. The more your conversions fluctuate, the harder it is to see the lift in the data.

Here are two similar simulated data sets with low and high variability, both showing a 20% lift. The lift is more obvious when variability is lower:

A similar 20% lift with low and high variability

Sales may fluctuate for a lot of reasons (weekly, seasonally, in response to your marketing activities, unexpected traffic). The smaller your sample, the higher the chance that the pattern you’re looking for just won’t be there by chance. For example, if you just saw the middle segment of the full graph below, you’d never know that the right, orange side of the graph shows a 20% improvement:

14 days of simulated daily conversion rates (blue), then 14 days with a 20% improvement (orange)

What you can do:

Zoom out to reduce variability. If data is too variable daily, look at semi-daily rate or weekly rate
Look at more data to cover the full cycle of ups and downs e.g., a week (note that the lower your conversion rate, the more data you need to see an effect)
Check your site analytics to see what might have been different that week. Check if dips have happened before. Might one have coincided with the test?
If the data has a lot of variation, it is hard to estimate visually. Compare what I’ll call the “clipping rates”. In this graph, you see higher peaks as well as a higher frequency of peaks in the second half:

20% lift manifests in more frequent and higher peaks

Reason 3: Not everyone was part of your test

Even if you didn’t put exclusion conditions on the test, some visitors were excluded.

For example, mobile visitors are excluded by default. Another 10-20% of visitors normally get excluded when the A/B testing tool time out. Further technical implementation issues can cause another 10-20% of visitors to be excluded, things like JavaScript-heavy sites or the tracking code not implemented in the right place.

Moreover, gaps in test design can create a discrepancy between test and sales data. For example, we ran a test on the home page of a basic single-product site and noticed that our test data was missing many sales. After investigating, it turned out that about 50% of purchases were by people who didn’t land on and never visited the home page as well as by existing customers from a special upgrade page that we didn’t consider.

As a result of these exclusions, when you implement your winner, you may be exposing it to segments you didn’t test it on. For example, although you tested on desktop and saw a 20% lift, the same design on mobile might cause a 30% drop. So, if you made the winner your new home page for all traffic, the drop in mobile could counteract some of the lift (say, if you had lots of mobile traffic).

What you can do:

Factor in 20% exclusions due to technical issues, like timeouts
Set up an inverse test to see how many sales are by-passing your main test (target pages and visitors who are excluded from your main test)
When looking at sales or conversion data, keep in mind it probably includes segments you didn’t test on. Test the design on all segments that will be exposed to it e.g., new customers and existing customers. For mobile, build and test a dedicated mobile version

Reason 4: You are eyeing it instead of using math

Sometimes a lift is obvious. Other times you need to use math. Here’s a sample of real conversion data with about 20 days of basline followed by 20 days of the improved version:

Just over a month of real conversion data with winner on the right

The lift is not visually obvious. Nonetheless, the average for the first 20 days is 0.71%, whereas the average for last 20 days is 0.85%, which is a 20% lift. However, if the standard deviation of the data is high, the difference in averages may be coincidental.

Reason 5: Your design or conditions are not the same

This happens all the time. You run a winning test, then you tweak the winner before pushing it to your site. It’s entirely possible that those visual tweaks reduced the effectiveness of your variation.

It’s also possible the test conditions are different when you launched the test. Did you test during the holidays or launch during holidays?

Are you including a different page? You might have several pages that look similar. So you tested something on one page, and you decided to apply it in one go to all the pages. If so, there is no guarantee that the same concept will work equally well on other pages.

What you can do:

Check your site analytics to see what conditions might be different now and retest if necessary
If you know you will be changing something, apply changes to the variation and test it with the changes
You should implement the winner as the new control and then test the new changes
Retest on each site if you have reason to believe the outcome may be different

Reason 6: It was a false positive

Yes, it happens all the time. There are many reasons you might have gotten a false positive, including improper test design and not running your test long enough. The most common scenario is you run your test until you see a winner and stop. I’ve seen results that looked very exciting flatten after 3-4 weeks.

What you can do:

Follow the great tips on http://goodui.org/betterdata to ensure you get good data

Back To Realville

Let’s say Realville decided to retest the Perfectville winner 8 more times (it took years!). They found that indeed, the overall tendency of the variation was towards increase, following the same pattern as Perfectville’s test. There’s a small lift during the test, a slight dip when the test is stopped, and then a larger lift after final launch. However, despite the overall trend, individual outcomes showed that chance is a factor in this imaginary scenario:

Let me know if you apply and find useful some of these concepts.