I was hired by a small business CEO to explore and visualize a business opportunity. I can’t share the actual industry due to NDA, but I created an analogous scenario about travel.
The Problem
There are many stories of travelers buying insurance and still not being covered:
How might we prevent this or help the customer through this kind of situation?
Business Opportunity
My client’s business would help people in financial difficulty. The ask was for a concept mockup to share with a larger team to drive very early discussions. I also created a preliminary service blueprint of an emotionally charged situation: Susan cancels an expensive trip and her insurance doesn’t cover it.
Wireframing and Crafting Value Proposition
I designed a concept and wrote copy to connect the business idea to a strong customer story. The client appreciated the story-driven approach, which encouraged them to engage with real customers sooner. In past projects, similar advocacy led my clients to incorporate testimonials and eventually move to photos and candid videos.
User Advocacy
Clients are rarely eager to get user input to validate assumptions or get new perspectives. I always have to initiate those discussions.
Here’s an example email where I start to raise these issues:
To: Client You asked me to think through the employer-side of the travel insurance service. How might we get insights from actual "employers"? What past experiences may encourage them or deter them from using a service like this?
Identifying Points of Potential Breakdown
My objective was to encouraging the team to seek real customer feedback and gain a more holistic understanding of the customer experience.
When the big problem hits, Susan is dealing with multiple issues: a stressful event, canceling her trip, losing money, limited time off work, and exhaustion from fighting with insurance. She feels powerless when she discovers our service.
Empathizing with her situation can influence how the service is organized, marketed, and how customer service staff are trained. Instead of focusing only on the end result, like Susan getting a partial refund and us getting paid, it’s important to recognize that the solution for the customer begins much earlier, such as finding a service that offers hope.
Opportunities
A broader definition of the job-to-be-done encourages the team to view the service from the customer’s perspective. While we may think we’re selling a refund service, Susan is actually seeking guidance through a stressful situation. This broader view reveals additional opportunities:
offering flexible flight options
creating a ticket resale marketplace
providing financial counseling
software for independent insurance claim evaluations
Next Steps
Continuing to get a clearer picture of the target audience, defining constraints (e.g., what type of user story to focus on), and moving along an idea of the solution.
I created wireframes and a partial high fidelity design for a fashion e-commerce site looking to increase its click-through rate.
Direction
My hypothesis was that larger images and exposed curators would encourage users to browse more slowly and pay more attention to each item. Instead of 3 columns of smaller images, I opted for 2 columns with larger images. I exposed basic item details, including the brand, because this site is all about curated, select items from boutique brands.
Wireframing
I sketched a desktop concept in Adobe Illustrator and annotated my design rationale and interaction details. I then showed the flow of various edge cases and states:
I use Illustrator to be able to freely explore layout options without being constrained by the shapes and features of the software.
I created a responsive mobile view in parallel, showing how the columns could be collapsed and filters overlaid:
Vivareal.com.br recently ran a test, where they removed the Apply Filter button and instead updated the search results instantly (screenshot from goodui.org/evidence):
The results are weak but suggest a simple filter behavior might improve a deep metric like Leads (form submits). Given that many prominent sites are doing it, it gives the rest something to try. It also raises questions about what is the best implementation for a given site.
Progress Indicator Is Mandatory
On other sites, like Reverb.com, the transition is smooth. The search results go grey, then they populate in-place:
If search results take more than 200ms to update, use a spinner and/or grey out results, or use another progress indicator (see Jeff Johnson, 2010 for more on human timing requirements). Beyond that, the search filter won’t be perceived as instant and users will be off-put.
Autotrader.ca doesn’t show results instantly. To avoid confusion, they grey out the results and shows a green alert to confirm that user’s filter changes have not yet applied:
Page Refresh Or Not?
On many big sites like eBay and Amazon, filters behave as links. There is a delay and a full page refresh. Here’s eBay:
Some problems with this:
The redirect always takes too long. And after a page refresh, I find I need a second to refocus and figure out what happened.
I usually don’t find it obvious how to undo a filter or go back in that case, because the UI has changed. I’m probably not the only one.
Change Blindness could obscure some important visual change (e.g., people may not notice that some other categories have been updated).
A page refresh may be necessary if there’s a long list of filters that goes past the fold. If you update listings immediately in that case, the top results could be out of sight, at the top. Many sites also update related categories and sub-filters based on the filter chosen, so essentially the whole page does have to change.
I would be very interested to see an alternative pattern to this that is less jarring, faster, and allows easy backtracking. For instance, an animated scroll back to the top may be a better option than a full page refresh in some cases.
Submit For Inputs Or Update As You Type
In the case of Reverb, notice there is a small submit button for the text inputs. All the other fields are instant, but you submit an input field manually when you’re done typing:
In contrast, on sites like Netflix and platforms like Apple TV, filter-as-you-type is heavily used and is very effective:
My recommendation is to use this only if you serve results FAST and if the search terms are bounded. For example, Netflix is a repository of movies and there are only so many movies that start with “Fast and”. So it’s a good solution for them. Likewise, on a stock site, showing the most common stocks might work, because stock symbols are a limited set.
Autocomplete
If filter-as-you-type is not feasible, consider using Autocomplete on the input fields. When using Autocomplete, you can delay showing results until enough characters are typed. And when you show matches, it’s text-only and right below, so it’s not distracting. Moreover, web users are all familiar with this pattern.
Reverb uses Autocomplete in its main search input on the home page, but not on the Keywords filter.
Do Not Freeze
When a search filter is clicked, it should not block the whole app. That is, a user could apply more filters while the previous ones are still fetching data. Similarly, removing a filter should apply opportunistically. For example, when I click an enabled checkbox filter, it should immediately show as “off”, even while the data is still being fetched.
Cache Results
If data sets take a long time to load, we should cache the results. That way when people undo a filter, we can go back to the previous view instantly. It may also be a good idea to keep the original results until new ones are ready. That way, if a user accidentally clicks a filter and then undoes it, you can just cancel the new lookup and show the original results.
Best For Simpler Searches
If a user is expected to enter multiple filters, then there is no point submitting after every search. A manual submit is best in this case. This is also true when there are many filters. For example, your basic search could be instant, but if the user pulls down the massive set of Advanced filters, then a submit button would be needed.
Also if the data set takes a long time to filter (because it’s vast or complex), a manual submit may be better. For example, the search filter for the ChemTrac municipal chemical data repository has an Apply button which filters a very busy map:
Estimate Search Time
It’s a good idea to log search times to see if it’s taking too long. If searches take more than 1 sec on average, then you could show the ETA using a progress indicator or countdown, rather than just a generic spinner. For example, the map for ChemTrac can take a few seconds to populate. Here’s what it might look like with an ETA:
When a visitor knows what to expect and completes a process smoothly, I call this good “flow”. This post shows some options for how a checkout can be organized and presented to anticipate questions like: Does everything look right with my order? How long will this take? Is it going to be complicated? What are my options?)
Account Creation
Allow users to checkout as guests without side-stepping into an account creation flow:
Better yet, conceal account creation. For example, ask customers if they want to save their information at the bottom of the form (or on the Confirmation page after the order is completed):
For existing customers, you can provide a link to login or a small sign-in form on the side. If a user chooses not to login, you might check if their email is already in the system and offer to retrieve their last used info:
If someone forgot their password, you can tell them to continue as a guest to avoid the delay of recovering their password:
Express Checkouts
Give existing customer an express checkout option. On different sites, it may be called “express checkout” or “1-click purchase”:
Another type of express checkout is when your email recipient clicks a unique email link, so when they land on the page, they get the option of using same billing and shipping details. A Complete Money Back Guarantee helps ease doubts about an “express” checkout, since the customer sometimes doesn’t even get to see their last used information. If you offer 1-click purchases, include a Cancel/Undo option right on the Confirmation page. I’ve used a 1-step checkout before, not realizing it would literally put in the order without any confirmation.
Checkout Tunnels (Enclosed Checkouts)
A checkout that keeps normal navigation and sidebars creates a more natural transition. It tells customers “Check out now if you want, or keep looking around for other products”.
In contrast, a checkout tunnel removes all distractions. It tells customers “You’ve finished browsing. Time for payment”. Test the impact on your total order value, time to purchase, as well as completion rate. Keep consistent branding, and keep some common elements as visual anchors (e.g., remove the navigation links but preserve the area, so content areas don’t jump too much after the page transition).
One hybrid approach is opening the checkout in a modal with a faded background. The fading shifts attention away from background elements. At the same time, it maintains a strong connection to the product, since the product page remains in the background. One way to preserve that on a separate checkout is to include the image of the product being purchased.
Form Layout
The goal is to make the form look easy to fill.
Direct flow of attention in one direction, top to bottom. Avoid columns. That said, you can group short and closely related fields, especially if it’s expected (e.g., credit card dd/mm/yy expiry fields should appear together):
Give fields an appropriate maximum width. A narrow form will look simpler, because it appears to require less typing in each field:
Keep labels above fields to make the field-label unit easier to process. Left-aligned labels have advantages – they shorten the form and are easier to scan (see Top, Right or Left Aligned Form labels):
Avoid placeholder text and inner labels, because it creates confusion about which fields are completed and which are not. Inner labels may be ok on very short forms (2-3 fields), but make sure the label remains visible once the user starts typing. I like the pattern that moves the label over to the border area rather than removes it.
Stepped Progress
To make the form look like less work, chunk it up. You can have a long form with numbered sections separated with spaces or lines. Alternatively, you can use the “accordion” pattern to show one section at a time, while other sections are collapsed. Some checkouts span separate pages, such as Personal Info > Shipping > Payment (see examples with test data on GoodUI Evidence):
If you use a single long form, create distinct, intuitive sections (like Shipping Address, Payment), which you can also number. Test for best field to start with: Is it the email? Is it shipping preference? What is low friction? What is high engagement? What is high commitment?
If you use a multi-page checkout, use a breadcrumb or other progress indicator. For your “Next” buttons, use a label that sets an expectation, such as “Next: Payment”.
In your analytics, measure drop-offs at each step and engagement with key fields, so you can compare effectiveness of each layout (e.g., how many people start filling credit card).
Payment Alternatives
Choose a transaction processor with a high success rate. In addition to your default processor, you can offer an alternative gateway, such as PayPal. Conversely, see if removing the choice increases revenue:
A 3rd party checkouts usually takes the user away from your site and provides an experience you can’t track and have no control over, but it may increases revenue.
You can also use a fallback processor when a transaction is declined. If automating that is not possible, you can show a more informative Declined message with a link to the alternative, like PayPal.
Order Review
If you have a review step, try removing it, as it’s likely unnecessary. However, if you have a long checkout spanning several screens, it may be reassuring to see a summary before committing to the order. See what works.
In the next post, I plan to look at the Fields aspect of a checkout, which tells the user what data to provide and in what format. If you’re interested in reading that, please leave a comment so I know you’re interested.
Are there other patterns and aspects of a checkout I have not covered?
After reading this post, you will be able to say whether your test has “low traffic”, decide if A/B testing is worth it, and know what to do if you decide to A/B test.
Technique 1: Do confirmatory not exploratory testing
Exploratory testing = you run an A/B test to look for big or small changes that will increase your conversion rate. You come up with some ideas, then you test them to see which of them work.
Confirmatory testing = you do risky update (e.g., remove free trial) and want to confirm there is no huge negative effect (risk mitigation), or you do an aesthetic site update and want to see if it’s at least not worse than before.
If you get <1,000 visitors per month with 5% converting, you should not be doing exploratory A/B testing.
The best thing you can do instead to optimize your site is look for bugs proactively on your site. You can also do 1-on-1 user testing, surveys, and just deploy your best design and watch your conversion trends.
However, you might still be able to do confirmatory testing.
For example, say I ran a test for 2 months and found this statistically insignificant result (70% confidence level):
Based on this that I’m confident enough that my new redesign is no worse than the original with some chance it might be better. That is useful information.
Technique 2: Find a proxy metric
You can increase your test sensitivity by using a higher baseline metric. Say for example that your primary metric is purchases, but purchase rate is only 2%. Here’s the lift you can detect ( try it yourself: Vlad’s What-If A/B Test Planner ):
What’s a key milestone prior to sales? Let’s say your form fill (not yet submitted) rate is 3%. Maybe form starts are 4% (e.g., people enter an email). If we measure form starts, much smaller changes are detectable:
Of course, form starts are not purchases, but it’s a behavior that suggests improvement AND it’s a good supporting metric.
Here’s how to use this to analyze your test. Say at the end of the test, you find these effects:
As you can see, in terms of purchases the variation beat the Control by only 10% and it’s not a statistically strong result. But the preceding steps show a progression from engagement to purchases and the shallower goals are statistically stronger. So the big picture here is actually pretty good.
You can also compare the performance of various metrics to see which are fairly in sync. For example, here’s what it might look like if Form Field Engagement is a great proxy for Revenue:
A word of caution: The mere fact that the metrics are lined up DOES NOT increase the likelihood that B is the winner. These are co-related metrics that measure the same behavior (i.e. “purchase” implies “form completion” which implies “form engagement” which implies “scrolling to form” and so on). These metrics will line up whether the effect is true or a false positive. What you CAN say is that, since these metrics are co-related, you can use the shallower metric as a proxy for the deeper metric.
Technique 3: Look for consistent performance
With low traffic, you get too few visitors per day to gauge daily variability, but you can track weekly trends. If a variation is winning more consistently, then it is more likely to be a winner.
Here is week-to-week performance over 7 weeks for two tests, both showing a 6% cumulative improvement:
But one test is showing a lot of week to week variability, whereas the other shows the blue winning for 5 weeks straight. All other things being equal, the second offers more reliable evidence.
Technique 4: Compare segments
Another way of checking consistency is to compare performance across user segments. For example, we tested the same variation simultaneously on two virtually identical landing pages on different domains. We expected the variations to do similarly on both. Comparing day-to-day performance for a sample week, we see that the effect sizes for the two tests are moving roughly in sync:
If the variations are not in sync, it could mean the effect size is false or that the segments are dissimilar. However, if the segments are in sync, it’s a good sign. If you have a low daily rate, you should instead check bi-daily or weekly rates. Otherwise, performance jumps around too much just by chance.
Technique 5: Don’t expect big changes to bring a big win
Some people will tell you to aim for a large effect sizes by doing big changes. For example, with 4,000 visitors and 5% rate, you can detect an impressive 54% to 89% lift:
That’s true, but how exactly are you going to achieve that? Is that realistic and worth the effort?
Big design changes have the potential to bring bigger wins, but it doesn’t mean that’s likely to happen. Big tests do fail often. They also take more development effort. So instead of fixing bugs on your site and building new features, you may end up doing lots of testing work with no results.
In the above scenario of 54-89% lifts, you are even likely to hit on a high false positive and come away believing your design worked:
The only way to reduce the potential for large false positive errors is more traffic, even if you’re testing big changes. Hope for the best, but plan on testing for a while and putting in more time before you hit upon a win.
More techniques for planning tests and analyzing the results
What to test:
prioritize big changes that require little effort
test radical changes (think big conceptual shift, not just big visual changes)
start with ideas you have good, research-backed reasons to test
How to test:
test 1 idea at a time
plan to run tests longer
keep reminding your team a test is running
you don’t have to freeze development as long as you make changes globally
How to interpret:
you might have to tolerate variations going “red” for days as data is collected slowly
keep in mind the effect size even if true is inflated
ignore big lifts if you’ve got only 1000 visitors so far (remember your false positive risk)
How much traffic is enough?
For example, with 20,000 monthly visitors and 0.5% conversions, you’ll have a tough time testing. But with 15,000 visitors and 10% conversions, you get a decent spread of effect sizes around 15% that are detectable within 1 month:
This is what I’d call “adequate traffic”.
Adequate traffic = you can run a test for a 1 month or less with enough sensitivity to detect a 15% lift. Anything that takes over 1 month or aims at unrealistically high effects is low traffic.
Why 1 month? Because beyond that, things tend to get messy. Users clear cookies and reenter in different variations, your dev team accidentally introduces some change, and so on. Once you start talking of months instead of weeks, the test becomes a burden instead of an opportunity.
Why 15%? In my opinion, 15% is the sweet spot. If your sensitivity is aimed at 15% but you detect a 30% effect, then great – you’ll either have super-strong data or you can stop a bit earlier. If you detect a 10% effect, then you probably still have decent sensitivity to see a suggestive result.
Conversely, if you aim for 50%, and the true effect is 15%, then you’ll be chasing fantoms. Since you virtually never know what to expect, it’s best to be conservative. I found that 15% is roughly the effect size at which A/B testing becomes reasonable for many sites.
Site Traffic Examples
Traffic
Baseline Conversions
Verdict
Why?
20,000 / month
0.5%
Low Traffic
Base rate is low. Test takes many months and/or a minimum effect of 50% is required.
1,000 / month
10%
Low Traffic
Site traffic is low. Test takes many months and/or a minimum effect of 50% is required.
15,000 / month
6%
Adequate Traffic
With this base rate and traffic, you can detect a 15% effect or so. Enough for a test.
50,000 / month
5%
Adequate Traffic
With this base rate and traffic, you can detect a 15% effect or so in 2 weeks.
Conclusion
Testing on low traffic is paaaainful. Look at what you plan to gain from testing something, look at your chances objectively, and if you do go ahead, be patient and understand what you should and should not expect.
Did I miss anything? Let me know and I will update this.
The collective guess of a crowd can be more accurate than that of an individual. For example, over a hundred years ago, statistician Francis Galton noticed that a crowd of people could guess the weight of an ox with over 99% accuracy. In a more complex domain like politics, we know that expert predictions are terrible, but the average of their guesses is better.
How accurate is a crowd when it comes to good UI design?
How good is the Crowd at picking a better design?
Behave.org has a good repository of people’s guesses about competing designs. Various contributors submit their A/B tests to the “Test of the Week” section, which asks users to guess the winner. Here, crowds get to weigh in on a single idea, one that is not their own. Moreover, these tests are curated to be interesting, which implies they are carried out by more experienced people with more robust hypotheses.
When we look at the last 106 tests on Behave.org, we see that:
53% of the tests show a variation that beat the baseline. But when the Crowd guessed the outcome of these tests, they guessed right 72% of the time (36% better). Interestingly, the Crowd did choose B about 50% of the time. It was just better at NOT choosing B when it was not an improvement.
FYI 53% does not represent the average success rate of test contributors at all, since it’s not a random sample. We just use it as an arbitrary baseline, to see if the Crowd makes the same guesses or better ones.
Why does the Crowd do better?
First, a test designer choses B 100% of the time, since B is be definition the improvement. In contrast, an impartial outsider considers all options equally and freely chooses either B or A.
Second, a tester is biased by his idea. His faith in the idea fills the flaws in the implementation, especially if he executes his own idea. In contrast, an outsider just evaluates the implementation.
Finally, a Crowd by definition has greater diversity of opinions, gut reactions, etc. than an individual. So even if a Crowd mix is not representative of site visitors, a Crowd is likely to be MORE representative than just the individual.
Quality of the crowd and the data
Is it a Crowd of independent opinions?
The individuals in the crowd need to be independent. One of the flaws of “focus group” research is that individuals win the group will influence the responses of others. This makes the Crowd less intelligent. The power of an internet poll is that we average over many independent opinions.
How qualified is the Crowd?
The composition of the crowd is also important. Sometimes, you want average people, representative of your audience, to give you their simple preference (the classic “Which product would you buy?” question in market research). Other times, you want experienced designers to give you their guess based on their expertise (though sometimes experts can bad at predicting). For example, on all our projects at Goodui.org we ask multiple client contacts to specify their certainty in each test idea, which gets averages out with our prediction. These are people who are not designers and people who know more about the product and the user base than we do – both very good things that add diversity. At the same time, being subject matter experts, they are qualified to give feedback on marketing copy and so on. In other words, their contribution to the crowd is valuable. We then try to prioritize the ideas that have the highest overall score.
How to quantify Crowd opinion?
How you summarize the opinion of the Crowd matters. In the initial example of guessing the weight of an ox, Galton used the median to summarize the crowd’s opinion. A median controls for outliers and can increase data accuracy. For example, there might be unqualified people in the Crowd who are way off on their guess (I would have absolutely no idea, for example, how much an ox weighs). An average would be skewed, while a median would not be.
How the question is asked and the format of the data itself also matters. For example, I recall that asking people “Who do you think will win the election?” predicted election results better than “Who will you vote for?”
In the case of Behave.org’s Test of the Week, if they asked visitors to quantify their confidence not as a binary choice but as a 0 to 10 (0 = A, 10 = B), then their uncertainty might have weakened the overall prediction. The binary choice of A or B effectively controls respondents’ uncertainty, so they are forced to give the same weight to “it’s slightly better” aso to “it’s much better”. A simple binary choice might make the prediction more accurate (this is known as the “error of central tendency”).
For example, at GoodUI.org with @jlinowski, we recently moved away from a simple 0 – 10 subjective certainty rating to a -3 to +3 scale. Since we tend to include test ideas that others are likely to also consider worthwhile, this effectively makes it a 0 – 3 scale. Simpler means stronger predictions. However, we also moved to a more complex formula that incorporates experimental evidence and research. This increased complexity risks decreasing the predictive power, since much more complex decision making is involved in figuring out if a variation is likely to win.
Can a Crowd fail?
Yes! More people doesn’t necessarily mean better predictions. Sampling error (your choice of who’s in the Crowd) has to be valid and appropriate. In the classic example of the 1936 Literary Digest poll, a very expensive poll with a large sample size actually failed to make an accurate political prediction, because the sample was biased. So quality trumps quantity.
How good is the Crowd at predicting effect size?
For reasons of complexity, I don’t believe a Crowd can be relied to estimate the degree to which a variation might beat the baseline. The possible % values are unbounded.
Next factor is experience. In the ox example, most people know what an ox is and have a lots of experience with other objects. But every website, every site audience, and every implementation of an idea is unique and not necessarily transferable. Of course if you are testing something and you’ve tested something exactly like it on a dozen very similar sites and got a similar result each time, then you would have experience. I’d be skeptical if anyone can make such a claim however. CRO experts rely on a lot of intuition and personal judgement to fill those gaps in experience.
Finally, it’s the type of stuff we are measuring. Weight is a simply additive property. A rock that appears to be 2X size of another probably weighs 2X as much. An ox about the size of 5 people probably weighs the same. Online experiments are not like this. If I am testing an idea, it’s often not decomposable in such a way. And even if we were to decompose it, there is no guarantee the effects of the parts add up this way.
In other words, guessing the outcome is very very hard and should not be done subjectively. We should do our best to find similar past tests, and use those to make a starting prediction. We can then collect some data in order to fine-tune the prediction. We can then test our prediction with more data.
Crowd-sourcing design ideas
How do we figure out what to test in the first place or find something better than B? After several unsuccessful tests a while back, we decided to crowd-source improvement ideas from GoodUI blog visitors. The result was the highest response rate of all posts up till then with lots of ideas that led to new variations.
We could further improve this by asking visitors to make predictions about others’ ideas before add their own idea. That way we would crowd-source a list of ideas as well as crowd-source predictions for each idea. As we have seen, stronger predictions result when a Crowd joins to rate a single idea, instead of each individual putting forward and rating their own idea.
Lessons learned
If I want to improve my success ratio, I need to behave like the crowd:
Separate the idea from the implementation. I recently had a great idea but my best implementation was weak. It took 6 visual iterations over several weeks to ensure both idea and implementation were sound. Critical feedback from 2-3 people (who were not involved in generating that idea) was critical in improving it.
Do a pre-mortem. Imagine that B already lost, and try to figure out why that has happened. This way you force yourself to consider taht B might not be better and needs improvement.
Seek first impressions from outsiders, both qualitative (ideas) and quantitative (predictions about ideas).
What’s in the future?
I envision a crowd-prediction service (similar to remote User Testing services that have become popular). You could pay to have 100 pre-qualified people make predictions about your test.
In this video, I want to show you a different kind of sample size calculator for your A/B tests. It works backwards compared to how traditional calculators work and you might find that more intuitive. The basic premise of this approach is that we mostly don’t know what effect size to expect, so we projections for a range of outcomes.
This calculator doesn’t ask you to input power or the effect size you are after, because it assumes that you’re exploring and don’t know what effect size to expect. Instead it just asks you for your current conversion rate and traffic, and then gives you several possible effect sizes that you can reasonably detect on your site and your chance of success for each outcome.
Let’s see how it works. Let’s say you want to run your test on your home page. About 5% of people make it from the home page to a purchase, and you get about 5000 visitors to the home page per week. Let’s run the report with those numbers.
At the top of the report, you’ll see your preliminary estimate. This estimate tries to balance the testing duration with the effect size you can detect. It’ll your duration at 8 weeks regardless. Next, it’ll try to make sure you can detect at minimum a 15% effect.
If I scroll down, you see that’s exactly what I got. The duration is 6 weeks and this is optimal to detect a true 14% lift. I can then adjust my duration up and down and see how it impacts my projections.
The advantage of this report is that it doesn’t give you an estimate for just one effect size. It gives you a range of reasonable what-if scenarios. That’s because we might have little idea what the effect size might be. But I see that if my new version is 10% better or 10% worse, then there is a 50% chance that the effect will actually peak through the noise strongly enough.
But if the effect is 14%, then I have an 80% chance of success or 80% power. I can then use my judgement to see if whatever I am testing can reasonably beat the existing version by at least 10% and ideally by 14%. It will depend on how big my idea is that I’m testing, my experience with similar tests elsewhere, how bad the current design is, and so on.
Another piece of information you can get here is a sense of what the actual observed effect might be. Remember that EVEN IF my new version is 14% better and the test is a success, it doesn’t mean the effect size will actually be 14%. By chance it may be inflated or deflated. So here you can also see the margin of error. This means that if I get a 7.5% lift, I know that the true effect might actually be as high as 14%. But if I see a 3% effect, I know the true effect is at most 10%.
I might wonder if the true effect were 14%, what actual effect might I observe half way through the test. To see that, I can reduce the duration to 3 weeks, find the 14% effect, and see that it might show up as an effect as low as 4.5%.
So far, I assumed there is a true effect. But if I am wrong and my new variation actually has no effect, I might still get a lift – that’s called a false positive. I always like to know what sorts of false positives I can expect.
In this case, let’s put it back to 6 weeks. And we see that with this duration, we have a high chance of a false positive of 5%, positive or negative. The term false positive includes effects in either direction. I see there is a small chance of a false positive as high as 10% – the probability is 5%, small but possible. If I’d like to eliminate that possibility altogether, then I can increase my duration. After 9 weeks, the probability is just 1%. And I scroll back up, then I see with this duration, I can also detect smaller effects.
In the end, whatever happens, you have a much better idea of what to expect. Give it a shot. Let me know how you like it. Thanks for watching.
Every other day or so you should should peek at how your tests are doing. Here are some guidelines on doing that without skewing your data:
Technical problems
The main reason you want to peek frequently is technical problems. You should QA your site before you launch, but you should QA again a couple of days in and later on. You may have missed some bugs, and repeat QA will catch more of them. Other site changes get introduced that break your test in some way. And sometimes a transient bug doesn’t show up in the data until more data is collected.
Averting losses
If you set it and forget it, you may leave a losing test running too long. If you’re after long term gains and want to avoiding losses, you should stop a losing test at some point. However, stop when you can be reasonably sure you’re doing the right thing. You don’t want to stop your test a day in because you get put off by a big initial drop. But if your test is losing for 2 weeks straight with high statistical confidence, you might not want to let it run another week. If your goal is learning, then you might want to run a test longer to confirm it is a loser – but exposing your site a poor variation may not be what is good for your business.
What stats are you using?
Many tools have moved to a different statistical method, which allows peeking. VWO uses Bayesian stats, which give you up to date confidence and probabilities. Optimizely uses a pseudo-Bayesian sequential testing method to allow you to peek. So depending on what statistical method you use, you may be “allowed” to peek and make decisions. However, keep in mind regardless of what your tool says, you still want to let your test run a reasonable amount of time. So estimate your duration upfront anyway, so you have some point of reference.
No significance, no problem
If you’re using traditional stats (which works fine is much easier to understand), the basic idea is to avoid calculating statistical significance (p-value) and then making interim decisions based on that. A p-value is meant to be something you calculate after your test is done. However, you can check significance and adjust your duration estimate once or twice during the test (e.g., once you get a ballpark effect size). That’s not going to skew your p-value much. IF you commit to some sample size upfront, then you can check significance all you want – it gives you a basic sense of whether the effect is strong.
Safe things to peek at
It depends on what you peek at. If you’re not calculating significance but still making decisions based on how the test is doing so far, you’re still skewing your final analysis. However, you can peek at how your overall test is doing (without drilling into how each variation is doing). For example, if you see that overall traffic to the test or the total conversion rate is lower than expected, then you can recalculate your duration estimate without any problems. Looking at the test overall doesn’t tell you how the variations are doing relative to each other – unfortunately A/B testing tools never offered this option. You definitely want to readjust your duration estimate once you have more data. You just don’t want to keep doing that based on how each variation is doing – that fluctuates and if you keep adjusting your duration to current performance, you’re letting chance lead you instead of subduing the effect of chance with time.
Have someone else peek
I don’t do this, but a good idea I’ve come across (not sure how practical) is to have one person peek for technical issues and a different person to ultimately do the final analysis and make the decision.
Objectivize
The biggest danger with peeking is that your brain will look for patterns and once you see them, you can’t unsee them. This can lead to disappointment when a green turns red. Or worse it can lead you to hack your test to get the results you want. Try to be objective about it. Once the test is running, you’re after the truth, not winning or losing.
Once you figure out what you want to test, you need to define what you’re going to measure and where. In this post, I will introduce my preferred terms for describing test structure (things like test conditions, goals, and pages), and I’ll use a visual language to cover the basic patterns. Here’s an example:
A simple test showing Gate (Start), Path, and two Goals (the big black circle is primary).
Gates and Goals
GATE: circle that represents all conditions for entry into the test, including test URLand traffic segmentation (Example: Home page mobile traffic).
GOAL: all success conditions, including confirmation page URL and business rules (Example: Thank you page visit after purchase of premium package).
TIP: If you have a sizable mobile segment, you’ll want to track mobile, tablet, and desktop traffic separately. If your tool doesn’t allow you to segment after the fact, set up 3 separate tests with mutually exclusive gates. Other gates you should distinguish are: existing users vs. new users, ad traffic vs. direct traffic, and so on, in case each segment performs differently. Keep sample size in mind, because segmentation reduces sample size and increases false positives and false negatives.
Primary vs. Secondary Goals
Page visits are generally more reliable than clicks, so they are the preferred primary metric. Clicks on links or form submits are often secondary metrics. There are many other customized types of metrics based on user behavior and business rules.
Large circle is a primary goal. Small circles are secondary.
TIP: Whenever possible, track both the start and end of an interaction e.g., track clicks on a link and the visit to the destination page.
TIP: Tracking how many people start completing form fields is a good measure of intention e.g., track keydown or change events on key form fields. It can also highlight anomalies in other goals. Track attention (user scrolled to and stopped at the element being tested) as a secondary metric via scroll tracking and setTimeout().
Goal Depth
Your primary goal might be directly on the page you’re testing or further down the funnel.
A direct goal happens on the test page and is your ideal end goal (e.g., an AJAX payment event).
A shallow goal is a relative term for a goal at or near your test gate that’s not ideal. For example, visits to the checkout page is a shallow goal relative to a primary goal of completing the purchase.
A deep goal lies further away from the gate and is usually your primary goal. However, you might track other deep goals that are not primary (e.g., post-purchase downloads, dashboard engagement). Changes to deeper goals are harder to detect using statistics, because counts are lower.
TIP: If your primary metric won’t produce enough data in the time you have, then choose the next best metric.
Conditional Goals
Behavioral and business conditions can be added to goals. For example, fire a conversion when a timer expires, a scroll position is reached, several steps are completed, or a user successfully logs out and returns again. You can map goals to any user behavior.
TIP: Additional logic requires additional code, increasing risk of technical and logical errors. Be careful about making your primary goal very complex. Moreover, elaborate goals that are harder to achieve will have a lower conversion rate and will be harder to track. However, they may be more informative – track them but have a fall-back.
TIP: You can set up goals to detect errors on complex screens with lots of dynamic components. For example, part way through the test you might find that clicks on your new button are low. So you might set up a goal to check that the button actually exists on the page for all visitors. In one case, we wanted to check if any visitors to a split-URL test were changing the URL to enter a different variation.
Single Page vs. Template
Your test gate is typically a single page. The gate can also be multiple product pages that use the same template. It can also be multiple pages that are completely different or the test can even be side-wide. For example, if you’re testing a change to your navigation or a sidebar, you’ll want to modify all pages with that element – for consistency. Advantages: potentially much higher traffic to your test and testing the change in broader context. Disadvantage: you may obscure different performance on each page due to different traffic source, different previous page seen, etc.
TIP: If you can, track that your multi-page A and B samples contain similar ratios of visitors to each page. For example, if your site is running side-wide, you might want to know that your A and B samples contain the roughly the same % of people who came from the home page, pricing page, blog, etc. What if, for instance, product A has mostly traffic from your internal search, while product B was mentioned in a blog and has lots of referral traffic?
TIP: You can run separate tests to isolate the data set for each page as long as there is no visitor overlap.
Sequences and Funnels
Navigation within funnels. Straight lines are direct links, while wiggly lines suggest intervening pages e.g., Home > Pricing > Checkout as well as Home > Checkout.
If you have a unique link to a page, you control the traffic to that page. Other times, you might have multiple links, which means multiple entry points into a page. You might also have steps that can be bypassed. The wavy line means that the page transition is flexible or unknown (e.g., Home page to Pricing page), while a straight line means it’s a direct relationship (e.g., Payment Step 2 to Step 3).
TIP: Verify, don’t assume path paths. Are you getting lots of direct visits from Google into the middle step? Can visitors bypass your link towards step 3?
TIP: Set up tiered metrics, so you can track the user’s progress at each step of the funnel.
If your traffic is too low to detect a change in your end goal, you should make a shallower goal primary. It’s also useful to track whether users step outside the main funnel. For example, if you find that a losing variation is increasing traffic to the pricing page, you might have a hypothesis to explain the loss.
User behaviors other than what you want are also good to track (Distractions).
TIP: Track visits to all main pages, like pricing, about us, blog, etc. Most patterns will be meaningless, but sometimes they are informative. For example, a huge increase in visits to the Pricing page might suggest something you said make people think of cost, which may or may not be a good thing.
Visual Scope
You can test just one page (Page Test) or you can test an entire funnel as a sequence of pages (Funnel Test) – visitors see version A of the complete funnel or version B of the complete funnel.
The grey container represents scope of visual changes.
On rare occasions, you might start your test on a page other than the one you’re testing. For example, a page may not be accessible directly or may depend on an interaction with the preceding page, so you have to start your test a step earlier (Premature Start). You might also have visual changes on different pages that go together conceptually, such a discount offer on the home page vs. on a deeper page. So on one variation, the user will enter prematurely on the page without visual changes.
TIP: Avoid testing related pages in separate simultaneous tests, because you’ll have to account for visitors’ coming from and seeing different versions of each page.
TIP: When running tests simultaneously on the same site, use cookies to ensure visitors can only join 1 test at a time. If you do risk running overlapping tests, at least add metrics to each test to track which variation of each test the users saw. That way you can at least check that the assignment is roughly equal and even split users into non-overlapping segments.
Data Collection
Tests can be run by injecting CSS and JavaScript into an existing page or creating a separate page URL for each variation.
A tracking or blank test simply collects data about your site. You might run such a test to QA your metrics or estimate your traffic and conversion rates to enable you to do a power analysis.
An A/A test is less a way to test the tool than a way to understand the properties of the traffic e.g., are all users relatively similar, producing less variation in the conversion rate over time?
Visual changes can be tested by injecting CSS and JavaScript into an existing page or creating a separate page URL for each variation.
TIP: Dynamic A/B tests can be faster to deploy but can have a flickering problem or take slightly longer to load. Split URL tests don’t have these problems but require your to create a duplicate page, which is not always possible. It also requires separate URLs for each variation, so you then need a process to reuse or expire the old URL variants. The redirection itself and difference in URL might be noticed by some users.
TIP: A hybrid approach is redirecting to a URL parameter on the main URL. Changes are then applied on the back end or front end based on the parameter e.g., example.com/?v=a vs example.com/?v=b.
Single or Multiple Variables
Test can involve a single change or multiple changes, as in a full redesign. Micro tests allow you to connect the observed effect to a specific visual change – this is the most satisfying type of test. A macro test can allow you to test a coordinated set of changes, but you won’t know the effect of each individual change. This can put you into a tough spot if the test loses – do you scrap the whole thing or try to retest specific elements of it?
A multivariate test is used to test multiple changes by essentially running a separate test for each combination of variables. The risk with this type of test is lower power for each sub-sample with more false -positives. This is not the same as running multiple tests simultaneously on the same page, since in that case you won’t know which version of which test each user saw.
In closing
Use your awareness of page and goal patterns to collect richer data and avoid common mistakes.
I use simulations all the time to help answer questions like: Is this outcome possible? What outcomes are most likely? How much data is enough?
Simulations can give an answer faster than detailed calculations. They are less precise but far more intuitive. If you run a simulation 10 times and get a certain outcome even once, you know it’s possible. If you get it a few times, you know it’s quite likely. If you want more confidence, just rerun the simulation 10 or 100 or 1000 more times.
What if?
In the previous post, I included an under-powered simulation in Excel, where we ended up with a 23.7% drop instead of the true +10% lift. Using that template, you can set up a simulation in seconds.
A hypothesis is an explanation of why something is the way it is.
Example business hypothesis:
“We are a new company, and visitors have doubts about the quality of our product.”
Do we really create hypotheses to test them?
To see if my example hypothesis is true, it would be best to talk to some potential customers. A/B testing is not really about testing business hypotheses but about using them to iterate a design. An A/B tester is not a scientist. He takes the hypothesis as inspiration for new visual treatments in order to increase his chances of raising revenue.
In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:
97 days of daily conversion rates in Perfectville showing 20% lift
The graph perfectly related what happened: The baseline increased by 10% during the test, with half the traffic exposed to the winning variation. Then was the week when the test was stopped, followed by a lift of 20% once the winner was implemented.
The good people in nearby Realville heard about this and ran the test on their site. When they later checked their daily conversions data, they scratched their heads (as they often do in Realville):
97 days of daily conversion rates in Realville showing same improvement
The data actually includes the same 10% lift during the test, a gap, and a final 20% improvement. The problem is the improvement is relative to natural fluctuations in daily conversion rates, so 20% improvement doesn’t necessarily mean 20% lift.
Here are 6 reasons why people in Realville might find it difficult to see a lift and what they can do about it.
Reason 1: The effect is too small
The smaller the lift, the harder it is to see it through the noise. If the conversion rate drops for some reason unrelated to the test, the lift from your winner might not even offset that. For example, here’s 1 week of simulated daily conversion rates followed by a week with a 20% lift compared to a 5% lift. If the lift were 5%, it would look as though the test actually did worse in the second half:
7 days at baseline followed by 7 days with 5% vs. 20% lift
Have you just run a test and are looking at during-test data? You likely won’t see any effect. Typically only 70-80% of visitors will join the test (more on this below), and these are split among your variations. If 80% of your traffic actually participated in an ABC test, a third of that is exposed to the winning variation. So, a 20% lift would manifest as 5% overall.
What you can do:
Look for a larger cumulative upward trend after several tests.
Compare longer timescales for baseline and post-implementation data.
Reason 2: Your baseline is too variable or you’re not looking at enough data
In Perfectville, conversions are constant each day, each week, each month. This means a 20% improvement causes a 20% lift. Not so in Realville. In Realville, daily conversions naturally fluctuate, so the full potential for improvement may not manifest. The more your conversions fluctuate, the harder it is to see the lift in the data.
Here are two similar simulated data sets with low and high variability, both showing a 20% lift. The lift is more obvious when variability is lower:
A similar 20% lift with low and high variability
Sales may fluctuate for a lot of reasons (weekly, seasonally, in response to your marketing activities, unexpected traffic). The smaller your sample, the higher the chance that the pattern you’re looking for just won’t be there by chance. For example, if you just saw the middle segment of the full graph below, you’d never know that the right, orange side of the graph shows a 20% improvement:
14 days of simulated daily conversion rates (blue), then 14 days with a 20% improvement (orange)
What you can do:
Zoom out to reduce variability. If data is too variable daily, look at semi-daily rate or weekly rate
Look at more data to cover the full cycle of ups and downs e.g., a week (note that the lower your conversion rate, the more data you need to see an effect)
Check your site analytics to see what might have been different that week. Check if dips have happened before. Might one have coincided with the test?
If the data has a lot of variation, it is hard to estimate visually. Compare what I’ll call the “clipping rates”. In this graph, you see higher peaks as well as a higher frequency of peaks in the second half:
20% lift manifests in more frequent and higher peaks
Reason 3: Not everyone was part of your test
Even if you didn’t put exclusion conditions on the test, some visitors were excluded.
For example, mobile visitors are excluded by default. Another 10-20% of visitors normally get excluded when the A/B testing tool time out. Further technical implementation issues can cause another 10-20% of visitors to be excluded, things like JavaScript-heavy sites or the tracking code not implemented in the right place.
Moreover, gaps in test design can create a discrepancy between test and sales data. For example, we ran a test on the home page of a basic single-product site and noticed that our test data was missing many sales. After investigating, it turned out that about 50% of purchases were by people who didn’t land on and never visited the home page as well as by existing customers from a special upgrade page that we didn’t consider.
As a result of these exclusions, when you implement your winner, you may be exposing it to segments you didn’t test it on. For example, although you tested on desktop and saw a 20% lift, the same design on mobile might cause a 30% drop. So, if you made the winner your new home page for all traffic, the drop in mobile could counteract some of the lift (say, if you had lots of mobile traffic).
What you can do:
Factor in 20% exclusions due to technical issues, like timeouts
Set up an inverse test to see how many sales are by-passing your main test (target pages and visitors who are excluded from your main test)
When looking at sales or conversion data, keep in mind it probably includes segments you didn’t test on. Test the design on all segments that will be exposed to it e.g., new customers and existing customers. For mobile, build and test a dedicated mobile version
Reason 4: You are eyeing it instead of using math
Sometimes a lift is obvious. Other times you need to use math. Here’s a sample of real conversion data with about 20 days of basline followed by 20 days of the improved version:
Just over a month of real conversion data with winner on the right
The lift is not visually obvious. Nonetheless, the average for the first 20 days is 0.71%, whereas the average for last 20 days is 0.85%, which is a 20% lift. However, if the standard deviation of the data is high, the difference in averages may be coincidental.
Reason 5: Your design or conditions are not the same
This happens all the time. You run a winning test, then you tweak the winner before pushing it to your site. It’s entirely possible that those visual tweaks reduced the effectiveness of your variation.
It’s also possible the test conditions are different when you launched the test. Did you test during the holidays or launch during holidays?
Are you including a different page? You might have several pages that look similar. So you tested something on one page, and you decided to apply it in one go to all the pages. If so, there is no guarantee that the same concept will work equally well on other pages.
What you can do:
Check your site analytics to see what conditions might be different now and retest if necessary
If you know you will be changing something, apply changes to the variation and test it with the changes
You should implement the winner as the new control and then test the new changes
Retest on each site if you have reason to believe the outcome may be different
Reason 6: It was a false positive
Yes, it happens all the time. There are many reasons you might have gotten a false positive, including improper test design and not running your test long enough. The most common scenario is you run your test until you see a winner and stop. I’ve seen results that looked very exciting flatten after 3-4 weeks.
What you can do:
Follow the great tips on http://goodui.org/betterdata to ensure you get good data
Back To Realville
Let’s say Realville decided to retest the Perfectville winner 8 more times (it took years!). They found that indeed, the overall tendency of the variation was towards increase, following the same pattern as Perfectville’s test. There’s a small lift during the test, a slight dip when the test is stopped, and then a larger lift after final launch. However, despite the overall trend, individual outcomes showed that chance is a factor in this imaginary scenario:
Let me know if you apply and find useful some of these concepts.
Executed an Audit of IT systems (what they do, how old they are, who’s using them and why) and operations (where are the staff allocated, how do programs share resources):
Interview all program heads
Design qualitative and quantitative research as needed
Summarize current state and identify opportunities for improvement
Present findings to management
Solution
I audited operations at a large Public Health Sector organization. I interviewed stakeholders, collected information on IT systems and people, and presented an analysis with diagrams and organizational recommendations.
I delivered a presentation and wrote a 60-page report on the state of IT operations, covering 9 program areas:
Expanded The Scope
On my own initiative, I expanded the scope to include Strategic guidance for management based on my findings. This included organizational models, a service blueprint (lists of responsibilities, goals, beneficiaries, internal vs. external stakeholders), and a governance survey (audited the management process itself).
The Work (Service Blueprinting)
I created a Blueprint to help management to better understand the scope of the organization (using “The Work” model) and the effect of external factors (it wasn’t always obvious to people managing The Work).
I further broke down the organization into areas like Data, Processes, Business Applications, and People. I then detailed all the internal support activities that are required to manage each aspect of the organization. Finally, I mapped internal activities to public services:
Surveys And Interviews
I created multiple surveys (developed the scope and wrote the questions). The big challenge was that each program used different terminology and had a different way of seeing itself in relation to the whole. I had to think how to frame questions in Plain English in a way that is properly interpreted by all participants.
Surveys included systems and process questions:
Surveys also included self-evaluation of the management team and their relationship to IT (provider of services):
When a standard survey format was insufficient, I created a custom framework and format:
After survey data was collected, I interviewed the head of each program. In the end, I presented my findings to the whole team.
Data Analysis And Customized Presentation
I created different presentations for different types of data.
For data pertaining to relationships, I created a relationship overlap diagram:
For standardized data, I created a color-coded map to highlight pain points:
From qualitative data and interview notes, I extracted and summarized the key actionable requirements:
Governance Framework
I created a survey to assess the effectiveness of the IT Steering Committee in areas like communication and resource management. I also created an overarching framework to convey the scope of organizational activities, weak points, and opportunities (SWOT analysis):
More Leadership Projects:
Presented Vision for IT Service Delivery (2012): Presented to cross-divisional senior management about transforming IT services through collaboration and workflow automation. Current state included poor collaboration across silos, resistance to change, and lack of process transparency. I explained how process definition enables measurement and creating efficiencies through reuse. I also presented ITIL Service Design principles and sample IT service catalogues.
Charter for TPH IT Strategic Plan (2013): IT wanted to understand the gaps in its service delivery to business and develop a service-oriented strategy to create Business-IT alignment. I advised senior managers on creating a project plan for this initiative. I developed problem statements, objectives, and analyzed risks. I prepared a detailed roadmap of activities.