Bridging The Designer-Engineer Divide

Instead of specialization, we should strive for some overlap. When engineers and designers share experiences, they develop empathy, which leads to clearer communication. This in turn improves outcomes.

Designer Saves The Day (True Story)

Sam was engaged on a renovation project, where the goal was to open up a high-traffic space in a house.

Sam was the “designer”, who drew up a detailed plan to remove a main load bearing wall that would meet the permit requirements. Sam also defined the functional requirements: the beam had to be concealed, the opening had to be this wide, and so on. Mike was the highly skilled “engineer” tasked with making it happen.

As Mike and Sam discussed implementation, Mike made specific decisions about materials and specific dimensions. A few compromises were made, but all looked doable, until Mike exposed the floor where the beam cross-support would go and realized the space contained an HVAC conduit that could not be relocated.

At this point, Mike the engineer sighed and said, “You know, this is a show stopper. I would reconsider removing this wall. We’ll have to move up past the vent and at that point, you’re not really removing that much wall.”

To this, Sam replied, “No, this wall is in the way. We can’t stop now. Let’s just think about it for a moment.”

Very quickly it occurred to Sam that they could build on top of the floor without impacting the vent. Sam described how it could be done. Apparently, this option didn’t occur to the engineer, because it’s not usually done this way and if they got a grumpy inspector, it would not pass inspection. So Sam vetted his idea with an architect, who highlighted the risks with doing that. Fortunately, Mike the engineer jumped in and offered solutions that would mitigate those risks. Sam then relayed this final plan to the inspector, who signed off on it.

And so Sam’s solution was implemented and the project was a success.

Mike was 50X more skilled than Sam. But two heads are better than one. Although Sam was a designer and project manager, he acquired some basic construction knowledge. This knowledge turned out to be critical in the successful implementation of his design. And his not being an expert was actually helpful, because he naturally thought outside the box.

The critical takeaway here was that BOTH the specialized skillsets of the engineer and the architect AND the designer’s basic technical acumen came together to produce the perfect solution.

Now let me tell you a different story.

Engineer Saves The Day (True Story)

Sam was involved in a bathroom renovation project. Since Mike the engineer was busy, Sam decided to enlist the help of a different engineer, named Anton. Anton said he required design drawings even for a small project. “You tell me what you need, and I’ll build it” was his motto.

So Sam decided to contract out the design to a dedicated designer, Julie. Julie came highly recommended. She looked at the existing layout and drew up several recommendations. Sam and Julie agreed on a plan to put the shower by the window, because it didn’t seem to fit anywhere else. Julie then produced detailed drawings for the engineer Anton. But at that point Anton was no longer available.

Luckily, Mike now was. Mike the engineer looked at the plans and immediately said it was no go. You can’t put a shower by the window, because obviously the water would go all over the window sill and it would be a mess in the long run. Julie or Sam were so focused on the paper layout that they overlooked the implementation.

So Mike the engineer thought it over for a day or two and proposed a completely different design, which moved everything around in a way that neither Julie nor Sam ever considered. It was a spectacular improvement. A bit more work but definitely worth it. Mike’s design was a rough pencil sketch on the wall. With that in mind, he successfully implemented the new design.

Most often designers are frustrated that engineers haven’t implemented their designs exactly on the first try. But that isn’t always a good expectation. I don’t always care to align my boxes down to the pixel or get the font colors and sizes perfect. I always expect the implementer to critically think about what they are building. The result need not be antagonistic. Here’s a sample tweet I just came across:

Building Rockets Iteratively

There are many examples of designers and engineers collaborating successfully.

If you get a chance to watch the documentary Cosmodrome, it’s an interesting story about how Soviets perfected a closed cycle rocket engine in the 70s. The U.S. thought it impossible and wasn’t even aware of it until the 90s.

The Soviets’ manufacturing and engineering process is a perfect case study in iterative design, prototyping, and collaboration.

The engineers drew up plans, and then planned a dozen test flights to iron out flaws. These were FULL flights, and they fully expected the first few rockets to explode. And they did. They even destroyed the launch complex and had to rebuild it to keep testing. Luck was a factor in these decisions. The Soviets simply didn’t have the right test facilities, so they adapted. Whereas the Americans could test an engine without actually launching it, the Soviets had to do a full launch.

The Soviets learned from each failure and with each test, they refined the engine. This way they achieved something the Americans could not.

Their design method is particularly instructive. Whereas for Americans, design and build phases were separate, Soviet design engineers handed responsibility over their design to build engineers. The build engineers would take over and be free to iterate the design to make it work.

“That’s Not My Job”

I started out my career at a consulting firm as a Business Analyst working closely with other analysts and developers. This company called us renaissance consultants, and in fact all our official titles were plain “Consultant”. I remember a training presentation where it was emphasized that we should all do what is necessary. “If the trash bin is full, we can’t say That’s not my job”.

I believe this kind of environment allows bright individuals to thrive. It allows developers who have a flair for client-facing consulting to take active part in client meetings. It allows technically inclined designers/analysts like myself to pick up coding when needed. This allows the team to go beyond the requirements or design “hand-off” and instead work hands-on together on challenges.

Contrast this with experience all designers have had with *some* developers. Developers who convey “It’s not my job” by saying “Done!” without even bothering to load their work in a browser (they send you a link to a page that’s completely broken). Equally frustrating is the reverse experience with designers, who don’t consider mobile experience or feasibility at all.

What The Future Looks Like For IT

There are many trends in the industry that try to address this divide. The trend toward pattern libraries and design systems can help developers design. Abstraction layers like jQuery made JavaScript coding more accessible to designers. Prototyping software like Adobe XD allows designers to build sophisticated interactions without any coding and make it easier to share specs with developers. But the bigger problem is culture.

Still, I think there is room for optimism. As much as there is a trend to specialization, there is potential for cross-pollination. Modern solutions are simply too complicated for designers to remain oblivious about technology and, I will add, business.

Organizations need to embrace more fluid approaches to specialization. It allows individuals to utilize 100% of their potential, making them more valuable and more satisfied.

It also sends a message to educational institutions. For example, now that UX is coming into its own, there is a growing danger of creating new silos where none existed. The field was built by folks who’ve come from all sorts of diverse backgrounds, yet it might be adopted now by those who would enter the field through specialized programs.

If we are not careful, specialization will change things, and I don’t think it will be for the better.

P.S. As I come to UX from a Business Analyst role, I have a similar view of that divide. In fact, a large consultancy I talked to recently mentioned they were experimenting with joining their two departments. They are not alone. But that’s a story for another day.

Designing for Trial and Error

Most things in life get done through trial and error, and digital products need to support that sort of fuzziness.

What scares people about technology is that it’s precise and uncompromising. A bank machine asks you for THE number — you can’t type anything else and nothing happens until you do. When you turn a dial on a washing machine, it does exactly and only what you choose. At the other extreme, automated systems take all control away from us.

The beauty of machines is that machines don’t lie. They just do what they are programmed to do. But that’s not how people interact, and that’s what makes machines hard to interact with. Sometimes we need to be guided even as we remain in control.

Many real-life decisions follow a similar pattern:

Julie tries decorating her apartment. She chooses some colors and pieces. She then hires a Designer to pull it all together . The Designer generates potential floor plans and finds photo inspiration. Julie provides feedback: more like this, less like that. She rejects or accepts ideas until the picture comes together.

Many products fail to find a place in our lives, because they try to be too simple, too smart. Perfect anticipation of user’s intent will likely never happen. Even my wife with her human brain and a decade of experience can’t always predict what I will like. So digital products should make educated guesses, but they shouldn’t expect to be right. Instead, technology should be optimized for making suggestions, making corrections, and listening for feedback.

Here are some patterns I’ve been paying attention to lately:

User Input Optimization

Basic ways to integrate lightweight guides into our workflow are found in features like:

  • “snap“ in drawing software gently shifts objects so they lie on a grid or align with guides or other objects; “auto-smoothing” makes lines straighter or curves smoother,
  • “quantization” in music sequencing software automatically aligns notes precisely to a rhythm (or conversely “swing” humanizes rhythms so they sound more natural); “pitch correction” corrects out of pitch singing even in real time; “compressors” or “limiters” remove outliers to maintain a consistent loudness

In these cases, software smooths the rough edges of the user’s input through very light automation, which can be enabled or disabled easily.

Undo (Your Best Friend)

Undo is a powerful feature, which allows users to generate randomized content manually. Randomness is a key part of my creative process.

A few years ago, I created several comic strips for the Suicide Prevention program at the City of Toronto. Although I know nothing about drawing human faces, I was able to create characters real enough to convey emotion. Here is Ida, one of the fictional people at risk, opening her door cautiously:

Ida, a character I created using trial and error in Illustrator.

Ida and the rest were created purely through trial and error. I drew randomized shapes and hit undo a lot, until I assembled a realistic face.

Multiple Takes (Undo’s Big Brother)

The opposite feature is the ability to generate lots of content quickly to be sifted through later. For example, I rely on my ability to take hundreds of shots on my camera, to increase the chance that more will be in focus and that some may even contain pleasant surprises I can extract later in Photoshop.

Multiple Presets

“Smart default” is a great way make an interface easier to use, but why does it have to be just one default? When you open visual effect filters or audio effect plugins, each filter has a default setting. Often there is just the default, and sometimes it’s not so great. A better case is when multiple presets are offered in a pull-down, so I can flip through to see what the filter is capable of. But an even better feature would be to offer an intelligently random set of presets.

Some products make users do work, for example, to categorize their content like emails or photos… It is common for software to start with a suggested categorization and force it on the user — remember when Gmail in 2013 rolled out an automatic and hugely unpopular categorization of email into Social, Primary, etc. Is there a way to quickly generate multiple suggestions based on the user’s own activity and let the user choose?

Randomized Content Generation

Users of complex digital products can’t always express their full potential, because their technical skills are limited. I already showed how I use undo and multiple takes to overcome my limitations.

The key insight for me is that when users hit the limit of what they can create intentionally, they are still able to recognize if shown some options. Besides, people enjoy surprising themselves. They don’t always want predictable outcomes.

The Nord Lead 4 synthesizer has a Mutator function, which takes a seed sound and changes it in slight or major ways. I’ve come to depend on this feature in my creative process. It works on demand — meaning it’s easy to access, so I can easily trigger it manually.

Mutator feature on the Nord Lead 4 synthesizer

There 3 types of randomization and 5 levels of randomization strength. I can hear endless variations on a single sound (similar to the many-takes pattern), evolve a sound gradually, or create a completely random sound.

Here is a sample of gradual mutation followed by a big mutation at the end:

What makes the Mutator effective for me is that the Nord Lead

  1. chooses optimal parameters to randomize (bounded randomness), so most generated sounds are viable,
  2. allows fluid change from manual editing to mutation — I can generate a very random sound, then tweak an aspect I don’t like manually, then randomize slightly

To me it feels just like asking the device for suggestions. It’s not intrusive.

In Closing

I would like to see more light automation and more explicit support for trial-and-error in software.

In the future, I can imagine my sketching app recognize that I am trying to draw a person. The app will automatically fix and elaborate the sketch, using my own line style. It will even be able to ask me, “Hey, how about this?” My writing app will offer alternative ways to arrange my content and offer better headlines to choose from.

That said, I have two concerns:

  1. Products that try to be too smart often fail to listen to feedback. Products should have humility as a feature. How do we create digital collaborators and partners rather than intrusive automatons? Past attempts to do this failed miserably — remember Clipy from MS Office? Many people were frustrated with the Nest thermostat. And so on.
  2. Are there good working patterns for a digital product to offer content and ask for feedback without obstructing the user’s workflow? Most cases of such interaction today are intrusive and built on selling the user something, not on truly being helpful. How do we avoid unwanted help?

Let me know your thoughts and please share examples.

Update: Check out the workflow when working with AI to create music https://www.theverge.com/2018/8/31/17777008/artificial-intelligence-taryn-southern-amper-music

Compact Mobile Navigation Patterns

Problem: Running Out of Width

What if all your navigation items don’t fit onto one line on mobile? In the screenshot below, if Habanero opens another office, it won’t fit:

Pattern 1: Horizontal Swipe

You can let the choices go off-canvas. You’ll find this in use on the web and in native apps.

Make sure to hint at the concealed content. Adjust spacing to ensure the last item is cut off:

Google adds a fade effect as an added cue, which also makes the cut off item less awkward:

Strymon adds fade effect AND a small arrow for clarity. Moreover, they turned the links into buttons, which makes it easier to see they are cut off (not a video):

This pattern isn’t useful just on mobile. LinkedIn uses this pattern in their Create Post pop-up, where width is restricted and the concealed tags are low priority:

Samsung adds motion as an affordance. In the video below, notice that it’s important to auto-pan the selected item into view after a page redirect:

This diversity and number of cues highlights the main disadvantage of this pattern: discoverability. A user may not recognize a swipeable menu and may miss out on the concealed choices.

Pattern 2: Hide Low Priority Items

A common pattern is to hide complex navigation behind a hamburger icon or icon + label. You should avoid hiding important choices if possible. It may be appropriate if the menu is redundant and the high priority options are prominent in the body of the page.

Try a hybrid approach to achieve a balance between discoverability and layout constraints.

CASE STUDY: Interaction Design Foundation Home Page
IDF decided to hide all its links, including “UX Courses”. The courses offered are listed in large cards way down the page. So at first blush, it’s not obvious what courses they offer, although it’s the primary interest of visitors.

Here’s how the page looks now:

SOLUTION 1: Move Courses Higher
Here we leave navigation unchanged but move the relevant content above the fold:

SOLUTION 2: Expose Top Priority Menu Item
Here we leave the page unchanged but surface the UX Courses as a button:

Another hybrid example
Vivobarefoot exposes the high-level filters but conceals more precise filters behind an icon. This is better than hiding all filters, but the “filter” row is empty, so they should use available space to expose 1 or 2 mostly frequently used
precise filters:

Pattern 3: Simplify and Fit

You can save some space by using icons without labels, but you should avoid pure icons if possible. When you choose this approach, be pragmatic. Certain icons like Home are safe, while others are less recognizable. I use YouTube every day, but I don’t know what the flame icon represents:

Simon Cowell’s gesture symbolizes the idea that if a user doesn’t know what something means, it’s as if it’s not there.

You can fit links by grouping them under pulldowns:

Another approach is to shorten text labels on mobile. Here is a desktop nav I did for a client:

On mobile, I kept just the key words:

Here’s how the code for this looks: <a>Promote <span class=”mobile-hide”>Your Listing</span></a>. A media query hides .mobile-hide class on narrow screens. Note that assistive technologies may still be able to read the full label depending on how you hide it (e.g., if you make it 0 width and height).

Sometimes the links are just too long and won’t fit. In that case, you can let them overflow:

On that note, let’s talk about my preferred approach.

Pattern 4: Let It Overflow (Counter-Pattern)

The approach I generally prefer is to not hide anything. As you can see in the previous example, I let the links overflow on mobile. I want them all visible at a glance as an overview.

If you visit my site on mobile, you’ll notice my sitemap is fully exposed right at the top to orient visitors to what I’m about:

On the inner pages, I’ve removed low priority items, but I still let items overflow onto a second row:

This works for a limited number of links, 3 rows at most. I do the same thing for inline tabs:

When I worked on GoodUI.org, we usually exposed links in the body and let users scroll through them normally. 100% discoverable:

Sometimes you have tabs that are not links but switch up some text in-page. If the tabs take up a lot of height, the user may not be aware that the text below has changed. In those cases, checking if text is out of view and auto-scrolling to it may be an option.

Wrapping Up: Best Practices

  • Keep choices exposed whenever possible to ensure disoverability
  • Try simplifying to fit the choices on one line by using icons carefully, simplifying text labels, or grouping (simplify without sacrificing function and clarity)
  • If you rely on horizontal swipe, use fading, cut-off text / buttons, and motion to increase discoverability
  • If you decide to hide choices, use available space to keep high priority choices exposed in the nav area or the page body
  • Don’t be afraid to give navigation a large treatment if it makes sense

Do you need a UX audit of your mobile app or site? Send me a link, and I’ll check it out.

Solving Problems with User Research, Best Practices, and A/B Testing

What can I do to persuade more people to buy your product online? I tackled this question for 5 years as I ran A/B tests for diverse clients.

I remember one test idea that everyone on the team loved. The client said “That’s the one. That one’s totally going to win.” Well, it didn’t.

The fact is, most A/B test ideas don’t win.

In fact, interpretation is tough, because there are so many sources of uncertainty: What do we want to improve first? Which of a hundred implementations is a valid test of our hypothesis about the problem? If our implementation does better, how statistically reliable is the result?

Is our hypothesis about the users actually true? Did our idea lose, because our hypothesis is false or because of our implementation? If the idea wins, does that support our hypothesis, or did it win for some completely unrelated reason?

Even if we accept everything about the result in the most optimistic way, is there a bigger problem we don’t even know about? Are we inflating the tires while the car is on fire? 

If you take anything away from this, take this analogy: inflating your car tires while the car is on fire will not solve your real problem.

I believe the most effective means of selling a product and building a reputable brand is to show how the product meets the customer’s needs. This means we have to know what the customer’s problem is. We have to talk to them.

Then if we run an A/B test and lose, we won’t be back to square one. We’ll know our hypothesis is based in reality and keep trying to solve the problem.

Emulating Competitors

“I heard lots of people found gold in this area. I say we start digging there!”

That actually is a smart strategy: knowing about others’ successes helps define the opportunity. That’s how a gold rush happens.

This is why A/B testing blogs are dominated by patterns and best practices. So-and-so gained 50% in sales by removing a form field… that sort of thing. Now don’t get me wrong: you should be doing a lot of those things. Improve your value proposition. Ensure your buttons are noticed. Don’t use tiny fonts that are hard to read. You don’t need to test anything to improve, especially if you focus on obvious usability issues.

So what’s the problem? Well, let’s go back to the gold analogy. Lots of people went broke. They didn’t find any gold where others had or they didn’t find enough:

“The actual reason that so many people walked away from the rush penniless is that they couldn’t find enough gold to stay ahead of their costs.” ~ Tyler Crowe Sept. 27, 2014 in USAToday

You could be doing a lot of great things, just not doing the RIGHT things.

The good thing is many people do some research. The problem is not enough of it or directly enough. They are still digging in the wrong place.

“If I had only one hour to solve a problem, I would spend up to two-thirds of that hour in attempting to define what the problem is.” ~ An unknown Yale professor, wrongly attributed to Einstein.

Think about this for a moment: How can you sell something to anyone when you’ve never talked to them or listened to what they have to say?

Product owners often believe they know their customers, but assumptions usually outnumber verifiable facts. Watching session playback can hint at problems. Google Analytics gives a funnel breakdown, but it doesn’t give much insight into a customer’s mind. It’s like trying to diagnose the cause of indigestion without being able to ask the patient what they had for dinner or if they have other more serious health complaints.

The problem is it’s all impersonal, there’s no empathy. There’s no “Oh man, that sucks, I see how that is a problem for you”. It’s more like “Maybe people would like a screenshot there. I guess that might be helpful to somebody”.

Real empathy spurs action. When you can place yourself in your customer’s situation, you know how to go about helping them. If your solution doesn’t work, you can try again, because you know the problem is real rather than a figment of your imagination.

A Pattern Is A Solution To A Problem

Therapist: “Wait, don’t tell me your problem. Let me just list all the advice that has helped my other patients.”

Let’s say some type of visual change has worked on 10 different sites. Let’s call it a pattern.

A pattern works, because it solves some problem. So choosing from a library of patterns is choosing the problem you have. You don’t chose Tylenol unless you have a headache or fever. You don’t chose Maalox unless you have indigestion.

If you know what YOUR problem is, you can choose the right patterns to solve it.

If you don’t know the problem, you won’t get far choosing a pattern because it’s popular, because of how strongly it worked or how many people it has worked for. That’s like taking a medication you’ve never heard of and seeing what it does for you.

Pattern libraries are great for when you have a problem and want a quick, time-tested way to solve it:

Research Uncovers The Problem: A Short Story

Say you’re a shoe brand. You decide to reach out to people who are on your mailing list but haven’t purchased yet.

So you send out a survey. Within the first day, it becomes clear that many people are avoiding buying your shoes, because they’re not sure about sizing.

You’re shocked, but you shouldn’t be. User research insights are often surprising.

It’s just that you thought you anticipated this by posting precise measurements, a great return policy, and glowing testimonials. If anything, you thought people would mention the price, but no one so far mentioned price.

That’s a big deal for your product strategy. You need to build trust. So you set aside your plans for a full redesign (those fancy carousels on your competitor’s site sure are tempting). You set aside A/B test ideas about the font size of prices, removing fields, and so on.

You tackle the big problem. You do some research and come up with solutions:

  • match sizing to a set of well known brands
  • provide a printable foot template
  • allow people to order two sizes and return one
  • mail out a mock plastic “shoe” free of charge, and so on…

You ask a couple of people to come to the office and try some of your solutions.

Your user testing methodology is simple: First people pick their size based on either the sizing chart or template. Then they see if the real shoe fits.

Result? The matched sizing and the foot template were very effective in predicting fit. In user testing, the initial template didn’t work so well, because it’s hard to place a 3D foot in perfect position on a 2D printout. So, you come up with a template that folds up at the back and front, simulating a shoe. The users liked that much better. In fact, you start working on a cardboard model you can mail cheaply to anyone who requests it.

Now you’re off to testing it in the real world!

You design 2 different foot sizing comparisons, one pretty one with photos of top 3 brands and one long, plain table with 20 different brands. You also create an alternative page that links to the downloadable foot template.

You A/B test these variants over 2 weeks and pick the one that works.

(Then you go back to your research and find the next problem.)

You may also like this post about patterns: Compact Navigation Patterns .

If you want to uncover the biggest problems for your customers, I’m happy to help.

The Real Life Test For Intuitiveness

This article is a work in progress…

Comparing an interaction to its real-life equivalent can be a useful test of how intuitive it is. Does it match your real-life mental model?

Example 1: In-Game Inventory Management

Imagine you’re scavenging in a post-apocalyptic city. Your bag is full. Get to a safe place. Lock the door. See what you have.

What do you do?

I bet you dump it out on the floor in front of you

But in the Fallout games, you pull out an alphabetical list with obscure names:

Now, you could get fancy with mimicking a real-life experience. Instead, what if we had a simple gallery that we can reorder by dragging, like this rough mock-up:

You can now see how many pistols you have, which weapons are bigger… all at a glance.

It’s best to use an existing intuitive cue (like size) than create a new abstract cue (like range). What if more powerful weapons were always beefier? What if the most accurate rifles were longer?

In the game, there is little correlation between a weapon’s size or visual impressiveness and its damage. You’d think ANY gun would finish an enemy with just a shot or two at close range, but that’s not the case. I don’t know which weapon can incapacitate a raider with a single shot at close range. If I did, the choice of which pistol to carry would be less intimidating.

Example 2: Storage Metaphors

Here’s another example. In the Farmville game, the user can buy all sorts of equipment, seeds, etc. But seeing getting to the inventory requires menu diving.

Where’s my stuff?

Well, in real life, my stuff would be in my storage shed. So let’s add a shed:

Example 3: Character Interaction

In one aquarium simulation game, the fish swim around, and you have to feed them and buy stuff for the aquarium. I found it unexciting.

I used to have a real fish once (a rescue) and my real-life experience with him was quite rewarding. He saw me and followed me when I entered the room. He was at times curious, lethargic, startled, cozy… Why not model the AI of the fish to simulate some of these real-life behaviors? Wouldn’t that engage users more?

There are many situations where comparing to the real-life equivalent can generate solutions to UI problems.

The User Story In Context And Time

Meaning is not in the words — it’s in the total situation. – Ronald Langacker

To know if we created a great product, we need to test the User Experience beyond the screen:

Level of Test Types of Stories
Individual screens

Short-term usability stories about UI-level problems.

“I wanted to buy the product but couldn’t find a Buy button.”

Flow in context

User Experience stories that show whether the product can successfully do the job it was hired for.

“I avoided using the medical software, because it forced me to turn away from my patient.”

Usage over time

Full User Experience stories that show how well the product works as the details of the job change over time.

The gorgeous curved screen design that I loved at first caused the screen to break a few months later, which cost me $400 to repair.”

Visually polished and usability-tested screens can still lead to a failed product experience in the long run.

Case Study: Inventory System in Fallout 4 Game

The inventory gets long and hard to manage as you pick up tons of items. For example, it’s hard to compare weapons, because you can only see stats for one at a time:

But these sorts of usability-level issues are easiest to fix. For example, as a work around, this user prefixed the weapon names with useful stats. This makes it possible to compare items at a glance.

However, fixing screen-level problems is small potatoes in comparison to the larger issues that hurt the game. Here are user stories that evaluate the inventory system at different levels:

Example: Choosing the best apparel Screen Level Context & Time Levels
Problem I can apply clothing in inventory mode A by clicking it, but how do I apply apparel while in inventory mode B? Clicking in mode B sells apparel instead. What kind of apparel is going to keep me safe?
What happened? My workaround is to exit the trade dialogue and go into inventory mode A. I then use the body chart there to see what I’m wearing now and which new apparel is superior. I click the right apparel to apply it and go back into the trade dialogue and click the apparel I’m no longer wearing to sell it. However, I sometimes nearly sell an item by mistake when I reflexively click it to apply it.

I’ve spent hours wondering which items to carry, figuring out where to stash inventory when I had too much to carry, comparing items, and agonizing over which to sell. In the end, I mastered the inventory UI, but my overall UX was poor.

I expected my diligence to pay off, but that optimal weapon I handpicked still couldn’t kill the next enemy and despite all the hoarding and trading I still couldn’t afford the best apparel.

At times I was paralyzed, even quit the game, when I found something important and had to figure out what to drop to make room. I’ve found myself not wanting to go into a new building, because finding new things had become a burden.

You can clearly see which user stories affects the User Experience more profoundly. 

There are issues of both clarity and meaning. Should I use the gun with 100 accuracy & 30 damage or the gun with 200 accuracy & 10 damage? What’s the difference between “accuracy” and “range”? Is a “fire rate” of 6 good or bad? These questions are frustrating, but the bigger UX problem is why any of this matters in the first place.

How does a feature translate into outcomes the user cares about?

If we can’t answer that, we have more than just a usability problem.

Case Study: Context for Medical Software

There’s a great story about software failure in Clayton Christensen’s Competing Against Luck:

We’d designed a terrific software system that we thought would help this doctor get his job done, but he was choosing to ‘hire’ a piece of paper and pen instead…

Why? The design team overlooked the situational and emotional context:

“As [Dr. Holmstrom] began to discuss Dunn’s prognosis, he grabbed a piece of paper to sketch out, crudely, what was wrong with Dunn’s knee and what they could do to fix it. This was comforting, but puzzling. Dunn knew there was state-of-the-art software in that computer just over Holmstrom’s shoulder to help him record and communicate his diagnosis during an examination. But the doctor didn’t choose to use it. “Why aren’t you typing this into the computer?” Dunn asked.

…The doctor then explained that not only would typing the information into the computer take him too much time, but it would also cause him to have to turn away from his patient, even just for a few moments, when he was delivering a diagnosis. He didn’t want his patients to have that experience. The doctor wanted to maintain eye contact, to keep the patient at ease, to assure him that he was in good hands…”

Case Study: Samsung Galaxy Edge

The Samsung S7 Edge phone was very slick with its curved edge. But it turned out that this design choice makes the phone hard to protect. Flat screens allow protective cases with higher sides that rise over the screen. Cases for curved screens rise just barely. Even with a high quality case, this screen cracked along the curved edge (and it happened twice)!

If we look beyond the first experience and aesthetic factors, we see a very different User Experience story.

The cost of repair was $400 the first time. The second time, I had to replace the perfectly functional phone, which didn’t suit my ecological values.

Sadly, Samsung appears to have standardized this design. I suspect it’s even financially lucrative, due to demand for pricey replacement parts or replacement devices.

Case Study: Fitbit Dashboard

In The Big Book of Dashboards, Steve Wexler describes how his experience with his Fitbit changed over time:

“After a while, I came to know everything the dashboard was going to tell me. I no longer needed to look at the dashboard to know how many steps I’d taken. The dashboard had educated me to make a good estimate without needing to look at it. Step count had, in other words, become a commodity fact. I’d changed my lifestyle, and the dashboard became redundant.

Now my questions were changing: What were my best and worst days ever? How did my daily activity change according to weather, mood, work commitments, and so on? Fitbit’s dashboard didn’t answer those questions. It was telling the same story it had been telling on the first day I used it, instead of offering new insights. My goals and related questions were changing, but Fitbit’s dashboard didn’t. After a year, my Fitbit strap broke, and I decided not to buy a replacement. Why? Because the dashboard had become a dead end. It hadn’t changed in line with my needs.”

So there wasn’t anything wrong with the dashboard in 2 dimensions, but its usefulness wasn’t constant along the dimension of time.

How to Uncover the User Experience Story

Time is a necessary dimension of user testing. Retrospective interviews and delayed customer feedback help reveal the total story. You want to know how the customer enjoyed the shopping experience or their first time playing a game. Then you should follow up to see how they feel weeks later.

You can get more insight into how the story unfolds through something like a diary study, which allows users to keep track of their usage at their own pace. They can capture one-off occurrences and experiences that might at first seem insignificant and would get lost otherwise.

15 Jobs-To-Be-Done Interview Techniques

Here are 15 techniques I extracted from the Jobs-To-Be-Done interview Bob Moesta’s team did with a camera customer (link at bottom):

Set expectations

Give an introduction to how long the interview’s going to take and what sorts of things you’re interested in. For example, “even minor details may be important”.

Ask specific details to jot the customer’s memory

Don’t just ask what the customer bought but why that model, which store, what day, what time of day, where they in a rush…

Use humor to put the customer at ease

Intentionally or not, early in the interview the whole team had a good laugh about something the customer said. I think it did a lot to dull the edge of formality.

Discuss pre-purchase experiences

Ask what the customer used before they bought the product and what they would use without it. Dig into any “I wish I had it now” moments prior to the purchase.

Go back to the trigger

Walked back to what triggered the customer to even start thinking about buying the product and to a time before they ever considered it.

Get detailed about use

Interviewers and the customer talked about how she held the camera, which hand, in which situations she used it, which settings she used, and advantages/disadvantages of the alternatives. You want the customer to remember and imagine the product in their hands. Things like the weight or texture of the product could impact the user experience. Dismiss nothing.

Talk about lifestyle impact

Dig into ways in which the product impacted the customer lifestyle, things they were/are able or unable to do. For example, they talked about how taking pictures without the camera affected the way she presented her trip photos to her sister. Focus on the “use” rather than the specific “thing”. For example, you can ask “do you like this feature”, but then you want to move to “what does this feature mean to you in terms of what you’re able to do, how it affects your lifestyle, your future decisions”.

Explore product constraints

Talked about how other decisions and products impacted the decision. For example, the size of the bag that has to fit the camera and avoiding the slippery slope of requiring additional accessories.

Ask about alternatives

Products don’t exist in isolation. The customer had several other solutions, which serve different, specific purposes. Figure out whether the new product will replace or complement other products.

Point out inconsistencies, such as delays

Interviewers pointed out that the customer waited a long time to buy the product from the initial trigger to making the call after a trip. They asked “Why did you wait so long?”

Talk about the influence of other people

Ask about advice other people gave the customer or how other people may be affected by the decision.

Don’t put words in their mouth

In digesting and summarizing back to the customer, it’s easy to inject your own conclusions and words. Try to elicit attitudes and conclusions from the customer. Lead them to it but don’t do it for them (a related technique is to start talking and then leave a pregnant pause, so the customer can complete the thought). In one clear case in the camera interview, the interviewers asked a leading question but then prompty noticed this and corrected themselves, saying “Don’t use his words”.

Talk about the outcome

Asked open ended questions about whether the customer was happy with their purchase and in what ways. Ask about specific post-purchase moments when the customer felt “I am glad I have it right now”, but focus on how the situation is affected not the product itself.


Here are some additional I considered after listening  to the interview:

Avoid fallacy of the single cause

Don’t push the conversation towards a single cause (see Fallacy of the single cause). Rather than engage in cause reductionism, accept there may be multiple, complex causes.

Let’s say you pose the question: “Joe said that, and so you decided to buy X?” The simple narrative may be intuitive, causing the subject to be persuaded that “Yes, I guess that is why I decided to buy X”. In reality, the events may be true (Joe did say that), but in reality may be unconnected. In these cases, it’s important to point out inconsistencies rather than seek confirmation. For example, in the camera interview the interviewer rightly pointed out an inconsistency: “Why did you wait so long to buy X after he said that?” They also often asked “What didn’t you…” Work together to uncover the truth.

Beware planting false memories

Do not reflect back your own sentiments or ideas to the interviewee when clarifying. For example, asking people to confirm something they did not literally say may cause them to confirm a causal relationship that did not happen (other cognitive biases may aid this: pleasing the interviewer, tendency to fall for reductionism). It may plant a subtle attitude that might then be amplified through the course of the interview. Also be careful with “because” statements, as there is some evidence that we are biased to accept such explanations even when they are irrational (see The Power Of The Word Because).

More on possibility of implanting false memories Video 1 and Video 2.


Listen to the interview for yourself.

Guidelines for instant search filters

Vivareal.com.br recently ran a test, where they removed the Apply Filter button and instead updated the search results instantly (screenshot from goodui.org/evidence):

The results are weak but suggest a simple filter behavior might improve a deep metric like Leads (form submits). Given that many prominent sites are doing it, it gives the rest something to try. It also raises questions about what is the best implementation for a given site.

Progress Indicator Is Mandatory

On other sites, like Reverb.com, the transition is smooth. The search results go grey, then they populate in-place:

If search results take more than 200ms to update, use a spinner and/or grey out results, or use another progress indicator (see Jeff Johnson, 2010 for more on human timing requirements). Beyond that, the search filter won’t be perceived as instant and users will be off-put.

Autotrader.ca doesn’t show results instantly. To avoid confusion, they grey out the results and shows a green alert to confirm that user’s filter changes have not yet applied:

Page Refresh Or Not?

On many big sites like eBay and Amazon, filters behave as links. There is a delay and a full page refresh. Here’s eBay:

Some problems with this:

  • The redirect always takes too long. And after a page refresh, I find I need a second to refocus and figure out what happened.
  • I usually don’t find it obvious how to undo a filter or go back in that case, because the UI has changed. I’m probably not the only one.
  • Change Blindness could obscure some important visual change (e.g., people may not notice that some other categories have been updated).

A page refresh may be necessary if there’s a long list of filters that goes past the fold. If you update listings immediately in that case, the top results could be out of sight, at the top. Many sites also update related categories and sub-filters based on the filter chosen, so essentially the whole page does have to change.

I would be very interested to see an alternative pattern to this that is less jarring, faster, and allows easy backtracking. For instance, an animated scroll back to the top may be a better option than a full page refresh in some cases.

Submit For Inputs Or Update As You Type

In the case of Reverb, notice there is a small submit button for the text inputs. All the other fields are instant, but you submit an input field manually when you’re done typing:

In contrast, on sites like Netflix and platforms like Apple TV, filter-as-you-type is heavily used and is very effective:

My recommendation is to use this only if you serve results FAST and if the search terms are bounded. For example, Netflix is a repository of movies and there are only so many movies that start with “Fast and”. So it’s a good solution for them. Likewise, on a stock site, showing the most common stocks might work, because stock symbols are a limited set.

Autocomplete

If filter-as-you-type is not feasible, consider using Autocomplete on the input fields. When using Autocomplete, you can delay showing results until enough characters are typed. And when you show matches, it’s text-only and right below, so it’s not distracting. Moreover, web users are all familiar with this pattern.

Reverb uses Autocomplete in its main search input on the home page, but not on the Keywords filter.

Do Not Freeze

When a search filter is clicked, it should not block the whole app. That is, a user could apply more filters while the previous ones are still fetching data. Similarly, removing a filter should apply opportunistically. For example, when I click an enabled checkbox filter, it should immediately show as “off”, even while the data is still being fetched.

Cache Results

If data sets take a long time to load, we should cache the results. That way when people undo a filter, we can go back to the previous view instantly. It may also be a good idea to keep the original results until new ones are ready. That way, if a user accidentally clicks a filter and then undoes it, you can just cancel the new lookup and show the original results.

Best For Simpler Searches

If a user is expected to enter multiple filters, then there is no point submitting after every search. A manual submit is best in this case. This is also true when there are many filters. For example, your basic search could be instant, but if the user pulls down the massive set of Advanced filters, then a submit button would be needed.

Also if the data set takes a long time to filter (because it’s vast or complex), a manual submit may be better. For example, the search filter for the ChemTrac municipal chemical data repository has an Apply button which filters a very busy map:

Estimate Search Time

It’s a good idea to log search times to see if it’s taking too long. If searches take more than 1 sec on average, then you could show the ETA using a progress indicator or countdown, rather than just a generic spinner. For example, the map for ChemTrac can take a few seconds to populate. Here’s what it might look like with an ETA:

What is your experience with Instant Filters?

Share your thoughts and additional criteria.

Patterns for optimizing checkouts: flow

When a visitor knows what to expect and completes a process smoothly, I call this good “flow”. This post shows some options for how a checkout can be organized and presented to anticipate questions like: Does everything look right with my order? How long will this take? Is it going to be complicated? What are my options?)

 


Account Creation

Allow users to checkout as guests without side-stepping into an account creation flow:

checkout-guest

Better yet, conceal account creation. For example, ask customers if they want to save their information at the bottom of the form (or on the Confirmation page after the order is completed):

checkout-noaccount

For existing customers, you can provide a link to login or a small sign-in form on the side. If a user chooses not to login, you might check if their email is already in the system and offer to retrieve their last used info:

checkout-login

If someone forgot their password, you can tell them to continue as a guest to avoid the delay of recovering their password:

checkout-recover


Express Checkouts

Give existing customer an express checkout option. On different sites, it may be called “express checkout” or “1-click purchase”:

checkout-express

Another type of express checkout is when your email recipient clicks a unique email link, so when they land on the page, they get the option of using same billing and shipping details. A Complete Money Back Guarantee helps ease doubts about an “express” checkout, since the customer sometimes doesn’t even get to see their last used information. If you offer 1-click purchases, include a Cancel/Undo option right on the Confirmation page. I’ve used a 1-step checkout before, not realizing it would literally put in the order without any confirmation.

 


Checkout Tunnels (Enclosed Checkouts)

A checkout that keeps normal navigation and sidebars creates a more natural transition. It tells customers “Check out now if you want, or keep looking around for other products”.

In contrast, a checkout tunnel removes all distractions. It tells customers “You’ve finished browsing. Time for payment”. Test the impact on your total order value, time to purchase, as well as completion rate. Keep consistent branding, and keep some common elements as visual anchors (e.g., remove the navigation links but preserve the area, so content areas don’t jump too much after the page transition).

checkout-tunnel

One hybrid approach is opening the checkout in a modal with a faded background. The fading shifts attention away from background elements. At the same time, it maintains a strong connection to the product, since the product page remains in the background. One way to preserve that on a separate checkout is to include the image of the product being purchased.

 


Form Layout

The goal is to make the form look easy to fill.

Direct flow of attention in one direction, top to bottom. Avoid columns. That said, you can group short and closely related fields, especially if it’s expected (e.g., credit card dd/mm/yy expiry fields should appear together):

checkout-cols

Give fields an appropriate maximum width. A narrow form will look simpler, because it appears to require less typing in each field:

checkout-width

Keep labels above fields to make the field-label unit easier to process. Left-aligned labels have advantages – they shorten the form and are easier to scan (see Top, Right or Left Aligned Form labels):

checkout-labels

Avoid placeholder text and inner labels, because it creates confusion about which fields are completed and which are not. Inner labels may be ok on very short forms (2-3 fields), but make sure the label remains visible once the user starts typing. I like the pattern that moves the label over to the border area rather than removes it.


Stepped Progress

To make the form look like less work, chunk it up. You can have a long form with numbered sections separated with spaces or lines. Alternatively, you can use the “accordion” pattern to show one section at a time, while other sections are collapsed. Some checkouts span separate pages, such as Personal Info > Shipping > Payment (see examples with test data on GoodUI Evidence):

checkout-steps

If you use a single long form, create distinct, intuitive sections (like Shipping Address, Payment), which you can also number. Test for best field to start with: Is it the email? Is it shipping preference? What is low friction? What is high engagement? What is high commitment?

If you use a multi-page checkout, use a breadcrumb or other progress indicator. For your “Next” buttons, use a label that sets an expectation, such as “Next: Payment”.

In your analytics, measure drop-offs at each step and engagement with key fields, so you can compare effectiveness of each layout (e.g., how many people start filling credit card).


Payment Alternatives

Choose a transaction processor with a high success rate. In addition to your default processor, you can offer an alternative gateway, such as PayPal. Conversely, see if removing the choice increases revenue:

checkout-processors

A 3rd party checkouts usually takes the user away from your site and provides an experience you can’t track and have no control over, but it may increases revenue.

You can also use a fallback processor when a transaction is declined. If automating that is not possible, you can show a more informative Declined message with a link to the alternative, like PayPal.


Order Review

If you have a review step, try removing it, as it’s likely unnecessary. However, if you have a long checkout spanning several screens, it may be reassuring to see a summary before committing to the order. See what works.

checkout-review


In the next post, I plan to look at the Fields aspect of a checkout, which tells the user what data to provide and in what format. If you’re interested in reading that, please leave a comment so I know you’re interested.

Are there other patterns and aspects of a checkout I have not covered?

Tips for A/B testing with low traffic

After reading this post, you will be able to say whether your test has “low traffic”, decide if A/B testing is worth it, and know what to do if you decide to A/B test.

No traffic

Technique 1: Do confirmatory not exploratory testing

Exploratory testing = you run an A/B test to look for big or small changes that will increase your conversion rate. You come up with some ideas, then you test them to see which of them work.

Confirmatory testing = you do risky update (e.g., remove free trial) and want to confirm there is no huge negative effect (risk mitigation), or you do an aesthetic site update and want to see if it’s at least not worse than before.

If you get <1,000 visitors per month with 5% converting, you should not be doing exploratory A/B testing.

The best thing you can do instead to optimize your site is look for bugs proactively on your site. You can also do 1-on-1 user testing, surveys, and just deploy your best design and watch your conversion trends.

However, you might still be able to do confirmatory testing.

For example, say I ran a test for 2 months and found this statistically insignificant result (70% confidence level):

2016-10-05-14_57_01-a_b-test-split-test-calculator-_-thumbtack

Based on this that I’m confident enough that my new redesign is no worse than the original with some chance it might be better. That is useful information.

Technique 2: Find a proxy metric

You can increase your test sensitivity by using a higher baseline metric. Say for example that your primary metric is purchases, but purchase rate is only 2%. Here’s the lift you can detect ( try it yourself: Vlad’s What-If A/B Test Planner ):

2016-10-05-22_12_32-mozilla-firefox

What’s a key milestone prior to sales? Let’s say your form fill (not yet submitted) rate is 3%. Maybe form starts are 4% (e.g., people enter an email). If we measure form starts, much smaller changes are detectable:

2016-10-05-22_13_00-mozilla-firefox

Of course, form starts are not purchases, but it’s a behavior that suggests improvement AND it’s a good supporting metric.

Here’s how to use this to analyze your test. Say at the end of the test, you find these effects:

untitled-1

As you can see, in terms of purchases the variation beat the Control by only 10% and it’s not a statistically strong result. But the preceding steps show a progression from engagement to purchases and the shallower goals are statistically stronger. So the big picture here is actually pretty good.

You can also compare the performance of various metrics to see which are fairly in sync. For example, here’s what it might look like if Form Field Engagement is a great proxy for Revenue:

untitled-1

A word of caution: The mere fact that the metrics are lined up DOES NOT increase the likelihood that B is the winner. These are co-related metrics that measure the same behavior (i.e. “purchase” implies “form completion” which implies “form engagement” which implies “scrolling to form” and so on). These metrics will line up whether the effect is true or a false positive. What you CAN say is that, since these metrics are co-related, you can use the shallower metric as a proxy for the deeper metric.

Technique 3: Look for consistent performance

With low traffic, you get too few visitors per day to gauge daily variability, but you can track weekly trends. If a variation is winning more consistently, then it is more likely to be a winner.

Here is week-to-week performance over 7 weeks for two tests, both showing a 6% cumulative improvement:

untitled-2

But one test is showing a lot of week to week variability, whereas the other shows the blue winning for 5 weeks straight. All other things being equal, the second offers more reliable evidence.

Technique 4: Compare segments

Another way of checking consistency is to compare performance across user segments. For example, we tested the same variation simultaneously on two virtually identical landing pages on different domains. We expected the variations to do similarly on both. Comparing day-to-day performance for a sample week, we see that the effect sizes for the two tests are moving roughly in sync:

2016-10-14-11_22_03-microsoft-excel-non-commercial-use

If the variations are not in sync, it could mean the effect size is false or that the segments are dissimilar. However, if the segments are in sync, it’s a good sign. If you have a low daily rate, you should instead check bi-daily or weekly rates. Otherwise, performance jumps around too much just by chance.

Technique 5: Don’t expect big changes to bring a big win

Some people will tell you to aim for a large effect sizes by doing big changes. For example, with 4,000 visitors and 5% rate, you can detect an impressive 54% to 89% lift:

2016-10-04-17_25_31-mozilla-firefox

That’s true, but how exactly are you going to achieve that? Is that realistic and worth the effort?

Big design changes have the potential to bring bigger wins, but it doesn’t mean that’s likely to happen. Big tests do fail often. They also take more development effort. So instead of fixing bugs on your site and building new features, you may end up doing lots of testing work with no results.

In the above scenario of 54-89% lifts, you are even likely to hit on a high false positive and come away believing your design worked:

2016-10-05-16_04_25-mozilla-firefox

The only way to reduce the potential for large false positive errors is more traffic, even if you’re testing big changes. Hope for the best, but plan on testing for a while and putting in more time before you hit upon a win.

More techniques for planning tests and analyzing the results

What to test:

  • prioritize big changes that require little effort
  • test radical changes (think big conceptual shift, not just big visual changes)
  • start with ideas you have good, research-backed reasons to test

How to test:

  • test 1 idea at a time
  • plan to run tests longer
  • keep reminding your team a test is running
  • you don’t have to freeze development as long as you make changes globally

How to interpret:

  • you might have to tolerate variations going “red” for days as data is collected slowly
  • keep in mind the effect size even if true is inflated
  • ignore big lifts if you’ve got only 1000 visitors so far (remember your false positive risk)

How much traffic is enough?

For example, with 20,000 monthly visitors and 0.5% conversions, you’ll have a tough time testing. But with 15,000 visitors and 10% conversions, you get a decent spread of effect sizes around 15% that are detectable within 1 month:

2016-10-03-22_18_55-mozilla-firefox

This is what I’d call “adequate traffic”.

Adequate traffic = you can run a test for a 1 month or less with enough sensitivity to detect a 15% lift. Anything that takes over 1 month or aims at unrealistically high effects is low traffic.

Why 1 month? Because beyond that, things tend to get messy. Users clear cookies and reenter in different variations, your dev team accidentally introduces some change, and so on. Once you start talking of months instead of weeks, the test becomes a burden instead of an opportunity.

Why 15%? In my opinion, 15% is the sweet spot. If your sensitivity is aimed at 15% but you detect a 30% effect, then great – you’ll either have super-strong data or you can stop a bit earlier. If you detect a 10% effect, then you probably still have decent sensitivity to see a suggestive result.

Conversely, if you aim for 50%, and the true effect is 15%, then you’ll be chasing fantoms. Since you virtually never know what to expect, it’s best to be conservative. I found that 15% is roughly the effect size at which A/B testing becomes reasonable for many sites.

Site Traffic Examples

Traffic Baseline Conversions
Verdict Why?
20,000 / month 0.5% Low Traffic Base rate is low. Test takes many months and/or a minimum effect of 50% is required.
1,000 / month 10% Low Traffic Site traffic is low. Test takes many months and/or a minimum effect of 50% is required.
15,000 / month 6% Adequate Traffic With this base rate and traffic, you can detect a 15% effect or so. Enough for a test.
50,000 / month 5% Adequate Traffic With this base rate and traffic, you can detect a 15% effect or so in 2 weeks.

Conclusion

Testing on low traffic is paaaainful. Look at what you plan to gain from testing something, look at your chances objectively, and if you do go ahead, be patient and understand what you should and should not expect.

Did I miss anything? Let me know and I will update this.

Crowd-sourcing A/B test predictions

The collective guess of a crowd can be more accurate than that of an individual. For example, over a hundred years ago, statistician Francis Galton noticed that a crowd of people could guess the weight of an ox with over 99% accuracy. In a more complex domain like politics, we know that expert predictions are terrible, but the average of their guesses is better.

How accurate is a crowd when it comes to good UI design?

crowd

How good is the Crowd at picking a better design?

Behave.org has a good repository of people’s guesses about competing designs. Various contributors submit their A/B tests to the “Test of the Week” section, which asks users to guess the winner. Here, crowds get to weigh in on a single idea, one that is not their own. Moreover, these tests are curated to be interesting, which implies they are carried out by more experienced people with more robust hypotheses.

When we look at the last 106 tests on Behave.org, we see that:

53% of the tests show a variation that beat the baseline. But when the Crowd guessed the outcome of these tests, they guessed right 72% of the time (36% better). Interestingly, the Crowd did choose B about 50% of the time. It was just better at NOT choosing B when it was not an improvement.

FYI 53% does not represent the average success rate of test contributors at all, since it’s not a random sample. We just use it as an arbitrary baseline, to see if the Crowd makes the same guesses or better ones.

Why does the Crowd do better?

First, a test designer choses B 100% of the time, since B is be definition the improvement. In contrast, an impartial outsider considers all options equally and freely chooses either B or A.

Second, a tester is biased by his idea. His faith in the idea fills the flaws in the implementation, especially if he executes his own idea. In contrast, an outsider just evaluates the implementation.

Finally, a Crowd by definition has greater diversity of opinions, gut reactions, etc. than an individual. So even if a Crowd mix is not representative of site visitors, a Crowd is likely to be MORE representative than just the individual.

Quality of the crowd and the data

Is it a Crowd of independent opinions?

The individuals in the crowd need to be independent. One of the flaws of “focus group” research is that individuals win the group will influence the responses of others. This makes the Crowd less intelligent. The power of an internet poll is that we average over many independent opinions.

How qualified is the Crowd?

The composition of the crowd is also important. Sometimes, you want average people, representative of your audience, to give you their simple preference (the classic “Which product would you buy?” question in market research). Other times, you want experienced designers to give you their guess based on their expertise (though sometimes experts can bad at predicting). For example, on all our projects at Goodui.org we ask multiple client contacts to specify their certainty in each test idea, which gets averages out with our prediction. These are people who are not designers and people who know more about the product and the user base than we do – both very good things that add diversity. At the same time, being subject matter experts, they are qualified to give feedback on marketing copy and so on. In other words, their contribution to the crowd is valuable. We then try to prioritize the ideas that have the highest overall score.

How to quantify Crowd opinion?

How you summarize the opinion of the Crowd matters. In the initial example of guessing the weight of an ox, Galton used the median to summarize the crowd’s opinion. A median controls for outliers and can increase data accuracy. For example, there might be unqualified people in the Crowd who are way off on their guess (I would have absolutely no idea, for example, how much an ox weighs). An average would be skewed, while a median would not be.

How the question is asked and the format of the data itself also matters. For example, I recall that asking people “Who do you think will win the election?” predicted election results better than “Who will you vote for?”

In the case of Behave.org’s Test of the Week, if they asked visitors to quantify their confidence not as a binary choice but as a 0 to 10 (0 = A, 10 = B), then their uncertainty might have weakened the overall prediction. The binary choice of A or B effectively controls respondents’ uncertainty, so they are forced to give the same weight to “it’s slightly better” aso to “it’s much better”. A simple binary choice might make the prediction more accurate (this is known as the “error of central tendency”).

For example, at GoodUI.org with @jlinowski, we recently moved away from a simple 0 – 10 subjective certainty rating to a -3 to +3 scale. Since we tend to include test ideas that others are likely to also consider worthwhile, this effectively makes it a 0 – 3 scale. Simpler means stronger predictions. However, we also moved to a more complex formula that incorporates experimental evidence and research. This increased complexity risks decreasing the predictive power, since much more complex decision making is involved in figuring out if a variation is likely to win.

Can a Crowd fail?

Yes! More people doesn’t necessarily mean better predictions. Sampling error (your choice of who’s in the Crowd) has to be valid and appropriate. In the classic example of the 1936 Literary Digest poll, a very expensive poll with a large sample size actually failed to make an accurate political prediction, because the sample was biased. So quality trumps quantity.

How good is the Crowd at predicting effect size?

For reasons of complexity, I don’t believe a Crowd can be relied to estimate the degree to which a variation might beat the baseline. The possible % values are unbounded.

Next factor is experience. In the ox example, most people know what an ox is and have a lots of experience with other objects. But every website, every site audience, and every implementation of an idea is unique and not necessarily transferable. Of course if you are testing something and you’ve tested something exactly like it on a dozen very similar sites and got a similar result each time, then you would have experience. I’d be skeptical if anyone can make such a claim however. CRO experts rely on a lot of intuition and personal judgement to fill those gaps in experience.

Finally, it’s the type of stuff we are measuring. Weight is a simply additive property. A rock that appears to be 2X size of another probably weighs 2X as much. An ox about the size of 5 people probably weighs the same. Online experiments are not like this. If I am testing an idea, it’s often not decomposable in such a way. And even if we were to decompose it, there is no guarantee the effects of the parts add up this way.

In other words, guessing the outcome is very very hard and should not be done subjectively. We should do our best to find similar past tests, and use those to make a starting prediction. We can then collect some data in order to fine-tune the prediction. We can then test our prediction with more data.

Crowd-sourcing design ideas

How do we figure out what to test in the first place or find something better than B? After several unsuccessful tests a while back, we decided to crowd-source improvement ideas from GoodUI blog visitors. The result was the highest response rate of all posts up till then with lots of ideas that led to new variations.

We could further improve this by asking visitors to make predictions about others’ ideas before add their own idea. That way we would crowd-source a list of ideas as well as crowd-source predictions for each idea. As we have seen, stronger predictions result when a Crowd joins to rate a single idea, instead of each individual putting forward and rating their own idea.

Lessons learned

If I want to improve my success ratio, I need to behave like the crowd:

  1. Separate the idea from the implementation. I recently had a great idea but my best implementation was weak. It took 6 visual iterations over several weeks to ensure both idea and implementation were sound. Critical feedback from 2-3 people (who were not involved in generating that idea) was critical in improving it.
  2. Do a pre-mortem. Imagine that B already lost, and try to figure out why that has happened. This way you force yourself to consider taht B might not be better and needs improvement.
  3. Seek first impressions from outsiders, both qualitative (ideas) and quantitative (predictions about ideas).

What’s in the future?

I envision a crowd-prediction service (similar to remote User Testing services that have become popular). You could pay to have 100 pre-qualified people make predictions about your test.

Happy testing.

When and why to peek at A/B tests

Every other day or so you should should peek at how your tests are doing. Here are some guidelines on doing that without skewing your data:

hidingface

Technical problems

The main reason you want to peek frequently is technical problems. You should QA your site before you launch, but you should QA again a couple of days in and later on. You may have missed some bugs, and repeat QA will catch more of them. Other site changes get introduced that break your test in some way. And sometimes a transient bug doesn’t show up in the data until more data is collected.

Averting losses

If you set it and forget it, you may leave a losing test running too long. If you’re after long term gains and want to avoiding losses, you should stop a losing test at some point. However, stop when you can be reasonably sure you’re doing the right thing. You don’t want to stop your test a day in because you get put off by a big initial drop. But if your test is losing for 2 weeks straight with high statistical confidence, you might not want to let it run another week. If your goal is learning, then you might want to run a test longer to confirm it is a loser – but exposing your site a poor variation may not be what is good for your business.

What stats are you using?

Many tools have moved to a different statistical method, which allows peeking. VWO uses Bayesian stats, which give you up to date confidence and probabilities. Optimizely uses a pseudo-Bayesian sequential testing method to allow you to peek. So depending on what statistical method you use, you may be “allowed” to peek and make decisions. However, keep in mind regardless of what your tool says, you still want to let your test run a reasonable amount of time. So estimate your duration upfront anyway, so you have some point of reference.

No significance, no problem

If you’re using traditional stats (which works fine is much easier to understand), the basic idea is to avoid calculating statistical significance (p-value) and then making interim decisions based on that. A p-value is meant to be something you calculate after your test is done. However, you can check significance and adjust your duration estimate once or twice during the test (e.g., once you get a ballpark effect size). That’s not going to skew your p-value much. IF you commit to some sample size upfront, then you can check significance all you want – it gives you a basic sense of whether the effect is strong.

Safe things to peek at

It depends on what you peek at. If you’re not calculating significance but still making decisions based on how the test is doing so far, you’re still skewing your final analysis. However, you can peek at how your overall test is doing (without drilling into how each variation is doing). For example, if you see that overall traffic to the test or the total conversion rate is lower than expected, then you can recalculate your duration estimate without any problems. Looking at the test overall doesn’t tell you how the variations are doing relative to each other – unfortunately A/B testing tools never offered this option. You definitely want to readjust your duration estimate once you have more data. You just don’t want to keep doing that based on how each variation is doing – that fluctuates and if you keep adjusting your duration to current performance, you’re letting chance lead you instead of subduing the effect of chance with time.

Have someone else peek

I don’t do this, but a good idea I’ve come across (not sure how practical) is to have one person peek for technical issues and a different person to ultimately do the final analysis and make the decision.

Objectivize

The biggest danger with peeking is that your brain will look for patterns and once you see them, you can’t unsee them. This can lead to disappointment when a green turns red. Or worse it can lead you to hack your test to get the results you want. Try to be objective about it. Once the test is running, you’re after the truth, not winning or losing.

Visual patterns for A/B test structure

Once you figure out what you want to test, you need to define what you’re going to measure and where. In this post, I will introduce my preferred terms for describing test structure (things like test conditions, goals, and pages), and I’ll use a visual language to cover the basic patterns. Here’s an example:

Example test
A simple test showing Gate (Start), Path, and two Goals (the big black circle is primary).

Gates and Goals

GATE: circle that represents all conditions for entry into the test, including test URL and traffic segmentation (Example: Home page mobile traffic).

GOAL: all success conditions, including confirmation page URL and business rules (Example: Thank you page visit after purchase of premium package).

TIP: If you have a sizable mobile segment, you’ll want to track mobile, tablet, and desktop traffic separately. If your tool doesn’t allow you to segment after the fact, set up 3 separate tests with mutually exclusive gates. Other gates you should distinguish are: existing users vs. new users, ad traffic vs. direct traffic, and so on, in case each segment performs differently. Keep sample size in mind, because segmentation reduces sample size and increases false positives and false negatives.

Primary vs. Secondary Goals

Page visits are generally more reliable than clicks, so they are the preferred primary metric. Clicks on links or form submits are often secondary metrics. There are many other customized types of metrics based on user behavior and business rules.

Primary vs. secondary goals
Large circle is a primary goal. Small circles are secondary.

TIP: Whenever possible, track both the start and end of an interaction e.g., track clicks on a link and the visit to the destination page.

TIP: Tracking how many people start completing form fields is a good measure of intention e.g., track keydown or change events on key form fields. It can also highlight anomalies in other goals.  Track attention (user scrolled to and stopped at the element being tested) as a secondary metric via scroll tracking and setTimeout().

Goal Depth

Your primary goal might be directly on the page you’re testing or further down the funnel.

abvocab_depth

A direct goal happens on the test page and is your ideal end goal (e.g., an AJAX payment event).

A shallow goal is a relative term for a goal at or near your test gate that’s not ideal. For example, visits to the checkout page is a shallow goal relative to a primary goal of completing the purchase.

A deep goal lies further away from the gate and is usually your primary goal. However, you might track other deep goals that are not primary (e.g., post-purchase downloads, dashboard engagement). Changes to deeper goals are harder to detect using statistics, because counts are lower.

TIP: If your primary metric won’t produce enough data in the time you have, then choose the next best metric.

Conditional Goals

Behavioral and business conditions can be added to goals. For example, fire a conversion when a timer expires, a scroll position is reached, several steps are completed, or a user successfully logs out and returns again. You can map goals to any user behavior.

Conditional goals

TIP: Additional logic requires additional code, increasing risk of technical and logical errors. Be careful about making your primary goal very complex. Moreover, elaborate goals that are harder to achieve will have a lower conversion rate and will be harder to track. However, they may be more informative – track them but have a fall-back.

TIP: You can set up goals to detect errors on complex screens with lots of dynamic components. For example, part way through the test you might find that clicks on your new button are low. So you might set up a goal to check that the button actually exists on the page for all visitors. In one case, we wanted to check if any visitors to a split-URL test were changing the URL to enter a different variation.

Single Page vs. Template

Your test gate is typically a single page. The gate can also be multiple product pages that use the same template. It can also be multiple pages that are completely different or the test can even be side-wide. For example, if you’re testing a change to your navigation or a sidebar, you’ll want to modify all pages with that element – for consistency. Advantages: potentially much higher traffic to your test and testing the change in broader context. Disadvantage: you may obscure different performance on each page due to different traffic source, different previous page seen, etc.

abvocab_pages

TIP: If you can, track that your multi-page A and B samples contain similar ratios of visitors to each page. For example, if your site is running side-wide, you might want to know that your A and B samples contain the roughly the same % of people who came from the home page, pricing page, blog, etc. What if, for instance, product A has mostly traffic from your internal search, while product B was mentioned in a blog and has lots of referral traffic?

TIP: You can run separate tests to isolate the data set for each page as long as there is no visitor overlap.

Sequences and Funnels

abvocab_funnel
Navigation within funnels. Straight lines are direct links, while wiggly lines suggest intervening pages e.g., Home > Pricing > Checkout as well as Home > Checkout.

If you have a unique link to a page, you control the traffic to that page. Other times, you might have multiple links, which means multiple entry points into a page. You might also have steps that can be bypassed. The wavy line means that the page transition is flexible or unknown (e.g., Home page to Pricing page), while a straight line means it’s a direct relationship (e.g., Payment Step 2 to Step 3).

TIP: Verify, don’t assume path paths. Are you getting lots of direct visits from Google into the middle step? Can visitors bypass your link towards step 3?

TIP: Set up tiered metrics, so you can track the user’s progress at each step of the funnel.

abvocab_sequences

If your traffic is too low to detect a change in your end goal, you should make a shallower goal primary. It’s also useful to track whether users step outside the main funnel. For example, if you find that a losing variation is increasing traffic to the pricing page, you might have a hypothesis to explain the loss.

User behaviors other than what you want are also good to track (Distractions).

TIP: Track visits to all main pages, like pricing, about us, blog, etc. Most patterns will be meaningless, but sometimes they are informative. For example, a huge increase in visits to the Pricing page might suggest something you said make people think of cost, which may or may not be a good thing.

Visual Scope

You can test just one page (Page Test) or you can test an entire funnel as a sequence of pages (Funnel Test) – visitors see version A of the complete funnel or version B of the complete funnel.

Visual scope
The grey container represents scope of visual changes.

On rare occasions, you might start your test on a page other than the one you’re testing. For example, a page may not be accessible directly or may depend on an interaction with the preceding page, so you have to start your test a step earlier (Premature Start). You might also have visual changes on different pages that go together conceptually, such a discount offer on the home page vs. on a deeper page. So on one variation, the user will enter prematurely on the page without visual changes.

TIP: Avoid testing related pages in separate simultaneous tests, because you’ll have to account for visitors’ coming from and seeing different versions of each page.

TIP: When running tests simultaneously on the same site, use cookies to ensure visitors can only join 1 test at a time. If you do risk running overlapping tests, at least add metrics to each test to track which variation of each test the users saw. That way you can at least check that the assignment is roughly equal and even split users into non-overlapping segments.

Data Collection

Tests can be run by injecting CSS and JavaScript into an existing page or creating a separate page URL for each variation.

abvocab_types

A tracking or blank test simply collects data about your site. You might run such a test to QA your metrics or estimate your traffic and conversion rates to enable you to do a power analysis.

An A/A test is less a way to test the tool than a way to understand the properties of the traffic e.g., are all users relatively similar, producing less variation in the conversion rate over time?

Visual changes can be tested by injecting CSS and JavaScript into an existing page or creating a separate page URL for each variation.

TIP: Dynamic A/B tests can be faster to deploy but can have a flickering problem or take slightly longer to load. Split URL tests don’t have these problems but require your to create a duplicate page, which is not always possible. It also requires separate URLs for each variation, so you then need a process to reuse or expire the old URL variants. The redirection itself and difference in URL might be noticed by some users.

TIP: A hybrid approach is redirecting to a URL parameter on the main URL. Changes are then applied on the back end or front end based on the parameter e.g., example.com/?v=a vs example.com/?v=b.

Single or Multiple Variables

Test can involve a single change or multiple changes, as in a full redesign. Micro tests allow you to connect the observed effect to a specific visual change – this is the most satisfying type of test. A macro test can allow you to test a coordinated set of changes, but you won’t know the effect of each individual change. This can put you into a tough spot if the test loses – do you scrap the whole thing or try to retest specific elements of it?

A multivariate test is used to test multiple changes by essentially running a separate test for each combination of variables. The risk with this type of test is lower power for each sub-sample with more false -positives. This is not the same as running multiple tests simultaneously on the same page, since in that case you won’t know which version of which test each user saw.

abvocab_variables

In closing

Use your awareness of page and goal patterns to collect richer data and avoid common mistakes.

Simulations are faster and more intuitive than calculations

I use simulations all the time to help answer questions like: Is this outcome possible? What outcomes are most likely? How much data is enough?

Simulations can give an answer faster than detailed calculations. They are less precise but far more intuitive. If you run a simulation 10 times and get a certain outcome even once, you know it’s possible. If you get it a few times, you know it’s quite likely. If you want more confidence, just rerun the simulation 10 or 100 or 1000 more times.

What if?

In the previous post, I included an under-powered simulation in Excel, where we ended up with a 23.7% drop instead of the true +10% lift. Using that template, you can set up a simulation in seconds.

Continue reading “Simulations are faster and more intuitive than calculations”

How to correctly define a business hypothesis

A hypothesis is an explanation of why something is the way it is.

Example business hypothesis:

“We are a new company, and visitors have doubts about the quality of our product.”

Do we really create hypotheses to test them?

To see if my example hypothesis is true, it would be best to talk to some potential customers. A/B testing is not really about testing business hypotheses but about using them to iterate a design. An A/B tester is not a scientist. He takes the hypothesis as inspiration for new visual treatments in order to increase his chances of raising revenue.

Continue reading “How to correctly define a business hypothesis”

I Have An A/B Test Winner, So Why Can’t I See The Lift?

In the town of Perfectville, a company ran a winning A/B test with a 20% lift. A few weeks after implementing the winner, they checked their daily conversions data:

 

graph-perfectville

97 days of daily conversion rates in Perfectville showing 20% lift

 

The graph perfectly related what happened: The baseline increased by 10% during the test, with half the traffic exposed to the winning variation. Then was the week when the test was stopped, followed by a lift of 20% once the winner was implemented.

The good people in nearby Realville heard about this and ran the test on their site. When they later checked their daily conversions data, they scratched their heads (as they often do in Realville):

 

graph-realville

97 days of daily conversion rates in Realville showing same improvement

 

The data actually includes the same 10% lift during the test, a gap, and a final 20% improvement. The problem is the improvement is relative to natural fluctuations in daily conversion rates, so 20% improvement doesn’t necessarily mean 20% lift.

Here are 6 reasons why people in Realville might find it difficult to see a lift and what they can do about it.

Reason 1: The effect is too small

The smaller the lift, the harder it is to see it through the noise. If the conversion rate drops for some reason unrelated to the test, the lift from your winner might not even offset that. For example, here’s 1 week of simulated daily conversion rates followed by a week with a 20% lift compared to a 5% lift. If the lift were 5%, it would look as though the test actually did worse in the second half:

 

graph-reason1-smalllift

7 days at baseline followed by 7 days with 5% vs. 20% lift

 

Have you just run a test and are looking at during-test data? You likely won’t see any effect. Typically only 70-80% of visitors will join the test (more on this below), and these are split among your variations. If 80% of your traffic actually participated in an ABC test, a third of that is exposed to the winning variation. So, a 20% lift would manifest as 5% overall.

What you can do:

  • Look for a larger cumulative upward trend after several tests.
  • Compare longer timescales for baseline and post-implementation data.

 

Reason 2: Your baseline is too variable or you’re not looking at enough data

In Perfectville, conversions are constant each day, each week, each month. This means a 20% improvement causes a 20% lift. Not so in Realville. In Realville, daily conversions naturally fluctuate, so the full potential for improvement may not manifest. The more your conversions fluctuate, the harder it is to see the lift in the data.

Here are two similar simulated data sets with low and high variability, both showing a 20% lift. The lift is more obvious when variability is lower:

 

graph-reason2-lessvariable

graph-reason2-morevariable

A similar 20% lift with low and high variability

 

Sales may fluctuate for a lot of reasons (weekly, seasonally, in response to your marketing activities, unexpected traffic). The smaller your sample, the higher the chance that the pattern you’re looking for just won’t be there by chance. For example, if you just saw the middle segment of the full graph below, you’d never know that the right, orange side of the graph shows a 20% improvement:

 

14 days of simulated daily conversion rates (blue), then 14 days with a 20% improvement (orange)

 

What you can do:

  • Zoom out to reduce variability. If data is too variable daily, look at semi-daily rate or weekly rate
  • Look at more data to cover the full cycle of ups and downs e.g., a week (note that the lower your conversion rate, the more data you need to see an effect)
  • Check your site analytics to see what might have been different that week. Check if dips have happened before. Might one have coincided with the test?
  • If the data has a lot of variation, it is hard to estimate visually. Compare what I’ll call the “clipping rates”. In this graph, you see higher peaks as well as a higher frequency of peaks in the second half:

 

graph-peakingrate

20% lift manifests in more frequent and higher peaks

 

Reason 3: Not everyone was part of your test

Even if you didn’t put exclusion conditions on the test, some visitors were excluded.

For example, mobile visitors are excluded by default. Another 10-20% of visitors normally get excluded when the A/B testing tool time out. Further technical implementation issues can cause another 10-20% of visitors to be excluded, things like JavaScript-heavy sites or the tracking code not implemented in the right place.

Moreover, gaps in test design can create a discrepancy between test and sales data. For example, we ran a test on the home page of a basic single-product site and noticed that our test data was missing many sales. After investigating, it turned out that about 50% of purchases were by people who didn’t land on and never visited the home page as well as by existing customers from a special upgrade page that we didn’t consider.

As a result of these exclusions, when you implement your winner, you may be exposing it to segments you didn’t test it on. For example, although you tested on desktop and saw a 20% lift, the same design on mobile might cause a 30% drop. So, if you made the winner your new home page for all traffic, the drop in mobile could counteract some of the lift (say, if you had lots of mobile traffic).

What you can do:

  • Factor in 20% exclusions due to technical issues, like timeouts
  • Set up an inverse test to see how many sales are by-passing your main test (target pages and visitors who are excluded from your main test)
  • When looking at sales or conversion data, keep in mind it probably includes segments you didn’t test on. Test the design on all segments that will be exposed to it e.g., new customers and existing customers. For mobile, build and test a dedicated mobile version

 

Reason 4: You are eyeing it instead of using math

Sometimes a lift is obvious. Other times you need to use math. Here’s a sample of real conversion data with about 20 days of basline followed by 20 days of the improved version:

 

graph-reallift

Just over a month of real conversion data with winner on the right

 

The lift is not visually obvious. Nonetheless, the average for the first 20 days is 0.71%, whereas the average for last 20 days is 0.85%, which is a 20% lift. However, if the standard deviation of the data is high, the difference in averages may be coincidental.

 

Reason 5: Your design or conditions are not the same

This happens all the time. You run a winning test, then you tweak the winner before pushing it to your site. It’s entirely possible that those visual tweaks reduced the effectiveness of your variation.

It’s also possible the test conditions are different when you launched the test. Did you test during the holidays or launch during holidays?

Are you including a different page? You might have several pages that look similar. So you tested something on one page, and you decided to apply it in one go to all the pages. If so, there is no guarantee that the same concept will work equally well on other pages.

What you can do:

  • Check your site analytics to see what conditions might be different now and retest if necessary
  • If you know you will be changing something, apply changes to the variation and test it with the changes
  • You should implement the winner as the new control and then test the new changes
  • Retest on each site if you have reason to believe the outcome may be different

 

 Reason 6: It was a false positive

Yes, it happens all the time. There are many reasons you might have gotten a false positive, including improper test design and not running your test long enough. The most common scenario is you run your test until you see a winner and stop. I’ve seen results that looked very exciting flatten after 3-4 weeks.

What you can do:

  • Follow the great tips on http://goodui.org/betterdata to ensure you get good data

 

Back To Realville

Let’s say Realville decided to retest the Perfectville winner 8 more times (it took years!). They found that indeed, the overall tendency of the variation was towards increase, following the same pattern as Perfectville’s test. There’s a small lift during the test, a slight dip when the test is stopped, and then a larger lift after final launch. However, despite the overall trend, individual outcomes showed that chance is a factor in this imaginary scenario:

Let me know if you apply and find useful some of these concepts.