Crowd-sourcing A/B test predictions

The collective guess of a crowd can be more accurate than that of an individual. For example, over a hundred years ago, statistician Francis Galton noticed that a crowd of people could guess the weight of an ox with over 99% accuracy. In a more complex domain like politics, we know that expert predictions are terrible, but the average of their guesses is better.

How accurate is a crowd when it comes to good UI design?

How good is the Crowd at picking a better design?

Behave.org has a good repository of people’s guesses about competing designs. Various contributors submit their A/B tests to the “Test of the Week” section, which asks users to guess the winner. Here, crowds get to weigh in on a single idea, one that is not their own. Moreover, these tests are curated to be interesting, which implies they are carried out by more experienced people with more robust hypotheses.

When we look at the last 106 tests on Behave.org, we see that:

53% of the tests show a variation that beat the baseline. But when the Crowd guessed the outcome of these tests, they guessed right 72% of the time (36% better). Interestingly, the Crowd did choose B about 50% of the time. It was just better at NOT choosing B when it was not an improvement.

FYI 53% does not represent the average success rate of test contributors at all, since it’s not a random sample. We just use it as an arbitrary baseline, to see if the Crowd makes the same guesses or better ones.

Why does the Crowd do better?

First, a test designer choses B 100% of the time, since B is be definition the improvement. In contrast, an impartial outsider considers all options equally and freely chooses either B or A.

Second, a tester is biased by his idea. His faith in the idea fills the flaws in the implementation, especially if he executes his own idea. In contrast, an outsider just evaluates the implementation.

Finally, a Crowd by definition has greater diversity of opinions, gut reactions, etc. than an individual. So even if a Crowd mix is not representative of site visitors, a Crowd is likely to be MORE representative than just the individual.

Quality of the crowd and the data

Is it a Crowd of independent opinions?

The individuals in the crowd need to be independent. One of the flaws of “focus group” research is that individuals win the group will influence the responses of others. This makes the Crowd less intelligent. The power of an internet poll is that we average over many independent opinions.

How qualified is the Crowd?

The composition of the crowd is also important. Sometimes, you want average people, representative of your audience, to give you their simple preference (the classic “Which product would you buy?” question in market research). Other times, you want experienced designers to give you their guess based on their expertise (though sometimes experts can bad at predicting). For example, on all our projects at Goodui.org we ask multiple client contacts to specify their certainty in each test idea, which gets averages out with our prediction. These are people who are not designers and people who know more about the product and the user base than we do – both very good things that add diversity. At the same time, being subject matter experts, they are qualified to give feedback on marketing copy and so on. In other words, their contribution to the crowd is valuable. We then try to prioritize the ideas that have the highest overall score.

How to quantify Crowd opinion?

How you summarize the opinion of the Crowd matters. In the initial example of guessing the weight of an ox, Galton used the median to summarize the crowd’s opinion. A median controls for outliers and can increase data accuracy. For example, there might be unqualified people in the Crowd who are way off on their guess (I would have absolutely no idea, for example, how much an ox weighs). An average would be skewed, while a median would not be.

How the question is asked and the format of the data itself also matters. For example, I recall that asking people “Who do you think will win the election?” predicted election results better than “Who will you vote for?”

In the case of Behave.org’s Test of the Week, if they asked visitors to quantify their confidence not as a binary choice but as a 0 to 10 (0 = A, 10 = B), then their uncertainty might have weakened the overall prediction. The binary choice of A or B effectively controls respondents’ uncertainty, so they are forced to give the same weight to “it’s slightly better” aso to “it’s much better”. A simple binary choice might make the prediction more accurate (this is known as the “error of central tendency”).

For example, at GoodUI.org with @jlinowski, we recently moved away from a simple 0 – 10 subjective certainty rating to a -3 to +3 scale. Since we tend to include test ideas that others are likely to also consider worthwhile, this effectively makes it a 0 – 3 scale. Simpler means stronger predictions. However, we also moved to a more complex formula that incorporates experimental evidence and research. This increased complexity risks decreasing the predictive power, since much more complex decision making is involved in figuring out if a variation is likely to win.

Can a Crowd fail?

Yes! More people doesn’t necessarily mean better predictions. Sampling error (your choice of who’s in the Crowd) has to be valid and appropriate. In the classic example of the 1936 Literary Digest poll, a very expensive poll with a large sample size actually failed to make an accurate political prediction, because the sample was biased. So quality trumps quantity.

How good is the Crowd at predicting effect size?

For reasons of complexity, I don’t believe a Crowd can be relied to estimate the degree to which a variation might beat the baseline. The possible % values are unbounded.

Next factor is experience. In the ox example, most people know what an ox is and have a lots of experience with other objects. But every website, every site audience, and every implementation of an idea is unique and not necessarily transferable. Of course if you are testing something and you’ve tested something exactly like it on a dozen very similar sites and got a similar result each time, then you would have experience. I’d be skeptical if anyone can make such a claim however. CRO experts rely on a lot of intuition and personal judgement to fill those gaps in experience.

Finally, it’s the type of stuff we are measuring. Weight is a simply additive property. A rock that appears to be 2X size of another probably weighs 2X as much. An ox about the size of 5 people probably weighs the same. Online experiments are not like this. If I am testing an idea, it’s often not decomposable in such a way. And even if we were to decompose it, there is no guarantee the effects of the parts add up this way.

In other words, guessing the outcome is very very hard and should not be done subjectively. We should do our best to find similar past tests, and use those to make a starting prediction. We can then collect some data in order to fine-tune the prediction. We can then test our prediction with more data.

Crowd-sourcing design ideas

How do we figure out what to test in the first place or find something better than B? After several unsuccessful tests a while back, we decided to crowd-source improvement ideas from GoodUI blog visitors. The result was the highest response rate of all posts up till then with lots of ideas that led to new variations.

We could further improve this by asking visitors to make predictions about others’ ideas before add their own idea. That way we would crowd-source a list of ideas as well as crowd-source predictions for each idea. As we have seen, stronger predictions result when a Crowd joins to rate a single idea, instead of each individual putting forward and rating their own idea.

Lessons learned

If I want to improve my success ratio, I need to behave like the crowd:

Separate the idea from the implementation. I recently had a great idea but my best implementation was weak. It took 6 visual iterations over several weeks to ensure both idea and implementation were sound. Critical feedback from 2-3 people (who were not involved in generating that idea) was critical in improving it.
Do a pre-mortem. Imagine that B already lost, and try to figure out why that has happened. This way you force yourself to consider taht B might not be better and needs improvement.
Seek first impressions from outsiders, both qualitative (ideas) and quantitative (predictions about ideas).

What’s in the future?

I envision a crowd-prediction service (similar to remote User Testing services that have become popular). You could pay to have 100 pre-qualified people make predictions about your test.

Happy testing.