Statistical Dashboard & Internal PM Tool

Recency: <2016
Role: Product designer, web developer
Collaboration: Solo design with lots of feedback from colleague

Background

The Visual Website Optimiser (VWO) dashboard shows the performance of page variants in real time. The original dashboard omitted many key metrics and did not provide sufficient guidance to users (based on my conversations with clients):

  • How’s the test doing now? Any early indications?
  • When do we have enough data to stop? What are the risks and trade-offs to stopping now?
  • What’s on deck to be tested next?

What I Did

I put together some tools to help solve these problems for me and my clients:

  • Statistical library in JavaScript focused on A/B testing
  • Greesemonkey script to add missing metrics and rules to VWO
  • Created email status updates using PHP and VWO’s API
  • Created landing page explaining the free tool’s benefits
  • Created Project Management tool to track ideas

Marketing Focused On Benefits

I created a page to clearly explain the top 3 problems I’m trying to solve:

Enhanced VWO Overview

The original dashboard started with an overview, which showed the relative performance of the each version :

The problem was:

  • No indication of the statistical significance of the results
  • Hard to compare bars as performance differences narrowed over time

I enhanced the overview with:

  • Worst case scenario: Vertical line to easily compare versions
  • Margin of error: T lines to show margin of error
  • Statistical confidence: Added p-value statistic

Confidence lines at the top of each bar show uncertainty. I drew a vertical line to represent the maximum estimate of V1 (the Control). Now it is easy to say that even if the true performance of V1 is its maximum, then the lowest estimates for V2 and V2 are still outperforming it. This is very good.

I added a p-value, which is a standard way to measuring the strength of results. Normally you can’t show p-values like this in real time, but there are various reasons that I did so here.

Enhanced Main Dashboard

The original dashboard looked like this:

The problem was:

  • No indication of current false positive and false negative risk
  • No margin of error for the improvement
  • “Chance to beat” was not always reliable
  • No indication of how much longer to go

Over multiple iterations, the dashboard looked like this:

I made a number of improvements here:

  1. Labeling: I back-calculated VWO’s margin of error and discovered it was lower than is standard (only 75%). I clearly labeled this.
  2. Added confidence interval: I used 99% confidence intervals for extra aggressive to allow for other statistical laxness in making this more user-friendly. Now users could see a range of uncertainty instead of one value.
  3. New confidence indicator: I replaced the “Chance to Beat” with my own “Actual Confidence”, based on my own algorithm. Users could hover the values to see what they mean.
  4. Sample size guide: I tried to estimate how much longer a test has to run. When users hovered over the icons, they could see an explanation and a recommendation in plain English. I also applied many rules in the background to show context-specific messages e.g., if visitors are under some best practice minimum.
  5. Test metrics & risk: I added holistic metrics, showing time elapsed and estimated weekly test traffic. I also quantified the false positive risk, taking into account number of variants being tested.
  6. External calculator link: I provided a link to an external calculator that would allow users to manipulate the data and add special “corrections” not available in VWO

User Feedback

I received feedback from multiple sources and found bugs, which I fixed. The Addon went through 7+ iterations.

Next, I Created Email Alerts

The problem was VWO had no email update service to keep the client updated. Tracking results for multiple clients across different accounts was also laborious for me. Fortunately VWO had an API.

I created an email update service that sent bi-weekly test updates to me and my clients. I used VWO’s API and PHP to route emails. I first started with a status update showing current performance and change from last time:

The email included:

  • All tests and their status
  • Performance of each version, traffic, and statistical assessment
  • Estimate of test duration

I then incorporated my own heuristics that weren’t available in VWO. For example, this report included daily performance so I can see how consistent the test is:

For many projects, the daily counts of visitors were low, so I expanded the weekly summary to show detailed performance. Also, my colleague suggested making the report more personal. So, I also added a custom summary at the top in yellow:

The red and green colors are also distinguished by minus signs and difference in tint, so it’s still clear for color-blind users.

I also built my ownstatistical calculator to facilitate both the planning and analysis of tests.

Product Page for Addon

The full product page included a clear explanation of the what’s new with arrows pointing to specific features and what they mean to educate users.

MVP / Prototype for Project Management

My clients wanted to see the list of A/B test ideas and their current status. I created a functional prototype to allow us to enter test ideas, clearly articulate the rationale, prioritize, and flag them for testing:

When a test was activated in VWO, it would show up in the list, and anyone on the team could click on it to open the VWO dashboard.

Tool Retired

Eventually VWO updated their statistical model and I retired my tools. I also retired the email updates, because it was decided weekly personal updates with clients were more valuable. However, going through the prototyping exercise was highly valuable in documenting the process.

Remote Testing Paid Signup Flow

Recency: 2015
Role: Owned project, client-facing designer and A/B test developer
Process: Solo with over a dozen A/B test iterations
Top Challenge: Improve sales by genuinely improving UX

Iterative redesign and A/B testing to improve the Plans/Payment flow on a dating site to increase paid sign-ups.

Problems With Original

The original process failed to emphasize the top benefits:

There is no guidance on what to choose. Is 3 month plan long enough?

Buying message packs and the option to enter a coupon code further complicate the choice. Analytics showed some of the options were never used. Some benefits, like “Extra privacy options”, are unclear. Some benefits, like “Organize singles events”, are not relevant for the average user.

Once the user picked a plan, they went to Step 2:

The 2-step flow was awkward. Step 1 had the per-month price prominently, yet step 2 starts with a higher prepay-3-months price. Checking analytics showed people went back and forth between the two steps, suggesting Step 1 wasn’t effective as a gateway page.

My Solution

My final redesign looked like this:

The key aspects of the solution were:

  • Hierarchy & Flow: Simplified to single-page process over multiple steps. Tabs for each plan allowed user to explore plans without flipping back and forth between pages.
  • Value proposition: I turned the top benefit into the headline (“Make A Great First Impression, Send the First Message”). Showed only the best next 3 benefits below instead of many.
  • User-centric: Guided user’s decision with Plain English financial and situational advice. For example, the 6-month plan says: “Pays for itself in X months. Take your time meeting people.” Used more casual language when describing the plans too.
  • Hierarchy: Removed distractions and moved secondary payment options way down to the footer.

Many Interactions Of A/B Experiments

Over multiple tests, I removed various components, such as the message pack footer. I also tried simplifying the choice by setting different defaults and using a single column layout.

Here’s an intermediate variation, which did NOT do better:

I tested setting the default to Plan 1 as well as Plan 2. During the testing, I monitored the impact on sales counts and revenue, as well as user behavior.

After each round of testing, I prepared summmary reports and analyses with lessons learned and recommendations.

User Behavior Research

I tracked user behavior in detail, like if they were chosing a plan then going back and changing their choice. I tracked how long they spent at a given step. 

Some results were surprising. For example, setting the default to Plan 1 reduced sales of Plan 1 but tripled the sales of Plan 2 instead BUT in a way that did not increase overall revenue due to the price differences between the plans.

A/B Test Outcome

I A/B tested this solution and it increased revenue and sales for all plans. It took 3-4 iterations.

Video: What-if analysis using a reverse test duration calculator

In this video, I want to show you a different kind of sample size calculator for your A/B tests. It works backwards compared to how traditional calculators work and you might find that more intuitive. The basic premise of this approach is that we mostly don’t know what effect size to expect, so we projections for a range of outcomes.

Go to the Reverse A/B Test Calculator

Video Transcript

This calculator doesn’t ask you to input power or the effect size you are after, because it assumes that you’re exploring and don’t know what effect size to expect. Instead it just asks you for your current conversion rate and traffic, and then gives you several possible effect sizes that you can reasonably detect on your site and your chance of success for each outcome.

Let’s see how it works. Let’s say you want to run your test on your home page. About 5% of people make it from the home page to a purchase, and you get about 5000 visitors to the home page per week. Let’s run the report with those numbers.

At the top of the report, you’ll see your preliminary estimate. This estimate tries to balance the testing duration with the effect size you can detect. It’ll your duration at 8 weeks regardless. Next, it’ll try to make sure you can detect at minimum a 15% effect.

If I scroll down, you see that’s exactly what I got. The duration is 6 weeks and this is optimal to detect a true 14% lift. I can then adjust my duration up and down and see how it impacts my projections.

The advantage of this report is that it doesn’t give you an estimate for just one effect size. It gives you a range of reasonable what-if scenarios. That’s because we might have little idea what the effect size might be. But I see that if my new version is 10% better or 10% worse, then there is a 50% chance that the effect will actually peak through the noise strongly enough.

But if the effect is 14%, then I have an 80% chance of success or 80% power. I can then use my judgement to see if whatever I am testing can reasonably beat the existing version by at least 10% and ideally by 14%. It will depend on how big my idea is that I’m testing, my experience with similar tests elsewhere, how bad the current design is, and so on.

Another piece of information you can get here is a sense of what the actual observed effect might be. Remember that EVEN IF my new version is 14% better and the test is a success, it doesn’t mean the effect size will actually be 14%. By chance it may be inflated or deflated. So here you can also see the margin of error. This means that if I get a 7.5% lift, I know that the true effect might actually be as high as 14%. But if I see a 3% effect, I know the true effect is at most 10%.

I might wonder if the true effect were 14%, what actual effect might I observe half way through the test. To see that, I can reduce the duration to 3 weeks, find the 14% effect, and see that it might show up as an effect as low as 4.5%.

So far, I assumed there is a true effect. But if I am wrong and my new variation actually has no effect, I might still get a lift – that’s called a false positive. I always like to know what sorts of false positives I can expect.

In this case, let’s put it back to 6 weeks. And we see that with this duration, we have a high chance of a false positive of 5%, positive or negative. The term false positive includes effects in either direction. I see there is a small chance of a false positive as high as 10% – the probability is 5%, small but possible. If I’d like to eliminate that possibility altogether, then I can increase my duration. After 9 weeks, the probability is just 1%. And I scroll back up, then I see with this duration, I can also detect smaller effects.

In the end, whatever happens, you have a much better idea of what to expect. Give it a shot. Let me know how you like it. Thanks for watching.