The best way to get real-life feedback is to put something real in the hands of customers. Wizard of Oz is a prototyping technique where you tell customers they are using a real solution, but you’re actually faking it behind the curtain.
I shipped an AI bot that consumed documents with client information over email (e.g., last year’s PDF submission) and replied with quotes. I configured an LLM backend to parse the document, automatically fill an insurance submission, and fill gaps with assumptions.
The key chapters of this story include:
Problem
Brokers want quotes from multiple providers. More providers = more questions upfront, and this wasn’t always worth it. This is a larger problem I worked on for a long time:
For this project, the final solution was: brokers can get rough quotes right from their inbox by submitting existing documents and getting quotes by email:
- Avoid Reentry from existing PDF applications
- Meet Customers Where They Are: No logging into anything, no leaving email inbox
- Faster Time to Quote
My Responsibilities and Collaborators
I ran the project end-to-end:
- Defined scope and roadmap of feature
- Tested mock and live prototype
- Supervised 2 engineers
- Helped customers define goals
- Negotiated with external vendor
- Prompt engineering and framework for QA
What Research Triggered This Project?
Some customer insights were amplified by recent interviews:
- Brokers felt overwhelmed by the proliferation of provider portals (if they didn’t have to use our portal for the initial steps, it was a big plus)
- Brokers essentially “lived” in their inbox (if they could transact in rough quotes without ever leaving their inbox, that would be huge)
Example Analysis of Target Customer
- Brokers that value speed/volume over accuracy (ok with rough number they can revise later)
- Brokers comfortable making assumptions, whose whose clients rarely revise application to bind (e.g., Apogee)
- Brokers that receive filled applications via email (e.g., last year’s app for other carriers, broker’s form)
Prior Context
- I had already shipped a related feature: ability to skip questions for rougher pricing
- I had long-term relationships with customers who need this sort of feature
- I had previously concepted related ideas, like Outlook integrations, but there were practical obstacles, and it was hard to prototype (none of which held for AI-over-email)
- My team had access to ChatGPT-powered data extraction tools
Managing Conflict and Negotiating a Pilot With Beta Customers
As part of my design role, I cultivated long-term relationships with customers. I identified 2 beta customers for this feature and gained their buy-in for a pilot. One of them, I previously visited at their office. As a result, securing their participation in a 2-week pilot was relatively easy. In the end, I had 6 potential users, who I knew would benefit from this feature, based on my deep knowledge of how their specific workflow and values.
I had a preliminary call with them to estimate how much time we stand to save them.
One company was in a busy season, but I persuaded them this was actually a great chance to try a time-saving feature. The other ran a very lean shop. When I reached out with a follow-up request to one of the end users, the manager pushed back, saying, “We can’t spend time doing that. Perhaps this isn’t going to work for us.”
Instead of negotiating over email, I knew we had a standing call the next day. I simply said “No worries, we can touch on this tomorrow”. The following day, I walked the customer through a carefully prepared, minimalist slide deck. In step 1, I reiterated their goals and the projected time saving = handling more business. Then I explained how we needed some sample documents to calibrate our solution. The customer gladly agreed.
Negotiating With External Stakeholders
Part of my job involved negotiating with various stakeholders, vendors, lawyers, underwriters, and others.
For this project, we were using an AI provider for extracting data from PDFs. I suggested the vendor and my team set up a shared Slack channel, where I ended up giving a lot of feedback and making feature requests. I articulated the benefits well enough that the vendor actually shipped new features based on my feedback, in time to be useful to us.
Scope of the Pilot
For one of our key customers, I negotiated a desired workflow and target metrics:
How I Ran “Wizard of Oz” Prototype Test
For two weeks, I monitored my inbox and impersonated an AI chatbot (assembling submissions and attaching quotes manually).
My research objectives included questions like:
- What’s maximum time we can take to respond so the solution is acceptable?
- Would users be comfortable sending follow-up questions to a chatbot?
- Would users even remember to use the new workflow? (old habits die hard in this industry)
- How would users react to assumptions? What information would users need to verify?
- And most importantly: would the document extraction tech be good enough to produce quotes most of the time?
- It was a chance to collect more real documents to refine our prompts for data extraction
I varied my responses to test various hunches. I’d make successful responses but also reject a submission on purpose to see what the user would do instead.
At times, the incoming customer request woke me early in the morning and I had to respond as quickly as possible to impersonate the AI Quote Bot.
A Useful Technique: Intentionally Fail
People are surprised when I tell them I sometimes intentionally gave failed responses, from complete failure to parse the document to “I got this but it’s not enough for a quote. Please take this additional step.” Knowing how customers would react to fail states is a key part of prototyping. Sometimes fail states elicit behaviors that you wouldn’t catch otherwise e.g., would users try to push back on the response, would they try to resubmit, or modify their submission?
Learnings From Prototype
- User said “Too slow” and revealed nuances in their workflow. We needed to respond within 5 min in order not to block other steps in users’ workflow
- To some users it felt natural to ask the chatbox follow up questions, but it wasn’t cruicial; it was enough for our Version 1 release to merely create a “shell” submission, which users could edit via a link
- We measured the time saving of 10-20 minutes per submission from data entry alone; this translated into one customer being able to place 20% more business
- Needed data prediction improvements: e.g., 15% of submissions couldn’t go through because of missing industry code
- Most importantly: we kept seeing submissions despite users being told they could stop at the end of our pilot
The following quote further illustrates how this feature validated our principle of meeting users where they are (i.e. shifting from web UI to their inbox):
How We Measured Success
Several customers came to rely entirely on the new Email Bot. One of the pilot customers was able to quote 20% more business:
Technical Success
From our Beta release, we saw we weren’t hitting our 80% submission-to-quote success ratio. I identified the causes, and we fixed it in the next release:
Desire Paths: Users Vote With Behavior
A desire path is a workaround. It’s formed when pedestrians refuse to walk a paved path and forge their own, shorter path across the grass. This way users communicate their desire through behavior.
During the pilot, one customer hacked our solution. Since they knew our AI email service consumed prefilled PDF applications, they created plain-text PDFs with customer information and then submitted them to the AI bot. Doing so required extra work, but it told us that the email channel was so effective and intuitive that they we willing to do that work.
Biggest Proof: Do They Want It Back When It’s Taken Away?
One the most surprising findings AFTER the pilot ended was that we kept seeing submissions despite users being told they could stop at the end of our pilot.
Once the pilot was “turned off”, users were eager to get it back, even users who complained about a lot of rough edges during the pilot. In B2B, it is harder to impress the “doer” rather than the “buyer” persona. In this case, users didn’t have to use our AI tool, but they saw that it made their lives easier and wanted it back. This was another strong signal that we were on the right track.
One customer said:
User Testing the Bot Response
Part of the bot response was a checklist to verify accuracy of extracted data. One of they key considerations was: should the user see it in the same order every time or should it be in order of priority (e.g., errors first). I surveyed the pilot participants over email to choose:
One user replied: “B is better” while another said “Example A is better.” So I dug in further with the latter: “What makes version A more practical? Was it clear to you that… [details]”
The user clarified: “Version A is more practical because it captures more info and leaves us just 4 additional bullets to double check…Version B captures less info and requires more time to validate those 8 bullets filled out using guesses and defaults.” This was not true, but the user’s perspective here did inform me about the perceived complexity. After talking to the customer more, they agreed Version B would be better for them.
I explored several other variants e.g., I guessed Version E4 would be easier to scan, but I was wrong. Although E4 would give the user criteria in the same order every time, the mental load was actually higher, because they processed it as 2 lists, left and right. I also saw that too granular a breakdown was counterproductive:
Through a couple of iterations I validated:
- How granular the breakdown should be
- What layout entails lowest mental load (it wasn’t what I expected)
- How many bullets are too many
- Which items are most crucial to review
In the end, I reused the email “digest” template I had designed for another feature. The bot summary was simplified and tacked on at the bottom:
Prompt Engineering & Test Strategy
To ensure we had an 80%+ success rate converting submissions into quotes, I had to get creative with designing and testing prompts. The AI vendor did not have a mass-test feature, but I wanted a systematic approach. I built a JavaScript browser automation for mass prompt testing and established a process to regression test prompt changes. Here is an example of my log, showing pass/fail with various prompt variants across dozens of document samples:
g our first AI features on the submission form and email workflow (the salient and familiar), we were able to get a positive response to AI from a non-tech-savvy crowd.
Product Risks I Helped Identify
I lead team discussions on the following topics. These are things we needed to be ok with, actively mitigate, or learn about from our pilot.
Risk | Description | Mitigation |
ACCURACY | AI is faster but will it hallucinate or extract data accurately? | Restrict to high priority data like name, state, revenue |
CARRIER CHOICE | Brokers won’t use AI, because it doesn’t support their key carriers | Reinforce message that it’s zero effort to try; hinges on our carriers’ premiums’ being competitive over time; also on the number of carriers they get (3+ good quotes are hard to pass up even if 1 is missing) |
OVERALL USABILITY | Broker is annoyed if data is extracted incorrectly and fixing it takes even longer | Overall testing showed good results but could impact specific carriers (need to monitor long-term) |
DEFAULTS | Broker can make better assumptions than AI | Research showed brokers are ok with a rough number and reasonable assumptions e.g., 3 employees should be enough and this can be configured for each broker |
MVP Scope and Feature Roadmap
After the prototyping and consulting with engineering, I defined the roadmap for this feature, setting MVP constraints as:
- Bot can process 1 document at a time
- Bot cannot read email body
- Bot cannot reply to follow ups (user must hit Edit link)
Go-to-Market
As the feature was nearing completion, I kept the release on track:
- Pitched the latest customer feedback to engineering
- Updated customers on final release date and scope
- Wrote product brief for sales, support, marketing
- Ran Q&A session with sales/support
- Discussed upcoming release with marketing
Interesting Aside: Hardest Mental Shift Is When Lo-Tech Wins
It’s often said that we as product people overestimate where we sit in our customer’s life. This can be like tunnel vision.
Over the years, I invested a lot of time into designing and shipping a user-friendly, lean application flow. I always saw email and spreadsheets as something to be replaced. However, as time went on we saw repeatedly that there are times when PDFs, spreadsheets, and email are good solutions for specific circumstances.
Meeting customers where they are at became an important theme. This opened my eyes to new factors. For example, I came to appreciate that certain user personas just wouldn’t log into our software. For example, “producers” (sales people) spent all their time on phone calls with customers and traveling. They never logged into anything and delegated a lot to others. The tendency then was to either ignore them OR to push solutions on them they didn’t want. But I challenged myself on how to make it “dead simple” to create opportunities. Maybe they wouldn’t log in but they would click a link in a spreadsheet.
This project was one of the steps we took to meet customers where they are at.
Concepting The Future
Following that release, we started work on enhancements (not covered here). The vision included expanding the AI Assistant into more of the day-to-day workflow. For example, I knew that flagging failed deals is important for company reporting but is a task of less value to the broker… This is a rough mock I used in discussions with customers to illustrate how that might work: