STFU: Test Your Voice App Idea in Less Than An Hour

Published in

Chatbots Magazine

9 min readFeb 23, 2017

(world’s MOST annoying bot)

In digital design, there’s a universal truth emerging: most ideas should be tested as cheaply and as simply as possible (thinking lean or agile). In the case of a voice interface, you wouldn’t want it to end up as annoying as C3PO, would you? You most certainly wouldn’t want to find that out AFTER you made it. Yes you, George Lucas (hey, I saw Jar-Jar coming).

Ideally, you’ve got to fail and learn as fast as possible. Ed Catmull at Pixar said:

If you aren’t experiencing failure, then you are making a far worse mistake. You are being driven by the desire to avoid it.

Guerrilla Testing a GUI

At Clearleft in 2007, we created Silverback, because the cost of usability testing at the time was prohibitively expensive. Silverback is a neat hack — a quick way to record the screen, audio, webcam, and clicks on a mac during usability testing. It’s for those who value the quick and dirty route to knowledge over academic perfection (that comes later). Action is a great route to learning. You shouldn’t need a lab to test something early on, you need simple solution, a cafe, and members of the public.

Silverback (the software) and Steve (the silverback)

Recently I’ve been thinking about how to do this in the early stages of creating a voice interface. I wanted a method that took me from writing a script for the interface to a conversation between the system and a user as fast as possible. No development effort, no fooling around.

That’s my goal with this article: get you testing an idea for a voice interface, as fast as possible.

Wizard of Oz Testing

Wizard of Oz testing is not new, but it has a renewed importance as we see the rise of voice interfaces. Jeff Kelly refined the method in the late 70s: have humans interact with a system that seems real, where in reality it’s controlled by the ‘Wizard’, somewhere behind the scenes. Using the method, you can figure out how people will use and react to the system, and test if the system has the appropriate responses. If it doesn’t, refine and add as needed.

(I’m sure you can do a better job than Kramer!)

Amazon used this method to create the Amazon Echo. The device was in one room with a user, and the ‘Wizard’ was typing responses back to the user, converted into speech by the Echo prototype. The test participants were none the wiser: they thought were interacting with a smart device instead of a smart fake.

“The goal was to collect information on what types of responses worked and what didn’t work, as the subject would later fill out a satisfaction report on what they liked. For example, 50 people would go through the same script, but with varying response times or sentence structures.”

Gathering what people say

This method allowed them to gather ‘utterances’ (in VUI design, whatever your users say). You want to collect words and synonyms that people use, even though their intent is the same (book me a cab vs. get me a taxi). Your system must be able to cope with the wide variety of things that people say, and ideally use the words they use when speaking, in order to be understood.

Designing the flow

The system must also accommodate different paths through the dialogue, this is called the dialogue flow. A user booking a restaurant might say the time they want the table when they first start speaking, another might not. Your system needs to ask the right questions in response to highly variable input. Speech is a blank canvas; most GUI isn’t.

The 1 Hour Test

My kids watch this scene from X-men Apocalypse on repeat. A lot.

You can prototype an application for Amazon’s Echo or Google Home (or anything else) in minutes using this method. I’ve made a little python script to make this very easy: it lets you press shortcut keys to have your mac read aloud phrases from a text file.

The Steps

Script the basics of your VUI.
Write something that imagines how a user will interact with your system. Try it out a few different ways: how might the flow of conversation work differently when users give more or less information? You want to hit a ‘Goldilocks Zone’ of detail: cater with some of the variation, but don’t try to cover every eventuality at first.
Copy and paste the lines your system will speak into a text file. Remove the user’s lines. Save it as script.txt
Install Say Wizard from github onto your mac. Copy your script.txt into the same directory. Read the instructions for use.
Run your test! Learn and refine as needed.

(If you want to just get started right away and figure this out for yourself, just do step 3.)

Still here? Let’s break it down in more detail. Here’s a simple example I created and tested in less than a hour. I’m skipping a *lot* of detail about how to design a VUI, craft dialogue flow, moderate tests and analyse them, that’s for another time or source. This is about getting going fast.

1. Scripting the Basics of a VUI

In this scenario, we’ll imagine a skill for the Amazon Echo that’s useful at the office: audio recordings of meetings. (Yes, I’m aware it can’t do this, hey: this is just a tutorial!).

I start out by considering the intentions users might have and the useful information that the system might need to gather from the user. I write a script that covers the same dialogue to start the recording of the room audio, but with some variation around what information the user might offer. Some are just fragments of dialogue where I think things will vary.

The italic titles in brackets are the intents — a consistent name we’ll use in the system for what the user was trying to do, based on their first command.

_______________________

(startRecording)

U: Alexa, record this meeting.
A: Recording started. How long is your meeting?
U: an hour.
A: One hour, OK. Who would you like me to send the recording to?
U: James Box.
A: Thanks. Afterwards I’ll send this recording to them.

(startRecording)

U: Alexa, record this meeting.
A: Recording started. How long is your meeting?
U: don’t know.
A: You can ask me to stop the recording whenever you like. Who would you like me to send the recording to?

(startTimedRecording)

U: Alexa, record this room for 45 minutes.
A: Recording started for 45 minutes. Who would you like me to send the recording to?
U: Me.
A: What’s your full name, me?
U: James Box.
A : Thanks. Afterwards I’ll send this recording to you.

(startTimedRecording)

A: Recording started for 45 minutes. Who would you like me to send the recording to?
U: uhhhh… not sure.
A: No problem. You can find the recording later in dropbox folder for shared recordings.

(stopRecording)

U: Alexa, stop the recording.
A: Recording stopped at 53 minutes. Sending it now.

(cancelTheRecording)

U: Alexa, delete the recording.
A: Are you sure you want to cancel the recording?
U: Yes
A: OK, recording deleted.
____________________

2. Copy and Paste the Lines Your System Will Speak into a Text File (script.txt)

Simple really. Take out what the user says and the accompanying notes. I slightly regroup the responses into a sensible order that I think will occur in testing.

3. Install Say Wizard Onto Your Mac

There are some tools out there for testing voice interfaces, but none of them get you setup that fast, so I decided to create a messy (but elegant!) hack, in the spirit of Silverback. Credit is due here to Abi Jones at Google (we’re running VUI design workshops together): she’s trained Googlers in the past using Apple’s ’say’ software on the command line. I’ve extended her idea a little bit to make it even easier to create and interact with a fake interface.

https://github.com/bensauer/saywizard

Download and unzip this script to your mac
Edit the text file ‘script.txt’ in the unzipped ‘saywizard’ folder with your phrases.
Double-click startSayWizard.command to run a test (if mac security settings prevent this, right-click on it, Choose: Open With > Terminal, and click Open).
Press the relevant key to have your mac say your phrases. Close the window to quit.

4. Run Your Test

i. Brief the Participant

Write a brief for the user in advance. Give them enough context for what you’re asking them to do, but try to avoid words that would lead them.

Example brief
_____________________

The company has bought an Amazon Echo, and you’ve heard it can capture audio. You’ve got a presentation in this room for the next 60 minutes, and you’d like to listen back to it later. Ask Alexa to help you, beginning with the word “Alexa…”
_____________________

ii. Setup the Test

Book a space, recruit people (e.g. on slack / twitter), bribe them with pizza, do whatever it takes to drag people into the session. Avoid giving them much information in advance. In the actual test space, I’m fond of using a prop instead of an Amazon Echo, e.g. a tub of hairwax or pringles.

iii. Run the Test

Sit somewhere that the participant can’t see your screen. It helps if they’re not facing you; this reduces the sense that they’re talking to you.
Welcome the participant and remind them they can do no wrong.
Open quicktime player on your mac and go to File > New Audio Recording, and then hit the red Record button.
Switch to Terminal on your mac to use my script (see steps above)
Read the brief to the participant
Have a conversation by triggering the system’s responses to the participant by pressing the relevant keys on your mac.
Without asking leading questions, ask the participant for feedback.
Get the participant to fill out a brief satisfaction survey. See Cathy Pearl’s great book, Designing Voice User Interfaces, for more on this (and testing in general).
Stop the recording when the session is done.
Take off the sorting hat. You’re a wizard now. YAY.

iv. Analysis

Now’s the time to listen back to your recordings. What did you learn?

Your analysis will be looking for three things in particular.

1 — Was it Usable?

Did the user complete the tasks? Did they understand what the system was saying? Were the responses that you scripted able to handle the different ways that people asked for things and the different paths through the dialogue? Did they hesitate at all? Could the system make it clearer to a user what options they have?

2 — Was it Desirable?

Did you notice any emotional responses to the system? Do participants *feel* that the task is easier using a VUI?

3 — What Words Did They Use?

The system should:

i. recognise a wide variety of ways that users’ ask for things, so you should collect their ‘utterances’ and make sure that your system recognises the synonyms and related phrases (e.g. 60 minutes = 1 hour)

ii. respond using the language that users naturally tend to use — update your script accordingly.

Tip

Spend more time than you think you need writing the test brief. In my experience you’ll get this wrong first time, mine typically go through a lot of revision / editing.

Wrapping Up

Now we’ve revealed the Wizard behind the curtain, a couple of caveats.

This method is by no means perfect; e.g. the system can’t speak back to the user using information they’ve given, like time / date / location. This kind of testing is formative, not summative: you’re not summing up the performance of an end product, you’re forming your ideas about how it could work, as quickly as possible.

I’ve cut out lots of sensible, useful steps for the purposes of the article. Still, I hope this method helps you find your inner Wizard fast.

Feedback and likes gratefully appreciated!