Hey Cortana, How would you evaluate the usability of a VUI?

The screen has become the predominant and default way for humans to interact with computers, but everyday voice based interfaces are becoming more and more advanced. Last Friday, Richa and I sat down with Omri Dvir, a Senior Design Lead at Microsoft who works on Cortana, to learn more about usability considerations of voice user interfaces. Below is a rough summary of what we talked about with Omri, who was kind enough to have a video chat with us all the way from Tel Aviv, Israel.

QUESTION: What do VUI designers have to consider that GUI designers don’t?

Answer: The context of use in every digital product is really important but, for GUI you can kind of get away with not getting it perfect the first time. But for VUI, you really have to understand the different levels of attention the user is able to give you. For example, if you’re sitting at a desk and using a VPA, you can use your voice, yes, but it’s also possible for you to look at a screen. So maybe the screen can provide additional information and functionality. Now if you’re driving, the experience has to assume that you’re not looking down at the screen at all. Or if you’re walking on the streets, your attention is not quite there, your two hands are not on the keyboard but your eyes can glance down once in a while and the screen can still feed information while the voice does most of the work. Or maybe you’re in a meeting room, standing away from the clicker. You’re not really in a position to fumble with a screen so the Voice Interface really has to step up its game.

QUESTION: Why do VUI designers evaluate their designs?

Answer:    Mostly to test the big assumptions that we have. Designing a voice interface requires us to make a lot of assumptions in the design process so it’s important to continuously test them

For example, do we let user control the experience or do we make assumptions? With using voice interfaces, sometimes there is ambiguity that needs to be eliminated. If we tell a virtual personal assistant to “Call John”, and there are 4 numbers, which number does the system call? Do we choose the default number, the first one listed, or the local number? Or do we let the user choose? But that might be really annoying because now the user is having a conversation with AI about making a phone call which would have been a one-click flow on a screen based device. 

So it’s always a delicate balancing act between giving you, the user controls, and making the system seamless by letting it make decisions without user input. So we evaluate this as much as we can beforehand but we can still make changes afterwards. 

QUESTION:   What is a common way to test your assumptions?

Answer:     I don’t know about common, but a method that I like is Wizard of Oz. You’re basically faking an experience to save money. It’s very easy to pretend to be a voice system right, all you need is a voice, a user, and some kind of script. With Wizard of Oz, you can have a human conversation without having to build the language, understanding, and algorithms to quickly validate ideas. It’s much better to have one design team working on this in a day or two, than a team of 50 engineers working for a year on something that’s not right. 

Another good indication is that if my human intelligence cannot create a good experience for users then AI would never be able to. I definitely recommend using Wizard of Oz for testing big questions.  

QUESTION:  What is your experience with first time users ?

Answer : Using voice user interface is a huge change in behaviour. Voice Devices  is having the same problems like VR had. It is very awkward to wear a device for the purpose of entertainment. VR and other few of such technologies did not pass the barrier. It is hard for the first time users to communicate with device. For eg. Imagine going to a restaurant which has no buttons to select the menu. It would be weird for the user to talk out loud. Or imagine being alone in the room and talk to the device. Even though users now carry few activities but the conversation is still limited. It is really important for us that the first time experience is good for the users.For example going to a yoga class for the first time and image if it does crappy. You won’t be trying it 10 times more because you have better things to do. If the experience is good then chained to them is the abilities, skills and increase in the vocabulary of the device. For example next time you’ll try calling 2 people or messaging. A quick confidence

Concept banner with woman using his voice to command to the robot. Robotic voice assistant concept. Trendy bright linear illustration. At home and work, everywhere

QUESTION: What process do you follow for Users recruit and Testing ?

Answer : Before starting the process of design. We would like to know the behaviour of users on particular tasks. For example, we would like to know how many people would like to check their email first thing they wake up in the morning. For that we send surveys and from that we understand behaviour of the user. Its send to around 5-10 people. 

User testing is done in the labs. For user testing we follow more like as i said previously we use various methods as well as make few assumptions. But during this time we give user small tasks just like the way it is done for the Graphic User Interface. We give them tasks and take video of them using the device. We also ask them why they do this and what results they were expecting. To understand the user behaviour with the device in the environment, the device is given to the people within the company. These helps us gain confidence of the users.

QUESTION:  Can you tell us about any Error cases ?

Answer : We try to avoid dead ends. For example, imagine going to the information desk asking how to take bus to new jersey and they answer no you can’t. Voice conversations are the same. But there is no button or menu or back button. During voice conversation it is difficult to go back and it’s very time consuming as well. You go through various steps and if you get a dead end its super frustrating. We always try to explain why you came to a dead end. Example if you want to call jane and there is no jane on your phone. The device cant say “no, goodbye”.It creates horrible experience for the user. It is our duty to not make users feel they did something wrong. Its a device that is wrong. So a better answer would be “are you sure you would like to call jane or you meant john? Even though there always will be dead ends but there will be sideways to move to the next conversation. An error stage which make you look like an idiot is terrible.

QUESTION:  How do you deal with people with different accents ?

Answer : Even though it should be considered a lot . But it has a lot of machine learning and AI training and models. That’s why you are seeing that voice is first launched in US english but its like an internal biased thing. But also english being the first language of many in the US. it’s easier to build the models. We are working on it but its super complicated.Accents is just one thing. But having a non traditional name is also, we don’t have the machine learning mileage to find result. We are trying to put all of this in the naming resolution and grow our vocabulary, understanding accents but it is a costly but time consuming process.