On this episode of The Digital Life, we talk about designing voice user interfaces with engineer Claire Sun, who just returned from the Conversational Interaction Conference. Today, the voice UI market is primarily focused on performing tasks and controlling devices related to the smart home. Voice recognition technology isn’t robust enough yet to function smoothly in noisy environments or tell the difference between multiple speakers. Additionally, the software that powers VUIs is still at the early stages of being able to understand language in context, as opposed to more simple, transactional phrases. Join us as we discuss approaches to designing VUIs, and the difficulties that designers and engineers encounter in their quest to create software that’s both personal and conversational.
Jon: Welcome to episode 247 of The Digital Life, a show about our insights into the future of design and technology. I’m your host Jon Follett and joining me today is Claire [Sun 00:00:26], a voice user interface engineer who will be chatting with me about designing VUIs. Claire just returned from the Conversational Interaction Conference. Claire, welcome to the show.
Claire: Thanks, Jon.
Jon: Obviously you learned a lot at the conference, and we want to bring our listeners, who may or may not be familiar with voice user interfaces, up to speed on the current state of voice technology. Could you tell us a little bit about what the strengths are of current voice technology? What the strengths are and how those are manifesting in products?
Claire: Just to clarify to the audience, voice interfaces are probably something that they’ve interacted with already. It’s part of Siri that is part of everyone’s iPhone, as well as the trending Alexa and Google Home that has been commercialized everywhere nowadays. What’s really good about these new types of interfaces is that you can use them without having to look at them. You can use them in hands free if you’re occupied while cooking or while you’re driving. Voice is very good in that it’s kind of more instinctual and easier to use. Like speaking to something comes intuitively to people. It takes less time than having to type a paragraph to somebody, so it’s very good in relating data to a machine in a very quick way.
The problems with this interface is that it’s very hard to convey a lot of information to the user, because you don’t want to be lectured by a machine with this long paragraph, and oftentimes people have very short attention spans and so they’re not going to be paying attention as a machine drones on and on about the Wikipedia definition for this historical event. So that’s where the common interfaces that we see nowadays with mobile and web are better at, because they are able to show picture examples, they are able to show paragraphs of data in a short given time. It’s really about which environments to use these different interfaces that is the most important.
Jon: You mentioned some of the more popular products that people might have in their homes or obviously on their phones. How would you break down the current state of the market? What are the different categories of voice products and where are new VUIs emerging?
Claire: What’s the most common consumer product is that we’ve seen this large push to having smart speakers, which are again the Alexa, the Google Homes, and now the Apple HomePod which is Apple’s version of a smart speaker, which has Siri on it. These voice agents communicate using a speaker, so they’re very good at searching information that they’re connected with their certain search engine. Of course Google is connected to Google. As well as people usually use them for entertainment reasons, like playing music or playing any of the game apps that have been loaded to their App Store. They have also been integrated into many people’s phones which have more utility, so Siri of course is in the iPhone whereas Google Assistant has been integrated into various of Google’s products, like their hardware products as well as third-party products.
Because they are integrated to a phone, they are more allowed to use a kind of multimodal device where they can access also the visual interface of your phone, they can access any … they can be more of a real assistant since they’re connected to your calendar, your e-mail, et cetera. What’s another big push that I saw at the conference was on large corporations wanting to push more towards using voice agents to automatize their customer service, because if you think about it, customer service is a very large expense for large companies, so if we’re able to replace this workforce with computers, it’s something that would save the companies a large amount of money. So the idea is instead of someone calling for questions about a product from Walmart to a person at a call center, they can instead call this virtual assistant and ask questions about that product or certain company policies, et cetera.
Jon: Yeah, I think we’re all familiar with the not so helpful helplines, the 800 numbers that wind up in a call center somewhere. I can see why that might be attractive, identifying this huge chunk of money that needs to be applied to customer service and, “Wow, we can save money now because the voice user interface will do it for us.” I guess I’m a little bit skeptical as to whether people are willing to tolerate relatively clumsy technology in order to find … if they have a problem that they need solved right away. I know that I always hack the voice user interface that I encounter and just say “operator” repeatedly.
Claire: Yeah, definitely a lot of people do that. A lot of people don’t like talking to a computer, because in a way it’s more frustrating and slow than talking to a real person. Still, in the case of customer service, you have to wait to be on line with that real person or else you’re going to be on hold for quite a time. So the push is that voice has a long way to go and that we need to work on making it more conversational, instead of having this kind of survey that talks to you on the phone like, “If this applies to you, press one. If this applies to you, press two.” Instead of listening to that, but instead on having more human-mediated conversations where the human is able to drive the conversation and not the reverse.
Jon: That seems like a good transition into talking about some of the challenges that come along with designing for this kind of interface. You mentioned there that there is a difficulty in being able to properly frame the conversation right now. We’re used to these voice user interface trees that put us into this mode where we’re trying to find our way and it’s not very successful, and it doesn’t make us feel good about what we’re doing. What are some of the other challenges that come along with designing VUIs and how are we able to design to meet those challenges?
Claire: Currently how voice agents are used is very short conversations. Basically you give them a question and they provide you with an answer. So I can ask, “Hey, Google, what’s the capital of Alaska?” Something like that, and that is kind of the length of their conversation. Usually there isn’t what we call a actual conversation, a real dialogue, because it doesn’t really ask follow-up questions unless it’s giving you a survey. So this is really what’s in the voice field and the research that people are conducting with this kind of technology, is trying to further the conversational aspect. Because we want to produce a technology that is kind of similar to talking to a real person, that’s what we want to strive for, and so it really has to do with the designs of the conversation flow connected with the actual engineering aspect of the machine learning that is doing all of the text-to-speech. So that really has to do with the neural nets that kind of act as the back end for all the natural language processing, so that the machine can actually decipher what is the linguistical meaning and the sentence structure and et cetera, to understand human statements but also to related that information to a way that we can understand.
Jon: So it seems to me that one of the important missing pieces of all of this, around that designing for VUIs, is context, right? So not only context in a particular conversation, but also over time being able to know that my virtual assistant has some understanding of who it’s dealing with and that my house is set up in a certain way, I have certain preferences, versus my wife or my son might have something different altogether. We take that for granted of course, when we’re dealing with people, that they kind of understand the context of our request. What from a design perspective is … how do we approach that with context in mind? How do we improve our voice interfaces so they are no longer these one-word interactions, but rather have sort of the context as well?
Claire: The issue with context is that when humans have a conversation, they remember what they’ve just talked about or previous things that have happened. That requires some form of memory, and so the thing is is that how voice agents are built currently is that there is no past storage of memory, because in a way it needs to be infinite because the agent doesn’t know exactly what you’re going to refer to, it’s really hard to determine what you want to talk again about, without stating that exact subject. So really the technology isn’t at that point yet. A lot of research at Microsoft, Google, a lot of universities are trying to figure out on how is it that you can get that context, so that you can refer to, “What was that again? What was this again?” Like things that do not describe the whole term that we’re trying to ask the machine, but we as humans understand what that means. So it’s something to continue to look for, like things can be hard-coded to understand context if we know what data that we’re expecting. So if we can construct a dialogue flow that will for example kind of temporarily store memory on what this location is that you’re talking, but again it will always be temporary because machines can’t store all the data for their lifetimes like we can. So it’s definitely further research on the linguistics, the cognitive side, that we need to continue developing for.
Jon: Yeah, to expand on that a little bit, we do set context a lot with our visual user interfaces, specifically planning to work in a type of document or with something that we’ve access previously. So we do that, but it’s second nature, right? So I open a Word doc, I’m essentially telling the machine that I’m going to be composing some writing and that I’m expecting certain kinds of tools to be available to me, et cetera. Whereas there is not as much context-setting with the voice user interfaces, at least not yet. So to wrap up our conversation on designing VUIs today, what are some of the things that you’re seeing in the marketplace in terms of the technology moving forward? what are the problems that are being solved over the next year to improve the voice user experience? Because there seem to be an endless array of things that we could do. Where do you see the technology going in the short term?
Claire: Technology advances at a very fast pace, so what people are working on correctly is to make these agents sound more human. For example, if you talk to Siri or Alexa now, you can always tell that it’s a robot, but there’s been more developing research for text-to-speech in order to make those voices less and less robotic. That changes how people interact with the device, because if you know that something is a robot, you’re going to treat it differently than if you think it’s a human. I think there is going to be a lot of further integration of voice agents into devices that we commonly interact with, but I think it’s going to be a much more multimodal approach, where I don’t think we’re going to stick with the whole smart speaker aspect. Because really only having voice as a form of an interface is actually limiting, whereas if you think about having voice and a visual interface, and maybe something else, there is much more functionality that can be done with that, and a lot of various data inputs and outputs that are better suited for those different interfaces. So I think that is going to be the next approach, and companies are actually approaching that with like the Echo Show where there is kind of a touch screen attached to an Echo instead of just having the Echo speaker itself.
Jon: Interesting. We’ll keep an eye out for that further integration. From a design perspective, I can see that making a lot of sense. Claire, thank you so much for joining us today.
Claire: Thank you, Jon.
Jon: Listeners, while you’re listening to the show, you can follow along with the things that we’re mentioning here, in real-time, if you head over to thedigitalife.com. That’s just one L in ‘thedigitalife’, and go to the page for this episode. We’ve included links to pretty much everything mentioned by everybody, so it’s a rich information resource to take advantage of while you’re listening, or afterward if you’re trying to remember something that you liked. You can find The Digital Life on iTunes, SoundCloud, Stitcher, Player FM and Google Play. If you’d like to follow us outside of the show, you can follow me on Twitter at jonfollett, that’s J-O-N-F-O-L-L-E-T-T, and of course the whole show is brought to you by GoInvo, a studio designing the future of healthcare and emerging technologies, which you can check out at goinvo.com, that’s G-O-I-N-V-O dot com. So that’s it for episode 247 of The Digital Life. I’m Jon Follett and we’ll see you next time.