Transcript:
My name is Tanya cruel joke as you all know now, I work at Google as a conversation designer. I work on the Google assistant and today I'm just going to talk to you a little bit about some things.
We've been learning over the past couple of years that hopefully will help you if you are starting to think about adding voice as a touch point for your customers. Okay. So thanks Victoria for the great lead-in.
So voice is definitely like out of the early adopter phase has not reached a Tipping Point yet, but about, you know, 25 percent of u.s. Adults now have a smart speaker many have more than one actually there right regular users.
So about 70% of people are active users of this smart speaker and 30% of people are active users of voice in the Harm, and it's not just in the US and other countries and locales as well.
And of course it's not just smart speakers, but as Mike's have gotten better.
There's now a voice on a number of form factors everything from you know, headphones and watches to like microwaves for instance some of which are very personal form factors and some of which are more communal ones.
So it's just something to think about as you're approaching your voice Journey. One of the reasons that voice is kind of taking off now is that the technologies that are specific to voice?
So the things that make Voice work which is speech recognition on the one hand and natural and TTS text to speech on the other hand have gotten really good.
So this is already a couple of years old as data, but the word error rate which is our main measure of accuracy for speech recognition systems is now close to the human world. Rate not in all contexts and not for all speakers, but it's gotten so reliable.
Actually that a colleague recently followed me an article that was talking about how air traffic controllers are looking at using speech recognition to augment their capabilities.
So it's a very reliable technology at this point.And on the other side of that coin is text-to-speech not natural language generation, but just the ability for machines actually speak responses out loud, which has also gotten incredibly good in terms of so, there are metrics are usually intelligibility and naturalness and it's not about mimicking human voice necessarily right but it is about that what I just mentioned intelligibility, which is especially important having worked with Enterprises.
You guys have a lot of a lot of dynamic data a lot of dates and times and amounts and things like that that used to be very difficult to comprehend when a machine would read out these sentences that include All that kind of data and now just is a lot easier to comprehensive is really important.
Okay. So all of this is just to sort of set the stage that the biggest opportunities and also the biggest challenges in voice UX right now are ones that might be very familiar to you in the chat bot from the chap art world as well which are picking .
The Right Use case and making it work very well and they seem so obvious as almost to not be worth saying but these are the buckets where when we look at people who are creating actions or embarking on the space.
These are the kind of buckets that that we see people making misguided decisions in over and over and over again, and so I just want to do a little bit of a deep dive into some of what we've what we've learned that might be useful here.
So one thing that's been really interesting to see is where now that More and more consumers are actually using voice daily and not just using it when it gets thrown in front of them in like an automated phone system is that voice fills to needs for people for users one is maybe one that you thought of already which is that it can be about he's convenience accessibility.
So when people are hands-free eyes-free, they're trying to multitask voice can be very valuable there. And so we sometimes do this test for ourselves when we're thinking about different use cases, which is the top test which is how many taps would it take to get the answer to a question or to do a particular task.
So if you want to you can play along by taking out your phone and just seeing how many Taps it takes to get the answer to what 16 times 2710. We do want to do it.
Okay. Well usually people say it's about like seven or eight tops give or take then you've got things that are slightly more complex play Van Morrison on Spotify.
Then you're up to like, you know in the teen’s number of TAPS. Then you have things like any direct flights to Denver this Sunday, which I've never actually counted how many tops between like all the typing you have to do and all that and its wipes and decisions to and you can compare that.
So I'm going to do this you should never do a live demo on stage, but I'm going to do one. So just compare that last one any direct flights to Denver in the Sunday how you would do with tap versus if I were to say, let's hope this works.
Any direct flights to Denver this Saturday. Flights from New York to Denver leaving the 25th of May and coming back the 3rd of June started $243 the nonce. Okay.
So what it does there you can't see my phone is like it's sort of answers the question and also because this is a multi-modal form factor. It actually just pulls up the Google flights UI and so it's just basically provided a shortcut for me to start browsing flights, which is not an unreasonable thing to do in that case because you know browsing flights purely through voice would be pretty painful but taking advantage like for grounding the strongest modality that you have which in this case would be using voice to shortcut but then for grounding the visual modality is a smart decision that case
So one question you could ask yourself. Are you reducing the amount of effort that users have to put in another question to ask to ask yourself is are you allowing users to complete a full journey by voice? And those are two different things right? So in that one, I didn't complete the journey by voice, but I did start it and it was a shortcut one thing that's been really fascinating to me and has been kind of re inspiring me a little bit lately. Is that as a society? I think we are not I think I know we are hearing over and over again that people are looking for ways to not get sucked into their phones and their computers all the time. Right? So they're looking for ways for small internet moments to stay small and voice gives them that and when you have a smart speaker at home, we see people telling us all the time that you know, the most surprising thing about it for them has been that not only can they use it together as a family but also that they can get quick answers and they don't you know, there's no phone binge.
So like for those of you who may be played along with the phone taps, you very likely still have your phone out and are like now sucked into email or something so Percent of parents actually say that they purchase a smart speaker to reduce screen time. They don't actually even count smart speakers in technology of screen time for their kids. So a lot of people try to reduce technology for their kids for young kids, but they don't include smart speakers in that because of these other reasons, so it's been really interesting to me to consider the opportunities that voice has for this but more importantly maybe for you or in the immediate term what this means is that when you're looking at use cases that are good for voice you want to keep those user needs in mind and think about what you can how you can optimize on providing a lightweight frequent touch point for your customers. And that's the thing that voice is really good for and sticky for right now. So it's not that you want to voice if I your app or website, but you want to look at the range of ways that people interact with your with your with your company and you want to look for those opportunities where things where you can make things a little bit easier for them. You want to look for things these are just like some examples but think about having the opportunity for a more frequent touch point that isn't the thing that you know, when users go to do something that's complex-- or asynchronous or requires a lot of data comparison. They're probably not going to choose voice as the way to do that. They're probably going to choose to go open their computer or go into a chatbot or something like that.
But there are lots of things that people do all the time that aren't that complex. Right? And those are the places where you can sort of get winds right now. And we see this in the consumer world as well. Right? So the killer apps and I don't mean to say that like these will always be the killer apps where that they're the most sexy or interesting apps. But the things that are sticky right now are those things that users can complete by voice where they can sort of like just be sitting at the kitchen table something pops into their mind and they just asked for it and they and they get the answer to it. So the first part was about you know, the use case bucket doing Simple Things is kind of the message look for ways to simplify people's lives. The second part is about how to do those things well, and there's obviously a ton of things to say about that. We've heard some of them today from other speakers.
I've just picked a few places where I think you can really have a big impact if you focus on them when you're building when you're building your Bots, so those areas are the very first use the very first time a user comes into your app. What do you put in front of them? What mental model do you give them? What obstacles are there different things ways that you can prompt to make it easy for users to talk to you? I'll dive into each one of these in a little bit of detail what data you can put to use for you for you for you and for users and investing in the natural language understanding part which gives you a lot of wins. So this is a drawing of Luke Skywalker that a colleague's son did mostly because it we could get it approved by legal whereas we couldn't get publicly available photos proof. So the very first time that I user comes into your app, they are going to develop a belief about what it does and what it doesn't do just as importantly and if they don't we heard earlier from someone I might even Vittoria actually about setting at the importance of Setting expectations in that first one. This is your opportunity to do that and it will increase the success with which users use your Bots and also the it will decrease kind of the long tail of responses that you get from people because they'll have a better understanding of how to talk to it. What you don't want to do if you can avoid it as have something like, you know have a user come in and I see this kind of thing all the time have a user come in. And the first thing I have to do is like great Go download our app and log in and then we'll know things about you and we'll be able to give you more value which may or may not be true.
But it's it doesn't set a good mental model for that user about something being about your Bots capabilities, whether it's learnable, whether it's easy to use what it actually does anything like that. And a really good place to do. This is in your first prompt this this guidance serves for any prompt that you write but it's particularly important the first one we often see very broad opening prompts. How can how can I help you? Because we want to show that you know, we can handle a lot of things and it's a smart but besides the fact that we often can't handle all the things that users will say it also creates a little bit of pressure and deer in the headlights for users. So when users have to speak and formulate their own requests in this very open-ended way, they tend to hesitate for a minute because they're not sure what to say or how to say it and what that results in besides that their own like, you know, the cognitive pressure that they have it also results in errors because they're likely the mic will time out or they will say and you'll recognize that and get it Miss Rekha Miss Rock and you'll have to re prompt so Constraining it a little bit as good. You don't go too far in the other direction, which is this other examples. You probably can't read back there. But where you say like, oh, hi, welcome to yarn finder.
I can give you information about micro measurements qualitative assessments and how to how texture and ply affect results. You can also find out blah. So which one would you like? Obviously, that's too much. So like a nice Middle Ground is to really think about what's the primary thing that your user that you want your users to be able to do and guide them along that path. So in this fake example, hi, welcome to yarn finder. I can help you get just the right wolf or your knitting project. So what are you planning to make? It's a very easy question for people to answer you'll probably get a lot of I don't knows or I'm not sure but they'll know what to say and how to respond to that and they'll also have an idea of what your app is there for and what it can do. And just a little note on these sorts of I get a lot of pushback.
We get a lot of pushback as a conversation designers about like we don't want to give people menus or things that sound like menus because it's and I'll you and I we just we just like to make the point that actually this stuff happens all the time in real life and real conversation. It's not seen as not natural or unhuman has actually seen as very helpful for someone to know. So again just sort of like basing things on human conversation and sort of verbal storyboarding. You'll see pretty quickly that these are very natural. Context and personalization five minutes things. Okay context and personalization. So if you have relevant context and data to bear on your interaction, you'll increase every success metric that you want. Whether it's a success metric that you're looking at in the interaction itself or a success metric in terms of like larger kpis and business goals and that we often talk about personalization as while you have to really understand the customer that you're talking to and you want to know about their patterns that's ideal.
But you know, we don't have all that data all the time. You can often do things just based on what you know about customers generally or how your business works and how people interact with it that you can bring to bear. So for example, when I was working a while back with an Line we were looking at so we were looking at how to make it easier for people to both tell us what they're calling them out. But also if they were calling about a current reservation for us to collect their reservation number because that's a really difficult ASR problem and we discover that one of the ways that the airline tends to think about cost. The customer life cycle is in this like are they pre reservation do they have an upcoming trip are they in the middle of a trip are they post trip? And if you just apply that way of thinking to the question that you ask up front? You can ask a much better question that will be easier for users to answer and will be much more relevant to them and will capture a lot more easily what they're calling about than otherwise instead of starting with a how can I help you? If you say are you calling about your trip to San Francisco if I have an upcoming trip to San Francisco? There's a huge chance that that is the reservation that I'm calling. About and so I can you can sort of already constrained Things based on that and it doesn't require a lot of data about that person specifically just requires knowing you know, where they are in the in this more General customer life cycle. And then finally the final point is just invest in answering the right question which means invest in and all you part which is which is a difficult part that you've heard other speakers talked about when you have a visual interface you can often understand that kinds of things that users are going to say, but you can often bucket things not as granularly because when you have a visual interface, you can answer several questions with one in with one card or UI because it stays on the screen and people can process it as they need to and come back to it. But when you have a verbal response that's really difficult to do you wouldn't want to say all of the operating hours for Black Sheep for every day.
I've seen plenty of this happens all the time to you ask about like when My flight departing and you got like, you know United Airlines flight 173 departs New York- jfk like a whole bunch of information and because the speech signal is cereal and ephemeral. It's a lot of processing power to sit there and parse through all the stuff that's completely irrelevant to you to get to the one piece of information you do need and so instead of bucketing all of these in this example the questions around opening hours into a single intent when you're doing a verbal bot you want to take you want to both get more granular in the intent. And then you also want to look at the context probably in order to give a really specific answer. So in this case, this is the answer that you get right now is Black Sheep open at actually it's closed today, but they open tomorrow at 5 p.m. Which requires both knowing exactly what the question was and also knowing the context. Like today is Thursday. They're going to open in you know, they're going to be open tomorrow right knowing that today tomorrow kind of thing. So the message do simple things and do them.
Well lean into convenience remove obstacles use your data your business is data and do the work to be specific in your answers. And if you if you want some additional sort of design tips. We have a website actions at google.com slash design. I'm you can go check that out and get in touch with us that way. Thank you. All right questions for the audio. Good go ahead and shout it as though you're talking to Alexa. Did you get that see that's a problem with voice command? If it's not her doors, not loud enough doesn't work.
Can you rephrase it? But speak into the microphone? Can you comment about the majority of telephony to the voice board platform at an Enterprise level about how they relate to another telephony to the how do you connect the people who are calling in to the call center? Yeah, what is the maturity level where we can make that telephony to voice Borglum was bored platform were the board platform can take over the conversation and give it back. I'll be there at an Enterprise level is the conversation smoother. So, I mean that's been done and Ivy are world for a long time. I am in the call centers themselves in this world. We're still I think that still early days for that kind of thing for Enterprises. I'm not sure I'm not sure who does use that kind of thing. But the way that we think about it is that if you again if you're if there are certain information that you want to collect upfront or that you would know in order to make a conversation with a human Smoother you just want to make sure that you also pass that data in and pass it in a consumable way so that the person on the other end doesn't have to recreate the entire conversation, but can actually leverage that. Did I answer your question? Sorry, maybe I misunderstood the question.
So this would be a good one for a talk over drinks. I think because it's kind of a weird Niche question. I'm with my friend here in the green shirt. He likes to ask questions and he's got another one for you. Sure security one thing that's annoying about ivr and just kind of a overlap between voice and I have yours when they don't give you the option to tap you have to speak into the ivr what if and you had mentioned. Sorry. Completing the journey and if that Journey includes like checking out or purchasing, how do you handle pii? Yeah good question. So right now all of the all of the ones that I know of either you have to be logged into that app.
And so you actually complete you complete the journey on the app itself or through some other channel and I think that's because in large part because people still hesitate to use Voice bio as the ID and auth although it's becoming more and more common and all channels. So I'll be curious to see how that goes is kind of an industry by industry decision, but it's the same problem. You're right that we've had before. Okay last question to the almost last speaker. Are you ready? I have like two questions. So the first one is I know last year Google it was showcased a pretty impressive case like basically body was able to call a restaurant and reserve a spot and at the person was no I guess was not able to identify it. It was that a specific case or is Google rolling out a series of come commercial applications related to that. The second question is I know you mentioned whatever Ray is pretty low, but somehow when I interact with Alexa is still not able to you know, talk to me or engage with me and a conversation away.
So I'm not a voice expert but what am I missing here? Thank you. So for your first question, I don't work directly on duplex or I can't comment too much on it. There was an announcement at I/O this year about duplex being extended to web applications for things like car reservations and stuff. So you can you can check that out and kind of see what the plans are with the publicly announced plans are there for your second question word error rate is just a speech recognition metric. It's not a metric of natural language understanding or of conversation all of which so natural language understanding is a different underlying technology that has to do with actually understanding the intent of it of what you're asking for and then there's the whole dialogue management like what to do with that and a lot of those are even there the technology for natural language understanding is pretty good, but there's a huge learning curve to do. It well, you need to understand your users. You need to understand data, you need data to you know, come back in people mention a couple times before like nobody is perfect when you first launch it because you need that feedback loop in there. So anytime we see that things go wrong it the like the layperson response is like it didn't understand me, but when you actually dig in the problem could be at many layers of technology or design.
So yeah, it's not the word error rate is not by no means metric of successful interactions. It's just a metric of successful recognition. Thank you. No, thank you. Thank you. Thank you. Thank you. Thank you.
Thank you. Thank her with your hands, please.
Comentarios