Natural Language Generation for Conversational AI with Sander Wubben
Okay, welcome everyone. This talk will be a bit different. I guess than the talks that we saw just previous a little bit of change of pace no marketing, but more a deep dive into NLP and energy. So yeah, my name is sander Robin. I'm a co-founder of flow. May I A little bit about my background? I studied computational Linguistics, which is another word for natural language processing then obtain the PHD in natural language processing and more specifically natural language generation and from 2012 until now I've been connected to tilburg University as an assistant professor. And I'm also the co-founder as I mentioned of flow AI flow AI is headquartered in in the Netherlands actually and we are in conversation II platform.
Basically, we have a nice drag drop interface, but we are powered by a powerful NLP and basically we combined it into a platform and that is flow AI. Let me skip this so basically how it works is as you've seen with a lot of the other platforms. We are in between the chat Bots and the channel as you could say, so we are a multi-channel platform. You can connect to Facebook Messenger your own website your app Google home Alexa or WhatsApp or RCS all these kind of channels. We all support them all and we all support the capabilities of hand of scenarios. So it's been mentioned once before today, but I think it can be mentioned enough a lot of the customer service chat. Bots can only really be successful. If you combine them with human agents who can take over if the boat doesn't know it or if some more complicated scenarios are required.
So you cannot Design This Train Dai connected to channels connected to services like dashboard and basically have the complete solution in our in our platform. So today I wanted to talk a little bit about types of chat Bots and the kind of technology that you could use to build this dish. That wasn't we might see in the future. So if you look here we see a couple of different NLP based chatbots. So luckily today all chapels use some kind of NLP, I guess unless it's only buttons and images but if you look on the left side of this of this diagram, you see the retrieval best chat Bob. And I think that most of the channels that are that are in existence right now are can be considered retrieval based chat bolts and what I mean, but retrieval based chapels is that you have some kind of query you do a classification on the career intro you try to find the query that is like it matches it best. So the intent of the of the query and then you perform some kind of action based on that intent now, of course if your if your body is in an open domain that really hard or it's actually impossible because you can never have scenarios for everything that can ever occur or everything that someone can say, so if you are in a closed tube and it's a lot easier, so if your body is a recruitment board or you are in the marketing or in customer service for specific Brands, that's a close domain that you can handle the questions that are coming in best. And then if you look at the on the right hand side, you see the generative chatbot and that's something I want to talk about today and that is it.
But that doesn't do the classification so it doesn't try to find the best matching intent. But it actually generates a response based on the input that it gets and I would say it's more human-like because it's more like the way that we do it like if I respond to someone that says that talks to me, I generate my response, right? I generate real-time text basically and of course in the in the future, we would like to have chapels that can also do this because if you have the balls to do that, you can do personalization. You can have more human-like conversations and have less can bike responses that we usually have now. So if you look at it, but architecture what it looks like. So if you go if you look at a complete voice experience that you have a person that that generates some kind of audio wave some kind of smart speaker like a Google home that that records this speech. It's it translates to speech into text and then the some kind of understanding machine learning on the on the text has some kind of management system that does some dialogue management. It's it generates some kind of response in some way that is transformed into speech and then the both can respond and then you have two full circle where the human response Etc. Now, of course, if you have a chat, but you can sort of ignore this part and you only have to natural language understanding direct management and response generation.
And that's basically what this is so you have some kind of input from the user you have some natural language understanding module then you know what the intent of the queries that you just saw you need to do some kind of state tracking. So you need to know where in the conversation you are. So if you say yes, okay, we all know that's an acknowledgment. But what are you actually saying? Yes to so you need to look back into the previous part of the conversation what actually was being said before and then if you do that, you know, the user intent like for instance the user wants to book a flight or something and then you have certain set of rules or some kind of system that the checks what the policy is for that intent. You connect to your own systems your databases and your apps and what not. And then you can respond with some kind of action generate it and you're back to the user. So if you take one step further than this is what this would look like in a full system. So you say I want to book a flight for instance.
They do the intent classification. You figure out what the user saying you extract the entities like the place you want to fly to at cetera. Then you can have some meta information like what kind of channel are you on? What is the time? What kind of information do you have about the user? And then you have some kind of system that does the rest of policy manager. Basically, this can be a rule a rule based engine or you can have some kind of neural network that that makes decisions based on historical data. So you predict an action you take the action and it can be like an API call or you can send a response or whatever. So this is basically what chatbots look like if you look at the entire architecture and of course there can be differences in certain components, but this is usually how it works. So yeah, one of the main components of Chapel this natural language understanding so it's a classification task. So here you see images and here the task is okay.
What which of these are I guess bagels and which of these are dogs? And so does anyone know what let me see what this one is. Is it a dog who thinks dog will fix Bagel? Okay good, or is it done it? Yeah, you probably know better than I and this one is a dog Etc. So we know of course that this is a dog and this is a bagel because we see certain features like a nose or an air or something like smoothness or color and of course, these are the features that help us classify these images. That's also how computers do it. So they need certain features to do classification. And for text is basically the same. So if you look at texts like these are some in some assurances that you get into your chat Bots and these are some others and you want to sort of do the classification that this is a trans one at is Uplands to or intent one and intend to so you need to sort of find the distinction there and this is a typical machine learning problem. So you give it the training data you give labels you give up trances and you train the system and then when it gets a new our truancy does prediction based on the features that it extracts and it labels it with the right intent.
So this is how all the all the chat based chatbot classification systems work. So this is intent classification. And then you have something like entity extractions. Are you extract? Like if you are say I want to travel now from New York to Amsterdam and maybe you have some in entities like the dates or the time departure destination and an intent that is travel advice. And of course you can say it in all kinds of different ways. So that's why you need to train your chat box with lots of examples. So that's natural language understanding and that's I think pretty well defined a pretty well known by now certain technologies have helped us do this better like word embedding. But now I want to talk a little bit about natural language Generations, or which is on the Fly generating a response.
So we have one good example, I think from Microsoft people here. I think so, right. So this was chatbot day, which was released on Twitter at some point. And the fun thing at the time I think was from Forte is that you could actually talk to it and it would learn from the dialogues. So that means that you can get that actually people could actually train the chatbots online. So this turret was removed pretty quickly because it started having all kinds of racist responses to people and Etc. And the reason was that of course, if you let people on the internet do stuff then at some point they were going to try to break it.
So lots of people were teaching actually this jet boat to be racist and have all kinds of nasty comments.
And of course you can learn dialogues from all kinds of data, so Film scripts chat data from data or from Twitter data, but of course, yeah, if you put garbage in you get garbage out, so it's very important to know what kind of data you enter in your in your system. So how does it work? I'll not go too much into details, but the technology behind this kind of systems that learn to generate text or neural networks and more specifically recurrent neural networks are recurrent. Neural networks are neural networks that actually work sort of in a in a fashion that it takes their own output as input for the next step.
So it's a recurrent process where you can have several different time steps. And at each time step, you can make a prediction and then at the next time step, you take that prediction and you predict a new you make a new prediction and that's basically how we generate language is how we generate language ourselves because if you think about language, it's something that's sort of occurs through time and which is like every word that I say is sort of depending on the word before it. So you enter something. Like how are you into the network the network learns some kind of representation in the in Cohen? There and then in the decoder it tries to decode it into a response.
So, how are you? There is encoded in some kind of representation and then it decodes it into I am good which would be the response and this is trained by giving it lots of examples of parallel data of a like an input and an output.
So, how are you? I am good. Those bears would be the training data for such a system. So we did an experiment to see if we could actually use such a system to train some kind of model that we generate text. So this system is actually used by Google Translate for instance. So they've used to have another system. But in the in the last couple of years they've completely changed it into this this recurrent neural network architecture and we try to do something similar will try to generate text. So the idea was that we wanted to have some kind of system that could do compression in an abstract of way. So compression means basically shortening a sentence.
So you have a long sentence you want to generate a short sentence and one way of doing this is by removing work from the sentence because then it becomes shorter. But again, if you think about how humans do this if I ask you to sort of do a summary of for instance Game of Thrones, you're not just going to take words from the from the book. And sort of make a compression out of that you do an abstract of compression where you try to abstract or are making a summary in your own words. So we took a data set that we that we found that had this kind of abstractions and this is a Muskoka dataset, which is used a lot in in fishing research and this data set contains pictures with a lot of sentences that describe the picture and the nice thing is that you have longer sentences and shorter sentences that describe the picture so you can actually think of the longer sentences and the shortest answer such pairs where you would have this is the north the original sense and these would be compressions. So if you would put this into your system, you would hope that it would actually learn to generate shorter sentences. So we trade the model with around 1 million sentences, and then we just looked at what the output was. So here's the output actually of the system. So with the in Blu the Originals, so these are the original sentences and these are the generated outputs.
So here you see the sends a man flipping in the air with a snowboard above snow covered Hill and generate a sense is a snowboarder is doing a trick on a snowy slope. So that seems to be sort of a compression of this and you'll also see that the orange part is really an abstraction of this like a man flipping in the air with a snowboard snowboarder is doing the trick they mean the same thing but they are completely different and this shows actually that the system has actually learned to sort of grasp the meaning of a sentence and then abstract it. So there was also some interesting other stuff like here you see a table with free place setting with meat vegetables and side dishes on its translated or summarized into a tabletop place of food and a glass of wine. So this glass of wine comes out of nowhere. So does anyone have any idea why this system would actually add a glass of wine? Any idea because wine is delicious. Yeah that that's true. So I guess the answer is that if you look at pictures of food, so if you go forth to Instagram and you just look at how people photograph their food, you will see that a lot of times there is a glass of wine on the side because wine is delicious. So the system has a learned that a glass of wine is sort of I think that also occurs together with food so it has sort of hallucinated this glass of wine.
So that was also interesting to see and also an interesting if you think about chatbots also an interesting concept to think about and the other one was also this was also a very weird one where a woman is leaning over the toilet while arms are inside the lawn and garden fresh back a woman is cleaning the toilet in a park like we're not really but I can sort of see where it got its inspiration from. So I'll skip this we did some evaluation and we saw that it worked pretty well. In fact, we actually saw that we had some kind of Baseline system. We at the neural network and we had to human-generated compression.
Actually, we didn't see too much difference between the neural network and the human so people were not really able to distinguish between the two. So then on towards chatbots little bit a little bit a step further basically, so we wanted to actually do this but then generate responses as a jet put wood and for that we also need the data. So I also mentioned that you could use data from movie scripts or whatever and we chose to use data from Reddit. So this is a big online community where people can post all kinds of links and stuff and then comment on it and our idea was that every pair of comments could be seen as a sort of a micro conversation.
So someone says something and someone replies. So the task is then to try to generate a reply on any given input. So these are really these are not really like full conversations, but just one turn conversations because you have to start somewhere. So we had used a lot of data for that 7.5 million pairs from 2015. And the nice thing of Reddit is that you can give scores you can give scores to a comments.
So we filled them on a Karma score of higher than 5 so men that hopefully the commons had some quality in them and we only took shorter comments. So we also did an evaluation on this but we actually only took a sentence that had sort of a very generic meaning so not the very nice things that usually you see on Reddit like people discuss movies or television shows or games, but we only took generic ones and there we saw that the neural model actually didn't do too well. And that is I think that's that has something to do with the fact that as I also already said in the beginning these open domain systems.
Here are very hard to make as we will see it works a lot better on clothes domains. So here are some examples of outputs. I hope they are kind of readable to in Gray is the input and blue is the output and what we saw is that actually the system worked really well on the very specific Concepts. One of them was memes because they are Chorale a lot on Reddit. Another one is television show. So Game of Thrones was actually one that the system was very good. Well, Very not knowledge about knowledgeable about so here's one on game of France. There's something like timing was right in the show to tell her she isn't half as clever as he thinks it's the worst when you think you're way smarter than you are.
So this is about the character Cersei and then the reply that the generators I think she's just a bit of a self-centered personality which is kind of equity I think and it also learned lyrics from songs. So if you if it would get the input baby don't hurt me it would output no more television shows again from Top Gear some say still looking for this pocket to this day. All we know is he's called The Stig was the generated output and a lot of memes like I think this is a meme that is actually learned to like this how much Collection like this be worth about three-fitty which is a well-known meme. So yeah, we saw that this neural network is very good in generating stuff from very specific domains that are very well occur very frequently. So in this case memes television shows and like games like League of Legends, which was popular at the time. It was very good at generating output for that. So this is one of the things that's been Gathering some headlines lately. I saw it come by earlier like this open a I will build this text generator in the news.
It was the news was that it was considered too good or too dangerous to release. I think that is a bit of an overstatement a bit of a marketing thing. We are on our way of doing nice things in this natural language generation, but we're I think we're not there yet. So, how can we use it? One of the problems is that you can't really use this technology.
If you have a client who wants to do stuff like this because we saw in the market soft example, you don't want to have some kind of output that you can't predict beforehand. So basically what you can do is and how we use it is to help the user. So if you have to train a chatbot, you need to give lots of examples and once one of the ways that we actually use it is to help the user by generating training examples. So here we took we took actually data from our own platform.
So intense 50,000 English intense and we actually aligned pairs of examples within intends to be able to paraphrase input sensors into output sentences, which would help users to actually give more give more inputs for the chair for the chat bot. So this is what it would look like. So if you enter like can I order a pizza as an example of an intent the system will then generate sort of paraphrase is how you can also say this and how you can add additional examples for your chat bot. So this is how it looks what it looks like in the platform. So you say something like no never so this is typically an intent of negative inputs. And then the system just generates some examples and you can add them if you like. So this really helps you if you have like a writer's block in the comic open examples. So yeah to wrap it up.
I think energy at this moment. You cannot really put it in production because it's not good enough. You can't really predict what it will output, but it can help you in actually training your body. It can help for instance in assisting humans in customer service to give them suggestions, and I think we will go towards something like natural response generation where you can also for instance generate buttons or generate images of these kind of things and that was about it. So, thank you. Thank you. Questions comments concerns. Okay a double.
This is great. When you're done pass the mic just click simple question. Just wondering where do you see the technical hurdles for analogy? Is it because OST is not good enough. We need a better algorithm or is it simply because we don't have enough data and if you can well I guess you mentioned but if you can perhaps give out some successful commercial cases if there are any okay. Yeah.
So the question is it's a what are what I now do I guess the hurdles are the roadblocks, but why can't we really use this? I think one thing is data because of course some companies have a lot of data but not all the companies have so much data and the format of the data is also a thing because you need to sort of Into account what are good examples what I'm not good examples. Another important part is actually that this technology right now is not so good in like in remembering larger parts of the conversation like it can do stuff pretty well if you only have one like if you do only one turn conversations, but if you do like multiple turn conversations, it gets really hard. And another thing is that it tends to over generate for the very frequent examples.
So it tends to generate the stuff that it has seen more often. So I was wondering what's the minimum amount of data that you need for this type of technique? So for example for small business could use their zendesk some like that. Would that be enough information or enough data? So how much data I would you need to actually to train in a generative model? You've the minimum amount? Yeah, that's really hard to answer because it depends on what you want to do. Like if you if you if you have like one process or one flow you want to automate you don't need that much data if there's not much variation in there, you don't need much data, but if there's lots of variation lots of different flows lots of different processes than the amount of data is gets bigger and bigger but I think you would need to think about like millions of examples to get something that's pretty decent. But again, you still only can generate then these one turn things or maybe to turn but that's about it.
Okay, so a couple of hundred couple thousand probably not enough, right? It depends. Yeah.
I can't really answer it without knowing the Looks alright, if you have other questions definitely attack this guy outside, I mean attack in the nicest way, right? Like he like you'd approach a bot conversation ladies and gentlemen, give it up for somebody who's never lie has a lot of information for you.