In this lesson, we’ll give our bot a large input of past text that we’ve written (essays, other tweets, etc.) and, using markov chains, have it create tweets that sound like ourselves!
For more information about Markov chains, see Markov Chains explained visually: http://setosa.io/ev/markov-chains/
The RiTa library is a powerful library for working with text and text generation. See the reference here: http://rednoise.org/rita/reference/index.php
[00:00] In addition to the Twit library, we'll also need fs, because we'll be working with the file system. We'll need CSVParse, and we'll need the RiTa library, which is an awesome library for working with text.
[00:15] We'll also need some input text. To create a bot that sounds like ourselves, we're going to be using what are called Markov chains. Markov chains are a way to generate text based on the probabilities of the words that came before.
[00:34] To show an example, we can say var markov = new rita.RiMarkov. We'll pass it a number here, and this is the number of n-grams, or the number of words that the model is going to take into consideration. Let's have it consider three words to start. We can say markov.loadText, our input text, and we can say var sentences = markov.generateSentences, and let's have it generate one sentence, and then we'll log those out.
[01:16] I'm actually going to add a little more text here. If we run this, what happens?
[01:23] Let's run it a few times and get a sampling of sentences. This tells us a lot of things. We can see that all of our sentences start with one of our sentence starts in the input text, so I, the car, or safe. We can see that there's actually a ton of overlap between our input text and our generated sentences. The way this works is, the Markov chain is looking at previous words to decide the next one. Let's look a little deeper into this.
[02:06] RiTa gives us a bunch of cool features for digging into our Markov chains. If we say markov.getProbability('went'), we can see that the probably of the word went being the chosen word is 014.
[02:23] Even more interestingly, we can say markov.getProbabilities('went'), and then we can see exactly what it thinks the next word should be. Two-thirds of the time, the next word should be to, and in one-third, it should be bowling.
[02:41] If we look at to, we can see that to is followed by the 100 percent of the time. If we look at the, we can see there's three options, car, grocery, and store, each a third of the time. If we search a word like bowling, that can only be followed by behind, 100 percent of the time according to our input text. What this should tell us is that we need a lot of text in order to create a Markov chain that can generate interesting and varied sentences. I'm going to clear this. To create a Twitter bot that sounds like you, you need a lot of text that you've written.
[03:24] Here, we'll be working with a Twitter archive. First, let's create a path to our file. I have mine stored in twitterarchive/tweets.csv, and then we'll say var tweetdata = fs.createReadStream() our file path. We'll say .|csvParse, which takes a delimiter, which will be a comma, and we'll say .onData(function(row)) and we'll just log out that row for now. Let's run this.
[04:14] Here's one row, and what we really want is just the tweet, so that's row five. Zero, one, two, three, four, five. There are all our tweets. The next thing we want to do is to clean up the text a little bit. We'll make a function called cleanText which will take a string text, and to clean the text, we're going to remove a couple of things such as handles or URLs.
[04:48] In natural language processing, these things are usually called stop words. We're also going to make a function called hasNoStopWords, which will check if the text has stop words or not. Let's go here, and we're going to say return rita.RiTa.tokenize() our text, with a delimiter which will be a space, so that will break up our tweet into tokens. We'll say .filter() and this takes a function which will be hasNoStopWords, and that will return all of the tokens that don't include stop words.
[05:33] We'll join those back together with a space, and we'll say .trim() which will trim whitespace from the beginning of our text. HasNoStopWords takes a token, and we first need to create our stop words. We want to remove handles, URLs, and the RT tag. This will remove tokens that include any of these, and then we'll use the every function for this. It's return stopwords.every(function(stopWord)) and we'll say return ! Token.includes(stopWord).
[06:20] If a stop word is included in token, this will return false. If any of these return false, the whole expression will return false. Filter will create a new array of all the tokens without stop words, then it will be joined together, the whitespace will be trimmed, and the clean text will be returned. Up here, we can say inputText = inputText+ +cleanText(Row 5), then we can say .onEnd(function()) and we can put our Markov logic in here.
[07:04] We'll say rMarkov = new rita.RiMarkov and we'll pass it the number of words to take into consideration, so we'll start with three, and then we can say markov.loadText() and pass it our input text, and then we can say var sentences = markov.generateSentences(). We'll have it generate one.
[07:32] To have our bot tweet a sentence, we'll say bot.post('statuses/update') with our status being the sentences, let's change this to sentence, actually, and our callback. If there's an error, we'll log that out. Otherwise, we can say status tweeted. If we go to Twitter, we can see that a sentence has been posted.
[08:13] If you want your sentences to be a little closer to the input text, you can make this number a little higher. If you want them to be a little sillier, you can make the number of words to take into consideration a little lower.
Fed it a ton of data, and while it does sound like me, mostly, I have found out I sound like an idiot ;P