I’m a programmer, data scientist, and musician. I like music generation, data visualization and sonification, natural language processing, machine learning, and storytelling in various formats. I’m currently working on TransProse, a program that translates literature (and emotional data) into music, and creating datasets for machine learning.
showing 28 lessons...
In this lesson, we’ll learn how to host a simple bot with Heroku. We'll learn how to create a new Heroku application and how to deploy our code to Heroku using git. We’ll learn how to change our project from a web app to a worker app and how to create a Procfile. We'll also see how to see output from our Heroku app with the command
Make a Twitter Audio Bot That Composes a Song Based on a Tweet - In the final bot lesson, we'll compose a ditty based on a tweet, save it as an audio file, and post it to Twitter. Because Twitter only supports uploading audio in video form, we'll learn how to create a video from the MIDI file and post it to Twitter. This is a longer video since we are going over how to create this pipeline from scratch.
We'll use RiTa to tokenize the text of a tweet and find the parts of speech:
We'll use Jsmidigen to compose a tune in a MIDI format:
We'll also use FFMPEG, which will help us create a video from our audio and a picture:
And we'll use TiMidity to convert our MIDI file to a Wav file:
You can use any image in place of the black image used in this video.
We’ll learn how to host a complex Twitter bot (a bot that requires external tools that need to be installed on the system) with Heroku and Docker. We'll learn how to configure our Docker file, using an image that already has node installed on it. We'll use the Heroku Container Registry to deploy an app to heroku that was created with a local container, and deploy it as a worker app.
In this lesson, we’ll learn how to retrieve and tweet data from Google Spreadsheets. We'll use Tabletop.js to make this easier. More information on Tabletop can be found at https://github.com/jsoma/tabletop.
We’ll learn how your bot can get its list of followers, follow people, and look up friendships. We'll use Twit's GET method to get our followers at the followers/list endpoint, to get the users we follow at the friends/ids and friends/list endpoints, and to look up our friendships with the friendships/lookup endpoint. We'll also use Twit's POST method to follow someone at the friendships/create endpoint, and to send messages by posting to the direct_messages/new endpoint.
With this bot, we’ll find the number of faces in a photo that is tweeted at us, and respond back with what emotions the faces are expressing, using the Google Cloud Vision API.
The Google Cloud Vision API is worth exploring, and you'll need to create an account before this lesson:
Tracery is a brilliant tool to more easily create text grammars and structure. In this lesson, we’ll create a bot that tweets out tiny stories.
We'll learn what a grammar is in this context, and how to create one with Tracery. We'll first create a simple story with character, action, place, and object variables, and learn how to add modifiers. Then, we'll create a more complex one, and learn how to set variables that we want to be consistent throughout the story, such as pronouns.
In this lesson, we’ll create multiple functions to request, download, and save photos and data from NASA's API, and then have our bot upload these photos to Twitter and post them along with their descriptions. We'll also learn how to tweet videos using a video from NASA’s space archives.
In this lesson, we’ll give our bot a large input of past text that we’ve written (essays, other tweets, etc.) and, using markov chains, have it create tweets that sound like ourselves!
For more information about Markov chains, see Markov Chains explained visually:
The RiTa library is a powerful library for working with text and text generation. See the reference here:
We’ll learn how to search tweets with the Twitter Search API, using Twit's GET method to get the search/tweets endpoint. We'll use the query and count parameters to get the search term(s) and number of tweets that we want. We'll learn how to get exact phrases, multiple words, one of several words, emoticons, hashtags, photos/videos, urls, and how to remove words from our results. We'll learn how to implement the safe filter and how to filter by media, website, or date. We'll learn how to get recent results, popular results, results by location, and results by language.
The search API returns for relevance, and not completeness. If you want all the tweets from a search term, you should use the stream API (which we'll go over in the next lesson).
We’ll learn the basics of interacting with tweets, including retweeting, deleting, favoriting, and replying to tweets. We'll get our home timeline by using Twit's GET method to access the
statuses/home_timeline endpoint, including the count parameter, which lets us get back a certain number of tweets. We'll also pass it a callback. We'll learn how to cycle through the data (the tweets) we get back and see what information is included. We'll learn how to retweet statuses by posting to
statuses/retweet and including the tweet id. We can unretweet by posting to
statuses/unretweet with the same tweet id. We'll also learn how to like a tweet by posting to
favorites/create with a tweet id, and unlike a tweet by posting to
favorites/destroy with a tweet id. We'll also learn how to reply to a tweet by posting to
statuses/update, with a status that includes the handle of the user we're replying to and
in_reply_to_status_id parameter, which is the id of the tweet we're replying to. We'll learn how to delete a tweet by posting to
statuses/destroy with the tweet's id.
We will learn how to compare how similar two strings are to each other, examining three algorithms: Jaro-Winkler, Levenshtein, and Dice’s Coefficient.
You should note that none of these algorithms are inherently better than the others. Instead, it's important to choose the one that best fits your text data.
By this point we've seen that classification can take a long time, and with more data, it would take even longer. Luckily, Natural provides support to save your classifiers. In this lesson, we will learn how to save a classifier and load it into a new project in order to classify new data.
Tf-idf, or term frequency-inverse document frequency, is a statistic that indicates how important a word is to the entire document. This lesson will explain term frequency and inverse document frequency, and show how we can use tf-idf to identify the most relevant words in a body of text.
In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regression classifier - basic machine learning algorithms - on JSON text data, and classify it into categories.
While this dataset is still considered a small dataset -- only a couple hundred points of data -- we'll start to get better results.
The general rule is that Logistic Regression will work better than Naive Bayes, but only if there is enough data. Since this is still a pretty small dataset, Naive Bayes works better here. Generally, Logistic Regression takes longer to train as well.
This uses data from Ana Cachopo: http://ana.cachopo.org/datasets-for-single-label-text-categorization
An important component of many natural language processing projects is being able to identify the grammar of a piece of text. We’ll learn how to do that with Natural’s parts of speech (POS) tagger.
There are many tags, and it's worth looking them up online (search "POS tag symbols") to become familiar with them all.
The setup of the tagger may seem a little strange, but it allows you to replace the lexicon or the rules with a different lexicon or rule set of your choice.
We will learn about “stemming,” the process of finding the root of words, often in order to group words by a common base root. We will look at the Porter and Lancaster Stemmers, briefly touch on Natural’s support for Russian and Spanish stemmers, and introduce the function to stem and tokenize at the same time.
A part of Natural Language Processing (NLP) is processing text by “tokenizing” language strings. This means we can break up a string of text into parts by word, sentence, etc. In this lesson, we will use the
natural library to tokenize a string. First, we will break the string into words using
TreebankWordTokenizer. Then we will break the string into sentences using