A part of Natural Language Processing (NLP) is processing text by “tokenizing” language strings. This means we can break up a string of text into parts by word, sentence, etc. In this lesson, we will use the
natural library to tokenize a string. First, we will break the string into words using
TreebankWordTokenizer. Then we will break the string into sentences using
We will learn about “stemming,” the process of finding the root of words, often in order to group words by a common base root. We will look at the Porter and Lancaster Stemmers, briefly touch on Natural’s support for Russian and Spanish stemmers, and introduce the function to stem and tokenize at the same time.
An important component of many natural language processing projects is being able to identify the grammar of a piece of text. We’ll learn how to do that with Natural’s parts of speech (POS) tagger.
There are many tags, and it's worth looking them up online (search "POS tag symbols") to become familiar with them all.
The setup of the tagger may seem a little strange, but it allows you to replace the lexicon or the rules with a different lexicon or rule set of your choice.
We will learn how to compare how similar two strings are to each other, examining three algorithms: Jaro-Winkler, Levenshtein, and Dice’s Coefficient.
You should note that none of these algorithms are inherently better than the others. Instead, it's important to choose the one that best fits your text data.
In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regression classifier - basic machine learning algorithms - on JSON text data, and classify it into categories.
While this dataset is still considered a small dataset -- only a couple hundred points of data -- we'll start to get better results.
The general rule is that Logistic Regression will work better than Naive Bayes, but only if there is enough data. Since this is still a pretty small dataset, Naive Bayes works better here. Generally, Logistic Regression takes longer to train as well.
This uses data from Ana Cachopo: http://ana.cachopo.org/datasets-for-single-label-text-categorization
By this point we've seen that classification can take a long time, and with more data, it would take even longer. Luckily, Natural provides support to save your classifiers. In this lesson, we will learn how to save a classifier and load it into a new project in order to classify new data.
Tf-idf, or term frequency-inverse document frequency, is a statistic that indicates how important a word is to the entire document. This lesson will explain term frequency and inverse document frequency, and show how we can use tf-idf to identify the most relevant words in a body of text.
I’m a programmer, data scientist, and musician. I like music generation, data visualization and sonification, natural language processing, machine learning, and storytelling in various formats. I’m currently working on TransProse, a program that translates literature (and emotional data) into music, and creating datasets for machine learning.