Natural Language Processing in JavaScript with Natural

38 minutes

In this course we’ll work through Natural’s API for natural language processing in JavaScript. We’ll look at how to process text: learning how to break up language strings, find the word roots, work with inflectors, find sequences of words, and tag parts of speech. We’ll learn how to find important stats about a body of text: how to compare strings, how to classify text with machine learning, how to use the tf-idf tool to find relevant words. We’ll look at some of the extra tools Natural gives us, including the dictionary/thesaurus of WordNet, a phonetics comparer that lets us see if two words sound the same, and a spellcheck feature. We’ll also look at tries and digraphs, two data structures that help us better analyze bodies of text.

pro-course-rss-logo

PRO RSS Feed

Break up language strings into parts using Natural

Find the roots of words using stemming in Natural

Pluralizing nouns and counting numbers with inflectors in Natural

Find sequences of words (n-grams) using Natural

Tag parts of speech using Natural

Compare similarity of strings through string distance in Natural

Classify text into categories with machine learning in Natural

Classify JSON text data with machine learning in Natural

Using machine learning classifiers in a new project

Identify the most important words in a document using tf-idf in Natural

Find a word’s definition using WordNet in Natural

Search more efficiently with tries using Natural

Include spell-check in text projects using Natural

Check if words sound alike using Natural

js tutorial about Break up language strings into parts using Natural

Break up language strings into parts using Natural

1:25 js

A part of Natural Language Processing (NLP) is processing text by “tokenizing” language strings. This means we can break up a string of text into parts by word, sentence, etc. In this lesson, we will use the natural library to tokenize a string. First, we will break the string into words using WordTokenizer, WordPunctTokenizer, and TreebankWordTokenizer. Then we will break the string into sentences using RegexpTokenizer.

js tutorial about Find the roots of words using stemming in Natural

Find the roots of words using stemming in Natural

1:33 js

We will learn about “stemming,” the process of finding the root of words, often in order to group words by a common base root. We will look at the Porter and Lancaster Stemmers, briefly touch on Natural’s support for Russian and Spanish stemmers, and introduce the function to stem and tokenize at the same time.

js tutorial about Pluralizing nouns and counting numbers with inflectors in Natural

Pluralizing nouns and counting numbers with inflectors in Natural

1:06 js

Inflectors are the modifiers of a word that indicate grammatical categories. While Natural’s coverage of inflectors is not comprehensive, we will show how Natural can pluralize/singularize nouns and count numbers.

js tutorial about Find sequences of words (n-grams) using Natural

Find sequences of words (n-grams) using Natural

2:06 js

N-grams are sequences of words, where the 'n' stands for the number of words in the sequence. In this lesson, we will see how to find bigrams (2-grams), trigrams (3-grams), and any other length n-gram in a body of text.

js tutorial about Tag parts of speech using Natural

Tag parts of speech using Natural

2:16 js

An important component of many natural language processing projects is being able to identify the grammar of a piece of text. We’ll learn how to do that with Natural’s parts of speech (POS) tagger.

There are many tags, and it's worth looking them up online (search "POS tag symbols") to become familiar with them all.

The setup of the tagger may seem a little strange, but it allows you to replace the lexicon or the rules with a different lexicon or rule set of your choice.

js tutorial about Compare similarity of strings through string distance in Natural

Compare similarity of strings through string distance in Natural

3:32 js

We will learn how to compare how similar two strings are to each other, examining three algorithms: Jaro-Winkler, Levenshtein, and Dice’s Coefficient.

You should note that none of these algorithms are inherently better than the others. Instead, it's important to choose the one that best fits your text data.

js tutorial about Classify text into categories with machine learning in Natural

Classify text into categories with machine learning in Natural

3:42 js

In this lesson, we will learn how to train a Naive Bayes classifier or a Logistic Regression classifier - basic machine learning algorithms - in order to classify text into categories.

js tutorial about Classify JSON text data with machine learning in Natural

Classify JSON text data with machine learning in Natural

6:05 js

In this lesson, we will learn how to train a Naive Bayes classifier and a Logistic Regression classifier - basic machine learning algorithms - on JSON text data, and classify it into categories.

While this dataset is still considered a small dataset -- only a couple hundred points of data -- we'll start to get better results.

The general rule is that Logistic Regression will work better than Naive Bayes, but only if there is enough data. Since this is still a pretty small dataset, Naive Bayes works better here. Generally, Logistic Regression takes longer to train as well.

This uses data from Ana Cachopo: http://ana.cachopo.org/datasets-for-single-label-text-categorization

js tutorial about Using machine learning classifiers in a new project

Using machine learning classifiers in a new project

2:03 js

By this point we've seen that classification can take a long time, and with more data, it would take even longer. Luckily, Natural provides support to save your classifiers. In this lesson, we will learn how to save a classifier and load it into a new project in order to classify new data.

js tutorial about Identify the most important words in a document using tf-idf in Natural

Identify the most important words in a document using tf-idf in Natural

5:15 js

Tf-idf, or term frequency-inverse document frequency, is a statistic that indicates how important a word is to the entire document. This lesson will explain term frequency and inverse document frequency, and show how we can use tf-idf to identify the most relevant words in a body of text.

js tutorial about Find a word’s definition using WordNet in Natural

Find a word’s definition using WordNet in Natural

2:43 js

This lesson introduces WordNet, which is an important resource in natural language processing. With WordNet, we can look up a word’s definition, or find its synonyms.

js tutorial about Search more efficiently with tries using Natural

Search more efficiently with tries using Natural

1:40 js

Tries are a data structure that provide an efficient way to search for the existence of a word or phrase in a body of text, or to search by prefix.

js tutorial about Include spell-check in text projects using Natural

Include spell-check in text projects using Natural

3:05 js

In this lesson, we’ll see how to use Natural’s probabilistic spell-checker, which uses the trie data structure.

js tutorial about Check if words sound alike using Natural

Check if words sound alike using Natural

2:14 js

In this lesson, we’ll take a look at Natural’s phonetics feature. We’ll learn how to check whether two words sound alike, looking at both the SoundEx and Metaphone algorithms.

Presented by:

Hannah Davis

I’m a programmer, data scientist, and musician. I like music generation, data visualization and sonification, natural language processing, machine learning, and storytelling in various formats. I’m currently working on TransProse, a program that translates literature (and emotional data) into music, and creating datasets for machine learning.

HEY, QUICK QUESTION!
Joel's Head
Why are we asking?