🎁

12 Days of Baddass Courses sale! Get instant access to the entire egghead library of courses and lessons for 58% off.

Runs out in:
13 : 08 : 46 : 35
Join egghead, unlock knowledge.

Want more egghead? It's 58% off for a limited time only!

This lesson is for members. Join us? Get access to all 3,000+ tutorials + a community with expert developers around the world.

Unlock All Content for 40% Off
1×
Become a member
to unlock all features
Autoplay

    Break up language strings into parts using Natural

    Hannah DavisHannah Davis
    naturalNatural
    ^0.4.0

    A part of Natural Language Processing (NLP) is processing text by “tokenizing” language strings. This means we can break up a string of text into parts by word, sentence, etc. In this lesson, we will use the natural library to tokenize a string. First, we will break the string into words using WordTokenizer, WordPunctTokenizer, and TreebankWordTokenizer. Then we will break the string into sentences using RegexpTokenizer.

    Code

    Code

    Become a Member to view code

    You must be a Member to view code

    Access all courses and lessons, track your progress, gain confidence and expertise.

    Become a Member
    and unlock code for this lesson
    Discuss

    Discuss

    Transcript

    Transcript

    First, import the natural library. We'll also make a test string here. To create a new tokenizer, the syntax is new natural.WordTokenizer. From there, all we need to do is tokenizer.tokenize our string.

    WordTokenizer splits text by spaces and punctuation. Note that contractions are split on their apostrophes. WordTokenizer also discards the punctuation. If you want to retain the punctuation, you can use another tokenizer called WordPunctTokenizer.

    This will retain the punctuation putting it into its own tokens. Natural also has a TreebankWordTokenizer. This tries to preserve some of the semantics of the text. It splits contractions into their respective words. It also keeps the punctuation.

    Lastly, natural has a regular expression tokenizer. Here, you have to pass a regular expression pattern. In our case, we'll look for end of sentence punctuation. This splits the text into sentences.