In this lesson, we’ll take a look at Natural’s phonetics feature. We’ll learn how to check whether two words sound alike, looking at both the SoundEx and Metaphone algorithms.
[00:00] First import the natural library. We'll be looking at two algorithms here. The first is soundex which is natural.SoundEx, the second is metaphone which is natural.Metaphone. We'll also make two test words here. To test if two words sounds the same, the syntax is if (soundex.compare(word1, word2).
[00:31] If they sound alike, we'll say, ("SoundEx:Alike!"). Otherwise, we'll say, ("SoundEx:Unalike!"). We'll do the same for metaphone here.
[00:56] Let's take a look. They sound alike in both algorithms. Now behind the scenes, both of these algorithms make encodings for these words, which they compare against each other to see if they sound the same.
[01:16] To print out those encodings, we can say, soundex.process our words. Let's just print all of those here. SoundEx, which was actually created in the early 1900s creates the encoding using the first letter of the word, dropping all vowels and any duplicating coresidence, and replaces remaining coresidence with numbers according to its lookup table.
[01:55] This makes it less powerful than metaphone, which is a much more complex algorithm. For instance, if we look at two words with different first letters and print them out, we can see that where SoundEx's encoding found the words to sound unalike, metaphone's encoding of these words were the same.
Should I split documents into single sentences or use them as is to train Brain.js text classification model? I was wondering what's the best way to feed the model with training data.
Can i just use the document as is? like this: {"phrase": "First long document with up to 30 sentences", "result": {"label 1": 1}}, {"phrase": "first long document with up to 30 sentences", "result": {"label 2": 1}} {"phrase": "Second long document with up to 30 sentences", "result": {"label 2": 1}}, etc. Or, should I split all documents into sentences and then the data will look like something this: {"phrase": "Sentence 1 out of document 1", "result": {"label 1": 1}}, {"phrase": "Sentence 2 out of document 1", "result": {"label 2": 1}}, etc.
{"phrase": "Sentence 1 out of document 2", "result": {"label 5": 1}}, etc.
{"phrase": "Sentence X out of document X", "result": {"No labels at all": 1}}, etc. Same question about using the model, should I just apply it on the complete document or should I split it to separate sentences then apply the model on each sentence.
What's the best practice?