In our previous examples, we've only been looking at the episode title, and data related to the episode. In the background, I've imported the entire script for each episode as well, so if you're following along, you can do that as well using the episode scripts file in the utils folder from the Git repo.
When just searching the episodes, we found that there were 43 episodes that had Homer in the title. With our new data now, we can do that same search.
We'll search for the word Homer. We'll pipe our output to JQ. Run that, and then I'll scroll back up here and you can see we now have 34,000-plus hits. When we look through the results it actually includes different places where the word Homer is included besides the title of the episode.
Returning back to Postman, I'm going to do a query. It will be a match type query, and we're going to match on the field from our dataset called spokenWords. We're going to match on the words makes me laugh.
When we go down to our results here, you can see there are just over 10,000 results. As we scroll through them, it's all hits where the spoken-words field contains the words makes me laugh. Then if we get down here far enough, we'll also see different variations of it where we have the words makes, me, or laugh in there in some combination which still results as a hit.
What if we just wanted the exact phrase "makes me laugh?" Well we can do that by changing out our match operator with the match phrase operator.
When we submit that, scroll down to results, you see that we've gone from 10,000 results down to a total of 3, and each one includes just the phrase "makes me laugh" in that order. We can search multiple terms as well if we do a multi-match search. We can have our query equal to the phrase "Homer Simpson."
Then we can apply the fields parameter, and fields accepts an array where we can tell it to search within spokenWords, and also the field rawCharacterText, which if you looked at the data, that's the actual spokenWords by the character.
If we run that, you can see our top match includes a document where Homer Simpson is found in both rawCharacterText and in spokenWords. As we continue to look down through these, another one the same thing, the more that you see the word repeated the higher relevance score it gets. Then as we move down to number three, you can see that it's keying off of the fact that Homer Simpson appears in spokenWords even though it doesn't appear in rawCharacterText, because it matches either, or, or both.
If we wanted to provide more search relevance to the words spoken by Homer rather than the places where his name is used by speaking character we can boost that. To boost it, we add the ^ symbol, and then the level of boost we want. I can boost it by a factor of 8 here. Now the rawCharacterText is the character speaking, so boosting it by 8 will boost the words that are spoken where rawCharacterText is equal to Homer Simpson.
If we run that, you can see now that the rawCharacterText is equal to Homer Simpson and the spokenWords is equal to Oh, showing that that boost is actually working. That just gives you some flexibility in whenever you apply a search in determining what's more relevant to the question that you're asking, as opposed to all fields or all data points in your dataset being equal.
We can also do a query string search. It will accept fields and fields again as an array, so you can put in multiple fields. Fields is also optional, so if you don't supply it at all, it searches the entire index.
Then for our query, we're going to put in Homer or donut. We'll execute that, and take a look at our results here. We get in our spokenWords we have the word Homer and donut here.
Then in our second result we only have the word donut, because it was an or clause. So if we swap that or with an and, and rerun the query, our first result is the same, and then our second one actually changed because we forced it to have the words Homer and donut.
We can also do wildcard searches, so we can search for the letters fri and , and our results that come back is going to include this one has friends, another one matches on friends, and then the next one down matches on the word friars, because it's fri. Wildcards should be used carefully though, because when you're doing a wildcard search it has to go through and look for any possible combinations that match, so it can take up a lot of memory and really impact the performance of your cluster.
We can also do fuzzy queries, so I can do dount and apply the ~ here telling it to do a fuzzy match. Whenever we submit that, it actually matches on the word donut, even though I misspelled it.
So this is a great way to deal with spelling mistakes in user-submitted data. By default, it looks for all terms with a maximum of two changes where a change is defined as the insertion, deletion, or substitution of a single character, or the transposition of two adjacent characters.
The default distance is 2, but research has indicated that a distance of 1 should catch 80 percent of all human misspellings. If we add the 1 on the end there and then resubmit this query, you can see that it still caught the word donut, but had the added benefit of improving the performance because it doesn't have to look through as many permutations of errors to find a match.