Elasticsearch has four types of aggregations. Pipeline and matrix, which are both experimental at this point, as well as bucket and metric aggregations which we'll cover here.
With metric aggregations, we can do things like calculate the average number of words in each script line in an episode. We have this field in each document from the episodes that's called wordCount and it contains the number of words spoken by that character. We can do something like this.
We can do an aggs query, which is short for aggregations, and we can create a field called averageWordCount. In it we'll use the average function and tell it that we want to average the field wordCount. When we run that, if we take a look at our response, we have all of these hits. Then down at the bottom we have the aggregations block and the averageWordCount field that we calculated with the value of 10.5.
One thing you'll notice is I got a bunch of search results back, and these are the actual results that were used to calculate the averageWordCount, but I don't care about that at the moment. I can change this up a little bit, and I can size = 0When I rerun this, going back down to the results, the hits are actually missing or not returned to me, and only my aggregations object is. If we take a look at our documents again, let's do an empty search so I get them all back, and we'll look at the script type.
In our results there's a field called rawCharacterText that has the name of the speaking character. We can get the cardinality of this field using a cardinality aggregation. To do that, we'll name our field first, our return value, which we'll call speakingLineCount. It's going to be a cardinality query on the field rawCharacterText.
When I run that, you can see I got an HTTP 400, and the reason is called, or the reason says it's because field data is disabled on text fields by default. Let's talk about why that is. We just asked Elasticsearch to count the distinct values for every word in a text field.
Our text field that we based this on, rawCharacterText, is just a character's name, but imagine the impact if that field contained volumes of text like the entire text of a book in each value. To get the cardinality, the field data has to be loaded into heap memory of the cluster, which can not only be expensive, but it takes time and resources, and can potentially impact the performance of your cluster.
For that reason, it's disabled by default, so that you stop and think about whether or not you really need to do that. It's worth noting that field data doesn't need to be enabled for that field to be searchable. Whenever you index the document, the terms are already identified and indexed for searching, it's only used by field data when you're using it for aggregation calculations.
For our purposes, we're OK with turning it on for this field for demonstration, so to turn it on I'm going to do a put operation on the cluster. I need to specify the index that we're doing this operation on, and then we're going to put in the mapping endpoint and specify the script.
In the body, we're going to be updating the properties, and we're updating the properties of our field rawCharacterText. We're specifying that the type of this field is text, and then to enable to the field data we specify fielddata and set that value to true. When I send that, I get an HTTP 200, and Elasticsearch acknowledges the request.
Now if we return back to our query here for our aggregation to get the cardinality of that, when we send it we get an HTTP 200 response, and in our aggregations we get the value of 4931.
Percentiles is another type of aggregation available in the metrics aggregation. I'm going to create one called wordCountPercentiles. We'll specify that this is a percentile aggregation, and then specify the field that it's going to operate on is the wordCount field. Again, that field was the number of words spoken in each line.
When we run this, we get the percentile chart returned showing us different things like 75 percent of the wordsSpoken, or each line spoken by a character is 13 words or less. The 99th percentile is 38 words. It's a quick and easy way to get percentiles for numeric values.
Some of the other metric aggregations available are min, max, stat, sum, top, and value, and they all operate almost identical to this percentiles count. Rather than show you each of those individually, they're pretty straightforward, and we're going to move on to bucket aggregations.
Bucket aggregations don't calculate metrics like the metric aggregations do, but instead create buckets of documents. They can hold sub-aggregations though, so let me show you what that means. We're going to create an aggregation, and we're going to call it HomerWordCount, then we're going to apply a filer that's going to be a term filter.
It's going to filter on the field rawCharacterText, which again was the name of the person who was speaking that line, and we're going to filter it to the value Homer. Inside of that, we're going to do another aggs, or another aggregation, and we're going to do averageWordCount here, much like we did at the very beginning of this lesson.
It will be of type average, and we'll base it on the field wordCount. What we've done here, is we've filtered our result set down to lines that were spoken by the character Homer, and then grabbed the average word count of those lines.
When we run this, we can see that there were a total of just over 30,000 lines spoken by Homer, with an average word count of 10.2 or 10.3.
We can also create a multi-bucket aggregation. We're going to create an aggregation called Simpsons, and inside of it we'll define filters, which is plural this time.
I'm going to define a second set of filters and I'll tell you why in just a second after I show you how this works. The first filter's going to be named Homer, and it's going to match on the rawCharacterText equal to Homer, and then we'll do the same for Marge, we'll do one for Bart, one for Lisa, and finally one for Maggie.
When I run that in our aggregations bucket, we get each of our buckets listing the name of that character and the number of spoken lines owned by that character.
Let's go back to this filters filters thing that I did up here. The outer set of filters is for anonymous filter, then inside of that, we create what are called named filters. The anonymous filters allow you to do things that aren't defined by a specific set of criteria.
For example, in our named filters we have each of the speaking characters we wanted, but we don't know how many other lines there were. We can do an anonymous filter called the otherBucket to get that, and that's where the distinction between the anonymous buckets and the named buckets comes in.
The first thing we need to do is enable the anonymous bucket, because it may be a very intense operation. Elasticsearch doesn't calculate it by default, so we need to turn it on.
We need to name the key that it's going to display as, and we'll call it nonSimpsonsCast. We've got an HTTP 400 on that, and if we scroll down and look, it's an unknown key Boolean in other buckets. Oh, that's because it is otherBucket, not otherBuckets.
If we return that, we get our aggregations, so here our named buckets, again for the Simpsons family, and then at the very end it's our anonymous bucket, nonSimpsonsCast with the document count for lines spoken by non-Simpson family members.
One of my favorite aggregation queries is the significant terms aggregation. In its simplest form, the significant terms aggregation identifies terms that are significantly more popular for a given set than for the comparison set.
Elasticsearch docs give us some great use cases that illustrate that, like suggestion H5N1 when users search for bird flu, or identifying the merchant that's a common point of compromise from the transaction history of credit card owners reporting loss, or spotting the fraudulent doctor who's diagnosing more than his fair share of whiplash injuries.
Let's build an aggregation query with our Simpsons dataset and see what that reveals. The first thing we do is define our foreground set which is done with a query. I'm going to make a terms query here, and it's going to be based on the rawCharacterText, which is again the speaking character's name. This accepts an array, so you can do multiple terms.
We're just going to filter it down to the word Homer. Again, I'm going to set my size to zero because I'm not interested in the actual hits themselves, but in the aggregation results. We'll define our aggregations, the name I'm going to give to my aggregation is significantWords, the type of query it is, is significant terms, and it's based on the field spokenWords.
What we're doing is we're aggregating the significant terms from the field spokenWords as our background set, and then looking for statistic deviations in that using our foreground set of spokenWords where the rawCharacterText is set to Homer.
If we run that, we get an HTTP 400, and if we look at the error message again, it's the fielddata error, because we tried to do an aggregation on the text spokenWords, which isn't indexed by default. We know how to deal with that, and we've actually got one we can already use, reuse here. We'll do a put request, and we'll modify the property of the spokenWords field to enable fielddata.
Elasticsearch acknowledges that, so we return to our query, rerun it, and so here's what we get. Homer spoke just over 30,000 words, and then it broke up the significant terms into two different buckets. Homer statistically uses the words Marge, I, woo, and who, me, oh, my, it, that, and Flanders, more than the rest of the speaking cast in the Simpsons episodes.
If you've ever seen the episodes, then these words really aren't surprising to you, once you know the context of this query. Let me explain to you what these actual results mean numerically.
We learned that it means that there's a noticeable difference in the frequency in which a term appears in the subset and in the background. Let's back that up with math. We're going to focus in on this result set for Marge right here.
The first thing I want to do is just do an empty search across all of the scripts types, and see that we have just over 157,000 documents in Elasticsearch. Now I'm going to do a basic query looking for the term Marge.
We'll do a query of type match on the field spokenWords, and look for the term Marge. When we run that and take a look at our results, we have 2,581 results. Out of the 157,000 documents, the word Marge appears in 2,581 of them, which is about 1.6 percent.
If we go back to our significant terms query, we can see that Homer has a total of 30,000 speaking lines, and he used the term Marge in 1,760 of them. If we do 1,760 divided by 30,000 we get about 5.9 percent, which is significantly different from our background set of 1.6 percent, and that's the reason it shows up in the significant terms query.