⚠️ This lesson is retired and might contain outdated information.

Monitor Elasticsearch cluster health and status with the _cat API

Will Button
InstructorWill Button
Share this video with your friends

Social Share Links

Send Tweet
Published 8 years ago
Updated 2 years ago

Elasticsearch has an in-depth set of APIs for accessing the health and performance of the cluster. In this lesson, you will learn how to access them using the _cat API endpoint, designed for console use. You will also learn some of the key metrics to monitor to identify issues and performance problems with your Elasticsearch cluster before it impacts your application and clients, including how to tell if your Elasticsearch cluster isn’t returning results based on all of your data.

[00:00] We'll start by using Curl to access the cat APIs. The nice thing about the cat APIs is it returns formatted strings instead of JSON, making it easier for you and I to see what's going on. If I hit just the cat endpoint, it returns a list of all the cat APIs available.

[00:20] If you've been following along with some of the other Elasticsearch lessons, you've already seen some of these, such as health and indices endpoints. If I go to the master endpoint, it returns some data, but aside from guessing, I don't really know what it's trying to tell me here.

[00:38] I can include the verbose parameter, ?v, and it will return the header name for each column returned. This actually works for all of the cat APIs. I can also call the endpoint with the help parameter, ?help, and it returns a list of all the available columns, the alias names for those columns, and a brief description of what the column returns.

[01:02] If I return the help for the indices API, there's quite a bit of information available here. Let's say I just want to view the health. That's got an alias of H. The index name with an alias of I. The number of available docs is DC, and the store size is SS.

I can do a curl on localhost: [01:27] 9200 cat indices, and then the parameter H for header, and then specify the H for health, the index name, the .count, and the store size. Elasticsearch will return those values.

[01:48] I can also sort the output using the sort parameter, which is S, and then specify the column that I want to sort by. The indices are returned sorted by store size. I can check the overall health of the cluster with the health endpoint.

[02:08] Just to clarify, I use the term cluster, but I've only got one Elasticsearch node. I could also state that I'm checking the health of this node, and that would be accurate. In most applications, you're never going to have just a single node in your cluster. The health endpoint returns the overall health of the entire cluster.

[02:27] In my output here, I have the epoch and the timestamp. Both of those return the same thing, just in different time formats. This is useful if you're running this command multiple times, like in a for loop. You can see when the cluster turns from a status of yellow to green, and look at the elapsed time between those two to find out how long that took.

[02:50] In the next column, we have the number of the cluster, which is pretty straightforward, and then we have the status. Currently, mine's yellow, and there are three available statuses -- red, yellow, and green. Green means everything is great. All your nodes are operational. All your indices have the required number of primaries and replicas for performance and redundancy.

[03:12] Yellow means that all your data is available, but you don't have full redundancy. That's the case of my cluster currently. If I look at the cat endpoint for my indices, it lists all of the indices on my system. Look at this column. It shows the number of replicas required for each shard in the index.

[03:32] Almost every index I have requires one replica, but I only have a single node in this cluster. For that reason, there's nowhere for Elasticsearch to place that replica index. That's what triggers the yellow status.

[03:45] There are two ways of dealing with this. I can add nodes to the cluster to allow Elasticsearch to create the required replicas, or change the replica settings for the indices. In a production environment, you always want to have replicas available for redundancy.

[04:01] For the sake of this lesson, I can update my index settings to zero replicas. That's done with a put operation. I'll put in the URL for my cluster, specify the index that I want to modify. We're going to modify the Simpsons index, and then update the settings of it.

[04:20] In the body, we're going to update the index, and we're going to update the number of replicas to zero. Then I'll send that through. Elasticsearch acknowledges it, and if I rerun the indices command, or the cat endpoint, you can see that the Simpsons index has been updated to have zero replicas.

[04:41] Now, the health of that index is green. I can update the number of replicas required for all of my other indexes, and whenever we return that, you can see that they're all green. If I rerun the health command, my cluster is now green.

[04:58] A red status indicates that Elasticsearch is missing data. This is extremely important to know, because Elasticsearch can still return results, but the data returned will be based on incomplete data. Let me show you what that means.

[05:13] We have a three-node cluster here with a shard on server A, a shard and a replica on server B, and a replica on server C. If server C goes offline, either due to a failure or a planned maintenance, we lose that replica set, and the cluster status turns yellow, because all of our information is still available on servers A and B. We just don't have redundancy.

[05:35] Elasticsearch is going to start rebuilding that replica set on either server A or B, but if server B goes down before that completes, we now only have server A left, and its single shard. The important concept to grasp here is that the Elasticsearch service is running on server A.

[05:55] If you query Elasticsearch, you're going to get results back, but those results are going to be only based on the data that's available in shard one, not your complete data set. That's the reason it's so important to actively monitor this, and know whenever your cluster's not healthy.

[06:13] Returning back to our health status, we have the total number of nodes, as well as the total number of data nodes, and these may be different. It's common practice in high performance clusters to have dedicated master nodes and dedicated data nodes. This is done to segregate the workload across the different nodes.

[06:31] Next, we have our count metrics for our shards. We have the number of primary shards and the number of unassigned shards. Unassigned shards represents the shards that should exist, but Elasticsearch doesn't have anywhere to put them.

[06:43] Let me show you what that looks like. We'll do a put operation on our cluster on the Simpsons index, again updating the settings. We'll update our index with the number of replicas, putting that back to one.

[06:59] If I rerun the health command, you can see we have five unassigned shards, and that's from the five primary shards in the Simpsons index that we just updated to require a replica that Elasticsearch has nowhere to put.

[07:16] This column for relocating shards indicates the shards that are being moved from one node to another. The init column shows new shards that are being initialized, either because of a new index being created, or increasing the number of shards required for a given index.

[07:34] We also have the number of pending tasks and the wait time for the longest pending task. Finally, we have the percentage of shards present, which is just the number of shards divided by the total shards required, which is the shards plus the unassigned.