This is more of an analysis of how the jvm heap will kill your Elasticsearch cluster than a “how-to” lesson. If you aren’t familiar with Java apps and the jvm, this 2 1/2 minutes can save you much pain, suffering, and self-loathing by showing you how Elasticsearch utilizes the jvm heap for performance and what to monitor so you know when it’s affecting you.
[00:01] Finally, I want to show you the node's API endpoint. This shows the main performance statistics for each node in your cluster, and I really want to draw your attention to this value, the heap percent. Elasticsearch is a Java application and the memory allocated by the Java process is referred to as the heap.
[00:20] Based on my experience, this is probably the most important metric to track in your Elasticsearch cluster. As you use the cluster, the memory in this heap gets consumed. Garbage collection routinely frees any memory not used by the heap, so in a healthy cluster you'll see this value increase, then garbage collection runs and it decreases.
[00:40] When you plot that over time you see something like a nice sawtooth pattern. When you start approaching the upper bounds of your cluster limits, this sawtooth pattern will get smaller and smaller and the memory utilized will remain high.
[00:55] Ultimately you'll reach a point where garbage collection no longer finds any memory available to be freed, and when you run subsequent queries Elasticsearch doesn't have the memory it needs to answer your request. From your point of view, it's going to look like Elasticsearch has frozen because it'll just sit there and never return your results.
[01:13] If you restart the cluster, all of a sudden your query will work. That's because the restart released all of the allocated memory, resulting in available memory to fulfill your request. Over time, though, you'll end up right back here.
[01:26] To overcome this problem you have a couple of options. You can add additional nodes to your cluster to increase the total amount of memory available. You can add additional memory to the nodes you currently have. You can close indices that are no longer needed freeing up the memory that they consumed, or you can create additional Elasticsearch clusters dedicated to specific tasks and indices.
[01:50] Most importantly, though, you're never going to know this is the root cause of your Elasticsearch performance problems if you aren't tracking this value over time. There are a lot of different options available to do that. You may already have an existing monitoring system in place, such as Datadog, Nagios, or Zabbix that can be configured to collect these metrics.
[02:11] Elasticsearch itself has a monitoring tool, called X-Pack, for this. You can even build your own solution to periodically get this information and store it. To do so, you'll probably need an Elasticsearch endpoint more suited to programmatic access than the column-based results provided by the cat API.
[02:27] Elasticsearch provides this via the cluster API and we'll cover that in the next lesson.