Business requirements change, new information is discovered, or usage patterns differ from the expected use. In any case, sooner or later you will find the need to reindex your data to accommodate these changes. In this lesson you will learn how to leverage the information stored in Elasticsearch and the bulk API to reindex data from one index to another.
[00:01] I can copy all of the documents from the Simpsons index to a new index. I'll do so with a post operation. Enter the URL for my cluster, and call the Reindex API. In the body, I'll specify my source index. That's going to be the Simpsons. I'll specify the destination, and then provide the name of the new index that I want that in.
[00:27] If I run that, it's going to take a minute, because there's a lot of documents in there. When that completes, it shows me the amount of time it took, the number of documents that were created. Now, if I take a look at the indexes on my cluster, you can see there's a new index called local news that has the exact same documents as the original Simpsons index.
[00:50] The reason this is going to be of interest to you is because it's much faster to reindex from one existing Elasticsearch index into a new index than your other options, which are to individually index the documents, or use the bulk API endpoint with your original source data.
[01:07] There are usually certain business-related factors behind why you would want to do this. For example, let's say that in our Simpsons index here, Milhouse renegotiated his contract so that he gets paid based off the number of lines spoken in each episode.
[01:22] We can start with our source index of the Simpsons. I can create a new index called No More Milhouse. I'm going to set the version type to external, which will cause Elasticsearch to preserve the version information from the source.
[01:37] Your other option is to set it to internal, which will cause Elasticsearch to blindly dump the documents into the target without any consideration as to whether or not those documents exist. We can do an in-line script, and CTX is our index object that includes the document plus its index data and metadata.
[01:58] _source is our document itself that's being indexed. Raw character text refers to the specific field in that document that's the name of the speaking character. We can see if that's equal to Milhouse. We can increment our version number, and remove the field, raw character text.
[02:50] We can search for the raw character text equal to Milhouse, and we send that back. We get the results Triple Milhouse, we get the results Moses Milhouse, but we don't get anything spoken by Milhouse directly, indicating that Milhouse's new contract negotiation may not work out like he expected.
[03:10] In a more practical application, you may find this handy if you've discovered that you indexed information you shouldn't have, such as a piece of sensitive data. You can reindex into a new index while removing that sensitive data, and then drop the original index.
[03:31] One other practical application for this is if you need to break off a subset of your data into its own index. For example, if we want to break all of the lines spoken by Krusty the Clown out into its own index, we can do so as we reindex with a query.
[03:47] We can do a match where the raw character text is equal to Krusty the Clown, and put that into an index named Krusty. Submit that, and it returns a new index with a total of 2,200 documents in it. If we take a look at that, we can just do a search for everything in that index.
[04:08] In the results, we see that we created a new index named Krusty, and it only contains lines where the raw character text is equal to Krusty the Clown.