Join egghead, unlock knowledge.

Want more egghead? It's 45% off for a limited time only!

This lesson is for members. Join us? Get access to all 3,000+ tutorials + a community with expert developers around the world.

Unlock All Content for 45% Off

Already subscribed? Sign In

Save 45% for a limited time.

Get access to all courses and lessons on egghead today.

Autoplay

    Reindex data from an existing Elasticsearch index

    Will ButtonWill Button

    Business requirements change, new information is discovered, or usage patterns differ from the expected use. In any case, sooner or later you will find the need to reindex your data to accommodate these changes. In this lesson you will learn how to leverage the information stored in Elasticsearch and the bulk API to reindex data from one index to another.

    elasticsearchelasticsearch
    Code

    Code

    Become a Member to view code

    You must be a Member to view code

    Access all courses and lessons, track your progress, gain confidence and expertise.

    Become a Member
    and unlock code for this lesson
    Transcript

    Transcript

    00:01 I can copy all of the documents from the Simpsons index to a new index. I'll do so with a post operation. Enter the URL for my cluster, and call the Reindex API. In the body, I'll specify my source index. That's going to be the Simpsons. I'll specify the destination, and then provide the name of the new index that I want that in.

    00:27 If I run that, it's going to take a minute, because there's a lot of documents in there. When that completes, it shows me the amount of time it took, the number of documents that were created. Now, if I take a look at the indexes on my cluster, you can see there's a new index called local news that has the exact same documents as the original Simpsons index.

    00:50 The reason this is going to be of interest to you is because it's much faster to reindex from one existing Elasticsearch index into a new index than your other options, which are to individually index the documents, or use the bulk API endpoint with your original source data.

    01:07 There are usually certain business-related factors behind why you would want to do this. For example, let's say that in our Simpsons index here, Milhouse renegotiated his contract so that he gets paid based off the number of lines spoken in each episode.

    01:22 We can start with our source index of the Simpsons. I can create a new index called No More Milhouse. I'm going to set the version type to external, which will cause Elasticsearch to preserve the version information from the source.

    01:37 Your other option is to set it to internal, which will cause Elasticsearch to blindly dump the documents into the target without any consideration as to whether or not those documents exist. We can do an in-line script, and CTX is our index object that includes the document plus its index data and metadata.

    01:58 _source is our document itself that's being indexed. Raw character text refers to the specific field in that document that's the name of the speaking character. We can see if that's equal to Milhouse. We can increment our version number, and remove the field, raw character text.

    02:21 That's just a JavaScript block right there. If this is true, then do this. Finally, we'll specify the Elasticsearch script language of Painless. If we run that, we have a new index that was created with just over 157,000 documents in it. If we do a query on that, specify our new index name of No More Milhouse, look at the script type.

    02:50 We can search for the raw character text equal to Milhouse, and we send that back. We get the results Triple Milhouse, we get the results Moses Milhouse, but we don't get anything spoken by Milhouse directly, indicating that Milhouse's new contract negotiation may not work out like he expected.

    03:10 In a more practical application, you may find this handy if you've discovered that you indexed information you shouldn't have, such as a piece of sensitive data. You can reindex into a new index while removing that sensitive data, and then drop the original index.

    03:31 One other practical application for this is if you need to break off a subset of your data into its own index. For example, if we want to break all of the lines spoken by Krusty the Clown out into its own index, we can do so as we reindex with a query.

    03:47 We can do a match where the raw character text is equal to Krusty the Clown, and put that into an index named Krusty. Submit that, and it returns a new index with a total of 2,200 documents in it. If we take a look at that, we can just do a search for everything in that index.

    04:08 In the results, we see that we created a new index named Krusty, and it only contains lines where the raw character text is equal to Krusty the Clown.

    Discuss

    Discuss