⚠️ This lesson is retired and might contain outdated information.

Bulk import data into Elasticsearch

Will Button
InstructorWill Button
Share this video with your friends

Social Share Links

Send Tweet
Published 8 years ago
Updated 2 years ago

Elasticsearch has a rich set of APIs for adding data to an index but for loading massive amounts of data, you’ll find the bulk interface much more efficient and performant. In this lesson you will learn how to format your data for bulk loading, add data via the bulk endpoint with curl, and add data via the bulk endpoint using the elasticsearch npm client.

[00:01] Post operations are the most common way to add data to an index. For example, in day-to-day operations, you may have your application configured to send each log event to your log index and Elasticsearch as the application generates the log entry. If you have many documents to add to your index, this isn't the most efficient way, as each document inserted has the setup and teardown work of the API call for every document it indexes.

[00:29] Elasticsearch bulk API allows you to perform many index and delete operations within a single API call. We use the bulk API by calling it, or by posting data to our endpoint.

We specify our localhost: [00:44] 9200 as our cluster, and we call the _bulk endpoint, going to pass in a content type header here of application.applications/json, and then the format of the data that the bulk API expects is the operation, the metadata, and then the data itself. Let me show what that looks like.

[01:07] We're going to start by specifying that the operation is an index operation. Here's where our metadata comes in. The index that we want to hit is the logs index. The type or mapping that we're going to use is our client mapping that we defined. Then we define the data itself.

[01:29] That's just going to be the fields for our documents and their values, that will be separated by a newline character. Then our next bulk endpoint operation starts the same thing, where we define the operation and the metadata, the data itself separated by another newline, followed by our next operation, and the data for that operation.

[01:51] Then finally, we can close out our data object. That responds back with results, and we get a result back for each operation, so this one was created is true, status HTTP 201.

[02:05] Same thing for this one, created true, status HTTP 201. If I switch back over to Postman and do a search, we have some additional results available in our search for everything now, because of that bulk endpoint operation. After watching that painful experience, you may be less than excited about using the bulk endpoint API. Fortunately, curl is probably not going to be your first choice for importing large amounts of data, and Elasticsearch clients support the bulk API operations as well.

[02:37] As a matter of fact, if you use the Git repo that accompanies this source to get the Simpsons dataset into your own Elasticsearch cluster, you've already used the bulk API endpoint. Let's take a look at how that was done. In my working directory here, I have a subfolder called dataset, and inside of that is the CSV file that contains all of the Simpsons episodes that I imported into Elasticsearch. I need two dependency packages to make this work.

[03:06] The first one is going to be the Elasticsearch client itself, so I'll install that and be sure to save it using - - save flag. Second one I need to install is csvtojson, and all that does is it takes the CSV document that you're looking at on my screen, and line by line turns it into a JSON object suitable for inserting into Elasticsearch, so I don't have to do that myself. I'll also create my utils folder, and inside of that I'm going to create a new file and just call it import.js.

[03:43] For me, this import operation is a utility type function, so it just feels cleaner to me to keep it in a subdirectory called utils, but there's no hard and strict rule that it has to be done this way, so if you don't like it that way, you don't have to do it that way. I need to require the csvtojson library, also need to require the Elasticsearch npm module. I'm going to create a constant named episodes that is the source of our CSV file data.

[04:20] Create another constant that I'm just going to call ESCluster that is going to be the Elasticsearch endpoint we're talking to, index is going to be the index that we index our data into. Type will be our mapping, and this is going to be the mapping for the episodes. Then I'm going to create this empty array here, where we're going to write our CSV data into this array.

[04:49] Finally, I'll create the Elasticsearch client object itself, specifying our host, and then the API version of 5.0Which this is kind of important, because it locks your application into using a specific version of the API so that if there are future versions of the API library that break compatibility, you have a better chance of it not breaking your existing code because you're locking your API version down here.

[05:17] Just a console log statement to show where we're at, then our CSV will read from file. Then as we receive JSON from that, we'll have a parameter, or we'll pass in a parameter called obj, and for each object we receive we're going to push that onto our bulk array. You'll see this looks very similar to what we did with curl, where we define our index, we define our type, the ID we're going to pull from the ID field of our document, so if we look at our CSV real quick, this first column here is the ID number, so we're going to use that in Elasticsearch.

[06:04] Then on the next line we'll pass in the JSON formatted line from our CSV itself. This is following very similarly to the format we did from curl, but it's much nicer, and much cleaner, and then just do a little line for some sanity checking here. Then at the end of that file, we'll have another function where we actually call the bulk API endpoint. The body of our request will include our bulk array, and if we have any errors, we'll just write those out to the log.

[06:44] Otherwise, we're just going to write out at the end, console log processing complete. With that done, I'm going to open up a terminal window. I'm going to delete the Simpsons index first, just so that there's no data there.

[07:00] Now we can import our data by calling node, going into our utils folder and calling import.js, and it says bulk importing into Elasticsearch, so that doesn't look right. We should have seen it write out the ID number for each item it had iterated through the CSV file.

[07:22] If I look back through here, that's because I quoted my variable name of episodes. That shouldn't be quoted. Let's try that import again.

[07:31] There we go, that's much better. Now if I return to Postman, we can do a cat on the indices endpoint. There's my Simpsons index, newly created, and we can dig down in that a little bit further, and show that the status is yellow, the index is open, and there's 600 docs in it as well as our primary and replica information.

[07:57] Just as a by the way, the ?v here tells Elasticsearch to return the column names in the output as well. If you don't specify the ?v and just send it, it only returns the data itself.