⚠️ This lesson is retired and might contain outdated information.

Create a type and map in an index

Will Button
InstructorWill Button
Share this video with your friends

Social Share Links

Send Tweet
Published 8 years ago
Updated 2 years ago

Types can probably be best thought of as a Class for your index and a map is the definition for that type. In this lesson, you will learn how to create a mapping type for an index based on sample data, verify it, then store data in the index using that type. You will also learn how Elasticsearch can automatically create them for you (known as Dynamic Mapping), and you’ll learn what mapping explosion is and how to avoid it.

[00:01] So far in this course, you've seen type mappings for our Simpsons index for episodes and for scripts. The episodes data contains information about each episode, such as the ID, the IMDb rating, the number and season, and the original air date, while the scripts datatype contains the actual lines spoken by each character in the episode.

[00:27] Each of these was different document types, but still part of our Simpsons data set. In a relational database world, these would have been tables in the same database. In Elasticsearch, they're types in the same index, each with their own mapping.

[00:42] A map is a process of defining how a document and the fields within it are stored and indexed. We have the mapping for the index on the left, and we've got one of the sample documents from the episode's data on the right.

[00:55] Now, this type of mapping is referred to as dynamic mapping. That means that Elasticsearch guessed at the mapping, because I didn't specify one before I imported the data. On the whole, it looks like it did a pretty good job.

[01:10] It correctly picked up that the episode title is a text. It picked up that the IMDb rating is a float, and even guessed that the original air date is a date type. Dynamic mapping will also traverse into subdocuments or nested elements inside of your document as well. There are two limits that come into play here.

[01:32] First, Elasticsearch won't index more than 1,000 fields per document, and it won't traverse more than 20 levels deep into subdocuments. Both of these are configurable, but the reason these limits exist is to prevent mapping explosion.

[01:47] Each document and subdocument index must be analyzed so that it can be fully searchable. It makes sense that the larger the document, the more resources Elasticsearch requires to index the document. Hitting the upper limits of these capabilities is referred to as mapping explosion, and it will cause significant performance degradation of your Elasticsearch cluster.

[02:09] If you find yourself in a position where you think you need to adjust these limits, I recommend taking a look at how you're storing and indexing your data, and see if maybe there's not a better usage pattern that suits your needs.

[02:23] What if the dynamic mapping didn't work right, or it didn't guess correctly? Maybe in the data I imported, season is a number here, but it should actually be a float. In that case, once a mapping is defined for an index, you can't change it.

[02:39] Changing a mapping means invalidating all of the documents that are already indexed. If you need to change a map, you have to create a new index with the correct mappings, and then reindex your data into that data.

[02:52] If you learn that after indexing millions of documents, it can be a painful lesson. For that reason, I always recommend creating your mapping when you define the index or add a new type to an existing index.

[03:04] Let's use an example, and I'll show you exactly how that's done. I've got a sample header and sample log entry here from a client application that's going to be writing its logs to Elasticsearch. The first field, we have the timestamp. The second field is the server name that it's interacting with.

[03:21] We have a log level for info, warning, or error, the IP address of the client, the latitude and longitude coordinates for the client's location, and then the log event or log message itself. Let me show you how to create an index to store that.

[03:37] We'll build our map with a put operation, specify the cluster that I'm talking to, and then the index name for the index I want to create. In the body, we'll define our mappings. We'll give it the name of the type that we're mapping, which is going to be client. Then we're going to define the properties for that client map.

[03:58] The first field in our log entry is the timestamp. We'll define timestamp, and then define it as a type of date. Next, we have our server name. That's going to be a type of text. The log level is next, and for the type on that, I'm going to give it the type of keyword.

[04:20] The keyword datatype is used for structured content. By creating our log level as a keyword, we're setting ourselves up where we can do some nice filtering. We can do things where we find all log entries where log level is equal to info, or where log level is equal to error.

[04:40] These keywords are searchable only by their exact value, so wildcard searches won't work. Other places where keywords are going to be useful for you are fields like email addresses, host names, status codes like our log level here, as well as ZIP codes, or defined tags.

[04:58] The next fields, we've got our client IP. My type for that is going to actually be IP. The cool thing about this, it allows us to store both IPv4 and IPv6 addresses in a data store, and when searching, it's fully aware of the details of IP addressing, so we can do things like search by CIDR notation.

[05:20] I can search for 192.168.0016, and it knows how to gather the IP addresses that are within that CIDR range, and do the search correctly. The next field in our map is the client coordinates. The type for this will be geo-point.

[05:41] By storing that as geographical coordinates, that gives us the ability to do a geo bounding box query. I'll specify my field of client coordinates here, and then define my box, which is going to be the top left geographic coordinate.

[06:00] We'll have a latitude of 52, and a longitude of -1. Then the bottom right of my bounding box will have a latitude of 51, and a longitude of 1. Then when running this, this will correctly identify any documents in Elasticsearch where the client coordinates are somewhere in the London area.

[06:28] With the geographic coordinates in here, that allows us to do not only the bounding box query, but finding documents within a certain distance of a central point, or do advanced queries, where like finding errors from users near a geographic location, integrate distance into document's relevance score, or sort documents by distance.

[06:50] The last field is our log event itself, which is the details of the event. That one's going to be a type of text. If I submit this, if I did it all right, we get an HTTP 200 request back, and Elasticsearch acknowledges the result.

[07:07] Now, we can use the cat endpoint to take a look at our indices. We have the new logs index created, and we can also drill into the index itself, and take a look at the mappings. In the mappings, you have all of the mappings that we defined.

[07:23] If I want to save a document into that index, I'll specify the endpoint of logs, and the mapping type of client. We'll do a post operation, and then add the details to the body of the request, starting with our timestamp. Server name will appserver01. Our log level is info. Client IP is 192.168.015.

[07:52] Client coordinates, the cool thing about the client coordinates is I don't have to specify anything about them. I can just pass in a string, and Elasticsearch is going to parse that string, and turn it into the latitude and longitude for me. Then the final field is the log event itself. That just states that the client logged in.

[08:14] Now, when I send this, we get an HTTP 201, created event back from Elasticsearch. It gives us the ID number, as well as verifies the index and type that we wrote. If I go back over here to cat indices, you can see our log index now has one document in it.

[08:35] Then we can also do a search to retrieve all documents, and we get our one document that we've submitted back. As you've already seen with the Simpsons index, we can have multiple types for a given index. This type we created was for our client logs, but in the same logs index, we could also have one for our server logs.

[08:59] The logs index contains logs from all different sources, and then we use types to define where, or the individual source of those logs. There are a couple of considerations whenever you're defining multiple types within an index. The primary one is the field names are shared across mapping types.

[09:18] That means any field with client IP must be a type of IP address. Any field with client coordinates must be a type of geo-point, and so on. If that poses a problem for your existing mapping, you can get around it by using more descriptive property names, or more descriptive field names, such as client log level or server log level.