How can we see a histogram of movies on IMDB with a particular rating? Or how much movies grossed at the box office each month? Or how many movies there are of each genre? These are examples of data aggregation questions, and this lesson will teach us how to answer them.
[00:00] For this episode, we're talking about aggregation. If we care a lot about aggregation, we're going to have a ton of data. There's a big example file for this episode. You can grab it online.
[00:11] Then, what I would do is I would just drop the database that we have. You would obviously never do this to a production database, because you're going to lose everything that you have, but since it's just easy to recreate things this way, we'll create the database. I've been calling mine "Pose Grass 101."
[00:28] Then, you can use the psql command line. We can tell it what database to use. That database is Post Grass 101. We can tell it to read a file. Just link to wherever you have that file downloaded and this will run those statements in Post Grass. This is a really nice way to share commands with people quickly.
[00:47] We can see this copied in almost 59,000 rows. We can see a lot of the data here that was inserted. Now that we've hopped back into our SQL program, we're going to select the count from the movies. We see that's the same thing. We can take a look at what they look like, but since we have this many movies, I really strongly encourage you to enforce a pretty strict limit on this.
[01:13] This is going to cause the query to return pretty fast. It returned within less than half a millisecond. Not too bad. This will let us look at the data without causing a lot of contention on the database, which is really important so long as it's not a production database.
[01:29] This is just something you're messing around with. By all means, do whatever you want with it. The limits are very important when you're working with a lot of data and that data is mission critical data.
[01:40] An aggregation is defined by the group by clause. We can group by the rating. For instance, we see the rating is going to be from zero to 10 and it's a decimal. The first thing we're going to see is this error. It says we need the movies that ID column to appear in the group by clause. We didn't group by the ID, but we selected everything. That's what it's mad about.
[02:08] Instead, what we're going to do is select the rating. We're just going to select the rating. This is...oh my God, it's in all kinds of a crazy order. We don't love that. Let's also add an order by clause. We can say order by one, which is the first column. This is the same as saying order by rating, but we'll say order by one. These will now be in order.
[02:32] We can see that basically what this did, despite the fact that there were so many records here, this essentially limited to all of the distinct values. There's only one record for each of these values here. That's an important thing to note about aggregations, that the group by clause is limiting to distinct values.
[02:52] We want to ask questions about each individual distinct value. What we're going to do here is say we want to see maybe for the rating how many movies are there for each rating. This gives us the list.
[03:09] Honestly, even for me, this is a little bit too much information. I want to just see a histogram of the information roughly. Now, we're going to group by round rating. Order by one is even better used now, because we don't have to type this everywhere.
[03:28] We could even do group by one, just FYI in case you don't want to have to use that everywhere. Now, what we're going to see is more of the actual picture of the data, the histogram of it. There are 272 movies here that have one star, 1,200 almost 1,300 that have two stars, etc.
[03:50] What this is doing is it's allowing us to aggregate on a particular distinct field and ask some question about all the underlying data there. Aggregations are really nice when all of the information is in a single field like this.
[04:06] Let's take a look at what happens though when they're spread across different fields. We see here we have a Boolean column for each of these fields that determine whether it's an action, animation, comedy, drama, documentary, etc.
[04:21] What's the genre? Basically, we could boil this down into a single column that says which genre it is. This isn't something that the group by clause is going to let us do right now. Ultimately, we want to see a histogram of that data, as well.
[04:35] What I want to introduce you to now is a case statement. You're probably familiar with a case statement from elsewhere, but we say when action equals true, then-this is going to be action-else is going to be other.
[04:50] Right now, just to show you what this looks like, and let's also select maybe the title. This is a little obvious. Yeah, this makes sense. We have a bunch of different other columns in here, because we simply haven't mentioned them yet. We could also say when animation is true, then it's animation. We can see we've got some animation ones now.
[05:15] I'm just going to go ahead and paste in the rest of these values for us here. We can see that some of them still IMDB didn't give any kind of ranking to, we don't have that information. Some of them are in fact other, but for most of these, we see this.
[05:32] What we have here is this case column, which is weird. Remember, we can rename it. We're going to rename this as genre, so now we have a genre column. This is kind of an unwieldy table to deal with, so I also want to introduce you to the width statement. This is going to create another temporary table for us to use.
[05:54] We'll say with genres as, and we'll take our select statement here and we'll just indent it, it's obvious that this is different. We'll back up here. Now, we have this new genres table to play with and a new query below here.
[06:08] Let's say select star from genres. We'll say we're going to group by the genre column, which is right here. Let's say genre and count. We'll order by one, as well. Now, we have a histogram of our genres. Pretty cool.
[06:34] Here, we see we still have the limit 100, which is why these numbers are so low, but if we remove that, now, we start to see a histogram of this data.