Combine Window Functions to Answer Complex Questions

Brett Cassette
InstructorBrett Cassette

Share this video with your friends

Send Tweet
Published 5 years ago
Updated 3 years ago

Individual window functions are useful, but sometimes you’ll need to combine them to answer complex questions. For instance, which 10 star movies were released in the year that had the most 10 star movies released?

[00:00] We're going to ask a fairly complicated question that's going to require us to compose window functions. That question is, essentially, which year has the most one-star ratings, two-star ratings, three-star ratings? Then, what were the titles of the movies that had those ratings in that year?

[00:18] The first thing that we're going to do is get counts for each year by each rating. I'm rounding the rating, because I don't, necessarily, want to know for 1.1, 1.2. I just want to know for one.

[00:30] We'll round the ratings by year. We'll order by the year descending. These are just the row numbers, we're going to see where we're starting here. In 2004, you can see we have 7, 8, 9, 10, 11, 12 one-star ratings.

[00:49] In 2003, we keep coming until we get to this last value. We've just said, "last value." That should tell us which window function we're going to reach into our pocket for next. The first thing I want to do is to split these up. Let's call these "Our rankings," that each of these queries can be isolated in itself.

[01:10] We'll say, "Select star from rankings." We'll also add onto this the last value of the ranking column over the partition by...we'll use rating and year.

[01:24] Now, we're using this rating, we don't need to round it anymore. We'll call this "The count." This should be the number of this many star movies in this particular year.

[01:36] As we've seen before with last value, it's important for us to give ourselves an order bias. We're ordering by the ranking. We have to grab the rows between unbounded preceding and unbounded following to make sure that we're getting the correct count here.

[01:56] Now that we have the counts, we want to make sure that we're grabbing the years that have the highest count for each particular rating. Let's call these "The rated years." We're going to rank them in some way. We will select from the rated years. We'll pick out the outliers in a second.

[02:19] For right now, let's just say, "Select star from the counts." We're going to want to get a ranking of some kind, I think DENSE_RANK makes sense. We'll say, "This is over the partition by the rating, ordered by the count descending." That's going to be the rank.

[02:39] We'll see what this looks like here. The rank for 1997, you can see is one for the rating of one, two for 2004 etc. The ones we want are the ones where the rank is one. These are the first place. These are the outliers.

[03:01] We see 1997 has 16 one-star rated movies. That is the highest number. We can see what the names of the movies are here. We can see 1997, also, has 85 two-star rated movies. 1997 seems to have produced a lot of bad movies, also for three.

[03:22] 1999 produced the most four-star rated movies and six-star etc. We can get down to 10 in 2004. It looks like a great year, with 70 10-star rated movies.

Mike
Mike
~ 5 years ago

Awesome course. You had great examples. Looking forward to the next one.

Babak Badaei
Babak Badaei
~ 4 years ago

The "counts" subquery shown in the video is not the same subquery in the git repo. There are three ways to correct the repo. In the context of this video, it should match the method shown here. However, here are two other ways:

counts AS ( SELECT *, last_value(ranking) OVER (partition by rating,year ORDER BY year DESC) AS count FROM rankings order by ranking ),

or

counts AS ( SELECT *, max(ranking) OVER (partition by rating,year ORDER BY year DESC) AS count FROM rankings ),

because otherwise the rankings subquery may return row_numbers out of order and the last_value will not be the largest value--therefore all records in the window should be taken into consideration.

Babak Badaei
Babak Badaei
~ 4 years ago

Interesting side note, unless this data is deliberately incomplete--there seems to be major rating inflation. The total movie count to satisfy round(rating) = 10 between 2000 and 2005 is almost the same as all movies from 1893 to 2000. ;-)

Controlling for increased movie production--a fun query to try is ranking not by count but rank as a percentage of total movies for that year. For example in 2004 alone, roughly 14% of all movies ranked 9+. If the top years since 1930 and before 2000 unite to compete against 2004, 1967 and 1984 were the best years and produced 4.7% and 0.8% of 9's and 10's within their respective years.