Detect Duplicates in Postgres with Cartesian Product

Brett Cassette
InstructorBrett Cassette
Share this video with your friends

Social Share Links

Send Tweet
Published 8 years ago
Updated 6 years ago

The cartesian product is a humdinger -- it relates all data in one table to all data in another! Now that’s a lot of rows, and fast. Surprisingly, cartesian products are as fast as inner joins at duplicate detection, thanks to the magic of the query planner!

[00:00] Let's look at an inner join. We'll inner join backers when projects.ID equal backers.projectID. This gives us all the projects where there's a backer for that project. A cross join doesn't need a clause. It takes every single project and matches it to every single backer regardless of whether or not they backed that project.

[00:30] Here we see the second project matched again to every single backer, and again, and again. This gives us the entire Cartesian product of the projects table matched up with the backers table. We could even cross join onto ourselves. P1 cross joins onto P2.

[00:53] Maybe we'd use this to detect some duplicates. Perhaps, where the P1.ID is not equal to the P2.ID and the P1.name is equal to the P2.name. This shows us two different projects that have the same name.