Web Scraping with Pagination and Advanced Selectors

John Lindquist
InstructorJohn Lindquist

Share this video with your friends

Send Tweet
Published 6 years ago
Updated 3 years ago

When web scraping, you'll often want to get more than just one page of data. Xray supports pagination by finding the "next" or "more" button on each page and cycling through each new page until it can no longer find that link. This lesson demonstrates how to paginate as well as more advanced selectors for when links are difficult to scrape.

[00:00] To paginate through Hacker News, we first find the elements we'll be loping through. It looks like a thing, the class of thing is where each of these loop through.

[00:12] I want to grab the rank, a rank of one on this one, and then the third TD, and then, the A tag and the href from that A tag. We can get the rank, the title, and then, the link for the title.

[00:29] We'll go to Hacker News and we'll look for a thing. As we loop through each thing we find, the object we want to create will have the rank which was that rank class, the title which was the third TD, we'll pull off the nth child trick by looking for TD number three and grabbing the A tag out of that.

[00:57] The link will be really similar. We'll say "link," and then, "grab the href." When I run this, we'll get the first page of results, 1 through 30, but we want to paginate through these and get every page we can.

[01:13] To be able to paginate, you have to find the thing you click on to go to the next page, which happens to be this little more link right here.

[01:21] Hacker News has some very generic, very little styling. We have to get creative by finding the rel attribute that equals no follow. We'll look for that, because the title class isn't helping and the rest of the table stuff isn't helping either. We'll look for this guy in particular.

[01:37] To test out that more link, let's just go ahead and see what we get when we do A rel equals no follow. When I run this, you'll see that is not what we wanted. That means it must be the first result in a set of results. We'll create an array and an object that just says "test," and see what each of the results gives us. We can see that more is the last one of all of those that we found.

[02:09] To get the last child, we'll test out this selectorlast-child. Run it again. You can see now we only get more which is perfect for us. Now, we can find this and get the href off of it.

[02:25] I'll undo back to where we were, and now, we can achieve pagination by typing "paginate" between our query and write. I'll pass in that selector of A rel is no follow. Make sure it's the last child. We want the href attribute off of that.

[02:48] Now, when I run this, it will take a little bit longer. You can see when it's done, we now have hundreds of results, I'll scroll, scroll, scroll, scroll, scroll, scroll. I have hundreds of results. Every time it found that more link, it would scrape the entire page for all of the stuff then click more scrape again, click more and keep on scraping.

[03:12] If you only need a few result sets or a few pages, you can say limit to three. This will run much faster this time. When I scroll, we should probably only get down to 90. 30 stories across three pages...

~ 6 years ago

Could you please give an example of how to use Composition and pagination to another site? I cant get the documented feature to work: https://www.npmjs.com/package/x-ray#paginating-to-another-site