This lesson is for PRO members.

Unlock this lesson NOW!
Already subscribed? sign in

Scraping Data from Sites with Login Forms with Nightmare

3:00 Node.js lesson by

Many pages you'll want to grab data from will require you to log in first. This video demonstrates how to log in and grab whatever data off the page you will need.

Get the Code Now
click to level up

egghead.io comment guidelines

Avatar
egghead.io

Many pages you'll want to grab data from will require you to log in first. This video demonstrates how to log in and grab whatever data off the page you will need.

Avatar
Igor

Thanks, very useful actually! :)

In reply to egghead.io
Avatar
Nabil

Awesome! Very useful!

Avatar
Sequoia McDowell

Loved it, thanks!! I wrote wrote a scraper a while ago and these x-ray + download would have made it waaaay simpler. :p

We're going to scrape the data behind this login, so we need to simulate entering a email. User at keystonejs.com and my password is P-A-S-S-W-O-R-D. Then clicking "Sign in" which will take me here and here we have our super-secret data.

Let's see how to do that based on what we already know. We created a new nightmare. We want to go to that URL, which is localhost3000keystonesignin. Then we want to type in our email. We say "Type." It looks like the selector is going to be just an ID of email, so that's easy.

Looking ahead, this one will be an ID of password. We'll look up our ID of email and type in user@keystonejs.com. Then we'll just duplicate this. Change email this time to password. Our password was P-A-S-S-W-O-R-D. Then we need to click on the button.

It's a pretty generic element. Let's try a class off form with a child of button, just to make it a little more explicit. We will click on off form with a child of button. Now this will trigger the page change. What we want to do next is actually wait for the element that we want to get to load.

You may think logically, "Let's wait for something like 'sign out' to show up." We'd look for an element with a class of sign out, because that would mean that we're logged in. You have to be careful because a site like this is rendered with react, so if I look at this manage, you can see this is rendered out by reacted, has a react ID.

If I were to wait for sign out to show up, manage wouldn't have been rendered yet. If I were to look for this manage, it wouldn't have shown up. Instead, I'm going to wait for something that actually has a react ID.

Let's try the page header. This may require you to do some trial and error. In this specific scenario, I'm just going to use the page header class. I'll say, "Wait for the page header class to load." Once that's loaded, we'll just find the element called "H1," which should be that big manage header.

You can see a bunch of debugging information here. Once it's done, you'll see that we get the text of manage logged out, because we found from our document in H1 that had an inner text of manage, and that passed it on to the next function, which is the result. You can see we logged that out.

One tip here is that instead of running this every single time when you're debugging and trying to figure out what goes where, I strongly recommend just using the council and doing things like document, query selector, H1, inner text. Because that only took two seconds versus waiting for 30 seconds or a minute or as long as the nightmare task takes to run.

HEY, QUICK QUESTION!
Joel's Head
Why are we asking?