Enter Your Email Address to Watch This Lesson

Your link to unlock this lesson will be sent to this email address.

Unlock this lesson and all 830 of the free egghead.io lessons, plus get Node.js content delivered directly to your inbox!



Existing egghead members will not see this. Sign in.

Just one more step!

Check your inbox for an email from us and click link to unlock your lesson.



Intro to Web Scraping with Node and X-ray

2:03 Node.js lesson by

Node and Xray have made web scraping a really simple affair. This video introduces you to the process of scraping all of the "a" tags off of a url and saving them to a .json file.

Get the Code Now
click to level up

egghead.io comment guidelines

Avatar
egghead.io

Node and Xray have made web scraping a really simple affair. This video introduces you to the process of scraping all of the "a" tags off of a url and saving them to a .json file.

Avatar
Maui

What IDE are you using?

In reply to egghead.io
Avatar
Kenneth

Are you using Browserify? I'm getting a weird error about a node module missing somewhere way down in the x-ray package.

In reply to egghead.io
Avatar
John

That's WebStorm.

In reply to Maui
Avatar
John

Nope, nothing special. Just npm install x-ray and the code from the lesson. I just tested again from scratch and it still works as expected. Maybe you typed xray and forgot the dash of x-ray?

In reply to Kenneth
Avatar
Sébastien

Happy member here, but I'm wondering if I'm missing something in the Egghead UI. Where do I go next? The panel to the right gives me a list of different lessons, but I was never able to figure out how that list is gathered. I see a few other lessons in that list that seem related to web scraping, but I'm left wondering if they are all listed there, or if I'm missing some, and which one comes next, or previously. Could the lessons be tagged to help? (here, with "web scraping", so I can find all the others in that series). Or numbered? Thanks.

UPDATE: by using Search, I found that there is a user-created playlist which seems to include most of the web scraping lessons. That could be a nice addition too: for a given lesson, show which playlists are including that lesson (maybe the larger ones, or most "popular" ones, etc).

I've already NPM installed x-ray. I'm just requiring x-ray here. Make sure to include the dash when you install it or else you'll get a different package entirely. I'll go ahead and create a new x-ray.

To scrape something, I need a URL I want to scrape. I'll say, http//google if you've heard of that site before and then a selector that I want to grab from that site. I'll go ahead and grab the title. X-ray allows me to write this out to a file. I'll say, write to results.json which is this file over here. When I run this, you can see that the title element of Google.com is Google.

To make this a little more interesting, let's go ahead and grab all the A tags instead of the title. I'll run this. You see it gives me images. There's an images A tag in there somewhere, but that's not all the information I want. I want all of the A tags and I want them formatted nicely.

We can do that with a third parameter here, which will be an array and then an object describing what we want to pass in. You can name these whatever you want. I'm going to name this key A and then an empty string. That's going to give me all of the content of each of the A tags. I'll run that. You can see that now, I have an array with objects inside of it with keys of A mapping to each of the content of all those A tags.

Now, you probably also want the link inside of it. Grab the href and the selector for the href is going to att for attribute href, meaning that just grab the attribute off of this A tag that we found. I'll run this. You can see that images would go here. Maps would go there. Play would go there.

Just to give us a little bit more information, we'll grab something like the CSS. I'll say give me the class attribute. Run it again. You can see the various class names used on that A tag.

HEY, QUICK QUESTION!
Joel's Head
Why are we asking?