Web Scraping Images with Node, Xray, and Download

John Lindquist
InstructorJohn Lindquist
Share this video with your friends

Social Share Links

Send Tweet

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

[00:00] To check for images, we'll switch over to the image tag. We'll switch this to just image. We'll change this guy to source, and him as well, and just get rid of that line there.

[00:14] When I run this, we see all of the images that are on Google.com. To do something a bit more realistic, let's try the Wikipedia page for Pluto. We'll run this and get a bunch of pictures of Pluto.

[00:27] One thing you might not be able to tell from this result set is some of these things are just little cursor icons and things on Wikipedia which we don't really need. Let's check the width and height and then filter those out.

[00:41] We'll add width, there's the add width for the width attribute and height add height for the height attribute and run it again. You can now see we have width and height to see when the images are really small and those are the ones we don't want.

[00:59] Because x-ray doesn't support filtering out results, we're going to have to do this ourselves. What it does allow you to do is invoke the results of the x-ray with a call back, which takes the error and the results.

[01:13] To keep showing our results over here just for visually and for storing the data later, I'm going to bring in the file systems stuff node and then say file system, write file to results.JSON. We need to JSON stringify our results. We don't want to change how it's stringified. It will pass in null. We want to use a tab character for formatting. If I run this again, we should get the same thing in our results JSON. Now we're just doing it ourselves.

[01:48] To filter out the small images, we'll go ahead and say results are equal to the results where we filter out the images where the...will return images where the width is greater than 100. That should be pretty safe for us. Anything that's over 100 we'll get. You can see that now our results set only has images that are greater than 100 in width, which looks pretty good to me.

[02:19] To download these guys, I already NPM-installed a package called download. With download, you just say, "I want a new download." When we filter on each of these guys, we'll just tack on a, for each. Now, we can take each image. We'll say, "Download, get the image SRC" which is the path to the image. We'll set the download destination to /images. We'll tell download to run. Now you can watch as the images starts streaming in.