This lesson is for PRO members.

Unlock this lesson NOW!
Already subscribed? sign in

Web Scraping Images with Node, Xray, and Download

3:07 Node.js lesson by

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

Get the Code Now
click to level up

egghead.io comment guidelines

Avatar
egghead.io

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

Avatar
Nabil

Let's say I put my application on an application server, how will things download then? Won't it download the images on the server? If so how would I be able to do it on the clients pc?

Avatar
Paul

Hi. You can't create(and download) any files on a client machine because of the security issues - https://en.wikipedia.org/wiki/JavaScript#Security

In reply to Nabil
Avatar
philip

In regards to filtering out images of a certain size... what if the image doesn't have a @width attribute?

Would another option be requesting the HEAD of the image and getting the file size? If so, any guidance on how to implement that? Thanks!

In reply to egghead.io
Avatar
Sequoia McDowell

FYI if you're scraping large files like mp3s rather than small images you might not want to start downloads in a simple forEach. I don't know exactly what happens if you attempt to download 250 large files at once, but it probably isn't good! :) Another reason to avoid this would be to not accidentally DOS the site if it's a small mom & pop server rather than google.

A function like async's parallelLimit will allow you to say "download in parallel, but only 5 at a time" which may work better for you and the site operator.

Avatar
John

hmm this download npm module doesn't seem to want to work for that for loop. I removed the for loop and just used url-download module passing in the whole arrary to download.
var download = require("url-download");
download(results, './images').on('close', function (err, url) {
console.log(url + ' has been downloaded.');
How this helps someone reading this.

Avatar
izdb

This lesson doesn't appear to work at all anymore, copied the code to node v5.3.0. Fails with no errors.

{
"name": "xray-tuts",
"version": "1.0.0",
"description": "",
"main": "app.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"devDependencies": {
"download": "^5.0.2",
"x-ray": "^2.3.1"
}
}

Avatar
Vinny

Well, apparently the Download package has been updated. I checked their docs and fixed the code accordingly:

// ... ^^ imports and xray config

(function(err, result) {
  var images = result.filter(function(img) { 
    return img.width > 100; 
  })
  .map(function(img) {
    // Here is the new download code.
    // Download takes asset url and download destination.
    // I used map() here, but forEach would provide the 
    // same output
    Download(img.src, './images');
   });

   // Write the original return result to JSON file
   s.writeFile('./results.json', JSON.stringify(result, null, '\t'));
});
Avatar
Alonso

Thanks Vinny!

In reply to Vinny

To check for images, we'll switch over to the image tag. We'll switch this to just image. We'll change this guy to source, and him as well, and just get rid of that line there.

When I run this, we see all of the images that are on Google.com. To do something a bit more realistic, let's try the Wikipedia page for Pluto. We'll run this and get a bunch of pictures of Pluto.

One thing you might not be able to tell from this result set is some of these things are just little cursor icons and things on Wikipedia which we don't really need. Let's check the width and height and then filter those out.

We'll add width, there's the add width for the width attribute and height add height for the height attribute and run it again. You can now see we have width and height to see when the images are really small and those are the ones we don't want.

Because x-ray doesn't support filtering out results, we're going to have to do this ourselves. What it does allow you to do is invoke the results of the x-ray with a call back, which takes the error and the results.

To keep showing our results over here just for visually and for storing the data later, I'm going to bring in the file systems stuff node and then say file system, write file to results.JSON. We need to JSON stringify our results. We don't want to change how it's stringified. It will pass in null. We want to use a tab character for formatting. If I run this again, we should get the same thing in our results JSON. Now we're just doing it ourselves.

To filter out the small images, we'll go ahead and say results are equal to the results where we filter out the images where the...will return images where the width is greater than 100. That should be pretty safe for us. Anything that's over 100 we'll get. You can see that now our results set only has images that are greater than 100 in width, which looks pretty good to me.

To download these guys, I already NPM-installed a package called download. With download, you just say, "I want a new download." When we filter on each of these guys, we'll just tack on a, for each. Now, we can take each image. We'll say, "Download, get the image SRC" which is the path to the image. We'll set the download destination to /images. We'll tell download to run. Now you can watch as the images starts streaming in.

HEY, QUICK QUESTION!
Joel's Head
Why are we asking?