Web Scraping Images with Node, Xray, and Download

John Lindquist
InstructorJohn Lindquist

Share this video with your friends

Send Tweet
Published 6 years ago
Updated 3 years ago

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

[00:00] To check for images, we'll switch over to the image tag. We'll switch this to just image. We'll change this guy to source, and him as well, and just get rid of that line there.

[00:14] When I run this, we see all of the images that are on Google.com. To do something a bit more realistic, let's try the Wikipedia page for Pluto. We'll run this and get a bunch of pictures of Pluto.

[00:27] One thing you might not be able to tell from this result set is some of these things are just little cursor icons and things on Wikipedia which we don't really need. Let's check the width and height and then filter those out.

[00:41] We'll add width, there's the add width for the width attribute and height add height for the height attribute and run it again. You can now see we have width and height to see when the images are really small and those are the ones we don't want.

[00:59] Because x-ray doesn't support filtering out results, we're going to have to do this ourselves. What it does allow you to do is invoke the results of the x-ray with a call back, which takes the error and the results.

[01:13] To keep showing our results over here just for visually and for storing the data later, I'm going to bring in the file systems stuff node and then say file system, write file to results.JSON. We need to JSON stringify our results. We don't want to change how it's stringified. It will pass in null. We want to use a tab character for formatting. If I run this again, we should get the same thing in our results JSON. Now we're just doing it ourselves.

[01:48] To filter out the small images, we'll go ahead and say results are equal to the results where we filter out the images where the...will return images where the width is greater than 100. That should be pretty safe for us. Anything that's over 100 we'll get. You can see that now our results set only has images that are greater than 100 in width, which looks pretty good to me.

[02:19] To download these guys, I already NPM-installed a package called download. With download, you just say, "I want a new download." When we filter on each of these guys, we'll just tack on a, for each. Now, we can take each image. We'll say, "Download, get the image SRC" which is the path to the image. We'll set the download destination to /images. We'll tell download to run. Now you can watch as the images starts streaming in.

Nabil Makhout
Nabil Makhout
~ 6 years ago

Let's say I put my application on an application server, how will things download then? Won't it download the images on the server? If so how would I be able to do it on the clients pc?

Paul
Paul
~ 6 years ago

Hi. You can't create(and download) any files on a client machine because of the security issues - https://en.wikipedia.org/wiki/JavaScript#Security

Sequoia McDowell
Sequoia McDowell
~ 6 years ago

FYI if you're scraping large files like mp3s rather than small images you might not want to start downloads in a simple forEach. I don't know exactly what happens if you attempt to download 250 large files at once, but it probably isn't good! :) Another reason to avoid this would be to not accidentally DOS the site if it's a small mom & pop server rather than google.

A function like async's parallelLimit will allow you to say "download in parallel, but only 5 at a time" which may work better for you and the site operator.

John  Lorrey
John Lorrey
~ 5 years ago

hmm this download npm module doesn't seem to want to work for that for loop. I removed the for loop and just used url-download module passing in the whole arrary to download. var download = require("url-download"); download(results, './images').on('close', function (err, url) { console.log(url + ' has been downloaded.'); How this helps someone reading this.

izdb
izdb
~ 5 years ago

This lesson doesn't appear to work at all anymore, copied the code to node v5.3.0. Fails with no errors.

{ "name": "xray-tuts", "version": "1.0.0", "description": "", "main": "app.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "author": "", "license": "ISC", "devDependencies": { "download": "^5.0.2", "x-ray": "^2.3.1" } }

Vinny
Vinny
~ 5 years ago

Well, apparently the Download package has been updated. I checked their docs and fixed the code accordingly:

// ... ^^ imports and xray config

(function(err, result) {
  var images = result.filter(function(img) { 
    return img.width > 100; 
  })
  .map(function(img) {
    // Here is the new download code.
    // Download takes asset url and download destination.
    // I used map() here, but forEach would provide the 
    // same output
    Download(img.src, './images');
   });

   // Write the original return result to JSON file
   s.writeFile('./results.json', JSON.stringify(result, null, '\t'));
});
Alonso Lamas
Alonso Lamas
~ 5 years ago

Thanks Vinny!