The ability to reply to discussions is limited to PRO members. Want to join in the discussion? Click here to subscribe now.

Web Scraping Images with Node, Xray, and Download

Web Scraping Images with Node, Xray, and Download

3:07
Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.
Watch this lesson now
Avatar
egghead.io

Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.

Avatar
Nabil

Let's say I put my application on an application server, how will things download then? Won't it download the images on the server? If so how would I be able to do it on the clients pc?

Avatar
Paul

Hi. You can't create(and download) any files on a client machine because of the security issues - https://en.wikipedia.org/wiki/JavaScript#Security

In reply to Nabil
Avatar
philip

In regards to filtering out images of a certain size... what if the image doesn't have a @width attribute?

Would another option be requesting the HEAD of the image and getting the file size? If so, any guidance on how to implement that? Thanks!

In reply to egghead.io
Avatar
Sequoia McDowell

FYI if you're scraping large files like mp3s rather than small images you might not want to start downloads in a simple forEach. I don't know exactly what happens if you attempt to download 250 large files at once, but it probably isn't good! :) Another reason to avoid this would be to not accidentally DOS the site if it's a small mom & pop server rather than google.

A function like async's parallelLimit will allow you to say "download in parallel, but only 5 at a time" which may work better for you and the site operator.

Avatar
John

hmm this download npm module doesn't seem to want to work for that for loop. I removed the for loop and just used url-download module passing in the whole arrary to download.
var download = require("url-download");
download(results, './images').on('close', function (err, url) {
console.log(url + ' has been downloaded.');
How this helps someone reading this.

Avatar
izdb

This lesson doesn't appear to work at all anymore, copied the code to node v5.3.0. Fails with no errors.

{
"name": "xray-tuts",
"version": "1.0.0",
"description": "",
"main": "app.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"devDependencies": {
"download": "^5.0.2",
"x-ray": "^2.3.1"
}
}

Avatar
Vinny

Well, apparently the Download package has been updated. I checked their docs and fixed the code accordingly:

// ... ^^ imports and xray config

(function(err, result) {
  var images = result.filter(function(img) { 
    return img.width > 100; 
  })
  .map(function(img) {
    // Here is the new download code.
    // Download takes asset url and download destination.
    // I used map() here, but forEach would provide the 
    // same output
    Download(img.src, './images');
   });

   // Write the original return result to JSON file
   s.writeFile('./results.json', JSON.stringify(result, null, '\t'));
});
Avatar
Alonso

Thanks Vinny!

In reply to Vinny
HEY, QUICK QUESTION!
Joel's Head
Why are we asking?