Many websites have more than just simple static content. Dynamic content which is rendered by JavaScript requires browser to be able to scrape data. This video demonstrates how to use Nightmare (which is a wrapper around Electron) to launch a url and scrape dynamic data.
[00:00] Many sites like weather.com require JavaScript execute to do something like render the temperature. For example, if I search for temperature as a selector, you can see it finds my temperature, but if I drill into it, you can see ng-isolate-scope, which means that someone is using Angular to render out this 76 degrees.
[00:22] If I try to scrape the temperature, I would only get a blank HTML tag right there. I wouldn't get the actual degrees, because you need a browser to run and execute the JavaScript.
[00:33] What I'm going to do, I'm going to leverage a project called "Nightmare," which is a wrapper around PhantomJS, which is a headless browser, meaning, it doesn't have any UI.
[00:42] It just launches in the background and can render a page and execute JavaScript. This Nightmare project makes it much easier to work with. I've already npm installed Nightmare and PhantomJS. I can say, "import Nightmare from Nightmare."
[01:01] Then, I just create a new Nightmare, and then leverage the Nightmare API to achieve what I want to achieve. To scrape the temperature from weather.com, I'll just say, "Go to." Then, I'll chain on an evaluate, and then basically, tell that to execute with a run.
[01:20] I want to go to weather.com. Evaluate is going to take two functions, the first one being in the scope of the browser so I can actually access the document here. The second function is going to handle the result that I return from that scope of the browser.
[01:45] What I mean by that is if I return document query selector, my query is just going to be temperature. I'll grab the inner text. This is going to return, and then, pass that in as an argument here, which I'll just call temperature and then log out temperature.
[02:12] I'll go ahead and run this. This will take a while, but it logs out 76 degrees. Nightmare is not doing anything too fancy for us. It's just giving us a convenient API or way to work with PhantomJS, where we go to a URL, we evaluate what's actually in the browser.
[02:31] Inside this function, we have the browser scope. Then in the second function, we can take what we return from this first one, and we're back into the scope of Node. Then, we run it.
looks Nightmare doesn't support https... or how to configurate it for https?
Thanks!
When googling for answers, try looking for "PhantomJS https" (because Nightmare is just a wrapper around PhantomJS).
So add the follow config when you run your script: "--ssl-protocol=any" And I tossed together this as a bonus (I'll be talking more about "cheerio" in future videos):
import Nightmare from "nightmare";
import cheerio from "cheerio";
new Nightmare()
.goto('https://l3com.taleo.net/careersection/l3_ext_us/jobsearch.ftl')
.evaluate(function(){
return document.documentElement.innerHTML; //pass all of the html as text
}, function(html){
let $ = cheerio.load(html); //use cheerio for jqeury in node
let titles = $('#jobs .absolute>span>a').map(function(){
return $(this).text();
}).get();
console.log(titles); //log out the array of job titles
})
.run();
How does it compare to CasperJS? I've spent a lot of time with Casper, and I'm curious if someone out there is familiar enough with both APIs to have an opinion on the two.
Thanks for the video. However, I could not get the current script to work. Looks like Nightmare has changed their syntax. Borrowing from the their posted example, I have come up with this and it worked for me after installing 'vo'.
var Nightmare = require('nightmare');
var vo = require('vo');
vo(function* () {
var nightmare = Nightmare({ show: true });
var link = yield nightmare
.goto('http://weather.com')
.evaluate(function () {
return document.querySelector('.temperature').innerText;
});
yield nightmare.end();
return link;
})(function (err, result) {
if (err) return console.log(err);
console.log(result);
});
Thanks Baskin, your code worked for me. These video should be updated with notes more clearly. I don't understand why vo is being used. Could you explain?
Hi, I need to send custom header in goto. How to achieve this ?
Nice tutorial. Unfortunately I tried to use Nightmare to crawl an AJAX site and it didn't work.
var Nightmare = require('nightmare'); new Nightmare() .goto('https://l3com.taleo.net/careersection/l3_ext_us/jobsearch.ftl') .evaluate(function () { var links = document.querySelectorAll('th a'); return links }, function (links) { console.log(links); }) .run();