Node and Xray have made web scraping a really simple affair. This video introduces you to the process of scraping all of the "a" tags off of a url and saving them to a .json file.
[00:01] I've already NPM installed x-ray. I'm just requiring x-ray here. Make sure to include the dash when you install it or else you'll get a different package entirely. I'll go ahead and create a new x-ray.
[00:13] To scrape something, I need a URL I want to scrape. I'll say, http//google if you've heard of that site before and then a selector that I want to grab from that site. I'll go ahead and grab the title. X-ray allows me to write this out to a file. I'll say, write to results.json which is this file over here. When I run this, you can see that the title element of Google.com is Google.
[00:45] To make this a little more interesting, let's go ahead and grab all the A tags instead of the title. I'll run this. You see it gives me images. There's an images A tag in there somewhere, but that's not all the information I want. I want all of the A tags and I want them formatted nicely.
[01:03] We can do that with a third parameter here, which will be an array and then an object describing what we want to pass in. You can name these whatever you want. I'm going to name this key A and then an empty string. That's going to give me all of the content of each of the A tags. I'll run that. You can see that now, I have an array with objects inside of it with keys of A mapping to each of the content of all those A tags.
[01:31] Now, you probably also want the link inside of it. Grab the href and the selector for the href is going to att for attribute href, meaning that just grab the attribute off of this A tag that we found. I'll run this. You can see that images would go here. Maps would go there. Play would go there.
[01:51] Just to give us a little bit more information, we'll grab something like the CSS. I'll say give me the class attribute. Run it again. You can see the various class names used on that A tag.
Happy member here, but I'm wondering if I'm missing something in the Egghead UI. Where do I go next? The panel to the right gives me a list of different lessons, but I was never able to figure out how that list is gathered. I see a few other lessons in that list that seem related to web scraping, but I'm left wondering if they are all listed there, or if I'm missing some, and which one comes next, or previously. Could the lessons be tagged to help? (here, with "web scraping", so I can find all the others in that series). Or numbered? Thanks.
UPDATE: by using Search, I found that there is a user-created playlist which seems to include most of the web scraping lessons. That could be a nice addition too: for a given lesson, show which playlists are including that lesson (maybe the larger ones, or most "popular" ones, etc).