Scrape a Webpage with a Serverless Function

Lukas Ruebbelke
InstructorLukas Ruebbelke

Share this video with your friends

Send Tweet
Published 6 months ago
Updated 5 months ago

Even webpages can be scraped using serverless functions!

We can create a function using the Serverless CLI and import Axios and Cheerio. Axios is used to get the HTML from a page and then we use Cheerio to parse the HTML for our desired data and create an array of pretty printed strings from it.

Instructor: [0:00] In our quest to do wildly useful and wildly profitable things, our next example is going to be a screen scraper. The idea here is as follows. First, we'll pick a site. In this case, the college football ranking site from ncaa.com.

[0:18] Next, using a Serverless Function, we'll go in and parse the website with axios and Cheerio. Finally, we'll convert this particular table into a JSON structure and prettyPrint it so we can display it in the browser in a meaningful way.

[0:35] To get started, we are going to create a new serverless project. From the command line, we will use sls create. We're going to continue to use the AWS node template. We'll call this particular project, "Egghead NCAA Rankings." We'll go ahead and create this. Next, we'll step into the folder, and from here, we'll install the axios and Cheerio packages. Once these are installed, we can hop into our code and start to build things out.

[1:18] Let's jump into the Handler.js() function, where we'll clean up these two lines, close the left panel for extra working space, and then import axios and Cheerio. Axios is for our HTTP request. Cheerio is for parsing the DOM. It's like jQuery for Node. Next, we'll create a property to store our URL.

[1:46] Then let's create a convenience object called Props that maps the index of the column to the property that it's referencing. The first column is Rank. The second column is School. The third column is Points. The fourth is Previous, and the fifth is Record. Then we will provide a convenience method that allows you to pass in an index in get the property value off of the object.

[2:12] Let's create a method to fetch the HTML. This is an asynchronous method that takes in a URL, and this is where axios earns its lunch money. We're going to create a try-catch block, and within the try block, we're going to make a GET request.

[2:31] Then, using object destructuring, we're going to pull the data off the response and, if that is successful, return the data. If something goes wrong, we'll do a console log and let the world know.

[2:45] With that done, we are now going to create another method called scrape. This is where the real magic happens. Let's step through this slowly and see how this works. The first thing we're going to do is call fetch HTML and get the raw data from that page. In this case, the raw HTML.

[3:05] Next, we'll need to convert this into something we can parse. We're going to load it into Cheerio. Side note, the convention here is to use a dollar sign, like jQuery.

[3:19] Now that we have the ability to crawl this DOM and start to do things with it, let's step into the page and take a look at how this works. This is just old-fashioned parse the DOM and look for ways to find the pieces that you need in the page that you're looking at.

[3:36] On this page, I'm going to right click, and then I'm going to click Inspect. We are looking for something in the DOM that we can use to start our query. I see that there is a div with an ID of block the spin content. This is a good starting point.

[4:01] If I scroll down, I can look for something that actually holds the table portion of the page. Here we have it. Notice that when I select the part of the page that I actually care about, it is highlighted.

[4:18] If I step into this, you'll see that it's full of a bunch of table rows, which is exactly what I want. Then if we open this up, you'll see that we have five cells in here, which is what we're going to parse over and what our convenience function and data structure are pointing to.

[4:39] Rank, school, points, previous, and record are the columns that were going to use to convert into JavaScript objects and then prettyPrint them. Let's hop back into our code and let's start to build out the parsing logic within this.

[4:59] I'm going to say, give me the results and then I'm going to start with a query to the block the spin content. Then from here, I'm going to do an additional query that says, go find the T body and return all of the table rows.

[5:14] This returns a weird JavaScript object that I found to be hard to parse, but I was able to solve this by simply converting this into an array. Before we go any further, I want to tie this function off for a moment and work on two other functions that we will build out and then add into our scraping logic in just a moment. The first function that I want to build out is the logic around parsing the DOM itself.

[5:47] To that end, we are going to create a function called parse result that takes in our Cheerio object and the element that we want to convert into a JavaScript structure. We'll start by creating an empty object, and then we will return it.

[6:08] We need to do something magical in the middle. We want to take this element, and using Cheerio, take the table row and say, within this table row, give me the table data element or TD. Then we're going to iterate over this particular array of table data elements.

[6:30] For each, we're going to loop over and assign a property on the result object using getProp. Then we'll get the value using Cheerio. This property is going to equal the element, or more specifically, the text inside of it.

[6:49] We are essentially parsing over the element and turning this into a JavaScript object. Let's create one more method called prettyPrint(). This is a little bit more simple. It just takes the result object itself.

[7:08] We're going to create a single prettyPrint string that just says, results school with a result record. We are ready to come back up and finish our scrape method. We're going to map over our array, parse the results, map over that array, and then prettyPrint the results.

[7:34] This is functional programming for the win. We'll format this document just a bit. Then we'll be ready to update our handler. The first thing we'll do is call scrape and then save the results of that call to a JSON property, which we will use to update our call to JSON.stringify.

[7:55] We are also going to update our handler name to Rankings. Just to review. Let's start at the top and work our way down through what we've done up to this point. We've imported axios and Cheerio and created a reference to the URL we want to scrape.

[8:13] We've created this props convenience method here, to map that to the table that we're parsing. We have our fetch HTML function that goes and gets the page. Then we have our scrape method, which converts our page into a Cheerio structure.

[8:32] We've run queries on that structure and using functional programming, we have parsed the result. Then we use prettyPrint. After all of this, I think it's pretty amazing the serverless portion here is only about seven or eight lines of code. Moving on.

[8:49] Our next step is to deploy this is a Serverless Function. Let's hop into our serverless YAML file. I've taken the liberty of cleaning this up and deleting the extraneous comments. You're welcome. We can jump right in and change the method here to rankings, along with the handler.

[9:12] Let's add in our API Gateway using events HTTP API. Then set the route as path, which in this case is going to be 4/API, 4/rankings. Then finally, we're going to set the method to git. With that saved, we'll jump into our terminal and we will sls deploy this project.

[9:37] Once this is deployed, ideally, we'll be able to pull up the browser and see everything rendering. I have to be honest. I'm a little bit nervous here. I stand on the precipice of possibility, and I really hope this works. The moment of truth.

[10:01] We'll copy this input URL here, hop into the browser, open a new tab, paste it in, and lo and behold, it worked. If I click on this tab, we have some HTML that we have successfully scraped into this array right here.

[10:23] We were able to do this with just a little bit of JavaScript in a very tiny amount of serverless, which I think is amazing. I hope you find this to be incredibly useful and profitable.