Scrape an HTML Table with split, map, and reduce

John Lindquist
InstructorJohn Lindquist
Share this video with your friends

Social Share Links

Send Tweet
Published 8 years ago
Updated 5 years ago

This lesson teaches about the common need of grabbing information from an HTML table and converting it into plain JavaScript objects that you can use for your own needs.

[00:00] I want to grab the data from this table of popular programming languages and turn into something I can use in JavaScript. So to start with, I'll open the Chrome dev tools, I'll select an element inside of here, select the TBODY, and I'm just going to right click and copy. So I'll hop into my editor, save r table string is equal to a string with single quotes, because there's double quotes inside of my table string.

[00:27] I'll paste it in here, and I'll just get rid of the TBODY because that's an easy task, and hop back up to the first line, get rid of TBODY and bring everything on to a single line. So now we have the entire contents in that table in a super long string, all on a single line. So let's turn that into something I can use. So I'll start by saying var table is table string, and the first thing I want to do is split the table string into its individual rows, you know a TR is a row, so if I split on TR and I'll just log out table, let it run here.

[01:05] You can see I have an array of strings which are split into each row. Now I want to get rid of this first result because it's just an empty string because it matched the very beginning here. I can get rid of the first result of an array with just saying slice 1, which because this is number 0we want to start at 1 and just give us the rest, we don't have to put in an end number, and I'll run this again and we got rid of that first one.

[01:29] Now that we have an array of strings we want to turn this into something useful, and when you think I want to turn this into something else, I have an input I want to output to something else, you should think, "I need to map that." So we'll say map this row and output test. So when I run this you can see you get test in each result. Now obviously that's not what we want. If we just return row and run this, you can see we got the same exact thing as before.

[02:00] But what I actually want to do is split each row by its columns, so I'll go ahead and split on the TDs. So we'll come up here and say .split and split on each TD, now when we run this you can see we get an array of arrays. So we have that outer array, and the inner arrays, and the inner arrays were split based on where that first TD was.

[02:26] If you look back real quick at our data here, we have one being the rank of the language, two being what it previously was, the language over here, and you can see how that matches up to what we have in here. One being the rank, two being the previous rank, this being the language, and so on.

[02:43] Though again because we split at the very first instance of TD we did get an empty one again, so we can go ahead and slice 1 just like we did before, run this again, and you can see we got rid of that empty result, and you also might notice that we actually have the closing row tag here, this TR that we want to get rid of before we start dealing with the each individual row that we have.

[03:09] So because we want to get rid of this before we split, while it's still a single string and not an array of strings, we'll come back before we split and say slice and we want everything from the beginning up until the end, minus five characters. So we want everything from here up until one, two, three, four, five, so we just want this, and the -5 will say leave off this. So I'll run this again, and now you can see all these closing row tags are gone.

[03:43] Now we're dealing with the individual columns which are inside of an array. So you can think table, row, then individual columns. I want to get rid of this closing column tag. So again to transform this, or to turn it into something new on this, so since we're working on the column instead of the row this time, we need to map on this result and now we're working with the columns.

[04:12] We're basically doing the exact same here with this slice, so I can copy-paste because we're just going to remove each of these closing column tags. Run this, and you can see we now have an array of arrays of all the data that we'll want to grab. Now to turn this into an object when you think I want to turn an array into some sort of object, or simply do some really complex transformation, you should be thinking reduce.

[04:43] So I can take all of these results, again I'm back out to where the whole table is, the outer array, and I want to reduce them into something else. The reducer is a function which takes an accumulator and the current. Basically the current iteration as you're going through it. This current is going to be the row. So I'll go ahead and name this row. Now if I just reduce this into the word hello and run this, you can see that my entire table all of this stuff was simply reduced into one simple string.

[05:19] So whatever's returned from this is what this is going to turn into. What I want to do is use the accumulator which is going to be an array, and this is basically the starting point of what this is going to output. So if I return just the accumulator, so acc for accumulator, run this, now I get this empty array. So you can say this guy is passed in over here, and then I returned it here. But I want to do some fancy stuff where I grab some of the data off of each row and create an object off of that, and then push it into an array.

[05:57] So I'm still going to return the accumulator, it's just that in each pass I'm going to use the accumulator and push something into it. So if I push test, run this, you can see I still have my array, but each time it went through it just pushed in a string of test. But what I want to do is grab the rank like var rank is row at position 0so that first thing on the table was the rank, and then push an object in there with a key of rank, which is assigned to the value of the rank from the row.

[06:36] So I'll run this and you can see we get 1 through 20 because they are simply ranked in order. Now I can do basically the same thing with the name, say row at number 3, and then just add name which is name. Run this again, and now you can see the name and rank of each of the programming languages from that table.

[06:59] Now because of some newer JavaScript features we can actually clean this up a little bit where if we want the key to be named the same thing as the value that we're passing in, or the property that's being passed in, I can simply delete this, delete this, and it will just grab the value of here and pass it in, so I'll get the same results and I can actually deconstruct this, because I know it's an array.

[07:24] Instead of saying row, I can say it's an array with a first value where I want to call the first thing rank, and I don't need the second value, I don't need a third value, I actually just need the fourth value which I'm going to call name. I can delete both of these lines and when I run this, everything will work just the same and I can even easily come in here and say something like previous rank and just add previous rank.

[07:53] Run it again, and now you can see I get the rank, the previous rank, and the name. Thanks to the power of reduce I could easily customize this, say something like if rank is less than previous rank, meaning that it's gone up in the standings, then do this stuff.

[08:12] Then and only then put something into my accumulator, so when I run this you can see I'll only get the languages that have gone up in rank since the previous month, and you can easily customize whatever data you want to push in and simply just return the accumulator, you don't have to push something in, or add anything to it each time, you can essentially filter out data this way.

egghead
egghead
~ an hour ago

Member comments are a way for members to communicate, interact, and ask questions about a lesson.

The instructor or someone from the community might respond to your question Here are a few basic guidelines to commenting on egghead.io

Be on-Topic

Comments are for discussing a lesson. If you're having a general issue with the website functionality, please contact us at support@egghead.io.

Avoid meta-discussion

  • This was great!
  • This was horrible!
  • I didn't like this because it didn't match my skill level.
  • +1 It will likely be deleted as spam.

Code Problems?

Should be accompanied by code! Codesandbox or Stackblitz provide a way to share code and discuss it in context

Details and Context

Vague question? Vague answer. Any details and context you can provide will lure more interesting answers!

Markdown supported.
Become a member to join the discussionEnroll Today