Use a Python Generator to Crawl the Star Wars API

Will Button
InstructorWill Button
Share this video with your friends

Social Share Links

Send Tweet
Published 6 years ago
Updated 5 years ago

In this lesson, you will be introduced to Python generators. You will see how a generator can replace a common function and learn the benefits of doing so. You will learn what role the yield keyword provides in functions and how it differs from a return. Building on that knowledge, you will learn how to build a generator to recursively crawl an API (swapi.co) and return Star Wars characters from "The Force Awakens".

Instructor: [00:01] Let's start with a little function to show you the type of things that a generator can help you with. We'll create a little countdown function. We'll have our results that starts off as an empty list, then while the number that we pass to the function is greater than zero, we'll just append that number to the results list. We'll subtract one from it, and when we're done, we'll return that list.

[00:29] We'll create a launch_timer that's going to use the result of our countdown function, and we'll start with the number five. We can say for val in launch_timer, because that's going to return a list that we can iterate over. We can print the value. To execute that, I'll come down here and do python, and call the name of our script. It's pretty straightforward, it just starts at five and counts down to one, then exits.

[00:57] Where this can lead to your problems is in this results list right here. You can imagine if we had millions of items in that list, the memory that it uses can grow and be quite large. We can use a generator function to produce the same results. Let's just comment this out, and we'll produce the same results with the generator.

[01:19] Again, we'll define it like a function, it's going to then receive its parameter num. We'll say while num > 0we'll yield num, then subtract one from it. Let's run that again, and we get the same results. Let me show you how that worked. We started off by defining a function just like we did before, then the difference here is we used this yield statement.

[01:46] What the yield statement does is, whenever we initialized our generator using the launch_timer statement here, we started to iterate over it with our for statement. A generator basically turns a function into an iterable. The function starts, then whenever it gets to this yield statement, the yield statement returns control to the line of code that called it. Which is our for loop here. Then, it prints out the value that was yielded.

[02:16] The difference is a return exits the function completely, whereas a yield function, or a yield statement, will just yield control back to the code. The next time that it's called, because we're using a for loop here, it gets called again. It knows where it left off. Whenever it first started, it had the value of five.

[02:38] It returned control to the for loop, which printed that value. It iterated over it again. When it reached this yield statement again, it knew that the value had been reduced by one to four. That was the number that it yielded. This continues as long as this while statement is true. When that while statement is no longer true, the function exits, and our for loop terminates.

[03:04] The big performance difference between this function that we originally created and our yield function, is that we didn't create this results list here. Which means we never created that big block of memory or potentially big block of memory that was consumed. Instead, the generator lazily created the list as each number was needed.

[03:25] Let's use this knowledge of generators to crawl an api. I'm going to use the swapi api, the Star Wars api. It's a REST api that has all the details of the Star Wars universe, and I'm calling the people endpoint on it. The results of this, you see that we have this results list here that returns a list of the characters and details about the characters in the api results.

[03:51] I'm going to create our generator function, and it's going to be called crawl. Then, we'll have a response object that we will use the request library to perform a get request to our Star Wars api. We'll take our results and turn that into a python object using the json module. The payload for our api is in the response.content. If we take a look back at a payload real quick, we see that we've got this results list.

[04:19] Inside the results list, we have all of our characters, we're going to iterate over that. To do so, we'll say for character in api_results, then grab that results object. We want to see a list of characters who starred in "The Force Awakens," we'll look for the url for The Force Awakens. Which is films/7 and the films list for each character. If it's found, we're going to yield that character's name.

[04:49] That's our generator, and we'll initialize it by saying force_awakens = crawl. We can iterate over that because the generator returns an iterable function. We'll say for result in force_awakens, print result. To execute that, we'll do python, then call in the name of our python application. That returns three character names.

[05:13] We know there were more than three characters in The Force Awakens. Let's dig in and see what happened here. If we go back to the api result from Postman, we can see that in the results list here, there were only 10 characters returned. The api didn't return a list of all the characters from the Star Wars universe, it only returned 10. This is actually pretty common with apis. They only return a certain amount of results, and then they provide the rest of them through pagination.

[05:42] That's what we see here. We have these next and previous keys that tell you how to paginate through the api results. If we're going to get a list of all the characters from the Star Wars api who were in the The Force Awakens, we're going to have to paginate through our api results.

[05:57] To do that, we're going to add a parameter to our crawl generator, and just call that link. I'm going to cut that out, and it's going to take the value pass to it. Then when we call that, we'll pass in our api endpoint. From there, we're going to expand on this a little bit, and we'll check to see if there's a key called next in our api_results. If there is, we need to make sure that key is not none.

[06:25] What I'm looking for there is I'm checking to make sure that this key exists, which is a really good practice whenever you're consuming apis that you don't control. You always want to make sure that the key you're looking for exists before you try to access it. Because if it doesn't exist, your app's going to blow up.

[06:42] The second part of that is we want to make sure that key actually contains a value. If I go to the last page of the results here, you can see when we hit the last page, this key value is none or it's returned as null. But in python, we call that none. We want to check to make sure that key exists and that it has a value.

[07:00] If that's the case, we're going to create a variable called next_page. We're going to call the same function that we're in again, but this time, provide the value of the next_page link. We'll say crawl, and provide the value of the next link. Then we'll continue to iterate over that, and we'll yield a page.

[07:20] What's going on here is we call our generator, providing the Star Wars api link first, and it has results in it. We're going to iterate over those results, yielding each character name who appears in film number seven or The Force Awakens. Because this yield appears first, that's the one that's going to be executed until we run out of characters in our results.

[07:44] Then we're going to come down to this line of code, where we check to see if there's a next link. If so, we're going to grab that link, recursively call our generator again, then yield the page itself. Which is going to produce a new line of characters, and we'll iterate over that yielding each character name one at a time, until we run out of results.

[08:06] Let's execute that again, see if it works. There you have it, a list of all the characters from the Force Awakens as known by the Star Wars api.

Eisson Alipio
Eisson Alipio
~ 6 years ago

that API is so cool and your explanation is really good! thank you Will

Will Button
Will Buttoninstructor
~ 6 years ago

My pleasure! I'm excited that you found it helpful!

Ondra Geršl
Ondra Geršl
~ 5 years ago

Nice example of iterator over HTTP API, thanks!

Could you please review my custom solution of function crawl?

def crawl(link):
  while link:
    response = requests.get(link)
    api_results = response.json()

    for character in api_results['results']:
      if 'https://swapi.co/api/films/7/' in character['films']:
        yield character['name']
    
    link = api_results.get('next')
Markdown supported.
Become a member to join the discussionEnroll Today