What do we do when one of our GraphQL query functions is running slow? Where do we start figuring out the problem?
Enabling all logs with AppSync and setting up X-Ray gives you a powerful bug hunting combination. You can use the logs to find your request logs and a trace ID that you can use with X-Ray to see a graph of your requests.
From there you can see exactly which request is having issues and investigate it further!
Tomasz Łakomy: [0:00] A single GraphQL query can trigger multiple resolvers. For instance, in this example, this MyAwesomeQuery is listing all the books and also getting a single book by its ID. Of course, this works fine, but here's a problem. What happens when one of those resolvers is slow? How do we identify and fix a performance issue like this?
[0:20] Let me cause a performance issue. I'm going to add a wait function to my getBookByIDLambda function. This function is essentially going to hang the execution of this Lambda function by a timeout specified over here. Before we get an item from DynamoDB, I'm going to add a, wait 2000, hanging this function for two seconds.
[0:41] Because wait function returns a promise, I also need to await it. Let's deploy that. Let's give it a shot. I'm going to send this MyAwesomeQuery once more. You will notice that it is much slower. This is something that may happen in a production API. For some reason, it gets very slow, and now we have to investigate what is going on.
[1:04] To do that, let us go back to our stack. I'm going to scroll up to our AppSync API, and we're going to add two more things to it. First of all, we're going to add a logConfig in order to enable logging in our API. I'm going to set the fieldLogLevel to be of AppSync.FieldLogLevel.All because, for the time being, we would like to get as many logs as possible.
[1:25] On top of that, I'm going to also enable X-Ray. X-Ray is an AWS solution for enabling traces in our API, and I'm going to set it to true. With all of that, let us do another deployment.
[1:39] Let us investigate. We have a slow GraphQL query in our API. How do we find it? First of all, I'm going to navigate to the AppSync console. Next, I'm going to go to Settings to verify that, first of all, logging was enabled for this API and the log level is set to All, just in case.
[1:55] Here we can see that the X-Ray was successfully enabled for this API. Take a look at this note. It says that you may be charged for using X-Ray, so please keep that in mind.
[2:05] Let's start our investigation. Go to Monitoring tab, and next click on View Logs in CloudWatch. I'm going to close this section over here and expand all of the logs. As you can see, there's quite a lot of them.
[2:17] This is where the request started. We can see that it started at 17th second of a given minute. If I scroll down a bit, this is quite a lot of logs, we can see over here that this request has ended nearly three seconds later. In fact, this latency, which is in nanoseconds, tells us that it was 2.6 seconds later, so we definitely have a performance issue in this API.
[2:40] If I go to the top of the logs for this particular request, I can see that this query was sent to MyAPI. This MyAwesomeQuery was fetching a list of all of the books and also a single book by its ID. If I scroll down a bit, we're going to see all sorts of logs for this particular execution.
[2:58] What if I wanted to visualize what exactly is going on and which component is actually introducing latency in MyAPI? To do that, let us check the RequestSummary once again. I'm going to scroll down to it.
[3:10] Here we have an Amzn-Trace-Id for this particular request. I can copy that and go to the aforementioned X-Ray service. Here, I can paste in this trace ID and hit Enter in order to visualize it. This is going to create an interactive map of my request.
[3:27] There's a number of things here. First of all, down here in the bottom we see a timeline of everything that happened in MyAPI. I was trying to call this API. I got the response of 200. I was in fact calling two Lambda functions.
[3:41] This listBooksLambda managed to complete in 637 milliseconds, which was not terrible. The second Lambda actually took over two seconds in order to execute.
[3:51] Apart from the Timeline View, here we also have an interactive map of our system. A Client was calling our GraphQLAPI, which in turn was calling two Lambda functions.
[4:01] This Lambda function is the listBooks handler, which, as I mentioned, took roughly half a second in order to execute. The second Lambda, which is a getBookByID handler, took over two and a half seconds in order to complete. Now we know that we have to investigate this Lambda.
[4:17] Then we would probably have noticed that this wait call is not exactly a good idea. We should remove that, deploy again, and now our API is going to be fast once more.