Joshua Comeau: When I'm working on a Gatsby project and I run into a new problem, I always like to start by checking to see if there are any community solutions for that problem. In the case of sitemaps, there is a Gatsby plugin for it.
I will copy the name of that plugin and will install it as a dependency. While that installs, I'm going to pop over to my Gatsby config and we'll add it as a plugin.
One thing the documentation mentioned which might catch people off guard is that it only generates the sitemap when you're building for production. When you're in development mode, you won't get anything. Rather than run Gatsby dev, we're going to do a yarn build.
I also like to use the serve package which spins up an HTTP server that serves all the files in the directory that you give it.
All right. If we pop over and see what it gave us, I'm going to go to /sitemap.xml. If you're very lucky, then this will be exactly what you want and your work will be done. I suspect almost everyone will have some things to change.
The very first thing I saw is that I have certain routes that I don't really want to be indexed by Google. The admin route is for my admin panel and confirmed is for the newsletter. Happily, the extension is configurable and we can solve for this.
I'm going to go into the more long form with a resolve, as well as some options. It accepts and exclude array, which takes an array of paths. I'm going to pass my admin route, as well as any others that start with admin, as well as my confirmed repped.
There's one other thing I want to fix and it's a bit trickier. I noticed that it's repeating a lot of articles, so I have this blog post, "The Perils of Rehydration." It's putting it under both React and under Gatsby.
Both of these pages exist because I want it to be accessible from either set of pages, but Google really frowns when you have the same content hosted on multiple URLs. When you read about sitemaps, they recommend only putting the canonical URL, the primary URL. In this case, this one is the primary URL. I should exclude this one from the sitemap.
To solve this, the first thing I need is a way to figure out whether or not a given page is canonical or not. I'm going to hop over to my Gatsby node and scroll up to where these pages in particular are created.
Of course, the way you built your site, this might be different. In my case, I have the current category that is being read. I also have the first category, which is the way that I'm checking whether or not it's canonical.
I'm going to add a new field to my context, which I'm going to call isCanonical. To determine that, we're just going to compare the first category with the current category.
In order to make use of this though, we have to do quite a bit more inside our Gatsby plugin sitemap configuration. The first thing is to query the data that we want. This takes any GraphQL query. Critically though, it needs to know the root URL, as well as the data that we're going to need for every page. On the node, we can add to the context which includes the field we just added is canonical.
Next, we need to add a custom serialized function. This function takes the data that we just defined and expects you to return an array of the pages you want to create with the data it needs to know about them. We'll map over allSitePage, receiving an edge. If this edge.node.context.isCanonical is equal to false, then I'm just going to return null.
Critically, I need to check if it's equal to false specifically because I will have other pages on my site that aren't articles, for which it'll be undefined. That's totally fine.
If we make it past that point, then I want to return an object with all the data that the plugin needs. It needs to know the URL, which will be this site.siteMetadata.siteUrl, plus the path for this specific node. I also need to tell Google how often this page changes. I'm going to put daily for all pages.
Finally, I need to give every page a priority. This is going to be a number between zero and one. You might think you can gain the system by just making every page one, but the number is relative for all the pages on your site. For now, I'm just going to leave every page right in the middle. You can tweak this if you want to let Google know which pages are most important.
Finally, I need to filter out all of the nodes that I returned null for earlier. I'll just return if the node is truthy. With that done, I'll go ahead and redo my build and my server. Remember that this doesn't work when running on localhost. We'll go back, refresh our site map and now you'll see that The Perils of Rehydration article only appears once, only under the React category.
To review because this is quite a lot. The very first thing we did was we excluded paths that could be matched with globs. If we see /admin/, we know we can exclude this from the sitemap.
Other pages though, couldn't be excluded so simply because it depended on data available in our GraphQL schema. We had to write a query that pulled first the site URL, as well as the data on every page including the context that we specify over in Gatsby node.
Then, we have to write a serialized function, which takes the data received by the GraphQL query and returns an array of objects that are each used to describe the XML. If you'll notice, URL, changefreq and priority are very similar to the fields put in the XML output.