Skip to main content

Adding lastmod to sitemap based on git commits

· 4 min read
John Reilly

This post demonstrates enriching an XML sitemap with lastmod timestamps based on git commits.

title image reading "Adding lastmod to sitemap based on git commits" with XML and Docusaurus logos

Reading git log in Node.js

In the last post I showed how to manipulate XML in Node.js, and filter our sitemap. In this post we'll build upon what we did last time, read the git log in Node.js and use that to power a lastmod property.

To read the git log in Node.js we'll use the simple-git package. It's a great package that makes it easy to read the git log. Other stuff too - but that's what we care about today.

yarn add simple-git

To work with simple-git we need to create a Git instance. We can do that like so:

import { simpleGit, SimpleGit, SimpleGitOptions } from 'simple-git';

function getSimpleGit(): SimpleGit {
const baseDir = path.resolve(process.cwd(), '..');

const options: Partial<SimpleGitOptions> = {
baseDir,
binary: 'git',
maxConcurrentProcesses: 6,
trimmed: false,
};

const git = simpleGit(options);

return git;
}

From sitemap to git log

It's worth pausing to consider what our sitemap looks like:

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://johnnyreilly.com/2022/09/20/react-usesearchparamsstate</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<!-- ... -->
</urlset>

If you look at the URL (loc) you can see that it's fairly easy to determine the path to the original markdown file. If we take https://johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants, we can see that the path to the markdown file is blog-website/blog/2012-01-07-standing-on-shoulders-of-giants/index.md.

As long as we don't have a custom slug in play (and I rarely do), we have a reliable way to get from blog post URL (loc) to markdown file. With that we can use simple-git to get the git log for that file. We can then use that to populate the lastmod property.

const dateBlogUrlRegEx = /(\d\d\d\d\/\d\d\/\d\d)\/(.+)/;

async function enrichUrlsWithLastmod(
filteredUrls: SitemapUrl[],
): Promise<SitemapUrl[]> {
const git = getSimpleGit();

const urls: SitemapUrl[] = [];
for (const url of filteredUrls) {
if (urls.includes(url)) {
continue;
}

try {
// example url.loc: https://johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants
const pathWithoutRootUrl = url.loc.replace(rootUrl + '/', ''); // eg 2012/01/07/standing-on-shoulders-of-giants

const match = pathWithoutRootUrl.match(dateBlogUrlRegEx);

if (!match || !match[1] || !match[2]) {
urls.push(url);
continue;
}

const date = match[1].replaceAll('/', '-'); // eg 2012-01-07
const slug = match[2]; // eg standing-on-shoulders-of-giants

const file = `blog-website/blog/${date}-${slug}/index.md`;
const log = await git.log({
file,
});

const lastmod = log.latest?.date.substring(0, 10);
urls.push(lastmod ? { ...url, lastmod } : url);
console.log(url.loc, lastmod);
} catch (e) {
console.log('file date not looked up', url.loc, e);
urls.push(url);
}
}
return urls;
}

Above we're using a regular expression to extract the date and slug from the URL. We then use those to construct the path to the markdown file. We then use simple-git to get the git log for that file. We then use the latest commit date to populate the lastmod property, and push that onto the urls array.

Finally we return the urls array and write that to the sitemap before we write it out:

sitemap.urlset.url = await enrichUrlsWithLastmod(filteredUrls);

Our new sitemap looks like this:

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
<lastmod>2021-12-19</lastmod>
</url>
<url>
<loc>https://johnnyreilly.com/2012/01/14/jqgrid-its-just-far-better-grid</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
<lastmod>2022-11-03</lastmod>
</url>
<!-- ... -->
</urlset>

You see the lastmod property has been populated for URLs based upon the most recent commit for that file. Yay!

GitHub Actions - fetch_depth

You might think we were done (I thought we were done), but we're not. We're not done because we're using GitHub Actions to build the site.

When I tested this locally, it worked fine. However, when I pushed it to GitHub Actions, it surfaced a latest.date which wasn't populated with the value you'd hope. The reason was that the fetch_depth was set to 1 (the default). This meant that the git log wasn't providing the information we'd hope for. By changing the fetch_depth to 0 the situation is resolved.

- uses: actions/checkout@v3
with:
# Number of commits to fetch. 0 indicates all history for all branches and tags.
# Default: 1
fetch-depth: 0