Blog

Invalidate CloudFront cache for Hugo websites

13 Jul, 2023
Xebia Background Header Wave

Hugo is a great tool that let you build website fast and easy. It will render a complete static HTML version of your website. This makes it an ideal candidate to host it on a S3 Bucket. You can then use CloudFront to serve the content across the globe.

CloudFront Cache

Cache can be a wonderful thing! But it can also be an annoying thing. So I decided to dive a little bit deeper into this topic for my own personal website. So by default, when you use S3 as an origin by default the TTL (Time To Live) is set to 24 hours. This means that when you visit your site that version is than cached for 24 hours on the edge location.

This means if you would update your website, the changes will not be visible for you. This is because you will most likely end up on the same edge server. You can invalidate the cache but objects like photos and videos benefit from this cache.

Only invalidate what has changed

So that brings us to the next logical step. Invalidate only the things that have been changed. I am using CodeCommit as a source repository you can ask git for what files are changed:

git diff-tree --no-commit-id --name-only -r HEAD

This will give you a list of files changed in your last commit, with this information you can do smart things:

  • Only select the files changed in the content folder.
  • Replace the parts that contain any _index to index.
  • Replace the parts that contain any .md to .html.
  • Detect where we have post aggregation.

So for this blog post that would result to:

$ git diff-tree --no-commit-id --name-only -r HEAD
content/blog/2023/07/invalidate-cloudfront-cache-for-hugo-websites/index.md

So Hugo uses _index.md files to generate pages that list all child pages. These overview pages are also cached and also need to be invalidated.

By using the following script I will include these pages in my “to be invalidated” list:

#!/usr/bin/env python3
from typing import List
import os
import subprocess

command = ['git', 'diff-tree', '--no-commit-id', '--name-only', '-r', 'HEAD']
output = subprocess.check_output(command, cwd=os.getcwd()).decode('utf-8')
changed_files = output.strip().split('\n')


def content_changed(path: str) -> bool:
    return path.startswith("content/")


def convert_html_endpoints(path: str) -> str:
    path = path.replace("content/", "")
    path = path.replace("_index", "index")
    path = path.replace(".md", ".html")

    if not path.startswith("/"):
        path = f"/{path}"

    return path


def find_list_pages(path: str) -> List[str]:
    existing_path = ""
    list_pages = []

    for part in path.split("/"):
        list_page_path = os.path.join(existing_path, part, "_index.md")

        if os.path.exists(list_page_path):
            list_pages.append(os.path.join(existing_path, part, "_index.md"))

        existing_path = os.path.join(existing_path, part)


    return list_pages


def flatten_list(list_of_lists: List[List[str]]) -> List[str]:
    return [item for sublist in list_of_lists for item in sublist]


changed_files += list(flatten_list(map(find_list_pages, changed_files)))
changed_files = list(dict.fromkeys(changed_files))
changed_files = filter(content_changed, changed_files)
changed_files = list(sorted(map(convert_html_endpoints, changed_files)))
list(map(print, changed_files))

Let’s examine what that would bring us:

$ ./retrieve_files.py
/blog/2023/07/invalidate-cloudfront-cache-for-hugo-websites/index.html
/blog/index.html
/index.html

The blog post itself is listed, but also the front page and the blog section of my website. Since these pages are listing my page I include them in the pages to invalidate.

Putting it together

In my build phase I will call this script since I will have access to the commit history there. A simple command will capture these files in a changelog.txt file:

./retrieve_files.py > public/changelog.txt

Because my pipeline runs in a separate AWS account I could not use a Lambda function. Instead I upload the changelog file to the S3 bucket. And I use a S3 Event Notifications to trigger the create_invalidation call.

Conclusion

You can use your existing systems to keep track what has changed. You only need to expose it in the correct way so that it becomes usable in a downstream actions.

This way you make use of a decoupled event driven pattern.

Photo by Magda Ehlers

Joris Conijn
Joris has been working with the AWS cloud since 2009 and focussing on building event driven architectures. While working with the cloud from (almost) the start he has seen most of the services being launched. Joris strongly believes in automation and infrastructure as code and is open to learn new things and experiment with them, because that is the way to learn and grow. In his spare time he enjoys running and runs a small micro brewery from his home.
Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts