Hugo is a great tool that let you build website fast and easy. It will render a complete static HTML version of your website. This makes it an ideal candidate to host it on a S3 Bucket. You can then use CloudFront to serve the content across the globe.
CloudFront Cache
Cache can be a wonderful thing! But it can also be an annoying thing. So I decided to dive a little bit deeper into this topic for my own personal website. So by default, when you use S3 as an origin by default the TTL (Time To Live) is set to 24 hours. This means that when you visit your site that version is than cached for 24 hours on the edge location.
This means if you would update your website, the changes will not be visible for you. This is because you will most likely end up on the same edge server. You can invalidate the cache but objects like photos and videos benefit from this cache.
Only invalidate what has changed
So that brings us to the next logical step. Invalidate only the things that have been changed. I am using CodeCommit as a source repository you can ask git for what files are changed:
git diff-tree --no-commit-id --name-only -r HEAD
This will give you a list of files changed in your last commit, with this information you can do smart things:
- Only select the files changed in the
content
folder. - Replace the parts that contain any
_index
toindex
. - Replace the parts that contain any
.md
to.html
. - Detect where we have post aggregation.
So for this blog post that would result to:
$ git diff-tree --no-commit-id --name-only -r HEAD
content/blog/2023/07/invalidate-cloudfront-cache-for-hugo-websites/index.md
So Hugo uses _index.md
files to generate pages that list all child pages. These overview pages are also cached and also need to be invalidated.
By using the following script I will include these pages in my “to be invalidated” list:
#!/usr/bin/env python3
from typing import List
import os
import subprocess
command = ['git', 'diff-tree', '--no-commit-id', '--name-only', '-r', 'HEAD']
output = subprocess.check_output(command, cwd=os.getcwd()).decode('utf-8')
changed_files = output.strip().split('\n')
def content_changed(path: str) -> bool:
return path.startswith("content/")
def convert_html_endpoints(path: str) -> str:
path = path.replace("content/", "")
path = path.replace("_index", "index")
path = path.replace(".md", ".html")
if not path.startswith("/"):
path = f"/{path}"
return path
def find_list_pages(path: str) -> List[str]:
existing_path = ""
list_pages = []
for part in path.split("/"):
list_page_path = os.path.join(existing_path, part, "_index.md")
if os.path.exists(list_page_path):
list_pages.append(os.path.join(existing_path, part, "_index.md"))
existing_path = os.path.join(existing_path, part)
return list_pages
def flatten_list(list_of_lists: List[List[str]]) -> List[str]:
return [item for sublist in list_of_lists for item in sublist]
changed_files += list(flatten_list(map(find_list_pages, changed_files)))
changed_files = list(dict.fromkeys(changed_files))
changed_files = filter(content_changed, changed_files)
changed_files = list(sorted(map(convert_html_endpoints, changed_files)))
list(map(print, changed_files))
Let’s examine what that would bring us:
$ ./retrieve_files.py
/blog/2023/07/invalidate-cloudfront-cache-for-hugo-websites/index.html
/blog/index.html
/index.html
The blog post itself is listed, but also the front page and the blog section of my website. Since these pages are listing my page I include them in the pages to invalidate.
Putting it together
In my build phase I will call this script since I will have access to the commit history there. A simple command will capture these files in a changelog.txt
file:
./retrieve_files.py > public/changelog.txt
Because my pipeline runs in a separate AWS account I could not use a Lambda function. Instead I upload the changelog file to the S3 bucket. And I use a S3 Event Notifications to trigger the create_invalidation call.
Conclusion
You can use your existing systems to keep track what has changed. You only need to expose it in the correct way so that it becomes usable in a downstream actions.
This way you make use of a decoupled event driven pattern.
Photo by Magda Ehlers