Blog

Maintaining conventions in dbt projects with dbt-bouncer

21 Nov, 2024
Xebia Background Header Wave

dbt (data build tool) has seen increasing use in recent years as a tool to transform data in data warehouses. At Xebia, we have seen a variety of dbt implementations and the challenges encountered when scaling their impact. Whether you have a single dbt project or a multitude of dbt projects, one common challenge is maintaining conventions as the number of contributors to your dbt project(s) grows.

Challenges of growing

Imagine the following scenario, you have a dbt project and you are successfully delivering valuable data to your business stakeholders. To continue with this work, it is important that you onboard new contributors to your dbt project. These contributors can be from your team, a different analytics team, or a different engineering team. Regardless of their origin, these contributors need to work with and maintain the conventions and standards you have set. But:

  • Will they name their staging models “stg_[source]__[entity]s” in alignment with dbt recommendations?
  • How do they know that you prefer boolean values for columns like “is_active” and “is_deleted”?
  • How can they adapt to naming conventions such as “updated_at_utc” and “created_at_cet”?
  • Will they feel comfortable approving their colleague’s pull requests if they themselves aren’t confident they remember all the discussions around the existing conventions?

In traditional software engineering projects, challenges like these are overcome with automated tooling; directory structures encourage a standardised file layout, pre-commit offers config-based formatting and tools like flake8 offer linting capabilities.

But what about dbt?

Maintaining conventions in a dbt project

Most teams working in a dbt project will document their conventions. Sometimes this is in the README.md of the repository, while other times this is in an external tool like Confluence or Notion. Regardless of location, documentation is a great starting point, writing down the outcome of discussions allows new developers to quickly get up to speed. But when the size of a dbt project grows, and the number of developers increases, then an automated approach is often the only scalable way forward.

In dbt Core projects it is common to see dbt-checkpoint implemented to help standardise conventions. dbt-checkpoint contains an extensive range of pre-commit hooks that query dbt artifacts like manifest.json to ensure models have contracts implemented, all columns in source tables are specified and much more. Configuration options are provided via a .pre-commit-config.yaml file that lives in the dbt repository.

repos:
  - repo: https://github.com/dbt-checkpoint/dbt-checkpoint
    rev: v2.0.6
    hooks:
      - id: check-model-has-tests
        args: ["--test-cnt", "2", "--"]

While dbt-checkpoint offers numerous useful hooks, it is limited by the fact that it is designed to work as a pre-commit hook. This prevents running the hooks in dbt Cloud (as dbt Cloud can only run dbt commands) and makes development of new hooks a difficult task for those not already familiar with pre-commit’s inner workings.

Another tool in this area is dbt-project-evaluator, a dbt package from dbt Labs that materialises the contents of dbt’s graph context variable in the underlying database. dbt-project-evaluator powers the recommendations seen in dbt Explorer. This package can be executed from the dbt Cloud IDE, however it is not compatible with all dbt adapters (neither Fabric nor Synapse are supported). Tests can be added for models, documentation coverage and best practices like avoiding chained views. Configuration of these tests is via variables added to your dbt project’s dbt_project.yml file and seeds, sometimes leading to a dis-jointed approach where you do not know where to look for a particular configuration option.

In recent months Picnic open-sourced dbt-score, a python package that uses the manifest.json to assign a score to individual models and sources. While this allows dbt-score to be compatible with any dbt adapter, it is somewhat limited by its limited number of rules (currently only 5) and the complexity required to apply different configurations to models and sources.

dbt-bouncer: A new approach

While the previously mentioned tools have advantages, they are all limited in one way or another. At Xebia we saw the benefit of taking the learnings from these tools and incorporating them into an approach that offers maximum flexibility and ease of use. This has resulted in dbt-bouncer, a python package to help maintain conventions and standards in dbt projects.

dbt-bouncer runs checks against dbt artifacts, hence it is compatible with any dbt adapter. This also means that it does not require a direct connection to your database. How does dbt-bouncer work?

First step is to install the python package in your virtual environment:

pip install dbt-bouncer

Next is to create a configuration file for dbt-bouncer called dbt-bouncer.yml. Here is a basic example to get started with:

manifest_checks:
  - name: check_model_directories
    include: ^models
    permitted_sub_directories:
      - intermediate
      - marts
      - staging

This check will validate that all your models exist in one of the sub-directories specified in the permitted_sub_directories key.

Running dbt-bouncer is as simple as:

dbt-bouncer

The output printed to your console displays how many checks were run (in this example, one for each model in the dbt project):

Running dbt-bouncer (1.5.0)...
Loaded config from dbt-bouncer-example.yml...
Validating conf...
Parsed `manifest.json`, found `dbt_bouncer_test_project` project: 1 exposures, 3 macros, 8 nodes, 1 semantic models, 3 sources, 28 tests, 2 unit tests.
Running checks...
Assembled 8 checks, running...
Running checks... |################################| 8/8
`dbt-bouncer` failed. Please see below for more details or run `dbt-bouncer` with the `-v` flag.
Failed checks:
| Check name                                      | Severity   | Failure message                                                                                       |
|-------------------------------------------------|------------|-------------------------------------------------------------------------------------------------------|
| check_model_directories:0:metricflow_time_spine | error      | AssertionError: `metricflow_time_spine` is located in `utilities`, this is not a valid sub-directory. |
Done. SUCCESS=7 WARN=0 ERROR=1

Uh oh! I forgot about my utilities sub-directory, thankfully the error message is descriptive enough to help me assess whether I should add “utilities” to the configuration in dbt-bouncer.yml, or take a closer look at the non-complying model to see if it should be adjusted.

What other checks can dbt-bouncer perform? It turns out there are many checks available, some examples:

Now that we’ve seen how to quickly set up and run dbt-bouncer, it’s time to take a look at some advantages it offers:

  • Since dbt-bouncer runs against dbt artifacts, it does not require a database connection. This simplifies the process of running dbt-bouncer in a continuous integration (CI) pipeline and ensures compatibility with all dbt adapters.
  • Runs against all dbt artifacts. Checks can be run against catalog.json, manifest.json and run_results.json, meaning checks can be applied to macros, models, run results, sources and many more objects.
  • dbt-bouncer is written in python, allowing new checks to be written in a familiar language that can be thoroughly unit tested.
  • dbt-bouncer is extendable, you can write your own checks and add them to your repository.
  • Is configurable via a YML file, a format all dbt developers are already familiar with.
  • Contains all the tests that can be run by dbt-checkpoint, dbt-project-evaluator and dbt-score.

But is it fast?

As with any new tool, one question that is commonly asked is about its speed. dbt-bouncer is competitive with similar tools. In our tests on a medium-sized, already parsed dbt project with 250 models connected to Google BigQuery, running a check/test for models that are not documented, we experienced the following run durations:

Run duration (seconds)
dbt-bouncer1.69
dbt-checkpoint0.49
dbt-project-evaluator21.05
dbt-score0.94

While dbt-bouncer is not the fastest, in the context of a CI pipeline (its intended use case), it is not significantly slower than similar tools.

dbt-bouncer and dbt Cloud

dbt-bouncer is a python package and, as such, cannot be run from the dbt Cloud IDE. However, it can be run from a CI pipeline when integrated with GitHub by using the dbt-cloud-download-artifacts-action action to download the artifacts from the CI job run in dbt Cloud:

name: CI pipeline

on:
  pull_request:
      branches:
          - main

jobs:
    run-dbt-bouncer:
        runs-on: ubuntu-latest
        steps:
          - name: Checkout
            uses: actions/checkout@v4

          - name: Download dbt artifacts
            uses: pgoslatara/dbt-cloud-download-artifacts-action@v1
            with:
              commit-sha: ${{ github.event.pull_request.head.sha }}
              dbt-cloud-api-token: ${{ secrets.DBT_CLOUD_API_TOKEN }}
              output-dir: target

          - name: Run dbt-bouncer
            uses: godatadriven/dbt-bouncer@v1
            env:
              GH_TOKEN: ${{ github.token }}

maintaining conventions in dbt projects with dbt bouncer github statuses

dbt-bouncer in a multi-project setup (dbt Mesh)

In some organisations dbt projects are allocated per team, some of which can read models from other dbt projects. This is sometimes referred to as a “dbt Mesh”, borrowing from the term Data Mesh. Though only natively supported by dbt Cloud, a dbt Mesh can also be set up using dbt Core.

One challenge in a dbt Mesh is the large number of developers working across multiple dbt projects, leading to inconsistent conventions being implemented. dbt-bouncer can help standardise these conventions via the use of per-project configuration files or even a shared configuration file, more details here.

Should you use dbt-bouncer?

Whether you are managing one dbt project or many, using dbt Core or dbt Cloud, ensuring that conventions and best practices are followed by all developers is a never-ending journey. With dbt-bouncer we believe in making this journey smoother, and hopefully even faster!

To get started, take a look at our GitHub repository today!


Are you part of an organisation looking into implementing best practices around dbt? Our analytics engineer consultants are here to help – just contact us and we’ll get back to you soon. Or are you an analyst, analytics engineer or data engineer interested in learning more about dbt? Check out our dbt Learn course at Xebia Academy or have a look at our job openings.

Pádraic Slattery
Pádraic is a technical-minded engineer passionate about helping organizations derive business value from data. With experience in data engineering, Business Intelligence development, and data analysis, he specializes in data ingestion pipelines and DataOps.
Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts