dbt (data build tool) has seen increasing use in recent years as a tool to transform data in data warehouses. At Xebia, we have seen a variety of dbt implementations and the challenges encountered when scaling their impact. Whether you have a single dbt project or a multitude of dbt projects, one common challenge is maintaining conventions as the number of contributors to your dbt project(s) grows.
Challenges of growing
Imagine the following scenario, you have a dbt project and you are successfully delivering valuable data to your business stakeholders. To continue with this work, it is important that you onboard new contributors to your dbt project. These contributors can be from your team, a different analytics team, or a different engineering team. Regardless of their origin, these contributors need to work with and maintain the conventions and standards you have set. But:
- Will they name their staging models “stg_[source]__[entity]s” in alignment with dbt recommendations?
- How do they know that you prefer boolean values for columns like “is_active” and “is_deleted”?
- How can they adapt to naming conventions such as “updated_at_utc” and “created_at_cet”?
- Will they feel comfortable approving their colleague’s pull requests if they themselves aren’t confident they remember all the discussions around the existing conventions?
In traditional software engineering projects, challenges like these are overcome with automated tooling; directory structures encourage a standardised file layout, pre-commit offers config-based formatting and tools like flake8 offer linting capabilities.
But what about dbt?
Maintaining conventions in a dbt project
Most teams working in a dbt project will document their conventions. Sometimes this is in the README.md
of the repository, while other times this is in an external tool like Confluence or Notion. Regardless of location, documentation is a great starting point, writing down the outcome of discussions allows new developers to quickly get up to speed. But when the size of a dbt project grows, and the number of developers increases, then an automated approach is often the only scalable way forward.
In dbt Core projects it is common to see dbt-checkpoint implemented to help standardise conventions. dbt-checkpoint
contains an extensive range of pre-commit hooks that query dbt artifacts like manifest.json
to ensure models have contracts implemented, all columns in source tables are specified and much more. Configuration options are provided via a .pre-commit-config.yaml
file that lives in the dbt repository.
repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
rev: v2.0.6
hooks:
- id: check-model-has-tests
args: ["--test-cnt", "2", "--"]
While dbt-checkpoint
offers numerous useful hooks, it is limited by the fact that it is designed to work as a pre-commit hook. This prevents running the hooks in dbt Cloud (as dbt Cloud can only run dbt commands) and makes development of new hooks a difficult task for those not already familiar with pre-commit’s inner workings.
Another tool in this area is dbt-project-evaluator, a dbt package from dbt Labs that materialises the contents of dbt’s graph
context variable in the underlying database. dbt-project-evaluator
powers the recommendations seen in dbt Explorer. This package can be executed from the dbt Cloud IDE, however it is not compatible with all dbt adapters (neither Fabric nor Synapse are supported). Tests can be added for models, documentation coverage and best practices like avoiding chained views. Configuration of these tests is via variables added to your dbt project’s dbt_project.yml
file and seeds, sometimes leading to a dis-jointed approach where you do not know where to look for a particular configuration option.
In recent months Picnic open-sourced dbt-score, a python package that uses the manifest.json
to assign a score to individual models and sources. While this allows dbt-score
to be compatible with any dbt adapter, it is somewhat limited by its limited number of rules (currently only 5) and the complexity required to apply different configurations to models and sources.
dbt-bouncer
: A new approach
While the previously mentioned tools have advantages, they are all limited in one way or another. At Xebia we saw the benefit of taking the learnings from these tools and incorporating them into an approach that offers maximum flexibility and ease of use. This has resulted in dbt-bouncer
, a python package to help maintain conventions and standards in dbt projects.
dbt-bouncer
runs checks against dbt artifacts, hence it is compatible with any dbt adapter. This also means that it does not require a direct connection to your database. How does dbt-bouncer
work?
First step is to install the python package in your virtual environment:
pip install dbt-bouncer
Next is to create a configuration file for dbt-bouncer
called dbt-bouncer.yml
. Here is a basic example to get started with:
manifest_checks:
- name: check_model_directories
include: ^models
permitted_sub_directories:
- intermediate
- marts
- staging
This check will validate that all your models exist in one of the sub-directories specified in the permitted_sub_directories
key.
Running dbt-bouncer
is as simple as:
dbt-bouncer
The output printed to your console displays how many checks were run (in this example, one for each model in the dbt project):
Running dbt-bouncer (1.5.0)...
Loaded config from dbt-bouncer-example.yml...
Validating conf...
Parsed `manifest.json`, found `dbt_bouncer_test_project` project: 1 exposures, 3 macros, 8 nodes, 1 semantic models, 3 sources, 28 tests, 2 unit tests.
Running checks...
Assembled 8 checks, running...
Running checks... |################################| 8/8
`dbt-bouncer` failed. Please see below for more details or run `dbt-bouncer` with the `-v` flag.
Failed checks:
| Check name | Severity | Failure message |
|-------------------------------------------------|------------|-------------------------------------------------------------------------------------------------------|
| check_model_directories:0:metricflow_time_spine | error | AssertionError: `metricflow_time_spine` is located in `utilities`, this is not a valid sub-directory. |
Done. SUCCESS=7 WARN=0 ERROR=1
Uh oh! I forgot about my utilities
sub-directory, thankfully the error message is descriptive enough to help me assess whether I should add “utilities” to the configuration in dbt-bouncer.yml
, or take a closer look at the non-complying model to see if it should be adjusted.
What other checks can dbt-bouncer
perform? It turns out there are many checks available, some examples:
check_columns_are_all_documented
ensures that all columns in a model are included in the model’s properties file, improving documentation coverage.check_exposure_based_on_view
ensures exposures are not based on views as this may result in poor performance for data consumers.check_run_results_max_execution_time
ensures each node in dbt’s DAG completes before the specified duration, helping to avoid developers creating long-running models.
Now that we’ve seen how to quickly set up and run dbt-bouncer
, it’s time to take a look at some advantages it offers:
- Since
dbt-bouncer
runs against dbt artifacts, it does not require a database connection. This simplifies the process of runningdbt-bouncer
in a continuous integration (CI) pipeline and ensures compatibility with all dbt adapters. - Runs against all dbt artifacts. Checks can be run against
catalog.json
,manifest.json
andrun_results.json
, meaning checks can be applied to macros, models, run results, sources and many more objects. dbt-bouncer
is written in python, allowing new checks to be written in a familiar language that can be thoroughly unit tested.dbt-bouncer
is extendable, you can write your own checks and add them to your repository.- Is configurable via a
YML
file, a format all dbt developers are already familiar with. - Contains all the tests that can be run by
dbt-checkpoint
,dbt-project-evaluator
anddbt-score
.
But is it fast?
As with any new tool, one question that is commonly asked is about its speed. dbt-bouncer
is competitive with similar tools. In our tests on a medium-sized, already parsed dbt project with 250 models connected to Google BigQuery, running a check/test for models that are not documented, we experienced the following run durations:
Run duration (seconds) | |
---|---|
dbt-bouncer | 1.69 |
dbt-checkpoint | 0.49 |
dbt-project-evaluator | 21.05 |
dbt-score | 0.94 |
While dbt-bouncer
is not the fastest, in the context of a CI pipeline (its intended use case), it is not significantly slower than similar tools.
dbt-bouncer
and dbt Cloud
dbt-bouncer
is a python package and, as such, cannot be run from the dbt Cloud IDE. However, it can be run from a CI pipeline when integrated with GitHub by using the dbt-cloud-download-artifacts-action action to download the artifacts from the CI job run in dbt Cloud:
name: CI pipeline
on:
pull_request:
branches:
- main
jobs:
run-dbt-bouncer:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Download dbt artifacts
uses: pgoslatara/dbt-cloud-download-artifacts-action@v1
with:
commit-sha: ${{ github.event.pull_request.head.sha }}
dbt-cloud-api-token: ${{ secrets.DBT_CLOUD_API_TOKEN }}
output-dir: target
- name: Run dbt-bouncer
uses: godatadriven/dbt-bouncer@v1
env:
GH_TOKEN: ${{ github.token }}
dbt-bouncer
in a multi-project setup (dbt Mesh)
In some organisations dbt projects are allocated per team, some of which can read models from other dbt projects. This is sometimes referred to as a “dbt Mesh”, borrowing from the term Data Mesh. Though only natively supported by dbt Cloud, a dbt Mesh can also be set up using dbt Core.
One challenge in a dbt Mesh is the large number of developers working across multiple dbt projects, leading to inconsistent conventions being implemented. dbt-bouncer
can help standardise these conventions via the use of per-project configuration files or even a shared configuration file, more details here.
Should you use dbt-bouncer
?
Whether you are managing one dbt project or many, using dbt Core or dbt Cloud, ensuring that conventions and best practices are followed by all developers is a never-ending journey. With dbt-bouncer
we believe in making this journey smoother, and hopefully even faster!
To get started, take a look at our GitHub repository today!
Are you part of an organisation looking into implementing best practices around dbt? Our analytics engineer consultants are here to help – just contact us and we’ll get back to you soon. Or are you an analyst, analytics engineer or data engineer interested in learning more about dbt? Check out our dbt Learn course at Xebia Academy or have a look at our job openings.