Many companies are using the Kimball dimensional modelling framework to model their data and will undoubtedly think of, or are already using Slowly Changing Dimensions Type 2 (SCD2) models. It is one of the most common approaches to model historical data for analytical purposes. It is so common that today’s most popular analytics engineering tool dbt (data build tool) has had the ability to create SCD2 models since 2016 (!). However, as always, there are many different ways to go about creating Slowly Changing Dimensions Type 2 in dbt. Moreover, how historical data is ingested and stored in your data warehouse determines for a major part how you can implement SCD2 models inside of dbt.
This practical 2-part guide will explain the concept of Slowly Changing Dimensions, storing historical data using different patterns and the different possibilities of creating SCD2 models in dbt.
In part 1, we will go over what Slowly Changing Dimensions are and especially how you extract and store data in your data platform ready to convert to SCD2 models. In part 2, we will get hands-on and go over dbt snapshots as well as creating custom SCD2 models yourself that are more flexible and scalable then dbt’s native solution.
Kimball & Slowly Changing Dimensions
Before covering anything else, let’s shortly discuss Kimball dimensional modelling and the need for Slowly Changing Dimensions.
Dimensional modelling
One of the most popular data modeling techniques still used by a lot of companies was introduced by Kimball in 1996. In his book, Kimball introduced the concepts of facts and dimensions as well as a process to get the requirements to create these facts and dimensions. The idea is that if you use this technique, you model everything in a consistent way making it easier to adjust data models while also making it easier for analysts and BI tools to consume data.
Facts
The purpose of fact tables is ‘the repository of the numeric facts that are observed during the measurement event’ (link). In other words, for any business process you want to measure, you want to store all of those numeric measurements in a fact table. Fact tables are typically very narrow (limited number of columns) but contain a lot of records. These tables can then be used to calculate metrics like revenue, quantity sold etc.
transaction_id | customer_id | transaction_date | revenue | quantity |
---|---|---|---|---|
22 | 123 | 2024-01-01 | 19,99 | 2 |
Example of an order line fact table with 3 foreign keys and 2 facts (revenue and quantity).
Dimensions
Having just fact tables is not enough to create a valuable data warehouse. Besides creating fact tables, you also want to create dimensions. Dimensions “provide the who, what, where, when, why, and how context surrounding a business process event” (link). In other words, anything that describes a measurement in a fact will typically end up in a dimension. You can join these dimensions to facts using keys to add more context to facts. For example, you can join a customer dimension to a transaction fact to be able to analyze the amount of transactions done by customers from a specific country.
Dimensions are typically very wide (lots of different columns) but don’t have many records. In its most basic form, a dimension shows the current state of the data. For example, what is true about a customer right now?
customer_id | customer_name | customer_address | customer_email | customer_created_at |
---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | a.johnson@email.com | 2024-01-01T14:00 |
Example of a customer dimension table with a number of attributes.
The need for Slowly Changing Dimensions
Whereas many business questions can be answered by just having basic facts and dimensions and joining them, this won’t be sufficient to answer any business questions related to changes of the ‘truth’ over time. Basic facts and dimensions will help answer questions such as: * How many customers do we have now?
How much revenue did we do last year?
Which product category brought in the most revenue last month?
However, it won’t be able to answer questions like: * Which products changed ingredients last year?
Which customers moved?
How often do customers change their e-mail address?
To be able to answer these questions, you do not only require the truth about customers, products etc. right now, but also the truth about customers or products at any point in time. Fortunately for us, Kimball introduced the concept of Slow Changing Dimensions to accommodate for this requirement. In the rest of this guide, we will refer to the truth as a state. Hence, a customer can have a lot of historical states if information about that customer changes.
Different ways to model historical states of data (SCD types)
Kimball describes 8 types of slowly changing dimensions that cater to capturing changes for different use cases. Since the purpose of this guide is not to explain all the Kimball methodology in full detail, we will stick to the types that are relevant and are most popular. Since Type 1 and 2 are the most popular and relevant types we will cover only these 2. Other types are limited in the amount of states able to be stored or are variations on Type 2.
To illustrate the different types of Slowly Changing Dimensions, let’s use the customer source table of a Dutch Stroopwaffle shop where we only have 1 customer as an example throughout the rest of this guide. It is a very simple source table with an id, name and address of a customer while also containing metadata columns that describe when a record was created and/or updated.
customer_id | customer_name | customer_address | created_at | updated_at |
---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-02T14:00 |
Customer source table
Type 1
This is the ‘default’ type of dimension. It contains no history, but only stores the current state of a dimension. If any change occurs to, for example, a customer, the related customer record will be simply overwritten.
customer_id | customer_name | customer_address | created_at | updated_at |
---|---|---|---|---|
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 |
Example of a customer Slowly Changing Dimension Type 1 with a number of attributes. It contains only the current truth and 1 record \= 1 customer. Note the updated address and updated_at values.
Type 2
In a Slowly Changing Dimension Type 2, every time a customer is added or updated, a new record will be inserted. To keep track of which state was true at which period in time, this Type benefits from adding three columns: valid_from, valid_to and is_current_record. For example, if a customer changes address, a new record with the same information and an updated customer address will be inserted for which the valid period will be determined and made available in those three columns.
customer_id | customer_name | customer_address | created_at | valid_from | valid_to | is_valid_record |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | 2024-01-01T14 | false |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-05T15:59 | false |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | 9999-12-31:23:59 | true |
Example of a customer Slowly Changing Dimension Type 2 where a customer changes address two times.
Slowly Changing Dimensions and deleted records
When working with historical data, you will almost always encounter situations where records in source data gets deleted. Think of customers wanting to op-out (GDPR), products being deleted etc. This means that while creating SCD2 tables you will have to account for this.
Kimball addresses deleting records very briefly in his book, but in practice, companies have to account for deleted records in source data and the deletion of records is a very important design choice to make. There are two ways of handling deletes.
Soft deletes
Whenever a record gets marked as deleted, but physically remains in the database. This is usually to make sure this data will not be used by downstream systems while keeping it for auditing and analytical purposes.
As an example, consider the same sample data for a specific customer. This customer asks to get deleted from the company’s systems. In a SCD2 model, this could be modeled as such:
customer_id | customer_name | customer_address | created_at | valid_from | valid_to | is_valid_record | deleted_at |
---|---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | 2024-01-01T14:59 | false | 2024-01-06T12:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-05T15:59 | false | 2024-01-06T12:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | 2024-01-6T11:59 | false | 2024-01-06T12:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-06T12:00 | 9999-12-31:23:59 | true | 2024-01-06T12:00 |
This way of modeling allows you to keep the soft-deleted customer in historical states, but does mark every record as deleted including the deletion timestamp.
Hard deletes
Whenever a record gets physically deleted from a database, we call it hard deleting. This is typically done to comply with data retention rules or opt-out regulation (GDPR in Europe).
Illustrating this with an example is easy, as there is no data to show the moment a customer gets hard deleted from a database. Whenever creating SCD2 models, your data operations should include a process to hard delete records in tables downstream from your source system. Otherwise, you would only delete the data in the source, but not in your SCD2 models for example.
How to prepare data for SCD2?
When having a requirement to create Slowly Changing Dimensions Type 2 models, there are three relevant and often used patterns to extract source data, store states and ultimately create a SCD2 table.
- The first pattern is using Change Data Capture to capture every database transaction.
- The second pattern, full extracts, is using a snapshotting process to capture the full state of the source table periodically and appending the full state to a history table.
- The third pattern, delta extractions, is a snapshotting process to capture only changed records in a source table.
In the following sections, we will go over the various approaches to extract and store your source data states and how that would translate into SCD2 models.
Change Data Capture (CDC)
The most detailed SCD2 models will be possible when using CDC tables. With this type of source table, you will have access to the full history of every record in a table. This means that you will know exactly when a record was created, updated and/or deleted from a table including the state of that record at each step.
What is Change Data Capture?
Change Data Capture, or CDC for short, is a process in which you capture every change to data in the source table. The changes can then be loaded into another table on your data platform in either batches or real-time. CDC is supported by many SQL databases, however the exact implementation and feature set varies per database. For this guide, we will keep it basic.
In a table in a SQL database, each change to a record in a table is done using a SQL operation. The most common operations are INSERT, UPDATE and DELETE. If you create a customer, you INSERT a new record into the table, when you want to update a customer’s record you do an UPDATE operation and when you want to remove a customer you can do a DELETE statement. Most SQL databases then store that operation in a log which can then be read and utilized to capture all data changes to a table.
How can you create an SCD2 table using CDC?
Let’s use the customer table used earlier in this guide to illustrate how CDC gives more insight into the changes of customer information. When directly querying the customer source table, you would typically only see the latest state of the customer table.
customer_id | customer_name | customer_address | created_at | updated_at |
---|---|---|---|---|
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 |
Customer source table of Stroopwaffle Shop
However, in the background we have captured every change to this customer’s record in this table using the database’s CDC feature and loaded these changes to a new table:
customer_id | customer_name | customer_address | created_at | updated_at | operation | row_operation_at |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | INSERT | 2024-01-01T14:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | UPDATE | 2024-01-01T15:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | UPDATE | 2024-01-05T16:00 |
CDC table of customer source table
In the new table, you can see that there have been 3 table operations for this specific customer: 1 INSERT and 2 UPDATEs. Furthermore, you can see that the operation type and record operation timestamps have been captured and stored together with the records. From this table, it becomes clear the customer’s address has been updated on the same day the record was created and then adjusted again 4 days later.
Now that you have all the historical changes available, you can create a SCD2 model:
customer_id | customer_name | customer_address | created_at | scd2_valid_from | scd2_valid_until | is_current_record |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | 2024-01-01T14:59 | false |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-05T15:59 | false |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | NULL | true |
Resulting SCD2 Customer table using the CDC Customer table
In the resulting SCD2, you can see how the record operation timestamps are translated into the valid_from and valid_until columns.
How does CDC work with deleted records?
As mentioned already, one table operation is to DELETE records. This happens for example when customers want to opt-out, incorrect data is removed etc. CDC processes DELETE operations in the same way as INSERT and UPDATE operations. Whenever a record in the source table gets deleted, the operation is registered in the log.
customer_id | customer_name | customer_address | created_at | updated_at | operation | row_operation_at |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | INSERT | 2024-01-01T14:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | UPDATE | 2024-01-01T15:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | UPDATE | 2024-01-05T16:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-10T12:00 | DELETE | 2024-01-10T12:00 |
CDC table including DELETE operation
You can now use this DELETE operation record to soft- or hard delete this customer in the SCD2 table. More on delete operations later in this guide.
How do you set up CDC for your database?
As mentioned earlier, a lot of different databases support CDC but have varying feature sets and implementations. It is out of scope for this guide to go into depth, but we can briefly mention a few options.
Databricks
Databricks supports CDC out of the box when you use their Delta lake format. It’s called Change Data Feed (CDF) and gives you everything you need. Be aware that CDF is forward looking meaning it will only start capturing data changes once you enable it for a table. To enable it, run:
CREATE TABLE customer (
customer_id INT,
customer_name STRING,
customer_address STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
TBLPROPERTIES (delta.enableChangeDataFeed = true)
Google Cloud
When on Google Cloud, you could consider Datastream. This fully managed database replication service allows you to replicate databases like Postgres, MySQL etc. by using a CDC pattern and ingesting the data in real-time into Google Cloud Storage or Google BigQuery.
Debezium
When you don’t want to use a managed service like Change Data Feed from Databricks or Datastream from Google Cloud or it simply not available for your cloud platform, you can also go for the open-source tool Debezium. Debezium can connect to databases like MySQL, PostgreSQL or MongoDB, captures changes and makes these changes available in real-time for other applications to consume.
Snapshot tables
When you don’t have the option to use CDC to capture all changes to data and your source data only gives you the latest state of the data, you can perform snapshotting on your source data you want to create SCD2 models for.
What is snapshotting?
Snapshotting a source data means that you will look at a source table at regular intervals and capture the state of that source table at that time. Ideally, you then store every snapshot you make (in a data lake) so you can always recreate a history table from scratch using these snapshots. In an analytical database you can expose all those snapshots by simply appending all snapshots to each other, resulting in 1 large historical snapshot table. In this way, you capture the state of data frequently and can use this to create a SCD2 table.
Snapshot strategies
Full extracts
In its most basic form, you make a copy of the entire table, a snapshot, every day and append it in a history table. For example, given the customer table of the Stroopwaffle shop, if we were to run daily snapshots and store each snapshot in a history table, you would get the following history table:
customer_id | customer_name | customer_address | created_at | updated_at | snapshotted_at |
---|---|---|---|---|---|
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-02T00:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-03T00:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-04T00:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | 2024-01-05T00:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | 2024-01-06T00:00 |
History table using daily snapshotting using full extracts of customer source table of Stroopwaffle Shop
In the resulting table, you can see that a record will get copied over regardless of if the information of a customer changes.
The advantages of making full extracts is that it is simple to set up, and allows you to identify deleted records (records that are in 1 extract, but not in the next one). This is especially useful for soft deletes. For hard deletes, a separate process is required to delete records from previous full extracts as well. The downside of the full extract strategy is that you will run into scalability issues when the data volume is medium to high.
Delta extracts
A delta strategy requires having a reliable updated_at
column to check if any record in a source table has been updated since the last snapshot was done. This way, you only extract inserted or updated records and you won’t get a lot of duplicate records in the history table. Again, this does require a very reliable updated_at
column, because if the updated_at
column is for example not updated when manual fixes are done, it will lead to data quality issues when using that column. Going back to our example, using a delta extract strategy will result in a lot less records.
customer_id | customer_name | customer_address | created_at | updated_at | deleted_at | snapshotted_at |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | NULL | 2024-01-02T00:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | NULL | 2024-01-06T00:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-06T12:00 | 2024-01-06T12:00 | 2024-01-07T00:00 |
History table using daily snapshotting using delta extracts of the customer source table of Stroopwaffle Shop
The resulting table is a lot more slim and doesn’t include duplicate records anymore.
If there is no option to use a reliable updated_at
column to identify updated records, you can also compare the record contents of the previous state to the new state and see if any values changed. You can do this by checking each column for changes individually or generate a full record hash (recommended).
Deleted records also have to be included, for example, by adding the deleted_at timestamp the moment that a record is deleted. However, this only works for soft deletes. For hard deletes, similar to full extracts, you will need to set up a seperate process to delete records from all delta extracts.
Creating a SCD2 table based on snapshotting
Using either the full extract or delta extract strategy will both result in the same SCD2 model:
customer_id | customer_name | customer_address | created_at | scd2_valid_from | scd2_valid_until | is_current_record |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | 2024-01-05T15:59 | false |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-01T15:00 | NULL | true |
You can see that the duplicate records as captured with the full extract strategy are gone and only the created record and updated record are kept in the SCD2. For the delta strategy, you can see that the delta strategy already results in a table very similar to a SCD2 table.
Snapshotting accuracy limitation
There are various ways to perform a snapshot, but they all have in common that the level of detail depends on the frequency of snapshotting. In other words, if you snapshot your source table monthly, you will only capture the state of the data once per month and don’t know what happened in between exactly. If we look at our customer example and compare the history table using CDC with the history table using snapshotting, you will see that we missed out on the first UPDATE operation (the red record) that changed the address. This is because there were 2 operations happening on the same day in between 2 snapshots.
customer_id | customer_name | customer_address | created_at | updated_at | operation | row_operation_at |
---|---|---|---|---|---|---|
123 | Adam Johnson | Main Street 1, London | 2024-01-01T14:00 | 2024-01-01T14:00 | INSERT | 2024-01-01T14:00 |
123 | Adam Johnson | Main Street 3, London | 2024-01-01T14:00 | 2024-01-01T15:00 | UPDATE | 2024-01-01T15:00 |
123 | Adam Johnson | Main Street 4, London | 2024-01-01T14:00 | 2024-01-05T16:00 | UPDATE | 2024-01-05T16:00 |
CDC source table of Stroopwaffle Shop where the red marked record would not be captured by snapshotting
If we would have done hourly snapshotting this would be solved. However, usually, an increase in snapshotting frequency goes hand in hand with checking how frequent you can update before systems can’t scale anymore. For example, if an hourly snapshotting process actually takes more than 1 hour to complete or the source data is very large, you might want to refrain from doing hourly snapshotting and explore less frequent snapshots.
Finally, with snapshotting, if for whatever reason a snapshot isn’t captured, there is no possibility of backfilling. Meaning, if your snapshotting process fails for 5 days, there is no way of knowing and recovering what the state was in each of those 5 days. With CDC however, this is not an issue.
How does snapshotting work with deleted records?
Processing deleted records with the snapshotting strategy is not straightforward. At some point, you need to identify which records were deleted. The exact point where you do that slightly differs per strategy.
Identifying deleted records with full extracts
When snapshotting and storing the full extracts with every snapshot, you can compare the previous state to the new state and see if any entity in the previous state is not there anymore in the new state. To be more specific, you will perform this action AFTER you dumped the entire source table in the history table in the data lake. You can use a simple WINDOW function to do this. For example:
-- BigQuery SQL
with customers as (
select customer_id, customer_name, customer_address, created_at, updated_at, snapshotted_at from raw.customer_snapshots
),
identify_previous_and_next_record as (
select
*,
if(
lag(customer_id) over (partition by customer_id order by snapshotted_at) is not null,
true, false) as has_previous_record,
-- This method is not very scalable since it's a window function. For larger tables, consider storing the max snapshotted_at in a pre-computed Jinja variable in dbt.
max(snapshotted_at) over () as max_snapshotted_at
from
customers
),
identify_deleted_customers as (
select
*,
if(snapshotted_at <> max_snapshotted_at AND has_next_record is false, true,false) as is_deleted_record
from
identify_previous_and_next_record
)
select * from identify_deleted_customers
Identifying deleted records with delta extracts
Identifying deleted records with delta extracts is a bit less straightforward as this logic can take place either when ingesting or when transforming data. The process that takes a snapshot of the source data should identify new, updated and deleted records all at the same time. It will involve comparing the latest state in the data lake with the current state of the source system. Since the ingestion process usually varies a lot per organization, it’s not easy to give an example. However, the dbt snapshots feature is able to do this process in the transformation phase (since the data should already be available on the data platform for dbt snapshot to work) you will learn more about this in part 2 of this series.
Which pattern to use when?
Based on the descriptions of the CDC and Snapshotting strategies, you would think that going for CDC would always be the recommended path due to the level detail and latency. The truth however is, as always, it depends.
Strategy | Accuracy | Scalability | Timeliness | Costs |
---|---|---|---|---|
CDC (real-time) | Transaction level accuracy | Suitable for large data volumes | Real-time updates | Relatively harder set-up and relatively higher costs to operate streaming pipelines. |
CDC (batch) | Transaction level accuracy | Suitable for large data volumes | Depends on batch frequency | Relatively harder set-up and relatively lower costs to operate batching pipelines. |
Snapshotting (full extracts) | Depends on frequency of snapshotting (e.g. daily) | Suitable for small data volumes | Depends on snapshot frequency | Relatively easy set-up and relatively low costs to operate. Slightly higher storage costs due to the inefficient storage pattern. |
Snapshotting (delta extracts) | Depends on frequency of snapshotting (e.g. daily) | Suitable for medium data volumes | Depends on snapshot frequency | Relatively easy set-up and relatively low costs to operate. |
Whereas CDC gives you the most granular level of detail to create a complete history table of every record, the cost of implementing, maintaining and operating such a process can outweigh the added value of having such detail available. Especially the streaming variant of CDC will lead to higher operational costs.
Implementing and maintaining CDC processes requires data engineering capacity as it is not just a matter of turning a feature on. Especially if you require a real-time CDC solution, a lot more complexity will come into play. Moreover, minor detail, CDC would need to be supported by your source system as well.
On the other hand, snapshotting is very straightforward and is relatively easy to set up. Depending on the availability of the updated_at
column, you can even make this process quite lightweight as you will also see in part 2 of this series. For most use cases, you usually don’t need to have such detailed history available and daily snapshots are more than enough.
One final very important requirement that makes any of these considerations irrelevant is the need for auditability. If your company has to be able to adhere to audits, having transactional level accuracy will be very important.
At Xebia, we’ve seen a lot of use cases at organizations that really don’t need all that complexity. Therefore, we recommend carefully evaluating the use case you want to have SCD2 models for and then decide how you want to extract and store historical data.
How to create SCD2 models in dbt?
Now you know how you can extract and store historical data using various patterns.
In part 2 of this series, I will go more in-depth into practically creating SCD2 models based on historical data. Both the existing dbt snapshots feature (which belongs to the delta extraction snapshotting pattern) and custom implementations of SCD2 models in dbt will be covered.
Stay tuned for part two!