Blog

Resolving Re-Used Business Keys in Data Vault 2.0 with the Business Vault

24 Jan, 2025
Xebia Background Header Wave

Introduction

In the world of data modeling, one of the challenges organizations face is managing tables with business keys (BKs) that are re-used over time. This issue typically occurs when a company, for technical limitations in the source system, reuses primary keys or "unique identifiers" over time. This practice can lead to ambiguity, especially when historical data is needed for analytical purposes. This blog post explores how to use Data Vault 2.0 to resolve re-used BKs by leveraging the Business Vault. 

Understanding the Data Vault 2.0 Architecture 

Data Vault 2.0 is often seen as a modeling technique, but it’s a full solution that covers methodology, architecture, implementation, and modeling. 

Let’s have a look at the below architecture: 

DATA VAULT.drawio

 

  1. The first step is to extract the data from the source systems and load the data in its raw format into a Staging Area. 
  2. Next you can apply hard rules if necessary, such as data type changes.  
  3. Then you load the data into the Raw Vault and model the data into Hubs, Links, Satellites etc. 
  4. If you need to apply business logic to your data across more than one use case, you can extend the Raw Vault by creating a Business Vault.
  5. The last step is to build the Information Marts (IMs) to be used for reporting/dashboarding purposes. The IM tables or views are built using the Raw Vault/Business Vault and can be modeled using Kimball dimensional modeling for example. Check this other blog as reference: https://xebia.com/blog/kimball-dimensional-modelling/. 

Let’s check an example 

Suppose we are an e-commerce business, and we want to analyze our orders information. The order code is re-used over time. In this scenario, we typically have one source table for the order header, and another for the order items.  

order header: 

order_code order_date customer _code ship_to_city 
111110/10/2024AAlmere
111211/10/2024BAmsterdam
111112/11/2024DBreda

 

order items: 

order_codeline_codeitem_codequantityamount
11111AB132
11112BC34.99
11113DF1056
11121BC570
11122XS15
11111DE20251.95

 

The expectation of the Business Intelligence (BI) team is that they should have two Information Mart tables, one for the order header and another one for the order items, and they need to be able to join them for reporting purposes. They also want to keep track of the full historical changes. 

To determine which order items are linked to each order header, we would typically check the foreign key to order_code in the order items table, allowing us to reconstruct the complete order. 

However, if we examine the order header table closely, we notice that the order_code “1111” was reused on a different date. This introduces the complexity of determining which item codes are associated with each order_code over time. 

The logical solution in this case is to create a primary key (PK) in the order header table that uniquely identifies each order over time. This same value should be inserted as a foreign key (FK) in the order items table. By including the order_date as part of the primary key, we can ensure uniqueness and achieve a table with the expected results, as shown below: 

im_order_header: 

pk_order_codeorder_codeorder_datecustomer_code ship_to_city 
1111 || 10/10/2024111110/10/2024AAlmere 
1112 || 11/10/2024111211/10/2024BAmsterdam
1111 || 12/11/2024111112/11/2024DBreda

 

im_order_items: 

pk_order_line_codefk_order_code order_codeline_codeItem_codequantityamount
1111 || 10/10/2024 || 11111 || 10/10/202411111AB132
1111 || 10/10/2024 || 21111 || 10/10/202411112BC234.99
1111 || 10/10/2024 || 31111 || 10/10/202411113DF1056
1112 || 11/10/2024 || 11112 || 11/10/202411121BC570
1112 || 11/10/2024 || 21112 || 11/10/202411122XS15
1111 || 12/11/2024 || 11111 || 12/11/202411111DE20251.95

 

 In the next sections, we cover how you can use Data Vault 2.0 to obtain this output. 

The Raw Vault 

Let’s assume that your data is already extracted and available in the Staging Area. Next step is to load the data into the Raw Vault. In this blog, we will not cover the details about the different structures that can be created in the Data Warehouse following the Data Vault 2.0 principles, such as hubs, links, and satellites. Below, you can find one possible solution to model the order data.

DV Sales order 1.drawio.drawio 1

 

After creating your satellites, you will see the information below. In Data Vault 2.0, best practices recommend using Hash Keys (HK) to uniquely identify records and ensure consistent joins between tables. Hash Keys provide better performance during joins compared to text-based data types. 

rv_sat__order_header: 

hk_order  order_code  order_date customer_code ship_to_city load_date end_date 
MD5(1111)111110/10/2024AAlmere10/10/202412/11/2024
MD5(1112)111211/10/2024BAmsterdam11/10/2024NULL
MD5(1111)111112/11/2024DBreda12/11/2024NULL

 rv_sat__order_items: 

hk_order hk_order_line order_code line_code item_code quantity amount load_date end_date 
MD5(1111)MD5(1111 || 1)11111AB13210/10/202412/11/2024
MD5(1111)MD5(1111 || 2)11112BC234.9910/10/202412/11/2024
MD5(1111)MD5(1111 || 3)1111 3DF105610/10/202412/11/2024
MD5(1112)MD5(1112 || 1)1112 1BC57011/10/2024NULL
MD5(1112)MD5(1112 || 2)1112 2XS1511/10/2024NULL
MD5(1111)MD5(1111 || 1)1111 1DE20251.9512/11/2024NULL

 

 So now, we need to consider the following: 

  1. The same order code is only active once in the satellite (with end_date equal to NULL). 
  2. If you use these satellites to create your Information Marts tables or views, it’s essential to include only the active records and exclude historical data. Failing to filter for active records will result in non-unique primary keys, making it impossible to join the header and items models correctly. 

The business requires analyzing historical data to make informed decisions. Looking at the expected result, we can see that making the primary key unique involves applying business logic. Both tables need to use a combination of order_code and order_date as the unique identifier. However, the raw data for order items lacks any date information. In this scenario, we have the following possibilities: 

  1. Ask the data provider to include the date information in the order items source table. This can be very time-consuming and sometimes not possible. 
  2. After you load the data to the raw vault, create business logic to include the order code + order date combination into the item’s information. This can be done by leveraging the Business Vault. 
  3. Joining the header and items table before loading the data into the Raw Vault is not an option, because it would violate the Data Vault 2.0 standards that in the Staging Layer you only should apply hard rules.  
  4. Joining the header and items data after loading the data into the Raw Vault in the Information Mart is also not ideal because it adds complexity and redundancy to the reporting layer. The business logic to combine order code + order date would need to be recreated for every use case, leading to inconsistencies. 

 The Business Vault 

The Business Vault is an extension of the Raw Vault, particularly useful when you need to apply business logic that is needed for several use cases. It needs to be implemented in a way where you will have the same surrogate key for both the order header and the order items. 

bv__order_header: 

The bv__order_header is an extension of the rv_sat__order_header. To create it, you need to generate a surrogate key and recalculate the end_date based on the previous load_date value for each surrogate key. This approach allows you to handle cases where two order_code values, such as “1111”, are valid simultaneously. Below is an example of the code and how the Business Vault order header table would look: 

with
  unique_key as (
    select
      order_code,
      order_date,
      concat (bk_order_code, ' || ', order_date) as bk_order_code,
      customer_code,
      ship_to_city,
      load_date
    from
      rv_sat__order_header
  ),
  end_date_calculation as (
    select
      *,
      lead (load_date) over (
        partition by
          bk_order_code
        order by
          load_date
      ) as end_date
    from
      unique_key
  )
select
  bk_order_code,
  order_date,
  customer_code,
  ship_to_city,
  load_date,
  coalesce(end_date, '9999-12-31') as end_date
from
  end_date_calculation

 

 order_code order_date bk_order_code customer_code ship_to_city load_date end_date 
1111 10/10/20241111 || 10/10/2024BAlmere10/10/202411/10/2024
1111 10/10/20241111 || 10/10/2024AAlmere11/10/202431/12/9999
1112 11/10/20241112 || 11/10/2024BAmsterdam11/10/202412/10/2024
1112 11/10/20241112 || 11/10/2024BAmsterdam12/10/202431/12/9999
1111 12/11/2024 1111 || 12/11/2024DBreda12/11/202431/12/9999

 

 bv__order_items: 

A similar logic applies for creating the business vault table for rv_sat__order_items, but first you need to add the same surrogate key from the business vault order header that was previously created. Again, a code example and how the Business Vault order items table would look like: 

with
  sat_order_items as (
    select
      order_code,
      line_code,
      item_code,
      quantity,
      amount,
      load_date
    from
      rv_sat__order_items
  ),
  bv_order_header as (
    select
      *
    from
      bv__order_header
  ),
  add_unique_key as (
    select
      sat_order_items.*,
      bv_order_header.bk_order_code,
      concat (
        bv_order_header.bk_order_code,
        "||",
        sat_order_items.line_code
      ) as bk_order_line_code
    from
      sat_order_items
      left join bv_order_header on sat_order_items.order_code = bv_order_header.bk_order_code
      and sat_order_items.load_date between bv_order_header.load_date and bv_order_header.end_date
  ),
  end_date_calculation as (
    select
      *,
      lead (load_date) over (
        partition by
          bk_order_line_code
        order by
          load_date
      ) as end_date
    from
      add_unique_key
  )
select
  *,
  coalesce(end_date, '9999-12-31') as end_date
from
  end_date_calculation

 

order_codebk_order_code  bk_order_line_codeline_codeitem_codequantityamountload_dateend_date
11111111 || 10/10/20241111 || 1 || 10/10/20241AB13211/10/202431/12/9999 
11111111 || 10/10/20241111 || 2 || 10/10/20242BC234.9911/10/2024 31/12/9999 
11111111 || 10/10/20241111 || 3 || 10/10/20243DF105611/10/202431/12/9999
11121112 || 11/10/20241112 || 1 || 11/10/20241BC57012/10/202431/12/9999
11121112 || 11/10/20241112 || 2 || 11/10/20242XS1512/10/202431/12/9999
11111111 || 12/11/20241111 || 1 || 12/11/20241DE20251.9512/11/202431/12/9999

 

With the Business Vault in place, you can now create Information Marts that fully complies with the BI requirements. 

By materializing the Information Mart as a view that pulls directly from the Business Vault, you can avoid duplicating data across multiple layers and reduce the need for computations at the Information Mart layer. This approach significantly improves the speed at which BI tools can access and analyze the data. 

It’s important to note that, since a new business key was created by applying business logic, you will also need to reconstruct the Order Hub and establish links between the new business key and the Customer and Item Hubs within the Business Vault. This ensures consistent integration and traceability across the model. 

Conclusion 

Managing re-used business keys in a Data Vault 2.0 implementation can be challenging, but the Business Vault provides a robust solution to maintain historical integrity and ensure data uniqueness. 

Do you need help tackling complex Data Vault 2.0 challenges? Our Data Vault 2.0 certified analytics engineers at Xebia are here to assist you in leveraging your data vault effectively. Contact us today to unlock the full potential of your data!

Camila Birocchi
I'm a technology enthusiast passionate about problem-solving and making companies data-driven. I thrive on turning challenges into solutions and transforming data into actionable insights.
Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts