
Introduction
Data is a crucial yet sensitive asset for businesses. But how can we safeguard sensitive information while ensuring the right people have access?
Fine-grained authorization in Databricks Unity Catalog empowers organizations to implement robust, role-based data security policies. By leveraging features such as row-filters and column masks, organizations can achieve precise control over data visibility. This not only ensures compliance with strict data governance policies but also fosters secure collaboration across teams and departments within the data lakehouse.
As data security and privacy concerns continue to grow, implementing fine-grained authorization is becoming a cornerstone of modern data governance strategies. With tools like Unity Catalog, organizations can strike the perfect balance between protecting sensitive data and enabling data-driven decision-making.
In this post, I’ll dive into the implementation of a sophisticated data filtering model for the Databricks Unity Catalog. I’ll use row-filters and column-masks to meet the most complex access requirements. Come along!
Databricks Unity Catalog:
The Unity Catalog is the centralized governance layer within the Databricks Lakehouse Platform. It simplifies and strengthens data governance by providing a unified framework for managing data, AI models, and other resources. Its core capabilities include access controls, tools for data discoverability, the ability to annotate datasets with descriptive metadata, and lineage tracking to trace data from its source to the point where it’s consumed.
By centralizing access and governance, Unity Catalog enables secure collaboration across teams while enhancing transparency. This empowers organizations to unlock the full value of their data, maintain trust, and ensure compliance with regulatory requirements.
In today’s increasingly complex data landscape, managing access at a broad asset level (e.g., granting access to an entire dataset) is no longer sufficient. As organizations grow, so do their data security needs. Avoiding data silos and addressing diverse user roles requires a more granular approach to access control. This is where fine-grained authorization comes into play.
The need for Fine-Grained Authorization:
Having access controls in place for assets, such as who can read and write, is a fundamental requirement. But what if you don’t want to expose the entire contents of an asset? What if you only want to share a subset of it, such as specific rows or columns? Examples of sensitive data that could require this level of protection are purchase prices, profit margins, and customer details. In summary, all information that could easily be leveraged by competitors.
A straightforward but not ideal solution is to separate sensitive data into different datasets and grant access only to a privileged group. However, this approach becomes unsustainable at scale. Splitting datasets multiple times leads to challenges in maintainability and oversight.
A better solution is to make the access control fine-grained, that is, setting detailed permissions at row and column-level. In Databricks Unity Catalog, this involves applying row-level filters and column-level masks to tables.
Simple example:
As mentioned earlier, implementing fine-grained access controls involves creating row filters and column masks. These are defined as functions and applied directly to tables within the Unity Catalog. Below is an example of row-level filters:
CREATE OR REPLACE FUNCTION
governance.row_level_security.udf_row_filter ()
RETURNS BOOLEAN
RETURN IF(is_account_group_member("Xebia Data"), True, False);
;
ALTER TABLE catalog.schema.sample_table
SET ROW FILTER governance.row_level_security.udf_row_filter on ();
Within the above function: is_account_group_member() checks if the current_user() is part of the group, in this case: “Xebia Data”. The group must exist inside Databricks, it could be that it’s synced from e.g. Azure EntraID to Databricks.
Prerequisites and considerations:
Before exploring the possibilities of row-filters and column-masks, there are a few considerations to keep in mind:
Implement Fine-grained Authorization on the consuming end-point. For example, if data is consumed by a dashboarding tool, it may be more appropriate to handle authorization there. If the consumers are using the Databricks platform, then Unity Catalog is the ideal place to implement fine-grained authorization. Otherwise, you may need to re-apply the fine-grained authorization at other endpoints.
Understand technical limitations: Refer to the Azure Databricks row and column-filters (support and limitations) for details on potential constraints.
Advanced Example
Imagine a scenario where security policies enforce role-based access restrictions. For instance, account managers can view data related to their assigned clients but cannot access product purchase prices. Similarly, purchasers can only view sales data for their assigned products, including the purchase price, but are limited to seeing only the customer ID instead of detailed customer information. This separation ensures that sensitive company data, such as customer details and profit margins, is protected.
The table below demonstrates this scenario using sales order lines:
With row-level filters and column masks, we will make this requirement possible. To enforce fine-grained authorization, the first step is to define and store user-level access permissions. We achieve this through the creation of a mapping table that specifies which users have access to certain data.
Defining a mapping table for Fine-Grained Access
To control the row-filter and column-masks, I used a source that specifies the allowed access on a user level. I mocked this data and transformed it into a lookup table: ‘user_customer_product’ and stored it inside Databricks.
If you are interested in how to create the lookup table, click here to expand the code:
written in pyspark:
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType, BooleanType
# mapping table: user_customer_product
user_customer_product = [
{
"user": "maurice@xebia.com",
"role": "Account Manager",
"customer_key": [11185, 11142],
"product_key": None,
"purchase_price": False
},
{
"user": "ruben@xebia.com",
"user": "maurice.veltman@xebia.com",
"role": "Purchaser",
"customer_key": None,
"product_key": [225, 484, 485, 479, 214, 528],
"purchase_price": True
},
{
"user": "ken@xebia.com",
"role": "Admin",
"is_admin": True
}
]
# Schema
schema = StructType([
StructField("user", StringType(), False),
StructField("role", StringType(), False),
StructField("customer_key", ArrayType(IntegerType()), True),
StructField("product_key", ArrayType(IntegerType()), True),
StructField("purchase_price", BooleanType(), True),
StructField("is_admin", BooleanType(), True)
])
# dataframe
df = spark.createDataFrame(data=user_customer_product, schema=schema)
# create table
table_name = "governance.row_level_security.user_customer_product"
table_location= 'abfss://governance@xdnldataplatform.dfs.core.windows.net/authorization/user_customer_product'
df.write.mode("overwrite").option("path", table_location).saveAsTable(table_name)
display(spark.table(table_name))
This information required to create a similar lookup table can probably be derived from an ERP system, what needs to be derived is: – the assigned account manager from the customer table – the responsible purchaser for a product from the product table
The table in this example looks like this:
The column user
corresponds to the current_user()
who will be retrieving the data. The customer_key
and product_key
columns are used for row filtering to return only approved records; if these are left empty, all records will be returned. The purchase_price
column determines whether to display the purchase price or not. The is_admin
column will override the row filters and column masks.
Once the mapping table is in place, the next step is to implement a row filter function. This function will ensure that only the rows a user is authorized to access are returned from the dataset.
Implementing Row-Level Filters:
To enforce row-level security, we need to create a function and apply it on a table. For example, sales_order_line
contains sales records that need to be secured. The code below contains a SQL function that returns TRUE
when row access is allowed and FALSE
when access is not allowed based on the contents of the mapping table above.
CREATE OR REPLACE FUNCTION
governance.row_level_security.udf_row_filter_adventureworks_sales_order_line (CustomerKey INT, ProductKey INT)
RETURNS BOOLEAN
LANGUAGE SQL
NOT DETERMINISTIC
READS SQL DATA
RETURN
EXISTS(
SELECT
NULL -- Selecting NULL as we only care about existence
FROM
governance.row_level_security.user_customer_product AS row_filter
WHERE
row_filter.user = current_user()
AND CASE
-- Check access conditions based on user permissions
-- Full access
WHEN row_filter.is_admin = TRUE THEN TRUE
-- Access for specific products, allowing all customers
WHEN array_contains(row_filter.product_key, ProductKey)
AND row_filter.customer_key IS NULL THEN TRUE
-- Access for specific customers, allowing all products
WHEN array_contains(row_filter.customer_key, CustomerKey)
AND row_filter.product_key IS NULL THEN TRUE
-- Access for specific products and customers
WHEN array_contains(row_filter.product_key, ProductKey)
AND array_contains(row_filter.customer_key, CustomerKey) THEN TRUE
-- No access if none of the conditions are met
ELSE FALSE
END
)
;
ALTER TABLE staging.adventureworks.sales_order_line
SET ROW FILTER governance.row_level_security.udf_row_filter_adventureworks_sales_order_line
ON (CustomerKey, ProductKey);
Looking at the code above, it may seem surprising to find the EXISTS
clause, but it is the optimal solution for checking row existence. The key advantage of EXISTS
is that it stops processing as soon as it finds a match, making it more efficient, especially when working with larger datasets or lookup tables. In comparison to only a select—a SELECT
statement requires a GROUP BY
or LIMIT 1
to return a single boolean. However, LIMIT 1
can behave unexpectedly when multiple rows exist for a user—something that shouldn’t happen but cannot be guaranteed. Furthermore, using a GROUP BY
to aggregate results into a single boolean value introduces additional overhead, as it generates an intermediate result set that requires sorting and deduplication. This makes it less efficient compared to the behavior of the EXISTS
clause.
While the row filter function restricts access to specific rows, we also need a way to control access to sensitive columns, such as the purchase price. This is where column masking comes into play.
Implementing Column masking
To enforce column masking, we create a function that hides sensitive data, such as the purchase price from unauthorized users. It’s important to note that the column itself will remain visible in the dataset, but its values will be masked for users without the necessary permissions.
Below is the code for implementing the column-masking function:
CREATE OR REPLACE FUNCTION
governance.row_level_security.udf_mask_purchase_price (column FLOAT)
LANGUAGE SQL
NOT DETERMINISTIC
READS SQL DATA
RETURN
IF (
EXISTS(
SELECT
NULL -- Selecting NULL as we only care about existence
FROM
governance.row_level_security.user_customer_product mask
WHERE
mask.user = current_user()
AND CASE
-- Check access conditions based on user permissions
-- Full access
when mask.is_admin = TRUE THEN TRUE
-- Access for purchase price
when mask.purchase_price = TRUE THEN TRUE
-- No access if none of the conditions are met
ELSE FALSE
END
),
column,
NULL
);
ALTER TABLE staging.adventureworks.sales_order_line
ALTER COLUMN purchasePrice SET MASK governance.row_level_security.udf_mask_purchase_price;
Note that there is no return type defined in the code above. This provides the flexibility to apply the masking function to multiple columns, such as integers, floats, or decimals, without needing to create separate functions for each type.
From this point on, the Fine-grained Authorization is applied on a user level. The users Maurice, Ruben, and Ken (specified in the mapping table) can see the data they are authorized for. All other users won’t see any records.
Validation
To validate the functionality, we will query the table: sales_order_line
under the user accounts of Maurice, Ruben, and Ken.
The query below will be used to test the implementation:
SELECT
CURRENT_USER() as user,
COLLECT_LIST(DISTINCT CustomerKey) AS Customer,
COLLECT_LIST(DISTINCT ProductKey) AS Product,
ROUND(AVG(purchasePrice),2) AS avg_price
FROM
staging.adventureworks.sales_order_line
GROUP BY CustomerKey
LIMIT 10
Case: Account manager
User: Maurice
Access: Customers [11185, 11142] without seeing purchase prices.
Result:
| user | Customer | Product | avg_price |
|--------|----------|-----------------------------------------|-----------|
| maurice| 11142 | [480, 528, 538, 490, 537, 487, 465, 472,| null |
| | | 463, 481, 477, 529, 536, 222, 217, 535, | |
| | | 541, 475, 214, 539, 234, 483, 530] | |
| maurice| 11185 | [530, 538, 217, 480, 471, 485, 237, 225,| null |
| | | 479, 528, 535, 541, 539, 475, 489, 537, | |
| | | 465, 488, 477, 486, 536, 491, 214, 484, | |
| | | 228, 529, 540, 222, 472] | |
Validation: Success
Maurice only sees the two assigned customers (11185 and 11142) with all associated products. As expected, the purchase price values are masked (null).
Case: Purchaser
User: Ruben
Access: Products [225, 484, 485, 479, 214, 528] with access to purchase prices.
Result:
| user | Customer | Product | avg_price |
|--------|----------|-----------------------------------------|-----------|
| ruben | 11142 | [528, 214] | 10.65 |
| ruben | 11185 | [225, 484, 485, 479, 214, 528] | 7.48 |
| ruben | 15277 | [528] | 1.87 |
| ruben | 22773 | [485, 528] | 5.04 |
| ruben | 11451 | [214, 485] | 10.65 |
Validation: Success
Ruben sees sales orders from different customers, but only for the products he is authorized to see (225, 484, 485, 479, 214, 528). Additionally, the purchase price is visible for these products, as expected.
For example:
– For Customer 11142, only products [528, 214] are included.
– For Customer 11185, only products [225, 484, 485, 479, 214, 528] are included.
Case: Admin
User: Ken
Access: Full access to all customers and all products, including purchase prices.
Result:
| user | Customer | Product | avg_price |
|--------|----------|-----------------------------------------|-----------|
| ken | 11142 | [480, 528, 538, 490, 537, 487, 465, 472,| 11.33 |
| | | 463, 481, 477, 529, 536, 222, 217, 535, | |
| | | 541, 475, 214, 539, 234, 483, 530] | |
| ken | 11185 | [530, 538, 217, 480, 471, 485, 237, 225,| 10.52 |
| | | 479, 528, 535, 541, 539, 475, 489, 537, | |
| | | 465, 488, 477, 486, 536, 491, 214, 484, | |
| | | 228, 529, 540, 222, 472] | |
| ken | 15277 | [465,485,477,214,478] | 7.21 |
| ken | 22773 | [485,528,536,486] | 5.04 |
| ken | 11451 | [575,478,472,214,217,359,344,477,485] | 523.31 |
Validation: Success
Ken, as an Admin, has unrestricted access to all customers, all products, and the purchase prices. The query results confirm that no row or column-level restrictions are applied for this user. For example:
– For Customer 11142, all associated products are visible, with the average purchase price calculated.
– For Customer 11185, the same unrestricted access is observed.
Conclusion:
The validation process confirms that fine-grained authorization is functioning correctly:
– Maurice (Account Manager): Access is restricted to specific customers (11185 and 11142), with purchase prices masked.
– Ruben (Purchaser): Access is restricted to specific products (225, 484, 485, 479, 214, 528), with purchase prices visible.
– Ken (Admin): Full access to all data without any restrictions.
This validation underscores the effectiveness of implementing row-level filters and column masking to enforce fine-grained access control. By tailoring data visibility to individual roles, organizations can ensure that sensitive information is protected while providing users with the data they need to perform their jobs.
What’s Next?
Looking ahead, the introduction of Attribute-Based Access Control (ABAC) promises even greater flexibility by enabling policies and governed tags. This capability will eliminate the need to apply functions to individual tables, as we can define policies at the catalog or schema level. For instance, we may enforce row filters or column masks when a table or column is marked with a governed tag (a new feature) such as company-secret, PII, or purchase_price for a specific group of users.
Additionally, I hope fine-grained access controls will also expand to volumes inside Unity Catalog, enabling the protection of paths for non-tabular data. This enhancement would further strengthen data security by extending to unstructured and semi-structured data, ensuring a comprehensive approach to data protection.