Blog

Your Data Lakehouse Architecture Questions Answered: Open-Source Data Catalogs vs Market Leaders

Marek Wiewiórka

April 8, 2026
5 minutes

The most revealing part of any technical webinar isn’t always the presentation; it’s the questions that linger in the audience's mind. In a recent deep dive on data lakehouse,  Xebia experts talked on how catalogs are able to store metadata, manage access control, and facilitate data discovery. 

While the experts laid out the battle between market leaders and open-source challengers, the live demos and subsequent Q&A cut to the heart of what data practitioners really want to know: "Will this data lakehouse catalog work for my company?" 

Curious about how Data Lakehouse catalog works? Watch our on-demand webinar here. Or, if you want to start first by looking at it with a general concept of data lakehouse, check out the first webinar from that series. 

Let us take a look at the key questions from the audience, answered by the concepts and live demonstrations of the session. 

Read our first blog post based on the webinar: Market Leaders vs Challengers: the Ongoing Battle for Data Catalogs in Data Lakehouse.

Question #1 "Is it possible to mix open-source tools with my existing proprietary platform? Or am I locked in?" 

This was a central theme of the webinar, and the demos provided a powerful, twofold answer. 

  • The Open-Source stack demo answered: Yes, it is possible to build a fully open alternative. The presenters showcased a working architecture with Lake Keeper (technical catalog), DataHub (business catalog), and PyIceberg/Spark, all tied together with Keycloak for auth. In our webinar, we showed a full automated ELT pipeline, with column-level lineage appearing automatically in DataHub. This proved a modular, best-of-breed stack isn't just a nice little diagram, but instead it's a deployable reality that avoids vendor lock-in entirely. 
  • The "Open-by-Design" Demo answered: Yes, it is entirely possible to bend proprietary platforms toward openness. This was a real crowd-pleaser. For Databricks Unity Catalog, they showed that enabling the "Uniform" feature makes a native Delta table instantly queryable via the Iceberg REST API by an external PyIceberg client. For Snowflake, they used the Open Data Catalog (managed Polaris) to create a table with PyIceberg, then synced it seamlessly into the internal Horizon catalog for SQL querying. 

Building openly or use platform features to ensure your data remains accessible via open standards is now a real possibility, not just fancy marketing talk. 

Question #2 "The webinar talked quite a bit about 'technical' vs. 'business' catalogs. As a small team, do I need both? It does sound complex." 

This question highlights what might be a common topic that can confuse teams. The Q&A session from the webinar clarified that this is a spectrum, not a mandate. 

  • Technical/Operational Catalogs (like Lake Keeper, Apache Gravitino) focus on the engine: security, credential vending, and table maintenance so engines like Spark can run. 
  • Business Catalogs (like DataHub, OpenMetadata) focus on the users: data discovery, lineage, quality, and governance. 

If you are on a small team, then your primary need might need to enable secure data access for pipelines. In this case, a technical catalog is sufficient. If your need is for analysts to find and trust data, a business catalog is, instead, the key. The demo showed in the webinar the elegant combo: Lake Keeper ran the engine, and DataHub automatically ingested its metadata to provide the user-friendly interface and lineage. For small teams, integrated open-source tools can definitely make this combination increasingly feasible. 

Question #3 "The open-source options seem promising but immature. What's the real catch compared to Unity Catalog or Snowflake?" 

This pragmatic question was addressed head-on in the evaluation. The speakers acknowledged that while open-source challengers (Gravitino, Lake Keeper) excel at core technical capabilities (Iceberg API, access control), they often lack enterprise-grade production features out-of-the-box. 

The gaps they identified include: 

  • Detailed access auditing and automated user sync (CIAM integration). 
  • Configuration-as-code and robust data sharing mechanisms. 
  • Built-in, automated table maintenance (compaction, vacuuming). 
  • Turn-key high availability. 

What is the real takeaway here? Choosing an open-source catalog means trading off "out-of-the-box completeness" for "architectural freedom." You gain interoperability and avoid licensing costs, but you must be prepared for a higher operational investment and custom integration work. As one speaker noted, "you need to be prepared for building custom automations." 

Question #4 "Sure, 'open' might be a fancy buzzword. But what does it actually mean for a data catalog?" 

In the webinar, we proposed a crucial dual definition that resonated with the audience: 

  1. Open by License: The software is open-source (e.g., Apache-licensed like Lake Keeper). 
  1. Open by Design/Standard: The catalog adheres to open protocols, even if proprietary. This is where Unity's support for the Iceberg REST API and Snowflake's adoption of Apache Polaris shine. 

The demos proved that #2 is often as important as #1. A platform can be proprietary yet still be a "good citizen" in an open ecosystem by speaking standard protocols. This allows organizations to leverage platform power without sacrificing future flexibility. 

The webinar concluded that the market is evolving to offer real choices. Your path is defined by how you answer these audience questions for yourself: 

  • Is your priority out-of-the-box sophistication or architectural control
  • Do you need to build a new open stack or liberate your existing platform investments

The most encouraging answer from the entire session was this: the technology, through open standards like the Iceberg REST Catalog API and Open Lineage, now exists to support either path your company might choose. The questions have moved from "is it possible?" to "which trade-off best serves our goals?", a great sign of a maturing, powerful ecosystem. 

Learn more about Open Data Lakehouse concept, and stay tuned for our next blog post about data catalogs. 

Written by

Marek Wiewiórka

Contact

Let’s discuss how we can support your journey.