Blog

Market Leaders vs Challengers: the Ongoing Battle for Data Catalogs in Data Lakehouse

Marek Wiewiórka

April 8, 2026

5 minutes

In the rapidly evolving world of data lakehouse architecture, the data catalog has emerged as the central nervous system. A critical layer that governs, navigates, and secures data across hybrid environments. This contest is not just about features; it’s about control, openness, and the future of data architecture itself.

A recent Xebia webinar dives deep into this landscape, revealing a fascinating battle between entrenched market leaders and a new wave of promising open-source challengers. This is part of a series of webinars dedicated to the Data Lakehouse architecture, exploring the foundations and future of big data storage and how it is shaping data architecture.

Find out more about how the data lakehouse works by watching our on-demand webinars: Towards Data Lakehouse Architecture.

In this blog post, we break down the concept of a data catalog, highlight emerging open-source projects that are gaining momentum, and explore whether they are ready to serve as the backbone of a modern data platform.

Data Catalogs. The Titans: Proprietary, Integrated, and Powerful

At the forefront are the well-funded, integrated offerings from cloud and platform giants. Databricks Unity Catalog stands as a reference implementation: a comprehensive, feature-rich catalog deeply embedded in the Databricks ecosystem. It supports Delta Lake natively and, notably, has recently added managed Iceberg support. With features like automated lineage, fine-grained access control, auditing, and its innovative “Uniform” layer (which allows Delta tables to be queried via the Iceberg REST API), Unity sets a high bar for technical and governance capabilities.

Similarly, Snowflake Horizon (for internal governance) and its Open Catalog (a managed version of Apache Polaris) offer a dual approach, enabling both tight platform integration and open connectivity. The major cloud providers, such as AWS with Glue Data Catalog & Lake Formation, Google with BigLake Metastore (which now offers GA support for the Iceberg REST Catalog), and Microsoft with Fabric OneLake Catalog, each offer robust, managed catalog services tightly coupled with their respective ecosystems. These are "open by design" in the sense that they support open table formats (Iceberg, Delta, Hudi) and, increasingly, open standards, but they are fundamentally proprietary services designed to enhance platform lock-in.

Data Catalogs. The Challengers: Open-source, Modular, and Agile

The open-source frontier is where the most dynamic innovation is happening. These challengers are not trying to be all-encompassing platforms; instead, they focus on being the best-in-class, interoperable catalog layer for an open data stack.

Leading the pack is Apache Gravitino, which graduated to an Apache Top-Level Project in June 2025. It is a feature-rich project supporting tables, ML models, Kafka topics, and more. It boasts credential vending, access control, and nascent lineage and MCP (Model Context Protocol) support. However, its adoption is hampered by incomplete and confusing documentation, making it a powerful but challenging tool to implement.

Lakekeeper takes a different, minimalist approach. Written in Rust, it is lightweight, focused purely on being a high-performance Iceberg catalog with access control and credential vending. It comes with excellent documentation and a working Docker playground, making it incredibly easy to evaluate and deploy. Its simplicity is its strength, though it lacks the broader entity support of Gravitino.

Other notable contenders include the open-source version of Unity Catalog (a separate, slower-moving project than its Databricks-managed namesake) and Apache Polaris ( now in Apache Incubation, the open-source foundation of Snowflake's Open Catalog offering). DataHub also deserves mention; traditionally a business-oriented metadata catalog, it now includes Iceberg REST Catalog support as of version 1.0, blurring the lines between business and technical catalogs.

The Great Divide: Technical vs Business Data Catalogs

There is a clear segmentation in the market between different organizational needs. Technical/Operational catalogs (like Lakekeeper, Gravitino, Polaris) focus on the mechanics of data access: security (RBAC, ABAC), credential vending, table maintenance, and multi-engine support. Business catalogs (like DataHub, OpenMetadata) prioritize data discovery, lineage, data quality, and data contracts for data mesh architectures.

The market leaders increasingly try to blend both, while the open-source world often requires combining a technical catalog (e.g., Lake Keeper) with a business catalog (e.g., DataHub) for a complete solution.

Evaluation: Strengths and Glaring Gaps

When comparing the challengers to leaders like Unity Catalog, a clear pattern emerges. The open-source contenders excel at the basics: Iceberg REST API support, core access control, and credential vending. They are modular, avoid vendor lock-in, and evolve quickly through community input.

However, they consistently fall short on enterprise-grade production features. Critical capabilities like detailed access auditing, automated user/group synchronization via CIAM (Customer Identity and Access Management), configuration-as-code, built-in table maintenance (compaction, vacuuming), and robust data sharing mechanisms are often absent, incomplete, or require significant custom development. High availability and operational maturity are also concerns that fall on the implementor.

The Verdict: the Future Looks Hybrid

The battle is not to be considered a simple zero-sum game; instead, the future is all about being both hybrid and pragmatic.

For organizations all-in on a platform like Databricks or Snowflake, leveraging their native catalog is the most powerful and seamless choice. Their increasing support for open standards (Iceberg API, Open Lineage) mitigates some lock-in fears.

For teams building an open, multi-cloud lakehouse, the open-source challengers (particularly Apache Gravitino and Lakekeeper) are compelling, despite their rough edges. They offer freedom and flexibility but demand a higher operational investment and a willingness to integrate components and build missing features.

The "open by design" ethos seems to be the winning strategy. Even proprietary leaders are being forced to adopt open table formats and APIs. True openness, however, remains a combination of open-source licensing and adherence to open standards.

Ultimately, the choice between market leaders and challengers hinges on an essential trade-off: out-of-the-box sophistication and support versus architectural freedom and control. As the open-source projects mature and the giants are well on their way to embracing openness, the lines will continue to blur, driving innovation that benefits everyone, building the next generation of data architectures.

The competition is heating up, and the real winners will be the data engineers and architects who now have more powerful, flexible tools than ever before. Learn more about Open Data Lakehouse concept, and stay tuned for next blog post about data catalogs.

Watch our webinar about Data Catalogs: Clash of The Data Catalogs - Market Leaders vs. Challengers.

Tags: