Blog

Data Lakehouse Explained: The Architecture Powering Modern Data & AI Platforms

Marek Wiewiórka

March 9, 2026

7 minutes

In the evolving landscape of data domain, a new paradigm has emerged that sets out to bridge that well known gap between the structured, reliable environment of the data warehouse and the flexible, scalable storage of the data lake. This architecture is known as the Data Lakehouse. But what exactly is it, and why is it becoming the go-to architecture for modern data platforms?

In this blog post, we will explore its origins, core principles, and transformative impact. This post was inspired by the webinar “Data Lakehouse (DL) - Is it the Holy Grail We Have Been Looking For?”, the first of a series of webinars ("Towards Data Lakehouse Architecture") on the topic. Find more information about it here.

Bridging Two Worlds

The term was first coined in 2017, in Jeremy Engle’s slides from the Redshift/Big Data meetup, even though it wasn’t a formally defined concept. Initially, it was more of an aspirational idea than an actual architecture. In this scenario, organizations could efficiently handle diverse data types without maintaining separate systems. At its heart was a desire to combine the best features of data warehouses with the scalability and flexibility of data lakes, which excel at storing vast amounts of raw, unstructured data.

This vision was finally formalized in a 2021 research paper titled “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics,” co-authored by researchers from UC Berkeley and engineers from Databricks. The manuscript laid out a clear set of requirements for what a Lakehouse should be, marking a pivotal shift from a vague concept to a concrete architectural blueprint.

The Three Core Building Blocks

A Data Lakehouse architecture is built on three foundational layers, each evolving from its predecessors to create a more integrated and open system.

The Storage layer: open, unified, and transactional

At its base, the Lakehouse leverages cost-effective, scalable cloud object storage provided by cloud providers such as AWS or Google or Microsoft. On top of this, it introduces an open table format layer, such as Apache Iceberg, Delta Lake, or Apache Hudi. These formats add crucial data warehousing capabilities directly onto file-based storage:

ACID compliance: Ensures reliable transactions, consistency, and data integrity by providing isolation of data.

Time travel and versioning: Allows querying of data at historical points in time.

Schema enforcement and evolution: Supports both schema-on-write and schema-on-read approaches.

This combination means data is stored in open, standardized file formats (like Parquet or ORC), but managed with the governance and reliability once exclusive to proprietary data warehouses.

Find out more about Xebia’s webinar, which also includes a live demo.

The Compute Layer: Flexible and Decoupled

In a Lakehouse, compute resources are fully decoupled from storage. This means you can independently scale your processing power up or down based on demand, without moving your data. A diverse set of processing engines can operate on the same data simultaneously:

Traditional distributed engines: Apache Spark, Apache Flink, Trino (formerly PrestoSQL).

Emerging single-node engines: DuckDB, Polars, and Daft, which leverage out-of-core processing, handling large scale workloads and offering impressive performance for datasets that fit on a single machine, reducing overhead for medium-scale workloads. These engines now offer maturing Iceberg support — DuckDB added full read/write capabilities in 2025, Polars integrates via PyIceberg, and Daft supports Unity Catalog. While integration depth varies, they are increasingly viable for lakehouse workloads.

Optimized and specialized engines: There are projects that provide building blocks such as query optimizers and execution engines. Adopting Apache DataFusion or Apache Arrow it is possible to create your own domain specific query engine or also plug your own existing engine into the data ecosystem.

Power of streaming: The Streamhouse architecture, recently introduced, combines Apache Flink for stream processing with Apache Paimon as the streaming storage layer. This approach enables real-time analytics with high-speed data ingestion, change data capture (CDC), and seamless data catalog integrations. For the analytical query layer, StarRocks is one solution worth considering.

This flexibility allows organizations to match the tool to the task, optimizing both cost and performance, even choosing a mix of open-source and commercial tools. Large companies such as Meta are also investing, as you can really cut the cost of your infrastructure if you are running your queries to scale.

Learn more in the webinar from the Towards Data Lakehouse Architecture series related to the compute layer.

The Metadata Layer: The Brain of the Lakehouse

Perhaps the most significant evolution is in the metadata and catalog layer. In early data lakes, metadata was often an afterthought, leading to challenges in data discovery, governance, and lineage. Without a unified data catalog, features like centralized access control, schema management, and cross-engine interoperability remained out of reach. This changed significantly in mid-2024 with the open-sourcing of key catalog technologies — notably Databricks' Unity Catalog and the adoption of the Iceberg REST Catalog specification.

This changed significantly in mid-2024 with the open-sourcing of key catalog technologies — notably Databricks' Unity Catalog and the adoption of the Iceberg REST Catalog specification. These developments enabled a sophisticated, centralized data catalog that serves as a system of record for all data assets. Third-party tools and custom engines can now securely interact with governed data through open APIs, breaking down vendor lock-in and enabling truly composable architectures. No matter the client used, data can always be read and written efficiently.

Watch our webinar: Clash of The Data Catalogs - Market Leaders vs. Challengers, where we highlight emerging open-source projects that are gaining momentum.

Key Advantages of the Lakehouse

As a data management system based on low-cost and directly accessible storage, Lakehouses also provide traditional analytical DBMS management and performance features such as ACID transactions, data versioning, auditing, caching, and query optimization. They combine the key benefits of data lakes and data warehouses: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter.

Here are the key advantages of a Data Lakehouse:

Unification and Simplicity: a single platform replaces separate data lakes and warehouses, it empowers all data users, from analysts to data scientists, reducing complexity and data silos.

Tailored to your needs: the data lakehouse is built based on your data requirements, while also remaining extensible and naturally fits into your existing environment and workflows.

Cost Efficiency and performance: decoupling storage from compute allows for independent, elastic scaling. Using open formats and optimized engines dramatically lowers costs while boosting query speeds.

Openness and Flexibility: built on open standards, the Lakehouse avoids vendor lock-in. You can freely choose and combine components for storage, compute, and metadata to create a tailored, composable architecture.

Advanced use case support: native support for a broad spectrum of modern workloads beyond BI, including Machine Learning, real-time streaming analytics, and AI, all under a centralized governance and security model.

Is a Data Lakehouse Right for You?

The Lakehouse is not a one-size-fits-all solution, but its applicability can be broad.

Consider it if:

Your data landscape includes a mix of structured and unstructured data.

You need to support diverse workloads: BI, advanced analytics, ML, and real-time streaming.

Cost control and avoiding vendor lock-in are key priorities.

You have outgrown traditional warehouse limitations but need more governance than a raw data lake provides.

For very small, simple BI-only use cases, a traditional warehouse might suffice. But for organizations with growing data, increasing complexity, and a need for agility, the Data Lakehouse presents a compelling, future-proof architecture that can be adapted to the needs of your organization.

Is Open Data Lakehouse the Modern Holy Grail?

The Data Lakehouse has evolved from a promising idea into a robust, production-ready architectural pattern. The recent push towards openness - spearheaded by open-sourced catalogs and standards - has transformed it into a truly flexible and powerful platform. It successfully unifies the scale and flexibility of data lakes with the reliability and performance of data warehouses, all while championing cost efficiency and openness.

In essence, the Data Lakehouse represents the convergence between different worlds: a single, open platform that can confidently serve as the foundation for an organization’s entire data and AI strategy. It’s not just an incremental improvement; it’s a foundational shift towards a more integrated, efficient, and innovative data future.

Tags: