Blog
Data Lakehouse Compute Layer: How to Choose the Right Engine

The rise of the data lakehouse architecture has changed how organizations store and process massive datasets. By merging the cost-effective, flexible storage of a data lake with the performance and management features of a data warehouse, the lakehouse promises the best of both worlds. But is that really so?
How should data teams choose the right compute engine for a data lakehouse?
Data teams should choose a data lakehouse compute engine based on workload type, latency needs, concurrency, cost model, and operational control. Serverless compute works well for elastic and variable workloads, customer-managed clusters fit high-control large-scale processing, and embedded engines are useful for local analytics, data science, and application-linked use cases.
It all depends on one central element that brings this architecture to life: the compute layer. Think of it as the dynamic engine that transforms static files in cloud storage into actionable insights. Choosing the right compute strategy is now one of the most critical decisions for data teams, directly impacting cost, performance, and agility. In a recent webinar, Xebia’s experts broke down the modern compute landscape, looking at established options and emerging technologies that are redefining the field.
Want to see how these compute patterns work in practice? Watch the on-demand webinar on data lakehouse compute, or start with the introductory session on the open data lakehouse concept.
The following sections answer the most important questions from the audience, using the concepts and live demonstrations from the session.
What data lakehouse compute options are available today?
Today’s data lakehouse ecosystem offers a diverse set of compute solutions, each designed for different workloads, operating models, and trade-offs.
Single-node compute
Single-node compute uses one powerful virtual machine: the classic “big machine” approach. It is often used with traditional MPP (Massively Parallel Processing) engines deployed in a single-node configuration for smaller workloads. While it can be simpler to manage conceptually, its scalability is limited by the CPU and memory ceiling of a single node.
Customer-managed distributed clusters
Customer-managed distributed clusters are built for large-scale data processing. Technologies such as Apache Spark, Trino, and Dremio can be deployed as clusters of ephemeral or long-lived nodes managed by the customer, or through managed services such as EMR, Dataproc, or EKS.
In this model, the data team controls provisioning, scaling, configuration, and maintenance. This offers maximum flexibility over engine versions, parameters, and security, but it also introduces significant operational overhead.
Serverless distributed compute
Serverless distributed compute abstracts away infrastructure management. Services such as AWS Athena, BigQuery, Snowflake, and Databricks SQL Warehouses fall into this category. The user submits a query, and the service automatically provisions, scales, and tears down the compute resources required to run it.
This model can significantly reduce operational burden because teams pay per query, per workload, or per scanned data instead of paying for idle cluster time. It aligns infrastructure cost more closely with actual usage.
Embedded and application-linked compute
Embedded or application-linked compute is an emerging pattern built around lightweight query engines such as DuckDB or DataFusion. These are not standalone services, but libraries that run inside an application process, such as a Python script, Jupyter notebook, or microservice.
They can perform high-performance, in-process analysis on data sourced directly from the lake, including Parquet, Delta Lake, or Apache Iceberg files. This brings compute closer to where data is consumed, making it useful for data science, feature engineering, and middle-tier data applications.
How do latency, concurrency, cost, and control shape data lakehouse compute choices?
The choice between compute models does not simply come down to technical preference. It has practical implications for performance, scalability, cost management, and the day-to-day experience of the data team.
Latency and performance
Single-node and embedded engines, such as DuckDB, can offer very low latency for single-threaded or moderately parallel queries on subsets of data because they minimize infrastructure overhead. Serverless services prioritize instant availability and automated optimization, although they may introduce cold-start penalties. Customer-managed clusters, when finely tuned, can achieve high throughput for massive and complex ETL or ELT jobs, but they require expert tuning to reach that peak.
Concurrency and elasticity
Concurrency and elasticity are where serverless compute often shines. Services such as BigQuery or Snowflake can handle many concurrent queries without user intervention, scaling resources automatically. Customer-managed clusters require careful capacity planning and autoscaling configurations to handle concurrency spikes, which can lead to over-provisioning. Single-node approaches are usually the least resilient under high concurrency.
Cost profile
Serverless compute typically follows a pay-per-query or usage-based model, making it a good fit for variable and unpredictable workloads. Customer-managed clusters incur cost while they are running, whether they are active or idle, which can favor predictable and constant workloads.
However, total cost of ownership should not only include infrastructure cost. For managed clusters, it should also include the human cost of operational control: engineers tuning, securing, upgrading, and maintaining the platform.
Operational control and flexibility
Control is a double-edged sword. Customer-managed clusters offer maximum control over engine versions, networking, security plugins, and low-level configuration. This can be crucial for complex compliance requirements or highly specialized workloads.
Serverless compute trades some of that control for simplicity. Teams get a standardized, secure, and continuously updated service, but they must work within the service’s guardrails.
Which emerging data lakehouse compute technologies should teams watch?
The compute layer is not standing still. Several trends are pushing the boundaries of what data lakehouse architectures can support.
Native and vectorized backends for open formats
The rise of open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi has accelerated the development of native, vectorized query engines designed for these formats. Engines such as StarRocks, GreptimeDB, and Rockset bypass parts of the traditional Java-based stack, such as Spark, to execute queries directly on columnar data using vectorized CPU instructions.
The goal is sub-second latency on lakehouse data, challenging the idea that only proprietary data warehouses can deliver OLAP-speed analytics. ClickHouse also fits this broader pattern as a high-performance OLAP engine increasingly used with lakehouse storage.
The growth of embeddable engines
DuckDB is the breakout example of embeddable compute, but it is part of a broader movement. These libraries turn every data application into a potential query endpoint. The pattern is powerful: pull filtered data from the cloud lake into a local process, then perform complex aggregations or joins at in-memory speed.
This can offload compute from central systems and give data scientists and application developers fast, programmatic access to lakehouse data. It also enables a hybrid compute model: distributed clusters handle heavy ETL, while downstream exploratory or application-specific workloads run through embedded engines.
Multi-engine patterns and polyglot compute
The era of a one-size-fits-all compute engine is ending. Forward-looking architectures increasingly use a polyglot compute strategy, matching each engine to the workload it serves best.
A single pipeline might use Spark for large-scale transformation and ingestion, Trino for interactive full-table SQL queries by analysts, and a native engine such as StarRocks for a low-latency dashboard. Tools such as LakeFS for Git-like data operations, together with workflow orchestrators such as Dagster or Prefect, are making this multi-engine coordination more manageable.
This reflects a broader shift: the compute layer is becoming a portfolio, not a single product.
Why data lakehouse compute strategy needs more than one engine
The compute layer of the data lakehouse is no longer an afterthought. It has become a strategic lever. The choice between serverless simplicity and managed-cluster control, or between a monolithic engine and a polyglot portfolio, helps define the platform’s cost profile, performance ceiling, and developer experience.
The trend is clear: abstraction is winning for broad usability through serverless models, while specialization is becoming the key to extreme performance through native vectorized engines and developer agility through embeddable engines.
For most organizations, the winning strategy will be a thoughtful blend: serverless compute for elasticity and operational ease, managed clusters for specific high-control workloads, and embeddable engines to push compute closer to applications and data consumers.
By understanding this dynamic landscape, data architects can design lakehouse systems that are powerful and cost-effective today, while remaining ready for the next wave of compute innovation. The goal is no longer to find one compute engine to rule them all, but to intelligently orchestrate a set of specialized engines, each playing its part in turning a data lake into a true engine of insight.
Learn more about the open data lakehouse concept, and follow the series for the next article on designing modern lakehouse architectures.
To sum up:
The data lakehouse compute layer determines how teams turn open-format storage into fast, reliable, and cost-effective analytics. Single-node, customer-managed, serverless, and embedded compute models each serve different workload patterns, from high-control ETL to elastic querying and application-linked analysis.
Choosing the right engine is therefore less about finding one universal solution and more about matching compute to latency, concurrency, cost, control, and developer experience. As native vectorized backends, embedded engines, and multi-engine architectures mature, the strongest data lakehouse strategies will combine several specialized engines instead of relying on a single platform for every workload.
Frequently Asked Questions
Written by

Marek Wiewiórka
Our Ideas
Explore More Blogs
Contact




