Blog

From Parquet to Iceberg: How the Lakehouse Storage Layer Powers Modern Data Platforms

Marek Wiewiórka

March 9, 2026

5 minutes

When we think about the shift from traditional data warehouses to modern data lakehouses, the storage layer often plays a starring role. It’s where raw data becomes a strategic asset: structured, optimized, and ready for action. In a recent deep-dive seminar, we explored the key technologies and strategies that make today’s lakehouse storage fast, flexible, and open. The webinar covered the foundations and future of big data storage and how it is shaping data architecture.

Watch the on-demand webinar: Data Lakehouse Storage Layer - Openness, Interoperability and Performance.

And if you want to start by looking at the general concept of data lakehouse, check out the first blog from that series: Data Lakehouse Explained: The Architecture Powering Modern Data & AI Platforms.

It’s All About the File Formats

At the foundation of any data system is the file format. If you have worked previously with CSVs or JSON files, you’re familiar with simple, row-oriented storage. But for massive datasets, these can quickly become slow and inefficient. This is where columnar formats like Apache Parquet come in. Instead of storing data row by row, Parquet organizes it by column. This means if your query only needs a few columns, for example, “average sales per region”, the system will load it up and skip everything else, dramatically boosting performance.

But it’s not just about speed. Formats like Parquet and Avro are each designed for specific use cases. Parquet excels in analytical workloads, while Avro is tailored for streaming and message-based systems, like those using Apache Kafka. Then there’s Lance, a newer format optimized for AI and machine learning workloads, with built-in support for images, audio, and vector data, perfect for the era of generative AI and semantic search.

The Magic of Metadata and Open Table Formats

Raw files alone aren’t enough for a production-ready data platform. Traditional data warehouses offered features like ACID compliance (ensuring reliable transactions) and easy metadata management, which were lost when moving to simple file storage. This gap drove the rise of open table formats like Delta Lake, Apache Hudi, and Apache Iceberg.

Think of these as a “management layer” on top of your files. They add a transaction log - a record of changes, file locations, and table statistics - without locking you into a vendor. This brings back a few essential features:

Time travel: Query data as it looked at a past point in time.

ACID guarantees: Reliable writes and consistency, even with multiple users.

Efficient metadata access: No need to scan thousands of files to understand a table’s structure.

These formats keep data in open file types like Parquet, ensuring you’re never trapped in a proprietary system. They represent a major step toward true interoperability in the data world.

Fine-Tuning for the Best Performance

Choosing the right format is just the start. To get the most out of a lakehouse, you need to organize your data intelligently. Here are three key techniques:

Partitioning:
Think of it like organizing a bookshelf by genre. By splitting data into folders based on columns like date or country, queries can skip irrelevant chunks. But be careful—over-partitioning can create too many small files, hurting performance.

Z-Ordering:
A smarter way to organize data within files. Instead of just grouping by one column, Z-Ordering arranges data to keep related values close together across multiple columns. This improves filtering efficiency, especially for multi-column queries.

Liquid Clustering (in Delta Lake):
An advanced method that uses incremental clustering - unlike Z-Ordering, which requires full data rewrites – and automatically balancesrow count and file size for optimal performance. It dynamically reorganizes data to optimize for real-world storage and access patterns, leading to more balanced performance.

Deletion Vectors: They mark rows as deleted without immediately rewriting files, improving update efficiency.

The seminar made it clear: there’s no one-size-fits-all solution. The best approach depends entirely on your data, query patterns, and workload.

Real-World Impact: Demo Recap

To illustrate these concepts, we ran a live demo using New York City taxi trip data. We compared:

Unsorted data (baseline)

Sorted data

Z-Ordered data

Across different file and row group sizes, the results were revealing. While using Z-ordering did improve performance, in some cases, simple sorting with optimized file sizing performed even better. The key takeaway? Testing is essential. Small adjustments, such as reducing Parquet row group size from 128MB to 4MB, can dramatically cut query times and reduce data scanned by eliminating records not meeting predicates.

Looking Ahead: Openness and Interoperability

One of the most exciting trends is the push toward interoperability between table formats. Projects like Delta Lake UniForm and Apache X-Table aim to let you write data in one format (like Delta Lake) and read it as another (like Iceberg). This reduces lock-in and can help work inside a more connected data ecosystem.

The storage layer today is an active, intelligent part of the data stack, it is no longer a passive layer. By leveraging open formats, smart metadata, and thoughtful optimization, organizations can build systems that are not only powerful and scalable, but also flexible and future-proof.

Whether you’re just starting your lakehouse journey or optimizing an existing platform, the message is clear: investing in your storage layer pays dividends. It’s where data turns into insight, and where openness paves the way for innovation.

Learn more about Open Data Lakehouse concept, and stay tuned for next blog post about data catalogs.

Tags: