Let’s get one thing straight: we, analytics engineers, love our jobs and solving problems with clients, but some questions we hear day in and day out are just plain exhausting. To save ourselves from answering the same thing five times a day, here’s a handy blog series that answers these questions thoroughly and to which we can start referring to instead.
Bo Lemmers, Analytics Engineer here at Xebia, and Mike Kamysz, Data Engineer at The Data Institute kick off the series with: “Why can’t I just query the raw data?” Oh, sweet summer child. We get it — you’re eager, you’re curious, and you want your answers now. But, diving headfirst into raw data without a plan is like trying to find a needle in a haystack, blindfolded, in a blizzard. So, let’s talk about why this is a bad idea and why we have shiny, structured data models to save us from chaos.
The Appeal of Querying Raw Data: What People Get Right
Before we dive into the big no-no of querying raw data, let’s be fair. We get why people ask this question. It seems like a good idea. You’re sitting on a goldmine of data—why not access it directly, in its rawest form? Here’s the rationale behind the request:
- Granularity: Raw data is the most detailed version of your dataset, capturing every event, transaction, and interaction in its purest form. If you want a full, unfiltered view of the data—whether it’s transactional data, user clicks, or sensor logs—raw data is the ultimate source of detail.
- Flexibility: There’s no perceived form restricting you. When you query raw data, it’s like having a blank canvas. You can create your own metrics, dimensions, and transformations on the fly. No predefined logic or business rules to get in the way—just you, the data, and your SQL editor.
- Speed to Insight: When you’ve got a burning question, waiting for the data team to add a new column or table to the warehouse can feel like an eternity. Why not cut out the middleman? Querying raw data gives the impression of an immediate path to insights.
The Hidden Downsides of Querying Raw Data
Now, let’s talk about why querying raw data isn’t all it’s cracked up to be. On the surface, it seems empowering, but behind the scenes, it creates all kinds of headaches—headaches that your friendly analytics engineers (us!) have to deal with. Here’s why:
Data Quality: Garbage In, Garbage Out
Raw data is unprocessed. It’s exactly what it sounds like—raw, uncooked, unrefined, full of noise, errors, and missing values. Before any serious analysis can be done, raw data typically needs to go through data cleaning and transformation processes. Here’s the reality:
- Inconsistent Formats: Dates in different formats, varying units (e.g., pounds vs. kilograms), and inconsistent categorizations (e.g., “NL” vs. “Netherlands”, or “NL” vs. “nl”) are just a few of the format issues that crop up in raw data. Every time you query raw data, you risk misinterpreting or misaggregating your results.
- Duplicate or Missing Records: Raw datasets often have duplicate entries or missing values. Without a proper process to clean and validate the data, you’ll spend most of your time handling these issues manually, leading to incomplete or false analyses.
When different teams query raw data without standardisation, they often handle these inconsistencies differently, leading to contradictory results: one team might treat missing values as zero, while another ignores them entirely, and yet others may take the previously known data point, creating an inconsistent understanding of the data.
Performance and Cost: The Hidden Burden
While querying raw data seems faster, the hidden costs in terms of both performance and expense are significant, especially in cloud-based environments like Snowflake, BigQuery etc. Here’s why querying the raw data can be a performance and cost nightmare:
- Large, Unoptimized Tables: Raw datasets tend to be big. When you query unprocessed data, you’re often forcing the database to scan through billions of rows and columns, much of which may be irrelevant to your actual analysis.
- Expensive Cloud Costs: In cloud data warehouses, every query has a cost. Since raw data may not be optimised for efficient querying, each query possibly consumes more computational power than necessary. Aggregating raw data on the fly, over long periods, is resource-intensive, leading to added costs, making your query bills skyrocket over time.
- Repeated Transformations: Because raw data lacks pre-built logic or calculations, every query requires you to re-execute complex transformations like joins, filters, and aggregations. Not only does this make each query slower and more expensive, but it also wastes valuable compute resources repeating the same operations over and over again.
Reinventing the Wheel: Duplicating Efforts, Inconsistent Results
When multiple teams or individuals query raw data, each is effectively reinventing the wheel. This leads to duplicated efforts and inconsistent business logic across the organisation. Without a shared data model, each team has to build its own transformations, metrics, and calculations, which causes a number of issues:
- Duplicate Work: Every team ends up creating their own transformations to clean and aggregate raw data. For example, one analyst might spend hours creating a transformation to calculate monthly sales, only for another team to do the exact same thing independently. This duplication of effort wastes time and resources across the organisation.
- Inconsistent Business Logic: Without standardised metrics, different teams will often define key business metrics (e.g., “monthly active users” or “churn rate”) in slightly different ways. This leads to confusion and misaligned decision-making, as different reports tell conflicting stories about the same data.
The Case for a well-designed data warehouse
So, if querying raw data is such a bad idea, what’s the alternative? Enter a Kimball-style, Data Vault, or any well-designed data warehouse. This is where raw data gets transformed into something clean, structured, and optimised for efficient querying. Rather than everyone reinventing the wheel, a data warehouse brings order to the chaos. Here’s why this approach is not just useful, but essential for anyone serious about analytics.
Single Source of Truth
A common data model ensures that there’s a single, standardised version of the truth. Metrics, key performance indicators (KPIs), and business definitions are all pre-defined and consistent across the organisation. No more arguing over whose version of the monthly revenue report is correct—everyone’s working off the same data.
- Predefined Metrics: In a Kimball-style model, important metrics (e.g., revenue, customer counts) are defined and calculated once. This eliminates the risk of contradicting results and ensures that everyone is using the same definitions
- Centralised Data Logic: All the complex business logic—like how we define “active users” or “churn”—is encoded into the data model. This means analysts don’t have to recreate these calculations every time they query the data.
Pre-Aggregated and Optimized for Performance
In a data warehouse, raw data is transformed into tables that are optimised for querying. For instance in a Kimball style data warehouse, fact tables store transactional data (e.g., sales transactions), while dimension tables store descriptive information (e.g., customer demographics). This design supports fast, efficient queries.
- Pre-Aggregation: Data that’s commonly used in aggregate (e.g., daily or monthly sales) can be pre-aggregated, meaning analysts don’t have to compute sums or averages over billions of rows every time they run a query.
- Optimised for Queries: Dimension and fact tables are designed to minimise expensive table scans. Queries that would take hours on raw data can be completed in seconds.
Data Quality Assurance
A foundational data model ensures data quality by acting as a gatekeeper for clean, well-structured data. Ideally, it goes through several automated tests to check both code readability and data accuracy. These tests help catch errors, ensure consistency, and maintain reliability, making sure the model performs well and the data remains trustworthy.
Conclusion
While there are valid cases for querying raw data, think of data science, validation purposes, or other specific needs, for analytics it often leads to more problems than it solves. From data quality issues and inconsistent metrics to performance bottlenecks and unnecessary costs. The solution? A well-structured, governed data model that centralises business logic, ensures data consistency, and optimises performance. Trust us—once you see the benefits of a curated data model, you won’t want to go back to raw data.
Are you part of an organisation looking into implementing best practices around data modeling? Our analytics engineer consultants are here to help – just contact us and we’ll get back to you soon. Or are you an analyst, analytics engineer or data engineer interested in learning more about data modeling? Check out our Data Warehousing and Data Modeling course from Xebia Academy or have a look at our job openings.
Photo by Cristi Ursea on Unsplash