Blog
From Data to Wisdom: Asking the Right Questions
Designing data ingestion turns data into insight. This article uses the DIKW pyramid to show early decisions that enable scalable, reliable outcomes.

From Data to Wisdom: Asking the Right Questions
Designing a data ingestion pipeline is an essential part of turning data into insight. This article steps back from tooling and architectures to examine data onboarding through the lens of organizational intelligence, using the DIKW pyramid as a guiding framework. You’ll learn which questions matter before data enters your platform, how early decisions affect everything upstream and why getting the foundations right is the fastest way to reliable, scalable outcomes.
Fitting Data Within Organization Intelligence Systems
A few months ago, I was asked to design a data ingestion solution. The request seemed technically straightforward, but I found myself wondering: would this solution truly enable the organization to leverage new data sources effectively and support informed decision-making? Without clear answers, I realized I needed to step back and understand where data fits within the broader landscape of organizational intelligence.
Understanding the Journey
To structure my approach and define a strategy that would support better decision-making, I turned to the DIKW (Data, Information, Knowledge, Wisdom) pyramid—a framework that maps the journey from raw data to actionable wisdom.

Adaptation from “Data to wisdom” – Russell L. Ackoff, 1989
At the foundation is Data: raw, unprocessed facts and figures. This is where information first enters the data platform—individual transactions from a database, sensor readings from IoT devices, API responses, or clickstream events from a website. In this state, data lacks context and meaning.
The next level is Information: data that has been processed, structured, and organized into meaningful form. When we aggregate transactions by customer segment or analyze sensor readings over time to identify patterns, we transform data into information. Information answers "What happened?" and "When did it happen?"
Building on information comes Knowledge: the application of information within a specific context, often by combining multiple sources to understand relationships and underlying principles. Knowledge helps us answer "how" and "why" things happen—for example, identifying that certain sensor patterns consistently indicate early equipment failure.
At the top sits Wisdom: the ability to make sound judgments and decisions based on knowledge, experience, and values. Wisdom involves applying knowledge strategically to make choices that align with organizational goals and drive action.
Here's what matters most about the DIKW pyramid: you cannot skip steps. Without properly onboarded, trustworthy data at the foundation, everything above it collapses. Investing sufficient time in the Data and Information layers—and asking the right questions early—creates a stable foundation that prevents costly troubleshooting later and accelerates the path to reliable insights.
Data Onboarding: Building the Foundation
Understanding where your data comes from is essential to successful data ingestion. The source directly determines the reliability, quality, and usefulness of the data entering the ecosystem. Knowing how data is generated provides the context needed to assess accuracy, identify potential gaps or biases, and choose the right ingestion strategy—whether batch, streaming, or API-driven.
Clear knowledge of data origin also enables proper governance and compliance, ensuring sensitive data is handled securely according to regulatory standards. It supports traceability and lineage, making it easier to troubleshoot issues and build trust in analytics outcomes. Without understanding the source, organizations risk building decisions on unstable foundations, leading to costly mistakes and unreliable insights.
Understanding Data Source and Origin
Before ingesting any data, understand the applications or systems that produce it. This foundational knowledge helps anticipate potential issues and plan integration strategy.
Key questions to ask:
- How can the data be accessed—through APIs, file transfers, or direct database connections?
- What format is the data initially available in?
- What is the data structure?
These answers shape the technical approach and influence downstream decisions about transformation and storage, helping us determine:
- Whether to use batch, streaming, API, or database replication ingestion
- Whether schema-on-read or schema-on-write is appropriate
- Whether to convert to Parquet or Avro for schema evolution
- Which service to use (in AWS world: Kinesis Firehose, Glue, DMS, or EventBridge)
Evaluating Data Refresh and Scale
Data isn't static. Understanding change patterns is crucial for building reliable ingestion pipelines.
Critical considerations:
- How often are the data and schema updated—real-time, hourly, daily, or on-demand?
- How are changes communicated?
- What's the typical data volume, and at what pace does it grow?
Update frequency directly impacts architecture choices and resource allocation. Knowing schema evolution patterns helps prevent breaking changes like column or type modifications. Volume and growth patterns matter equally—whether you're dealing with megabytes or petabytes, and whether growth is linear or exponential, affects how you design scalable solutions.
For example, an AWS example, without frequent schema evolution: Raw data → S3 → One-time Glue crawler → Data Catalog → ETL → Processed data in S3 (flexible formats like CSV, JSON, Parquet)
With continuous schema evolution: Raw data → S3 → Continuous Glue crawler → Data Catalog → ETL → Semi-processed data in S3 (Parquet format)
This evaluation helps you assess accuracy, reliability, completeness, consistency, and change frequency.
Assessing Data Quality and Reliability
Quality expectations establish the foundation for trust in your data. What level of accuracy, completeness, and reliability can you expect? This assessment determines whether additional validation, cleansing, or enrichment steps are necessary during onboarding.
Questions to consider:
- What level of accuracy and completeness can be expected?
- Are there known data quality issues or gaps?
- How is data quality validated at the source?
Poor data quality isn't just a technical problem—it's a business risk. Understanding quality characteristics upfront allows us to implement appropriate checks and balances, establish monitoring thresholds, and set realistic stakeholder expectations.
Planning Data Management and Oversight
Once data enters your ecosystem, where will it live? Understanding storage destinations helps you plan for capacity, performance, and cost optimization. Different storage solutions—data lakes, warehouses, or operational databases—serve different purposes with distinct trade-offs.
- Where will the data be stored, and how will storage be optimized for access patterns?
- What monitoring and oversight processes will ensure ongoing pipeline health?
- Who owns this data, and who is accountable for its quality and availability?
Reliable ingestion requires ongoing monitoring and oversight. What processes ensure reliable ingestion? Automated monitoring, alerting, and validation help catch issues early and maintain pipeline health.
Schema Considerations
Schema management is often an afterthought but should be a primary concern. Will your onboarding process focus on schema discovery, automatically detecting and adapting to source structures? Or will you need robust schema evolution capabilities to handle changes over time without breaking downstream processes? These decisions impact everything from initial development effort to long-term maintenance burden.
ETL: The Transformation Engine
Extract, Transform, Load (ETL) processes are where the exiting work happens. Consider whether large-scale transformations are necessary, as complex transformations significantly impact processing time and resource requirements.
During processing, pay attention to dataset structure and organization:
- How will you maintain data consistency across transformations?
- What strategies will handle data duplication and deduplication?
- How will you handle data validation and error handling during transformation?
- What approach will you use for managing slowly changing dimensions?
- How will you track data lineage and transformation history?
- What partitioning strategy will optimize processing performance?
- How will you handle late-arriving or out-of-order data?
- What mechanisms will ensure idempotency in your transformation logic?
These considerations ensure data integrity throughout the pipeline.
Storage Optimization
Storage isn't just about capacity—it's about organization and optimization. How will you organize data for efficient access and analysis? Proper strategies like partitioning by date or categorizing by business domain dramatically improve query performance.
How will you optimize storage to improve query performance? Techniques like indexing, compression, and choosing appropriate file formats can mean the difference between queries that run in seconds versus hours.
Security and Compliance
Security must be baked into the onboarding process from day one. Start with data classification and sensitivity assessment—not all data requires the same protection level, and understanding what you're working with informs your security approach.
Implement robust access controls: who can access what data, and under what circumstances? Role-based access control, encryption at rest and in transit, and audit logging are essential components.
- What type of data is this, and what is its sensitivity level?
- Does it contain PII, PHI, or other regulated information?
- Who can access this data, and under what circumstances?
- How will access permissions be managed and audited?
- How will data be encrypted at rest and in transit?
- What key management strategy will be used?
Observability is your early warning system. Can you detect unusual access patterns or potential breaches? If third-party integrations are involved, ensure Data Processing Agreements (DPAs) are in place to formalize security and compliance obligations.
- What logging and audit trails will track data access?
- How will you detect unusual access patterns or potential breaches?
- What regulatory requirements govern this data?
- Are Data Processing Agreements (DPAs) in place for third-party integrations?
- What are the data retention and deletion requirements?
Navigating Governance and Compliance
Data governance isn't optional in today's regulatory environment. What regulatory, compliance, or policy requirements govern this data? From GDPR and CCPA to industry-specific regulations, understanding compliance obligations is essential before data crosses system boundaries.
Ownership and accountability must be clearly established. Who holds responsibility for the data? Clear lines of accountability ensure someone is always answerable for data quality, security, and proper usage.
Conclusion
Successful data onboarding requires thoughtful planning across multiple dimensions—from understanding source characteristics to implementing robust security measures. By systematically addressing these considerations, you can build data pipelines that are not only functional but reliable, scalable, and trustworthy.
Remember: the questions you ask before onboarding data are just as important as the technical implementation that follows. Data is the fuel for AI—with better fuel, it performs better. Take the time to get it right.
Written by
Katerina Tashoska
AWS Cloud Architect
Our Ideas
Explore More Blogs

Introducing XBI Advisor; Start auto-consulting your BI environment.
Ruben Huidekoper, Camila Birocchi, Valerie Habbel
Contact


