Introduction
In this blog post, we explore the evolution of data modeling from its early days of transactional databases to its role in enabling AI-driven insights today. We discuss the impact of technology advancements, cloud computing, and AI/ML applications’ rise in data modeling practices.
As we intend for large language models (LLMs) to be able to query the data stored in our information systems, we need to enable our data models to generate queries that retrieve the right data for analytical purposes.
We investigate the challenges and opportunities that arise from integrating tabular data with contextual information, metadata and AI/ML systems. The need for enriched data models becomes clear, enabling more effective use of these AI/ML systems.
As we navigate the ever-changing landscape of data modeling, we highlight the importance of other topics like data quality, ethical considerations, and data governance in creating robust AI/ML applications.
Join us as we journey through the transformation of data models and prepare you to be at the forefront of the AI/ML revolution.
Early days of Data Modeling
Data modeling is not new. Even before computers existed, we have been modeling how data is stored and relates to each other. However, data modeling has undergone significant changes over the years, driven by technological advancements and the increasing reliance on AI/ML applications today.
As a simplified definition, we understand a data model as a representation of reality. Albeit simplified, this means we can have different schemas or drawings, code or other representations that allow us to understand how data is stored in a data system. Since data can be stored and retrieved in different formats, we can have multiple data models to choose from, their differences can help us decide which fits better the use case at hand.
Until the 1990’s data processing in databases had significant limitations. Storage was rather expensive, and complex data operations could take a long time to complete. The data models of this era focused on maximizing computation speed and reducing the storage.
During this period, computational power had to be hosted on site or “on-premise.” Therefore, the companies were limited by their physical constraints. They needed to have a room specially for this purpose and a team dedicated to keeping it running.
(Created with DALL-E 3)
The data warehouse was the most significant change during the first decade of the 2000s. As computing power became more available to many companies and the explosion in the use of the internet generated large quantities of data, data shifted from transactional databases to analytical workflows.
Data users needed to organize this information in an understandable way for the business while extracting knowledge from the raw data. This actionable insight would then be used to steer business decision-making faster than ever before. The data models of this era evolved to give people actionable insights through facts and dimensions geared towards an abstract but accurate representation of how the business worked using data as a medium.
The Cloud Era
The introduction of cloud computing disrupted the data ecosystem in the 2010s. Cloud environments allow data users to overcome the limitations of on-premises systems.
Storage costs decreased, and in case the data user needed more computing power, they only needed to pay a bit extra for more machines without changing the physical infrastructure in their server rooms. The biggest change was the introduction of distributed computing to handle large quantities of data efficiently. Massive parallel processing (MPP) engines like Google BigQuery changed the perspective on handling and querying data.
(Image credits: Google)
As a result, data models changed again to be more flexible, trying to avoid joins since information can now be stored in different parts of the world. It is better if the data needed for the analytical workflow stays together.
In the cloud era, data users constantly evolve the use cases, adding or removing data sources. This makes it challenging to follow the traditional data modeling process of creating a conceptual, moving to a logical, and finally implementing it in the physical data model.
Every company wants to derive value out of the different types of information they have. Traditional relational models that rely on tables to store data are still widely used. Yet, the companies have the need to include other file formats like PDF, video, image, sound, or semistructured data types like JSON or XML push data models to new forms.
AI/ML and the Future of Data Modeling
The speed of development in Artificial Intelligence (AI) and Machine Learning (ML) creates opportunities to include these types of non structured or semistructured data in the analytical workflow through AI/ML applications. In the last couple of years, Large Language Models, or LLMs, disrupted the way we interact with technology in almost every field.
Today is really common to see application with chatbots using LLMs, where the user can ask whatever they like and get a response, this gives the false perception that the LLM knows everything and can answer any question.
However, LLMs are trained to deliver an answer no matter what, and sometimes, these systems produce inaccurate answers called “hallucinations”. To avoid this, LLMs need the right context to produce accurate results. Using this context, LLMs can be used to extract information from non-structured data and even perform basic analytical workflows, generating code to query a data system.
(Created with DALL-E 3)
Let’s consider a simple example: suppose we want an LLM to analyze customer behavior based on their purchase history. We need to provide the model with relevant context, such as:
- Metadata about the customers (e.g., demographics, preferences)
- Labels indicating the type of products purchased
- Annotations describing the relationship between purchases and customer segments
By providing this context, we enable the LLM to identify patterns and relationships that might not be immediately apparent from individual data points in the tables of our data system.
In practice, this means using different data modeling approaches depending on the specific use case. For example:
- Denormalization: Transforming data into a format that’s more easily readable by humans or machines
- Normalization: Organizing data in a way that reduces redundancy and improves query performance
Suppose we have these information in our database:
Normalized Data Model (Multiple Tables)
In a simplified store we want to know information about the products that our customers order.
- Customers
Customers |
---|
customer_id |
name |
address |
- Orders
Orders |
---|
order_id |
customer_id |
order_date |
- Order Details
Order details |
---|
order_id |
product_name |
quantity |
To retrieve data for a specific customer, we would need to join these three tables together:
SELECT c.name, o.order_date, od.product_name, od.quantity
FROM Customers c
JOIN Orders o ON c.customer_id = o.customer_id
JOIN Order Details od ON o.order_id = od.order_id
WHERE c.customer_id = 123;
Denormalized Data Model (Single Table)
In a denormalized data model, all the information we need for a specific customer is stored in a single table:
- Customer Orders
customer_id | name | address | order_date | product_name | quantity |
123 | John Smith | 123 Main St | 2022-01-01 | Product A | 2 |
123 | John Smith | 123 Main St | 2022-01-15 | Product B | 3 |
… | … | … | … | … | … |
To retrieve data for a specific customer, we can simply query this single table:
SELECT * FROM Customer Orders WHERE customer_id = 123;
In the normalized model, the LLM would need to understand how to join multiple tables together to get the desired information. In contrast, the denormalized model provides all the necessary data in a single location, making it easier for the LLM to access and process.
We can choose the best strategy for our specific needs by understanding how these approaches impact LLM performance.
As these new use cases and technologies emerge, we need to think on the best way to store and serve data for the final user, in other words, we need to think if our data model is the best option to get the results we expect.
Now it’s our turn to answer how the data models must adapt to this. For starters, our existing data models must capture complex relationships, adding contextual information and metadata that help create the context for these AI/ML applications.
The data models devised to allow interaction between machines and humans now also need to facilitate machine-to-machine integration. We are still trying to understand the best shape for our data models to smoothen this interaction.
Enriching Data Models with Contextual Information and Metadata
Ensuring high performance and accuracy with LLMs requires proper preparation of data models. Without it, the ‘garbage in equals garbage out’ principle applies, leading to low-quality results. Data users can address these new challenges with the following considerations:
- Focus on data quality: Data quality is paramount, as it impacts the performance of AI/ML applications. This includes data cleansing, outlier detection, and handling missing values.
- Enrich data models with context: Once a data model is in place, enrich it with contextual information and metadata to provide meaning and facilitate analysis. This enables machines to find the correct data for LLM applications.
- Describe relationships using knowledge graphs and ontologies: By describing data relationships through knowledge graphs and ontology-based approaches, data users can maximize the application and accuracy of LLMs.
- Leverage additional tools and techniques: Data users can utilize data dictionaries, data catalogs, semantic models, knowledge graphs, text embeddings, and vector databases to enhance their data models for LLM applications.
- Implement guardrails and security measures: LLMs require security measures to prevent surfacing sensitive information. Proper labeling and governance rules must be in place to ensure data privacy and compliance with regulations.
In summary, by following these steps, data users can create well-prepared data models that maximize the performance and accuracy of AI/ML applications. Analysts can gain a deeper understanding of their data by incorporating the right annotations, labeling, and metadata. Additionally, addressing security concerns and ethical implications will ensure compliance with evolving data and privacy regulations.
Conclusion
From transactional operations to AI-driven insights, data models have been evolving to let data users go from raw data to useful information, transform it into actionable knowledge, and finally become wisdom to be applied.
Data users who are capable of integrating domain-specific knowledge, creating the right context using semantic models and metadata, and understanding how these new technologies interact with our existing data models will be at the forefront of the AI/ML revolution happening now.
If you or your data users need some help implementing these changes, give us a call. At Xebia data, we are ready to help you make the most out of your data.