Why you should care about Data Centric AI

10 Mar, 2022
Xebia Background Header Wave

Data Centric AI is about iterating on data instead of models to improve machine learning predictions.

Model Centric vs Data Centric AI

Why is this trend relevant now? Is this yet another hype in data science? Or has something really changed?

Traditionally, data scientist tend to treat their datasets as static and focus on improving their machine learning model.

And that’s not so strange:
Because that’s what they do in academia
‍ What they learn in courses
It’s what most online competitions focus on
Where most tools are being built for

This narrow focus has a name: "model-itis"

But something is changing. Machine learning models are being commoditized. Look at HuggingFace, they’re crushing it with models for text, vision and audio. Your data scientists can’t compete with that. The competitive advantage of data scientists lies in everything that surrounds the model.

What should your data scientists do instead? Focus on the things that don’t scale outside your organization:

Define great metrics
Develop high quality datasets
️ Discover systematic errors in your models
Ship models to your users
…and more

Your data is unique to your use case. Improving your data can be easier and have more impact than focusing on the model.

Improve your label quality is something you can do straight away. The course Machine Learning Engineering showed that clean datasets need way less training data than noisy datasets. Label quality makes a huge difference.

Clean datasets need less training data than noisey datasets.]

Not all of this is new. The ecosystem around SpaCy has had a focus on iterating on code and data for years. But the wider appreciation of this focus is new.

What’s my favorite benefit of Data Centric AI? It makes it easier to collaborate! Model Centric data scientists tend to get stuck in their ivory tower. Data Centric data scientist talk to domain experts and ask how the data got generated… and that’s where the real learning is.

Now if you’re a data scientist and think: "But I actually like working on the models". Then don’t despair! Because we’re in dire need for better tools that make it easier to iterate on the data. And models play a role there too!


Get in touch with us to learn more about the subject and related solutions

Explore related posts