A data-platform is nothing more than a normal (cloud) platform with some additional functionality on top to make it specific to the requirements of the data domain. Instead of the applications that run on a “normal” platform like (web)services and front-ends it runs ELTs/ETLs and data applications.
It builds on top of common platform building blocks and gives the users everything they need to take full ownership of their data applications, like:
Data applications and ELTs need to run somewhere, just like any other application. So just like any other platform the necessary resources need to be available for them.
The data in your data-platform needs to be stored somewhere. This is known as the data-lake. Because of the large volume of data this needs to be optimized for low cost/volume.
As all data applications can potentially access the data simultaneously, the storage needs to be able to scale with these requirements. The large volumes also require the data to be read in parallel/concurrently, and the storage solution should be optimized for throughput not latency.
Because of these requirements object stores like S3 on AWS or Cloud Bucket on GCP are a good fit. It’s relatively cheap per GB, isn’t limited in its maximum size and allows for decent scaling on parallel access.
To offer high performance and efficiency, the files are (often) stored in columnar format (i.e. parquet) in what’s generally called a data lake.
Just like other applications ELT/ETL and data applications need to be built, tested and deployed. This should be done using standard CI/CD solutions.
- Monitoring, dashboarding and alerting
To enable users full ownership of their projects, monitoring, dashboarding and alerting solutions need to be in place. This way, they can easily get insights into their projects and get informed when something happens.
- Connectivity to other systems/data sources
To be able to extract data from different sources the extract process needs connectivity to these sources, just like a new application might need connectivity to a legacy backend.
Also non-functionals like scalability, reliability, usability, maintainability, portability, extensibility, security apply to the fundamentals of the data platform just like a normal platform.
...with more focus on some areas
Some functionality is the same between a normal platform and a data-platform but might focus more on specific areas or pose additional challenges because of the inherent challenges of the data domain like the large volume of data or there being multiple copies of the same data. Good examples of this are providing compute resources that are optimized for accelerating certain data applications like AI accelerators or GPUs as well as compliance and governance. A GDPR request would be simple to honor for a normal application with a single database but might pose a significantly more challenging problem when it needs to be guaranteed to be applied to all data in a data lake and in the possibly scattered output of all data applications.
Scalability/having an elastic platform can bring significant cost benefits to a data platform because of the often large resource requirements of data applications that are run in batches.
...and some additional functionality
And finally there’s functionality that’s unique to a data-platform, all of this is provided based on the common building blocks offered by any platform.
Examples of this are:
- Workflow management system
Both ELT/ETL as well as data application pipelines need to be scheduled to load or process data. This generally happens on a periodic schedule and involves running multiple tasks in a certain order. An example of this is Apache Airflow.
- Data catalog
The data catalog allows users to easily find the data they are looking for. It serves as an inventory of the data that's available, extends it with collaborative metadata and makes it easily searchable. Examples of this are AWS Glue Data Catalog or Google Data Catalog.
- SQL based data exploration stack
The data exploration stack allows users to get a quick view of the contents of the data they found using the data catalog. It's SQL interface allows creating simple combinations of data sources as well as making it useful for analytics and dashboarding. An example of this is Presto or AWS Athena.
- Development and exploration stack for data applications
The data-science development stack allows users to develop and test their data applications. It consists of compute capacity and generally comes with a notebook-like interface like Jupyter.
Think of a data-platform as nothing more than a normal (cloud) platform with some additional functionality on top to make it specific to the requirements of the data domain. Having knowledge about normal platform technologies and building blocks helps you when working on your data platform strategy and starting your data platform initiative. Keep in mind the basic principles and add data specific functionalities where required. This should be your path to a successful data platform.
If you want to learn more about using your organization’s data in an optimal way, check out these resources: