A data pipeline component is nothing more than a normal applicationFig1. Example data pipeline: ELT It goes through the same lifecycle as any other application: the component is developed and tested, an artifact is built and finally it gets deployed. The artifact is then run as part of your data pipeline. During this the application uses normal application development processes like:
- All of the steps in the pipeline will run as a normal application, so for each of them an artifact needs to be built and tested. This can be done using the same process as done for any other application, i.e. build a (Docker) image, run tests and deploy any resources that need deploying.
- Monitoring, dashboarding and alerting
- Make use of the existing monitoring and dashboarding infrastructure. Most Kubernetes clusters will have something like Promotheus in place to gather metrics. All of your data pipeline components can integrate with this to make the metrics available the same way as all the other applications running on the cluster. This way you can make use of the same dashboarding and alerting solution as all the other teams in your organization.
- The modus operandi we prefer is running jobs on Kubernetes. This provides a lot of advantages from a platform perspective.
- Each of the pipeline steps is responsible for storing its own output. Store this output separately from the compute, generally on blob storage.
- Connectivity to other systems/data sources
- Leverage the existing connectivity of the platform to other on-premise or cloud-based systems, don’t build your own custom connectivity solutions in your applications
- Make use of Kubernetes service accounts bound to cloud provider service accounts and manage the permissions on the cloud provider service accounts.
...with more focus on some areasSome of the building blocks and processes are the same between a normal application and a data-pipeline component. However, it might be necessary to focus more on specific areas. This is mainly because of the inherent challenges of the data domain like the large volume of data and the fact that data applications are often run in batches and are all about (processing) the data instead of the application itself. Examples of this are:
- Make use of the platform’s scalability for your resource heavy applications. This can bring significant cost benefits to the data platform because of the often large resource requirements of data applications that are run in batches. Do make sure that important tasks are scheduled on nodes that won’t just disappear so they can run safely without getting killed.
- Since the volume of data is quite large, make use of the platform’s blob storage to store the data, this is generally much more cost-efficient than other ways of storing these large volumes of data.
- If you value quality and consistency make sure you validate the incoming data, for example against a schema that’s part of the application’s configuration.
- Write metrics about the data processed by the application to provide insights and allow monitoring of your pipeline. Make sure that if your data pipelines run in batches the information about which batch the pipeline was run for is included in these metrics.
...and some additional functionalityAnd finally there’s functionality that’s unique to a data-platform, we strive to keep this to a minimum. There are some exceptions though, like making sure to register the data produced by every application in the data catalog.
Final thoughtsThink of your data-pipeline components as nothing more than a normal application with some additional functionality on top to make it specific to the requirements of the data domain. Having knowledge about normal applications and their building blocks helps you when working on your data pipelines. Keep in mind the basic principles and add data specific functionalities where required. This should be your path to your successful data pipeline. If you want to learn more about using your organization’s data in an optimal way, check out our other resources: