In the last quarter of 2019, the GoDataDriven team has contributed again to many open source projects. At GoDataDriven we love with open source. Open sources enables us to fix bugs by our own, and go through the code if we can’t find anything in the documentation (looking at you Airflow). This removes blockers for our customers and enables them to continue their quest to become more data driven. At the same time, we try to make the world a better place by patching bugs, fixing security issues and implement features.

Takeoff

Takeoff is a deployment tool, initially developed at Schiphol Airport and is fully open source. It helps you to automate and create fully reproducible deployments on your favorite cloud. For more information, please refer to the excellent blog by the authors.

Our recently joined Daniel Heres took a swing at Takeoff, bumping into some typo’s in the documentation, and took the time to fix these:

Rename Runway -> Takeoff
Fix link in documentation
Some renaming, changes in schema
azure_service_principal -> service_principal
Fix typo
Fix for example code in takeoff plugins documentation

Daniel van der Ende also took the time to improve the docs:

Fix broken links in github pages docs
Add awesome ascii art
Fix incorrect takeoff config in docs

Finally, he also found the time to squash a couple of bugs and add some functionality:

Fix bug in latest docker tag
Bugfix: use correct client for azure provider
Fix base64 encoding of k8s templated values
Add ability to pass custom values to k8s jinja

Data Build Tool (DBT)

While getting familiar with the codebase of DBT, I encountered some minor code smells, and decided to add some annotations to make the code more readable:

Use generated_at instead of datetime.utcnow()
Remove the duplicate get_context_modules
Add annotations to the code

Furthermore, as Apache Spark lovers, we took the time to extend the support for it:

Add support for extracting statistics
Add support for creating/dropping schema’s
Pull the owner from the DESCRIBE EXTENDED
Apache Spark support

Apache Airflow

Many clients of ours use Airflow to orchestrate their data pipelines. Recently, there has been an effort to enable to store state at a task level. The first idea was to store this in the xcom table. Xcom stands for cross communication, and is used for sharing inter-task state. With some minor tweaks we could also use this for intra-stask state sharing. The main issue with state is, that some executions become non-idempotent by definition. This is something that we’re still working on, and details can be found in the AIP.

[AIRFLOW-5806] Simplify the xcom table
[AIRFLOW-5804] Batch the xcom pull operation
[AIRFLOW-5688] Merge alembic migrations
[AIRFLOW-5792] Straighten out the migrations
[AIRFLOW-5771] Straighten out alembic migrations
AIRFLOW-5701: Don’t clear xcom explicitly before execution

Apache Parquet

Parquet is the de-facto file format for OLAP workloads on data lakes. We’re working on getting ready for Java 11, so some dependencies were updated:

PARQUET-1496: Update Scala to 2.12
PARQUET-1673: Upgrade parquet-mr format version to 2.7.0

Apache Avro

In Avro we’ve discovered a regression bug that was uncovered by the integration tests of Apache Iceberg. This was introduced after the big refactor of the schema resolution logic. This bug took me around three days to find, and it was fixed in a single line:

AVRO-2663: Record inside of Union is not resolved properly

After fixing that bug we’re starting the release process of Avro 1.9.2, and this includes updating some of the dependencies to the latest version, to make sure that we’re up to date:

Fix the CI: