Blog

GoDataDriven Open Source Contribution for Q4 2019

07 Feb, 2020
Xebia Background Header Wave

In the last quarter of 2019, the GoDataDriven team has contributed again to many open source projects. At GoDataDriven we love with open source. Open sources enables us to fix bugs by our own, and go through the code if we can’t find anything in the documentation (looking at you Airflow). This removes blockers for our customers and enables them to continue their quest to become more data driven. At the same time, we try to make the world a better place by patching bugs, fixing security issues and implement features.

Takeoff

Takeoff is a deployment tool, initially developed at Schiphol Airport and is fully open source. It helps you to automate and create fully reproducible deployments on your favorite cloud. For more information, please refer to the excellent blog by the authors.

Our recently joined Daniel Heres took a swing at Takeoff, bumping into some typo’s in the documentation, and took the time to fix these:

  • Rename Runway -> Takeoff
  • Fix link in documentation
  • Some renaming, changes in schema
  • azure_service_principal -> service_principal
  • Fix typo
  • Fix for example code in takeoff plugins documentation

Daniel van der Ende also took the time to improve the docs:

  • Fix broken links in github pages docs
  • Add awesome ascii art
  • Fix incorrect takeoff config in docs

Finally, he also found the time to squash a couple of bugs and add some functionality:

  • Fix bug in latest docker tag
  • Bugfix: use correct client for azure provider
  • Fix base64 encoding of k8s templated values
  • Add ability to pass custom values to k8s jinja

Data Build Tool (DBT)

While getting familiar with the codebase of DBT, I encountered some minor code smells, and decided to add some annotations to make the code more readable:

  • Use generated_at instead of datetime.utcnow()
  • Remove the duplicate get_context_modules
  • Add annotations to the code

Furthermore, as Apache Spark lovers, we took the time to extend the support for it:

  • Add support for extracting statistics
  • Add support for creating/dropping schema’s
  • Pull the owner from the DESCRIBE EXTENDED
  • Apache Spark support

Apache Airflow

Many clients of ours use Airflow to orchestrate their data pipelines. Recently, there has been an effort to enable to store state at a task level. The first idea was to store this in the xcom table. Xcom stands for cross communication, and is used for sharing inter-task state. With some minor tweaks we could also use this for intra-stask state sharing. The main issue with state is, that some executions become non-idempotent by definition. This is something that we’re still working on, and details can be found in the AIP.

  • [AIRFLOW-5806] Simplify the xcom table
  • [AIRFLOW-5804] Batch the xcom pull operation
  • [AIRFLOW-5688] Merge alembic migrations
  • [AIRFLOW-5792] Straighten out the migrations
  • [AIRFLOW-5771] Straighten out alembic migrations
  • AIRFLOW-5701: Don’t clear xcom explicitly before execution

Apache Parquet

Parquet is the de-facto file format for OLAP workloads on data lakes. We’re working on getting ready for Java 11, so some dependencies were updated:

  • PARQUET-1496: Update Scala to 2.12
  • PARQUET-1673: Upgrade parquet-mr format version to 2.7.0

Apache Avro

In Avro we’ve discovered a regression bug that was uncovered by the integration tests of Apache Iceberg. This was introduced after the big refactor of the schema resolution logic. This bug took me around three days to find, and it was fixed in a single line:

After fixing that bug we’re starting the release process of Avro 1.9.2, and this includes updating some of the dependencies to the latest version, to make sure that we’re up to date:

Fix the CI:

Apache Spark

For Apache Spark, we’ve done some security patches:

And GoDataDriven is now mentioned on the Powered By section of the Spark website:

Apache Iceberg (Incubating)

While playing around with Iceberg, and getting familiar with it, I noticed that the docs were incomplete. This naturally resulted into a PR:

In addition, I fixed arbitrary issues that were reported by the code smell detector in order to get more familiar with the code-base:

Other

And some various fixes:

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts