In the last quarter of 2019, the GoDataDriven team has contributed again to many open source projects. At GoDataDriven we love with open source. Open sources enables us to fix bugs by our own, and go through the code if we can’t find anything in the documentation (looking at you Airflow). This removes blockers for our customers and enables them to continue their quest to become more data driven. At the same time, we try to make the world a better place by patching bugs, fixing security issues and implement features.
Takeoff
Takeoff is a deployment tool, initially developed at Schiphol Airport and is fully open source. It helps you to automate and create fully reproducible deployments on your favorite cloud. For more information, please refer to the excellent blog by the authors.
Our recently joined Daniel Heres took a swing at Takeoff, bumping into some typo’s in the documentation, and took the time to fix these:
- Rename Runway -> Takeoff
- Fix link in documentation
- Some renaming, changes in schema
- azure_service_principal -> service_principal
- Fix typo
- Fix for example code in takeoff plugins documentation
Daniel van der Ende also took the time to improve the docs:
Finally, he also found the time to squash a couple of bugs and add some functionality:
- Fix bug in latest docker tag
- Bugfix: use correct client for azure provider
- Fix base64 encoding of k8s templated values
- Add ability to pass custom values to k8s jinja
Data Build Tool (DBT)
While getting familiar with the codebase of DBT, I encountered some minor code smells, and decided to add some annotations to make the code more readable:
- Use generated_at instead of datetime.utcnow()
- Remove the duplicate get_context_modules
- Add annotations to the code
Furthermore, as Apache Spark lovers, we took the time to extend the support for it:
- Add support for extracting statistics
- Add support for creating/dropping schema’s
- Pull the owner from the DESCRIBE EXTENDED
- Apache Spark support
Apache Airflow
Many clients of ours use Airflow to orchestrate their data pipelines. Recently, there has been an effort to enable to store state at a task level. The first idea was to store this in the xcom table. Xcom stands for cross communication, and is used for sharing inter-task state. With some minor tweaks we could also use this for intra-stask state sharing. The main issue with state is, that some executions become non-idempotent by definition. This is something that we’re still working on, and details can be found in the AIP.
- [AIRFLOW-5806] Simplify the xcom table
- [AIRFLOW-5804] Batch the xcom pull operation
- [AIRFLOW-5688] Merge alembic migrations
- [AIRFLOW-5792] Straighten out the migrations
- [AIRFLOW-5771] Straighten out alembic migrations
- AIRFLOW-5701: Don’t clear xcom explicitly before execution
Apache Parquet
Parquet is the de-facto file format for OLAP workloads on data lakes. We’re working on getting ready for Java 11, so some dependencies were updated:
Apache Avro
In Avro we’ve discovered a regression bug that was uncovered by the integration tests of Apache Iceberg. This was introduced after the big refactor of the schema resolution logic. This bug took me around three days to find, and it was fixed in a single line:
After fixing that bug we’re starting the release process of Avro 1.9.2, and this includes updating some of the dependencies to the latest version, to make sure that we’re up to date:
- AVRO-2586: Bump spotless-maven-plugin from 1.24.1 to 1.25.1
- AVRO-2585: Bump jetty.version from 9.4.20.v20190813 to 9.4.21.v20190926
- AVRO-2584: Bump netty-codec-http2 from 4.1.39.Final to 4.1.42.Final
- AVRO-2582: Bump protobuf-java from 3.9.1 to 3.10.0
- AVRO-2583: Bump grpc.version from 1.23.0 to 1.24.0
Fix the CI:
Apache Spark
For Apache Spark, we’ve done some security patches:
- [SPARK-29445][CORE] Bump netty-all from 4.1.39.Final to 4.1.42.Final
- [SPARK-29483][BUILD] Bump Jackson to 2.10.0
- [SPARK-27506][SQL] Allow deserialization of Avro data using compatible schemas. Picked up a stale PR to add read Avro files with a custom schema.
And GoDataDriven is now mentioned on the Powered By section of the Spark website:
Apache Iceberg (Incubating)
While playing around with Iceberg, and getting familiar with it, I noticed that the docs were incomplete. This naturally resulted into a PR:
In addition, I fixed arbitrary issues that were reported by the code smell detector in order to get more familiar with the code-base:
- Replace StringBuffer by StringBuilder
- Add missing overrides, Fix MissingOverride error and Fix MissingOverride error.
- Remove some of the unused variables
- Fix PreconditionsInvalidPlaceholder error
- Fix ObjectsHashCodePrimitive error
- Fix EqualsGetClass error
- Fix MutableConstantField error
- Fix ImmutableEnumChecker error
- Update docs to Gradle 5.4.1
- Add Java 11 to the testsuite
- Bump ORC from 1.5.5 to 1.5.6
- Bump Apache Parquet to 1.11.0
- Add Baseline to iceberg-parquet
- Apply Baseline to iceberg-pig
Other
And some various fixes:
- Bump spark and java versions
- Our Kris Geuzebroek took the time to fix an issue on his docker-kafka image.
- Fixed a link on the Databricks containers repository
- Add Fokko as committer
- Fix the CI by updating the db2 test
- Presto Bump Apache Avro to 1.9.1 and Consolidate and bump the snappy-java version
- GoDataDriven open source contribution: June 2018 edition
- GoDataDriven open source contribution: March 2018 edition
- GoDataDriven open source contribution: May 2018 edition
- GoDataDriven open source contribution: November 2018 edition
- GoDataDriven open source contribution: September and October 2018 edition
- GoDataDriven open source contribution: April 2018 edition
- GoDataDriven open source contribution: Augustus 2018 edition
- GoDataDriven open source contribution: December 2018 edition
- GoDataDriven open source contribution: February 2018 edition
- GoDataDriven Open Source Contribution for February 2019, the first Open Source Initiatives edition
- GoDataDriven Open Source Contribution for January 2019, the Apache Edition
- GoDataDriven Open Source Contribution for March and April 2019
- GoDataDriven Open Source Contribution for May and June 2019
- GoDataDriven Open Source Contribution for Q4 2019
- GoDataDriven open source contribution: July 2018 edition