Welcome to the Open Source at GoDataDriven, July 2018 edition.
We start with Tünde and Kris who did a phenomenal job adding support to for Hive partitioned
tables with partitions having different data formats in Spark. You can find the result of their
work in PR 21893. Their work is the result of three sessions they had during our GoDataDriven
Fridays. It involved determination, skills, and a bit of detective work throughout the Spark code
base (they touched 7 files, adding more than 500 lines of code at the end).
The Spark folks are however reluctant to merge it. If you also think the feature is important and
useful to you, let your voice be heard!
Vincent open sourced Asekuro, a tool to make it easier to test Jupyter notebooks, included
notebooks that use the %load
magic[^1.]
Fokko then directed his attention to the Event Hub-Spark connector by Microsoft, opening PR
356, 359, and 360.
Henk discovered another tool my Microsoft, DoWhy, a library that makes it easy to estimate causal
effect. He immediately contributed PR 3 and 4.
Julian, on the other hand, also opened quite a large PR in Airflow, namely PR 3560. In there
he pushes snakebite out and hdfs3 in to increase Airflow’s compatibility with Python 3.
To close: I contributed PR 270 and 286 to dask-ml — although the first one might never be
merged, even though it solves an open issue. Both PRs show a nice use of decorators, with the
latest one also showing how to define context managers with yield
.
That’s it for this edition! Don’t forget we’re hiring! Especially if you are a software engineer
that would like to move in the data space, get in touch as we’re offering an apprenticeship
starting from October.
And if you want more rambling throughout the month, follow me on Twitter: I’m gglanzani there!
- The magic is used to make it easier to load the solution of the exercises without much
replication. ↩
- GoDataDriven open source contribution: June 2018 edition
- GoDataDriven open source contribution: March 2018 edition
- GoDataDriven open source contribution: May 2018 edition
- GoDataDriven open source contribution: November 2018 edition
- GoDataDriven open source contribution: September and October 2018 edition
- GoDataDriven open source contribution: April 2018 edition
- GoDataDriven open source contribution: Augustus 2018 edition
- GoDataDriven open source contribution: December 2018 edition
- GoDataDriven open source contribution: February 2018 edition
- GoDataDriven Open Source Contribution for February 2019, the first Open Source Initiatives edition
- GoDataDriven Open Source Contribution for January 2019, the Apache Edition
- GoDataDriven Open Source Contribution for March and April 2019
- GoDataDriven Open Source Contribution for May and June 2019
- GoDataDriven Open Source Contribution for Q4 2019
- GoDataDriven open source contribution: July 2018 edition