Like announced last month, we are trying to collect all the contributions we
do the the open source world, either to existing or to new projects.
This second edition starts with Fokko that contributed to 5 different projects: Druid,
Docker-Druid, Airflow, Flink, and scalatra-sample-app.
- In Druid he updated the documentation with PR 3973, regarding ingesting parquet format into a
Druid cluster; - There he also fixed a faulty log line with PR 3970;
- In Druid Docker he fixed two issues in PR 33 and 34 by creating and setting correct
directories and permissions; - In Airflow PR 2042 he further extended the spark-submit operator/hook by adding YARN integration;
- In Flink PR 3280 he fixed the documentation by setting a correct reference.
- In the scalatra sample app, being Fokko a Scala aficionado, he got rid of
TypeParamSupport
. The
trait got deprecated because the functionality has been folded into the core. This resulted in PR
7 for the project.
On the other hand yours truly fixed in NiFi a wrong description of the UnpackContent processor
in PR 1558 and open sourced a project to provision Google Cloud Engine instances to ease the
classroom trainings deployment1. We are in fact often faced with a lot of challenges when delivering
training where Spark is involved:
- If we use virtual machines (VM), the users can’t never quite experience how powerful Spark is, as
their machines are always so slow that it’s not even funny. As an added chore, we need to create,
maintain, and distribute several GBs around as these VMs are not small; - If we use local mode, installing Spark in all configurations is incredibly cumbersome, especially
if you want HDFS support; the slowness still applies, albeit in a less sever form; - If you create a cluster, it’s never nice to deploy it, install the packages, and make the keys
available to everybody.
Since Google Cloud Engine makes it extremely easy to create clusters, the project kind of assumes
that that’s what you’re using. That said, it should be easy enough to modify it. Personally I’m
working on getting Anaconda + JupyterHub integrated so that users don’t even need to have SSH
access to the machine.
That’s all for the second edition. As always, we’re hiring Data Scientists and Data Engineers. Head
up to our career page if you’re interested. You
get plenty of opportunities to give back to the community.
Explore more insights from our GoDataDriven Open Source Contributions series across various editions. Delve into the April 2017 edition, where we discuss significant advancements in open-source technologies. Discover the August 2017 edition, highlighting pivotal contributions to the community. Gain valuable perspectives from our December 2017 edition, showcasing innovations and collaborations in the open-source ecosystem. Dive into the February 2017 edition for in-depth discussions on emerging trends and technologies. Review our Q3 2019 contributions, reflecting our ongoing commitment to pushing boundaries in open-source initiatives. Discover breakthroughs from July 2017, June 2017, March 2017, May 2017, and October 2017 editions, each offering unique insights and impactful contributions to the open-source community.
- Huge props to Ron as he created the first working Ansible implementation! ↩