Optimizing Apache Spark & Tuning Best Practices
7 December, 2023 – Virtual
As data scales up, efficiently processing data becomes more crucial. Building on our experience as one of the world’s most significant Apache Spark users, this 2-day course provides an in-depth overview of the do’s and don’ts of one of the most popular analytics engines available.
09:00 – 17:00
What will you learn?
After the training, you will be able to
Understand what Apache Spark does under the hood.
Use best practices to write performant code.
Tweak and debug your Spark applications.
Explain the Spark fundamentals, including the execution model: Driver/Executors
Work with caching, shuffle service, and fair scheduling
Troubleshoot optimization problems and issues
- Spark execution model: Driver/Executors.
- Spark user interface for monitoring applications.
- Understanding RDDs/DataFrames APIs and bindings.
- Difference between Actions and Transformations
- How to read the Query plan (Physical/Logical)
- Spark Memory model
- Understanding persistence (caching)
- Catalyst optimizer, Tungsten project, and Adaptive Query Execution
- Shuffle service and how is shuffle operation executed
- Concept of fair scheduling and pools
Spark optimization: main problems and issues
- The most common memory problems
- The benefit of early filtering
- Understanding partition and predicate filtering
- Join optimization
- Dealing with data skewness (preprocessing, broadcasting, salting)
- Understanding shuffle partitions: how to tackle memory/disk spill
- The downside of using UDFs
- Executor idle timeout
- Data formats examples with an introduction to Delta file format
Moving to production
- Debugging / troubleshooting
- Productionizing your Spark application
- Dynamic allocation and dynamic partitioning
- JVM Profiler
The trainer facilitates the content using notebooks hosted in a cloud environment. Each participant will have a Spark cluster to experiment with.
- Theory about various spark basics and advanced topics
- Apply optimizations in practice
Who is it for?
This training is excellent for you if you are a data or machine learning engineer transforming large volumes of data, in need of production-quality code, and wanting to optimize your Spark applications. The course is also great for expert data scientists wanting to learn simple tweaks to increase Spark performance dramatically.
General knowledge and experience with Python with Spark (PySpark) will be necessary.
Why should I follow this training?
Learn about Apache Spark, using best practices to write performant code and tweaking and debugging Spark applications.
Grasp the Spark fundamentals, including the execution model: Driver/Executors, caching, shuffle service, and fair scheduling.
Learn from and network with Apache Spark data experts.
should I know?
After registering for this training, you will receive a confirmation email with practical information. A week before the training, we will ask you about any dietary requirements and share literature if you need to prepare.
See you soon!
All literature and course materials are included in the price.
After registering for this course, you will receive a confirmation email with practical information.
Also interesting for youView all trainings
In partnership with dbt Labs, we offer you the dbt Learn training course. Upgrade your dbt (data build tool) skills now.
This MLOps on Azure training is then a perfect next step if you’d like like to take your Machine Learning models further.
7 Feb, 2024
Master the art of goal-setting with our OKR training – practical, insightful, and applicable for all backgrounds. Achieve better outcomes with clear objectives & key results!
Discover OKRs – the popular goal-setting framework through hands-on training. Achieve better outcomes with clear objectives & key results!