Optimizing Apache Spark & Tuning Best Practices
7 December, 2023 – Virtual
As data scales up, efficiently processing data becomes more crucial. Building on our experience as one of the world’s most significant Apache Spark users, this 2-day course provides an in-depth overview of the do’s and don’ts of one of the most popular analytics engines available.
Looking to upskill your team(s) or organization?
Nico will gladly help you further with custom training solutions.
Get in touchDuration
2 days
Time
09:00 – 17:00
Language
English
Lunch
Included
Certification
No
Level
Professional
What will you learn?
After the training, you will be able to
Understand what Apache Spark does under the hood.
Use best practices to write performant code.
Tweak and debug your Spark applications.
Explain the Spark fundamentals, including the execution model: Driver/Executors
Work with caching, shuffle service, and fair scheduling
Troubleshoot optimization problems and issues
Key takeaways
Fundamentals
- Spark execution model: Driver/Executors.
- Spark user interface for monitoring applications.
- Understanding RDDs/DataFrames APIs and bindings.
- Difference between Actions and Transformations
- How to read the Query plan (Physical/Logical)
Spark Internals
- Spark Memory model
- Understanding persistence (caching)
- Catalyst optimizer, Tungsten project, and Adaptive Query Execution
- Shuffle service and how is shuffle operation executed
- Concept of fair scheduling and pools
Spark optimization: main problems and issues
- The most common memory problems
- The benefit of early filtering
- Understanding partition and predicate filtering
- Join optimization
- Dealing with data skewness (preprocessing, broadcasting, salting)
- Understanding shuffle partitions: how to tackle memory/disk spill
- The downside of using UDFs
- Executor idle timeout
- Data formats examples with an introduction to Delta file format
Moving to production
- Debugging / troubleshooting
- Productionizing your Spark application
- Dynamic allocation and dynamic partitioning
- JVM Profiler
Program
The trainer facilitates the content using notebooks hosted in a cloud environment. Each participant will have a Spark cluster to experiment with.
- Theory about various spark basics and advanced topics
- Apply optimizations in practice
Who is it for?
This training is excellent for you if you are a data or machine learning engineer transforming large volumes of data, in need of production-quality code, and wanting to optimize your Spark applications. The course is also great for expert data scientists wanting to learn simple tweaks to increase Spark performance dramatically.
Requirements
General knowledge and experience with Python with Spark (PySpark) will be necessary.
Why should I follow this training?
Learn about Apache Spark, using best practices to write performant code and tweaking and debugging Spark applications.
Grasp the Spark fundamentals, including the execution model: Driver/Executors, caching, shuffle service, and fair scheduling.
Learn from and network with Apache Spark data experts.
What else
should I know?
After registering for this training, you will receive a confirmation email with practical information. A week before the training, we will ask you about any dietary requirements and share literature if you need to prepare.
See you soon!
All literature and course materials are included in the price.
After registering for this course, you will receive a confirmation email with practical information.