Optimizing Apache Spark & Tuning Best Practices

7 December, 2023Virtual

2 days
Virtual
Data Engineering

As data scales up, efficiently processing data becomes more crucial. Building on our experience as one of the world’s most significant Apache Spark users, this 2-day course provides an in-depth overview of the do’s and don’ts of one of the most popular analytics engines available. 

Book this training

Book now

Looking to upskill your team(s) or organization? 

Nico will gladly help you further with custom training solutions. 

Get in touch

Duration

2 days

Time

09:00 – 17:00

Language

English

Lunch

Included

Certification

No

Level

Professional

What will you learn?

After the training, you will be able to

Understand what Apache Spark does under the hood. 

Use best practices to write performant code. 

Tweak and debug your Spark applications. 

Explain the Spark fundamentals, including the execution model: Driver/Executors 

Work with caching, shuffle service, and fair scheduling 

Troubleshoot optimization problems and issues   

Key takeaways

Fundamentals

  1. Spark execution model: Driver/Executors. 
  2. Spark user interface for monitoring applications. 
  3. Understanding RDDs/DataFrames APIs and bindings. 
  4. Difference between Actions and Transformations 
  5. How to read the Query plan (Physical/Logical) 

Spark Internals 

  1. Spark Memory model 
  2. Understanding persistence (caching) 
  3. Catalyst optimizer, Tungsten project, and Adaptive Query Execution 
  4. Shuffle service and how is shuffle operation executed 
  5. Concept of fair scheduling and pools 

Spark optimization: main problems and issues 

  1. The most common memory problems 
  2. The benefit of early filtering 
  3. Understanding partition and predicate filtering 
  4. Join optimization 
  5. Dealing with data skewness (preprocessing, broadcasting, salting) 
  6. Understanding shuffle partitions: how to tackle memory/disk spill 
  7. The downside of using UDFs 
  8. Executor idle timeout 
  9. Data formats examples with an introduction to Delta file format  

Moving to production 

  1. Debugging / troubleshooting 
  2. Productionizing your Spark application 
  3. Dynamic allocation and dynamic partitioning 
  4. JVM Profiler 

Program

The trainer facilitates the content using notebooks hosted in a cloud environment. Each participant will have a Spark cluster to experiment with. 

  • Theory about various spark basics and advanced topics 
  • Apply optimizations in practice 

Who is it for?

This training is excellent for you if you are a data or machine learning engineer transforming large volumes of data, in need of production-quality code, and wanting to optimize your Spark applications. The course is also great for expert data scientists wanting to learn simple tweaks to increase Spark performance dramatically. 

Requirements

General knowledge and experience with Python with Spark (PySpark) will be necessary. 

Why should I follow this training?

 Learn about Apache Spark, using best practices to write performant code and tweaking and debugging Spark applications. 

Grasp the Spark fundamentals, including the execution model: Driver/Executors, caching, shuffle service, and fair scheduling. 

Learn from and network with Apache Spark data experts. 

What else
should I know?

After registering for this training, you will receive a confirmation email with practical information. A week before the training, we will ask you about any dietary requirements and share literature if you need to prepare.

See you soon!

All literature and course materials are included in the price. 

After registering for this course, you will receive a confirmation email with practical information. 

Also interesting for you

View all trainings
Data Processing at Scale

Learn to use Apache Spark to process large sets of data.

Data Engineering
View training
dbt Learn

In partnership with dbt Labs, we offer you the dbt Learn training course. Upgrade your dbt (data build tool) skills now.

Lucy Sheppard 

Data Engineering
3 days
Virtual

Next:

11 Mar, 2024

From:

€1045

View training
MLOps on Azure

This MLOps on Azure training is then a perfect next step if you’d like like to take your Machine Learning models further.

Azure
GitHub
Machine Learning
Microsoft
3 days
In Person

Next:

7 Feb, 2024

From:

€1995

View training
Certified OKR Practitioner – Fundamentals

Master the art of goal-setting with our OKR training – practical, insightful, and applicable for all backgrounds. Achieve better outcomes with clear objectives & key results!

Sjoerd Nijland

1 day
In Person

Next:

14 May, 2024

From:

€875

View training
Certified OKR Practitioner – Applied

Discover OKRs – the popular goal-setting framework through hands-on training. Achieve better outcomes with clear objectives & key results!

Sjoerd Nijland

2 days
In Person

Next:

1 – 2 Feb, 2024

From:

€1570

View training