Data Science training: Data Science with Spark

This Xebia Data training offers a 3-day deep-dive into Apache Spark. Learn to master the tools Apache Spark offers, unlock its potential and advance your Data Science skills.

Apache Spark is a powerful open-source processing engine built around speed, ease of use, and advanced analytics. Through our experienced consultants, you can learn to unlock its full potential and master this challenging tool yourself.

“I liked every aspect of this training and would like to thank the trainers. They did an excellent job of explaining how to use Spark for data science. This is the fourth Xebia Data training I’ve followed. All were great, but this was the best one so far.” —Data Scientist, Knab

This training is perfect for

Anyone working in an organization that uses Apache Spark and wants to get the most out of it. The training is not limited to Data Scientists who wish to scale their projects. Data Engineers, Data Analysts, Software Programmers, and Database Administrators who want to exploit Apache Spark will also benefit from this course. Prior experience with Python or software programming is required. Experience with database languages such as SQL and pandas is helpful, but not required. 

What will you learn during this training?

Gain the theoretical knowledge, hands-on experience, and best practices you need to get the most out of Apache Spark. After completing the training, you will be able to use Apache Spark for data science at scale confidently.


The program consists of both theory and hands-on exercises.

Day 1:

  • Spark basics
  • Advanced Spark 
  • DataFrames

Day 2:


Day 3:

  • Spark structured streaming
  • Spark hands-on lab (i.e., install and run Spark locally; apply online statistics to Meetup data with Spark Streaming; build a movie recommender using Spark ML—we assist.) 

You will learn:

Spark basics

  • Spark execution
  • SparkSession
  • Transformations vs. actions
  • Laziness and lineage: how Spark optimizes code
  • How to use the Spark UI
  • Advanced Spark 
  • How to apply partitioning and how Spark reads and writes data
  • Shuffling, narrow wide operations, and their impact on performance
  • The catalyst optimizer
  • About scheduling and job execution
  • About caching and persistence levels


  • The basic concepts
  • All about Spark DataFrames and pandas DataFrames
  • How to load and save DataFrames
  • The functions API
  • How to join data
  • User-defined functions and pandas’ user-defined functions (with performance implications)
  • Window operations

  • Machine Learning with Spark
  • Pre-processing data and feature engineering
  • Model selection
  • Pipeline API
  • Advanced topics

Spark structured streaming

  • Structured streaming
  • Machine Learning & streaming
  • Sources and sink
  • Windows & aggregations
  • Checkpointing & watermarking
  • Fault tolerance & Kafka
  • Kafka as a source and as a sink

Data Science Trainers

This Data Science training is brought to you by Xebia Data. Xebia Data is part of Xebia, just like Xebia Academy. Xebia Data works with experts in their field who are always on the lookout for the most innovative ways to get the most out of data. Your trainer is a data guru who enjoys sharing his or her experiences to help you work with the latest tools.

Data Science Learning Journey

Your Data Science Learning Journey starts with a Foundation training, like Certified Analytics Translation or the Certified Data Science with Python Foundation training. We also offer a Xebia Data Deep Learning Professional level course. If you are ready for the Expert level, register for this 3-day Data Science with Spark training and learn all about Data Science at scale.

Yes, I want to dive into Apache Spark!

After registering for this training, you will receive a confirmation email with practical information. A week before the training we will ask you about any dietary requirements and share literature if there's a need to prepare. See you soon!

What else should I know?

  • Virtual or in-person training: This training can be delivered both in-person or online. When hosting the in-person training, we provide lunch, snacks and drinks to the participants. Accordingly there is a discount for virtual trainings.
  • This training requires a laptop. The hands-on labs are run in an online environment, eliminating the need to install software.
  • This course is brought to you by Xebia Data.
  • Literature and a nice lunch are included in the price.
  • Travel & accommodation expenses are not included.

Get in touch

Our team is at your service

Get in touch! →

Or call +31 (0)20 760 9844