Where to begin when joining your first Scala Spark project

27 Sep, 2019
Xebia Background Header Wave

Man, Apache Spark is some powerful stuff! Add a fancy and fun language called Scala and you feel like you can do a whole lot of cool things with a lot of flexibility! This is my current standpoint on the subject, after a short year of working with this setup. The first month was a completely different story. Let me share some tips that really helped me out when I started working on an existing Scala-Spark codebase.

The codebase

Scala is a really powerful language with lots of nice features. It’s fun to write and is quite expressive compared to other languages that require some boilerplate (like Java). This CAN improve the readability of the code however, some people tend to write short obscure one-liners. It’s a waste to invest time into deciphering that sort of code as long as you don’t have to.
Short story shorter: We don’t want to spend to much time getting to the bottom of overly complicated code. It can be more helpful to start reading the unit tests globally to get a feel what the code is supposed to do (given that they are up to date and are following a unit-test naming convention that enforces explaining the functionality).
Also one mention regarding the code: don’t sweat the small stuff. In Scala-Spark there are many ways to achieve the same thing. For instance, there is more than one way to express a column based on a name:

// Selecting column named col1
val col = col("col1")
val col = $"col1"
val col = `col1
val col = dataset("col1")

After a while, you’ll get used to the many synonyms Spark offers. Hopefully, your code does not contain too many different ones!

The data

When working on a project with Scala-Spark, there is always data involved. Getting to know the structure of this data is of paramount importance. Even if you’ve never written Scala-Spark code the best way to get insights into the data is by using notebooks.
Notebooks can be used on Apache Spark vendors like Databricks. As long as you know how to initially read the data, you’re good to go! Go play around with the data. Try to come up with some metrics and try to extract those metrics from the data.
Especially look at:

  • joins (there are many handy joinTypes in Spark on top of the classic ones, like one of my favorites the left_anti join. Equi join can also be a convenient way of joining datasets)
  • SQL functions (especially when applied in windowing of data)

And I know it is tempting to just write SQL, because Spark also has support for that. However, in the long run, a lot more can be achieved if you invest the effort to use Spark through the DSL (personally, I find it easier to read because all of your data manipulations are very fine-grained).

Debugging your code

When you’ve reached the point you’ll be contributing to the codebase and you’re stuck on unit testing (because somehow you can’t see what’s in your dataframe from your breakpoint), note that Spark only evaluates when an action is performed.
If you have code that chains a lot of transformations and you want to know what the data looks like somewhere in the middle, you can evaluate all the code up until the last transform you’re interested in and simply call the collect()  function on it. This will put you Spark instance to work and return you all of the transformed data in a Seq[T].
Consider the following code:

case class Person(
                name: String,
                dateOfBirth: Date,
                country: String,
                email: String,
                hideEmail: Boolean
                ) extends Product
case class ContactInfo(email: Option[String], name: String) extends Product
def someTransform(input: Dataset[Person])(implicit spark: SparkSession): Dataset[ContactInfo] = {
    import spark.implicits._
    input.filter($"age" > 5)
         .filter($"name" =!= "John")
         .filter($"country" === "DE")
         ...// a lot of other transformations
         .map(person => {
             val email = if (person.hideEmail) None else Some(
             ContactInfo(email = email, name =

If we’d suspect the third filter (.filter($”country” === “DE”)), a breakpoint can be added anywhere in the method (since we’re only interested in the input  value). Once the breakpoint fires, we can evaluate the expression:

input.filter($"age" > 5).filter($"name" =!= "John").collect()

This will apply the first 2 transforms and return all rows that pass as Seq[Person].

Operational support

When the project you’re working on is already in production, there will be questions asked on the Scala-Spark code. If you have the advantage of having other people on your project who are starting to analyze the system in order to answer the question: pair along! Don’t let the guru answer it all by himself, ask that person on how the answer was derived. Of course, this is obvious. But many times I’ve seen newer colleagues to shy to ask (or of course not interested).
Hopefully, these tips will help you in the case you’re starting to contribute to a Scala-Spark project!
Thanks for reading!


Get in touch with us to learn more about the subject and related solutions

Explore related posts