Blog
Apply TDD to Hadoop jobs with Scoobi
8 March, 2012
Maarten Winkels
Capabilities:
Map Reduce is a programming model for writing algorithms that process large quantities of data in a (relatively) short time. The building blocks for the programs are very simple map and reduce functions. Writing programs that do more and more complex tasks to data based on those simple functions becomes harder and harder and thus requires more thorough testing in early stages. This blog attempts to outline a simple method for testing the algorithm of a Map-Reduce program based on scoobi.
Let's look at the simple example on the scoobi website:
[scala highlight="11,12,13,14"]
import com.nicta.scoobi.Scoobi._
import com.nicta.scoobi.DList._
import com.nicta.scoobi.io.text.TextInput._
import com.nicta.scoobi.io.text.TextOutput._
object WordCount {
def main(allArgs: Array[String]) = withHadoopArgs(allArgs) { args =>
val lines: DList[String] = fromTextFile(args(0))
val counts: DList[(String, Int)] = lines.flatMap(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(_+_)
persist(toTextFile(counts, args(1)))
}
}
[/scala]
The example program counts the words in the input file. The beauty of the Scoobi framework is that it allows developers to work with collection abstractions to express their Map-Reduce algorithm, without having to deal with the gory details of writing mappers and reducers directly. The details of the Hadoop framework that underlies scoobi are nicely hidden behind the simple looking statements on lines 9 and 16. Without going into too much detail, the workings of scoobi are summarized in this quote on the scoobi website:
...calling DList methods will not immediately result in data being generated in HDFS. This is because, behind the scenes, Scoobi implements a staging compiler. The purpose of DList methods are to construct a graph of data transformations. Then, the act of persisting a DList triggers the compilation of the graph into one or more MapReduce jobs and their execution.Now my suggestion is to take this scoobi approach one step further and not only develop the algorithm with a collection abstraction, but test it with simple collections as well! From a testing point of view, there are two assumptions that I would like to ensure by testing the code above:
- Running the code should produce and run correct Hadoop Mappers and Reducers.
- The algorithm should do the task it is designed for: counting the words in the input file.
Maarten Winkels
Contact
Let’s discuss how we can support your journey.