Blog

Parsing Text with Scala

21 Oct, 2009
Xebia Background Header Wave

In my efforts to teach myself Scala, I tried solving a problem I’ve tackled in various languages, 6510 assembly code (didn’t get far…), pl/sql, Java (with and without Drools) and Groovy among them. Usually I get bogged down in some detail of the language so I never get to reap any actual benefits of my efforts in daily life. The plus side of this never ending task is that by now I don’t have to spend effort on defining a problem but instead can start coding right away.
So this story is about how to parse text in Scala and is part of THE software package that will automagically generate a menu for a week and the shopping list for that menu together with whatever else my family will need that week and send it to www.albert.nl and have the groceries delivered to my door.

Thinking big and starting small, I set about (again) parsing a set of recipes so I can extract the list of ingredients necessary for the shopping list. Since all of my practical knowledge of Scala came from reading the book by Martin Odersky, there was some evolution in my code. As we go, I’ll try and amuse you with some of the clunky contraptions I’ve created along the way while trying to shake of old habits.
The aim is to parse a list of recipes like the one below:
[text]
name ZucchiniWithMinceMeat
season summer
category easy
cookingtime 45 minutes
preheat oven at 180 degrees
fry 500 gram mincemeat
fry 1 piece Zucchini
add 1 can mashedTomatoes
add 1 bag groundCheese
bake for 30 minutes
[/text]
and turning it into a list of groceries like this:
[text]
mincemeat 500 gram
Zucchini 1 piece
mashedTomatoes 1 can
groundCheese 1 bag
[/text]
Scala offers a parser for text named StandardTokenParsers in the package scala.util.parsing.combinator.syntactical. The idea is that a specific parser extends StandardTokenParsers. The parser should provide

  • A list of delimiters
  • A list of reserved words
  • A ‘main’ method that accepts text input and starts the parser
  • A set of production rules that parse part of the input, building an object graph along the way.
    lexical.delimiters is the list of characters used to separate tokens in the input.
    lexical.reserved defines a set of key words, i.e. the while, if, for etc of your language. The snippet below shows my definitions (it is part of com.xebia.cooking.CookBookDSL.scala, which you can find in the zip file attached to this post.).
    [scala]
    lexical.delimiters ++= List(“(“, “)”, “,”, ” “, “\n”)
    lexical.reserved ++= List(“name”, “season”, “fry”, “pack”, “can”, “bag”, “piece”, “at”, “for”, “bottle”, “gram”, “oven”, “preheat”, “bake”, “serve”, “with”,”add”,”degrees”, “category”, “cookingtime”)
    [/scala]
    lexical is of type Lexical and is an inherited attribute of StandardTokenParsers.
    The parser is started like this:
    [scala]
    def parseCookBook(cookBook:String):CookBook = {
    cookbook(new lexical.Scanner(cookBook)) match {
    case Success(recipeList, _) => new CookBook(recipeList)
    case Failure(msg, _) => throw new ParseErrorException(msg)
    case Error(msg, _) => throw new ParseErrorException(msg)
    }
    }
    [/scala]
    The input is a string representation of a cookbook. parseCookBook returns an object of type CookBook which we will discuss later. The parsing is kicked off by calling the cookbook method with an instance of a Scanner as a parameter. The Scanner will emit tokens that are matched with the rules of the grammar as we’ll see below.
    Using Scala’s case construct the result of the parser is translated in either a CookBook instance or results in an exception.
    The parser consists of a set of rules in the form of method definitions that look like the one below:
    [scala]
    def recipeName: Parser[String] =
    “name” ~ ident ^^ { case “name” ~ name => name }
    [/scala]
    The rule defines the name of a recipe as a String. The text to the left of the ^^ operator defines the tokens that must appear in the input for the rule to match. In this case we expect the string “name” followed by an identifier. ident is defined in StdTokenParsers (along with numericLit which I’ll use later) and matches a string of alphanumeric characters. The tokens that are thus parsed are used to select a action to the right of the ^^ sign. “name” ~ name introduces a variable named name that is initialized with the value of the ident token. name is then returned as a result of the rule by placing it to the right of the => sign.
    A slightly more complex example is shown below:
    [scala]
    def ingredient: Parser[Ingredient] =
    opt(amount) ~ ident ^^
    { case Some(amount) ~ ingredient => (new Ingredient(ingredient,amount))
    case _ ~ ingredient => {new Ingredient(ingredient)}
    }
    [/scala]
    This rule matches text like “500 gram mincemeat”, “1 piece Zucchini” but also “salad” or “tomatoes”. The amount part of the definition is made optional. Instead of just one case clause like we saw in the first rule, this rule has two cases. The first matches an amount followed by an ingredient, the second matches just an ingredient without the amount part.
    One rule can build on the results of other rules as illustrated below:
    [scala]
    def step: Parser[Step] =
    (fryStep | preHeatStep | serveWithStep | addStep | bakeStep) ^^ { case step => step }
    [/scala]
    A step in a recipe is something like “preheat oven at 180 degrees” or “fry 500 gram mincemeat”. The clauses to the left are alternatives that are defined as separate rules. The result is always a Step instance. Steps are important because some of them define ingredients which, in some hypothetical future version of my magnum opus, will result in a shopping list (do I hear laughter?).
    One last sample rule shows repeating elements:
    [scala]
    def listOfSeasons: Parser[List[Seasons]] =
    repsep(season ,”,”) ^^ { listOfSeasons: List[Seasons] => listOfSeasons }
    }
    [/scala]
    This rule matches one or more occurrences of a season separated by a comma using the repsep clause. The result is a List of objects of type Seasons.
    The parser code uses classes that represent concepts in my domain, like Ingredient, Step and Recipe. I’ve defined these classes in a separate source file named Recipe.scala. Most of the classes are empty and just serve to define concepts. They may become more interesting later (you know, when I implement this wonderful…, etc, etc…). Others, like StepWithAnIngredient serve to collect ingredients that will be used to create a shopping list.
    Anyway, one thing I learned from writing the classes in the Recipe.scala file is the use the mkString method. Given my Java background I wrote toString methods like this:
    [scala]
    override def toString = {
    val result:StringBuilder = new StringBuilder(“Steps: “)
    list.foreach(step => result.append(step).append(“\n”))
    result.toString()
    }
    [/scala]
    I was feeling pretty happy about this (you know, remembering to use the foreach method) until I realized this kind of thing is way easier implemented like this:
    [scala]
    override def toString = “Steps: ” + steps.mkString(“\n”)
    [/scala]
    You can call mkString on any instance you like and it will return a string representation, which is especially convenient in the case of iterable objects.
    Another powerful Scala feature is the set of methods defined on lists. One such method is filter that accepts a closure executed on each element of the list. If the condition returns true the element is added to the result.
    In the example below, the _ represents a recipe, i.e. an element of the list of recipes.
    [scala]
    def findRecipesBySeason(season:Seasons):Seq[Recipe] = {
    recipes.filter(_.seasons.contains(season))
    }
    [/scala]
    Another thing that had me thoroughly confused at first was the syntax for constructor parameters. The example below shows the variables amount and unitName defined in the Amount class, the Java way:
    [java]
    public class Amount {
    private int amount;
    private String unitName;
    public Amount(int amount, String unitName) {
    this.amount=amount;
    this.unitName=unitName;
    }
    // getters and setters omitted for brevity…
    }
    [/java]
    And in Scala:
    [scala]
    class Amount (val amount:Int, val unitName:String) {
    override def toString = amount+”:”+unitName
    }
    [/scala]
    The parameters following the name of the class are turned into public val’s. With my Java background and my compulsion to hit ctrl-1 in Eclipse, it took me awhile to realize how easy and concise the Scala syntax is.
    Note that the Recipe.scala file contains a list of class definitions. In Scala there is no one-to-one mapping from public classes to source files. This allowed me to combine a list of related concepts in a single source file.
    Finally, I found that the best way to develop parsers is by taking very small steps at a time. It is pretty easy to mess things up beyond repair by changing too much at at once.
    This Scala parser is by far the most complete version I’ve built so far. I’m now starting to like the language, even though my lack of experience gets in the way at least 15 times an hour. Who knows, I might actually create something useful this time…
Jan Vermeir
Developing software and infrastructure in teams, doing whatever it takes to get stable, safe and efficient systems in production.
Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts