I previously blogged about shell scripting JSON with Node.js. In this post, I’ll demonstrate how to achieve the same thing with jsawk. The way these things go, I didn’t find out about jsawk until after I wrote my post about shell scripting with Node. It’s good to know both :).
Jsawk, as the name implies, aims to be for json what awk is for structured plain text. It’s rather useful, as it saves a lot of setup/boilerplate compared to the node.js scripts I’ve shown before.
A simple example
Consider a csv file of stock quotes:
[sourcecode language="text"]
#Symbol,Price,Cap
GOOG,569.31,184.58B
AAPL,574.70,537.60B
[/sourcecode]
If we want to extract just the symbols from this file with awk, one per line, we can run the file through awk:
[sourcecode language="bash"]awk -F , '/^[^#]/{ print $1 }'[/sourcecode]
To get the sum of all stock prices in the portfolio (which is silly), we run the file through in a slightly different way:
[sourcecode language="bash"]awk -F , '/^[^#]/{ sum += $2 } END{ print sum }'[/sourcecode]
For those unfamiliar with awk: the -F , argument indicates that we use the comma as the field separator. The /^[^#]/ regex tells awk the subsequent block (between curlies) applies only to lines that don’t start with a pound sign. The END command indicates that the subsequent block is run once, at end of file. The dollar variables point to the numbered columns, where $0 contains the entire record and $1 the first column.
If we had the same information in a JSON document, it’s not unlikely that it’d look like this:
[sourcecode language="javascript"]
[
{
"Symbol":"GOOG",
"Price":"569.31",
"Cap":"184.58B"
},
{
"Symbol":"AAPL",
"Price":"574.70",
"Cap":"537.60B"
}
]
[/sourcecode]
Given this kind of input, the similarity between awk and jsawk is strong. We’ll see further on that it breaks down a bit if the root of the JSON is an object instead of an array, but let’s focus on the simple things first.
To extract the symbol names with jsawk, we can run the json data through a very simple
[sourcecode language="bash"]jsawk 'return this.Symbol'[/sourcecode] and get the names in a JSON array: ["AAPL","GOOG"]. This script is analogous to applying ECMAscript 5's map function: it returns a new array where each element is the result of applying the given function to the element in the original array. The body of this function is the only parameter to jsawk, we don't need any boilerplate.
If we want to get the names unquoted and one per line, like with the awk script we've seen earlier, we can run [sourcecode language="bash"]jsawk -n 'out(this.Symbol)'[/sourcecode]
Here, the -n flag indicates that we’re not interested in the tranformed JSON. The out(arg) function simply writes to standard out.
Both these scripts are easier to read than their awk counterparts, mostly because I included a header line in the CSV example. Given a headerless CSV file, the awk and jsawk scripts are on par. The script to return the sum of stock prices is, unfortunately, more complicated than its awk equivalent: [sourcecode language="bash"]jsawk -n -v sum -b ‘sum = 0.0’ ‘sum = sum + parseFloat(this.Price)’ -a ‘out(sum)'[/sourcecode]. Let’s break it down:
- We’ve already seen the -n parameter that tells jsawk to ignore the json tranformation.
- The -v sum argument defines a global variable called sum. You can optionally pass an initial value like -v sum=0, which initializes the variable to the string "0".
- The -b <script> argument defines a script to be run once, before processing the input.
- The -a <script> operator defines a script to be run once, after processing the input.
The before-script reinitializes the sum variable as a number. If we don’t do that, the plus-operator in the main script performs string concatenation rather than numeric addition. The main script adds each stock price to the sum variable. The after script prints the result.
What about objects then?
I mentioned earlier that the analogy of jsawk to awk suffers when the root element of the JSON document is an object. To say it suffers is perhaps unfair, it’s more accurate to say that an input consisting of just one record is simply less interesting than an input of multiple records, awk and jsawk alike.
Let’s revisit the sample data from my post about scripting with Node.js. You’ll recall that we had to tranform this CouchDB view:
[sourcecode language="javascript"]
{"total_rows":22,"offset":0,"rows":[
{"id":"123456","key":"123456","value":{"rev":"1-acf7f39495a2cd4465be504cd435629e"}},
{"id":"123457","key":"123457","value":{"rev":"1-b67df18954264dbb65bf341294e572a5"}}
...20 more...
]}
[/sourcecode]
…into this bulk-api message:
[sourcecode language="javascript"]
{ "docs":[
{"_id":"123456","_rev":"1-acf6f39495a2cd4465be504cd435629e","_deleted":true}
{"_id":"123457","_rev":"1-b67df18954264dbb65be341294e572a5","_deleted":true}
...20 more...
]}
[/sourcecode]
This looks simple enough, however, if you feed this input through jsawk you’ll find that the script is hit exactly once, for the entire view object. An easy way to work around this is to put jsawk into your shell pipeline twice:
[sourcecode language="bash"]jsawk ‘return this.rows’ | jsawk -n ‘out("record")'[/sourcecode]
There may be a better way to this and I suspect the put(record) function may be intended for this purpose, but I couldn’t figure out how it works.
Using the double jsawk technique, my way to transform the CouchDB view into the bulk-api message is this:
[sourcecode language="bash"]jsawk 'return this.rows' | \
jsawk 'return {_id:this.id,_rev:this.value.rev,_deleted:true}' \
-a 'return {docs:this}'[/sourcecode]
The main script of the second jsawk command constructs an object literal for each record in the input with the id and revision copied and a hard-coded deletion marker thrown in for good measure. The after-script wraps the generated JSON array inside an object literal.
This works well and it’s a lot less code to maintain.