Contrary to what you might have expected because of the title, this is not an introduction to the marvelous UNIX hexdump tool. If that’s what you were searching for, then you better pull up the man page first, because I’m not going to spend a word it. This article is about something better…
Before going into that, first a short message to anyone who feels like XML or JSON are the only data encoding formats we will ever need:
WAKE UP!!! Only a fraction of the data on your hard disk is encoded as XML or JSON. A lot of it is binary encoded data. How would you feel about not having images, mp3, videos, PDFs and Word documents (well that would be ok, I guess)? If you really think that text-based formats are all you will ever need, then you can kiss all of that goodbye. So, with that out of the way, now read on.
Binary data is there to stay. In fact, a large part of the world evolves around binary encoded data. (The Internet for instance. TCP-IP packets are not XML documents, although there has been an RFC suggesting something like that in the past.) In fact, with the rise of NoSQL, the importance of having better ways of dealing with binary encoded data has probably only increased. (Many of these databases store arrays of bytes, and leave it up to you to reconstruct that into something useful.)
The agony!
Unfortunately, dealing with binary encoded data is not all that easy. Even understanding what is getting decoded and why can be extremely hard. On the plus side, we do have tools such as hexdump and hexl-mode (Emacs), but the output of hexdump -C will only be meaningful to a few people. For the rest of us, we still need to work really hard to understand what on earth is represented by the binary encoded representation.
Here is the top of a Java class file, presented with hexdump -C, as an example:
00000000 ca fe ba be 00 00 00 31 00 2b 0a 00 0a 00 1d 09 |.......1.+......| 00000010 00 09 00 1e 09 00 09 00 1f 07 00 20 0a 00 04 00 |........... ....| 00000020 1d 0a 00 04 00 21 0a 00 04 00 22 0a 00 04 00 23 |.....!...."....#| 00000030 07 00 24 07 00 25 01 00 03 66 6f 6f 01 00 12 4c |..$..%...foo...L| 00000040 6a 61 76 61 2f 6c 61 6e 67 2f 53 74 72 69 6e 67 |java/lang/String| 00000050 3b 01 00 06 4c 4f 4e 44 4f 4e 01 00 0d 43 6f 6e |;...LONDON...Con| 00000060 73 74 61 6e 74 56 61 6c 75 65 08 00 26 01 00 03 |stantValue..&...| 00000070 62 61 72 01 00 01 49 01 00 06 3c 69 6e 69 74 3e |bar...I...| 00000080 01 00 16 28 4c 6a 61 76 61 2f 6c 61 6e 67 2f 53 |...(Ljava/lang/S| 00000090 74 72 69 6e 67 3b 49 29 56 01 00 04 43 6f 64 65 |tring;I)V...Code|
As I said, it does provide some help, but not a lot of help.
But what if
But wait, if the computer knows how to decode this into something useful, why isn’t it able to provide some more insight in what I’m looking at? I mean, it’s quite a lot of work to interpret the above. Why isn’t a computer capable of annotating it, in order to help me understand what it says here?
The good news is: now there is a way to get that. Last year at OOPSLA, I presented Preon (now a codehaus project), a framework that captures the mapping of the in memory representation and the binary encoded representation in a declarative way. Preon is capable of giving you a decoder and encoder for free, and it also used to be able to generate documentation on the encoded representation.
But now, there is something new: in the next release, you will have the ability to – while you are decoding data – output an HTML document that clearly explains how the in memory data structure is represented in its encoded representation! Here is an example of exactly the same Java class as shown before. (Click to jump to the live page.):
As you can see, you can just move your mouse around over the page. Once it hits a byte that it knows about, it will highlight the bytes that together make up a data element, and show meta data on the side, including the value from these bytes.
Done?
Now, this is far from done. There is plenty of more stuff that will be added, such as a better presentation of the metadata, and an explanation, showing the dependencies between the decoded data and other parts of the encoded presentation. But it does show one of Preon’s strengths: by capturing the metadata on the mapping between the encoded representation and the in-memory representation, Preon is capable of explaining what is getting decoded. None of this was present in Preon before. Everything has been added to the framework without changing an existing line of code.
The version of Preon capable of doing this, currently resides on Github.