In my previous posts I talked about data from a high level perspective and introduced Apache Avro, a data serialization system. This time we are going to get hands-on with Apache Avro and look at Schemas
Introducing the domain
We need to have a domain and because this is a blog, lets keep the domain simple. Lets create a domain that consists of a single person entity. The person entity will change over time, and we will have three versions. The initial version, v1.Person, has a single name field. In v2.Person we have added an age field. In v3.Person we have changed the ordering of the fields.
We can declare this domain in Scala as follows:
case class v1.Person(name: String = "")
case class v2.Person(name: String = "", age: Int = 0)
case class v3.Person(age: Int = 0, name: String = "")
Avro Schemas
Apache Avro schemas definitions are declared in JSON format. When we generate schemas from these case classes we get the following Avro schemas.
v1.Person:
{
"type" : "record",
"name" : "Person",
"namespace" : "binxio",
"fields" : [ {
"name" : "name",
"type" : "string",
"default" : ""
} ]
}
v2.Person:
{
"type" : "record",
"name" : "Person",
"namespace" : "binxio",
"fields" : [ {
"name" : "name",
"type" : "string",
"default" : ""
}, {
"name" : "age",
"type" : "int",
"default" : 0
} ]
}
v3.Person:
{
"type" : "record",
"name" : "Person",
"namespace" : "binxio",
"fields" : [ {
"name" : "age",
"type" : "int",
"default" : 0
}, {
"name" : "name",
"type" : "string",
"default": ""
} ]
}
Serializing
When we serialize a record to Avro we get the following Avro Datums. We represent the Avro Datums as hexadecimal strings. You can see that the name ‘Dennis’ is encoded as ‘0C44656E6E6973’. The number ‘42’ is encoded as ‘54’ due to zig-zag encoding.
v1.Person:
v1.Person("Dennis").toAvroBinary().hex
"0C44656E6E6973"
v2.Person:
v2.Person("Dennis", 42).toAvroBinary().hex
"0C44656E6E697354"
v3.Person:
v3.Person(42, "Dennis").toAvroBinary().hex
"540C44656E6E6973"
Deserializing
When we deserialize an Avro Datum, we need to provide the writer, and the reader schema.
v1.Person:
"0C44656E6E6973".parseAvroBinary[v1.Person, v1.Person]
v1.Person("Dennis")
v2.Person:
"0C44656E6E697354".parseAvroBinary[v2.Person, v2.Person]
v2.Person("Dennis", 42)
v3.Person:
"540C44656E6E6973".parseAvroBinary[v3.Person, v3.Person]
v3.Person(42, "Dennis")
Schema Evolution
The schemas that we have defined all have default values for fields. This means that the schemas are full compatible. The Avro Datum is written with v1.Person. When we instruct the system that we want a different representation, Avro will calculate the schema evolution and provide the requested schema version for consumption.
Writer: v1.Person => Reader: v2.Person:
"0C44656E6E6973".parseAvroBinary[v2.Person, v1.Person]
v2.Person("Dennis", 0)
Writer: v1.Person => Reader: v3.Person:
"0C44656E6E6973".parseAvroBinary[v3.Person, v1.Person]
v3.Person(0, "Dennis")
Writer: v3.Person => Reader: v1.Person:
"540C44656E6E6973".parseAvroBinary[v1.Person, v3.Person]
v1.Person("Dennis")
Writer: v3.Person => Reader: v2.Person:
"540C44656E6E6973".parseAvroBinary[v2.Person, v3.Person]
v2.Person("Dennis", 42)
Writer: v4.Person => Reader: v1.Person:
"540C44656E6E697302021
C4C61617065727376656C64
203237021248696C7665727
3756D020E31323133205642".parseAvroBinary[v1.Person, v4.Person]
v1.Person("Dennis")
Cross Domain Evolution
Apache Avro can evolve schemas across domains if necessary.
Writer: v1.Person => Reader: v1.Cat:
"0C44656E6E6973".parseAvroBinary[v1.Cat, v1.Person]
v1.Cat("Dennis")
Apparently I’m also a cat!
Conclusion
In this blog we have created a simple domain, created Avro schemas for the domain, serialized records to Avro Datums, evolved schemas and even did a cross domain evolution.
In this blog we’ve used Scala, a programming language for the JVM to show the examples. Apache Avro has language binding support for C, C++, C#, Go, Haskell, Java, Perl, PHP, Python, Ruby, Scala, TypeScript and more.
The Avro Datums that we’ve generated in this blog can be read by other systems as well, because Apache Avro is an open data serialization system.
Apache Avro is used by high volume, high performance, high throughput, data processing systems like Apache Kafka, Apache Hadoop and Apache Spark.