Separating schema definitions in Avro

Daniel

November 9, 2019

Introduction

In Apache Avro, you can define a schema which contains the fields and data types allowed for a record. These schemas can contain complex types which are also represented as records. When the same record is used across multiple schemas, you may have multiple definitions for the same record. This makes schemas harder to maintain as we have multiple places they can be changed. We used the Maven Avro plugin to generate these schemas with only one definition of each record.

The Problem

{
 "type": "record",
 "name": "Pet",
 "namespace": "com.intergral.example.avro",
 "fields": [
   {
     "name": "name",
     "type": "string"
   },
   {
     "name": "toys",
     "type": {
       "type": "array",
       "items": {
         "type": "record",
         "name": "Toy",
         "namespace": "com.intergral.example.avro",
         "fields": [
           {
             "name": "name",
             "type": "string"
           },
           {
             "name": "price",
             "type": "long"
           }
         ]
       },
       "java-class": "java.util.List"
     }
   }
 ]
}

In the schema above, we have a record called Pet. Pet defines that it should contain a list of Toys, another complex type. This is great as when we generate Pet, Pet also knows how to generate Toy for us.

However, if we now decide we want a Shop schema, and this also contains a list of Toys, we will have defined Toy twice. Upon generation, the generated Toy classes will override each other if they are in the same namespace. Now we have to either have to have a new namespace for every Toy schema or we have to ensure that when Toy is updated, it is also updated in every schema that defines it.

Solution

To solve this, we can make only one definition for toy and reuse this anywhere it is needed.

Firstly we split out each record (Toy in this case) into its own file, and then can define where Toy is located in the Pet schema as shown below:

{
 "type": "record",
 "name": "Pet",
 "namespace": "com.intergral.example.avro",
 "fields": [
   {
     "name": "name",
     "type": "string"
   },
   {
     "name": "toys",
     "type": {
       "type": "array",
       "items": "com.intergral.example.avro.Toy",
       "java-class": "java.util.List"
     }
   }
 ]
}

Finally we have to tell the Avro maven plugin which order to build the files, since Pet requires Toy to already have been generated before it can be generated itself. We can do this by modifying the plugin config to include an import order. Anything defined in this list will be generated first, so we can use this for common definitions.

<sourcedirectory>${project.basedir}/src/main/avro</sourcedirectory>
<imports></imports>
<import>${project.basedir}/src/main/avro/Toy.avsc</import>

When we now run the build using maven, we can generate both Pet and Store, which both use Toy, while only having one definition of the Toy schema.

Conclusion

Using the maven Avro plugin, we have now used a definition of a schema in multiple places. While this makes the code more maintainable, it makes the build more complicated as now we have to define what to generate first. For us, this trade-off was worth it as now we can ensure consistency between Avro generated classes.

Daniel

Daniel

I am a Software Engineer working as part of the nerd.vision team, mainly working on the backend systems and agents. When I'm not squashing bugs, I enjoy travelling, gaming and experiencing new foods and cultures.