Generate your Avro Schema – Automatically!

Avro Logo
In a previous blog post, I explained how StreamSets Data Collector (SDC) can work with Apache Kafka and Confluent Schema Registry to handle data drift via Avro schema evolution. In that blog post, I mentioned SDC's Schema Generator processor; today I'll explain how you can use the Schema Generator to automatically create Avro schemas.

We'll use our old friend the Taxi tutorial pipeline as a basis, modifying it to write Avro-formatted data rather than a delimited data format. We'll look at an initial naive implementation, just dropping the Schema Generator into the pipeline, then see how, with a little more work, we get a much better result.

Creating an Avro Schema

I'm starting with the basic Taxi tutorial pipeline. If you have not yet completed the SDC tutorial, I urge you to do so – it really is the quickest, easiest way to get up to speed creating dataflow pipelines.

For simplicity, let's swap the Hadoop FS destination for Local FS, and set the data format to Avro. You'll notice that we need to specify the Avro schema, somehow:

Taxi Transactions To Local FS

Let's insert the Schema Generator processor just before the Local FS destination, and give the schema a suitable name:

Add Schema Generator

Notice that the Schema Generator processor puts the schema in a header attribute named avroSchema. We can now configure the Local FS destination to use this generated schema:

Set Schema Location

We can use preview to get some insight into what will happen when the pipeline runs. Preview will read the first few records from the origin, process them in the pipeline, but not, by default, write them to the destination. Enabling ‘Show Record/Field Header' will allow us to see the Avro schema:

Preview Dialog

Selecting the Schema Generator and drilling into the first record, we can see the Avro schema:

Preview Avro Schema

Let's reformat the Avro schema so it's more readable. I’ve removed most of the fields so we can focus on the key points:

Converting Field Types

The Schema Generator has created an Avro schema, but it's likely not going to be very useful. Delimited input data (for example, data from CSV files) doesn't have any type information, so all the fields are strings. It would be way more useful to have those datetimes as the corresponding type, the amount and coordinate fields as decimals, and it looks like trip_time_in_secs and passenger_count can be integers.

We can use the Field Type Converter processor to do the job:

Field Type Converter

Previewing again, the schema looks much better, but we still have a little work to do. Notice that the Field Type Converter ‘guesses' the precision for the decimal fields based on the values in each individual record:

The precision attributes of the generated schemas will vary from record to record, but the schema needs to be uniform across all of the data. We can use an Expression Evaluator to set the field headers to override the generated precision attribute with sensible values for the entire data set:

Field Headers

One last preview, and we can see that the schema is in good shape!

Let's run the pipeline and take a look at the output. I used Avro Tools to verify the schema and records in the output file from the command line (here's a useful primer on Avro Tools).

As expected, that matches what we saw in the pipeline preview. Let's take a look at the data:

The strings and integers look fine, but what's happened to the datetime and amount fields? Avro defines Logical Types for timestamp-millis, decimal and other derived types, specifying the underlying Avro type for serialization and additional attributes. Timestamps are represented as a long number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC, while decimals are encoded as a sequence of bytes containing the two's-complement representation of the unscaled integer value in big-endian byte order. The decimal fields in particular look a bit strange in their JSON representation, but rest assured that the data is stored in full fidelity in the actual Avro encoding!


The Schema Generator processor is a handy tool to save us having to write Avro schemas by hand, and a key component of the StreamSets Apache Sqoop Import Tool, but there is one caveat. The processor has no persistent state, so it can't track the schema between pipeline restarts to ensure that the evolving schema follows Avro's schema evolution rules. For this reason, you should not use the Schema Generator with drifting data – that is, when the incoming record structure may change over time.

Share This Article :

Related Posts

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!

Pin It on Pinterest