Schema Generator

The Schema Generator processor generates a schema based on the structure of a record and writes the schema into a record header attribute. The Schema Generator generates Avro schemas at this time.

You might use the Schema Generator in a pipeline to generate the latest version of the Avro schema before writing records to destination systems.

When you configure a Schema Generator processor, you can specify the namespace and description for the Avro schema. You can specify whether schema fields should allow nulls and whether schema fields should default to null. You can specify default values for most Avro primitive types, and you can allow the processor to use a larger data types for Avro types without a direct equivalent.

You can specify the names for precision and scale attributes for decimal values. And you can configure a default precision and scale for any decimal fields without that information or with an invalid precision or scale.

When appropriate, you can configure the Schema Generator to cache a number of schemas, and to apply the schemas to records based on the expression defined in the Cache Key Expression property.

Using the avroSchema Header Attribute

The Schema Generator processor writes Avro schemas to an avroSchema record header attribute by default. All Avro-processing origins also write the Avro schema of incoming records to the avroSchema header attribute. Also, any destination that writes Avro data can use the schema in the avroSchema header attribute.

When processing Avro data, one logical workflow is to add the Schema Generator immediately before the destination in a pipeline. This allows the processor to generate a new Avro schema before writing the data to destination systems.

If you want retain an earlier version of the avroSchema, you might use an Expression Evaluator processor before the Schema Generator to move the existing schema in the avroSchema header attribute to a different header attribute, such as avroSchema_previous.

Generated Avro Schema

The Avro schema that the Schema Generator creates includes the following information:
  • Schema type set to "record".
  • Schema name based on the Schema Name property.
  • Namespace based on the Namespace property, when configured.
  • Schema description in the doc field based on the Doc property, when configured.
  • A map of field names with related attributes based on the record schema and related properties defined in the stage, such as whether fields can include null values.

For example, the following Avro schema is generated when you set the Name property to MyAvroSchema, and omit the optional Namespace and Doc properties:

{"type":"record","name":"MyAvroSchema","namespace":"","doc":"","fields":[{"name":"name","type":["null","string"],"default":null},{"name":"id","type":["null","int"],"default":null},{"name":"instock","type":["null","boolean"],"default":false},{"name":"cost","type":["null",{"type":"bytes","logicalType":"decimal","precision":10,"scale":2}],"default":null}]}
The record described by this schema includes the following fields:
  • name - A string field.
  • id - An integer field.
  • instock - A boolean field.
  • cost - A decimal field.

The processor is configured to allow nulls in schema fields and to use null as the default value.

Caching Schemas

You can configure the Schema Generator to cache a number of schemas, and to apply the schemas to records based on the expression defined in the Cache Key Expression property.

Caching schemas can improve performance when a set of records can logically use the exact same schema, and when the records include a value that can be used to determine the schema to use.

For example, say your pipeline uses the JDBC Multitable Consumer to read from multiple database tables. The origin writes the names of the table used to generate each record to a jdbc.tables record header attribute. Let's assume that all data from each record comes from a single table.

To use the schema associated with each record, you can configure the Cache Key Expression property as follows: ${record:attribute(jdbc.tables)}.

Warning: Use schema caching with care - applying an incorrect schema to a record can cause errors when writing to destination systems.

Configuring a Schema Generator

Configure a Schema Generator processor to generate a schema for each record and write the schema to a record header attribute.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline. Not valid for cluster pipelines.
  2. On the Schema tab, configure the following properties:
    Schema Property Description
    Schema Type Type of schema to generate. The processor generates Avro schemas at this time.
    Schema Name The name to use for the resulting schema.
    Header Attribute The header attribute to contain the resulting schema.

    Default is avroSchema.

    Destinations can use the avroSchema header attribute to write Avro data when you configure the destination's Avro Schema Location property to use the In Record Header option.

  3. On the Avro tab, configure the following properties:
    Avro Property Description
    Namespace Namespace to use in the Avro schema.
    Doc Optional description for the Avro schema.
    Nullable Fields

    Allows fields to include null values by creating a union of the field type and null type.

    By default, fields cannot include null values.

    Default to Nullable When allowing null values in schema fields, uses null as a default value for all fields.
    Expand Types Allow using a larger Data Collector data type for an Avro data type when an exact equivalent is not available.
    Default Values for Types Optionally specify default values for Avro data types. Click Add to configure a default value.

    The default value applies to all fields of the specified data type.

    You can specify default values for the following Avro types:
    • Boolean
    • Integer
    • Long
    • Float
    • Double
    • String
  4. On the Types tab, optionally configure the following properties:
    Type Property Description
    Precision Field Attribute Name of the schema attribute that stores the precision for a decimal field.
    Scale Field Attribute Name of the schema attribute that stores the scale for a decimal field.
    Default Precision Default precision to use for decimal fields when the precision is not specified or is invalid.

    Use -1 to opt out of this option.

    Note: When decimal fields do not have a valid precision and scale, the stage sends the record to error.
    Default Scale Default scale to use for decimal fields when the precision is not specified or is invalid.

    Use -1 to opt out of this option.

    Note: When decimal fields do not have a valid precision and scale, the stage sends the record to error.
  5. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Enable Cache Enables caching schemas. Can improve performance under specific conditions. For more information, see Caching Schemas.
    Cache Size Maximum number of schemas to cache.
    Cache Key Expression Expression that evaluates to a valid cache key.