Error Record Handling

You configure error record handling at a stage level and at a pipeline level. You also specify the version of the record to use as the basis for the error record.

When an error occurs as a stage processes a record, StreamSets Cloud handles the record based on the stage configuration. One of the stage options is to pass the record to the pipeline for error handling. For this option, StreamSets Cloud processes the record based on the pipeline error record handling configuration.

When you configure a pipeline, be aware that stage error handling takes precedence over pipeline error handling. That is, a pipeline might be configured to write error records to file, but if a stage is configured to discard error records, those records are discarded. You might use this functionality to reduce the types of error records that are saved for review and reprocessing.

Note that records missing required fields do not enter the stage. They are passed directly to the pipeline for error handling.

Pipeline Error Record Handling

Pipeline error record handling determines how StreamSets Cloud processes error records that stages send to the pipeline for error handling. It also handles records deliberately dropped from the pipeline such as records without required fields.

The pipeline handles error records based on the Error Records property on the Error Records tab. When StreamSets Cloud encounters an unexpected error, it stops the pipeline and logs the error.

Pipelines provide the following error record handling options:
Discard
The pipeline discards the record. StreamSets Cloud includes the records in error record counts and metrics.
Write to Amazon S3
The pipeline writes error records and related details to Amazon S3. StreamSets Cloud includes the records in error record counts and metrics.

You define the Amazon S3 configuration properties.

Write to Google Cloud Storage
The pipeline writes error records and related details to Google Cloud Storage. StreamSets Cloud includes the records in error record counts and metrics.
You define the Google Cloud Storage configuration properties.
Write to Google Pub/Sub
The pipeline writes error records and related details to Google Pub/Sub. StreamSets Cloud includes the records in error record counts and metrics.
You define the Google Pub/Sub configuration properties.
Write to Kinesis
The pipeline writes error records and related details to Amazon Kinesis Streams. StreamSets Cloud includes the records in error record counts and metrics.
You define the configuration properties for the Kinesis stream to use.

Stage Error Record Handling

Most stages include error record handling options. When an error occurs during stage processing, StreamSets Cloud processes records based on the On Record Error property on the General tab of the stage.

Stages include the following error handling options:
Discard
The stage silently discards the record. StreamSets Cloud does not log information about the error or note the specific record that encountered an error. The discarded record is not included in error record counts or metrics.
Send to Error
The stage sends the record to the pipeline for error handling. The pipeline processes the record based on the pipeline error handling configuration.
When you monitor the pipeline, you can view the most recent error records and the issues they encountered on the Error Records tab for the stage. This information becomes unavailable after you stop the pipeline.
Stop Pipeline
StreamSets Cloud stops the pipeline and logs information about the error.

Example

An Amazon S3 origin reads JSON data with a maximum object length of 4096 characters. The stage encounters an object with 5000 characters. Based on the stage configuration, StreamSets Cloud either discards the record, stops the pipeline, or passes the record to the pipeline for error record handling.

When the stage is configured to send the record to the pipeline, one of the following occurs based on how you configure the pipeline error handling:
  • When the pipeline discards error records, StreamSets Cloud discards the record without noting the action or the cause.

    When you monitor the pipeline, you can view the most recent set of error records and information about the errors on the Error Records tab for the stage. But this information becomes unavailable after you stop the pipeline.

  • When the pipeline writes error records to a destination, StreamSets Cloud writes the error record and additional error information to the destination. It also includes the error records in error record counts and metrics.

Error Records and Version

When StreamSets Cloud creates an error record, it preserves the data and attributes from the record that triggered the error, and then adds error related information as record header attributes. For a list of the error header attributes and other internal header attributes associated with a record, see Internal Attributes.

When you configure a pipeline, you can specify the version of the record that you want to use:
  • Original record - The record as originally generated by the origin. Use this record when you want the original record without any additional pipeline processing.
  • Current record - The record in the stage that generated the error. Depending on the type of error that occurred, this record can be unprocessed or partially processed by the error-generating stage.

    Use this record when you want to preserve any processing that the pipeline completed before the record caused an error.