Common Stage Concepts

Pipeline stages include origins, processors, destinations, and executors. Although each pipeline stage functions differently, all pipeline stages share some concepts in common.

For example, all pipeline stages encrypt secret values in the same way, and most pipeline stages allow you to add additional configurations in simple or bulk edit mode in the same way. This section describes concepts that are common across most pipeline stages.

Secrets

Pipeline stages communicate with external systems to read and write data. Many of these external systems require secrets such as user names or passwords to access the data.

When you configure pipeline stages for these external systems, you define the secrets used to connect to the systems. StreamSets Cloud encrypts the values using 256-bit AES encryption with GCM-128, and then stores the encrypted values with the pipeline. Once stored, secret values are not visible. Secret values are decrypted only at pipeline runtime when a pipeline stage retrieves the secret to communicate with the external system.

StreamSets Cloud encrypts values for any stage property that requires secrets or sensitive information. Stage properties that require secrets display a key icon next to the property name. For example:

Simple and Bulk Edit Mode

Some pipeline and pipeline stage properties allow you to add additional configurations. You can add the configurations in simple or bulk edit mode. By default, each property uses simple edit mode.

For example, in the Expression Evaluator processor, you can click Add in simple edit mode to add multiple field expressions to the processor. The stage displays multiple text boxes for you to enter values, as follows:

Or, you can switch to bulk edit mode to enter a list of configurations in JSON format, for example:

Bulk edit mode can be particularly useful when you have a large number of configurations to enter.

When you switch to bulk edit mode, the stage displays the correct field names and any values that you entered in simple edit mode. Modify the values, but be sure not to modify the field names. If you modify the field names by mistake, the pipeline canvas displays a validation error message.

In bulk edit mode, you can add additional configurations by entering text in JSON format. Or, you can click Add to add another configuration with empty values and then enter the values in the JSON text.

Dropping Unwanted Records

You can drop records from a pipeline at each stage by defining required fields or preconditions for a record to enter a stage.

Required Fields

A required field is a field that must exist in a record to allow it into the stage for processing.

When a record does not include all required fields, it is processed based on the error handling configured for the pipeline. You can define required fields for any processor, executor, and most destination stages.

Configure required fields as part of the overall pipeline logic or to minimize processing errors. For example, let's say that a Field Hasher processor encodes social security data in the /SSN field. To ensure all records include social security numbers, you can make /SSN a required field for the stage.

Preconditions

Preconditions are conditions that a record must satisfy to enter the stage for processing.

Stages process all preconditions before passing a record to the stage or to error handling. When a record does not meet all configured preconditions, it is processed based on the error handling configured for the stage.

You can define preconditions for any processor, executor, and most destination stages. You can use most functions, pipeline constants, and runtime properties in preconditions.

Configure preconditions as part of the overall pipeline logic or to minimize processing errors. For example, you might use the following expression to exclude records that originate from outside the United States:
 ${record:value('/COUNTRY') == 'US'}

Error records include information about failed preconditions in the errorMessage record header attribute. You can also view error messages when monitoring recent error records while the pipeline is running.

Control Character Removal

You can use several stages to remove control characters, such as escape or end-of-transmission characters, from data. Remove control characters to avoid creating invalid records.

When StreamSets Cloud removes control characters, it removes ASCII character codes 0-31 and 127 with the following exceptions:
  • 9 - Tab
  • 10 - Line feed
  • 13 - Carriage return
You can use the Ignore Control Characters property in the following stages to remove control characters:
  • Amazon S3 origin
  • Azure Data Lake Storage Gen1 origin
  • Azure Data Lake Storage Gen2 origin
  • Azure IoT/Event Hub Consumer origin
  • Google Cloud Storage origin
  • Google Pub/Sub Subscriber origin
  • HTTP Client origin
  • Kinesis Consumer origin
  • HTTP Client processor
  • JSON Parser processor

Advanced Options for Stages

Some pipeline stages include advanced options with default values that should work in most cases. By default, each stage hides the advanced options.

As you get started with these stages, StreamSets recommends configuring the basic properties and using the default values for the advanced options. Then as you continue working with the stage, explore the advanced options to find ways to fine tune stage processing.

To show advanced options, select the stage in the pipeline canvas, and then select Show Advanced Options at the top of the properties pane. Advanced options can include individual properties or complete tabs.

For example, the following image displays the properties pane for the MySQL Query Consumer origin when advanced options are not shown. The four basic properties shown on the JDBC tab must be configured so that the origin can connect to your MySQL database.

When you select Show Advanced Options, the origin shows additional properties on the JDBC tab and shows additional tabs. In the image below, the circled advanced properties and tabs are configured with recommended default values. You likely won't need to configure these properties until you begin fine tuning how the origin processes your data.