Post Upgrade Tasks

In some situations, you must complete tasks within Data Collector or your Control Hub on-premises installation after you upgrade.

Update Control Hub On-Premises

By default, StreamSets Control Hub on-premises can work with registered Data Collectors from version 2.1.0.0 to the current version of Control Hub. If you use Control Hub on-premises and you upgrade registered Data Collectors to a version higher than your current version of Control Hub, you must modify the Data Collector version range within your Control Hub installation.

For example, if you upgrade Control Hub on-premises to version 3.6.0 and you upgrade registered Data Collectors to version 3.8.0, you must update the maximum Data Collector version that can work with Control Hub. As a best practice, configure the maximum Data Collector version to 3.8.10 to ensure that Data Collector upgrades to later minor versions, such as 3.8.1 or 3.8.2, will continue to work with Control Hub.

To modify the Data Collector version range:

  1. Log in to Control Hub as the default system administrator - the admin@admin user account.
  2. In the Navigation panel, click Administration > Data Collectors.
  3. Click the Component Version Range icon: .
  4. Enter the maximum Data Collector version that can work with Control Hub, such as 3.8.10.

Update Pipelines using Legacy Stage Libraries

Starting with version 3.0.0.0, several older stage libraries are no longer included with Data Collector and are now legacy stage libraries. Additional Data Collector releases have also moved some older stage libraries to the legacy stage libraries.

When you upgrade, review the complete list of legacy stage libraries. If your upgraded pipelines use these legacy stage libraries, the pipelines will not run until you perform one of the following tasks:
Use a current stage library
We strongly recommend that you upgrade your system and use a current stage library in the pipeline:
  1. Upgrade the system to a more current version.
  2. Install the stage library for the upgraded system.
  3. In the pipeline, edit the stage and select the appropriate stage library.
Install the legacy stage library
Though not recommended, you can still download and install the older stage libraries as custom stage libraries. For more information, see Legacy Stage Libraries.

Pipeline Export

With version 3.8.0, Data Collector has changed the behavior of the pipeline Export option. Data Collector now strips all plain text credentials from exported pipelines. Previously, Data Collector included plain text credentials in exported pipelines.

To use the previous behavior and include credentials in the export, choose the new Export with Plain Text Credentials option when exporting a pipeline.

Update TCP Server Pipelines

With version 3.7.2, the TCP Server origin has changed the valid values for the Read Timeout property. The property now allows a minimum of 1 second and a maximum of 3,600 seconds.

In previous versions, the Read Timeout property had no maximum value and could be set to 0 to keep the connection open regardless of whether the origin read any data.

If pipelines created in a previous version have the Read Timeout property set to a value less than 1 or greater than 3,600, the upgrade process sets the property to the maximum value of 3,600 seconds. If necessary, update the Read Timeout property as needed after the upgrade.

Update Cluster Pipelines

With version 3.7.0, Data Collector now requires that the Java temporary directory on the gateway node in the cluster is writable.

The Java temporary directory is specified by the Java system property java.io.tmpdir. On UNIX, the default value of this property is typically /tmp and is writable.

Previous Data Collector versions did not have this requirement. Before running upgraded cluster pipelines, verify that the Java temporary directory on the gateway node is writable.

Duplicate Data in Oracle CDC Client Pipelines

After upgrading to version 3.7.0 or later, pipelines that use the Oracle CDC Client origin can produce some duplicate data.

Due to a change in offset format, when the pipeline restarts, the Oracle CDC Client origin reprocesses all transactions with the commit SCN from the last offset to prevent skipping unread records. This issue occurs only for the last SCN that was processed before the upgrade, and only once, upon upgrading to Data Collector version 3.7.0 or later.

When possible, remove the duplicate records from the destination system.

Update Pipelines that Use Kafka Consumer or Kafka Multitopic Consumer Origins

With version 3.7.0, Data Collector no longer uses the auto.offset.reset value set in the Kafka Configuration property to determine the initial offset. Instead, Data Collector uses the new Auto Offset Reset property to determine the initial offset. With the default setting of the new property, the origin reads all existing messages in a topic. In previous versions, the origin read only new messages by default. Because running a pipeline sets an offset value, configuration of the initial offset only affects pipelines that have never run.

After upgrading from a version earlier than 3.7.0, update any pipelines that have not run and use the Kafka Consumer or Kafka Multitopic Consumer origins.

  1. On the Kafka tab, set the value of the Auto Offset Reset property:
    • Earliest - Select to have the origin read messages starting with the first message in the topic (same behavior as configuring auto.offset.reset to earliest in previous versions of Data Collector).
    • Latest - Select to have the origin read messages starting with the last message in the topic (same behavior as not configuring auto.offset.reset in previous versions of Data Collector).
    • Timestamp - Select to have the origin read messages starting with messages at a particular timestamp, which you specify in the Auto Offset Reset Timestamp property.
  2. If configured in the Kafka Configuration property, delete the auto.offset.reset property.

Update Spark Executor with Databricks Pipelines

With version 3.5.0, Data Collector introduces a new Databricks executor and has removed the ability to use the Spark executor with Databricks.

If you upgrade pipelines that include the Spark executor with Databricks, you must update the pipeline to use the Databricks executor after you upgrade.

Update Pipelines to Use Spark 2.1 or Later

With version 3.3.0, Data Collector removes support for Spark 1.x and introduces cluster streaming mode with support for Kafka security features such as SSL/TLS and Kerberos authentication using Spark 2.1 or later and Kafka 0.10.0.0 or later. For more information about these changes, see Upgrade to Spark 2.1 or Later.

After upgrading the Cloudera CDH distribution, Hortonworks Hadoop distribution, or Kafka system to the required version and then upgrading Data Collector, you must update pipelines to use Spark 2.1 or later. Pipelines that use the earlier systems will not run until you perform these tasks:

  1. Install the stage library for the upgraded system.
  2. In the pipeline, edit the stage and select the appropriate stage library.
  3. If the pipeline includes a Spark Evaluator processor and the Spark application was previously built with Spark 2.0 or earlier, rebuild it with Spark 2.1.

    Or if you used Scala to write the custom Spark class, and the application was compiled with Scala 2.10, recompile it with Scala 2.11.

  4. If the pipeline includes a Spark executor and the Spark application was previously built with Spark 2.0 or earlier, rebuild it with Spark 2.1 and Scala 2.11.

Update Value Replacer Pipelines

With version 3.1.0.0, Data Collector introduces a new Field Replacer processor and has deprecated the Value Replacer processor.

The Field Replacer processor lets you define more complex conditions to replace values. For example, unlike the Value Replacer, the Field Replacer can replace values that fall within a specified range.

You can continue to use the deprecated Value Replacer processor in pipelines. However, the processor will be removed in a future release - so we recommend that you update pipelines to use the Field Replacer as soon as possible.

To update your pipelines, replace the Value Replacer processor with the Field Replacer processor. The Field Replacer replaces values in fields with nulls or with new values. In the Field Replacer, use field path expressions to replace values based on a condition.

For example, let's say that your Value Replacer processor is configured to replace null values in the product_id field with "NA" and to replace the "0289" store ID with "0132" as follows:

In the Field Replacer processor, you can configure the same replacements using field path expressions as follows:

Update Einstein Analytics Pipelines

With version 3.1.0.0, the Einstein Analytics destination introduces a new append operation that lets you combine data into a single dataset. Configuring the destination to use dataflows to combine data into a single dataset has been deprecated.

You can continue to configure the destination to use dataflows. However, dataflows will be removed in a future release - so we recommend that you update pipelines to use the append operation as soon as possible.

Disable Cloudera Navigator Integration

With version 3.0.0.0, the beta version of Cloudera Navigator integration is no longer available with Data Collector. Cloudera Navigator integration now requires a paid subscription. For more information about purchasing Cloudera Navigator integration, contact StreamSets.

When upgrading from a Data Collector version with Cloudera Navigator integration enabled to version 3.0.0.0 without a paid subscription, perform the following post-upgrade task:

Do not include the Cloudera Navigator properties when you configure the 3.0.0.0 Data Collector configuration file, sdc.properties. The properties to omit are:
  • lineage.publishers
  • lineage.publisher.navigator.def
  • All other properties with the lineage.publisher.navigator prefix

JDBC Multitable Consumer Query Interval Change

With version 3.0.0.0, the Query Interval property is replaced by the new Queries per Second property.

Upgraded pipelines with the Query Interval specified using a constant or the default format and unit of time, ${10 * SECONDS}, have the new Queries per Second property calculated and defined as follows:
Queries per Second = Number of Threads / Query Interval (in seconds)
For example, say the origin uses three threads and Query Interval is configured for ${15 * SECONDS}. Then, the upgraded origin sets Queries per Seconds to 3 divided by 15, which is .2. This means the origin will run a maximum of two queries every 10 seconds.

The upgrade would occur the same way if Query Interval were set to 15.

Pipelines with a Query Interval configured to use other units of time, such as ${.1 *MINUTES}, or configured with a different expression format, such as ${SECONDS * 5}, are upgraded to use the default for Queries per Second, which is 10. This means the pipeline will run a maximum of 10 queries per second. The fact that these expressions are not upgraded correctly is noted in the Data Collector log.

If necessary, update the Queries per Second property as needed after the upgrade.

Update JDBC Query Consumer Pipelines used for SQL Server CDC Data

With version 3.0.0.0, the Microsoft SQL Server CDC functionality in the JDBC Query Consumer origin has been deprecated and will be removed in a future release.

For pipelines that use the JDBC Query Consumer to process Microsoft SQL Server CDC data, replace the JDBC Query Consumer origin with another origin:

Update MongoDB Destination Upsert Pipelines

With version 3.0.0.0, the MongoDB destination supports the replace and update operation codes, and no longer supports the upsert operation code. You can use a new Upsert flag in conjunction with Replace and Update.

After upgrading from a version earlier than 3.0.0.0, update the pipeline as needed to ensure that records passed to the destination do not use the upsert operation code (sdc.operation.type = 4). Records that use the upsert operation code will be sent to error.

In previous releases, records flagged for upsert were treated in the MongoDB system as Replace operations with the Upsert flag set.

If you want to replicate the upsert behavior from earlier releases, perform the following steps:
  1. Configure the pipeline to use the Replace operation code.

    Make sure that the sdc.operation.type is set to 7 for Replace instead of 4 for Upsert.

  2. In the MongoDB destination, enable the new Upsert property.

Time Zones in Stages

With version 3.0.0.0, time zones have been organized and updated to use JDK 8 names. This should make it easier to select time zones in stage properties.

In the rare case that an upgraded pipeline uses a format not supported by JDK 8, edit the pipeline to select a compatible time zone.

Update Kudu Pipelines

Consider the following upgrade tasks for Kudu pipelines, based on the version that you are upgrading from:

Upgrade from versions earlier than 3.0.0.0
With version 3.0.0.0, if the destination receives a change data capture log from the following source systems, you must specify the source system so that the destination can determine the format of the log: Microsoft SQL Server, Oracle CDC Client, MySQL Binary Log, or MongoDB Oplog.
Previously, the Kudu destination could not directly receive changed data from these source systems. You either had to include a scripting processor in the pipeline to modify the field paths in the record to a format that the destination could read. Or, you had to add multiple Kudu destinations to the pipeline - one per operation type - and include a Stream Selector processor to send records to the destination by operation type.
If you implemented one of these workarounds, then after upgrading, modify the pipeline to remove the scripting processor or the Stream Selector processor and the multiple destinations. In the Kudu destination, set the Change Log Format to the appropriate format of the log: Microsoft SQL Server, Oracle CDC Client, MySQL Binary Log, or MongoDB Oplog.
Upgrade from versions earlier than 2.2.0.0
With version 2.2.0.0, Data Collector provides support for Apache Kudu version 1.0.x and no longer supports earlier Kudu versions. To upgrade pipelines that contain a Kudu destination from Data Collector versions earlier than 2.2.0.0, upgrade your Kudu cluster and then add a stage alias for the earlier Kudu version to the Data Collector configuration file, $SDC_CONF/sdc.properties.

The configuration file includes stage aliases to enable backward compatibility for pipelines created with earlier versions of Data Collector.

To update Kudu pipelines:

  1. Upgrade your Kudu cluster to version 1.0.x.

    For instructions, see the Apache Kudu documentation.

  2. Open the $SDC_CONF/sdc.properties file and locate the following comment:
    # Stage aliases for mapping to keep backward compatibility on pipelines when stages move libraries
  3. Below the comment, add a stage alias for the earlier Kudu version as follows:
    stage.alias.streamsets-datacollector-apache-kudu-<version>-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget = streamsets-datacollector-apache-kudu_1_0-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget
    Where <version> is the earlier Kudu version: 0_7, 0_8, or 0_9. For example, if you previously used Kudu version 0.9, add the following stage alias:
    stage.alias.streamsets-datacollector-apache-kudu-0_9-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget = streamsets-datacollector-apache-kudu_1_0-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget
  4. Restart Data Collector to enable the changes.

Update JDBC Multitable Consumer Pipelines

With version 2.7.1.1, the JDBC Multitable Consumer origin can now read from views in addition to tables. The origin now reads from all tables and all views that are included in the defined table configurations.

When upgrading pipelines that contain a JDBC Multitable Consumer origin from Data Collector versions earlier than 2.7.1.1, review the table configurations to determine if any views are included. If a table configuration includes views that you do not want to read, simply exclude them from the configuration.

Update Vault Pipelines

With version 2.7.0.0, Data Collector introduces a credential store API and credential expression language functions to access Hashicorp Vault secrets.

In addition, the Data Collector Vault integration now relies on Vault's App Role authentication backend.

Previously, Data Collector used Vault functions to access Vault secrets and relied on Vault's App ID authentication backend. StreamSets has deprecated the Vault functions, and Hashicorp has deprecated the App ID authentication backend.

After upgrading, update pipelines that use Vault functions in one of the following ways:

Use the new credential store expression language functions (recommended)
To use the new credential functions, install the Vault credential store stage library and define the configuration properties used to connect to Vault. Then, update each upgraded pipeline that includes stages using Vault functions to use the new credential functions to retrieve the credential values.
For details on using the Vault credential store system, see Hashicorp Vault.
Continue to use the deprecated Vault functions
You can continue to use the deprecated Vault functions in pipelines. However, the functions will be removed in a future release - so we recommend that you use the credential functions as soon as possible.
To continue to use the Vault functions, make the following changes after upgrading:
  • Uncomment the single Vault EL property in the $SDC_CONF/vault.properties file.
  • The remaining Vault configuration properties have been moved to the $SDC_CONF/credential-stores.properties file. The properties use the same name, with an added "credentialStore.vault.config" prefix. Copy any values that you customized in the previous vault.properties file into the same property names in the credential-stores.properties file.
  • Define the Vault Role ID and Secret ID that Data Collector uses to authenticate with Vault in the credential-stores.properties file. Defining an App ID for Data Collector is deprecated and will be removed in a future release.
For details on using the Vault functions, see Accessing Hashicorp Vault Secrets with Vault Functions (Deprecated).

Configure JDBC Producer Schema Names

With Data Collector version 2.5.0.0, you can use a Schema Name property to specify the database or schema name. In previous releases, you specified the database or schema name in the Table Name property.

Upgrading from a previous release does not require changing any existing configuration at this time. But we recommend using the new Schema Name property, since the ability to specify a database or schema name with the table name might be deprecated in the future.

Evaluate Precondition Error Handling

With Data Collector version 2.5.0.0, precondition error handling has changed.

The Precondition stage property allows you to define conditions that must be met for a record to enter the stage. Previously, records that did not meet all specified preconditions were passed to the pipeline for error handling. That is, the records were processed based on the Error Records pipeline property.

With version 2.5.0.0, records that do not meet the specified preconditions are handled by the error handling configured for the stage. Stage error handling occurs based on the On Record Error property on the General tab of the stage.

Review pipelines that use preconditions to verify that this change does not adversely affect the behavior of the pipelines.

Authentication for Docker Image

With Data Collector version 2.4.1.0, the Docker image now uses the form type of file-based authentication by default. As a result, you must use a Data Collector user account to log in to the Data Collector. If you haven't set up custom user accounts, you can use the admin account shipped with the Data Collector. The default login is: admin / admin.

Earlier versions of the Docker image used no authentication.

Configure Pipeline Permissions

Data Collector version 2.4.0.0 is designed for multitenancy and enables you to share and grant permissions on pipelines. Permissions determine the access level that users and groups have on pipelines.

In earlier versions of Data Collector without pipeline permissions, pipeline access is determined by roles. For example, any user with the Creator role could edit any pipeline.

In version 2.4.0.0, roles are augmented with pipeline permissions. In addition to having the necessary role, users must also have the appropriate permissions to perform pipeline tasks.

For example, to edit a pipeline in 2.4.0.0, a user with the Creator role must also have read and write permission on the pipeline. Without write permission, the user cannot edit the pipeline. Without read permission, the user cannot see the pipeline at all. It does not display in the list of available pipelines.

Note: With pipeline permissions enabled, all upgraded pipelines are initially visible only to users with the Admin role and the pipeline owner - the user who created the pipeline. To enable other users to work with pipelines, have an Admin user configure the appropriate permissions for each pipeline.

In Data Collector version 2.5.0.0, pipeline permissions are disabled by default. To enable pipeline permissions, set the pipeline.access.control.enabled property to true in the Data Collector configuration file.

Tip: You can configure pipeline permissions when permissions are disabled. Then, you can enable the pipeline permissions property after pipeline permissions are properly configured.

For more information about roles and permissions, see Roles and Permissions. For details about configuring pipeline permissions, see Sharing Pipelines.

Update Elasticsearch Pipelines

Data Collector version 2.3.0.0 includes an enhanced Elasticsearch destination that uses the Elasticsearch HTTP API. To upgrade pipelines that use the Elasticsearch destination from Data Collector versions earlier than 2.3.0.0, you must review the value of the Default Operation property.

Review all upgraded Elasticsearch destinations to ensure that the Default Operation property is set to the correct operation. Upgraded Elasticsearch destinations have the Default Operation property set based on the configuration for the Enable Upsert property:

  • With upsert enabled, the default operation is set to INDEX.
  • With upsert not enabled, the default operation is set to CREATE which requires a DocumentId.
Note: The Elasticsearch version 5 stage library is compatible with all versions of Elasticsearch. Earlier stage library versions have been removed.