What's New

What's New in 3.15.x

StreamSets Transformer version 3.15.x includes the following new features and enhancements:

New Cluster Type
  • Dataproc - You can now run pipelines on a Google Dataproc cluster. Transformer supports Dataproc image versions 1.3.x and 1.4.x.
Cluster Type Enhancements
  • Hadoop YARN cluster with Kerberos authentication - When configuring a pipeline to run on a Hadoop YARN cluster with Kerberos authentication, you can store the Kerberos keytab in a credential store, and then use a credential function to call the keytab from the pipeline properties.

    When using a credential store, you can require Transformer group access to the keytab so that you can more securely use multiple keytabs for a Transformer instance.

  • SQL Server 2019 Big Data Cluster version support - Transformer supports SQL Server 2019 Cumulative Update 4 and later.
New Stages
You can use the following new stages only in pipelines that run on Dataproc clusters:
  • Google Cloud Storage origin - You can use this origin to read fully written objects in Google Cloud Storage.
  • Google Cloud Storage destination - You can use this destination to write to Google Cloud Storage.
  • Google Big Query destination - You can use this destination to write to Google BigQuery.
Stage Enhancements
  • Amazon S3 stages - You can specify the AWS region where the Amazon S3 bucket resides. Previously, the stages used US East (N. Virginia) us-east-1, which is the default AWS region.
  • Amazon S3 destination - You can specify fields to partition the data by. You can also use the destination in a slowly changing dimension pipeline to write results to a file dimension on Amazon S3.
  • Azure SQL destination - The destination is no longer considered a Technology Preview feature and is approved for use in production.
  • Hive destination - You can enable the destination to compensate for data drift by creating columns in the destination table for new fields.
  • Slowly Changing Dimension processor - The processor no longer requires that the configured tracking fields exist in the master dimension data.
Additional Enhancements
  • File-based user authentication - When Transformer is configured for file-based authentication, you can use the Transformer UI to change your password.
  • Pipeline run history - The pipeline run history displays the input, output, and error record count for each pipeline run.
  • Sample pipelines - Transformer provides several sample pipelines that you can use to learn about Transformer features or as a basis for building your own pipelines.

What's New in 3.14.x

StreamSets Transformer version 3.14.x includes the following new features and enhancements:

Installation
  • RPM installation and service start - You can install Transformer from an RPM package and then start Transformer as a service.
  • Transformer downloads require registration - Transformer installation packages downloaded from the StreamSets website require that you register and activate the Transformer instance with StreamSets or that you register the Transformer instance with StreamSets Control Hub.

    Your StreamSets account determines the maximum number of Spark executors allowed for each pipeline.

  • Cloud service provider marketplace offers - When you install Transformer through a cloud service provider marketplace, you select an offer based on the maximum number of Spark executors allowed for each pipeline.
New Stages
  • MapR Hive origin - You can use this origin to read from a MapR Hive table.
  • MapR Hive destination - You can use this destination to write to a MapR Hive table.
Stage Enhancements
  • Amazon Redshift destination - You can now merge data into an Amazon Redshift table instead of inserting it.
  • Field Renamer processor - The processor can change the case of field names to uppercase, lowercase, or title case.
  • Slowly Changing Dimension processor - The processor now supports processing multiple changes to the same record in the same batch when processing Type 1 data or Type 2 data that uses an Active Flag or Version Increment tracking field.
  • SQL Expression processor - Use the new Cast To Type property to specify the data type of the output field instead of using the inferred type.
Credential Store Enhancements
  • AWS Secrets Manager credential store - You can call secrets defined in AWS Secrets Manager from pipeline and stage properties.
  • Group secrets in credential stores - You can configure Transformer to validate a user’s group against a comma-separated list of groups allowed to access each secret.
  • Required group argument in credential functions - If you do not want to use the group argument in credential functions when working with Transformer and Control Hub, you can specify the default group using all or all@<organization ID>. StreamSets recommends using all so that you do not need to modify credential functions when migrating pipelines from Transformer to Control Hub. Previously, you had to use all@<organization ID>.
Additional Enhancements
  • Cluster validation - You can validate pipelines using the cluster configured for the pipeline.
  • Default users and groups for cloud service provider installations - Transformer installed through a cloud service provider marketplace includes only a default admin user account and no default groups.
  • Proxy user for Hadoop YARN clusters with Kerberos authentication - Before pipelines can use a proxy user to launch the Spark application on a Hadoop YARN cluster that uses Kerberos authentication, you must install the required Kerberos client packages on the Transformer machine and then configure the environment variables used by the K5start program.

What's New in 3.13.x

StreamSets Transformer version 3.13.x includes the following new features and enhancements:
New Cluster Types
  • Amazon EMR support - You can now run pipelines on an Amazon EMR cluster, version 5.13.0 or later.
  • Kubernetes support - You can now run pipelines on a Kubernetes cluster.
  • MapR support - You can now run pipelines on a MapR version 6.1.0 Hadoop YARN cluster.
New Stages
  • Amazon Redshift destination - You can now write to an Amazon Redshift table.
  • Azure Event Hubs origin and destination - You can now read from and write to Azure Event Hubs.
  • MapR FS origin and destination - You can now read from and write to MapR FS.
  • MapR Hive origin and destination - You can now read from and write to Hive on a MapR cluster.
  • MapR Event Store origin and destination - You can now read from and write to MapR Event Store.
  • MySQL JDBC Table origin - You can now read from a MySQL database table.
General Enhancements
  • Azure credential store - You can now use an Azure credential store with Transformer.
  • Reverse proxy support - You can now register Transformer instances deployed behind a reverse proxy with Control Hub.
Pipeline Enhancements
  • Callback URL - You can now specify a custom callback URL that the Spark cluster uses to communicate with Transformer. This property can be configured on the Advanced tab of the pipeline properties.
  • Cluster preview - You can now preview pipelines using the cluster configured for the pipeline.
Stage Enhancements
  • Amazon stages - Amazon stages can now connect using Amazon Web Services libraries for Hadoop 2.7.7.
  • Delta Lake destination - When using the Upsert using Merge write mode, you can now specify a timestamp column to ensure that the latest record is upserted when a batch contains multiple versions of a record.
  • Glob patterns allowed in origin directory paths - You can now use glob patterns to specify the read directory for the following origins: ADLS Gen1, ADLS Gen2, Amazon, and File.
  • Snowflake origin - You can now use the SELECT command as well as COPY UNLOAD to read from Snowflake.
  • Snowflake stages - You can now configure advanced Snowflake properties to pass to Snowflake. This property can be configured on the Advanced tab of the stage properties.

What's New in 3.12.x

StreamSets Transformer version 3.12.x includes the following new features and enhancements:
  • SQL Server 2019 Big Data Cluster support - You can run pipelines using SQL Server 2019 Big Data Cluster as a cluster manager.
  • Delta Lake additional storage support - You can now use ADLS Gen2 as a storage system for Delta Lake stages.

  • Enhanced file path validation - File path validation has been enhanced for HDFS-based stages, such as the Amazon S3, ADLS, Delta Lake, and File origins and destinations.

  • Hive external table creation - You can configure the Hive destination to create an external table at a specified location.

  • HTTPS self-signed certificates - You can more easily use self-signed certificates when enabling Transformer to use HTTPS.

  • JDBC stage quote character - You can specify a quote character for the JDBC origin, JDBC destination, and JDBC Lookup processor to use.

  • Kafka schema registry authentication - When a Kafka origin or destination uses Avro schemas in Confluent Schema Registry, you can specify basic authentication user information when required.

  • Scala processor empty batch handling - You can now configure the Scala processor to skip processing for empty batches.

  • Configuration property rename - In the Transformer configuration file, transformer.properties, the sdc.base.http.url property has been renamed to transformer.base.http.url.

    If you configured the sdc.base.http.url property in a previous version of Transformer, configure the new transformer.base.http.url property to use the same value when you upgrade.

  • Configuration property removal - In the Transformer configuration file, transformer.properties, the kerberos.client.enabled property has been removed since it is no longer used.

    When upgrading from a previous version of Transformer, do not add this property to the Transformer 3.12.0 configuration file.

What's New in 3.11.x

StreamSets Transformer version 3.11.x includes the following new features and enhancements:
  • Field Existence Selector processor - Use this new processor to route records to a downstream stage when only the specified fields exist in those records. Records that do not meet the criteria pass to a separate output for error handling.

  • Avro advanced read properties - When reading Avro data using Spark 2.4 or later, you can now specify the Avro schema to use in JSON format. You can also configure the origin to process all files, including those that do not use the .avro extension.

  • Delta Lake destination enhancement - You can now use the Delta Lake destination to perform updates, upserts, and deletes in addition to inserts.

  • JDBC Lookup processor enhancement - You can configure the processor to return the first matching row or multiple matching rows. You can determine if any matching rows exist or perform a count of matching rows. You can also specify how to order multiple matches.

  • Pipeline preprocessing script - You can define a pipeline preprocessing script that registers a user-defined function (UDF) before the pipeline starts, enabling the UDF to be called by a pipeline stage. For more information, see the sample script on the Advanced tab of the pipeline configuration properties.

  • Standalone Spark cluster - You can now configure a pipeline to run on a standalone Spark cluster.