What is StreamSets Transformer?
StreamSets TransformerTM is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.
Pipeline Processing on Spark
Transformer functions as a Spark client that launches distributed Spark applications.
Batch Case Study
Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.
Streaming Case Study
Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.
Transformer for Data Collector Users
For users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.
Tutorials and Sample Pipelines
StreamSets provides tutorials and sample pipelines to help you learn about using Transformer.
Installation Requirements
Installing Transformer
The method that you use to install Transformer depends on the location of your Spark installation and where you choose to install Transformer.
Protecting Sensitive Information in Configuration Files
You can protect sensitive information required by Transformer configuration files by storing the data in a separate location, then using a function to retrieve the data.
Customization with Environment Variables
Transformer includes several environment variables that you can modify to customize the following areas:
Enabling HTTPS
Enable HTTPS for Transformer to secure the communication to the Transformer UI and REST API and to use Transformer with StreamSets Control Hub.
Using a Reverse Proxy
Credential Stores
You can configure Transformer to access sensitive information that is secured in a credential store.
Stage-Related Prerequisites
External Libraries
You can install a driver or other library as an external library to make it available to a Transformer stage.
User Authentication
When Transformer is not registered with Control Hub, Transformer can authenticate user accounts based on LDAP or files. Best practice is to use LDAP authentication, particularly for a production deployment of Transformer. By default, Transformer uses file-based authentication.
Starting and Logging in to Transformer
You can upgrade a previous version of Transformer to a new version. You can upgrade an installation from the tarball or from the RPM package.
Pre Upgrade Tasks
Upgrade an Installation from the Tarball
Upgrade an Installation from the RPM Package
When you upgrade an installation from the RPM package, the new version uses the default Transformer configuration, data, log, and resource directories. If the previous version used the default directories, the new version has access to the files created in the previous version.
Post Upgrade Tasks
In some situations, you must complete tasks within Transformer or your Control Hub on-premises installation after you upgrade.
Troubleshooting an Upgrade
Registration with Control Hub Overview
Register Transformer
To register Transformer with Control Hub, you generate an authentication token and modify the Transformer configuration files.
Unregister Transformer
You can unregister a Transformer from Control Hub when you no longer want to use that Transformer installation with Control Hub.
Disconnected Mode
Control Hub Configuration File
You can customize how a registered Transformer works with Control Hub by editing the Control Hub configuration file, $TRANSFORMER_CONF/dpm.properties, located in the Transformer installation.
Transformer pipelines run on Spark. Generally, you run Transformer pipelines on Spark deployed on a cluster to leverage the performance and scale that Spark offers. Though, when needed, you can run a local pipeline on the Transformer machine.
Amazon EMR
You can run Transformer pipelines using Spark deployed on an Amazon EMR cluster. Transformer supports Amazon EMR 5.13.0 or later.
You can run Transformer pipelines using Spark deployed on a Databricks cluster. Transformer supports Databricks Runtime version 5.2 or later.
Google Dataproc
You can run Transformer pipelines using Spark deployed on a Google Dataproc cluster. Transformer supports Dataproc image versions 1.3 and 1.4.
Hadoop YARN
You can run Transformer pipelines using Spark deployed on a Hadoop YARN cluster.
SQL Server 2019 Big Data Cluster
You can run Transformer pipelines using Spark deployed on SQL Server 2019 Big Data Cluster (BDC). Transformer supports SQL Server 2019 Cumulative Update 4 or later. SQL Server 2019 BDC uses Apache Livy to submit Spark jobs.
What is a Transformer Pipeline?
A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data along the way.
Sample Pipelines
Transformer provides sample pipelines that you can use to learn about Transformer features or as a template for building your own pipelines.
Execution Mode
Transformer pipelines can run in batch or streaming mode.
Stage Library Match Requirement
Local Pipelines
Typically, you run a Transformer pipeline on a cluster. You can also run a pipeline on a Spark installation on the Transformer machine. This is known as a local pipeline.
Spark Executors
A Transformer pipeline runs on one or more Spark executors.
SQL Server 2019 JDBC Connection Information
Extra Spark Configuration
When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.
When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.
Offset Handling
Delivery Guarantee
Transformer's offset handling ensures that, in times of sudden failures, a Transformer pipeline does not lose data - it processes data at least once. If a sudden failure occurs at a particular time, up to one batch of data may be reprocessed. This is an at-least-once delivery guarantee.
Caching Data
You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.
Performing Lookups
Expressions in Pipeline and Stage Properties
Simple and Bulk Edit Mode
Runtime Values
Runtime values are values that you define outside of the pipeline and use for stage and pipeline properties. You can change the values for each pipeline run without having to edit the pipeline.
Ludicrous Processing Mode
You can configure Transformer to run a pipeline in ludicrous processing mode.
Cluster Callback URL
You can configure a cluster callback URL for a pipeline to use instead of the default Transformer URL.
Preprocessing Script
Data Types
Technology Preview Functionality
Transformer includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.
Configuring a Pipeline
Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline. Or, you can use multiple origin stages in a pipeline and then join the origins using a Join or Union processor.
Schema Inference
When processing data formats that include schemas with the data, such as Avro, ORC, and Parquet, Transformer origins use those schemas to process the data.
Custom Schemas
When reading delimited or JSON data, you can configure an origin to use a custom schema to process the data. By default, origins infer the schema from the data.
Amazon S3
Azure Event Hubs
Delta Lake
Oracle JDBC Table
PostgreSQL JDBC Table
SQL Server JDBC Table
Whole Directory
You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.
Preview Codes
In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.
Processor Output Order
When previewing data for a processor, you can preview both the input and the output data. You can display the output records in the order that matches the input records or in the order produced by the processor.
Previewing a Pipeline
Editing Properties
When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different output streams.
When Transformer runs a pipeline, you can view real-time statistics about the pipeline.
Pipeline and Stage Statistics
When you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.
Cluster and Spark URLs
In monitor mode, the Monitoring panel provides URLs for the cluster or the Spark application that runs the pipeline.
Pipeline Run History
You can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either the Summary or History tab.
Viewing Transformer Configuration Properties
Viewing Transformer Directories
You can view the directories that Transformer uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.
Viewing Transformer Metrics
Transformer Log Files
Shutting Down Transformer
You can shut down and then manually launch Transformer to apply changes to the Transformer configuration file, environment configuration file, or user logins.
Restarting Transformer
You can restart Transformer to apply changes to the Transformer configuration file, environment configuration file, or user logins. During the restart process, Transformer shuts down and then automatically restarts.
Opting Out of Usage Statistics Collection
You can help to improve Transformer by allowing StreamSets to collect usage statistics about Transformer system performance and features that you use. This information helps StreamSets to improve product performance and to make product development decisions.
Transformer installation packages downloaded from the StreamSets website require that you register and activate the Transformer instance with StreamSets or that you register the Transformer instance with StreamSets Control Hub. Only users with the Admin role can register and activate Transformer.
© 2020 StreamSets, Inc.