What is StreamSets Transformer?
StreamSets TransformerTM is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.
Pipeline Processing on Spark
Transformer functions as a Spark client that launches distributed Spark applications.
Batch Case Study
Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.
Streaming Case Study
Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.
Transformer for Data Collector Users
For users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.
Installation Overview
Installation Requirements
Installing Transformer
Transformer can work with Apache Spark that runs locally on a single machine or that runs on a cluster. The steps that you use to install Transformer depend on your Spark installation.
Customization with Environment Variables
Transformer includes several environment variables that you can modify to customize the following areas:
Enabling HTTPS
Enable HTTPS for Transformer to secure the communication to the Transformer UI and REST API and to use Transformer with StreamSets Control Hub.
Stage-Related Prerequisites
External Libraries
You can install a driver or other library as an external library to make it available to a Transformer stage.
Starting and Logging in to Transformer
Registration with Control Hub Overview
Register Transformer
To register Transformer with Control Hub, you generate an authentication token and modify the Transformer configuration files.
Unregister Transformer
You can unregister a Transformer from Control Hub when you no longer want to use that Transformer installation with Control Hub.
Disconnected Mode
Control Hub Configuration File
You can customize how a registered Transformer works with Control Hub by editing the Control Hub configuration file, $TRANSFORMER_CONF/dpm.properties, located in the Transformer installation.
What is a Transformer Pipeline?
A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data along the way.
Execution Mode
Transformer pipelines can run in batch or streaming mode.
Stage Library Match Requirement
Local Pipelines
Local pipelines run on the local Spark installation on the Transformer machine.
Pipelines on Hadoop YARN
You can run Transformer pipelines using Spark deployed on a Hadoop YARN cluster. When a pipeline runs, Spark distributes the processing across nodes in the cluster.
Pipelines on Databricks
You can run Transformer pipelines using Spark deployed on a Databricks cluster. When a pipeline runs, Spark distributes the processing across nodes in the cluster.
Pipelines on SQL Server 2019 Big Data Cluster
You can run Transformer pipelines using Spark deployed on SQL Server 2019 Big Data Cluster (BDC). When a pipeline runs, Spark distributes the processing across nodes in the cluster.
SQL Server 2019 JDBC Connection Information
Extra Spark Configuration
When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.
When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.
Offset Handling
Delivery Guarantee
Transformer's offset handling ensures that, in times of sudden failures, a Transformer pipeline does not lose data - it processes data at least once. If a sudden failure occurs at a particular time, up to one batch of data may be reprocessed. This is an at-least-once delivery guarantee.
Caching Data
You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.
Performing Lookups
Ludicrous Processing Mode
You can configure Transformer to run a pipeline in ludicrous processing mode.
Preprocessing Script
Data Types
Technology Preview Functionality
Transformer includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.
Configuring a Pipeline
Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.
Origins Overview
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline. Or, you can use multiple origin stages in a pipeline and then join the origins using the Join processor.
Schema Inference
When processing data formats that include schemas with the data, such as Avro, ORC, and Parquet, Transformer origins use those schemas to process the data.
Custom Schemas
When reading delimited or JSON data, you can configure an origin to use a custom schema to process the data. By default, origins infer the schema from the data.
Amazon S3
Delta Lake
Whole Directory
Preview Overview
You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.
Preview Codes
In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.
Processor Output Order
When previewing data for a processor, you can preview both the input and the output data. You can display the output records in the order that matches the input records or in the order produced by the processor.
Previewing a Pipeline
Editing Properties
When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different output streams.
Pipeline Monitoring Overview
When Transformer runs a pipeline, you can view real-time statistics about the pipeline.
Pipeline and Stage Statistics
When you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.
Spark Web UI
As you monitor a pipeline, you can also access the Spark web UI for the application launched for the pipeline. Use the Spark web UI to monitor the Spark jobs executed for the launched application, just as you monitor any other Spark application.
Pipeline Run History
You can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either the Summary or History tab.
© 2020 StreamSets, Inc.