What is StreamSets Transformer?
StreamSets TransformerTM is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.
Pipeline Processing on Spark
Transformer functions as a Spark client that launches distributed Spark applications.
Batch Case Study
Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.
Streaming Case Study
Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.
Transformer for Data Collector Users
For users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.
Installation Overview
Installation Requirements
Installing Transformer
Transformer can work with Apache Spark that runs locally on a single machine or that runs on a cluster. The steps that you use to install Transformer depend on your Spark installation.
Enabling HTTPS
Enable HTTPS for Transformer to secure the communication to the Transformer UI and REST API and to use Transformer with StreamSets Control Hub.
Stage-Related Prerequisites
Starting and Logging in to Transformer
Registration with Control Hub Overview
Register Transformer
To register Transformer with Control Hub, you generate an authentication token and modify the Transformer configuration files.
Unregister Transformer
You can unregister a Transformer from Control Hub when you no longer want to use that Transformer installation with Control Hub.
Control Hub Configuration File
You can customize how a registered Transformer works with Control Hub by editing the Control Hub configuration file, $TRANSFORMER_CONF/dpm.properties, located in the Transformer installation.
What is a Transformer Pipeline?
A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data along the way.
Execution Mode
Transformer pipelines can run in batch or streaming mode.
Stage Library Match Requirement
Local Pipelines
Local pipelines run on the local Spark installation on the Transformer machine.
Cluster Pipelines on Hadoop YARN
Cluster pipelines run on a cluster, where Spark distributes the processing across nodes in the cluster.
Cluster Pipelines on Databricks
Cluster pipelines run on a cluster, where Spark distributes the processing across nodes in the cluster.
Extra Spark Configuration
When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.
Partitioning
When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.
Caching Data
You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.
Performing Lookups
Ludicrous Processing Mode
You can configure Transformer to run a pipeline in ludicrous processing mode.
Data Types
Technology Preview Functionality
Transformer includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.
Configuring a Pipeline
Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.
Origins Overview
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline. Or, you can use multiple origin stages in a pipeline and then join the origins using the Join processor.
Schema Inference
Custom Schemas
When reading delimited or JSON data, you can configure an origin to use a custom schema to process the data. By default, origins infer the schema from the data.
ADLS Gen1
ADLS Gen2
Amazon S3
Delta Lake
File
JDBC
Hive
Kafka
Whole Directory
Preview Overview
You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.
Preview Codes
In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.
Processor Output Order
When previewing data for a processor, you can preview both the input and the output data. You can display the output records in the order that matches the input records or in the order produced by the processor.
Previewing a Pipeline
Editing Properties
When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different output streams.
Pipeline Monitoring Overview
When Transformer runs a pipeline, you can view real-time statistics about the pipeline.
Pipeline and Stage Statistics
When you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.
Spark Web UI
As you monitor a pipeline, you can also access the Spark web UI for the application launched for the pipeline. Use the Spark web UI to monitor the Spark jobs executed for the launched application, just as you monitor any other Spark application.
Pipeline Run History
You can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either the Summary or History tab.
© 2019 StreamSets, Inc.