Configuring a Pipeline

Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.

A pipeline can include multiple origin, processor, and destination stages.

  1. From the Home page or Getting Started page, click Create New Pipeline.
    Tip: To get to the Home page, click the Home icon.
  2. In the New Pipeline window, configure the following properties:
    Pipeline Property Description
    Title Title of the pipeline.

    Transformer uses the alphanumeric characters entered for the pipeline title as a prefix for the generated pipeline ID. For example, if you enter My Pipeline *&%&^^ 123 as the pipeline title, then the pipeline ID has the following value: MyPipeline123tad9f592-5f02-4695-bb10-127b2e41561c.

    Description Optional description of the pipeline.
    Pipeline Label Optional labels to assign to the pipeline.

    Use labels to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.

    You can use nested labels to create a hierarchy of pipeline groupings. Enter nested labels using the following format:
    <label1>/<label2>/<label3>
    For example, you might want to group pipelines in the test environment by the origin system. You add the labels Test/HDFS and Test/Elasticsearch to the appropriate pipelines.
  3. Click Save.
    The pipeline canvas displays the pipeline title, the generated pipeline ID, and an error icon. The error icon indicates that the pipeline is empty. The Properties panel displays the pipeline properties.
  4. In the Properties panel, on the General tab, configure the following properties:
    Pipeline Property Description
    Title Optionally edit the title of the pipeline.

    Because the generated pipeline ID is used to identify the pipeline, any changes to the pipeline title are not reflected in the pipeline ID.

    Description Optionally edit or add a description of the pipeline.
    Labels Optional edit or add labels assigned to the pipeline.
    Execution Mode Execution mode of the pipeline:
    • Batch - Processes all available data in a single batch, and then the pipeline stops.
    • Streaming - Maintains connections to origin systems and processes data as it becomes available. The pipeline runs continuously until you manually stop it.
    Trigger Interval Milliseconds to wait between processing batches of data.

    For streaming execution mode only.

    Enable Ludicrous Mode Enables predicate and filter pushdown to optimize queries so unnecessary data is not processed.
    Collect Input Metrics Collects and displays pipeline input statistics for a pipeline running in ludicrous mode.
    By default, only pipeline output statistics display when a pipeline runs in ludicrous mode.
    Note: For data formats that do not include metrics in the metadata, such as Avro, CSV, and JSON, Transformer must re-read origin data to generate the input statistics. This can slow pipeline performance.
  5. On the Cluster tab, select one of the following options for the Cluster Manager Type property:
    • None (local) - Run the pipeline locally on the Transformer machine.
    • Apache Spark for HDInsight - Run a pipeline on an HDInsight cluster.
    • Databricks - Run the pipeline on a Databricks cluster.
    • Dataproc - Run the pipeline on a Dataproc cluster.
    • EMR - Run the pipeline on an EMR cluster.
    • Hadoop YARN - Run the pipeline on a Hadoop YARN cluster.
    • Kubernetes - Run the pipeline on a Kubernetes cluster. Most Spark vendors support Kubernetes for development workloads only.
    • Spark Standalone - Run the pipeline on a Spark standalone cluster. Most Spark vendors support standalone mode for development workloads only.
    • SQL Server 2019 Big Data Cluster - Run the pipeline on SQL Server 2019 BDC.
  6. Configure the remaining properties on the Cluster tab based on the selected cluster manager type.
    For all cluster manager types, configure the following properties:
    Cluster Property Description
    Application Name Name of the launched Spark application. Enter a name or enter a StreamSets expression that evaluates to the name.

    Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.

    When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:
    myapplication_run1

    Default is the expression ${pipeline:title()}, which uses the pipeline title as the application name.

    Log Level Log level to use for the launched Spark application.
    Extra Spark Configuration Additional Spark configuration properties to use.

    To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.

    Use the property names and values as expected by Spark.

    For a pipeline that runs on a Databricks cluster, also configure the following properties:
    Databricks Property Description
    URL to Connect to Databricks Databricks URL for your account. Use the following format:

    https://<your_domain>.cloud.databricks.com

    Staging Directory Staging directory on Databricks File System (DBFS) where Transformer stores the StreamSets resources and files needed to run the pipeline as a Databricks job.

    When a pipeline runs on an existing interactive cluster, configure pipelines to use the same staging directory so that each job created within Databricks can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are run by the same Transformer instance. Pipelines that run on different instances of Transformer must use different staging directories.

    When a pipeline runs on a provisioned job cluster, using the same staging directory for pipelines is best practice, but not required.

    Default is /streamsets.

    Credential Type Type of credential used to connect to Databricks: Username/Password or Token.
    Username Databricks user name.
    Password Password for the account.
    Token Personal access token for the account.
    Provision a New Cluster Provisions a new Databricks job cluster to run the pipeline upon the initial run of the pipeline.

    Clear this option to run the pipeline on an existing interactive cluster.

    Cluster Configuration Configuration properties for a provisioned Databricks job cluster.

    Configure the listed properties and add additional Databricks cluster properties as needed, in JSON format. Transformer uses the Databricks default values for Databricks properties that are not listed.

    Include the instance_pool_id property to provision a cluster that uses an existing instance pool.

    Use the property names and values as expected by Databricks.

    Terminate Cluster Terminates the provisioned job cluster when the pipeline stops.
    Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.
    Cluster ID ID of an existing Databricks interactive cluster to run the pipeline. Specify a cluster ID when not provisioning a cluster to run the pipeline.
    Note: When using an existing interactive cluster, all Transformer pipelines that the cluster runs must be built by the same version of Transformer.
    For a pipeline that runs on a Dataproc cluster, also configure the following properties:
    Dataproc Description
    Project ID Google Cloud project ID.
    Region Region to create the cluster in. Select a region or select Custom and enter a region name.
    Custom Custom region to ceate a cluster in.
    Credentials Provider Credentials to use:
    Credentials File Path (JSON) Path to the Google Cloud service account credentials file that the pipeline uses to connect. The credentials file must be a JSON file.

    Enter a path relative to the Transformer resources directory, $TRANSFORMER_RESOURCES, or enter an absolute path.

    Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store.
    GCS Staging URI Staging location in Dataproc where Transformer stores the StreamSets resources and files needed to run the pipeline as a Dataproc job.

    When a pipeline runs on an existing cluster, configure pipelines to use the same staging directory so that each Spark job created within Dataproc can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are run by the same Transformer instance. Pipelines that run on different instances of Transformer must use different staging directories.

    When a pipeline runs on a provisioned cluster, using the same staging directory for pipelines is best practice, but not required.

    Default is /streamsets.

    Create Cluster Provisions a new Dataproc cluster to run the pipeline upon the initial run of the pipeline.

    Clear this option to run the pipeline on an existing cluster.

    Cluster Name Name of the existing cluster to run the pipeline. Use the full Dataproc cluster name.
    Cluster Prefix Optional prefix to add to the provisioned cluster name.
    Image Version Image version to use for the provisioned cluster. Transformer supports image versions 1.3 and 1.4.

    Specify the full image version name, such as 1.4-ubuntu18 or 1.3-debian10.

    When not specified, Transformer uses the default Dataproc image version.

    For a list of Dataproc image versions, see the Dataproc documentation.

    Master Machine Type Master machine type to use for the provisioned cluster.
    Worker Machine Type Worker machine type to use for the provisioned cluster.
    Network Type Network type to use for the provisioned cluster:
    • Auto - Uses a VPC network type in auto mode.
    • Custom - Uses a VPC network type with the specified subnet name.
    • Default VPC for project and region - Uses the default VPC for the project ID and region specified for the cluster.

    For more information about network types, see the Dataproc documentation.

    Subnet Name Subnet name for the custom VPC network.
    Network Tags Optional network tags to apply to the provisioned cluster.

    For more information, see the Dataproc documentation.

    Worker Count Number of workers to use for a provisioned cluster.

    Minimum is 2. Using an additional worker for each partition can improve pipeline performance.

    This property is ignored if you enable dynamic allocation using spark.dynamicAllocation.enabled as an extra Spark configuration property.

    Terminate Cluster Terminates the provisioned cluster when the pipeline stops.
    Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.
    For a pipeline that runs on an EMR cluster, also configure the following property:
    EMR Property Description
    Staging Directory Staging directory on the Amazon S3 staging URI.

    Used to store the Transformer resources and files needed to run the pipeline as an EMR job. The specified staging directory is used with the S3 Staging URI defined on the EMR tab as follows: <S3 staging URI>/<staging directory>.

    When a pipeline runs on an existing cluster, you might configure pipelines to use the same S3 staging URI and staging directory. This allows EMR to reuse the common files stored in that location. Pipelines that run on different clusters can also use the same staging locations as long as the pipelines are run by the same Transformer instance. Pipelines that run on different Transformer instances must use different staging locations.

    Both the staging directory and S3 staging URI must exist before you run the pipeline.

    Default is /streamsets.

    For a pipeline that runs on a Hadoop YARN cluster, also configure the following properties:
    Hadoop YARN Property Description
    Deployment Mode Deployment mode to use:
    • Client - Launches the Spark driver program locally.
    • Cluster - Launches the Spark driver program remotely on one of the nodes inside the cluster.

    For more information about deployment modes, see the Apache Spark documentation.

    Hadoop User Name Name of the Hadoop user that Transformer impersonates to launch the Spark application and to access files in the Hadoop system. When using this property, make sure impersonation is enabled for the Hadoop system.

    When not configured, Transformer impersonates the user who starts the pipeline.

    When Transformer uses Kerberos authentication or is configured to always impersonate the user who starts the pipeline, this property is ignored. For more information, see Kerberos Authentication and Hadoop Impersonation Mode.

    Use YARN Kerberos Keytab Use a Kerberos principal and keytab to launch the Spark application and to access files in the Hadoop system. Transformer includes the keytab file with the launched Spark application.

    When not selected, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and to access files in the Hadoop system.

    Enable for long-running pipelines when Transformer is enabled for Kerberos authentication.

    Keytab Source Source to use for the pipeline keytab file:
    • Transformer Configuration File - Use the same Kerberos keytab and principal configured for Transformer in the Transformer configuration file, $TRANSFORMER_DIST/etc/transformer.properties.
    • Pipeline Configuration - File - Use a specific Kerberos keytab file and principal for this pipeline. Store the keytab file on the Transformer machine.
    • Pipeline Configuration - Credential Store - Use a specific Kerberos keytab file and principal for this pipeline. Store the Base64-encoded keytab file in a credential store.

    Available when using a Kerberos principal and keytab for the pipeline.

    YARN Kerberos Keytab Path Absolute path to the keystore file stored on the Transformer machine.

    Available when using Pipeline Configuration - File as the keytab source.

    Keytab Credential Function Credential function used to retrieve the Base64-encoded keytab from the credential store. Use the credential:get() or credential:getWithOptions() credential function.
    For example, the following expression retrieves a Base64-encoded keytab stored in the clusterkeytab secret within the azure credential store:
    ${credential:get("azure", "devopsgroup", "clusterkeytab")}
    Note: The user who starts the pipeline must be in the Transformer group specified in the credential function, devopsgroup in the example above. When Transformer requires a group secret, the user must also be in a group associated with the keytab.

    Available when using Pipeline Configuration - Credential Store as the keytab source.

    YARN Kerberos Principal Kerberos principal name that the pipeline runs as. The specified keytab file must contain the credentials for this Kerberos principal.

    Available when using either pipeline configuration as the keytab source.

    For a pipeline that runs locally, also configure the following property:
    Local Property Description
    Master URL Local master URL to use to connect to Spark. You can define any valid local master URL as described in the Spark Master URL documentation.

    Default is local[*] which runs the pipeline in the local Spark installation using the same number of worker threads as logical cores on the machine.

    For a pipeline that runs on SQL Server 2019 BDC, also configure the following properties:
    SQL Server 2019 Big Data Cluster Property Description
    Livy Endpoint SQL Server 2019 BDC Livy endpoint that enables submitting Spark jobs.

    For information about retrieving the Livy endpoint, see Retrieving Connection Information.

    User Name

    Controller user name to submit Spark jobs through the Livy endpoint.

    For more information, see Retrieving Connection Information.

    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Password

    Knox password for the controller user name that allows submitting Spark jobs through the Livy endpoint.

    For more information, see Retrieving Connection Information.

    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Staging Directory Staging directory on SQL Server 2019 BDC where Transformer stores the StreamSets resources and files needed to run the pipeline.

    Default is /streamsets.

    Pipelines that run on different clusters can use the same staging directory as long as the pipelines are run by the same Transformer instance. Pipelines that run on different instances of Transformer must use different staging directories.

  7. To define runtime parameters, on the Parameters tab, click the Add icon and define the name and the default value for each parameter.
    You can use simple or bulk edit mode to configure the parameters.
  8. When running the pipeline on an EMR cluster, on the EMR tab, configure the following properties:
    EMR Property Description
    Region AWS region that contains the EMR cluster. Select one of the available regions.

    If the region is not listed, select Other and then enter the name of the AWS region.

    AWS Region The custom region to use.

    Available when Region is set to Other.

    Use IAM Roles Transformer connects to EMR using an IAM role assigned to the Transformer EC2 instance.

    When not enabled, Transformer connects using an AWS access key pair.

    Access Key ID Access key ID to use to connect to EMR.

    Required to use AWS keys to connect to EMR. Available when Use IAM Roles is not enabled.

    Secret Access Key Secret access key to use to connect to EMR.

    Required to use AWS keys to connect to EMR. Available when Use IAM Roles is not enabled.

    S3 Staging URI URI for an Amazon S3 directory to store the Transformer resources and files needed to run the pipeline.

    The specified URI is used with the staging directory defined on the Cluster tab as follows: <S3 staging URI>/<staging directory>.

    Both the S3 staging URI and staging directory must exist before you run the pipeline.

    Provision a New Cluster Provisions a new cluster to run the pipeline.
    Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.
    Cluster ID ID of the existing cluster to run the pipeline. Available when not provisioning a cluster.
    EMR Version EMR cluster version to provision. Transformer supports version 5.13.0 or later.

    Available only for provisioned clusters.

    Cluster Name Prefix Prefix for the name of the provisioned EMR cluster.

    Available only for provisioned clusters.

    Terminate Cluster Terminates the provisioned cluster when the pipeline stops.

    When cleared, the cluster remains active after the pipeline stops.

    Available only for provisioned clusters.

    Logging Enabled Enables copying log data to a specified Amazon S3 location. Use to preserve log data that would otherwise become unavailable when the provisioned cluster terminates.

    Available only for provisioned clusters.

    S3 Log URI Location in Amazon S3 to store pipeline log data.
    Location must be unique for each pipeline. Use the following format:
    s3://<bucket>/<path>

    The bucket must exist before you start the pipeline.

    Available when you enable logging for a provisioned cluster.

    Service Role EMR role used by the Transformer EC2 instance to provision resources and performing other service-level tasks.

    Default is EMR_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    Job Flow Role EMR role for the EC2 instances within the cluster to use to perform pipeline tasks.

    Default is EMR_EC2_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    Visible to All Users Enables all AWS Identity and Access Management (IAM) users under your account to access the provisioned cluster.

    Available only for provisioned clusters.

    EC2 Subnet ID EC2 subnet identifier to launch the provisioned cluster in.

    Available only for provisioned clusters.

    Master Security Group ID of the security group on the master node in the cluster.
    Note: Verify that the master security group allows Transformer to access the master node in the EMR cluster. For information on configuring security groups for EMR clusters, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    Slave Security Group Security group ID for the slave nodes in the cluster.

    Available only for provisioned clusters.

    Instance Count Number of EC2 instances to use. Each instance corresponds to a slave node in the EMR cluster.

    Minimum is 2. Using an additional instance for each partition can improve pipeline performance.

    Available only for provisioned clusters.

    Master Instance Type EC2 instance type for the master node in the EMR cluster.

    If an instance type does not display in the list, select Custom and then enter the instance type.

    Available only for provisioned clusters.

    Master Instance Type (Custom) Custom EC2 instance type for the master node. Available when you select Custom for the Master Instance Type property.
    Slave Instance Type EC2 instance type for the slave nodes in the EMR cluster.

    If an instance type does not display in the list, select Custom and then enter the instance type.

    Available only for provisioned clusters.

    Slave Instance Type (Custom) Custom EC2 instance type for the master node. Available when you select Custom for the Slave Instance Type property.
  9. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Cluster Callback URL Callback URL for the Spark cluster to use to communicate with Transformer. Define a URL when the web browser and the Spark cluster must use a different URL to access Transformer.

    Overrides the Transformer URL configured in $TRANSFORMER_CONF/transformer.properties.

    Preprocessing Script Scala script to run before the pipeline starts.

    Develop the script using the Spark APIs for the version of Spark installed on your cluster, which must be compliant with Scala 2.11.x.

  10. Use the Stage Library panel to add an origin stage. In the Properties panel, configure the stage properties.

    For configuration details about origin stages, see Origins.

  11. Use the Stage Library panel to add the next stage that you want to use, connect the origin to the new stage, and configure the new stage.

    For configuration details about processors, see Processors.

    For configuration details about destinations, see Destinations.

  12. Add additional stages as necessary.
  13. At any point, use the Preview icon () to preview data to help configure the pipeline.

    Preview becomes available in partial pipelines when all existing stages are connected and configured.

  14. When the pipeline is validated and complete, use the Start icon to run the pipeline.
    When Transformer starts the pipeline, monitor mode displays real-time statistics for the pipeline.