Installing Transformer

The method that you use to install Transformer depends on the location of your Spark installation and where you choose to install Transformer.

You can install Transformer using any of the following methods:
  • Local Spark installation - To get started with Transformer in a development environment, you can simply install both Transformer and Spark on the same machine and run Spark locally on that machine. You develop and run Transformer pipelines on the single machine.
  • Cluster Spark installation - In a production environment, use a Spark installation that runs on a cluster to leverage the performance and scale that Spark offers. Install Transformer on a machine that is configured to submit Spark jobs to the cluster. You develop Transformer pipelines locally on the machine where Transformer is installed. When you run Transformer pipelines, Spark distributes the processing across nodes in the cluster.
  • Cloud Spark installation - When you have Spark installed in the cloud, you can use the AWS or Azure marketplace to install Transformer as a cloud service. When you run Transformer as a cloud service, you can easily run pipelines on cloud vendor Spark clusters such as Databricks, EMR, and Azure HDInsights.

When installing Spark on a Hadoop YARN cluster with Kerberos enabled, you must enable Transformer to use Kerberos.

After installing Transformer, as a best practice, configure Transformer to use directories outside of the runtime directory.

Installing when Spark Runs Locally

To get started with Transformer in a development environment, install both Transformer and Spark on the same machine. This allows you to easily develop and test local pipelines, which run on the local Spark installation.

Before you start, ensure that the machine meets the minimum installation requirements.

  1. Download the Transformer tarball.
  2. Use the following command to extract the tarball to the desired location:
    tar xf streamsets-transformer-all-<version>.tgz -C <extraction directory>

    For example, to extract version 3.13.0, use the following command:

    tar xf streamsets-transformer-all-3.13.0.tgz -C /opt/streamsets-transformer/
  3. Download Apache Spark from the Apache Spark Download page to the same machine as the Transformer installation.

    Download a supported Spark version that is valid for the Transformer features that you want to use.

  4. Extract the downloaded file.
  5. Set the SPARK_HOME environment variable to the Spark installation path, as follows:
    export SPARK_HOME=<Spark path>
    For example:
    export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7/
    Modify environment variables using the method required by your installation type.

Installing when Spark Runs on a Cluster

In a production environment, install Transformer on a machine that is configured to submit Spark jobs to a cluster.

Before you start, ensure that the machine meets the minimum installation requirements.

  1. Download the Transformer tarball.
  2. Use the following command to extract the tarball to the desired location:
    tar xf streamsets-transformer-all-<version>.tgz -C <extraction directory>

    For example, to extract version 3.13.0, use the following command:

    tar xf streamsets-transformer-all-3.13.0.tgz -C /opt/streamsets-transformer/
  3. Edit the Transformer configuration file, $TRANSFORMER_DIST/etc/transformer.properties, where $TRANSFORMER_DIST is the base Transformer runtime directory.
    1. To define a publicly accessible IP address for Transformer. uncomment the transformer.base.http.url property and set it to the publicly accessible URL for Transformer.
      Transformer must use a publicly accessible IP address so that the Spark driver in the cluster can send the status, metrics, and offsets for running pipelines to Transformer.
    2. To run pipelines on a MapR Hadoop YARN cluster, uncomment the hadoop.mapr.cluster property and set it to true.

      Before running a pipeline on a MapR cluster, complete the prerequisite tasks.

    3. For an unsecured Hadoop YARN cluster, to restrict Transformer to impersonating the user who starts the pipeline when submitting Spark jobs, uncomment the hadoop.always.impersonate.current.user property and set it to true.
      On an unsecured Hadoop YARN cluster with Hadoop impersonation enabled, this is a recommended security measure. This prevents a user from impersonating another user by entering a different user name in the Hadoop User Name pipeline property.
      On a secure, Kerberos-enabled cluster, this property is not used. When enabled for Kerberos, Transformer either uses the keytab specified in the pipeline or impersonates the user who starts the pipeline, regardless of how this property is set.
  4. Set the following environment variables on the machine where you installed Transformer:
    Environment Variable Description
    JAVA_HOME Path to the Java installation on the machine.
    SPARK_HOME Path to the Spark installation on the machine. Required for Hadoop YARN and Spark standalone clusters only.

    Clusters can include multiple Spark installations. Be sure to point to a supported Spark version that is valid for the Transformer features that you want to use.

    On Cloudera clusters, Spark is generally installed into the parcels directory. For example, for CDH 5.11, you might use: /opt/cloudera/parcels/SPARK2/lib/spark2.

    Tip: To verify the version of a Spark installation, you can run the spark-shell command. Then, use sc.getConf.get("spark.home") to return the installation location.
    HADOOP_CONF_DIR or YARN_CONF_DIR Directory that contains the client side configuration files for the Hadoop cluster. Required for Hadoop YARN and Spark standalone clusters only.

    For more information about these environment variables, see the Apache Spark documentation.

When you configure a pipeline, you specify cluster details in the pipeline properties. For information about each cluster type, see the Cluster Type chapter.

Enabling Kerberos for Hadoop YARN Clusters

When a Hadoop YARN cluster uses Kerberos authentication, you must enable Transformer to use Kerberos authentication to communicate securely with the cluster.

When using Kerberos, Transformer can start a pipeline using a Kerberos principal and keytab or using a proxy user. Spark recommends using a Kerberos principal and keytab. For more information about how to configure pipelines when Kerberos is enabled, see Kerberos Authentication.
  1. On Linux, install the following Kerberos client packages on the Transformer machine:
    • krb5-workstation
    • krb5-client
    • K5start, also known as kstart
  2. Copy the Kerberos configuration file, krb5.conf, to the Transformer machine. The default location is /etc/krb5.conf.

    The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of host names onto Kerberos realms.

  3. Copy the keytab file that contains the credentials for the Kerberos principal to the Transformer machine. The default location is /etc/<keytab file>.
  4. Modify the Transformer configuration file, $TRANSFORMER_DIST/etc/transformer.properties, where $TRANSFORMER_DIST is the base Transformer runtime directory.
    Configure the following Kerberos properties in the file:
    Kerberos Property Description
    kerberos.client.principal Kerberos principal to use. Enter a service principal.
    kerberos.client.keytab Absolute path to the Kerberos keytab file copied to the Transformer machine.
  5. Restart Transformer.

Installation with Cloud Service Providers

You can install Transformer as a service with the following cloud service providers:

Installing on Amazon Web Services

You can install Transformer on Amazon Web Services (AWS). For details about Transformer on AWS, see the AWS Marketplace listing.

  1. Follow the Amazon documentation to launch an instance of StreamSets Transformer.
    1. When choosing an Amazon Machine Image (AMI), search for StreamSets Transformer in the AWS Marketplace and select StreamSets Transformer.
    2. When configuring the instance, set the following requirements:
      Instance type

      Select an appropriate instance type. The instance type that you choose can depend on whether you use a Spark installed locally or on a cluster. See the Transformer installlation requirements for details.

      Security group
      Transformer requires a security group that opens port 19630. See the Amazon documentation for information on how to open a port.
    3. When launching the instance, note the instance ID on the Launch Status page.
      The password to Transformer matches the instance ID.

      AWS might require a few minutes to launch an instance.

  2. To access Transformer, enter the following URL in the address bar of your browser:
    http://<Public DNS of EC2 instance>:19630

    For example if your DNS is ec2-12-345-678-999.compute-1.amazonaws.com, enter:

    http://ec2-12-345-678-999.compute-1.amazonaws.com:19630
  3. To log in, enter admin as the user name and the instance ID as the password.
    Tip: If you are new to Transformer, here are the basics.

Installing on Azure

You can install Transformer on Microsoft Azure.

Transformer is installed on a CentOS 7.x virtual machine hosted on Microsoft Azure. You run the installed Transformer as a service.

  1. Log in to the Microsoft Azure portal.
  2. In the Navigation panel, click Create a resource.
  3. Search the Marketplace for StreamSets Transformer for Azure, and then click Create.
  4. On the Create virtual machine page, enter a name for the new virtual machine, the user name to log in to that virtual machine, and the authentication method to use for logins.
    Important: Do not use transformer as the user name to log in to the virtual machine. The transformer user account must be reserved as the system user account that runs Transformer as a service.

    You can create the virtual machine in a new or existing resource group.

    You can optionally change the virtual machine size, but the default size is sufficient in most cases. If you change the default, select a size that meets the minimum requirements.

    For example, the following configuration creates a Basic A3 size virtual machine named transformer with a user named txuser who can log in using password authentication. The virtual machine is created in a new resource group named transformer:

  5. Click Next.
  6. On the Disks page under Advanced, verify that Use managed disks is enabled.
  7. On the remaining pages, accept the defaults or configure the optional features.
    Note: The virtual machine is automatically configured to allow incoming connections on the default Transformer port of 19630 used for the HTTP protocol. If you change the default port or enable HTTPS after installation, you'll also need to configure the virtual machine to allow incoming connections on the changed port.
  8. Verify the details in the Review and Create page, and then click Create.
    It can take several minutes for the resource to deploy and for Transformer to start as a service.
  9. To access the Transformer UI, enter the following URL in the address bar of your browser:
    http://<virtual machine IP address>:19630
  10. Use the following default credentials to log in: admin/admin.
    Tip: If you are new to Transformer, here are the basics.

Configuring Directories (Manual Start)

As a best practice, after installing Transformer from a tarball, configure Transformer to use directories outside of the runtime directory after installation. This enables the use of the directories after Transformer upgrades.

Configure the directories used to store configuration, data, log, and resource files so that they are outside of the $TRANSFORMER_DIST directory, the base Transformer runtime directory.

Note: You can use the default locations within the $TRANSFORMER_DIST runtime directory. However, if you use the default values, make sure the user who starts Transformer has write permission on the base Transformer runtime directory.
  1. Create directories outside of the $TRANSFORMER_DIST runtime directory for the configuration, data, log, and resource files.
  2. In the $TRANSFORMER_DIST/libexec/transformer-env.sh file, set the following environment variables to the newly created directories:
    • TRANSFORMER_CONF - The Transformer configuration directory.
    • TRANSFORMER_DATA - The Transformer directory for pipeline state and configuration information.
    • TRANSFORMER_LOG - The Transformer directory for logs.
    • TRANSFORMER_RESOURCES - The Transformer directory for runtime resource files.
    Modify environment variables using the method required by your installation type.
  3. Copy all files from $TRANSFORMER_DIST/etc to the newly created $TRANSFORMER_CONF directory.