Installation Requirements

Install Transformer on a machine that meets the following minimum requirements:
Component Minimum Requirement
Spark Apache Spark version 2.3.0 or any later 2.x.x version provides access to almost all Transformer features.
Apache Spark 2.4.0 or any later 2.x.x version is required to use the following features:
  • Avro data format processing, and Kafka message key processing with the JSON data format in the Kafka origin or destination
  • Pushdown optimization in the Snowflake origin
  • Write to a slowly changing dimension write mode in the JDBC destination
Apache Spark 2.4.2 or any later 2.x.x version is required to use the following features:
  • Delta Lake stages
  • Partitioned columns in the Hive destination

The Spark installation must include Scala 2.11.x. For more information, see Spark-Scala Version Requirement.

Spark clusters must also have the Spark external shuffle service enabled.

Important: Although Transformer requires an Apache Spark installation, StreamSets does not provide support for the Spark installation.
Cluster Types For Spark deployed to a cluster, use one of the following cluster types:
  • Apache Spark in Azure HDInsight.
  • Amazon EMR 5.20.0 or any later 5.x.x version.
  • Databricks Runtime versions 5.x or 6.x.
  • Google Dataproc image versions 1.3 or 1.4.
  • Hadoop YARN using one of the following Hadoop distributions:

    Before using a Hadoop YARN cluster, create the required directories and update drivers on older distributions, as needed.

  • Kubernetes - Most Spark vendors support Kubernetes for development workloads only.
  • Microsoft SQL Server 2019 Big Data Cluster on SQL Server 2019 Cumulative Update 4 or later.
  • Spark standalone cluster - Most Spark vendors support standalone mode for development workloads only.
Java Oracle Java 8 or OpenJDK 8
Operating system One of the following operating systems and versions:
  • Mac OS X - Supported only when Spark runs locally on one machine.
  • CentOS 6.x or 7.x
  • Oracle Linux 6.x or 7.x
  • Red Hat Enterprise Linux 6.x or 7.x
  • Ubuntu 14.04 LTS or 16.04 LTS
Cores 2
RAM 1 GB
Disk space 6 GB
Note: StreamSets does not recommend using NFS or NAS to store Transformer files.

Spark-Scala Version Requirement

Transformer requires a supported installation of Spark that includes Scala 2.11.x.

Transformer supports Apache Spark version 2.3.0 or any later 2.x.x version. All supported Spark versions can include Scala 2.11.x, however some installations include Scala 2.12.x instead. Ensure that your Spark installation includes Scala 2.11.x.

Each Spark version has multiple installation packages, such as Spark with Hadoop 2.6, Spark with Hadoop 2.7, and Spark without Hadoop. For most Spark versions, almost all installation packages include Scala 2.11.x. You must simply avoid installations that explicitly include Scala 2.12.x, such as spark-2.4.3-bin-without-hadoop-scala-2.12.tgz.

However, Spark 2.4.2 is the opposite – most installation packages include Scala 2.12.x. As a result, you must use an installation that explicitly includes Scala 2.11.x, such as spark-2.4.2-bin-without-hadoop-scala-2.11.tgz.

Spark Shuffle Service Requirement

To run a pipeline on a Spark cluster, Transformer requires that the Spark external shuffle service be enabled on the cluster.

Most Spark clusters have the external shuffle service enabled by default. However, Hortonworks clusters do not.

Before you run a pipeline on a Hortonworks cluster, enable the Spark external shuffle service on Hortonworks clusters. Enable the shuffle service on other clusters as needed.

When you run a pipeline on a cluster without the Spark external shuffle service enabled, the following error is written to the Spark log:
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled. 

For more information about enabling the Spark external shuffle service, see the Spark documentation.

Default Port

Transformer uses either HTTP or HTTPS that runs over the TCP protocol. Configure network routes and firewalls so that web browsers and the Spark cluster can reach the Transformer IP address.

The default port number that Transformer uses depends on the configured protocol:
  • HTTP - Default is 19630.
  • HTTPS - Default depends on the configuration. For more information, see Enabling HTTPS.

Browser Requirements

Use the latest version of one of the following browsers to access the Transformer UI:
  • Chrome
  • Firefox
  • Safari

Hadoop YARN Directory Requirements

When using a Hadoop YARN cluster manager, the following directories must exist:
Spark node local directories
The Spark yarn.nodemanager.local-dir configuration parameter in the yarn-site.xml file defines one or more directories that must exist on each Spark node.
The value of the configuration parameter should be available in the cluster manager user interface. By default, the property is set to ${hadoop.tmp.dir}/nm-local-dir.
The specified directories must meet all of the following requirements for each node of the cluster:
  • Exist on each node of the cluster.
  • Be owned by YARN.
  • Have read permission granted to the Transformer proxy user.
HDFS application resource directories
Spark stores resources for all Spark applications started by Transformer in the HDFS home directory of the Transformer proxy user. Home directories are named after the Transformer proxy user, as follows:
/user/<Transformer proxy user name>
Ensure that both of the following requirements are met:
  • Each resource directory exists on HDFS.
  • Each Transformer proxy user has read and write permission on their resource directory.
For example, you might use the following command to add a Transformer user, tx, to a spark user group:
usermod -aG spark tx
Then, you can use the following commands to create the /user/tx directory and ensure that the spark user group has the correct permissions to access the directory:
sudo -u hdfs hdfs dfs -mkdir /user/tx
sudo -u hdfs hdfs dfs -chown tx:spark /user/tx
sudo -u hdfs hdfs dfs -chmod -R 775 /user/tx

Hadoop YARN Driver Requirement

When you run pipelines on older distributions of Hadoop clusters, the cluster can have an older JDBC driver on the classpath that takes precedence over the JDBC driver required for the pipeline. This can be a problem for PostgreSQL and SQL Server JDBC drivers.

When a pipeline encounters this issue, it generates a SQLFeatureNotSupportedException error, such as:

java.sql.SQLFeatureNotSupportedException: This operation is not supported.

To avoid this issue, update the PostgreSQL and SQL Server JDBC drivers on the cluster to the latest available versions.

StreamSets Accounts Requirement

Users without an enterprise account must register with StreamSets Accounts to enable downloading and logging into Transformer or Data Collector. Users with a StreamSets enterprise account use the StreamSets Support portal for downloads.

You can use an existing Google or Microsoft account to register with StreamSets Accounts. Or, you can enter an email address and password, then check your email to confirm your registration. After registering, you can download the latest Transformer or Data Collector tarball or access Docker for the latest Transformer or Data Collector image.

After you install Transformer, you link it to StreamSets Accounts. StreamSets Accounts then provides single sign on authentication for Transformer.

Note: Users with an enterprise account should not use StreamSets Accounts. Also, users who installed Transformer through a cloud service provider marketplace do not need to use StreamSets Accounts.

To register or log into StreamSets Accounts, go to https://accounts.streamsets.com.

For more information about StreamSets Accounts, see StreamSets Accounts.