Installation Requirements

Install Transformer on a machine that meets the following minimum requirements:
Component Minimum Requirement
Spark Apache Spark version 2.3.0 or any later 2.x.x version provides access to almost all Transformer features.
Apache Spark 2.4.0 or any later 2.x.x version is required to use the following features:
  • Avro data format in the Kafka origin or destination

  • Pushdown optimization in the Snowflake origin
  • Write to a slowly changing dimension write mode in the JDBC destination
Apache Spark 2.4.2 or any later 2.x.x version is required to use the following features:
  • Delta Lake stages
  • Partitioned columns in the Hive destination

The Spark installation must include Scala 2.11.x. For more information, see Spark-Scala Version Requirement.

Important: Although Transformer requires an Apache Spark installation, StreamSets does not provide support for the Spark installation.
Cluster Types For Spark deployed to a cluster, use one of the following cluster types:
  • Apache Spark in Azure HDInsight.
  • Amazon EMR 5.13.0 or later.
  • Databricks Runtime version 5.2 or later.
  • Google Dataproc image versions 1.3 and 1.4.
  • Hadoop YARN using one of the following Hadoop distributions:
  • Kubernetes - Most Spark vendors support Kubernetes for development workloads only.
  • Microsoft SQL Server 2019 Big Data Cluster on SQL Server 2019 Cumulative Update 4 or later.
  • Spark standalone cluster - Most Spark vendors support standalone mode for development workloads only.
Java Oracle Java 8 or OpenJDK 8
Operating system One of the following operating systems and versions:
  • Mac OS X - Supported only when Spark runs locally on one machine.
  • CentOS 6.x or 7.x
  • Oracle Linux 6.x or 7.x
  • Red Hat Enterprise Linux 6.x or 7.x
  • Ubuntu 14.04 LTS or 16.04 LTS
Cores 2
Disk space 6 GB
Note: StreamSets does not recommend using NFS or NAS to store Transformer files.

Spark-Scala Version Requirement

Transformer requires a supported installation of Spark that includes Scala 2.11.x.

Transformer supports Apache Spark version 2.3.0 or any later 2.x.x version. All supported Spark versions can include Scala 2.11.x, however some installations include Scala 2.12.x instead. Ensure that your Spark installation includes Scala 2.11.x.

Each Spark version has multiple installation packages, such as Spark with Hadoop 2.6, Spark with Hadoop 2.7, and Spark without Hadoop. For most Spark versions, almost all installation packages include Scala 2.11.x. You must simply avoid installations that explicitly include Scala 2.12.x, such as spark-2.4.3-bin-without-hadoop-scala-2.12.tgz.

However, Spark 2.4.2 is the opposite – most installation packages include Scala 2.12.x. As a result, you must use an installation that explicitly includes Scala 2.11.x, such as spark-2.4.2-bin-without-hadoop-scala-2.11.tgz.

Default Port

Transformer uses either HTTP or HTTPS that runs over the TCP protocol. Configure network routes and firewalls so that web browsers and the Spark cluster can reach the Transformer IP address.

The default port number that Transformer uses depends on the configured protocol:
  • HTTP - Default is 19630.
  • HTTPS - Default depends on the configuration. For more information, see Enabling HTTPS.

Browser Requirements

Use the latest version of one of the following browsers to access the Transformer UI:
  • Chrome
  • Firefox
  • Safari

Hadoop YARN Directory Requirements

When using a Hadoop YARN cluster manager, the following directories must exist:
Spark node local directories
The Spark yarn.nodemanager.local-dir configuration parameter in the yarn-site.xml file defines one or more directories that must exist on each Spark node.
The value of the configuration parameter should be available in the cluster manager user interface. By default, the property is set to ${hadoop.tmp.dir}/nm-local-dir.
The specified directories must meet all of the following requirements for each node of the cluster:
  • Exist on each node of the cluster.
  • Be owned by YARN.
  • Have read permission granted to the Transformer proxy user.
HDFS application resource directories
Spark stores resources for all Spark applications started by Transformer in the HDFS home directory of the Transformer proxy user. Home directories are named after the Transformer proxy user, as follows:
/user/<Transformer proxy user name>
Ensure that both of the following requirements are met:
  • Each resource directory exists on HDFS.
  • Each Transformer proxy user has read and write permission on their resource directory.
For example, you might use the following command to add a Transformer user, tx, to a spark user group:
usermod -aG spark tx
Then, you can use the following commands to create the /user/tx directory and ensure that the spark user group has the correct permissions to access the directory:
sudo -u hdfs hdfs dfs -mkdir /user/tx
sudo -u hdfs hdfs dfs -chown tx:spark /user/tx
sudo -u hdfs hdfs dfs -chmod -R 775 /user/tx