Stage-Related Prerequisites

Some stages require that you complete prerequisite tasks before using them in a pipeline. If you do not use a particular stage, you can skip the tasks for the stage.

The following stages and features include prerequisite tasks:

Local Pipeline Prerequisites for Amazon S3 and ADLS

Transformer uses Hadoop APIs to connect to the following external systems:
  • Amazon S3
  • Microsoft Azure Data Lake Storage Gen1 and Gen2

To run pipelines that connect to these systems, Spark requires access to Hadoop client libraries and system-related client libraries.

Cluster pipelines require no action because supported cluster managers include the required libraries. However, before you run a local pipeline that connects to these systems, you must complete the following prerequisite tasks.

Ensure Access to Hadoop Client Libraries

To run a local pipeline that connects to Amazon S3, or Azure Data Lake Storage Gen1 or Gen2, the Spark installation on the Transformer machine must be configured to use Hadoop 3.2.0 client libraries.

Some Spark installations include Hadoop by default, but they include versions earlier than Hadoop 3.2.0.

  1. Install Spark without Hadoop on the Transformer machine.

    You can download Spark without Hadoop from the Spark website. Select the version named Pre-build with user-provided Apache Hadoop.

  2. Install Hadoop 3.2.0.

    You can download Hadoop 3.2.0 from the Apache website.

  3. Configure the SPARK_DIST_CLASSPATH to include the Hadoop libraries.
    Spark recommends adding an entry to the conf/spark-env.sh file. For example:
    export SPARK_DIST_CLASSPATH=$(/<path to>/hadoop/bin/hadoop classpath)
    For more information, see the Spark documentation.
  4. If the Transformer machine includes multiple versions of Spark, configure the machine to use the Spark version without Hadoop.
    For example:
    export SPARK_HOME="/<path to>/spark-2.4.3-bin-without-hadoop"

Use the Transformer Stage Library

Before running a local pipeline that connects to Amazon S3, or Azure Data Lake Storage Gen1 or Gen2, configure each related stage to use the Transformer stage library.

The Transformer stage library includes the required system-related client library with the pipeline, enabling a local Spark installation to execute the pipeline.

  1. In the stage properties, on the General tab, configure the Stage Library property to use the Transformer stage library:
    • For Amazon S3 stages, select AWS Library from Transformer (for Hadoop 3.2.0).
    • For ADLS stages, select ADLS Library from Transformer (for Hadoop 3.2.0).
  2. Repeat for any stage that connects to Amazon S3, or Azure Data Lake Storage Gen1 or Gen2.

PySpark Processor Prerequisites

Before using the PySpark processor to develop custom PySpark code, you must complete several prerequisite tasks.
The tasks that you perform depend on where the pipeline runs:
  • Databricks cluster
  • Other cluster or local Transformer machine

Databricks Cluster

Before using the PySpark processor in pipelines that run on an existing Databricks cluster, set the required environment variables on the cluster.

When running the pipeline in a provisioned Databricks cluster, you configure the environment variables in the pipeline cluster configuration property. For more information, see Provisioned Databricks Cluster Requirements.

On an existing Databricks cluster, the PySpark processor requires the following environment variables to be configured as follows:
PYTHONPATH=/databricks/spark/python/lib/py4j-0.10.7-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
PYSPARK_PYTHON=/databricks/python3/bin/python3
PYSPARK_DRIVER_PYTHON=/databricks/python3/bin/python3
Note: You can configure environment variables in your Databricks Workspace by clicking Clusters > Advanced Options > Spark. Then, enter the environment variables in the Environment Variables property. Restart the cluster to enable your changes.

Other Cluster or Local Machine

Before using the PySpark processor in pipelines that run on a non-Databricks cluster or on a local Transformer machine, complete the following prerequisite tasks.
  1. Install Python 3 on all nodes in the Spark cluster, or on the Transformer machine for local pipelines.

    The processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.

    You can use a package manager to install Python from the command line. Or you can download and install Python from the Python download page.

  2. Set the following Python environment variable on all nodes in the Spark cluster, or on the Transformer machine for local pipelines:
    Python Environment Variable Description
    PYTHONPATH Lists one or more directory paths that contain Python modules available for import.
    Include the following paths:
    $SPARK_HOME/libexec/python/lib/py4j-<version>-src.zip
    $SPARK_HOME/libexec/python
    $SPARK_HOME/bin/pyspark
    For example:
    export PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/libexec/python:$SPARK_HOME/bin/pyspark:$PYTHONPATH

    For more information about this environment variable, see the Python documentation.

  3. Set the following Spark environment variables on all nodes in the Spark cluster, or on the Transformer machine for local pipelines:
    Spark Environment Variable Description
    PYSPARK_PYTHON Path to the Python binary executable to use for PySpark for both the Spark driver and workers.
    PYSPARK_DRIVER_PYTHON Path to the Python binary executable to use for PySpark for the Spark driver only.

    When not set, the Spark driver uses the path defined in PYSPARK_PYTHON.

    For more information about these environment variables, see the Apache Spark documentation.

Scala Processor and Preprocessing Script Prerequisites

Before you use a Scala processor or preprocessing script in a pipeline, complete the following prerequisite tasks:

Set the SPARK_HOME Environment Variable

When using a Scala processor or preprocessing script, set the SPARK_HOME environment variable on the cluster node where the Spark driver program runs.

The SPARK_HOME environment variable must be set to a Spark version that includes Scala 2.11.x.

You can also use the Extra Spark Configuration pipeline property to set the environment variable for the duration of the pipeline run.