Extra Spark Configuration

When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.

You can add any additional Spark configuration property, as described in the Spark configuration documentation.

You can also add the following extra configuration property provided by Transformer. This is not a Spark configuration property:
Configuration Property Description
spark.home Overrides the SPARK_HOME environment variable set on the machine.

For example, let's say that multiple Spark versions are installed locally on the Transformer machine. You can add the spark.home configuration property to run the pipeline on the Spark version that is not set in the environment variable.

Performance Tuning Properties

By default, Transformer adds several Spark configuration properties to each pipeline with suggested values. These properties override the default Spark values in the cluster.

The defaults for these properties should work in most cases. If you are an advanced user, you can tune the performance of a specific pipeline by modifying these properties or by adding additional Spark configuration properties.

For more information about these configuration properties, see the Spark Configuration documentation.

Transformer adds the following Spark configuration properties to each pipeline:

Spark Configuration Property Description
spark.driver.memory Maximum amount of memory that the Spark driver uses to run the pipeline.
spark.driver.cores Number of cores that the Spark driver uses to run the pipeline.
spark.executor.memory Maximum amount of memory that each Spark executor uses to run the pipeline.
spark.executor.cores Number of cores that each Spark executor uses to run the pipeline.

Databricks and Dataproc do not allow overrides of this configuration property. This property is ignored when running the pipeline on a Databricks or Dataproc cluster.

spark.dynamicAllocation.enabled Enables dynamic resource allocation. Spark uses as many executors as required to run the pipeline.
Note: Local pipelines always run on one Spark executor.
spark.shuffle.service.enabled Enables the external shuffle service.

Must be true when dynamic allocation is enabled.

spark.dynamicAllocation.minExecutors Minimum number of Spark executors that the pipeline runs on when dynamic allocation is enabled.
spark.dynamicAllocation.maxExecutors Maximum number of Spark executors that the pipeline runs on when dynamic allocation is enabled.
The maximum number of Spark executors allowed for each pipeline is determined by your StreamSets account. You can decrease this number to limit executor usage in the cluster, but you cannot increase the number.
Note: When dynamic allocation is disabled, then the spark.executor.instances configuration property determines the number of Spark executors used for the pipeline. The maximum value of the spark.executor.instances property is also determined by your StreamSets account.