Data Collector Environment Configuration

Data Collector includes several environment variables that you can modify to customize the following areas:
  • Data Collector directories
  • User and group used to start Data Collector as a service
  • Java configuration options
  • Security Manager that restricts the runtime permissions of user libraries
  • Path to JAR files to be added to the root classloader
  • Heap dump creation and file location
Note: Data Collector also includes a SPARK_KAFKA_VERSION environment variable that should not be modified. This variable is used only when you run cluster streaming mode pipelines on a Cloudera CDH cluster. For more information, see Kafka Cluster Requirements.

Modifying Environment Variables

The method that you use to modify environment variables depends on the Data Collector installation type:
Tarball installation started manually from the command line
When you start Data Collector manually from the command line on any operating system, edit the $SDC_DIST/libexec/sdc-env.sh file to modify environment variables.

Use a text editor to edit the sdc-env.sh file. Some of the environment variables in the file are commented out and do not reflect the default values. Be sure to uncomment the line when you change a variable value.

After you edit the file, restart Data Collector from the command prompt to enable the changes.

Note: Do not restart Data Collector from the user interface after modifying environment variables.
Tarball or RPM installation started as a service on operating systems that use the SysV init system
When you start Data Collector as a service on CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu 14.04 LTS, edit the $SDC_DIST/libexec/sdcd-env.sh file to modify environment variables.

Use a text editor to edit the sdcd-env.sh file.

After you edit the file, restart Data Collector to enable the changes.

Tarball or RPM installation started as a service on operating systems that use the systemd init system
When you start Data Collector as a service on CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu 16.04 LTS, edit the sdc.service file to modify environment variables.
The location of the sdc.service file depends on how you installed Data Collector:
  • From the RPM package - /usr/lib/systemd/system/sdc.service
  • From the tarball - /etc/systemd/system/sdc.service

Override the default values in the sdc.service file using the same procedure that you use to override unit configuration files on a systemd init system. For an example, see "Example 2. Overriding vendor settings" in this systemd.unit manpage.

After overriding the default values, use the following command to reload the systemd manager configuration:
systemctl daemon-reload

Then restart Data Collector to enable the changes.

Cloudera Manager installation
When you install Data Collector through Cloudera Manager, modify environment variables by configuring the StreamSets service through Cloudera Manager.

Data Collector Directories

Data Collector includes environment variables that define the directories used to store configuration, data, log, and resource files.

The SDC_DIST environment variable defines the Data Collector runtime directory. The runtime directory is the base Data Collector directory that stores the executables and related files. This environment variable is set during installation.

When you start Data Collector manually, the default values of the remaining directory variables are relative to the $SDC_DIST runtime directory. When you start Data Collector as a service, the default values of the remaining directory variables are absolute paths that are outside of the $SDC_DIST runtime directory.

Modify environment variables using the method required by your installation type.

You can configure the following environment variables that define directories:

Environment Variable Description
SDC_CONF

Defines the configuration directory for the Data Collector configuration file, sdc.properties, and related realm properties files and keystore files. Also includes the logj4 properties file.

Default values:

  • Manual start: $SDC_DIST/etc
  • Service start: /etc/sdc
SDC_DATA

Defines the data directory for pipeline configuration and run details.

Default values:

  • Manual start: $SDC_DIST/data
  • Service start: /var/lib/sdc
SDC_LOG

Defines the log directory.

Default values:

  • Manual start: $SDC_DIST/log
  • Service start: /var/log/sdc
SDC_RESOURCES Defines the directory for runtime resource files.

Default values:

  • Manual start: $SDC_DIST/resources
  • Service start: /var/lib/sdc-resources

User and Group for Service Start

When you run Data Collector as a service, Data Collector runs as the system user account and group defined in environment variables. The default system user and group are named sdc.

You can modify the values of the environment variables to point to another system user or group.

Modify environment variables using the method required by your installation type.

If you change the system user, you must make the new system user the owner of all Data Collector directories:
  • $SDC_DIST
  • $SDC_CONF
  • $SDC_DATA
  • $SDC_LOG
  • $SDC_RESOURCES
For example, if you change the system user and group to myuser, use the following command to change the owner of the configuration directory, $SDC_CONF, and all files in the directory to myuser:myuser:
chown -R myuser:myuser /etc/sdc
Note: When you run Data Collector manually, Data Collector runs as the system user account logged into the command prompt when the launch command is run. To run as another user account, see Full Installation and Launch (Manual Start).

Java Configuration Options

You define Java configuration options used by Data Collector in environment variables.

For a tarball or RPM installation, define Java configuration options in the following environment variables:

  • SDC_JAVA_OPTS - Includes configuration options for Java.
  • SDC_JAVA8_OPTS - Includes configuration options specific to Java 8.

Data Collector loads the value of the version-specific environment variable and adds it to the SDC_JAVA_OPTS environment variable.

When defining Java configuration options, avoid defining duplicate options. If you do define duplicates, the last option passed to the JVM usually takes precedence.

For a Cloudera Manager installation, define Java configuration options by configuring the StreamSets service through Cloudera Manager.

Note: To modify Java configuration options for the Data Collector command line interface, use the SDC_CLI_JAVA_OPTS environment variable. For more information, see Java Configuration Options for the Cli Command.

Java Heap Size

Increase or decrease the Data Collector Java heap size as necessary, based on the resources available on the host machine. By default, the Java heap size is 1024 MB.

The Java heap size determines the heap size allocated to Data Collector and affects the amount of memory Data Collector can use when it runs a pipeline. Running a pipeline can use up to 65% of the allocated heap size.

Use the following Java options to define the Java heap size:
  • Xmx - Defines the maximum heap size.
  • Xms - Defines the minimum heap size.
Tip: To avoid constant recalculation of the allocated heap size, set both properties to the same value. To define the unit of measure, use m for MB and g for GB.

Define the heap size based on your installation:

Tarball or RPM installation

Define the heap size in the SDC_JAVA_OPTS environment variable.

For example, to double the heap size, increase the Xmx and Xms settings as follows:

export SDC_JAVA_OPTS="${SDC_JAVA_OPTS} -Xmx2048m -Xms2048m -server"

Modify environment variables using the method required by your installation type.

Cloudera Manager installation
Define the heap size in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field for the StreamSets service in Cloudera Manager.
For example, to double the heap size, add the following to the sdc-env.sh safety valve:
export SDC_JAVA_OPTS="-Xmx2048m -Xms2048m"
With a heap size of 2048 MB, you can configure a pipeline to use up to 65% - that's 1331 MB of memory.
Note: In the pipeline properties, you can use the jvm:maxMemoryMB() function to help define the percentage of the heap size the pipeline uses.

Remote Debugging

You can enable remote debugging to debug a Data Collector instance running on a remote machine.

Enable remote debugging based on your installation:
Tarball or RPM installation

Define debugging options in the SDC_JAVA_OPTS environment variable.

Add the following debugging options to the environment variable, where port_number is an open port number on the remote machine running Data Collector:
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=<port_number>,suspend=n
For example, to debug Data Collector on a remote machine using port number 2005, define SDC_JAVA_OPTS as follows:
export SDC_JAVA_OPTS="${SDC_JAVA_OPTS} -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=2005,suspend=n"

Modify environment variables using the method required by your installation type.

Cloudera Manager installation
Define the debugging options in the Java Options property for the StreamSets service in Cloudera Manager.
For example, to debug Data Collector on a remote machine using port number 2005, add the debugging options to the property as follows:
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=2005,suspend=n

Garbage Collector

You can define the Java garbage collector that Data Collector uses. By default, Data Collector uses the Concurrent Mark Sweep (CMS) garbage collector.

For example, if you configure Data Collector to use a large heap size, you might want to use the G1 garbage collector. If you define another garbage collector, test and evaluate Data Collector performance before making the same change in a production environment. Garbage collector performance depends on each particular use case.

Define the garbage collector based on your installation:

Tarball or RPM installation
Define the garbage collector in the SDC_JAVA8_OPTS environment variable.

For example, the default garbage collector is defined as follows:

export SDC_JAVA8_OPTS=${SDC_JAVA8_OPTS:-"-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"}

To use the G1 garbage collector, set the option as follows:

export SDC_JAVA8_OPTS=${SDC_JAVA8_OPTS:-"-XX:+UseG1GC"}

Modify environment variables using the method required by your installation type.

Cloudera Manager installation
Define the garbage collector in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field for the StreamSets service in Cloudera Manager.
For example, to use the G1 garbage collector, add the following to the sdc-env.sh safety valve:
 export SDC_JAVA8_OPTS="-XX:+UseG1GC"

Logging

Data Collector enables garbage collector logging by default to facilitate troubleshooting. Log files are written to $SDC_LOG/gc.log. You can disable logging.

Disable garbage collector logging based on your installation:

Tarball or RPM installation
Set the SDC_GC_LOGGING environment variable to false. For example:
export SDC_GC_LOGGING=false

Modify environment variables using the method required by your installation type.

Cloudera Manager installation
Set the SDC_GC_LOGGING environment variable to false in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field for the StreamSets service in Cloudera Manager.
For example, add the following to the sdc-env.sh safety valve:
export SDC_GC_LOGGING=false

Security Manager

Data Collector includes a Java Security Manager that is enabled by default. For enhanced security, you can enable the Data Collector Security Manager which prevents stages from accessing files in protected Data Collector directories.

Data Collector can use one of the following security managers:
Java Security Manager

By default, Data Collector uses the Java Security Manager. The Java Security Manager restricts the runtime permissions of user libraries. This allows administrators to control user libraries actions on production systems. For example, by default, user libraries cannot call out to network resources and potentially cause denial-of-service (DDoS) attacks.

The security policy is defined in the $SDC_CONF/sdc-security.policy file. The file syntax is java standard.

Data Collector Security Manager
For enhanced security, enable the Data Collector Security Manager. The Data Collector Security Manager prevents stages from accessing files in protected Data Collector directories, regardless of how the sdc-security.policy file is defined.
To enable the Data Collector Security Manager, uncomment the security_manager.sdc_manager.enable property in the Data Collector configuration file, $SDC_CONF/sdc.properties.
Note: If you use an older JVM version, the Data Collector Security Manager might encounter some JVM known issues.

If needed, you can configure Data Collector to use neither security manager by setting the SDC_SECURITY_MANAGER_ENABLED environment variable to false.

Modify environment variables using the method required by your installation type.

Protected Directories

When the Data Collector Security Manager is enabled, the following Data Collector directories are protected directories:
  • $SDC_CONF - Stages cannot access files in the configuration directory.
  • $SDC_DATA - Stages cannot access files in the data directory.
  • $SDC_RESOURCES - Stages can read files in the resources directory, but cannot write to files in the directory.

If needed, you can allow stages to access specific files in these protected directories by modifying Data Collector Security Manager exception properties in the Data Collector configuration file, $SDC_CONF/sdc.properties. However, use caution when configuring exceptions to these protected directories.

You can configure exceptions to protected directories as follows:
Exceptions for all stage libraries
To allow all stage libraries access to files in protected directories, modify the security_manager.sdc_dirs.exceptions property to define files that can be accessed.
Exceptions for specific stage libraries
To allow a specific stage library access to files in protected directories, add the following property and then define the files that the stage library can access:
security_manager.sdc_dirs.exceptions.<stage_library_name>=<file_path>
For example, the default Data Collector configuration file includes an exception for the Java keystore credential store stage library defined as follows:
security_manager.sdc_dirs.exceptions.lib.streamsets-datacollector-jks-credentialstore-lib=$SDC_CONF/jks-credentialStore.pkcs12

When you configure a Security Manager exception property, use the appropriate directory environment variable in the file path: $SDC_CONF, $SDC_DATA, or $SDC_RESOURCES. You can enter multiple file paths separated by commas.

Root Classloader

You can edit the SDC_ROOT_CLASSPATH environment variable to define the path to JAR files to be added to the Data Collector root classloader.

Use the variable for components that must be in the root classloader, such as Snappy. Default is $SDC_DIST/root-lib/'*'.

Modify environment variables using the method required by your installation type.

Heap Dump Creation

By default, when Data Collector encounters an out of memory error (OOME), it creates a heap dump.

By default, heap dump files are written to the file defined in the SDC_LOG environment variable and use a naming convention that allows generating multiple heap dump files, as follows: $SDC_LOG/sdc_heapdump_${timestamp}.hprof.

You can change the name of the heap dump files, but we recommend using the ${timestamp} or similar variable to ensure that the heap dump name is unique.

Note that Java Virtual Machine, and therefore Data Collector, does not overwrite existing heap dump files. For example, if you use $SDC_LOG/sdc_heapdump.hprof as the file name, after Data Collector creates the first heap dump file, it will not create another until you remove the existing file.

Note: Depending on the number and size of the generated heap dump files, you might want to increase the Data Collector Java heap size.
You can configure the following heap dump environment variables:
Heap Dump Environment Variable Description
SDC_HEAPDUMP_ON_OOM Specifies whether Data Collector generates a heap dump upon encountering an out of memory error.

Default is true.

SDC_HEAPDUMP_PATH Specifies the file name and location to use for heap dump files.

By default, heap dumps are written to $SDC_LOG/sdc_heapdump_${timestamp}.hprof.

To specify a different file name or location, uncomment the property and enter the location and file name to use.

Tip: To write multiple heap dump files to a directory, use a function or variable to ensure that the file name is unique. If a file of the same name exists in the directory, Data Collector does not create a new heap dump file.

Modify environment variables using the method required by your installation type.