Data Collector Environment Configuration

Data Collector provides two environment configuration files:

  • $SDC_DIST/libexec/sdc-env.sh - Used when you start Data Collector manually from the command line.
  • $SDC_DIST/libexec/sdcd-env.sh - Used when you start Data Collector as a service.

Use a text editor to edit the appropriate environment configuration file. Some of the environment variables in the sdc-env.sh file are commented out and do not reflect the default values. Be sure to uncomment the line when you change a variable value.

After you edit the file, restart Data Collector to enable the changes.

Modify the Data Collector environment configuration files to customize the following areas:
  • Data Collector directories
  • User and group used to start Data Collector as a service
  • Java configuration options
  • Java Security Manager that restricts the runtime permissions of user libraries
  • Path to JAR files to be added to the root classloader

Data Collector Directories

You can edit the Data Collector environment configuration file to modify the directories used to store configuration, data, log, and resource files.

The $SDC_DIST environment variable defines the Data Collector runtime directory. The runtime directory is the base Data Collector directory that stores the executables and related files. This environment variable is set during installation.

When you start Data Collector manually, the default values of the remaining directory variables are relative to the $SDC_DIST runtime directory. When you start Data Collector as a service, the default values of the remaining directory variables are absolute paths that are outside of the $SDC_DIST runtime directory.

You can configure the following environment variables that define directories:

Environment Variable Description
SDC_CONF

Defines the configuration directory for the Data Collector configuration file, sdc.properties, and related realm properties files and keystore files. Also includes the logj4 properties file.

Default values:

  • Manual start: $SDC_DIST/etc
  • Service start: /etc/sdc
SDC_DATA

Defines the data directory for pipeline configuration and run details.

Default values:

  • Manual start: $SDC_DIST/data
  • Service start: /var/lib/sdc
SDC_LOG

Defines the log directory.

Default values:

  • Manual start: $SDC_DIST/log
  • Service start: /var/log/sdc
SDC_RESOURCES Defines the directory for runtime resource files.

Default values:

  • Manual start: $SDC_DIST/resources
  • Service start: /var/lib/sdc-resources

User and Group for Service Start

When you run Data Collector as a service, you must create a system user and group named sdc, or you must edit the values of the SDC_USER and SDC_GROUP environment variables to point to an existing system user or group.

The $SDC_DIST/libexec/sdcd-env.sh file includes the following user and group environment variables:
  • SDC_USER - Defines the system user used to run Data Collector as a service. Default is sdc.
  • SDC_GROUP - Defines the system group used to run Data Collector as a service. Default is sdc.

Java Configuration Options

You can define the Java configuration options used by Data Collector.

For a tarball or RPM installation, define Java configuration options in the following environment variables in the Data Collector environment configuration file:

  • SDC_JAVA_OPTS - Includes configuration options for Java.
  • SDC_JAVA8_OPTS - Includes configuration options specific to Java 8.

Data Collector loads the value of the version-specific environment variable and adds it to the SDC_JAVA_OPTS environment variable.

For a Cloudera Manager installation, define Java configuration options in the Java Options property for the StreamSets service in Cloudera Manager.

Note: To modify Java configuration options for the Data Collector command line interface, use the SDC_CLI_JAVA_OPTS environment variable. For more information, see Java Configuration Options for the Cli Command.

Java Heap Size

Increase or decrease the Data Collector Java heap size as necessary, based on the resources available on the host machine. By default, the Java heap size is 1024 MB.

The Java heap size determines the heap size allocated to Data Collector and affects the amount of memory Data Collector can use when it runs a pipeline. Running a pipeline can use up to 65% of the allocated heap size.

Use the following Java options to define the Java heap size:
  • Xmx - Defines the maximum heap size.
  • Xms - Defines the minimum heap size.
Tip: To avoid constant recalculation of the allocated heap size, set both properties to the same value. To define the unit of measure, use m for MB and g for GB.

Define the heap size based on your installation:

Tarball or RPM installation

Define the heap size in the SDC_JAVA_OPTS environment variable.

If you start Data Collector as a service, set the environment variable in the $SDC_DIST/libexec/sdcd-env.sh file. If you start Data Collector manually, set the variable in the $SDC_DIST/libexec/sdc-env.sh file.

For example, to double the heap size, change the following default settings in the environment configuration file:
export SDC_JAVA_OPTS="-Xmx1024m -Xms1024m -server ${SDC_JAVA_OPTS}"

to increase the Xmx and Xms settings as follows:

export SDC_JAVA_OPTS="-Xmx2048m -Xms2048m -server ${SDC_JAVA_OPTS}"
Cloudera Manager installation
Define the heap size in the Java Options property for the StreamSets service in Cloudera Manager.
For example, to double the heap size, add the Xmx and Xms settings to the property as follows:
-Xmx2048m -Xms2048m
With a heap size of 2048 MB, you can configure a pipeline to use up to 65% - that's 1331 MB of memory.
Note: In the pipeline properties, you can use the jvm:maxMemoryMB() function to help define the percentage of the heap size the pipeline uses.

Remote Debugging

You can enable remote debugging to debug a Data Collector instance running on a remote machine.

Enable remote debugging based on your installation:
Tarball or RPM installation

Define debugging options in the SDC_JAVA_OPTS environment variable.

If you start Data Collector as a service, set the environment variable in the $SDC_DIST/libexec/sdcd-env.sh file. If you start Data Collector manually, set the variable in the $SDC_DIST/libexec/sdc-env.sh file.

Add the following debugging options to the environment variable, where port_number is an open port number on the remote machine running Data Collector:
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=<port_number>,suspend=n
For example, to debug Data Collector on a remote machine using port number 2005, define SDC_JAVA_OPTS as follows:
export SDC_JAVA_OPTS="-Xmx1024m  -Xms1024m -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=2005,suspend=n -server ${SDC_JAVA_OPTS}"
Cloudera Manager installation
Define the debugging options in the Java Options property for the StreamSets service in Cloudera Manager.
For example, to debug Data Collector on a remote machine using port number 2005, add the debugging options to the property as follows:
-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=2005,suspend=n

Garbage Collector

You can define the Java garbage collector that Data Collector uses. By default, Data Collector uses the Concurrent Mark Sweep (CMS) garbage collector.

For example, if you configure Data Collector to use a large heap size, you might want to use the G1 garbage collector. If you define another garbage collector, test and evaluate Data Collector performance before making the same change in a production environment. Garbage collector performance depends on each particular use case.

Define the garbage collector based on your installation:

Tarball or RPM installation
Define the garbage collector in the SDC_JAVA8_OPTS environment variable.

If you start Data Collector as a service, set the environment variable in the $SDC_DIST/libexec/sdcd-env.sh file. If you start Data Collector manually, set the variable in the $SDC_DIST/libexec/sdc-env.sh file.

For example, the default garbage collector is defined as follows:

export SDC_JAVA8_OPTS=${SDC_JAVA8_OPTS:-"-XX:+UseConcMarkSweepGC -XX:+UseParNewGC"}

To use the G1 garbage collector, set the option as follows:

export SDC_JAVA8_OPTS=${SDC_JAVA8_OPTS:-"-XX:+UseG1GC"}
Cloudera Manager installation
Define the garbage collector in the Java Options property for the StreamSets service in Cloudera Manager.
For example, to use the G1 garbage collector, add the option to the property as follows:
 -XX:+UseG1GC

Logging

Data Collector enables garbage collector logging by default to facilitate troubleshooting. Log files are written to $SDC_LOG/gc.log. You can disable logging.

Disable garbage collector logging based on your installation:

Tarball or RPM installation
Set the SDC_GC_LOGGING environment variable to false in the Data Collector environment configuration file.

If you start Data Collector as a service, set the environment variable in the $SDC_DIST/libexec/sdcd-env.sh file. If you start Data Collector manually, set the variable in the $SDC_DIST/libexec/sdc-env.sh file.

For example:
export SDC_GC_LOGGING=false
Cloudera Manager installation
Set the SDC_GC_LOGGING environment variable to false in the sdc-env.sh safety valve for the StreamSets service in Cloudera Manager.
For example, add the following to the sdc-env.sh safety valve:
export SDC_GC_LOGGING=false

Java Security Manager

Data Collector includes a Java Security Manager that is enabled by default.

The Java Security Manager restricts the runtime permissions of user libraries. This allows administrators to control what user libraries do on production systems. For example, by default, user libraries cannot call out to network resources and potentially cause denial of service attacked.

The security policy is defined in the $SDC_CONF/sdc-security.policy file. The file syntax is java standard.

To disable the Java Security Manager, edit the SDC_SECURITY_MANAGER_ENABLED environment variable in the Data Collector environment configuration file.

Root Classloader

You can edit the Data Collector environment configuration file to configure the path to JAR files to be added to the Data Collector root classloader.

The SDC_ROOT_CLASSPATH environment variable defines the path to JAR files to be added to the Data Collector root classloader. Use the variable for components that must be in the root classloader, such as Snappy. Default is $SDC_DIST/root-lib/'*'.