Data Collector Configuration

You can edit the Data Collector configuration file, $SDC_CONF/sdc.properties, to configure properties such as the host name and port number and account information for email alerts.

You can store sensitive information in separate files and reference them in the Data Collector configuration file. You can also reference information in an environment variable.

You can define runtime properties in the Data Collector configuration file or in a separate file. For more information, see Using Runtime Properties.

Enabling Kerberos Authentication

You can use Kerberos authentication to connect to external systems as well as YARN clusters.

By default, Data Collector uses the user account who started it to connect to external systems. When you enable Kerberos, it can use the Kerberos principal to connect to external systems.

You can configure Kerberos authentication for the following stages:
  • Hadoop FS origin
  • Kafka Consumer origin, Kafka version 0.9.0.0 or later
  • MapR FS origin
  • SDC RPC to Kafka, Kafka version 0.9.0.0 or later
  • UDP to Kafka, Kafka version 0.9.0.0 or later
  • HBase Lookup processor
  • Hive Metadata processor
  • Hadoop FS destination
  • HBase destination
  • Hive Metastore destination
  • Kafka Producer, Kafka version 0.9.0.0 or later
  • MapR DB destination
  • MapR FS destination
  • Solr destination
  • HDFS File Metadata executor
  • MapReduce executor
  • Spark executor
To enable Kerberos authentication, perform the following steps:
  1. Configure Data Collector to use Kerberos.
    In the Data Collector configuration file, $SDC_CONF/sdc.properties, configure the following Kerberos properties to enable Kerberos and define the principal and keytab:
    • kerberos.client.enabled
    • kerberos.client.principal
    • kerberos.client.keytab
    Important: For cluster pipelines, enter an absolute path to the keytab when configuring Data Collector. Standalone pipelines do not require an absolute path.
  2. Configure the stage to use Kerberos.
Note: Cluster pipelines automatically use Kerberos authentication if the YARN cluster requires it and if Data Collector has Kerberos enabled.

Sending Email

You can enable Data Collector to send email by configuring the email configuration properties. Data Collector can send email in the following ways:
  • Email alert - Sends a basic email when an email-enabled alert is triggered, such as when the error record threshold has been reached.
  • Pipeline notification - Sends a basic email when the pipeline state changes to a specified state. For example, you might use pipeline notification to send an email when a pipeline transitions to a Run_Error or Finished state.
  • Email executor - Sends a custom email upon receiving an event from an event-generating stage. Use in an event stream to send a user-defined email. You can include expressions to provide information about the pipeline or event in the email.

    For example, you might use an Email executor to send an email upon receiving a failed-query event from the Hive Query executor, and you can include the failed query in the message.

To enable sending email, in the Data Collector configuration file, configure the mail.transport.protocol property, and then configure the smtp/smtps properties and the xmail properties. For more information, see Configuring Data Collector.

Referencing Sensitive Values in Files

For increased security you can store sensitive information, such as a password, in separate files and reference the files in the Data Collector configuration file, $SDC_CONF/sdc.properties, as follows: ${file("<filename>")}.

You can store a single piece of information in a file. When Data Collector starts, it checks the reference files for information. By default, Data Collector expects the SMTP password and keystore password to be stored in files.

For example, if you configure xmail.username as follows, Data Collector checks the email_username.txt file for the user name:
xmail.username=${file("email_username.txt")}
To store and reference sensitive information:
  1. Create a text file for each configuration value that you want to safeguard.
  2. Include only one configuration value in each file.

    You can store any configuration value in a separate file, but do not include more than one configuration value in a file.

  3. Save the file in the Data Collector configuration directory, $SDC_CONF.
  4. In the Data Collector configuration file, $SDC_CONF/sdc.properties, set the relevant value to the appropriate file name. Use the required syntax as follows:
    ${file("<filename>")}

Referencing Environment Variables

You can reference an environment variable in the Data Collector configuration file, $SDC_CONF/sdc.properties, as follows: ${env("<environment variable name>")}.

For example, you want to display text in the Data Collector console next to the StreamSets logo indicating whether Data Collector is running in a development, test, or production environment. You create an environment variable named ENVIRONMENT and populate the variable with the appropriate value: development, test, or production.

You set the value of the ui.header.title property in sdc.properties as follows:
ui.header.title=${env("ENVIRONMENT")}

Running Multiple Concurrent Pipelines

By default, Data Collector can run approximately 22 standalone pipelines concurrently. If you plan to run a larger number of pipelines at the same time, increase the thread pool size.

The runner.thread.pool.size property in the Data Collector configuration file, $SDC_CONF/sdc.properties, determines the number of threads in the pool that are available to run standalone pipelines. One running pipeline requires five threads, and pipelines share threads in the pool.

  1. Use a text editor to open sdc.properties.
  2. Calculate the approximate runner thread pool size by multiplying the number of running pipelines by 2.2.
  3. Set the runner.thread.pool.size property to your calculated value.
  4. Restart Data Collector to enable the changes.

HTTP Protocols

You can configure Data Collector to use HTTP or HTTPS. By default Data Collector uses HTTP.

HTTPS requires an SSL/TLS certificate. Data Collector provides a self-signed certificate so you can use HTTPS immediately. You can also generate a certificate that is self-signed or signed by a certifying authority. Web browsers generally issue a warning for self-signed certificates.

When you configure Data Collector to use HTTPS, you can have Data Collector redirect HTTP requests to HTTPS or to ignore them.

You use different properties to configure HTTPS based on whether you run standalone or cluster pipelines.

Configuring HTTPS for Standalone Pipelines

To configure HTTPS when you run standalone pipelines only, you can use a self-signed certificate. Or, you can generate an SSL/TLS certificate and specify the generated keystore file and keystore password file in the Data Collector configuration file, sdc.properties.

When you generate an SSL/TLS certificate to run standalone pipelines only, use the following files:
keystore file
A file that contains the private key and self-signed certificates for the web server. You can create your own keystore file or use the provided keystore file.
Store the keystore file on the Data Collector machine. Update the path and name of the keystore file in the https.keystore.path property. Enter an absolute path or a path relative to the $SDC_CONF directory.
The default keystore file is keystore.jks.
keystore password file
A file that contains the password to open the Java keystore file. You can create your own keystore password file or use the provided file.
Store the keystore password file on the Data Collector machine. Update the path and name of the keystore password file in the https.keystore.password property. Enter an absolute path or a path relative to the $SDC_CONF directory.
The default keystore password file is keystore-password.txt.

Configuring HTTPS for Cluster Pipelines

To configure HTTPS when you run cluster pipelines, you can use a self-signed certificate. Or, you can generate an SSL/TLS certificate for the gateway node and the worker nodes. You specify the generated keystore file and keystore password file for the gateway and worker nodes in the Data Collector configuration file, sdc.properties. You can optionally generate a truststore file for the gateway and worker nodes.

Note: If you run both standalone and cluster pipelines, configure HTTPS for cluster pipelines.

Keystore Files for Cluster Pipelines

When you generate an SSL/TLS certificate to run cluster pipelines, use the following files:

keystore file and keystore password file on the gateway node
The keystore file contains the private key and self-signed certificates for the gateway node to connect to the web server. The keystore password file contains the password to open the Java keystore file. You can create your own files or use the provided files.
Store both files on the gateway node. Update the path and name of the files in the following properties:
  • https.keystore.path
  • https.keystore.password
Enter an absolute path or a path relative to the $SDC_CONF directory.
keystore file and keystore password file on all worker nodes
You must create your own files for all worker nodes to connect to the web server. Store the files in the same absolute location on each worker node. Update the full path to the files in the following properties:
  • https.cluster.keystore.path
  • https.cluster.keystore.password

Truststore Files for Cluster Pipelines

Optionally, you can generate and use your own truststore file when Data Collector runs cluster pipelines. By default, Data Collector uses the truststore file from the following directory on all nodes in the cluster: $JAVA_HOME/jre/lib/security/cacerts.

When you generate a truststore file to run cluster pipelines, use the following files:

truststore file and truststore password file on the gateway node
The truststore file on the gateway node stores certificates to trust the identity of the worker nodes. The truststore password file contains the password to open the truststore file. Store both files on the gateway node. Uncomment the following properties and enter the path and name of the files:
  • https.truststore.path
  • https.truststore.password
Enter an absolute path or a path relative to the $SDC_CONF directory.
truststore file and truststore password file on all worker nodes
The truststore file on the worker nodes stores certificates to trust the identity of the gateway node. The truststore password file contains the password to open the truststore file. Store the files in the same absolute location on each worker node. Uncomment the following properties and enter the full path to the files:
  • https.cluster.truststore.path
  • https.cluster.truststore.password

Hadoop Impersonation Mode

You can configure how Data Collector impersonates a Hadoop user when performing tasks, such as reading or writing data, in Hadoop systems.

By default, Data Collector impersonates Hadoop users as follows:
  • As the user defined in stage properties - When configured, Data Collector uses the user defined in Hadoop-related stages.
  • As the currently logged in Data Collector user - When no user is defined in a Hadoop-related stage, Data Collector uses the currently logged in Data Collector user.
Note: In both cases, the Hadoop systems must be configured to allow the impersonation.

You can configure Data Collector to only use the currently logged in Data Collector user by enabling the stage.conf_hadoop.always.impersonate.current.user property in the Data Collector configuration file. When enabled, configuring a user within a stage is not allowed.

Configure Data Collector to always impersonate as the currently logged in Data Collector user when you want to prevent access to data in Hadoop systems by stage-level user properties.

For example, say you use roles, groups, and pipeline permissions to ensure that only authorized operators can start pipelines. You expect that the operator user accounts are used to access all external systems. But a pipeline developer can specify a HDFS user in a Hadoop stage and bypass your attempts at security. To close this loophole, configure Data Collector to always use the currently logged in Data Collector user to read from or write to Hadoop systems.

To always use the currently logged in Data Collector user, in the Data Collector configuration file, uncomment the stage.conf_hadoop.always.impersonate.current.user property and set it to true.

With this property enabled, Data Collector prevents configuring an alternate user in the following Hadoop-related stages:
  • Hadoop FS origin and destination
  • MapR FS origin and destination
  • HBase lookup and destination
  • MapR DB destination
  • HDFS File Metadata executor
  • MapReduce executor

Blacklist and Whitelist for Stage Libraries

By default, almost all installed stage libraries are available for use in Data Collector. You can use blacklist and whitelist properties to limit the stage libraries that can be used.

To limit the stage libraries created by StreamSets, use one of the following properties:

system.stagelibs.whitelist
system.stagelibs.blacklist
To limit stage libraries created by other parties, use one of the following properties:
user.stagelibs.whitelist
user.stagelibs.blacklist
Warning: Use only the whitelist or blacklist for each set of libraries. Using both can cause Data Collector to fail to start.

The MapR stage libraries are blacklisted by default. To use one of the MapR stage libraries, run the MapR setup script as described in MapR Prerequisites.

Configuring Data Collector

You can customize Data Collector by editing the Data Collector configuration file, sdc.properties.

Use a text editor to edit the Data Collector configuration file, $SDC_CONF/sdc.properties. To enable the changes, restart Data Collector.

The Data Collector configuration file includes the following general properties:
General Property Description
sdc.base.http.url Data Collector URL that is included in emails sent for metric and data alerts.

Default is http://<hostname>:<http.port> where <hostname> is the value defined in the http.bindHost property. If the host name is not defined in http.bindHost, Data Collector runs the following command to determine the host name: hostname -f

Be sure to uncomment the property if you change the value.

http.bindHost Host name or IP address that Data Collector binds to. You might want to configure a specific host or IP address when the machine that Data Collector is installed on has multiple network cards.

Default is 0.0.0.0, which means that Data Collector can bind to any host or IP address. Be sure to uncomment the property if you change the value.

http.maxThreads Maximum number of concurrent threads the Data Collector web server uses to serve UI requests.

Default is 200. Uncomment the property to change the value, but increasing this value is not recommended.

http.port Port number to use for the Data Collector console.

Default is 18630.

https.port Secure port number for the Data Collector console. Any number besides -1 enables the secure port number.

If you use both port properties, the HTTP port bounces to the HTTPS port.

Default is -1.

http.enable.forwarded.requests Enables handling X-Forwarded-For, X-Forwarded-Proto, X-Forwarded-Port HTTP request headers issued by a reverse proxy such as HAProxy, ELB, or NGINX.

Set to true when hosting Data Collector behind a reverse proxy or load balancer.

Default is false.

https.keystore.path Keystore path and file name. Enter an absolute path or a path relative to the $SDC_CONF directory. For cluster pipelines, enter the path to the file on the gateway node.

Default is sdc-keystore.jks in the $SDC_CONF directory.

https.keystore.password Path and name of the file that contains the password to the keystore file. Enter an absolute path or a path relative to the $SDC_CONF directory. For cluster pipelines, enter the path to the file on the gateway node.

Default is keystore-password.txt in the $SDC_CONF directory.

https.cluster.keystore.path For cluster pipelines, the absolute path and file name of the keystore file on worker nodes. The file must be in the same location on each worker node.
https.cluster.keystore.password For cluster pipelines, the absolute path and name of the file that contains the password to the keystore file on worker nodes. The file must be in the same location on each worker node.
https.truststore.path For cluster pipelines, the path and name of the truststore file on the gateway node. Enter an absolute path or a path relative to the $SDC_CONF directory.

Default is the file from the following directory: $JAVA_HOME/jre/lib/security/cacerts. Be sure to uncomment the property if you change the value.

https.truststore.password For cluster pipelines, the path and name of the file that contains the password to the truststore file on the gateway node. Enter an absolute path or a path relative to the $SDC_CONF directory.

Be sure to uncomment the property and provide a path to the password file if you use a generated truststore file instead of the default file from the following directory: $JAVA_HOME/jre/lib/security/cacerts.

https.cluster.truststore.path For cluster pipelines, the absolute path and file name of the truststore file on the worker nodes. The file must be in the same location on each worker node.

Default is the file from the following directory: $JAVA_HOME/jre/lib/security/cacerts. Be sure to uncomment the property if you change the value.

https.cluster.truststore.password For cluster pipelines, the absolute path and name of the file that contains the password to the truststore file on the worker nodes. The file must be in the same location on each worker node.

Be sure to uncomment the property and provide a path to the password file if you use a generated truststore file instead of the default file from the following directory: $JAVA_HOME/jre/lib/security/cacerts.

http.session.max.inactive.interval Maximum amount of time that the Data Collector console can remain inactive before the user is logged out. Use -1 to allow user sessions to remain inactive indefinitely.

Default is 86,400 seconds (24 hours).

http.authentication HTTP authentication. Use none, basic, digest, or form.

The HTTP authentication type determines how passwords are transferred from the browser to Data Collector over HTTP. Digest authentication encrypts the passwords. Basic and form authentication do not encrypt the passwords.

When using basic, digest, or form, with file-based authentication, use the associated realm.properties file to define user accounts. The realm.properties files are located in the $SDC_CONF directory.

Default is form.

http.authentication.login.module Indicates where user account information resides:
  • Set to file to use the realm.properties files.
  • Set to ldap to use an LDAP server.
http.digest.realm Realm used for HTTP authentication. Use basic-realm, digest-realm, or form-realm. The associated realm.properties file must be located in the $SDC_CONF directory.

Default is <http.authentication>-realm. Be sure to uncomment the property if you change the value.

http.realm.file.permission.check Checks the permissions for the realm.properties file in use:
  • Set to true to ensure that the file allows access only to the owner.
  • Set to false to skip the permission check.

Relevant when http.authentication.login.module is set to file.

http.authentication.ldap.role.mapping Maps groups defined by the LDAP server to Data Collector roles.
Enter a semicolon-separated list as follows:
<ldap group>:<SDC role>,<additional SDC role>,<additional SDC role>);
<ldap group>:<SDC role>,<additional SDC role>... 

Relevant when http.authentication.login.module is set to ldap.

ldap.login.module.name Name of the JAAS configuration properties in the $SDC_CONF/ldap-login.conf file. Assumed to be "ldap" unless otherwise specified.
http.access.control.allow.origin List of domains allowed to access the Data Collector REST API for cross-origin resource sharing (CORS). To restrict access to specific domains, enter a comma-separated list as follows:
http://www.mysite.com, http://www.myothersite.com

Default is the asterisk wildcard (*) which means that any domain can access the Data Collector REST API.

http.access.control.allow.headers List of HTTP headers allowed during a cross-domain request.
http.access.control.allow.methods List of HTTP methods that can be called during a cross-domain request.
kerberos.client.enabled Enables Kerberos authentication for Data Collector. Must be enabled to allow stages to use Kerberos to access external systems.
kerberos.client.principal Kerberos principal to use. Enter a service principal.
kerberos.client.keytab Location of the Kerberos keytab file that contains the credentials for the Kerberos principal.

Use a fully-qualified directory or a directory relative to the $SDC_CONF directory.

preview.maxBatchSize Maximum number of records used to preview data.

Default is 10.

preview.maxBatches Maximum number of batches used to preview data.

Default is 10.

production.maxBatchSize Maximum number of records included in a batch when the pipeline runs.

Default is 1000.

parser.limit Maximum parser buffer size that origins can use to process data. Limits the size of the data that can be parsed and converted to a record.

By default, the parser buffer size is 1048576 bytes. To increase the size, uncomment and configure this property. For more information about how this property affects record sizes, see Maximum Record Size.

production.maxErrorRecordsPerStage Maximum number of error records to save in memory for each stage to display in monitor mode. When the limit is reached, older error records are discarded.

Default is 100.

production.maxPipelineErrors Maximum number of pipeline errors to save in memory to display in monitor mode. When the limit is reached, older errors are discarded.

Default is 100.

max.logtail.concurrent.requests Maximum number of external processes allowed to access the Data Collector log file at the same time through REST API calls.

Default is 5.

max.webSockets.concurrent.requests Maximum number of WebSocket calls allowed.
monitor.memory Displays the Stage Heap Memory Usage and the Heap Memory Usage statistics in Monitor mode for the pipeline or for a stage.

Default is false. Set to true only when testing real-world load usage in the test or production environment. When finished testing, set the property back to false.

pipeline.access.control.enabled Enables pipeline permissions and sharing pipelines. With pipeline permissions enabled, a user must have the appropriate permissions to view or work with a pipeline. Only Admin users and pipeline owners have full access to pipelines.

When pipeline permissions are disabled, access to pipelines is based on the roles assigned to the user and its groups. For more information about pipeline permissions, see Pipeline Permissions.

Default is false.

ui.header.title Optional custom header to display in the Data Collector console next to the StreamSets logo. You can create a header using HTML and include an additional image.

To use an image, place the file in a directory local to the following directory: $SDC_DIST/sdc-static-web/

For example, to add custom text, you might use the following HTML:

<span class="navbar-brand">Dev Data Collector</span>
Or to use an image in the $SDC_DIST/sdc-static-web/ directory, you can use the following HTML:
<img src="<filename>.<extension>">

We recommend using an image no more than 48 pixels high.

ui.local.help.base.url Base URL for the online help installed with Data Collector.

Do not change this value.

ui.hosted.help.base.url Base URL for the online help hosted on the StreamSets website.

Do not change this value.

ui.refresh.interval.ms Interval in milliseconds that Data Collector waits before refreshing the Data Collector console.

Default is 2000.

ui.jvmMetrics.refresh.interval.ms Interval in milliseconds that the JVM metrics are refreshed.

Default is 4000.

ui.enable.usage.data.collection Allows anonymous Google Analytics information to be sent to StreamSets.

Default is true.

ui.enable.webSocket Enables Data Collector to use WebSocket to gather pipeline information.
ui.undo.limit Number of recent actions stored so you can undo them.
The Data Collector configuration file includes the following properties for sending email:
Email Property Description
mail.transport.protocol Use smtp or smtps.

Default is smtp.

mail.smtp.host SMTP host name.

Default is localhost.

mail.smtp.port SMTP port number.

Default is 25.

mail.smtp.auth Whether the SMTP host uses authentication. Use true or false.

Default is false.

mail.smtp.starttls.enable Whether the SMTP host uses STARTTLS encryption. Use true or false.

Default is false.

mail.smtps.host SMTPS host name.

Default is localhost.

mail.smtps.port SMTPS port number.

Default is 25.

mail.smtps.auth Whether the SMTPS host uses authentication. Use true or false.

Default is false.

xmail.username User name for the email account to send email.
xmail.password File that contains the password for the email account.

By default, the file name is email-password.txt and expected in the $SDC_CONF directory.

When appropriate, you can replace the default file name.

xmail.from.address Email address to use to send email.
The Data Collector configuration file includes the following advanced properties:
Advanced Property Description
runtime.conf.location Location of runtime properties. Use to declare where runtime properties are defined:
  • embedded - The runtime properties are defined in the file.

    Use the following naming convention to define runtime properties in the Data Collector configuration file: runtime.conf_<property name>.

  • <file path> - The directory and file name where runtime properties are defined. The runtime properties file must be relative to the $SDC_CONF directory.

    When you define runtime properties in a separate file, do not prefix the property names with runtime.conf_ .

The Data Collector configuration file includes the following stage-specific properties:
Stage-Specific Properties Description
stage.conf_hadoop.always.impersonate.current.user Ensures that Hadoop-related stages use the currently logged in Data Collector user to perform tasks, such as writing data, in Hadoop systems. With this property enabled, Data Collector prevents configuring an alternate user in Hadoop-related stages.

To use this property, uncomment the property and set it to true.

For more information and a list of affected stages, see Hadoop Impersonation Mode.

stage.conf_com.streamsets.pipeline.stage.executor. shell.impersonation_mode

Uses the Data Collector user who starts the pipeline to execute shell scripts defined in Shell executors. When not enabled, the operating system user who started Data Collector is used to execute shell scripts.

To enable the secure use of shell scripts through the Shell executor, we highly recommend uncommenting this property.

Requires the user who starts the pipeline to have a matching user account in the operating system. For more information about the security ramifications, see Enabling Shell Impersonation Mode.

Used by Shell executors.

stage.conf_com.streamsets.pipeline.stage.executor. shell.shell Defines the relative or absolute path to the command line interpreter to use to execute scripts, such as /bin/bash.

Default is "sh".

Used by Shell executors.

stage.conf_com.streamsets.pipeline.stage.executor. shell.sudo Defines the relative or absolute path to the sudo to use when executing scripts.

Default is sudo.

Used by Shell executors.

The Data Collector configuration file includes the following observer properties, used to process data rules and alerts:
Observer Properties Description
observer.queue.size Maximum queue size for data rule evaluation requests. Each data rule generates an evaluation request for every batch that passes through the stream. When the number of requests outstrips the queue size, requests are dropped.

Default is 100.

observer.sampled.records.cache.size Maximum number of records to be cached for display for each rule. The exact number of records is specified in the data rule.

Default is 100. You can reduce this number as needed.

observer.queue.offer.max.wait.time.ms Maximum number of milliseconds to wait before dropping a data rule evaluation request when the observer queue is full.
The Data Collector configuration file includes the following miscellaneous properties:
Miscellaneous Property Description
max.stage.private.classloaders Maximum number of stage libraries Data Collector allows.

Default is 50.

runner.thread.pool.size Pre-multiplier size of the thread pool. One running pipeline requires five threads, and pipelines share threads in the pool. To calculate the approximate runner thread pool size, multiply the number of running pipelines by 2.2.

Increasing this value does not increase the parallelization of an individual pipeline.

Default is 50, which is sufficient to run approximately 22 standalone pipelines at the same time.

pipeline.max.runners.count Maximum number of pipeline runners to use for a multithreaded pipeline.

Default is 50.

bundle.upload.enabled Allows Data Collector to directly upload support bundles to the StreamSets support team.

To disable the direct upload, uncomment this property.

The configuration file includes stage and stage library aliases to enable backward compatibility for pipelines created with earlier versions of Data Collector, such as:
stage.alias.streamsets-datacollector-basic-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget=
streamsets-datacollector-jdbc-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget

library.alias.streamsets-datacollector-apache-kafka_0_8_1_1-lib=
streamsets-datacollector-apache-kafka_0_8_1-lib
Generally, you should not need to change or remove these aliases.
You can optionally add stage libraries to the blacklist or whitelist properties to limit the stage libraries Data Collector uses and include additional configuration files:
Blacklist / Whitelist Property Description
system.stagelibs.whitelist

system.stagelibs.blacklist

Use one list to limit the StreamSets stage libraries that can be used in Data Collector. Do not use both.
user.stagelibs.whitelist

user.stagelibs.blacklist

Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both.

The Data Collector configuration file includes the following property that specifies additional files to include in the Data Collector configuration:

Additional Files Property Description
config.includes Additional configuration files to include in the Data Collector configuration. The files must be stored in a directory relative to the $SDC_CONF directory.

You can enter multiple file names separated by commas. The files are loaded into the Data Collector configuration in the listed order. If the same configuration property is defined in multiple files, the value defined in the last loaded file takes precedence.

By default, the dpm.properties and vault.properties files are included in the Data Collector configuration.

The Data Collector configuration file includes record sampling properties that indicate the size of the sample set chosen from a total population of records. Data Collector uses the sampling properties when you run a pipeline that writes to a destination system using the SDC Record data format and then run another pipeline that reads from that same system using the SDC Record data format. Data Collector uses record sampling to calculate the time that a record stays in the intermediate destination.

By default, Data Collector uses 1 out of 10,000 records for sampling. If you modify the sampling size, simplify the fraction for better performance. For example, configure the sampling size as 1/40 records instead of 250/10000 records. The following properties specify the sampling size:

Record Sampling Property Description
sdc.record.sampling.sample.size Size of the sample set.

Default is 1.

sdc.record.sampling.population.size Size of the total number of records.

Default is 10,000.

The Data Collector configuration file includes properties that define how Data Collector caches pipeline states. Data Collector can cache the state of pipelines for faster retrieval of those states in the Home page. If Data Collector does not cache pipeline states, it must retrieve pipeline states from the pipeline data files stored in the $SDC_DATA directory. You can configure the following properties that specify how Data Collector caches pipeline states:

Pipeline State Cache Property Description
store.pipeline.state.cache.maximum.size Maximum number of pipeline states that Data Collector caches. When the maximum number is reached, Data Collector evicts the oldest states from the cache.

Default is 100.

store.pipeline.state.cache.expire.after.access Amount of time in minutes that a pipeline state can remain in the cache after the entry's creation, the most recent replacement of its value, or its last access.

Default is 10 minutes.