Administration
StreamSets Accounts
StreamSets Accounts enables users without an enterprise account to download and use the latest version of Data Collector and Transformer. StreamSets Accounts also enables logging into linked Data Collectors and Transformers. Users with a StreamSets enterprise account use the StreamSets Support portal for downloads.
You can use an existing Google or Microsoft account to register with StreamSets Accounts. Or, you can enter an email address and password, then check your email to confirm your registration. After registering, you can download the latest Data Collector or Transformer tarball or access Docker for the latest Data Collector or Transformer image.
StreamSets Accounts provides downloads for the Data Collector common installation tarball. The common installation includes core Data Collector functionality as well as commonly-used stage libraries to enable easy pipeline creation. You can install additional stage libraries as needed.
After you install Data Collector, you link it to StreamSets Accounts. StreamSets Accounts then provides single sign on authentication for Data Collector.
To use StreamSets Accounts authentication with Data Collector, Data Collector must be able to connect to StreamSets Accounts. Be sure to whitelist the StreamSets Accounts URL as needed.
You can configure Data Collector to use a different authentication method. You might change the authentication method to allow other users to share access to Data Collector.
To register or log into StreamSets Accounts, go to https://accounts.streamsets.com.
Providing an Activation Code
Users with an enterprise account need to provide an activation code only when using a Docker Data Collector image that is not registered with Control Hub or linked with StreamSets Accounts.
Other users do not need to provide an activation code.
To request an activation code, submit a request through the StreamSets Support portal.
After you receive an email with the activation code, log in to Data Collector. On the registration page shown below, click Enter a Code, then paste the code into the Activation window and click Activate.
Users with the Admin role can view activation details by clicking
.Viewing Data Collector Configuration Properties
For details about the configuration properties or to edit the configuration file, see Configuring Data Collector.
Viewing Data Collector Directories
You can view the directories that the Data Collector uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.
Data Collector directories are defined in environment variables. For more information, see Data Collector Environment Configuration.
To view Data Collector directories, click .
Directory | Includes | Environment Variable |
---|---|---|
Runtime | Base directory for Data Collector executables and related files. | SDC_DIST |
Configuration | The Data Collector
configuration file, sdc.properties , and related realm properties files and
keystore files. Also includes the logj4 properties file. |
SDC_CONF |
Data | Pipeline configuration and run details. | SDC_DATA |
Log | Data Collector
log file, sdc.log . |
SDC_LOG |
Resources | Directory for runtime resource files. | SDC_RESOURCES |
SDC Libraries Extra Directory | Directory to store external libraries. | STREAMSETS_LIBRARIES_EXTRA_DIR |
Viewing Data Collector Metrics
You can view metrics about Data Collector, such as the CPU usage or the number of pipeline runners in the thread pool.
Viewing Data Collector Logs
You can view and download log data. When you download log data, you can select the file to download.
Data Collector Log Format
Data Collector uses the Apache Log4j library to write log data. Each log entry includes a timestamp and message along with additional information relevant for the message.
- Timestamp
- Pipeline
- Severity
- Message
- Category
- User
- Runner
- Thread
For example:
- Timestamp
- User
- Pipeline
- Runner
- Thread
- Stage
- Severity
- Category
- Message
2019-03-19 09:34:26,236 [user:admin] [pipeline:Test/TestPipeline65f67dde-faad-426d-ac47-8a2cd707f224] [runner:] [thread:webserver-430] [stage:] INFO StandaloneAndClusterRunnerProviderImpl - Pipeline execution mode is: STANDALONE
For this message, the stage and runner are not relevant, and therefore not included in the log entry.
The information included in the downloaded file is set by the
log4j.appender.stdout.layout.ConversionPattern
property in the log
configuration file, $SDC_CONF/sdc-log4j.properties. The default
configuration sets this property to:
%d{ISO8601} [user:%X{s-user}] [pipeline:%X{s-entity}] [runner:%X{s-runner}] [thread:%t] [stage:%X{s-stage}] %-5p %c{1} - %m%n
%X{s-entity}
- Pipeline name and ID%X{s-runner}
- Runner ID%X{s-stage}
- Stage name%X{s-user}
- User who initiated the operation
Modifying the Log Level
If the Data Collector logs do not provide enough troubleshooting information, you can modify the log level to display messages at another severity level.
- TRACE
- DEBUG
- INFO (Default)
- WARN
- ERROR
- FATAL
When you’ve finished troubleshooting, set the log level back to INFO to avoid having verbose log files.
Shutting Down Data Collector
Use one of the following methods to shut down Data Collector:
- User interface
- To use the Data Collector UI for shutdown:
- Click .
- When a confirmation dialog box appears, click Yes.
- Command line when started as a service
- To use the command line for shutdown when Data Collector is started as a service, use the required command for your operating
system:
-
For CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu 14.04 LTS, use:
service sdc stop
-
For CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu 16.04 LTS, use:
systemctl stop sdc
-
- Command line when started manually
- To use the command line for shutdown when Data Collector is started manually, use the Data Collector process ID in the following
command:
kill -15 <process ID>
Restarting Data Collector
- Started manually
If you changed or added an environment variable in the
sdc-env.sh
file, then you must restart Data Collector from the command prompt. Press Ctrl+C to shut down Data Collector and then enterbin/streamsets dc
to restart Data Collector.If you did not change or add an environment variable, then you can restart Data Collector from the command prompt or from the user interface. To restart from the user interface, click , expand StreamSets Data Collector was started manually, and then click Restart Data Collector.
- Started as a serviceRun the appropriate command for your operating system:
- For CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu
14.04 LTS, use:
service sdc start
- For CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu
16.04 LTS, use:
systemctl start sdc
- For CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu
14.04 LTS, use:
- Started from Cloudera Manager
Use Cloudera Manager to restart Data Collector. For information about how to restart a service through Cloudera Manager, see the Cloudera documentation.
- Started from Docker
Run the following Docker command:
docker restart <containerID>
Viewing Users and Groups
If you use file-based authentication, you can view all user accounts granted access to this Data Collector instance, including the roles and groups assigned to each user.
To view users and groups, click Data Collector displays a read-only view of the users, groups, and roles.
.You configure users, groups, and roles for file-based authentication in the associated
realm.properties
file located in the Data Collector
configuration directory, $SDC_CONF
. For more information, see Configuring File-Based Authentication.
Managing Usage Statistics Collection
You can help to improve Data Collector by allowing StreamSets to collect usage statistics about Data Collector system performance and the features that you use. This telemetry data helps StreamSets to improve product performance and to make feature development decisions.
You can configure whether to allow usage statistics collection.
Support Bundles
You can use Data Collector to generate a support bundle. A support bundle is a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and pipeline snapshots. You upload the generated file to a StreamSets Support ticket, and the Support team can use the information to troubleshoot your tickets. Alternatively, you can send the file to another StreamSets community member.
Data Collector uses several generators to create a support bundle. Each generator bundles different types of information. You can choose to use all or some of the generators.
Each generator automatically redacts all passwords entered in pipelines, configuration
files, or resource files. The generators replace all passwords with the text
REDACTED
in the generated files. You can customize the generators
to redact other sensitive information, such as machine names or user names.
Before uploading a generated ZIP file to a support ticket, we recommend verifying that the file does not include any sensitive information that you do not want to share.
Generators
Data Collector can use the following generators to create a support bundle:
Generator | Description |
---|---|
SDC Info | Includes the following information:
|
Pipelines | Includes the following JSON files for each pipeline:
By default, all Data Collector pipelines are included in the bundle. |
Logs | Includes the most recent content of the following log files:
|
Snapshots | Includes snapshots created for each pipeline. |
In addition, Data Collector always generates the following files when you create a support bundle:
metadata.properties
- ID and version of the Data Collector that generated the bundle.generators.properties
- List of generators used for the bundle.
Generating a Support Bundle
When you generate a support bundle, you choose the information to include in the bundle. Only users with the Admin role can generate support bundles.
You can download the bundle, and then verify its contents and upload it to a StreamSets Support ticket.
Customizing Generators
By default, the generators redact all passwords entered in pipelines, configuration files, or resource files. You can customize the generators to redact other sensitive information, such as machine names or user names.
To customize the generators, modify the support bundle redactor file, $SDC_CONF/support-bundle-redactor.json. The file contains rules that the generators use to redact sensitive information. Each rule contains the following information:
- description - Description of the rule.
- trigger - String constant that triggers a redaction. If a line contains this trigger string, then the redaction continues by applying the regular expression specified in the search property.
- search - Regular expression that defines the sub-string to redact.
- replace - String to replace the redacted information with.
{
"description": "Custom domain names",
"trigger": ".streamsets.com",
"search": "[a-z_-]+.streamsets.com",
"replace": "REDACTED.streamsets.com"
}
REST Response
You can view REST response JSON data for different aspects of the Data Collector, such as pipeline configuration information or monitoring details.
You can use the REST response information to provide Data Collector details to a REST-based monitoring system. Or you might use the information in conjunction with the Data Collector REST API.
- Pipeline Configuration - Provides information about the pipeline and each stage in the pipeline.
- Pipeline Rules - Provides information about metric and data rules and alerts.
- Definitions - Provides information about all available Data Collector stages.
- Preview Data- Provides information about the preview data moving through the pipeline. Also includes monitoring information that is not used in preview.
- Pipeline Monitoring - Provides monitoring information for the pipeline.
- Pipeline Status - Provides the current status of the pipeline.
- Data Collector Metrics - Provides metrics about Data Collector.
- Thread Dump - Lists all active Java threads used by Data Collector.
Viewing REST Response Data
You can view REST response data from the location where the relevant information displays. For example, you can view Data Collector Metrics REST response data from the Data Collector Metrics page.
- Edit mode
- From the Properties panel, you can use the
More icon (
) to view the following REST response data:
- Pipeline Configuration
- Pipeline Rules
- Pipeline Status
- Definitions
- Preview mode
- From the Preview panel, you can use the More icon to view the Preview Data REST response data.
- Monitor mode
- From the Monitor panel, you can use the
More icon to view the following REST response
data:
- Pipeline Monitoring
- Pipeline Configuration
- Pipeline Rules
- Pipeline Status
- Definitions
- Data Collector Metrics page
- From the Data Collector Metrics page, , you can use the More icon to
view the following REST response data:
- Data Collector Metrics
- Thread Dump
Disabling the REST Response Menu
You can configure the Data Collector to disable the display of REST responses.
- To disable the REST Response menus, click the Help icon, and then click Settings.
- In the Settings window, select Hide the REST Response Menu.
Command Line Interface
Data Collector
provides a command line interface that includes a basic cli
command. Use the
command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be
running before you can use the cli
command.
cli
command:- help
- Provides information about each command or subcommand.
- manager
- Provides the following subcommands:
- start - Starts a pipeline.
- status - Returns the status of a pipeline.
- stop - Stops a pipeline.
- reset-origin - Resets the origin when possible.
- get-committed-offsets - Returns the last-saved offset for pipeline failover.
- update-committed-offsets - Updates the last-saved offset for pipeline failover.
- store
- Provides the following subcommands:
- import - Imports a pipeline.
- list - Lists information for all available pipelines.
- system
- Provides the following subcommands:
- enableDPM - Register the Data Collector with StreamSets Control Hub.
- disableDPM - Unregister the Data Collector from Control Hub.
Java Configuration Options for the Cli Command
Use the SDC_CLI_JAVA_OPTS environment variable to modify Java configuration options for
the cli
command.
For
example, to set the -Djavax.net.ssl.trustStore
option for the
cli
command when using Data Collector
with HTTPS, run the following command:
export SDC_CLI_JAVA_OPTS="-Djavax.net.ssl.trustStore=<path to truststore file> ${SDC_CLI_JAVA_OPTS}"
Using the Cli Command
Call the
cli
command from the $SDC_DIST
directory.
cli
commands:bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \
<command> <subcommand> [<args>]
The usage of the basic command options depends on whether or not the Data Collector is registered with Control Hub.
Not Registered with Control Hub
Option | Description |
---|---|
-U <sdcURL> or --url <sdcURL> |
Required. URL of the Data Collector. The default URL is
|
-a <sdcAuthType> or --auth-type <sdcAuthType> |
Optional. HTTP authentication type used by the Data Collector. |
-u <sdcUser> or --user <sdcUser> |
Optional. User name to use to log in. The roles assigned to the
user account determine the tasks that you can perform. If you omit this option, the Data Collector allows admin access. |
-p <sdcPassword> or --password <sdcPassword> |
Optional. Required when you enter a user name. Password for the user account. |
-D <dpmURL> or --dpmURL <dpmURL> |
Not applicable. Do not use when the Data Collector is not registered with Control Hub. |
<command> | Required. Command to perform. |
<subcommand> | Required for all commands except help. Subcommand to perform. |
<args> | Optional. Include arguments and options as needed. |
Registered with Control Hub
Option | Description |
---|---|
-U <sdcURL> or --url <sdcURL> |
Required. URL of the Data Collector. The default URL is
|
-a <sdcAuthType> or --auth-type <sdcAuthType> |
Required. Authentication type used by the Data Collector. Set to dpm. If you omit this option, Data Collector uses the Form authentication type, which causes the command to fail. |
-u <sdcUser> or --user <sdcUser> |
Required. User account to log in. Enter your Control Hub user ID using the following format:
The roles assigned to the Control Hub user account determine the tasks that you can perform. If you omit this option, Data Collector uses the admin user account, which causes the command to fail. |
-p <sdcPassword> or --password <sdcPassword> |
Required. Enter the password for your Control Hub user account. |
-D <dpmURL> or --dpmURL <dpmURL> |
Required. Set to: https://cloud.streamsets.com. |
<command> | Required. Command to perform. |
<subcommand> | Required for all commands except help. Subcommand to perform. |
<args> | Optional. Include arguments and options as needed. |
Help Command
Use the help command to view additional information for the specified command.
bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \
help <command> [<subcommand>]
bin/streamsets cli -U http://localhost:18630 help manager
Manager Command
The manager
command provides subcommands to start and stop a pipeline,
view the status of all pipelines, and reset the origin for a pipeline. It can also be used
to get the last-saved offset and to update the last-saved offset for a pipeline.
manager
command returns the pipeline status object after it
successfully completes the specified subcommand. The following is a sample of the
pipeline status object:
{
"user" : "admin",
"name" : "MyPipelinejf45e1f1-dfc1-402c-8587-918bc6e831db",
"pipelineID" : "MyPipelinejf45e1f1-dfc1-402c-8587-918bc6e831db",
"rev" : "0",
"status" : "STOPPING",
"message" : null,
"timeStamp" : 1447116703147,
"attributes" : { },
"executionMode" : "STANDALONE",
"metrics" : null,
"retryAttempt" : 0,
"nextRetryTimeStamp" : 0
}
Note that the timestamp is in the Long data format.
You can use the following manager
subcommands:
- start
- Starts a pipeline. Returns the pipeline status when successful.
- stop
- Stops a pipeline. Returns the pipeline status when successful.
- status
- Returns the status of a pipeline. Returns the pipeline status when successful.
- reset-origin
- Resets the origin of a pipeline. Use for pipeline origins that can be reset. Some pipeline origins cannot be reset. Returns the pipeline status when successful.
- get-committed-offsets
- Returns the last-saved offset for a pipeline with an origin that saves offsets. Some origins, such as the HTTP Server, have no need to save offsets.
- update-committed-offsets
- Updates the last-saved offset for a pipeline with an origin that saves offsets. Some origins, such as the HTTP Server, have no need to save offsets.
Store Command
The store
command provides subcommands to view a list of all pipelines
and to import a pipeline.
store
command:- list
- Lists all available pipelines. The
list
subcommand uses the following syntax:store list
- import
- Imports a pipeline. Use to import a pipeline JSON file, typically exported from a Data Collector. Returns a message when the import is successful.
System Command
The system
command provides subcommands to register and unregister the
Data Collector
with Control Hub.
You can use
the following subcommands with the system
command:
- enableDPM
- Registers the Data Collector with Control Hub. For a description of the syntax, see Registering from the Command Line Interface.
- disableDPM
- Unregisters the Data Collector with Control Hub. For a description of the syntax, see Unregistering from the Command Line Interface.