Getting Started

What is StreamSets Data Collector?

StreamSets Data Collector is a lightweight, powerful engine that streams data in real time. Use Data Collector to route and process data in your data streams.

To define the flow of data for Data Collector, you configure a pipeline. A pipeline consists of stages that represent the origin and destination of the pipeline, and any additional processing that you want to perform. After you configure the pipeline, you click Start and Data Collector goes to work.

Data Collector processes data when it arrives at the origin and waits quietly when not needed. You can view real-time statistics about your data, inspect data as it passes through the pipeline, or take a close look at a snapshot of data.

How should I use Data Collector?

Use StreamSets Data Collector like a pipe for a data stream. Throughout your enterprise data topology, you have streams of data that you need to move, collect, and process on the way to their destinations. Data Collector provides the crucial connection between hops in the stream.

To solve your ingest needs, you can use a single Data Collector to run one or more pipelines. Or you might install a series of Data Collectors to stream data across your enterprise data topology.

How does this really work?

Let's walk through it...

After you install and launch Data Collector, you access the Data Collector console, log in, and create your first pipeline.

What do you want it to do? Let's say you want to read XML files from a directory and remove the newline characters before moving it into HDFS. To do this, you start with a Directory origin stage and configure it to point to the source file directory. (You can also have the stage archive processed files and write files that were not fully processed to a separate directory for review.)

To remove the newline characters, connect Directory to an Expression Evaluator processor and configure it to remove the newline character from the last field in the record.

To make the data available to HDFS, you connect the Expression Evaluator to a Hadoop FS destination stage. You configure the stage to write the data as a JSON object (though you can use other data formats as well).

You preview data to see how source data moves through the pipeline and notice that some fields have missing data. So you add a Value Replacer to replace null values in those fields.

Now that the data flow is done, you configure the pipeline error record handling to write error records to a file, you create a data drift alert to let you know when field names change, and you configure an email alert to let you know when the pipeline generates more than 100 error records. Then, you start the pipeline and Data Collector goes to work.

The Data Collector console goes into Monitor mode and displays summary and error statistics immediately. To get a closer look at the activity, you take a snapshot of the pipeline so you can examine how a set of data passed through the pipeline. You see some unexpected data in the pipeline, so you create a data rule for a link between two stages to gather information about similar data and set an alert to notify you when the numbers get too high.

And what about those error records being written to file? They're saved with error details, so you can create an error pipeline to reprocess that data. Et voila!

StreamSets Data Collector is a powerful tool, but we're making it as simple as possible to use. So give it a try, click the Help icon for information, and contact us if you need a hand.

Logging In and Creating a Pipeline

After you start Data Collector, you can log in to the Data Collector console and create your first pipeline.

You can customize the address and login that you use to access Data Collector. This procedure uses the default settings.

  1. To access the Data Collector console, enter the following URL in the address bar of your browser:

    If you changed the default Data Collector port number in the Data Collector configuration file, $SDC_CONF/, use that number instead.

  2. In the Login dialog box, use the following credentials to log in: admin / admin.
    If you created a custom login, feel free to use it.
  3. On the Get Started page, click Create New Pipeline.
  4. In the New Pipeline window, enter a name for the pipeline, optionally enter a description, and click Save.
    The Data Collector console appears. The Properties panel displays pipeline properties.
For the steps to configure a pipeline, continue with step 3 in "Configuring a Pipeline".

Data Collector Console

Data Collector provides a console to configure pipelines, preview data, monitor pipelines, and review snapshots of data.

The Data Collector console includes the following general areas and icons:

No Name Description
1 Pipeline canvas Canvas for configuring or monitoring a pipeline.
2 Properties panel / Preview panel / Monitor panel When you configure a pipeline, the Properties panel displays the properties of the pipeline or selected stage. You can resize, minimize and maximize the panel.

When you preview data, the Preview panel displays the input and output data for the selected stage or group of stages.

When you monitor a running pipeline, the Monitor panel displays real-time metrics and statistics.

Note: Some icons and options might not display in the console. The items that display are based on the task that you are performing and roles assigned to your user account.
Dataflow Performance Manager icon Provides information about StreamSets Dataflow Performance Manager (DPM) and lets you register this Data Collector with DPM.
Home icon Displays a home page with a list of pipelines and their statuses.
Package Manager icon Displays the Package Manager which allows you to install additional stage libraries for a core Data Collector installation.
Notifications icon Displays notifications.
Administration icon Provides access to Data Collector configuration properties, directories, and log. Also allows you to shut down Data Collector.
User icon Displays the active user and the roles assigned to the user. Also allows you to log out of the console.
Help icon Provides access to the online help, the REST API, and the Data Collector version. Also allows you to define display settings and the online help version to use.
More icon

Provides additional actions for the pipeline. Use to import, export, duplicate, or delete a pipeline.

For information about pipeline configuration options, see Data Collector Console - Edit Mode.

For information about data preview options, see Data Collector Console - Preview Mode.

For information about pipeline monitoring options, see Data Collector Console - Monitor Mode.

For information about maintaining pipelines on the Home page, see Data Collector Console - All Pipelines on the Home Page.

Configuring Console Settings

You can configure how information in the Data Collector console displays, such as the online help version, information density in the panel, and the pipeline creation help bar.

  1. In the top right corner of the Data Collector console, click Help > Settings.
  2. In the Settings dialog box, you can configure the following options:
    Console Setting Description
    Timezone Display time zone. Used to display dates and times in the Data Collector console, such as datetime data in data preview or snapshot data.
    You can choose between the following options:
    • UTC
    • The browser time zone, which usually uses the operating system time zone.
    • When the Data Collector runs on a different machine, the operating system time zone of the Data Collector machine.
    Display Density Defines the density of the information that displays in the panel.
    Help Defines the help project that Data Collector uses:
    • Local help - Uses the help project installed with Data Collector.
    • Hosted help - Uses a help project hosted on the StreamSets website. Hosted help contains the latest available documentation. Requires an internet connection.

    Default is hosted help. When internet access is not available, Data Collector uses local help.

    Both help projects provide context-sensitive help.

    Hide Pipeline Creation Help Bar Hides the pipeline configuration help bar that displays by default when a pipeline is incomplete.
    Hide REST Response Menu Hides the REST Response menu so you cannot request REST API response information.
    Run preview in background to display available fields Runs preview in the background to display lists of available fields and the Select Fields with Preview Data option as you configure pipeline and stage properties.

    Running preview in the background can freeze the browser if the preview results in a large record. To resolve this issue, clear the property.

    Wrap long lines in properties Wraps long lines of text that you enter in properties. For example, you might enter a long line of text when you configure the preconditions for a stage.

    When cleared, displays long lines of text with a scroll bar.