What is StreamSets Data Collector?
StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.
What is StreamSets Data Collector Edge?
StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.
What is StreamSets Control Hub?
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.
Register with StreamSets Accounts
Logging In and Creating a Pipeline in Data Collector
After you start Data Collector, you can log in to Data Collector and create your first pipeline.
Data Collector User Interface
Data Collector provides a web-based user interface (UI) to configure pipelines, preview data, monitor pipelines, and review snapshots of data.
Data Collector UI - Pipelines on the Home Page
Data Collector displays a list of all available pipelines and related information on the Home page. You can select a category of pipelines, such as Running Pipelines, to view a subset of all available pipelines.
Tutorials and Sample Pipelines
StreamSets provides multiple tutorials and sample pipelines to help you learn about using Data Collector.
You can install Data Collector and start it manually or run it as a service.
Full Installation and Launch (Manual Start)
Full Installation and Launch (Service Start)
Common Installation
All users can install the Data Collector common installation. The common installation includes all core Data Collector functionality and commonly-used stage libraries in a tarball installation.
Core Installation
Users with a StreamSets enterprise account can use the Data Collector core installation.
Supported Systems and Versions
Install Additional Stage Libraries
Installation with Cloudera Manager
Run Data Collector from Docker
Installation with Cloud Service Providers
MapR Prerequisites
Creating Another Data Collector Instance
User Authentication
Roles and Permissions
Enabling HTTPS
Using a Reverse Proxy
Data Collector Configuration
You can edit the Data Collector configuration file, $SDC_CONF/sdc.properties, to configure properties such as the host name and port number and account information for email alerts.
Data Collector Environment Configuration
Install External Libraries
Custom Stage Libraries
Credential Stores
Accessing Hashicorp Vault Secrets with Vault Functions (Deprecated)
Working with Data Governance Tools
You can configure Data Collector to integrate with data governance tools, giving you visibility into data movement - where the data came from, where it’s going to, and who is interacting with it.
Enabling External JMX Tools
Data Collector uses JMX metrics to generate the graphical display of the status of a running pipeline. You can provide the same JMX metrics to external tools if desired.
Pre Upgrade Tasks
In some situations, you must complete tasks before you upgrade.
Upgrade an Installation from the Tarball
Upgrade an Installation from the RPM Package
Upgrade an Installation with Cloudera Manager
Post Upgrade Tasks
In some situations, you must complete tasks within Data Collector or your Control Hub on-premises installation after you upgrade.
Working with Upgraded External Systems
When an external system is upgraded to a new version, you can continue to use existing Data Collector pipelines that connected to the previous version of the external system. You simply configure the pipelines to work with the upgraded system.
Troubleshooting an Upgrade
What is a Pipeline?
Data in Motion
Data passes through the pipeline in batches. This is how it works:
Designing the Data Flow
You can branch and merge streams in the pipeline.
Sample Pipelines
Data Collector provides sample pipelines that you can use to learn about Data Collector features or as a basis for building your own pipelines.
Dropping Unwanted Records
You can drop records from the pipeline at each stage by defining required fields or preconditions for a record to enter a stage.
Error Record Handling
Record Header Attributes
Field Attributes
Field attributes are attributes that provide additional information about each field that you can use in pipeline logic, as needed.
Processing Changed Data
Control Character Removal
Development Stages
Shortcut Keys for Pipeline Design
Technology Preview Functionality
Data Collector includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.
Test Origin for Preview
A test origin can provide test data for data preview to aid in pipeline development. In Control Hub, you can also use test origins when developing pipeline fragments. Test origins are not used when running a pipeline.
Understanding Pipeline States
Data Collector UI - Edit Mode
Pipeline Types and Icons in Documentation
In Data Collector, you can configure pipelines that are run by Data Collector and pipelines that are run by Data Collector Edge.
Retrying the Pipeline
Rate Limit
Advanced Options
Pipelines and most pipeline stages include advanced options with default values that should work in most cases. By default, each pipeline and stage hides the advanced options. Advanced options can include individual properties or complete tabs.
Simple and Bulk Edit Mode
Runtime Values
Runtime values are values that you define outside of the pipeline and use for stage and pipeline properties. You can change the values for each pipeline run without having to edit the pipeline.
Event Generation
Pipeline Statistics
SSL/TLS Encryption
Many stages can use SSL/TLS encryption to securely connect to the external system.
Security in Amazon Stages
Security in Google Cloud Stages
Security in Kafka Stages
Kafka Message Keys
Implicit and Explicit Validation
Expression Configuration
Configuring a Pipeline
Data Formats Overview
Avro Data Format
Binary Data Format
Datagram Data Format
Delimited Data Format
Excel Data Format
Log Data Format
When you use an origin to read log data, you define the format of the log files to be read.
NetFlow Data Processing
Protobuf Data Format Prerequisites
SDC Record Data Format
Text Data Format with Custom Delimiters
Whole File Data Format
You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.
Reading and Processing XML Data
Writing XML Data
Solutions Overview
Converting Data to the Parquet Data Format
This solution describes how to convert Avro files to the columnar format, Parquet.
Automating Impala Metadata Updates for Drift Synchronization for Hive
This solution describes how to configure a Drift Synchronization Solution for Hive pipeline to automatically refresh the Impala metadata cache each time changes occur in the Hive metastore.
Managing Output Files
This solution describes how to design a pipeline that writes output files to a destination, moves the files to a different location, and then changes the permissions for the files.
Stopping a Pipeline After Processing All Available Data
This solution describes how to design a pipeline that stops automatically after it finishes processing all available data.
Offloading Data from Relational Sources to Hadoop
This solution describes how to offload data from relational database tables to Hadoop.
Sending Email During Pipeline Processing
This solution describes how to design a pipeline to send email notifications at different moments during pipeline processing.
Preserving an Audit Trail of Events
This solution describes how to design a pipeline that preserves an audit trail of pipeline and stage events that occur.
Loading Data into Databricks Delta Lake
You can use several solutions to load data into a Delta Lake table on Databricks.
Drift Synchronization Solution for Hive
The Drift Synchronization Solution for Hive detects drift in incoming data and updates corresponding Hive tables.
Drift Synchronization Solution for PostgreSQL
The Drift Synchronization Solution for PostgreSQL detects drift in incoming data and automatically creates or alters corresponding PostgreSQL tables as needed before the data is written.
Meet StreamSets Control Hub
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows teams to build and execute large numbers of complex dataflows at scale.
Working with Control Hub
Request a Control Hub Organization and User Account
Register Data Collector with Control Hub
You must register a Data Collector to work with StreamSets Control Hub. When you register a Data Collector, Data Collector generates an authentication token that it uses to issue authenticated requests to Control Hub.
Pipeline Statistics
Pipeline Management with Control Hub
After you register a Data Collector with StreamSets Control Hub, you can manage how the pipelines work with Control Hub.
Control Hub Configuration File
Unregister Data Collector from Control Hub
You can unregister a Data Collector from StreamSets Control Hub when you no longer want to use that Data Collector installation with Control Hub.
Meet StreamSets Data Collector Edge
StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices with limited resources. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.
Supported Platforms
Install SDC Edge
Download and install SDC Edge on each edge device where you want to run edge pipelines.
Getting Started with SDC Edge
Data Collector Edge (SDC Edge) includes several sample pipelines that make it easy to get started. You simply import one of the sample edge pipelines, create the appropriate Data Collector receiving pipeline, download and install SDC Edge on the edge device, and then run the sample edge pipeline.
Design Edge Pipelines
Edge pipelines run in edge execution mode. You design edge pipelines in Data Collector.
Design Data Collector Receiving Pipelines
Administer SDC Edge
Administering SDC Edge involves configuring, starting, shutting down, and viewing logs for the agent. When using StreamSets Control Hub, you can also use the SDC Edge command line interface to register SDC Edge with Control Hub.
Deploy Pipelines to SDC Edge
After designing edge pipelines in Data Collector, you deploy the edge pipelines to SDC Edge installed on an edge device. You run the edge pipelines on SDC Edge.
Downloading Pipelines from SDC Edge
Manage Pipelines on SDC Edge
After designing edge pipelines in Data Collector and then deploying the edge pipelines to SDC Edge, you can manage the pipelines on SDC Edge. Managing edge pipelines includes previewing, validating, starting, stopping, and monitoring the pipelines as well as resetting the origin for the pipelines.
Microservice Pipelines
A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.
Stages for Microservice Pipelines
Sample Pipeline
Creating a Microservice Pipeline
SDC RPC Pipeline Overview
Data Collector Remote Protocol Call pipelines, called SDC RPC pipelines, are a set of StreamSets pipelines that pass data from one pipeline to another without writing to an intermediary system.
Deployment Architecture
When using SDC RPC pipelines, consider your needs and environment carefully as you design the deployment architecture.
Configuring the Delivery Guarantee
The delivery guarantee determines when a pipeline commits the offset. When configuring the delivery guarantee for SDC RPC pipelines, use the same option in origin and destination pipelines.
Defining the RPC ID
The RPC ID is a user-defined identifier that allows an SDC RPC origin and SDC RPC destination to recognize each other.
Enabling Encryption
You can enable SDC RPC pipelines to transfer data securely using SSL/TLS. To use SSL/TLS, enable TLS in both the SDC RPC destination and the SDC RPC origin.
Configuration Guidelines for SDC RPC Pipelines
Cluster Pipeline Overview
A cluster pipeline is a pipeline that runs in cluster execution mode. You can run a pipeline in standalone execution mode or cluster execution mode.
Kafka Cluster Requirements
MapR Requirements
HDFS Requirements
Amazon S3 Requirements
Cluster EMR batch and cluster batch mode pipelines can process data from Amazon S3.
Cluster Pipeline Limitations
Data Preview Overview
Data Collector UI - Preview Mode
You can use Data Collector to view how data passes through the pipeline.
Preview Codes
Data preview displays different colors for different types of data. Preview also uses other codes and formatting to highlight changed fields.
Previewing a Single Stage
Previewing Multiple Stages
You can preview data for a group of linked stages within a pipeline.
Editing Preview Data
You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to test for data conditions that might not appear in the preview data set.
Editing Properties
In data preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the expression in an Expression Evaluator to see how the expression alters data.
Understanding Pipeline States
Starting Pipelines
Stopping Pipelines
Importing Pipelines
Sharing Pipelines
Adding Labels to Pipelines
Exporting Pipelines
Exporting Pipelines for Control Hub
Duplicating a Pipeline
Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate version. A duplicate is an exact copy of the original pipeline.
Deleting Pipelines
StreamSets Accounts
Providing an Activation Code
Viewing Data Collector Configuration Properties
Viewing Data Collector Directories
You can view the directories that the Data Collector uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.
Viewing Data Collector Metrics
You can view metrics about Data Collector, such as the CPU usage or the number of pipeline runners in the thread pool.
Viewing Data Collector Logs
Shutting Down Data Collector
Restarting Data Collector
Viewing Users and Groups
If you use file-based authentication, you can view all user accounts granted access to this Data Collector instance, including the roles and groups assigned to each user.
Managing Usage Statistics Collection
Support Bundles
Health Inspector
REST Response
You can view REST response JSON data for different aspects of the Data Collector, such as pipeline configuration information or monitoring details.
Command Line Interface
Data Collector provides a command line interface that includes a basic cli command. Use the command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be running before you can use the cli command.
Tutorial Overview
Before You Begin
Basic Tutorial
The basic tutorial creates a pipeline that reads a file from a directory, processes the data in two branches, and writes all data to a file system. You'll use data preview to help configure the pipeline, and you'll create a data alert and run the pipeline.
Extended Tutorial
The extended tutorial builds on the basic tutorial, using an additional set of stages to perform some data transformations and write to the Trash development destination. We'll also use data preview to test stage configuration.
© Apache License, Version 2.0.