What is StreamSets Data Collector?
StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.
What is StreamSets Data Collector Edge?
StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.
What is StreamSets Control Hub?
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.
Logging In and Creating a Pipeline in Data Collector
After you start Data Collector, you can log in to Data Collector and create your first pipeline.
Data Collector User Interface
Data Collector provides a web-based user interface (UI) to configure pipelines, preview data, monitor pipelines, and review snapshots of data.
Data Collector UI - Pipelines on the Home Page
Data Collector displays a list of all available pipelines and related information on the Home page. You can select a category of pipelines, such as Running Pipelines, to view a subset of all available pipelines.
Installation
You can install Data Collector and start it manually or run it as a service.
Full Installation and Launch (Manual Start)
Full Installation and Launch (Service Start)
Core Installation
You can download and install a core version of Data Collector, and then install individual stage libraries as needed. Use the core installation to install only the stage libraries that you want to use. The core installation allows Data Collector to use less disk space.
Install Additional Stage Libraries
Installation with Cloudera Manager
Run Data Collector from Docker
Installation with Cloud Service Providers
You can install the full Data Collector using cloud service providers such as Microsoft Azure or Microsoft Azure HDInsight. When you install Data Collector using a cloud service provider, you install Data Collector as a service.
MapR Prerequisites
Due to licensing restrictions, StreamSets cannot distribute MapR libraries with Data Collector. As a result, you must perform additional steps to enable the Data Collector machine to connect to MapR. Data Collector does not display MapR origins and destinations in stage library lists nor the MapR Streams statistics aggregator in the pipeline properties until you perform these prerequisites.
Creating Another Data Collector Instance
Uninstallation
User Authentication
Data Collector can authenticate user accounts based on LDAP or files. Best practice is to use LDAP if your organization has it. By default, Data Collector uses file-based authentication.
Roles and Permissions
Enabling HTTPS
Data Collector Configuration
You can edit the Data Collector configuration file, $SDC_CONF/sdc.properties, to configure properties such as the host name and port number and account information for email alerts.
Data Collector Environment Configuration
Install External Libraries
Custom Stage Libraries
Credential Stores
Data Collector pipeline stages communicate with external systems to read and write data. Many of these external systems require credentials - user names or passwords - to access the data. When you configure pipeline stages for these external systems, you define the credentials that the stage uses to connect to the system.
Accessing Vault Secrets with Vault Functions (Deprecated)
Working with Data Governance Tools
You can configure Data Collector to integrate with data governance tools, giving you visibility into data movement - where the data came from, where it’s going to, and who is interacting with it.
Enabling External JMX Tools
Data Collector uses JMX metrics to generate the graphical display of the status of a running pipeline. You can provide the same JMX metrics to external tools if desired.
Upgrade
Pre Upgrade Tasks
In some situations, you must complete tasks before you upgrade.
Upgrade an Installation from the Tarball
Upgrade an Installation from the RPM Package
When you upgrade an installation from the RPM package, the new version uses the default configuration, data, log, and resource directories. If the previous version used the default directories, the new version has access to the files created in the previous version.
Upgrade an Installation with Cloudera Manager
Post Upgrade Tasks
In some situations, you must complete tasks within Data Collector or your Control Hub on-premises installation after you upgrade.
Working with Upgraded External Systems
When an external system is upgraded to a new version, you can continue to use existing Data Collector pipelines that connected to the previous version of the external system. You simply configure the pipelines to work with the upgraded system.
Troubleshooting an Upgrade
What is a Pipeline?
Data in Motion
Data passes through the pipeline in batches. This is how it works:
Designing the Data Flow
You can branch and merge streams in the pipeline.
Dropping Unwanted Records
You can drop records from the pipeline at each stage by defining required fields or preconditions for a record to enter a stage.
Error Record Handling
Record Header Attributes
Field Attributes
Field attributes are attributes that provide additional information about each field that you can use in pipeline logic, as needed.
Processing Changed Data
Control Character Removal
Development Stages
Test Origin for Preview
A test origin can provide test data for data preview to aid in pipeline development. In Control Hub, you can also use test origins when developing pipeline fragments. Test origins are not used when running a pipeline.
Data Collector UI - Edit Mode
Retrying the Pipeline
Pipeline Memory
Rate Limit
Simple and Bulk Edit Mode
Runtime Values
Runtime values are values that you define outside of the pipeline and use for stage and pipeline properties. You can change the values for each pipeline run without having to edit the pipeline.
Event Generation
Webhooks
Notifications
SSL/TLS Configuration
Implicit and Explicit Validation
Expression Configuration
Configuring a Pipeline
Data Formats Overview
Delimited Data Root Field Type
Excel Data Format
Log Data Format
When you use an origin to read log data, you define the format of the log files to be read.
NetFlow Data Processing
Protobuf Data Format Prerequisites
SDC Record Data Format
Text Data Format with Custom Delimiters
Whole File Data Format
You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.
Reading and Processing XML Data
Writing XML Data
Origins
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.
Amazon S3
Amazon SQS Consumer
Azure IoT/Event Hub Consumer
CoAP Server
Constrained Application Protocol (CoAP) is a web transfer protocol designed for machine-to-machine devices. The CoAP Server origin is a multithreaded origin that listens on a CoAP endpoint and processes the contents of all authorized CoAP requests.
Directory
The Directory origin reads data from files in a directory. The origin can use multiple threads to enable the parallel processing of files.
Elasticsearch
The Elasticsearch origin is a multithreaded origin that reads data from an Elasticsearch cluster, including Elastic Cloud clusters (formerly Found clusters). The origin generates a record for each Elasticsearch document.
File Tail
The File Tail origin reads lines of data as they are written to an active file after reading related archived files in the same directory. File Tail generates a record for each line of data.
Google BigQuery
The Google BigQuery origin executes a query job and reads the result from Google BigQuery.
Google Cloud Storage
Google Pub/Sub Subscriber
The Google Pub/Sub Subscriber origin consumes messages from a Google Pub/Sub subscription.
Hadoop FS
The Hadoop FS origin reads data from the Hadoop Distributed File System (HDFS), Amazon S3, or other file systems using the Hadoop FileSystem interface.
Hadoop FS Standalone
The Hadoop FS Standalone origin reads files in HDFS. The origin can use multiple threads to enable the parallel processing of files. The files to be processed must all share a file name pattern and be fully written. You can also configure the origin to read from Azure HDInsight.
HTTP Client
HTTP Server
The HTTP Server origin is a multithreaded origin that listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests. Use the HTTP Server origin to read high volumes of HTTP POST and PUT requests using multiple threads.
HTTP to Kafka (Deprecated)
JDBC Multitable Consumer
JDBC Query Consumer
JMS Consumer
The JMS Consumer origin reads data from a Java Messaging Service (JMS).
Kafka Consumer
Kafka Multitopic Consumer
Kinesis Consumer
The Kinesis Consumer origin reads data from Amazon Kinesis Streams.
MapR DB CDC
MapR DB JSON
The MapR DB JSON origin reads JSON documents from MapR DB JSON tables. The origin converts each document into a record.
MapR FS
The MapR FS origin reads files from MapR FS. Use this origin only in pipelines configured for cluster batch pipeline execution mode.
MapR FS Standalone
The MapR FS Standalone origin reads files in MapR. The origin can use multiple threads to enable the parallel processing of files. The files to be processed must all share a file name pattern and be fully written.
MapR Multitopic Streams Consumer
MapR Streams Consumer
The MapR Streams Consumer origin reads messages from MapR Streams.
MongoDB
MongoDB Oplog
MQTT Subscriber
The MQTT Subscriber origin subscribes to topics on an MQTT broker to read messages from the broker. The origin functions as an MQTT client that receives messages, generating a record for each message.
MySQL Binary Log
Omniture
The Omniture origin processes JSON website usage reports generated by the Omniture reporting APIs. Omniture is also known as the Adobe Marketing Cloud.
OPC UA Client
Oracle CDC Client
PostgreSQL CDC Client
The PostgreSQL CDC Client origin processes Write-Ahead Logging (WAL) data to generate change data capture records for a PostgreSQL database. Use the PostgreSQL CDC Client origin to process WAL data from PostgreSQL 9.4 or later. Earlier versions do not support WAL.
RabbitMQ Consumer
RabbitMQ Consumer reads AMQP messages from a single RabbitMQ queue.
Redis Consumer
The Redis Consumer origin reads messages from Redis.
REST Service
Salesforce
SDC RPC
The SDC RPC origin enables connectivity between two SDC RPC pipelines. The SDC RPC origin reads data passed from an SDC RPC destination. Use the SDC RPC origin as part of an SDC RPC destination pipeline.
SDC RPC to Kafka (Deprecated)
SFTP/FTP Client
The SFTP/FTP Client origin reads files from a server using the Secure File Transfer Protocol (SFTP) or the File Transfer Protocol (FTP).
SQL Server CDC Client
SQL Server Change Tracking
System Metrics
The System Metrics origin reads system metrics from the edge device where StreamSets Data Collector Edge (SDC Edge) is installed. Use the System Metrics origin only in pipelines configured for edge execution mode.
TCP Server
UDP Multithreaded Source
UDP Source
UDP to Kafka (Deprecated)
WebSocket Client
The WebSocket Client origin reads data from a WebSocket server endpoint. Use the origin to read data from a WebSocket resource URL.
WebSocket Server
The WebSocket Server origin is a multithreaded origin that listens on a WebSocket endpoint and processes the contents of all authorized WebSocket client requests. Use the WebSocket Server origin to read high volumes of WebSocket client requests using multiple threads.
Windows Event Log
The Windows Event Log origin reads data from a Microsoft Windows event log located on a Windows machine. The origin generates a record for each event in the log.
Processors
Aggregator
Base64 Field Decoder
The Base64 Field Decoder decodes Base64 encoded data to binary data. Use the processor to decode Base64 encoded data before evaluating data in the field.
Base64 Field Encoder
The Base64 Field Encoder encodes binary data using Base64. Use the processor to encode binary data that must be sent over channels that expect ASCII data.
Data Parser
The Data Parser processor allows you to parse supported data formats embedded in a field. You can parse NetFlow embedded in a byte array field or syslog messages embedded in a string field.
Delay
Expression Evaluator
Field Flattener
Field Hasher
The Field Hasher uses an algorithm to encode data. Use Field Hasher to encode highly-sensitive data. For example, you might use Field Hasher to encode social security or credit card numbers.
Field Masker
The Field Masker masks string values based on the selected mask type. You can use variable-length, fixed-length, custom, or regular expression masks. Custom masks can reveal part of the string value.
Field Merger
The Field Merger merges one or more fields in a record to a different location in the record. Use only for records with a list or map structure.
Field Order
The Field Order processor orders fields in a map or list-map field and outputs the fields into a list-map or list root field.
Field Pivoter
Field Remover
Field Renamer
Use the Field Renamer to rename fields in a record. You can specify individual fields to rename or use regular expressions to rename sets of fields.
Field Replacer
The Field Replacer replaces values in fields with nulls or with new values. Use the Field Replacer to update values or to replace invalid values.
Field Splitter
The Field Splitter splits string data based on a regular expression and passes the separated data to new fields. Use the Field Splitter to split complex string values into logical components.
Field Type Converter
The Field Type Converter converts the data types of fields to compatible data types. You might use the Field Type Converter to convert the data types of fields before performing calculations. You can also use the Field Type Converter to change the scale of decimal data.
Field Zip
Geo IP
Groovy Evaluator
HBase Lookup
The HBase Lookup processor performs key-value lookups in HBase and passes the lookup values to fields. Use the HBase Lookup to enrich records with additional data.
Hive Metadata
HTTP Client
JavaScript Evaluator
JDBC Lookup
The JDBC Lookup processor uses a JDBC connection to perform lookups in a database table and pass the lookup values to fields. Use the JDBC Lookup to enrich records with additional data.
JDBC Tee
JSON Generator
JSON Parser
Jython Evaluator
Kudu Lookup
The Kudu Lookup processor performs lookups in a Kudu table and passes the lookup values to fields. Use the Kudu Lookup to enrich records with additional data.
Log Parser
PostgreSQL Metadata
Record Deduplicator
The Record Deduplicator evaluates records for duplicate data and routes data to two streams - one for unique records and one for duplicate records. Use the Record Deduplicator to discard duplicate data or route duplicate data through different processing logic.
Redis Lookup
The Redis Lookup processor performs key-value lookups in Redis and passes the lookup values to fields. Use the Redis Lookup to enrich records with additional data.
Salesforce Lookup
The Salesforce Lookup processor performs lookups in a Salesforce object and passes the lookup values to fields. Use the Salesforce Lookup to enrich records with additional data.
Schema Generator
Spark Evaluator
The Spark Evaluator performs custom processing within a pipeline based on a Spark application that you develop.
SQL Parser
Static Lookup
The Static Lookup processor performs lookups of key-value pairs that are stored in local memory and passes the lookup values to fields. Use the Static Lookup to store String values in memory that the pipeline can look up at runtime to enrich records with additional data.
Stream Selector
The Stream Selector passes data to streams based on conditions. Define a condition for each stream of data that you want to create. The Stream Selector uses a default stream to pass records that do not match user-defined conditions.
Value Replacer (Deprecated)
Whole File Transformer
The Whole File Transformer processor transforms fully written Avro files to highly efficient, columnar Parquet files. Use the Whole File Transformer in a pipeline that reads Avro files as whole files and writes the transformed Parquet files as whole files.
XML Flattener
XML Parser
Destinations
Aerospike
The Aerospike destination writes data to Aerospike.
Amazon S3
The Amazon S3 destination writes data to Amazon S3. To write data to an Amazon Kinesis Firehose delivery system, use the Kinesis Firehose destination. To write data to Amazon Kinesis Streams, use the Kinesis Producer destination.
Azure Data Lake Store
Azure Event Hub Producer
Azure IoT Hub Producer
Cassandra
The Cassandra destination writes data to a Cassandra cluster.
CoAP Client
Constrained Application Protocol (CoAP) is a web transfer protocol designed for machine-to-machine devices. The CoAP Client destination writes data to a CoAP endpoint. Use the destination to send requests to a CoAP resource URL.
Couchbase
The Couchbase destination writes data to Couchbase Server. Couchbase Server is a distributed NoSQL document-oriented database.
Elasticsearch
The Elasticsearch destination writes data to an Elasticsearch cluster, including Elastic Cloud clusters (formerly Found clusters). The destination uses the Elasticsearch HTTP API to write each record to Elasticsearch as a document.
Einstein Analytics
The Einstein Analytics destination writes data to Salesforce Einstein Analytics. The destination connects to Einstein Analytics to upload external data to a dataset.
Flume
The Flume destination writes data to a Flume source. When you write data to Flume, you pass data to a Flume client. The Flume client passes data to hosts based on client configuration properties.
Google BigQuery
Google Bigtable
Google Cloud Storage
Google Pub/Sub Publisher
Hadoop FS
HBase
The HBase destination writes data to an HBase cluster. The destination can write data to HBase as text, binary data, or JSON strings. You can define the data format for each column written to HBase.
Hive Metastore
The Hive Metastore destination works with the Hive Metadata processor and the Hadoop FS or MapR FS destination as part of the Drift Synchronization Solution for Hive.
Hive Streaming
The Hive Streaming destination writes data to Hive tables stored in the ORC (Optimized Row Columnar) file format.
HTTP Client
The HTTP Client destination writes data to an HTTP endpoint. The destination sends requests to an HTTP resource URL. Use the HTTP Client destination to perform a range of standard requests or use an expression to determine the request for each record.
InfluxDB
The InfluxDB destination writes data to an InfluxDB database.
JDBC Producer
The JDBC Producer destination uses a JDBC connection to write data to a database table. You can also use the JDBC Producer to write change capture data from a Microsoft SQL Server change log.
JMS Producer
Kafka Producer
The Kafka Producer destination writes data to a Kafka cluster.
Kinesis Firehose
The Kinesis Firehose destination writes data to an Amazon Kinesis Firehose delivery stream. Firehose automatically delivers the data to the Amazon S3 bucket or Amazon Redshift table that you specify in the delivery stream.
Kinesis Producer
The Kinesis Producer destination writes data to Amazon Kinesis Streams. To write data to an Amazon Kinesis Firehose delivery system, use the Kinesis Firehose destination. To write data to Amazon S3, use the Amazon S3 destination.
KineticaDB
Kudu
The Kudu destination writes data to a Kudu cluster.
Local FS
MapR DB
The MapR DB destination writes data to MapR DB binary tables. The destination can write data to MapR DB as text, binary data, or JSON strings. You can define the data format for each column written to MapR DB.
MapR DB JSON
MapR FS
The MapR FS destination writes files to MapR FS. You can write the data to MapR as flat files or Hadoop sequence files.
MapR Streams Producer
The MapR Streams Producer destination writes messages to MapR Streams.
MongoDB
MQTT Publisher
The MQTT Publisher destination publishes messages to a topic on an MQTT broker. The destination functions as an MQTT client that publishes messages, writing each record as a message.
Named Pipe
The Named Pipe destination writes data to a UNIX named pipe.
RabbitMQ Producer
RabbitMQ Producer writes AMQP messages to a single RabbitMQ queue.
Redis
The Redis destination writes data to Redis.
Salesforce
The Salesforce destination writes data to Salesforce objects.
SDC RPC
The SDC RPC destination enables connectivity between two SDC RPC pipelines. The SDC RPC destination passes data to one or more SDC RPC origins. Use the SDC RPC destination as part of an SDC RPC origin pipeline.
Send Response to Origin
Solr
The Solr destination writes data to a Solr node or cluster.
Splunk
The Splunk destination writes data to Splunk using the Splunk HTTP Event Collector (HEC).
To Error
Trash
WebSocket Client
The WebSocket Client destination writes data to a WebSocket endpoint. Use the destination to send data to a WebSocket resource URL.
Meet StreamSets Data Collector Edge
StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices with limited resources. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.
Supported Platforms
Install SDC Edge
Download and install SDC Edge on each edge device where you want to run edge pipelines.
Getting Started with SDC Edge
Data Collector Edge (SDC Edge) includes several sample pipelines that make it easy to get started. You simply create the appropriate Data Collector receiving pipeline, download and install SDC Edge on the edge device, and then run the sample edge pipeline.
Design Edge Pipelines
Edge pipelines run in edge execution mode. You design edge pipelines in Data Collector.
Design Data Collector Receiving Pipelines
Administer SDC Edge
Administering SDC Edge involves configuring, starting, shutting down, and viewing logs for the agent.
Deploy Pipelines to SDC Edge
After designing edge pipelines in Data Collector, you deploy the edge pipelines to SDC Edge installed on an edge device. You run the edge pipelines on SDC Edge.
Downloading Pipelines from SDC Edge
Manage Pipelines on SDC Edge
After designing edge pipelines in Data Collector and then deploying the edge pipelines to SDC Edge, you can manage the pipelines on SDC Edge. Managing edge pipelines includes previewing, validating, starting, stopping, resetting the origin, and monitoring the pipelines.
Meet StreamSets Control Hub
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows teams to build and execute large numbers of complex dataflows at scale.
Working with Control Hub
Request a Control Hub Organization and User Account
Register Data Collector with Control Hub
You must register a Data Collector to work with StreamSets Control Hub. When you register a Data Collector, Data Collector generates an authentication token that it uses to issue authenticated requests to Control Hub.
Pipeline Statistics
A Control Hub job defines the pipeline to run and the Data Collectors or Edge Data Collectors (SDC Edge) that run the pipeline. When you start a job, Control Hub remotely runs the pipeline on the group of Data Collectors or Edge Data Collectors. To monitor the job statistics and metrics within Control Hub, you must configure the pipeline to write statistics to Control Hub or to another system.
Pipeline Management with Control Hub
After you register a Data Collector with StreamSets Control Hub, you can manage how the pipelines work with Control Hub.
Control Hub Configuration File
Unregister Data Collector from Control Hub
You can unregister a Data Collector from StreamSets Control Hub when you no longer want to use that Data Collector installation with Control Hub.
Microservice Pipelines
A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.
Stages for Microservice Pipelines
Sample Pipeline
When you create a microservice pipeline, a sample microservice pipeline displays in the configuration canvas. You can edit the pipeline to suit your needs. Or, you can create a standalone pipeline and use the microservice stages in a clean canvas.
Creating a Microservice Pipeline
SDC RPC Pipeline Overview
Data Collector Remote Protocol Call pipelines, a.k.a. SDC RPC pipelines, are a set of StreamSets pipelines that pass data from one pipeline to another without writing to an intermediary system.
Deployment Architecture
When using SDC RPC pipelines, consider your needs and environment carefully as you design the deployment architecture.
Configuring the Delivery Guarantee
The delivery guarantee determines when a pipeline commits the offset. When configuring the delivery guarantee for SDC RPC pipelines, use the same option in origin and destination pipelines.
Defining the RPC ID
The RPC ID is a user-defined identifier that allows an SDC RPC origin and SDC RPC destination to recognize each other.
Enabling Encryption
You can enable SDC RPC pipelines to transfer data securely using SSL/TLS. To use SSL/TLS, enable TLS in both the SDC RPC destination and the SDC RPC origin.
Configuration Guidelines for SDC RPC Pipelines
Cluster Pipeline Overview
A cluster pipeline is a pipeline that runs in cluster execution mode. You can run a pipeline in standalone execution mode or cluster execution mode.
Kafka Cluster Requirements
MapR Requirements
HDFS Requirements
Amazon S3 Requirements
Cluster EMR batch and cluster batch mode pipelines can process data from Amazon S3.
Cluster Pipeline Limitations
Data Preview Overview
You can preview data to help build or fine-tune a pipeline. When using Control Hub, you can also use data preview when developing pipeline fragments.
Data Collector UI - Preview Mode
You can use Data Collector to view how data passes through the pipeline.
Preview Codes
In Preview mode, Data Collector displays different colors for different types of data. Data Collector uses other codes and formatting to highlight changed fields.
Previewing a Single Stage
Previewing Multiple Stages
You can preview data for a group of linked stages within a pipeline.
Editing Preview Data
You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to test for data conditions that might not appear in the preview data set.
Editing Properties
In data preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the expression in an Expression Evaluator to see how the expression alters data.
Understanding Pipeline States
Starting Pipelines
Stopping Pipelines
Stop pipelines when you want Data Collector to stop processing data for the pipelines.
Importing Pipelines
Sharing Pipelines
Adding Labels to Pipelines
Exporting Pipelines
Exporting Pipelines for Control Hub
Duplicating a Pipeline
Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate version. A duplicate is an exact copy of the original pipeline.
Deleting Pipelines
Tutorial Overview
Before You Begin
Basic Tutorial
The basic tutorial creates a pipeline that reads a file from a directory, processes the data in two branches, and writes all data to a file system. You'll use data preview to help configure the pipeline, and you'll create a data alert and run the pipeline.
Extended Tutorial
The extended tutorial builds on the basic tutorial, using an additional set of stages to perform some data transformations and write to the Trash development destination. We'll also use data preview to test stage configuration.
Data Format Support
This appendix lists the data formats supported by origin and destination stages.
© Apache License, Version 2.0.