You must enable javascript in order to view this page or you can go
here
to view the webhelp.
Data Collector
User Guide
Content
Search Results
Index
Loading, please wait ...
Getting Started
What is StreamSets
Data Collector
?
How should I use Data Collector?
How does this really work?
Logging In and Creating a Pipeline
Data Collector User Interface
Configuring the Display
Data Collector UI - Pipelines on the Home Page
What's New
What's New in 3.1.0.0
What's New in 3.0.3.0
What's New in 3.0.2.0
What's New in 3.0.1.0
What's New in 3.0.0.0
What's New in 2.7.2.0
What's New in 2.7.1.1
What's New in 2.7.1.0
What's New in 2.7.0.0
What's New in 2.6.0.1
What's New in 2.6.0.0
What's New in 2.5.1.0
What's New in 2.5.0.0
What's New in 2.4.1.0
What's New in 2.4.0.0
What's New in 2.3.0.1
What's New in 2.3.0.0
What's New in 2.2.1.0
What's New in 2.2.0.0
Installation
Installation
Installation Requirements
JCE for Oracle JVM
Configuring the Open File Limit
Full Installation and Launch (Manual Start)
Full Installation and Launch (Service Start)
Installing from the RPM Package
Installing from the Tarball for Systems Using SysV Init
Installing from the Tarball for Systems Using Systemd Init
Core Installation
Installing the Core RPM Package
Installing the Core Tarball
Install Additional Stage Libraries
Installing for RPM
Installing for Tarball Using the Package Manager
Installing for Tarball Using the Command Line
Available Stage Libraries
Legacy Stage Libraries
Run Data Collector from Docker
Installation with Cloudera Manager
Step 1. Install the StreamSets Custom Service Descriptor
Step 2. Manually Install the Parcel and Checksum Files (Optional)
Step 3. Distribute and Activate the StreamSets Parcel
Step 4. Configure the StreamSets Service
Configuring Data Collector with Cloudera Manager
MapR Prerequisites
Supported Versions
Step 1. Install Client Libraries
Step 2. Run the Command to Set Up MapR
Running the Command in Interactive Mode
Running the Command in Non-Interactive Mode
Step 3. Connect to a MapR Cluster Secured with Built-in Security
Step 4. Run Data Collector as a MapR Ticket User
Creating Another Data Collector Instance
Uninstallation
Uninstalling the Tarball (Manual Start)
Uninstalling the Tarball (Service Start)
Uninstalling the RPM Package
Uninstalling from Cloudera Manager
Configuration
User Authentication
Configuring LDAP Authentication
Step 1. Configure LDAP Connection Information
Example for OpenLDAP
Example for Active Directory
Step 2. Configure Secure Connections to LDAP (Optional)
Step 3. Map LDAP Groups to Data Collector Roles
Step 4. Configure Multiple LDAP Servers (Optional)
Step 5. Enable LDAP Authentication for MapR Stages
Configuring File-Based Authentication
Step 1. Configure Authentication Properties
Step 2. Configure Users, Groups, and Roles
Roles and Permissions
Roles
Pipeline Permissions
Roles and Permissions for Common Tasks
Transfer Pipeline Permissions
Transferring Permissions
Data Collector Configuration
Kerberos Authentication
Enabling Kerberos for RPM and Tarball
Enabling Kerberos with Cloudera Manager
Sending Email
Referencing Sensitive Values in Files
Referencing Environment Variables
Running Multiple Concurrent Pipelines
HTTP Protocols
Configuring HTTPS for Standalone Pipelines
Configuring HTTPS for Cluster Pipelines
Hadoop Impersonation Mode
Lowercasing User Names
Working with HDFS Encryption Zones
Blacklist and Whitelist for Stage Libraries
Configuring Data Collector
Data Collector Environment Configuration
Modifying Environment Variables
Data Collector Directories
User and Group for Service Start
Java Configuration Options
Java Heap Size
Remote Debugging
Garbage Collector
Java Security Manager
Root Classloader
Heap Dump Creation
Install External Libraries
Install Using the Package Manager
Step 1. Set Up an External Directory
Step 2. Install External Libraries
Install Manually
Installing Manually for RPM and Tarball
Installing Manually for Cloudera Manager
Custom Stage Libraries
Storing Custom Libraries for RPM and Tarball
Storing Custom Libraries for Cloudera Manager
Credential Stores
Group Access to Credentials
CyberArk Credential Store
Step 1. Install the Credential Store Stage Library
Step 2. Configure the Credential Store Properties
Step 3. Call the Credentials from the Pipeline
Java Keystore Credential Store
Step 1. Install the Credential Store Stage Library
Step 2. Configure the Credential Store Properties
Step 3. Add Credentials to the Credential Store
Step 4. Call the Credentials from the Pipeline
jks-cs Command
Vault Credential Store
Step 1. Install the Credential Store Stage Library
Step 2. Configure the Credential Store Properties
Step 3. Call the Credentials from the Pipeline
Accessing Vault Secrets with Vault Functions (Deprecated)
Step 1. Configure Vault Properties
Step 2. Call Vault from the Pipeline
Publishing Metadata to Cloudera Navigator
Prerequisites
Viewing Published Metadata
Supported Stages
Configuring Data Collector to Publish Metadata
Enabling External JMX Tools
Viewing JMX Metrics in External Tools
Custom Metrics
Upgrade
Upgrade
Pre Upgrade Tasks
Verify Installation Requirements
Migrate to Java 8
Upgrade Cluster Streaming Pipelines
Upgrade an Installation from the Tarball
Step 1. Shut Down the Previous Version
Step 2. Back Up the Previous Version
Step 3. Install the New Version
Installing from the Tarball (Manual Start)
Installing from the Tarball for Systems Using SysV (Service Start)
Installing from the Tarball for Systems Using Systemd (Service Start)
Step 4. Update Environment Variables
Step 5. Update the Configuration Files
Step 6. Install Additional Libraries for the Core Installation
Step 7. Start the New Version of Data Collector
Upgrade an Installation from the RPM Package
Step 1. Shut Down the Previous Version
Step 2. Back Up the Previous Version
Step 3. Install the New Version
Step 4. Update Environment Variables
Step 5. Update the Configuration Files
Step 6. Install Additional Libraries for the Core Installation
Step 7. Uninstall Previous Libraries
Step 8. Start the New Version of Data Collector
Upgrade an Installation with Cloudera Manager
Step 1. Stop All Pipelines
Step 2. Back Up the Previous Version
Step 3. Install the StreamSets Custom Service Descriptor
Step 4. Manually Install the Parcel and Checksum Files (Optional)
Step 5. Distribute and Activate the New StreamSets Parcel
Step 6. Verify Modified Safety Valves
Step 7. Restart the StreamSets Service
Post Upgrade Tasks
Update Value Replacer Pipelines
Update Einstein Analytics Pipelines
Update
Control Hub
On-premises
Update Pipelines using Legacy Stage Libraries
Disable Cloudera Navigator Integration
JDBC Multitable Consumer Query Interval Change
Update JDBC Query Consumer Pipelines used for SQL Server CDC Data
Update MongoDB Destination Upsert Pipelines
Time Zones in Stages
Update Kudu Pipelines
Update JDBC Multitable Consumer Pipelines
Update Vault Pipelines
Configure JDBC Producer Schema Names
Evaluate Precondition Error Handling
Authentication for Docker Image
Configure Pipeline Permissions
Update Elasticsearch Pipelines
Working with Upgraded External Systems
Working with Kafka 0.11 or Later
Working with Cloudera CDH 5.11 or Later
Working with an Upgraded MapR System
Troubleshooting an Upgrade
Pipeline Concepts and Design
What is a Pipeline?
Data in Motion
Single and Multithreaded Pipelines
Delivery Guarantee
Data Collector
Data Types
Designing the Data Flow
Branching Streams
Merging Streams
Dropping Unwanted Records
Required Fields
Preconditions
Error Record Handling
Pipeline Error Record Handling
Stage Error Record Handling
Example
Error Records and Version
Record Header Attributes
Working with Header Attributes
Internal Attributes
Header Attribute-Generating Stages
Record Header Attributes for Record-Based Writes
Generating Attributes for Record-Based Writes
Viewing Attributes in Data Preview
Field Attributes
Working with Field Attributes
Field Attribute-Generating Stages
Viewing Field Attributes in Data Preview
Processing Changed Data
CRUD Operation Header Attribute
Earlier Implementations
CDC-Enabled Origins
CRUD-Enabled Stages
Processing the Record
Use Cases
Control Character Removal
Development Stages
Pipeline Configuration
Data Collector UI - Edit Mode
Retrying the Pipeline
Pipeline Memory
Rate Limit
Simple and Bulk Edit Mode
Runtime Values
Using Runtime Parameters
Step 1. Define Runtime Parameters
Step 2. Call the Runtime Parameter
Step 3. Start the Pipeline with Parameters
Viewing Runtime Parameters
Using Runtime Properties
Step 1. Define Runtime Properties
Step 2. Call the Runtime Property
Using Runtime Resources
Step 1. Define Runtime Resources
Step 2. Call the Runtime Resource
Event Generation
Pipeline Event Records
Webhooks
Request Method
Payload and Parameters
Examples
Notifications
SSL/TLS Configuration
Keystore and Truststore Configuration
Transport Protocols
Cipher Suites
Implicit and Explicit Validation
Expression Configuration
Basic Syntax
Using Field Names in Expressions
Field Names with Special Characters
Referencing Field Names and Field Paths
Wildcard Use for Arrays and Maps
Field Path Expressions
Supported Stages
Field Path Expression Syntax
Expression Completion in Properties
Tips for Expression Completion
Data Type Coercion
Configuring a Pipeline
Data Formats
Data Formats Overview
Delimited Data Root Field Type
Log Data Format
NetFlow Data Processing
Caching NetFlow 9 Templates
NetFlow 5 Generated Records
NetFlow 9 Generated Records
Protobuf Data Format Prerequisites
SDC Record Data Format
Text Data Format with Custom Delimiters
Processing XML Data with Custom Delimiters
Whole File Data Format
Basic Pipeline
Whole File Records
Additional Processors
Defining the Transfer Rate
Writing Whole Files
Access Permissions
Including Checksums in Events
Reading and Processing XML Data
Creating Multiple Records with an XML Element
Using XML Elements with Namespaces
Creating Multiple Records with an XPath Expression
Using XPath Expressions with Namespaces
Simplified XPath Syntax
Sample XPath Expressions
Predicates in XPath Expressions
Predicate Examples
Including Field XPaths and Namespaces
XML Attributes and Namespace Declarations
Parsed XML
Writing XML Data
Record Structure Requirement
Origins
Origins
Comparing HTTP Origins
Comparing MapR Origins
Comparing UDP Source Origins
Comparing WebSocket Origins
Batch Size and Wait Time
Maximum Record Size
File Compression Formats
Previewing Raw Source Data
Amazon S3
AWS Credentials
Common Prefix, Prefix Pattern, and Wildcards
Record Header Attributes
Object Metadata in Record Header Attributes
Read Order
Buffer Limit and Error Handling
Server Side Encryption
Event Generation
Event Records
Data Formats
Configuring an Amazon S3 Origin
Amazon SQS Consumer
AWS Credentials
Queue Name Prefix
Multithreaded Processing
Including SQS Message Attributes
Including Sender Attributes
Data Formats
Configuring an Amazon SQS Consumer
Azure IoT/Event Hub Consumer
Storage Account and Container Prerequisite
Resetting the Origin in Event Hub
Multithreaded Processing
Data Formats
Configuring an Azure IoT/Event Hub Consumer
CoAP Server
Prerequisites
Multithreaded Processing
Network Configuration Properties
Data Formats
Configuring a CoAP Server Origin
Directory
File Name Pattern and Mode
Read Order
Multithreaded Processing
Reading from Subdirectories
Post-Processing Subdirectories
First File for Processing
Late Directory
Record Header Attributes
Event Generation
Event Records
Buffer Limit and Error Handling
Data Formats
Configuring a Directory Origin
Elasticsearch
Batch and Incremental Mode
Query
Incremental Mode Query
Scroll Timeout
Multithreaded Processing
Configuring an Elasticsearch Origin
File Tail
File Processing and Archive File Names
Multiple Paths and File Sets
First File for Processing
Late Directories
Files Matching a Pattern - Pattern Constant
Record Header Attributes
Defining and Using a Tag
Multiple Line Processing
File Tail Output
Event Generation
Event Records
Data Formats
Configuring a File Tail Origin
Google BigQuery
Credentials
Default Credentials Provider
Service Account Credentials File (JSON)
BigQuery Data Types
Event Generation
Event Record
Configuring a Google BigQuery Origin
Google Cloud Storage
Credentials
Default Credentials Provider
Service Account Credentials (JSON)
Common Prefix, Prefix Pattern, and Wildcards
Event Generation
Event Records
Data Formats
Configuring a Google Cloud Storage Origin
Google Pub/Sub Subscriber
Credentials
Default Credentials Provider
Service Account Credentials File (JSON)
Multithreaded Processing
Record Header Attributes
Data Formats
Configuring a Google Pub/Sub Subscriber Origin
Hadoop FS
Reading from Other File Systems
Kerberos Authentication
Using a Hadoop User
Hadoop Properties and Configuration Files
Data Formats
Configuring a Hadoop FS Origin
HTTP Client
Processing Mode
Pagination
Result Field Path
Keep All Fields
HTTP Method
OAuth 2 Authorization
Example for Twitter
Example for Microsoft Azure AD
Example for Google
Data Formats
Response Header Fields in Header Attributes
Configuring an HTTP Client Origin
HTTP Server
Prerequisites
Send Data to the Listening Port
Include the Application ID in Requests
Multithreaded Processing
Data Formats
Record Header Attributes
Configuring an HTTP Server Origin
HTTP to Kafka
Prerequisites
Pipeline Configuration
Kafka Maximum Message Size
Enabling Kafka Security
Enabling SSL/TLS
Enabling Kerberos (SASL)
Enabling SSL/TLS and Kerberos
Configuring an HTTP to Kafka Origin
JDBC Multitable Consumer
Installing the JDBC Driver
Working with a MySQL JDBC Driver
Table Configuration
Table Name Pattern
Offset Column and Value
Reading from Views
Multithreaded Processing Modes
Multithreaded Table Processing
Multithreaded Partition Processing
Partition Processing Requirements
Multiple Offset Value Handling
Best Effort: Processing Non-Compliant Tables
Non-Incremental Processing
Batch Strategy
Process All Available Rows
Switch Tables
Initial Table Order Strategy
Understanding the Processing Queue
Multiple Tables, No Partition Processing
Multiple Partitions, No Table Processing
Both Partition and Table Processing
JDBC Header Attributes
Event Generation
Event Record
Configuring a JDBC Multitable Consumer
JDBC Query Consumer
Installing the JDBC Driver
Offset Column and Offset Value
Full and Incremental Mode
Recovery
SQL Query
SQL Query for Incremental Mode
SQL Query for Full Mode
Stored Procedures
JDBC Record Header Attributes
Header Attributes with the Drift Synchronization Solution
CDC for Microsoft SQL Server
CRUD Record Header Attribute
Group Rows by Transaction
Event Generation
Event Record
Configuring a JDBC Query Consumer
JMS Consumer
Installing JMS Drivers
Data Formats
Configuring a JMS Consumer Origin
Kafka Consumer
Initial and Subsequent Offsets
Processing Available Data
Additional Kafka Properties
Record Header Attributes
Enabling Security
Enabling SSL/TLS
Enabling Kerberos (SASL)
Enabling SSL/TLS and Kerberos
Data Formats
Configuring a Kafka Consumer
Kafka Multitopic Consumer
Initial and Subsequent Offsets
Processing All Unread Data
Multithreaded Processing
Additional Kafka Properties
Record Header Attributes
Enabling Security
Enabling SSL/TLS
Enabling Kerberos (SASL)
Enabling SSL/TLS and Kerberos
Data Formats
Configuring a Kafka Multitopic Consumer
Kinesis Consumer
Multithreaded Processing
AWS Credentials
Read Interval
Lease Table Tags
Resetting the Origin
Data Formats
Configuring a Kinesis Consumer Origin
MapR DB CDC
Multithreaded Processing
Handling the _id Field
CRUD Operation and CDC Header Attributes
Additional Properties
Configuring a MapR DB CDC Origin
MapR DB JSON
Handling the _id Field
Configuring a MapR DB JSON Origin
MapR FS
Kerberos Authentication
Using a Hadoop User
Hadoop Properties and Configuration Files
Data Formats
Configuring a MapR FS Origin
MapR Multitopic Streams Consumer
Initial and Subsequent Offsets
Processing All Unread Data
Multithreaded Processing
Additional Properties
Record Header Attributes
Data Formats
Configuring a MapR Multitopic Streams Consumer
MapR Streams Consumer
Processing All Unread Data
Data Formats
Additional Properties
Record Header Attributes
Configuring a MapR Streams Consumer
MongoDB
Credentials
Offset Field and Initial Offset
Read Preference
Event Generation
Event Records
BSON Timestamp
Enabling SSL/TLS
Configuring a MongoDB Origin
MongoDB Oplog
Credentials
Oplog Timestamp and Ordinal
Read Preference
Generated Records
CRUD Operation and CDC Header Attributes
Enabling SSL/TLS
Configuring a MongoDB Oplog Origin
MQTT Subscriber
Topics
Record Header Attributes
Data Formats
Configuring an MQTT Subscriber Origin
MySQL Binary Log
Prerequisites
Configure MySQL Server for Row-based Logging
Install the JDBC Driver
Initial Offset
Generated Records
Processing Generated Records
Tables to Include or Ignore
Configuring a MySQL Binary Log Origin
Omniture
Configuring an Omniture Origin
OPC UA Client
Processing Mode
Providing NodeIds
Security
Configuring an OPC UA Client Origin
Oracle CDC Client
LogMiner Dictionary Source
Oracle CDC Client Prerequisites
Task 1. Enable LogMiner
Task 2. Enable Supplemental Logging
Task 3. Create a User Account
Task 4. Extract a Log Miner Dictionary (Redo Logs)
Task 5. Install the Driver
Schema, Table Name and Exclusion Patterns
Initial Change
Choosing Buffers
Local Buffer Resource Requirements
Uncommitted Transaction Handling
Include Nulls
Unsupported Data Types
Conditional Data Type Support
Generated Records
CRUD Operation Header Attributes
CDC Header Attributes
Event Generation
Event Records
Working with the Drift Synchronization Solution
Data Preview with Oracle CDC Client
Configuring an Oracle CDC Client
RabbitMQ Consumer
Queue
Record Header Attributes
Data Formats
Configuring a RabbitMQ Consumer
Redis Consumer
Channels and Patterns
Data Formats
Configuring a Redis Consumer
Salesforce
Query Existing Data
Using the SOAP and Bulk API
Example
Using the Bulk API with PK Chunking
Example
Repeat Query
Subscribing to Notifications
Processing PushTopic Events
PushTopic Event Record Format
Processing Platform Events
Reading Custom Objects or Fields
Processing Deleted Records
Salesforce Attributes
Salesforce Header Attributes
CRUD Operation Header Attribute
Salesforce Field Attributes
Event Generation
Event Record
Changing the API Version
Configuring a Salesforce Origin
SDC RPC
Configuring an SDC RPC Origin
SDC RPC to Kafka
Pipeline Configuration
Concurrent Requests
Batch Request Size, Kafka Message Size, and Kafka Configuration
Additional Kafka Properties
Enabling Kafka Security
Enabling SSL/TLS
Enabling Kerberos (SASL)
Enabling SSL/TLS and Kerberos
Configuring an SDC RPC to Kafka Origin
SFTP/FTP Client
Read Order
First File for Processing
Credentials
Record Header Attributes
Event Generation
Event Records
Data Formats
Configuring an SFTP/FTP Client Origin
SQL Server CDC Client
Installing the JDBC Driver
Supported Operations
Multithreaded Processing
Batch Strategy
Table Configuration
Initial Table Order Strategy
Allow Late Table Processing
Checking for Schema Changes
Generated Record
Record Header Attributes
CRUD Operation Header Attributes
Event Generation
Event Record
Configuring a SQL Server CDC Origin
SQL Server Change Tracking
Permission Requirements
Installing the JDBC Driver
Multithreaded Processing
Batch Strategy
Table Configuration
Initial Table Order Strategy
Generated Record
Record Header Attributes
CRUD Operation Header Attributes
Event Generation
Event Record
Configuring a SQL Server Change Tracking Origin
TCP Server
Multithreaded Processing
Closing Connections for Invalid Data
Sending Acknowledgements
Using Expressions in Messages
TCP Modes
Data Formats
Configuring a TCP Server Origin
UDP Multithreaded Source
Processing Raw Data
Receiver and Worker Threads
Packet Queue
Multithreaded Pipelines
Metrics for Performance Tuning
Configuring a UDP Multithreaded Source
UDP Source
Processing Raw Data
Receiver Threads
Configuring a UDP Source
UDP to Kafka
Pipeline Configuration
Additional Kafka Properties
Enabling Kafka Security
Enabling SSL/TLS
Enabling Kerberos (SASL)
Enabling SSL/TLS and Kerberos
Configuring a UDP to Kafka Origin
WebSocket Client
Read REST Response Data from Data Collector
Data Formats
Configuring a WebSocket Client Origin
WebSocket Server
Prerequisites
Send Data to the Listening Port
Include the Application ID in Requests
Multithreaded Processing
Data Formats
Configuring a WebSocket Server Origin
Windows Event Log
Configuring a Windows Event Log Origin
Processors
Processors
Aggregator
Window Type, Time Windows, and Information Display
Rolling Windows
Sliding Windows
Calculation Components
Aggregate Functions
Event Generation
Event Record Root Field
Event Records
Sample Event Records
Monitoring Aggregations
Configuring an Aggregator
Base64 Field Decoder
Configuring a Base64 Field Decoder
Base64 Field Encoder
Configuring a Base64 Field Encoder
Data Parser
Data Formats
Configuring a Data Parser
Delay
Configuring a Delay Processor
Expression Evaluator
Output Fields and Attributes
Record Header Attribute Expressions
Field Attribute Expressions
Configuring an Expression Evaluator
Field Flattener
Flatten the Entire Record
Flatten Specific Fields
Configuring a Field Flattener
Field Hasher
Hash Methods
List, Map, and List-Map Fields
Configuring a Field Hasher
Field Masker
Mask Types
Configuring a Field Masker
Field Merger
Configuring a Field Merger
Field Order
Missing and Extra Fields
Configuring a Field Order Processor
Field Pivoter
Generated Records
Configuring a Field Pivoter
Field Remover
Configuring a Field Remover
Field Renamer
Renaming Sets of Fields
Configuring a Field Renamer
Field Replacer
Replacing Values with Nulls
Replacing Values with New Values
Data Types for Conditional Replacement
Configuring a Field Replacer
Field Splitter
Not Enough Splits
Too Many Splits
Example
Configuring a Field Splitter
Field Type Converter
Valid Type Conversions
Changing the Scale of Decimal Fields
Configuring a Field Type Converter
Field Zip
Merging List Data
Merging List-Map Data
Pivoting Merged Lists
Configuring a Field Zip Processor
Geo IP
Supported Databases
Database File Location
GeoIP Field Types
Full JSON Field Types
Configuring a Geo IP Processor
Groovy Evaluator
Processing Mode
Groovy Scripting Objects
Processing List-Map Data
Type Handling
Event Generation
Working with Record Header Attributes
Viewing Record Header Attributes
Accessing Whole File Format Records
Calling External Java Code
Granting Permissions on Groovy Scripts
Configuring a Groovy Evaluator
HBase Lookup
Lookup Key
Lookup Cache
Kerberos Authentication
Using an HBase User
HDFS Properties and Configuration File
Configuring an HBase Lookup
Hive Metadata
Output Streams
Metadata Records and Record Header Attributes
Custom Record Header Attributes
Database, Table, and Partition Expressions
Hive Names and Supported Characters
Decimal Field Expressions
Time Basis
Cache
Cache Size and Evictions
Kerberos Authentication
Hive Properties and Configuration Files
Configuring a Hive Metadata Processor
HTTP Client
HTTP Method
Expression Method
Parallel Requests
Response Header Fields
OAuth 2 Authorization
Example for Twitter
Example for Microsoft Azure AD
Example for Google
Data Formats
Configuring HTTP Client Processor
JavaScript Evaluator
Processing Mode
JavaScript Scripting Objects
Processing List-Map Data
Type Handling
Event Generation
Working with Record Header Attributes
Viewing Record Header Attributes
Accessing Whole File Format Records
Calling External Java Code
Configuring a JavaScript Evaluator
JDBC Lookup
Installing the JDBC Driver
Lookup Cache
Retry Lookups for Missing Values
Monitoring a JDBC Lookup
Configuring a JDBC Lookup
JDBC Tee
Installing the JDBC Driver
Define the CRUD Operation
Single and Multi-row Operations
Configuring a JDBC Tee
JSON Generator
Configuring a JSON Generator
JSON Parser
Configuring a JSON Parser
Jython Evaluator
Processing Mode
Jython Scripting Objects
Processing List-Map Data
Type Handling
Event Generation
Working with Record Header Attributes
Viewing Record Header Attributes
Accessing Whole File Format Records
Calling External Java Code
Configuring a Jython Evaluator
Kudu Lookup
Column Mappings
Kudu Data Types
Lookup Cache
Cache Table Information
Cache Lookup Values
Configuring a Kudu Lookup
Log Parser
Log Formats
Configuring a Log Parser
Postgres Metadata
Installing the JDBC Driver
Schema and Table Names
Decimal Precision and Scale Field Attributes
Caching Information
Configuring a Postgres Metadata Processor
Record Deduplicator
Comparison Window
Configuring a Record Deduplicator
Redis Lookup
Data Types
Lookup Cache
Configuring a Redis Lookup Processor
Salesforce Lookup
Lookup Mode
Lookup Cache
Salesforce Attributes
Changing the API Version
Configuring a Salesforce Lookup
Schema Generator
Using the avroSchema Header Attribute
Generated Avro Schema
Caching Schemas
Configuring a Schema Generator
Spark Evaluator
Spark Versions and Stage Libraries
Standalone Pipelines
Cluster Pipelines
Developing the Spark Application
Installing the Application
Configuring a Spark Evaluator
Static Lookup
Configuring a Static Lookup Processor
Stream Selector
Default Stream
Sample Conditions for Streams
Configuring the Stream Selector
Value Replacer (Deprecated)
Processing Order
Replacing Values with Nulls
Replacing Values with Constants
Data Types for Conditional Replacement
Configuring a Value Replacer
XML Flattener
Generated Records
Configuring an XML Flattener
XML Parser
Configuring an XML Parser
Destinations
Destinations
Aerospike
Configuring an Aerospike Destination
Amazon S3
AWS Credentials
Bucket
Partition Prefix
Time Basis and Data Time Zone for Time-Based Buckets and Partition Prefixes
Object Names
Whole File Names
Event Generation
Event Records
Server-Side Encryption
Data Formats
Configuring an Amazon S3 Destination
Azure Data Lake Store
Prerequisites
Step 1. Create a
Data Collector
Web Application
Step 2. Retrieve Information from Azure
Step 3. Grant Execute Permission
Directory Templates
Time Basis
Timeout to Close Idle Files
Event Generation
Event Records
Data Formats
Configuring an Azure Data Lake Store Destination
Azure Event Hub Producer
Data Formats
Configuring an Azure Event Hub Producer
Azure IoT Hub Producer
Register Data Collector as an IoT Hub Device
Data Formats
Configuring an Azure IoT Hub Producer Destination
Cassandra
Authentication
Kerberos (DSE) Authentication
Cassandra Data Types
Configuring a Cassandra Destination
CoAP Client
Data Formats
Configuring a CoAP Client Destination
Elasticsearch
Time Basis and Time-Based Indexes
Document IDs
Define the CRUD Operation
Configuring an Elasticsearch Destination
Einstein Analytics
Changing the API Version
Define the Operation
Metadata JSON
Dataflow (Deprecated)
Configuring an Einstein Analytics Destination
Flume
Data Formats
Configuring a Flume Destination
Google BigQuery
BigQuery Data Types
Credentials
Default Credentials Provider
Service Account Credentials File (JSON)
Configuring a Google BigQuery Destination
Google Bigtable
Prerequisites
Install the BoringSSL Library
Configure the Google Application Default Credentials
Row Key
Cloud Bigtable Data Types
Column Family and Field Mappings
Time Basis
Configuring a Google Bigtable Destination
Google Cloud Storage
Credentials
Default Credentials Provider
Service Account Credentials (JSON)
Partition Prefix
Time Basis, Data Time Zone, and Time-Based Partition Prefixes
Object Names
Whole File Names
Event Generation
Event Records
Data Formats
Configuring a Google Cloud Storage Destination
Google Pub/Sub Publisher
Credentials
Default Credentials Provider
Service Account Credentials File (JSON)
Data Formats
Configuring a Google Pub/Sub Publisher Destination
Hadoop FS
Directory Templates
Time Basis
Late Records and Late Record Handling
Timeout to Close Idle Files
Recovery
Data Formats
Writing to Azure HDInsight
Event Generation
Event Records
Kerberos Authentication
Using an HDFS User
HDFS Properties and Configuration Files
Configuring a Hadoop FS Destination
HBase
Field Mappings
Kerberos Authentication
Using an HBase User
Time Basis
HDFS Properties and Configuration File
Configuring an HBase Destination
Hive Metastore
Metadata Processing
Hive Table Generation
Cache
Cache Size and Evictions
Event Generation
Event Records
Kerberos Authentication
Hive Properties and Configuration Files
Configuring a Hive Metastore Destination
Hive Streaming
Hive Properties and Configuration Files
Configuring a Hive Streaming Destination
HTTP Client
HTTP Method
Expression Method
Number of Requests
OAuth 2 Authorization
Example for Twitter
Example for Microsoft Azure AD
Example for Google
Data Formats
Configuring an HTTP Client Destination
InfluxDB
Configuring an InfluxDB Destination
JDBC Producer
Installing the JDBC Driver
Define the CRUD Operation
Single and Multi-row Operations
Configuring a JDBC Producer
JMS Producer
Installing JMS Drivers
Data Formats
Configuring a JMS Producer
Kafka Producer
Broker List
Runtime Topic Resolution
Partition Strategy
Additional Kafka Properties
Enabling Security
Enabling SSL/TLS
Enabling Kerberos (SASL)
Enabling SSL/TLS and Kerberos
Data Formats
Configuring a Kafka Producer
Kinesis Firehose
AWS Credentials
Delivery Stream
Data Formats
Configuring a Kinesis Firehose Destination
Kinesis Producer
AWS Credentials
Data Formats
Configuring a Kinesis Producer Destination
KineticaDB
Multihead Ingest
Inserts and Updates
Configuring a KineticaDB Destination
Kudu
Define the CRUD Operation
Kudu Data Types
Configuring a Kudu Destination
Local FS
Directory Templates
Time Basis
Late Records and Late Record Handling
Timeout to Close Idle Files
Recovery
Event Generation
Event Records
Data Formats
Configuring a Local FS Destination
MapR DB
Field Mappings
Time Basis
Kerberos Authentication
Using an HBase User
HDFS Properties and Configuration File
Configuring a MapR DB Destination
MapR DB JSON
Row Key
Row Key Data Types
Writing to MapR DB JSON
Define the CRUD Operation
Insert and Set API Properties
Configuring a MapR DB JSON Destination
MapR FS
Directory Templates
Time Basis
Late Records and Late Record Handling
Timeout to Close Idle Files
Recovery
Event Generation
Event Records
Data Formats
Kerberos Authentication
Using an HDFS User
HDFS Properties and Configuration Files
Configuring a MapR FS Destination
MapR Streams Producer
Data Formats
Runtime Topic Resolution
Partition Strategy
Additional Properties
Configuring a MapR Streams Producer
MongoDB
Credentials
Define the CRUD Operation
Performing Upserts
Enabling SSL/TLS
Configuring a MongoDB Destination
MQTT Publisher
Topic
Data Formats
Configuring an MQTT Publisher Destination
Named Pipe
Prerequisites
Create the Named Pipe
Configure the Named Pipe Reader
Working with the Named Pipe Reader
Data Formats
Configuring a Named Pipe Destination
RabbitMQ Producer
Data Formats
Configuring a RabbitMQ Producer
Redis
Mode
Data Types for Batch Mode
Data Formats for Publish Mode
Define the CRUD Operation
Configuring a Redis Destination
Salesforce
Changing the API Version
Define the CRUD Operation
Field Mappings
Configuring a Salesforce Destination
SDC RPC
RPC Connections
Disabling Compression
Configuring an SDC RPC Destination
Solr
Index Mode
Kerberos Authentication
Configuring a Solr Destination
To Error
Trash
WebSocket Client
Data Formats
Configuring a WebSocket Client Destination
Executors
Executors
Amazon S3 Executor
AWS Credentials
Create New Objects
Tag Existing Objects
Configuring an Amazon S3 Executor
Email Executor
Prerequisite
Conditions
Using Expressions
Configuring an Email Executor
HDFS File Metadata Executor
Related Event Generating Stages
Changing Metadata
Specifying the File Path
Changing the File Name or Location
Defining the Owner, Group, Permissions, and ACLs
Creating an Empty File
Removing a File or Directory
Event Generation
Event Records
Kerberos Authentication
HDFS User
HDFS Properties and Configuration Files
Configuring an HDFS File Metadata Executor
Hive Query Executor
Related Event Generating Stages
Installing the Impala Driver
Hive and Impala Queries
Impala Queries for the Drift Synchronization Solution for Hive
Event Generation
Event Records
Configuring a Hive Query Executor
JDBC Query Executor
Installing the JDBC Driver
Configuring a JDBC Query Executor
MapR FS File Metadata Executor
Related Event Generating Stage
Changing Metadata
Specifying the File Path
Changing the File Name or Location
Defining the Owner, Group, Permissions, and ACLs
Creating an Empty File
Removing a File or Directory
Event Generation
Event Records
Kerberos Authentication
HDFS User
HDFS Properties and Configuration Files
Configuring a MapR FS File Metadata Executor
MapReduce Executor
Prerequisites
Related Event Generating Stages
MapReduce Jobs
Avro to Parquet Job
Event Generation
Event Records
Kerberos Authentication
Using a MapReduce User
Configuring a MapReduce Executor
Pipeline Finisher Executor
Recommended Implementation
Related Event Generating Stages
Notification Options
Configuring a Pipeline Finisher Executor
Shell Executor
Data Collector Shell Impersonation Mode
Script Configuration
Configuring a Shell Executor
Spark Executor
Spark Versions and Stage Libraries
Spark on YARN
YARN Prerequisite
Spark Home Requirement
Application Properties
Using a Proxy Hadoop User
Kerberos Authentication
Spark on Databricks
Databricks Prerequisites
Event Generation
Event Records
Monitoring
Configuring a Spark Executor
StreamSets Control Hub
Meet
StreamSets Control Hub
Design the Complete Data Architecture
Collaboratively Build Pipelines
Execute Jobs at Scale
Map Jobs into a Topology
Measure Dataflow Quality
Monitor Dataflow Operations
Working with
Control Hub
Request a
Control Hub
Organization and User Account
Register Data Collector with
Control Hub
Registration Prerequisites
Registering from Data Collector
Registering from the Command Line Interface
Registering from Cloudera Manager
Disconnected Mode
Using an HTTP or HTTPS Proxy Server
Using a Publicly Accessible URL
Transfer Permissions to
Control Hub
Users
Pipeline Statistics
Pipeline Execution Mode
Write Statistics Directly to
Control Hub
Write Statistics to SDC RPC
Best Practices for SDC RPC
Write Statistics to Kafka
Partition Strategy
Best Practices for a Kafka Cluster
Write Statistics to Kinesis Streams
AWS Credentials
Best Practices for Kinesis Streams
Write Statistics to MapR Streams
Partition Strategy
Best Practices for MapR Streams
Configuring a Pipeline to Write Statistics
Pipeline Management with
Control Hub
Pipeline Types
Viewing Pipeline Types in Data Collector
Publishing Pipelines to
Control Hub
Reverting Changes to Published Pipelines
Viewing Pipeline Commit History
Downloading Published Pipelines
Exporting Pipelines for
Control Hub
Control Hub
Configuration
Unregister Data Collector from
Control Hub
Unregistering from Data Collector
Unregistering from the Command Line Interface
Dataflow Triggers
Dataflow Triggers Overview
Pipeline Event Generation
Using Pipeline Events
Pass to an Executor
Pass to Another Pipeline
Stage Event Generation
Using Stage Events
Task Execution Streams
Event Storage Streams
Executors
Logical Pairings
Event Records
Event Record Header Attributes
Viewing Events in Data Preview
,
Snapshot
, and Monitor Mode
Viewing Stage Events in Data Preview and Snapshot
Viewing Stage Events in Monitor Mode
Executing Pipeline Events in Data Preview
Case Study: Parquet Conversion
Case Study: Impala Metadata Updates for DDS for Hive
Case Study: Output File Management
Case Study: Stop the Pipeline
Case Study: Offloading Data from Relational Sources to Hadoop
Case Study: Sending Email
Case Study: Event Storage
Summary
Drift Synchronization Solution for Hive
Drift Synchronization Solution for Hive
General Processing
Parquet Processing
Impala Support
Flatten Records
Basic Avro Implementation
Basic Parquet Implementation
Implementation Steps
Avro Case Study
The Hive Metadata Processor
The Hive Metastore Destination
The Data-Processing Destination
Processing Avro Data
Parquet Case Study
JDBC Query Consumer
The Hive Metadata Processor
The Hive Metastore Destination
The Data-Processing Destination
The MapReduce Executor
Processing Parquet Data
Hive Data Types
Drift Synchronization Solution for Postgres
Drift Synchronization Solution for Postgres
Basic Implementation and Processing
Flattening Records
Requirements
Implementation Steps
Case Study
The JDBC Multitable Consumer Origin
The Postgres Metadata Processor
The JDBC Producer Destination
Running the Pipeline
Multithreaded Pipelines
Multithreaded Pipeline Overview
How It Works
Origins for Multithreaded Pipelines
Processor Caching
Monitoring
Tuning Threads and Runners
Resource Usage
Multithreaded Pipeline Summary
Edge Pipelines
Edge Pipelines Overview
Example: IoT Preventative Maintenance
Supported Platforms
Edge Pipelines
Edge Sending Pipelines
Edge Receiving Pipelines
Error Record Handling
Data Formats
Edge Pipeline Limitations
Data Collector Receiving Pipelines
Getting Started with Sample Edge Pipelines
Step 1. Create and Start a Data Collector Receiving Pipeline
Step 2. Download and Install SDC Edge
Step 3. Start SDC Edge and the Edge Pipeline
Install SDC Edge
Downloading from Data Collector
Downloading from the StreamSets Website
Running from Docker
Administer SDC Edge
Configuring SDC Edge
Starting SDC Edge
Shutting Down SDC Edge
Logs
Export Pipelines to SDC Edge
Manage Pipelines on SDC Edge
Uninstalling SDC Edge
SDC RPC Pipelines
SDC RPC Pipeline Overview
Pipeline Types
Deployment Architecture
Configuring the Delivery Guarantee
Defining the RPC ID
Enabling Encryption
Configuration Guidelines for SDC RPC Pipelines
Cluster Pipelines
Cluster Pipeline Overview
Cluster Batch and Streaming Execution Modes
HTTP Protocols
Checkpoint Storage for Streaming Pipelines
Configuring the Location for Mesos
Error Handling Limitations
Monitoring and Snapshot
Kafka Cluster Requirements
Configuring Cluster YARN Streaming for Kafka
Configuring Cluster Mesos Streaming for Kafka
MapR Requirements
Configuring Cluster Batch Mode for MapR
Configuring Cluster Streaming Mode for MapR
HDFS Requirements
Cluster Pipeline Limitations
Data Preview
Data Preview Overview
Data Preview Availability
Source Data for Data Preview
Writing to Destinations
Notes
Data Collector UI - Preview Mode
Preview Codes
Previewing a Single Stage
Previewing Multiple Stages
Editing Preview Data
Editing Properties
Rules and Alerts
Rules and Alerts Overview
Metric Rules and Alerts
Default Metric Rules
Metric Types
Gauge
Counter
Histogram
Meter
Timer
Metric Conditions
Configuring a Metric Rule and Alert
Data Rules and Alerts
Configuring a Data Rule and Alert
Viewing Data Rule Metrics and Sample Data
Data Drift Rules and Alerts
Data Drift Alert Triggers
Configuring Data Drift Rules and Alerts
Alert Webhooks
Configuring an Alert Webhook
Configuring Email for Alerts
Pipeline Monitoring
Pipeline Monitoring Overview
Data Collector UI - Monitor Mode
Viewing Pipeline and Stage Statistics
Monitoring Errors
Stage-Related Errors
Snapshots
Failure Snapshots
Viewing a Failure Snapshot
Capturing and Viewing a Snapshot
Downloading a Snapshot
Deleting a Snapshot
Viewing the Run History
Viewing a Run Summary
Pipeline Maintenance
Understanding Pipeline States
State Transition Examples
Starting Pipelines
Starting Pipelines with Parameters
Resetting the Origin
Stopping Pipelines
Importing Pipelines
Importing a Single Pipeline
Importing a Set of Pipelines
Sharing Pipelines
Sharing a Pipeline
Changing the Pipeline Owner
Adding Labels to Pipelines
Exporting Pipelines
Duplicating a Pipeline
Deleting Pipelines
Administration
Viewing Data Collector Configuration Properties
Viewing Data Collector Directories
Viewing Data Collector Metrics
Viewing Data Collector Logs
Modifying the Log Level
Shutting Down Data Collector
Restarting Data Collector
Viewing Users and Groups
Support Bundles
Generators
Generating a Support Bundle
Customizing Generators
REST Response
Viewing REST Response Data
Disabling the REST Response Menu
Command Line Interface
Java Configuration Options for the Cli Command
Using the Cli Command
Not Registered with
Control Hub
Registered with
Control Hub
Help Command
Manager Command
Store Command
System Command
Tutorial
Tutorial Overview
Before You Begin
Basic Tutorial
Create a Pipeline and Define Pipeline Properties
Configure the Origin
Preview Data
Route Data with the Stream Selector
Use Jython for Card Typing
Mask Credit Card Numbers
Write to the Destination
Add a Corresponding Field with the Expression Evaluator
Create a Data Rule and Alert
Run the Basic Pipeline
Extended Tutorial
Convert Types with a Field Type Converter
Manipulate Data with the Expression Evaluator
Preview and Edit the Pipeline
Write to Trash
Run the Extended Pipeline
Troubleshooting
Accessing Error Messages
Pipeline Basics
Data Preview
General Validation Errors
Origins
Directory
Hadoop FS
JDBC Origins
Kafka Consumer
Oracle CDC Client
SQL Server CDC Client
Destinations
Cassandra
Hadoop FS
HBase
Kafka Producer
SDC RPC
Executors
Hive Query
JDBC Connections
No Suitable Driver
Cannot Connect to Database
MySQL JDBC Driver and Time Values
Performance
Cluster Execution Mode
Glossary
Glossary of Terms
Data Formats by Stage
Data Format Support
Origins
Destinations
Expression Language
Expression Language
Expression Examples
Functions
Record Functions
Delimited Data Record Functions
Error Record Functions
Base64 Functions
Credential Functions
Data Drift Functions
Field Functions
File Functions
Math Functions
Pipeline Functions
String Functions
Time Functions
Miscellaneous Functions
Constants
Datetime Variables
Literals
Operators
Operator Precedence
Reserved Words
Regular Expressions
Regular Expressions Overview
Regular Expressions in the Pipeline
Quick Reference
Regex Examples
Grok Patterns
Defining Grok Patterns
General Grok Patterns
Date and Time Grok Patterns
Java Grok Patterns
Log Grok Patterns
Networking Grok Patterns
Path Grok Patterns
Your browser does not support iframes.