What is StreamSets Cloud?
StreamSets CloudTM is a cloud-native DataOps platform that you can use to build, run, and monitor your dataflow pipelines.
Requirements
To use StreamSets Cloud, ensure that you meet the following requirements.
Try StreamSets Cloud
This tutorial covers the steps needed to try StreamSets Cloud. Although the tutorial provides a simple use case, keep in mind that StreamSets Cloud is a powerful DataOps platform that enables you to build, run, and monitor large numbers of complex pipelines.
Community Help
For additional help with StreamSets Cloud, sign up for our StreamSets Community on Slack and join the #streamsets-cloud channel.
What is a Pipeline?
A pipeline describes how data flows from origin to destination systems and how the data is processed along the way.
Batches
Data passes through a pipeline in batches.
Designing the Data Flow
You can branch and merge streams data in a pipeline.
Pipeline Validation
StreamSets Cloud automatically validates pipelines as you design them and when you preview them.
Database Connections
When you design pipelines that access databases, you must decide how the pipeline stages connect to your databases.
Common Stage Concepts
Pipeline stages include origins, processors, destinations, and executors. Although each pipeline stage functions differently, all pipeline stages share some concepts in common.
Processing Changed Data
Certain stages enable you to easily process change capture data (CDC) or transactional data in a pipeline.
Error Record Handling
You configure error record handling at a stage level and at a pipeline level. You also specify the version of the record to use as the basis for the error record.
Expressions
Record Header Attributes
Record header attributes are attributes in record headers that you can use in pipeline logic, as needed.
Field Attributes
Field attributes are attributes that provide additional information about each field that you can use in pipeline logic, as needed.
Development Stages
You can use several development stages to help develop and test pipelines.
Data Types
Overview
After you create a pipeline, you configure specific properties for the pipeline.
Delivery Guarantee
When you configure a pipeline, you define whether you want to prevent the loss of data or the duplication of data.
Event Generation
The event framework generates pipeline events when the pipeline starts and stops.
Retrying the Pipeline
By default, when StreamSets Cloud encounters a stage-level error that might cause a pipeline to fail, it retries the pipeline. That is, it waits a period of time, and then tries again to run the pipeline.
Rate Limit
You can limit the rate at which a pipeline processes records by defining the maximum number of records that the pipeline can read in a second.
Configuring a Pipeline
Configure a pipeline to define the stream of data. After you configure the pipeline, you can run the pipeline.
Origins
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.
Amazon S3
The Amazon S3 origin reads objects stored in Amazon S3. The object names must share a prefix pattern and should be fully written.
Azure Data Lake Storage Gen1
The Azure Data Lake Storage Gen1 origin reads data from Microsoft Azure Data Lake Storage Gen1.
Azure Data Lake Storage Gen2
The Azure Data Lake Storage Gen2 origin reads data from Microsoft Azure Data Lake Storage Gen2.
Azure IoT/Event Hub Consumer
The Azure IoT/Event Hub Consumer origin reads data from Microsoft Azure Event Hub.
Google BigQuery
The Google BigQuery origin executes a query job and reads the result from Google BigQuery.
Google Cloud Storage
The Google Cloud Storage origin reads objects stored in Google Cloud Storage. The objects must be fully written and reside in a single bucket. The object names must share a prefix pattern.
Google Pub/Sub Subscriber
The Google Pub/Sub Subscriber origin consumes messages from a Google Pub/Sub subscription.
HTTP Client
The HTTP Client origin reads data from an HTTP resource URL.
Kinesis Consumer
The Kinesis Consumer origin reads data from Amazon Kinesis Streams.
MySQL Binary Log
The MySQL Binary Log origin processes change data capture (CDC) information provided by MySQL in binary logs.
MySQL Multitable Consumer
The MySQL Multitable Consumer origin reads data from a MySQL database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
MySQL Query Consumer
The MySQL Query Consumer origin reads data from a MySQL database using a user-defined SQL query. The origin returns data as a map with column names and field values.
Oracle CDC Client
The Oracle CDC Client processes change data capture (CDC) information provided by Oracle LogMiner redo logs. Use the origin to process data from Oracle 11g or 12c.
Oracle Multitable Consumer
The Oracle Multitable Consumer origin reads data from an Oracle database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
Oracle Query Consumer
The Oracle Query Consumer origin reads data from an Oracle database using a user-defined SQL query. The origin returns data as a map with column names and field values.
PostgreSQL Multitable Consumer
The PostgreSQL Multitable Consumer origin reads data from a PostgreSQL database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
PostgreSQL Query Consumer
The PostgreSQL Query Consumer origin reads data from a PostgreSQL database using a user-defined SQL query. The origin returns data as a map with column names and field values.
Salesforce
The Salesforce origin reads data from Salesforce.
SQL Server Multitable Consumer
The SQL Server Multitable Consumer origin reads data from a Microsoft SQL Server database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
SQL Server Query Consumer
The SQL Server Query Consumer origin reads data from a Microsoft SQL Server database using a user-defined SQL query. The origin returns data as a map with column names and field values.
Processors
A processor stage represents a type of data processing that you want to perform. You can use as many processors in a pipeline as you need.
Delay
The Delay processor delays passing a batch to rest of the pipeline by the specified number of milliseconds. Use the Delay processor to delay pipeline processing.
Expression Evaluator
The Expression Evaluator performs calculations and writes the results to new or existing fields. You can also use the Expression Evaluator to add or modify record header attributes and field attributes.
Field Flattener
The Field Flattener processor flattens list and map fields. The processor can flatten the entire record to produce a record with no nested fields. Or it can flatten specific list or map fields.
Field Merger
The Field Merger merges one or more fields in a record to a different location in the record. Use only for records with a list or map structure.
Field Order
The Field Order processor orders fields in a map or list_map field and outputs the fields into a list_map or list root field.
Field Pivoter
The Field Pivoter processor pivots data in a list, map, or list_map field and creates a record for each item in the field.
Field Remover
The Field Remover processor removes fields from records. Use the processor to discard field data that you do not need in the pipeline.
Field Renamer
Use the Field Renamer to rename fields in a record. You can specify individual fields to rename or use regular expressions to rename sets of fields.
Field Replacer
The Field Replacer processor replaces values in fields with nulls or with new values.
Field Splitter
The Field Splitter processor splits string data based on a regular expression and passes the separated data to new fields.
Field Type Converter
The Field Type Converter processor converts the data types of fields to compatible data types.
Field Zip
The Field Zip processor merges list data from two fields into a single field.
HTTP Client
The HTTP Client processor sends requests to an HTTP resource URL and writes responses to records.
JSON Generator
The JSON Generator processor serializes data in a field to a JSON-encoded string. You can serialize data from list, map, or list_map fields.
JSON Parser
The JSON Parser processor parses a JSON object embedded in a string field and passes the parsed data to an output field in the record.
Record Deduplicator
The Record Deduplicator processor evaluates records for duplicate data and routes data to two streams - one for unique records and one for duplicate records.
Salesforce Lookup
The Salesforce Lookup processor performs lookups in a Salesforce object and passes the lookup values to fields.
Stream Selector
The Stream Selector passes data to streams based on conditions. Define a condition for each stream of data that you want to create. The Stream Selector uses a default stream to pass records that do not match user-defined conditions.
Destinations
A destination stage represents the target for a pipeline. You can use one or more destinations in a pipeline.
Amazon S3
The Amazon S3 destination writes data to Amazon S3.
Azure Data Lake Storage Gen1
The Azure Data Lake Storage Gen1 destination writes data to Microsoft Azure Data Lake Storage Gen1.
Azure Data Lake Storage Gen2
The Azure Data Lake Storage Gen2 destination writes data to Microsoft Azure Data Lake Storage Gen2.
Azure Event Hub Producer
The Azure Event Hub Producer writes data to Microsoft Azure Event Hub.
Azure IoT Hub Producer
The Azure IoT Hub Producer destination writes data to Microsoft Azure IoT Hub.
Azure SQL Data Warehouse
The Azure SQL Data Warehouse destination loads data into one or more tables in Microsoft Azure SQL Data Warehouse.
Google BigQuery
The Google BigQuery destination streams data into Google BigQuery.
Google Cloud Storage
The Google Cloud Storage destination writes data to objects in Google Cloud Storage.
Google Pub/Sub Publisher
The Google Pub/Sub Publisher destination publishes messages to a Google Pub/Sub topic.
Kinesis Firehose
The Kinesis Firehose destination writes data to an Amazon Kinesis Firehose delivery stream.
Kinesis Producer
The Kinesis Producer destination writes data to Amazon Kinesis Streams.
MySQL Producer
The MySQL Producer destination writes data to a MySQL database table.
Oracle Producer
The Oracle Producer destination writes data to an Oracle database table.
PostgreSQL Producer
The PostgreSQL Producer destination writes data to a PostgreSQL database table.
Salesforce
The Salesforce destination writes data to Salesforce objects.
Snowflake
The Snowflake destination writes data to one or more tables in a Snowflake database. You can use the Snowflake destination to write to any accessible Snowflake database, including those hosted on Amazon S3, Microsoft Azure, and private Snowflake installations.
SQL Server Producer
The SQL Server Producer destination writes data to a Microsoft SQL Server database table.
To Error
The To Error destination passes records to the pipeline for error handling. Use the To Error destination to send a stream of records to pipeline error handling.
Trash
The Trash destination discards records. Use the Trash destination as a visual representation of records discarded from the pipeline. Or, you might use the Trash destination during development as a temporary placeholder.
Executors
An executor stage triggers a task when it receives an event.
ADLS Gen1 File Metadata
The ADLS Gen1 File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in Azure Data Lake Storage Gen1 each time it receives an event.
ADLS Gen2 File Metadata
The ADLS Gen2 File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in Azure Data Lake Storage Gen2 each time it receives an event.
Amazon S3
The Amazon S3 executor performs a task in Amazon S3 each time it receives an event.
Databricks Delta Lake
The Databricks Delta Lake executor runs one or more Spark SQL queries on a Delta Lake table on Databricks each time it receives an event record. Use the executor as part of an event stream in the pipeline.
Pipeline Finisher
When the Pipeline Finisher executor receives an event, the executor stops a pipeline and transitions it to a Finished state. This allows the pipeline to complete all expected processing before stopping.
Overview
You can preview pipeline data to help build or fine-tune a pipeline.
Preview Codes
A preview displays different colors for different types of data. It also uses other codes and formatting to highlight changed fields.
Previewing a Pipeline
Editing Preview Data
You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to test for data conditions that might not appear in the preview data set.
Overview
Dataflow triggers are instructions for the event framework to kick off tasks in response to events that occur in the pipeline.
Pipeline Event Generation
The event framework generates pipeline events at specific points in the pipeline lifecycle. You can configure the pipeline properties to pass each event to an executor for more complex processing.
Stage Event Generation
You can configure certain stages to generate events. Event generation differs from stage to stage, based on the way the stage processes data.
Executors
Executors perform tasks when they receive event records.
Logical Pairings
You can use events in any way that works for your needs.
Event Records
Event records are records that are created when a stage or pipeline event occurs.
Executing Pipeline Events in Preview
You can enable pipeline event execution in preview to test event processing.
Summary
Overview
Avro Data Format
Binary Data Format
Datagram Data Format
Delimited Data Format
Excel Data Format
Log Data Format
When you use an origin to read log data, you define the format of the log files to be read.
NetFlow Data Processing
SDC Record Data Format
Text Data Format with Custom Delimiters
Whole File Data Format
You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.
Reading and Processing XML Data
Writing XML Data
Overview
Define rules for a pipeline to capture information about the pipeline as it runs. Rules trigger alerts to notify you when a specified condition occurs.
Metric Rules and Alerts
Metric rules and alerts provide notifications about real-time statistics for pipelines.
Data Rules and Alerts
Data rules define information that you want to see about the data that passes between stages. You can create data rules based on any link in the pipeline. You can also enable meters and create alerts for data rules.
Data Drift Rules and Alerts
You can create data drift rules to indicate when the structure of data changes. You can create data drift rules on any link in the pipeline. You can also enable meters and create alerts for data drift rules.
Alert Webhooks
You can configure webhooks that are sent when alerts are triggered.
Overview
After creating and configuring a pipeline, you run the pipeline to start the flow of data from the origin to destination systems. Each pipeline runs until you stop the pipeline.
Running a Pipeline
When you run a pipeline, you start the flow of data from the origin to destination systems. Each pipeline runs until you stop the pipeline.
Stopping a Pipeline
Stop a pipeline when you want to stop processing data for the pipeline.
Exporting a Pipeline
Export a pipeline to create a backup or to share the pipeline with another StreamSets Cloud user.
Importing a Pipeline
Import a pipeline to restore a backup file or to use a pipeline shared by another StreamSets Cloud user.
Deleting a Pipeline
You can delete a pipeline when you no longer need the pipeline.
Overview
You can monitor the health and performance of running pipelines.
Viewing Pipeline and Stage Statistics
When you monitor a running pipeline, you can view real-time statistics for the pipeline and for stages in the pipeline.
Monitoring Stage Errors
When you monitor a pipeline, you can view a sampling of the error records. You can also view error statistics for each stage.
Snapshots
A snapshot is a set of data captured as it moves through a running pipeline.
Viewing Pipeline Run History
You can view the run history of a pipeline when you configure or monitor the pipeline.
Pipeline Logs
StreamSets Cloud generates log data when you preview a pipeline or run a pipeline.
Stopping a Pipeline After Processing All Available Data
This solution describes how to design a pipeline that stops automatically after it finishes processing all available data.
Managing Output Files
This solution describes how to design a pipeline that writes output files to a destination, and then moves the files to a different location and changes the permissions for the files.
Preserving an Audit Trail of Events
This solution describes how to design a pipeline that preserves an audit trail of pipeline and stage events that occur.
Loading Data into Databricks Delta Lake
You can use several StreamSets Cloud solutions to load data into a Delta Lake table on Databricks.
Overview
The Administration page includes administrative information about your StreamSets Cloud account.
Viewing Active Sessions
You can view details about your active user session.
Metering
You can view metering details about your pipeline hour usage. StreamSets Cloud charges for usage based on pipeline hours.
Expression Language
Functions
Constants
The expression language provides constants for use in expressions.
Datetime Variables
The expression language provides datetime variables for use in expressions.
Literals
The expression language includes the following literals:
Operators
Reserved Words
The following words are reserved for the expression language and should not be used as identifiers:
© 2020 StreamSets, Inc.