What is StreamSets Cloud?
StreamSets CloudTM is a cloud-native DataOps platform that you can use to build, run, and monitor your dataflow pipelines.
Requirements
To use StreamSets Cloud, ensure that you meet the following requirements.
Try StreamSets Cloud
This tutorial covers the steps needed to try StreamSets Cloud. Although the tutorial provides a simple use case, keep in mind that StreamSets Cloud is a powerful DataOps platform that enables you to build, run, and monitor large numbers of complex pipelines.
Community Help
For additional help with StreamSets Cloud, sign up for our StreamSets Community on Slack and join the #streamsets-cloud channel.
What is a Pipeline?
A pipeline describes how data flows from origin to destination systems and how the data is processed along the way.
Batches
Data passes through a pipeline in batches.
Designing the Data Flow
You can branch and merge streams data in a pipeline.
Pipeline Validation
StreamSets Cloud automatically validates pipelines as you design them and when you preview them.
Common Stage Concepts
Pipeline stages include origins, processors, destinations, and executors. Although each pipeline stage functions differently, all pipeline stages share some concepts in common.
Error Record Handling
You configure error record handling at a stage level and at a pipeline level. You also specify the version of the record to use as the basis for the error record.
Expressions
Record Header Attributes
Record header attributes are attributes in record headers that you can use in pipeline logic, as needed.
Field Attributes
Field attributes are attributes that provide additional information about each field that you can use in pipeline logic, as needed.
Development Stages
You can use several development stages to help develop and test pipelines.
Data Types
Pipeline Configuration Overview
After you create a pipeline, you configure specific properties for the pipeline.
Delivery Guarantee
When you configure a pipeline, you define whether you want to prevent the loss of data or the duplication of data.
Test Origin for Preview
To aid in pipeline development, you can preview a pipeline and use a test origin that provides test data. Test origins are not used when running a pipeline.
Event Generation
The event framework generates pipeline events when the pipeline starts and stops.
Retrying the Pipeline
By default, when StreamSets Cloud encounters a stage-level error that might cause a pipeline to fail, it retries the pipeline. That is, it waits a period of time, and then tries again to run the pipeline.
Rate Limit
You can limit the rate at which a pipeline processes records by defining the maximum number of records that the pipeline can read in a second.
Configuring a Pipeline
Configure a pipeline to define the stream of data. After you configure the pipeline, you can run the pipeline.
Origins
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.
Amazon S3
The Amazon S3 origin reads objects stored in Amazon S3. The object names must share a prefix pattern and should be fully written.
Azure Data Lake Storage Gen1
The Azure Data Lake Storage Gen1 origin reads data from Microsoft Azure Data Lake Storage Gen1.
Azure Data Lake Storage Gen2
The Azure Data Lake Storage Gen2 origin reads data from Microsoft Azure Data Lake Storage Gen2.
Azure IoT/Event Hub Consumer
The Azure IoT/Event Hub Consumer origin reads data from Microsoft Azure Event Hub.
Google BigQuery
The Google BigQuery origin executes a query job and reads the result from Google BigQuery.
Google Cloud Storage
The Google Cloud Storage origin reads objects stored in Google Cloud Storage. The objects must be fully written and reside in a single bucket. The object names must share a prefix pattern.
Google Pub/Sub Subscriber
The Google Pub/Sub Subscriber origin consumes messages from a Google Pub/Sub subscription.
HTTP Client
The HTTP Client origin reads data from an HTTP resource URL.
Kinesis Consumer
The Kinesis Consumer origin reads data from Amazon Kinesis Streams.
MySQL Multitable Consumer
The MySQL Multitable Consumer origin reads data from a MySQL database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
MySQL Query Consumer
The MySQL Query Consumer origin reads data from a MySQL database using a user-defined SQL query. The origin returns data as a map with column names and field values.
Oracle Multitable Consumer
The Oracle Multitable Consumer origin reads data from an Oracle database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
Oracle Query Consumer
The Oracle Query Consumer origin reads data from an Oracle database using a user-defined SQL query. The origin returns data as a map with column names and field values.
PostgreSQL Multitable Consumer
The PostgreSQL Multitable Consumer origin reads data from a PostgreSQL database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
PostgreSQL Query Consumer
The PostgreSQL Query Consumer origin reads data from a PostgreSQL database using a user-defined SQL query. The origin returns data as a map with column names and field values.
Salesforce
The Salesforce origin reads data from Salesforce.
SQL Server Multitable Consumer
The SQL Server Multitable Consumer origin reads data from a Microsoft SQL Server database using table names. Use the origin to read from multiple tables in the same database. For example, you might use the origin to perform database replication.
SQL Server Query Consumer
The SQL Server Query Consumer origin reads data from a Microsoft SQL Server database using a user-defined SQL query. The origin returns data as a map with column names and field values.
Processors
A processor stage represents a type of data processing that you want to perform. You can use as many processors in a pipeline as you need.
Delay
The Delay processor delays passing a batch to rest of the pipeline by the specified number of milliseconds. Use the Delay processor to delay pipeline processing.
Expression Evaluator
The Expression Evaluator performs calculations and writes the results to new or existing fields. You can also use the Expression Evaluator to add or modify record header attributes and field attributes.
Field Flattener
The Field Flattener processor flattens list and map fields. The processor can flatten the entire record to produce a record with no nested fields. Or it can flatten specific list or map fields.
Field Merger
The Field Merger merges one or more fields in a record to a different location in the record. Use only for records with a list or map structure.
Field Order
The Field Order processor orders fields in a map or list-map field and outputs the fields into a list-map or list root field.
Field Remover
The Field Remover processor removes fields from records. Use the processor to discard field data that you do not need in the pipeline.
Field Renamer
Use the Field Renamer to rename fields in a record. You can specify individual fields to rename or use regular expressions to rename sets of fields.
Field Replacer
The Field Replacer processor replaces values in fields with nulls or with new values.
Field Splitter
The Field Splitter processor splits string data based on a regular expression and passes the separated data to new fields.
Field Type Converter
The Field Type Converter processor converts the data types of fields to compatible data types.
HTTP Client
The HTTP Client processor sends requests to an HTTP resource URL and writes responses to records.
JSON Generator
The JSON Generator processor serializes data in a field to a JSON-encoded string. You can serialize data from list, map, or list-map fields.
JSON Parser
The JSON Parser processor parses a JSON object embedded in a string field and passes the parsed data to an output field in the record.
Record Deduplicator
The Record Deduplicator processor evaluates records for duplicate data and routes data to two streams - one for unique records and one for duplicate records.
Salesforce Lookup
The Salesforce Lookup processor performs lookups in a Salesforce object and passes the lookup values to fields.
Stream Selector
The Stream Selector passes data to streams based on conditions. Define a condition for each stream of data that you want to create. The Stream Selector uses a default stream to pass records that do not match user-defined conditions.
Destinations
A destination stage represents the target for a pipeline. You can use one or more destinations in a pipeline.
Amazon S3
The Amazon S3 destination writes data to Amazon S3.
Azure Data Lake Storage Gen1
The Azure Data Lake Storage Gen1 destination writes data to Microsoft Azure Data Lake Storage Gen1.
Azure Data Lake Storage Gen2
The Azure Data Lake Storage Gen2 destination writes data to Microsoft Azure Data Lake Storage Gen2.
Azure Event Hub Producer
The Azure Event Hub Producer writes data to Microsoft Azure Event Hub.
Azure IoT Hub Producer
The Azure IoT Hub Producer destination writes data to Microsoft Azure IoT Hub.
Azure SQL Data Warehouse
The Azure SQL Data Warehouse destination loads data into one or more tables in Microsoft Azure SQL Data Warehouse.
Google BigQuery
The Google BigQuery destination streams data into Google BigQuery.
Google Cloud Storage
The Google Cloud Storage destination writes data to objects in Google Cloud Storage.
Google Pub/Sub Publisher
The Google Pub/Sub Publisher destination publishes messages to a Google Pub/Sub topic.
Kinesis Firehose
The Kinesis Firehose destination writes data to an Amazon Kinesis Firehose delivery stream.
Kinesis Producer
The Kinesis Producer destination writes data to Amazon Kinesis Streams.
MySQL Producer
The MySQL Producer destination writes data to a MySQL database table.
Oracle Producer
The Oracle Producer destination writes data to an Oracle database table.
PostgreSQL Producer
The PostgreSQL Producer destination writes data to a PostgreSQL database table.
Salesforce
The Salesforce destination writes data to Salesforce objects.
Snowflake
The Snowflake destination writes data to one or more tables in a Snowflake database. You can use the Snowflake destination to write to any accessible Snowflake database, including those hosted on Amazon S3, Microsoft Azure, and private Snowflake installations.
SQL Server Producer
The SQL Server Producer destination writes data to a Microsoft SQL Server database table.
To Error
The To Error destination passes records to the pipeline for error handling. Use the To Error destination to send a stream of records to pipeline error handling.
Trash
The Trash destination discards records. Use the Trash destination as a visual representation of records discarded from the pipeline. Or, you might use the Trash destination during development as a temporary placeholder.
Executors
An executor stage triggers a task when it receives an event.
ADLS Gen1 File Metadata
The ADLS Gen1 File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in Azure Data Lake Storage Gen1 each time it receives an event.
ADLS Gen2 File Metadata
The ADLS Gen2 File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in Azure Data Lake Storage Gen2 each time it receives an event.
Amazon S3 Executor
The Amazon S3 executor performs a task in Amazon S3 each time it receives an event.
Pipeline Finisher Executor
When the Pipeline Finisher executor receives an event, the executor stops a pipeline and transitions it to a Finished state. This allows the pipeline to complete all expected processing before stopping.
Dataflow Triggers Overview
Dataflow triggers are instructions for the event framework to kick off tasks in response to events that occur in the pipeline.
Pipeline Event Generation
The event framework generates pipeline events at specific points in the pipeline lifecycle. You can configure the pipeline properties to pass each event to an executor for more complex processing.
Stage Event Generation
You can configure certain stages to generate events. Event generation differs from stage to stage, based on the way the stage processes data.
Executors
Executors perform tasks when they receive event records.
Logical Pairings
You can use events in any way that works for your needs.
Event Records
Event records are records that are created when a stage or pipeline event occurs.
Executing Pipeline Events in Preview
You can enable pipeline event execution in preview to test event processing.
Case Study: Output File Management
Case Study: Stop the Pipeline
Case Study: Event Storage
Summary
Data Formats Overview
Avro Data Format
Binary Data Format
Datagram Data Format
Delimited Data Format
Excel Data Format
Log Data Format
When you use an origin to read log data, you define the format of the log files to be read.
NetFlow Data Processing
SDC Record Data Format
Text Data Format with Custom Delimiters
Whole File Data Format
You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.
Reading and Processing XML Data
Writing XML Data
Rules and Alerts Overview
Define rules for a pipeline to capture information about the pipeline as it runs. Rules trigger alerts to notify you when a specified condition occurs.
Metric Rules and Alerts
Metric rules and alerts provide notifications about real-time statistics for pipelines.
Data Rules and Alerts
Data rules define information that you want to see about the data that passes between stages. You can create data rules based on any link in the pipeline. You can also enable meters and create alerts for data rules.
Data Drift Rules and Alerts
You can create data drift rules to indicate when the structure of data changes. You can create data drift rules on any link in the pipeline. You can also enable meters and create alerts for data drift rules.
Alert Webhooks
You can configure webhooks that are sent when alerts are triggered.
Pipeline Maintenance Overview
After creating and configuring a pipeline, you run the pipeline to start the flow of data from the origin to destination systems. Each pipeline runs until you stop the pipeline.
Running a Pipeline
When you run a pipeline, you start the flow of data from the origin to destination systems. Each pipeline runs until you stop the pipeline.
Viewing Pipeline Run History
You can view the run history of a pipeline when you configure or monitor the pipeline.
Pipeline Logs
StreamSets Cloud generates log data when you preview a pipeline or run a pipeline.
Stopping a Pipeline
Stop a pipeline when you want to stop processing data for the pipeline.
Exporting a Pipeline
Export a pipeline to create a backup or to share the pipeline with another StreamSets Cloud user.
Importing a Pipeline
Import a pipeline to restore a backup file or to use a pipeline shared by another StreamSets Cloud user.
Deleting a Pipeline
You can delete a pipeline when you no longer need the pipeline.
Pipeline Hours
StreamSets Cloud charges for usage based on pipeline hours.
Expression Language
Functions
Constants
The expression language provides constants for use in expressions.
Datetime Variables
The expression language provides datetime variables for use in expressions.
Literals
The expression language includes the following literals:
Operators
Reserved Words
The following words are reserved for the expression language and should not be used as identifiers:
© 2019 StreamSets, Inc.