The StreamSets Cloud Beta is Open for Participation!
Today we are opening the StreamSets Cloud Beta program, inviting you to experience and give feedback on the latest addition to the StreamSets product family. StreamSets Cloud is a cloud service for designing, deploying and operating smart data pipelines, combining the ease and scalability of the cloud with the flexibility to execute pipelines anywhere – on-premise, private cloud or public cloud.
StreamSets Cloud provides an integrated user interface to design, deploy, operate and monitor smart data pipelines as a cloud service managed by StreamSets. Beta program participants will be able to use StreamSets Cloud to create pipelines for ingesting data to and from cloud data platforms.
In this short demo I created a simple pipeline, reading data from MySQL and writing to Snowflake:
Read on to learn more about StreamSets Cloud, and sign up if you want to try it out and give feedback.
What do I use StreamSets Cloud for?
Currently, StreamSets Cloud is best used for ingesting data into Snowflake. For example, you may use it to migrate data from on-premise data sources or a data lake to Snowflake. Or you may use it for cloud-to-cloud pipelines, such as Salesforce to Snowflake. Support for additional design patterns and use cases will come in the future.
Who should use StreamSets Cloud?
Data engineers and data integration developers who want to accelerate the ingestion of data into cloud data platforms.
What is the StreamSets Cloud high-level architecture?
Enterprise data is often dispersed across on-premise subnets, cloud applications, hybrid-cloud, and across international regions. Large enterprises may also need multiple environments, for example, development, test, and production. To meet these requirements, StreamSets Cloud has two core components: the StreamSets Cloud service itself and Execution Environments, to ensure flexibility and scalability across platforms and deployment architectures.
The StreamSets Cloud service is focused on designing, managing, and monitoring your dataflows. StreamSets Cloud provides an integrated, cloud-based user experience for designing, deploying and monitoring your pipelines across your entire organization, and managing all of your Execution Environments.
An Execution Environment provides a scalable location for Pipeline Executors to run pipelines. StreamSets Cloud Beta provides a fully managed Execution Environment service, so you don’t have to install or setup the environment. You can just build your dataflow pipelines, and we run them on our hosted service.
What connectors are available today?
Currently, StreamSets Cloud pipelines can read data from
- Amazon S3
- Azure Data Lake
- Azure IoT/Event Hub
- Google BigQuery
- Google Cloud Storage
- Google Pub/Sub
- Microsoft SQL Server
In this initial beta release, pipelines can write to Snowflake, but more destinations are coming soon!
What is the plan for making additional connectors available during beta?
We are optimizing existing connectors for cloud before we make them available. Expect new connectors to be added to the palette weekly throughout the beta period and beyond.
Can I transform data in the pipeline?
Yes! We are releasing an initial set of processors, such as Expression Evaluator, Field Remover and Stream Selector, that allow you to filter and enrich data as it passes through the pipeline. This video goes into a little more detail, explaining how to filter records and delete fields to ensure that the correct data reaches the destination:
My organization is not yet in the cloud, will there be a fully on-premise version of StreamSets Cloud?
The StreamSets Cloud user interface is currently only available as a cloud service. Execution Environments, used for pipeline execution, data movement, data transformation and data source connectivity, can be installed on-premise or in your virtual private cloud. When Execution Environments run under your control no data ever leaves your environment.
Our StreamSets DataOps Platform provides a suite of products that are more appropriate for customers requiring a fully on-premise solution.
StreamSets has traditionally been open source. Is StreamSets Cloud open source?
StreamSets Data Collector is open source. StreamSets continues to be committed to open source; the StreamSets Cloud Pipeline Executors and connectors will be open source under the Apache 2.0 license, so you have the flexibility to customize and the assurance of always having access to the core code.
The StreamSets Cloud connector open source repository will include all of the standard connectors and most of the ‘enterprise connectors’ that were previously closed source in Data Collector. Note that we may need to maintain certain connectors under a proprietary licence if the associated third-party code does not permit open source licensing.
Unlike StreamSets Data Collector, the StreamSets Cloud pipeline design experience is delivered as a cloud service, rather than alongside the runtime engine, and will not be open source.
Where can I find out more about StreamSets Cloud?
The StreamSets Cloud documentation explains how to get started building, running and monitoring pipelines. You can also ask questions in the #streamsets-cloud channel on our community Slack team – sign up here for access.
Great! Where do I start?
Sign up for the StreamSets Cloud beta program here.