Frequently Asked Questions
About StreamSets DataOps Platform

What is DataOps?

DataOps is a practice encompassing people, process and technology to operationalize data management and integration to deliver rapidly and continuously with confidence. DataOps applies automation and monitoring across the entire data flow lifecycle, from design to deployment to operations, to make data movement highly efficient, drift-ready and continuous.

To put it another way, DataOps applies DevOps practices to data, so you can deliver with speed and confidence in today’s world of data fragmentation and ceaseless change.

What are the key features of the StreamSets DataOps platform?

The StreamSets DataOps Platform operationalizes data flows and enables continuous data delivery by addressing the entire design-deploy-operate lifecycle of data pipelines.

  • Design – You can design simple pipelines or complex dataflows comprising dozens of pipelines in a single topology using a cloud-based visual UI with pre-built origins, destinations and transformations. You can also add your own custom processing stages. Since it’s a shared environment with a pipeline repository, developers can collaborate on pipeline design and re-use. Then pipelines can be previewed, tracking how specific records flow and change, and then publish.
  • Deploy – Pipelines are executed in memory on standalone systems or scalable distributed systems using YARN, Mesos or Kubernetes mechanisms. Published pipelines are connected to jobs that can then be executed against specific systems and scheduled or run continuously.
  • Operate  All pipelines, whether batch or streaming, are managed from a single control plane, where you can view pipelines end-to-end, define data performance SLAs, and apply data protection policies. With smart, fully instrumented pipelines, data drift is detected and addressed automatically so you can ensure continuous operations.
  • Automate – Pipelines and data processing jobs can be automated to execute when you specify. The platform can scale for batch jobs and to run on ephemeral cloud infrastructure. Reports and performance read-outs can also be automated.
  • MonitorPipelines are fully instrumented so that you have end-to-end visibility and control, with real-time monitoring of metrics such as throughput, latency and error rates. Topologies of multiple interconnected pipelines are visually displayed on a live interactive map where performance hotspots are displayed and operators can drill down to quickly pinpoint resolve issues.

What are the popular use cases for the StreamSets DataOps platform?

A DataOps Platform can help bring value to many data and analytics projects and is critical for workloads like Data Lakes, Data Warehouse Modernization, Data Warehouse Modernization, Big Data/Hadoop Ingestion, Real-time Event Streaming, Change Data Capture, and IoT/edge device integration. Common business initiatives that are supported by these use cases include customer 360 engagement, cybersecurity, compliance/risk management, and product research and development.

What data sources and systems does the platform connect?

The DataOps Platform supports most popular relational databases, Big Data (Hadoop), NoSQL, enterprise data warehouse, and managed cloud data services. For a full list of connectors visit here.

How is DataOps different than ETL?

ETL (extract-transform-load) is a specific design pattern optimized for batch loads in relatively static environments, and is focused on upfront developer productivity (set-and-forget). DataOps is a holistic practice that supports all design patterns including batch ETL and streaming. It is a broad practice encompassing the people, process and technology necessary to operationalize all data integration design patterns across the design-deploy-operate lifecycle to ensure continuous delivery.

What is data drift and how does the StreamSeta DataOps platform handle it?

Data drift is the unexpected, unannounced and unending changes to data structure, infrastructure, and semantics that occur when data moves through enterprise systems. It is caused by changes to data structure or semantics as sources and systems are added or upgraded. In traditional data integration systems which are brittle due to their reliance on full schema mapping, data drift can lead to dropped/corroded data or pipeline failures
The StreamSets platform has intelligent sensors in-pipeline that detect when data drift occurs. The platform can make intelligent recommendations for fixing data drift and help validate that it does not contaminate data.

What are smart data pipelines and how do I build them?

Traditional data integration built ETL mappings that were designed to process and move data from a source to a target. Those data integrations were designed for a static environment where changes are planned weeks and months out. Therefore those integrations were done in a set-it-and-forget-it manner and had no smarts built-in because it was assumed nothing would change. Smart data pipelines are designed for the world of today, where change is frequent and often unexpected. Smart data pipelines are instrumented throughout, and are able to sense and respond to changes and issues on-the-fly, even as they are running. Where traditional data integrations break, smart data pipelines ensure continuous operations and let you harness change for business innovation.

You design smart data pipelines in Control Hub using the Control Hub Pipeline Designer. A pipeline describes the flow of data from an origin system to destination systems and defines how to transform the data along the way. The Control Hub includes a comprehensive tutorial describing how to create pipelines.

Can pipelines be automated?

Yes – pipelines can be automated through a variety of mechanisms. External scripts and applications can use the CLI and REST API to start and stop pipelines and jobs (and, in fact, perform any action available from the UI). In addition, StreamSets Control Hub includes a scheduler that allows jobs to be run on a regular basis.

How do I manage pipeline deployment?

Control Hub’s labels allow you to control the relationship between jobs and data collector instances. When you start a job, it runs on those Data Collector instances with matching labels. You can label a job ‘dev’ to run only on development instances, then change the label to ‘prod’ to run in production.

What are topologies, why are they important and how do I create them?

Topologies give you an end-to-end birdseye view of data movement within your organization. Create a topology from the Control Hub UI, then add the relevant jobs. Control hub will automatically link jobs where it can match the destination of one job to the origin of another.

How do I monitor dataflows?

You can monitor dataflows from Control Hub by selecting a topology; for more detail, you can drill down into an individual job, then again to see the pipeline running on a given Data Collector instance.

What are Data SLAs and how do I enact them?

A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet. In addition to measuring the performance of jobs within a topology, you can configure data SLAs to define the expected thresholds of the data throughput rate or the error record rate. Data SLAs trigger an alert when the specified threshold is reached. Data SLA alerts provide immediate feedback on the data processing rates expected by your team. They enable you to monitor your data operations and quickly investigate issues that arise, ensuring that data is delivered in a timely manner. You configure data SLAs at the jobs level.

What are data protection policies and how do I implement them?

Each job uses a protection policy that protects sensitive data upon reading data, and another that protects sensitive data upon writing data. The users who configure jobs can specify any read or write policy that is available to them.

Protection policies depend on classification rules to identify sensitive data; Data Protector provides StreamSets classification rules to identify general sensitive data, such as birth dates and IP addresses, as well as international data, such as driver's license numbers for different countries. You create additional classification rules to identify sensitive data not recognized by the StreamSets rules, such as industry-specific or organization-specific data.

Classification rules operate at an organization level, so once you create and commit a classification rule, that rule is available throughout the organization. It can then be used by any policy to identify the data that the policy is meant to protect.

Does the DataOps platform support Kubernetes, how does it manage high availability (HA)?

Yes, the DataOps Platform fully supports Kubernetes. Control Hub’s Provisioning Agent is a containerized application that runs in Kubernetes. The agent communicates with Control Hub to automatically provision Data Collector containers in the Kubernetes cluster in which it runs. Provisioning includes deploying, registering, starting, scaling, and stopping the Data Collector containers.

How are roles and users managed in the DataOps Platform?

Control Hub can authenticate users using a default user store, or delegate authentication to an enterprise identity provider via SAML. In both cases, users can be added to groups, and users and groups are assigned roles to govern the tasks that they can perform.

What can I monitor with the platform?

There are several levels of monitoring available in the platform. At the topology level, you can monitor the throughput of jobs, and configure data SLAs to define the expected thresholds of the data throughput rate or the error record rate. At the job level, you can view real-time statistics, error information, logs, and alerts about all remote pipeline instances running from as part of the job. Finally, at the pipeline level, you can drill down into individual pipeline stages to see where records are flowing, where processing time is being spent, and even take snapshots of live data to examine record field values as data is being processed in the pipeline.

Can I deploy to any of the public clouds? If so, which ones and how?

StreamSets Data Collector can be installed on any machine including ones hosted on Amazon, Azure, and Google Cloud as long as the machines meet the requirements outlined here.

Note that StreamSets Data Collector can also be installed as a Docker container. For details on that, click here. And StreamSets Data Collector is also available in AWS Marketplace and Azure Marketplace.

StreamSets Control Hub can be accessed in one of two ways:

  1. In the cloud at
  2. It can also be installed on-premises on any machine including ones hosted on Amazon, Azure, and Google Cloud as long as the machines meet the requirements outlined here.

How do I scale out the platform/jobs/pipelines?

You can create StreamSets jobs to run pipelines on groups of Data Collectors that are either manually administered or automatically provisioned on a container orchestration framework such as Kubernetes. When you start a job on a group of Data Collectors, StreamSets Control Hub remotely runs the pipeline on each execution component in the group. This enables you to manage and orchestrate large scale dataflows across multiple Data Collectors.

Does the platform support pipeline versioning? If so, how?

Yes, pipeline repository in StreamSets Control Hub helps maintain the version history for each pipeline. StreamSets Control Hub also enables you to also view and compare versions. When comparing versions, StreamSets Control Hub displays the versions side-by-side in the canvas, highlighting the differences between the versions. For more details, click here.

Can I interact with the platform programmatically? If so, how?

Yes, you can interact with StreamSets programmatically using StreamSets SDK for Python 3.4+ and REST APIs. For more details, click here.

Which StreamSets products does the platform include?

Can I schedule jobs to run at specific times?

Yes. Control Hub’s scheduler can start, stop, or upgrade jobs on a regular basis.

Can I use my enterprise identity provider to log in to the platform?

Yes. Control Hub integrates with enterprise identity providers via the SAML protocol.

Is the platform open source? Can paying customers have access to the source code?

StreamSets operates on a Hybrid Open-Source Model (HOSS). StreamSets Data Collector is open source, released under the Apache 2.0 license. The remainder of the platform source code is proprietary and not available to customers.

Learn more about StreamSets Data Collector

Still Have Questions?


Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!