Frequently Asked Questions
About StreamSets DataOps Platform

What is DataOps?

DataOps is the application of DevOps practices to modern data management and integration. DataOps reduces the cycle time of data analytics, in close alignment with business objectives.

What are the key features of the DataOps platform?

  • Design – You can design simple pipelines or complex dataflows comprising dozens of pipelines in a single topology using a cloud-based visual UI with pre-built origins, destinations and transformations. You can also add your own custom processing stages. Since it’s a shared environment with a pipeline repository, developers can collaborate on pipeline design and re-use. Then pipelines can be previewed, tracking how specific records flow and change, and then publish.
  • Deploy – Pipelines are executed in memory on standalone systems or scalable distributed systems using YARN, Mesos or Kubernetes mechanisms. Published pipelines are connected to jobs that can be then executed against specific systems and scheduled or run continuously.
  • Automate – Pipelines and data processing jobs can be automated to execute when you specify. The platform can scale for batch jobs and to run on ephemeril cloud infrastructure. Reports and performance read-outs can also be automated.
  • Monitor – once pipelines are running their throughput, latency and error rates can be monitored in real-time. Topologies of multiple interconnected pipelines are visually displayed on a live interactive map where performance hotspots are displayed and operators can drill down to quickly pinpoint resolve issues.

What are the popular use cases for the platform?

A DataOps Platform can help bring value to many data and analytics projects and is critical for workloads like Cybersecurity, Customer 360, IoT Development, Data Warehouse Modernization, Big Data/Hadoop Ingestion, Real-time Event Streaming, Cloud Migrations, Change Data Capture, and GDPR/PCI DSS Compliance.

What data sources and systems will the platform connect?

The DataOps Platform supports most popular RDMS, Big Data (Hadoop), NoSQL, EDW, and managed cloud data services. For a full list of connectors visit here.

How is DataOps different than ETL?

ETL is a designer paradigm (set-and-forget); DataOps is a holistic design time plus run time approach (continuous).

What is data drift and how does the platform handle it?

Data drift is the unexpected, unannounced and unending changes to data structure, infrastructure, and semantics that occur when data moves through enterprise systems. It is caused by changes to data structure or semantics as sources and systems are added or upgraded. In traditional data integration systems which are brittle due to their reliance on full schema mapping, data drift can lead to dropped/corroded data or pipeline failures
The StreamSets platform has intelligent sensors in-pipeline that detect when data drift occurs. The platform can make intelligent recommendations for fixing data drift and help validate that it does not contaminate data.

How do I build pipelines?

You design pipelines in Control Hub using the Control Hub Pipeline Designer. A pipeline describes the flow of data from an origin system to destination systems and defines how to transform the data along the way. The Control Hub includes a comprehensive tutorial describing how to create pipelines.

Can pipelines be automated?

Yes – pipelines can be automated through a variety of mechanisms. External scripts and applications can use the CLI and REST API to start and stop pipelines and jobs (and, in fact, perform any action available from the UI). In addition, StreamSets Control Hub includes a scheduler that allows jobs to be run on a regular basis.

How do I manage pipeline deployment?

Control Hub’s labels allow you to control the relationship between jobs and data collector instances. When you start a job, it runs on those Data Collector instances with matching labels. You can label a job ‘dev’ to run only on development instances, then change the label to ‘prod’ to run in production.

What are topologies, why are they important and how do I create them?

Topologies give you an end-to-end birdseye view of data movement within your organization. Create a topology from the Control Hub UI, then add the relevant jobs. Control hub will automatically link jobs where it can match the destination of one job to the origin of another.

How do I monitor dataflows?

You can monitor dataflows from Control Hub by selecting a topology; for more detail, you can drill down into an individual job, then again too see the pipeline running on a given Data Collector instance.

What are Data SLAs and how do I enact them?

A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet. In addition to measuring the performance of jobs within a topology, you can configure data SLAs to define the expected thresholds of the data throughput rate or the error record rate. Data SLAs trigger an alert when the specified threshold is reached. Data SLA alerts provide immediate feedback on the data processing rates expected by your team. They enable you to monitor your data operations and quickly investigate issues that arise, ensuring that data is delivered in a timely manner. You configure data SLAs at the jobs level.

What are data protection policies and how do I implement them?

Each job uses a protection policy that protects sensitive data upon reading data, and another that protects sensitive data upon writing data. The users who configure jobs can specify any read or write policy that is available to them.

Protection policies depend on classification rules to identify sensitive data; Data Protector provides StreamSets classification rules to identify general sensitive data, such as birth dates and IP addresses, as well as international data, such as driver's license numbers for different countries. You create additional classification rules to identify sensitive data not recognized by the StreamSets rules, such as industry-specific or organization-specific data.

Classification rules operate at an organization level, so once you create and commit a classification rule, that rule is available throughout the organization. It can then be used by any policy to identify the data that the policy is meant to protect.

Does the platform support Kubernetes, how does it manage high availability (HA)?

Yes, the DataOps Platform fully supports Kubernetes. Control Hub’s Provisioning Agent is a containerized application that runs in Kubernetes. The agent communicates with Control Hub to automatically provision Data Collector containers in the Kubernetes cluster in which it runs. Provisioning includes deploying, registering, starting, scaling, and stopping the Data Collector containers.

What role and user management does the platform have?

Control Hub can authenticate users using a default user store, or delegate authentication to an enterprise identity provider via SAML. In both cases, users can be added to groups, and users and groups are assigned roles to govern the tasks that they can perform.

What can I monitor with the platform?

There are several levels of monitoring available in the platform. At the topology level, you can monitor the throughput of jobs, and configure data SLAs to define the expected thresholds of the data throughput rate or the error record rate. At the job level, you can view real-time statistics, error information, logs, and alerts about all remote pipeline instances running from as part of the job. Finally, at the pipeline level, you can drill down into individual pipeline stages to see where records are flowing, where processing time is being spent, and even take snapshots of live data to examine record field values as data is being processed in the pipeline.

Can I deploy to any of the public clouds? If so, which ones and how?

SDC can be installed on any machine including ones hosted on Amazon, Azure, and Google Cloud as long as the machines meet the requirements outlined here.
Note that SDC can also be installed as a Docker container. For details on that, click here. And SDC is also available in AWS Marketplace and Azure Marketplace.
SCH can be accessed in one of two ways:

  1. In the cloud at cloud.streamsets.com
  2. It can also be installed on-prem on any machine including ones hosted on Amazon, Azure, and Google Cloud as long as the machines meet the requirements outlined here.

How do I scale out the platform/jobs/pipelines?

You can create StreamSets jobs to run pipelines on groups of Data Collectors that are either manually administered or automatically provisioned on a container orchestration framework such as Kubernetes. When you start a job on a group of Data Collectors, StreamSets Control Hub remotely runs the pipeline on each execution component in the group. This enables you to manage and orchestrate large scale dataflows across multiple Data Collectors.

Does the platform support pipeline versioning? If so, how?

Yes, pipeline repository in StreamSets Control Hub (SCH) helps maintain the version history for each pipeline. SCH also enables you to also view and compare versions. When comparing versions, SCH displays the versions side-by-side in the canvas, highlighting the differences between the versions. For more details, click here.

Can I interact with the platform programmatically? If so, how?

Yes, you can interact with StreamSets programmatically using StreamSets SDK for Python 3.4+ and REST APIs. For more details, click here.

Which StreamSets products does the platform include?

The StreamSets DataOps Platform comprises the following products:

  • StreamSets Data Collector
  • StreamSets Control Hub
  • StreamSets Dataflow Performance Manager (DPM)
  • StreamSets Data Protector

Can I schedule jobs to run at specific times?

Yes. Control Hub’s scheduler can start, stop, or upgrade jobs on a regular basis.

Can I use my enterprise identity provider to log in to the platform?

Yes. Control Hub integrates with enterprise identity providers via the SAML protocol.

Is the platform open source? Can paying customers have access to the source code?

StreamSets operates on a Hybrid Open-Source Model (HOSS). StreamSets Data Collector is open source, released under the Apache 2.0 license. The remainder of the platform source code is proprietary and not available to customers.

Learn more about StreamSets Data Collector

Still Have Questions?

Start a discussion with our team on Twitter @streamsets or via #streamsets.

Submit your questions on our sdc-user Google Group.

Send a question by email through our contact form.

READY TO LEARN MORE?

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!