Meet StreamSets Control Hub
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows teams to build and execute large numbers of complex dataflows at scale.
Data Collector Communication
StreamSets Control Hub works with Data Collector to design pipelines and to execute standalone and cluster pipelines.
SDC Edge Communication
StreamSets Control Hub works with Data Collector Edge (SDC Edge) to execute edge pipelines. SDC Edge is a lightweight agent that runs pipelines on edge devices with limited resources.
Provisioning Agent Communication
A Provisioning Agent is a containerized application that runs in your corporate network within a container orchestration framework, such as Kubernetes. The agent automatically provisions Data Collector containers in the Kubernetes cluster on which it runs.
Control Hub Requirements
Control Hub User Interface
StreamSets Control Hub provides a web-based user interface (UI) for managing the pipeline repository and the execution components - including Data Collectors, Edge Data Collectors, Provisioning Agents, and deployments. You can create jobs, topologies, and Control Hub users, as well as administer your organization.
Tutorial: Data Collectors, Pipelines, and Jobs
This tutorial covers the basic tasks required to get up and running with StreamSets Control Hub. In this tutorial, we create Control Hub user accounts, register a Data Collector with Control Hub, design and publish a pipeline, and create and start a job for the pipeline.
Tutorial: Topologies
This tutorial covers working with topologies in StreamSets Control Hub. A topology provides an interactive end-to-end view of data as it traverses multiple pipelines. In this tutorial, we create and publish an SDC RPC origin pipeline and an SDC RPC destination pipeline, create jobs for the pipelines, map the related jobs in a topology, measure the progress of the topology, and then define and monitor data SLAs for the topology.
Pipeline Repository Overview
The Control Hub pipeline repository provides access to the details and configuration for all pipelines and pipeline fragments in your Control Hub organization.
Design in Control Hub
You can design pipelines and pipeline fragments in Control Hub using the Control Hub Pipeline Designer. You can use Pipeline Designer to develop pipelines and fragments for Data Collector or Data Collector Edge.
Design in Data Collector
In addition to designing pipelines in the Control Hub Pipeline Designer, you can also log into an authoring Data Collector to design pipelines and then add them to the Control Hub pipeline repository. You cannot design pipeline fragments in Data Collector.
Viewing Published Pipelines and Fragments
Pipeline Labels
A pipeline label enables grouping similar pipelines or pipeline fragments. Use pipeline labels to easily search and filter pipelines and fragments when viewing them in the pipeline repository.
Version History
A typical pipeline or pipeline fragment development cycle involves iterative changes. The pipeline repository maintains the version history for each pipeline and fragment.
Duplicating a Pipeline or Fragment
Creating a Job for a Pipeline
Deleting Versions
Troubleshooting
Data Collectors Overview
Data Collectors work directly with Control Hub. You install Data Collectors in your corporate network, which can be on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.
Manually Administered Data Collectors
Manually administering authoring and execution Data Collectors involves installing Data Collectors on-premises or on a protected cloud computing platform and then registering them to work with Control Hub.
Authoring Data Collectors
Authoring Data Collectors are used to design all types of pipelines. You install and register authoring Data Collectors just as you do execution Data Collectors.
Labels
Monitoring Data Collectors
When you view registered Data Collectors in the Execute view, you can monitor the performance of each Data Collector and the pipelines currently running on each Data Collector.
Disconnected Mode
Trusted Domains for Data Collectors
Using an HTTP or HTTPS Proxy Server
Deactivating a Registered Data Collector
Control Hub Configuration File
Troubleshooting
Provisioned Data Collectors Overview
Provisioning Requirements
Provision Data Collectors
Labels
Managing Provisioning Agents
After you create and deploy a Provisioning Agent as a containerized application to your Kubernetes cluster, you use Kubernetes to manage the application.
Managing Deployments
After you create a deployment, you can start, stop, scale, or delete the deployment. You must stop a deployment to specify a different Provisioning Agent for the deployment, specify a different Data Collector Docker image to deploy, or assign additional labels to the deployment.
Edge Data Collectors Overview
Supported Platforms
Getting Started with SDC Edge and Control Hub
To get started with SDC Edge and Control Hub, you download and install SDC Edge on the edge device where you want to run edge pipelines.
Install SDC Edge
Download and install SDC Edge on each edge device where you want to run edge pipelines.
Register SDC Edge with Control Hub
When you register a Data Collector Edge (SDC Edge) to work with Control Hub, you generate an authentication token for that SDC Edge.
Labels
Monitor Edge Data Collectors
When you view registered Edge Data Collectors in the Execute view, you can monitor the performance of each SDC Edge and the edge pipelines currently running on each SDC Edge.
Administer SDC Edge
Administering Data Collector Edge (SDC Edge) involves configuring, starting, and shutting down the agent on the edge device.
Uninstalling SDC Edge
Labels Overview
Use labels to group Data Collectors and Edge Data Collectors (SDC Edge) registered with Control Hub. For example, you can use labels to group Data Collectors and Edge Data Collectors by project, geographic region, environment, department, or any other classification you choose.
Default Label
Working with Multiple Labels
Using Labels to Support Pipeline Failover
Assigning Labels to a Data Collector
Assigning Labels to an SDC Edge
Assigning Labels to a Deployment
Jobs Overview
A job defines the pipeline to run and the Data Collectors or Edge Data Collectors (SDC Edge) that run the pipeline. Jobs are the execution of the dataflow.
Number of Pipeline Instances
By default when you start a job, Control Hub runs one pipeline instance on an available Data Collector or SDC Edge running the fewest number of pipelines. An available Data Collector or SDC Edge includes any component assigned all labels specified for the job.
Pipeline Failover
You can enable a job for pipeline failover. Enable pipeline failover to minimize downtime due to unexpected pipeline failures and to help you achieve high availability.
Runtime Parameters
Runtime parameters are parameters that you define for a pipeline and call from within that same pipeline. When you define a runtime parameter, you enter the default value to use. When you create or edit a job that includes a pipeline with runtime parameters, you can specify another value to override the default. When the job starts, the value replaces the default value of the runtime parameter.
Job Templates
When you create a job for a pipeline that uses runtime parameters, you can enable the job to work as a job template. A job template lets you run multiple job instances with different runtime parameter values from a single job definition.
Creating Jobs and Job Templates
Starting Jobs and Job Templates
When you start a job that contains a standalone or cluster pipeline, Control Hub sends an instance of the pipeline to a Data Collector assigned all labels tagged to the job. The Data Collector remotely runs the pipeline instance.
Monitoring Jobs
After you start a job, you can view real-time statistics, error information, logs, and alerts about all remote pipeline instances run from the job.
Managing Jobs
When a job is active, you can synchronize or stop the job. When a job is inactive, you can reset the origin for the job, use the latest pipeline version, edit the job, or delete the job. When a job is active or inactive, you can edit the latest pipeline version or schedule the jobs to start on a regular basis.
Troubleshooting
Understanding Data Protector
StreamSets Data Protector enables in-stream discovery of data in motion and provides a range of capabilities to implement complex data protection policies at the point of data ingestion.
Enabling Data Protector
Enable Data Protector to use classification rules and protection policies to identify and protect your organization's sensitive data.
Implementing Data Protection
To implement data protection, you create classification rules to identify and classify sensitive data and protection policies to alter and protect that data.
Data Protector Tutorial
Let's take a look at how you might configure classification rules and policies for a simple use case.
Troubleshooting Data Protector
Classification Rules Overview
Classification rules identify and classify data. Classification rules are used with protection policies to classify and protect sensitive data. For example, you might use a classification rule to identify user addresses, then use a protection policy to drop that information from all records.
Classification Rule Components
Understanding Classifiers
Classifiers identify the data that the classification rule applies to. You define classifiers for a rule from within the rule. You cannot move or reuse classifiers from one rule to another.
Creating Classification Rules and Classifiers
Configuring Classifiers
Committing Classification Rules
Commit classification rules to make them available for use. After you create new rules or update existing rules, be sure to commit your changes so they can be used.
StreamSets Classification Rules
Data Protector includes a wide range of general and country-specific classification rules.
Protection Policies Overview
Protection policies alter and protect data. When you configure a job, you specify a read policy to protect data when reading from an origin system. You also specify a write policy to protect data when writing to destination systems.
Policy Configuration Strategies
When configuring policies, consider the different types of policies that your organization requires. In addition to having both read and write policies, you might also need different flavors of read and write policies to address the different levels of security required for various endpoints or for various user groups.
Default Policies
Default policies are the policies that are used when no policy has been assigned to a pipeline.
Protection Policy Components
When you configure a protection policy, you define several key components.
Understanding Procedures
Procedures define how a policy protects data. Typically, a procedure protects data based on classification categories, but you can also specify a known field path for protection when needed.
Protection Methods
Procedures provide a range of protection methods that can be used based on the data type of the data.
Creating a Protection Policy
Configuring a Procedure
Scheduler Overview
The scheduler manages long-running scheduled tasks. A scheduled task periodically triggers an action on a job or report at the specified frequency. For example, a scheduled task can start or stop a job or generate a data delivery report on a weekly or monthly basis.
Scheduled Task Types
Cron Expression
Start and End Times
Missed Execution Handling
Creating Scheduled Tasks
Monitoring Scheduled Tasks
As a scheduled task runs, you can monitor the status of the task and view the number of times the task has triggered an action on a job or report. You can also view an audit of all changes made to the scheduled task.
Managing Scheduled Tasks
Topologies Overview
A topology provides an interactive end-to-end view of data as it traverses multiple pipelines that work together.
Map Jobs in a Topology
Create a topology in Control Hub to map multiple related jobs into a single view. You can map all dataflow activities that serve the needs of one business function in a single topology.
Measure Topology Performance
After you start the jobs included in a topology, you can measure the health and performance of all jobs and systems in the topology.
Monitor Topologies with Data SLAs
A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet.
Default Topology
Auto Fix an Invalid Topology
Topology History
Renaming a Topology
Duplicating a Topology
Deleting Topology Versions
Data Delivery Reports Overview
Data delivery reports provide data processing metrics for a given job or topology. For example, you can use reports to view the number of records that were processed by a job or topology the previous day.
Creating Report Definitions
Generating Reports
To generate your data delivery report, do one of the following:
Viewing Reports
To view a data delivery report, click the name of the report in the Reports view. The details of the report display. Then click Show Reports.
Troubleshooting
Subscriptions Overview
A subscription listens for Control Hub events and then completes an action when those events occur. For example, you can create a subscription that sends a message to a Slack channel or emails an administrator each time a job status changes.
Enabling Events to Trigger Subscriptions
Events
A subscription is triggered by a Control Hub event. You can create a subscription for a single event type or for multiple event types.
Actions
After an event triggers a subscription, the subscription performs an action - such as using a webhook to send an HTTP request to an external system or sending an email.
Parameters
Examples
Let's look at a few subscription examples. The first example sends a message to a Slack channel when an event occurs. The second example calls an external API when an event occurs. The third example sends an email when an event occurs.
Creating Subscriptions
Troubleshooting
Export and Import Overview
Pipeline, Fragment, and Topology Export and Import
Pipelines, pipeline fragments, and topologies are versioned objects with either a draft or published state. You can export and import published pipelines, fragments, and topologies only.
Job Export and Import
Exporting Objects
You can export pipelines, fragments, jobs, and topologies from Control Hub. You can export a single object, or you can select multiple objects of the same type to export a set of objects.
Importing Objects
You can import pipelines, fragments, jobs, and topologies that were exported from Control Hub. You can also import pipelines exported from a Data Collector that is not registered with Control Hub.
Organization Security Overview
Organization
Authentication
Control Hub can authenticate users using the built-in Control Hub authentication method or a Security Assertion Markup Language (SAML) identity provider. For a Control Hub on-premises installation, Control Hub can also authenticate users using a Lightweight Directory Access Protocol (LDAP) provider.
Users and Groups
Create groups and assign user accounts to the groups to group related users. Users can belong to one or more groups. Use groups to more efficiently manage user accounts.
Roles
Control Hub provides several types of roles that allow you to customize the tasks that users can perform.
Permissions
Permissions determine the access level that users have on objects. When permission enforcement is enabled for your organization, you can share and grant permissions on Data Collectors, Provisioning Agents, deployments, pipelines, fragments, jobs, scheduled tasks, subscriptions, topologies, protection policies, and data SLAs.
Requirements for Common Tasks
Completing Control Hub tasks requires a combination of roles and permissions. The following sections list the requirements to complete common Control Hub tasks.
© 2018 StreamSets, Inc.