Meet StreamSets Control Hub
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows teams to build and execute large numbers of complex dataflows at scale.
Data Collector Communication
StreamSets Control Hub works with Data Collector to design pipelines and to execute standalone and cluster pipelines.
Transformer Communication
StreamSets Control Hub works with Transformer to design pipelines and to execute Transformer pipelines on Apache Spark, an open-source cluster-computing framework.
SDC Edge Communication
StreamSets Control Hub works with Data Collector Edge (SDC Edge) to execute edge pipelines. SDC Edge is a lightweight agent that runs pipelines on edge devices with limited resources.
Provisioning Agent Communication
A Provisioning Agent is a containerized application that runs in your corporate network within a container orchestration framework, such as Kubernetes. The agent automatically provisions Data Collector containers in the Kubernetes cluster on which it runs.
Control Hub Requirements
Control Hub User Interface
StreamSets Control Hub provides a web-based user interface (UI) for managing the pipeline repository and the execution components - including Data Collectors and TransformersData Collectors, Transformers, Edge Data Collectors, Provisioning Agents, and deployments. You can create connections, jobs, topologies, and Control Hub users, as well as administer your organization.
Tutorial: Data Collectors, Pipelines, and Jobs
This tutorial covers the basic tasks required to get up and running with StreamSets Control Hub. In this tutorial, we create Control Hub user accounts, register a Data Collector with Control Hub, design and publish a pipeline, and create and start a job for the pipeline.
Tutorial: Topologies
This tutorial covers working with topologies in StreamSets Control Hub. A topology provides an interactive end-to-end view of data as it traverses multiple pipelines. In this tutorial, we create and publish an SDC RPC origin pipeline and an SDC RPC destination pipeline, create jobs for the pipelines, map the related jobs in a topology, and then measure the progress of the topology.
Jobs Overview
A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector, Data Collector Edge, or Transformer. Jobs are the execution of the dataflow.
Number of Pipeline Instances
For Data Collector and Data Collector Edge jobs, you can manually scale out pipeline processing by increasing the number of pipeline instances that Control Hub runs for a job.
Data Collector Pipeline Failover
You can enable a Data Collector or Data Collector Edge job for pipeline failover. Enable pipeline failover to minimize downtime due to unexpected pipeline failures and to help you achieve high availability.
Transformer Pipeline Failover
You can enable a Transformer job for pipeline failover for some cluster types. Enable pipeline failover to prevent Spark applications from failing due to an unexpected Transformer shutdown.
Runtime Parameters
Runtime parameters are parameters that you define for a pipeline and call from within that same pipeline. When you define a runtime parameter, you enter the default value to use. When you create or edit a job that includes a pipeline with runtime parameters, you can specify another value to override the default. When the job starts, the value replaces the default value of the runtime parameter.
Job Templates
When you create a job for a pipeline that uses runtime parameters, you can enable the job to work as a job template. A job template lets you run multiple job instances with different runtime parameter values from a single job definition.
Job Tags
A job tag identifies similar jobs or job templates. Use job tags to easily search and filter jobs and job templates when viewing them in the Jobs view. For example, you might want to group jobs by the origin system or by the test or production environment.
Managing Jobs
After publishing pipelines, create a job for each pipeline that you want to run. When you create a job that includes a pipeline with runtime parameters, you can enable the job to work as a job template.
Monitoring Jobs
After you start a job, you can view statistics, error information, logs, and alerts about all remote pipeline instances run from the job. During pipeline test runs or job runs for Data Collector, you can also capture and review a snapshot of the data being processed.
Troubleshooting
Scheduler Overview
The scheduler manages long-running scheduled tasks. A scheduled task periodically triggers an action on a job or report at the specified frequency. For example, a scheduled task can start or stop a job or generate a data delivery report on a weekly or monthly basis.
Scheduled Task Types
Time Zone
Each scheduled task has one time zone that applies to all the times configured for the task. By default, the time zone is UTC.
Cron Expression
Start and End Times
Missed Execution Handling
Creating Scheduled Tasks
Monitoring Scheduled Tasks
As a scheduled task runs, you can monitor the status of the task, view details about each time the task triggered an action on a job or report, and view an audit of all changes made to the scheduled task.
Managing Scheduled Tasks
Topologies Overview
A topology provides an interactive end-to-end view of data as it traverses multiple pipelines that work together.
Map Jobs in a Topology
Create a topology in Control Hub to map multiple related jobs into a single view. You can map all dataflow activities that serve the needs of one business function in a single topology.
Measure Topology Performance
After you start the jobs included in a topology, you can measure the health and performance of all jobs and systems in the topology.
Monitor Topologies with Data SLAs
A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet.
Default Topology
Auto Fix an Invalid Topology
Topology History
Renaming a Topology
Duplicating a Topology
Deleting Topology Versions
Reports View
Report Definitions View
The Report Definitions view allows you to define a custom data delivery report that provides data processing metrics for a given job or topology. For example, you can use reports to view the number of records that were processed by a job or topology the previous day.
Creating Report Definitions
Generating Data Delivery Reports
To generate your data delivery report, do one of the following:
Viewing Data Delivery Reports
To view a data delivery report, click the name of the report in the Report Definitions view. The details of the report display. Then click Show Reports.
Troubleshooting
Subscriptions Overview
A subscription listens for Control Hub events and then completes an action when those events occur. For example, you can create a subscription that sends a message to a Slack channel or emails an administrator each time a job status changes.
Enabling Events to Trigger Subscriptions
Before Control Hub can trigger subscriptions for your organization, your organization administrator must enable events for the organization.
Events
Actions
After an event triggers a subscription, the subscription performs an action - such as using a webhook to send an HTTP request to an external system or sending an email.
Parameters
Examples
Let's look at a few subscription examples. The first example sends a message to a Slack channel when an event occurs. The second example calls an external API when an event occurs. The third example sends an email when an event occurs.
Creating Subscriptions
Create a subscription to listen for Control Hub events and then complete an action when those events occur.
Troubleshooting
If a subscription doesn't complete the expected action, your organization administrator can view the subscription audit entries to help troubleshoot the issue.
Data Collectors Overview
Data Collector is an execution engine that works directly with Control Hub. You install Data Collectors in your corporate network, which can be on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.
Manually Administered Data Collectors
Manually administering authoring and execution Data Collectors involves installing Data Collectors on-premises or on a protected cloud computing platform and then registering them to work with Control Hub.
Authoring Data Collectors
You use authoring Data Collectors to design pipelines and to create connections. You install and register authoring Data Collectors just as you do execution Data Collectors.
Labels
Resource Thresholds
Monitoring Data Collectors
When you view registered Data Collectors in the Execute view, you can monitor the performance of each Data Collector and the pipelines currently running on each Data Collector.
Disconnected Mode
Trusted Domains for Data Collectors
Using an HTTP or HTTPS Proxy Server
Deactivating a Registered Data Collector
Control Hub Configuration File
Troubleshooting
Edge Data Collectors Overview
Supported Platforms
Getting Started with SDC Edge and Control Hub
To get started with SDC Edge and Control Hub, you download and install SDC Edge on the edge device where you want to run edge pipelines.
Install SDC Edge
Download and install SDC Edge on each edge device where you want to run edge pipelines.
Register SDC Edge
You must register each Data Collector Edge (SDC Edge) to work with Control Hub. You can register Edge Data Collectors from Control Hub or from the SDC Edge command line interface.
Labels
Resource Thresholds
Monitor Edge Data Collectors
When you view registered Edge Data Collectors in the Execute view, you can monitor the performance of each SDC Edge and the edge pipelines currently running on each SDC Edge.
Disconnected Mode
After an SDC Edge is registered with Control Hub, the SDC Edge communicates with Control Hub at regular intervals - around every 30 seconds or less. If an SDC Edge cannot connect to Control Hub, due to a network or system outage, then the SDC Edge uses the Control Hub disconnected mode.
Administer SDC Edge
Administering Data Collector Edge (SDC Edge) involves configuring, starting, and shutting down the agent on the edge device.
Uninstalling SDC Edge
Provisioned Data Collectors Overview
Provisioning Requirements
Provision Data Collectors
Labels
Managing Provisioning Agents
After you create and deploy a Provisioning Agent as a containerized application to your Kubernetes cluster, use Helm or Kubernetes to manage the application.
Managing Deployments
After you create a deployment, you can start, stop, scale, or delete the deployment. You must stop a deployment to specify a different Provisioning Agent for the deployment, specify a different Data Collector Docker image to deploy, or assign additional labels to the deployment.
Export and Import Overview
You can export and import pipeline fragments, pipelines, jobs, and topologies from Control Hub.
Pipeline, Fragment, and Topology Export and Import
Pipelines, pipeline fragments, and topologies are versioned objects with either a draft or published state. You can export and import published pipelines, fragments, and topologies only.
Job Export and Import
Objects that Use Connections
When you export objects that use connections or objects with dependent objects that use connections, Control Hub exports the connection metadata only. The metadata includes the connection ID, name, and type. Control Hub does not export the connection credentials or other configured values.
Exporting Objects
You can export pipelines, fragments, jobs, and topologies from Control Hub. You can export a single object, a set of selected objects of the same type, or all objects of the same type that you have access to.
Importing Objects
You can import pipelines, fragments, jobs, and topologies that were exported from Control Hub.
Organization Security Overview
Authentication
Users and Groups
Roles
Control Hub provides several types of roles that allow you to customize the tasks that users can perform.
Permissions
Permissions determine the access level that users have on objects. When permission enforcement is enabled for your organization, you can share and grant permissions on objects.
Requirements for Common Tasks
Completing Control Hub tasks requires a combination of roles and permissions. The following sections list the requirements to complete common Control Hub tasks.
Organization Administration
As a user with the Organization Administrator role, you can use the Administration view to review and manage the organization configuration, to monitor active user sessions, and to view audit entries for your organization.
© 2021 StreamSets, Inc.