Meet StreamSets Control Hub
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows teams to build and execute large numbers of complex dataflows at scale.
Architecture
StreamSets Control Hub architecture includes applications that manage user requests and databases that store application metadata and time series data.
System Data Collector
Data Collector Communication
StreamSets Control Hub works with Data Collector to design pipelines and to execute standalone and cluster pipelines.
SDC Edge Communication
StreamSets Control Hub works with Data Collector Edge (SDC Edge) to execute edge pipelines. SDC Edge is a lightweight agent that runs pipelines on edge devices with limited resources.
Provisioning Agent Communication
A Provisioning Agent is a containerized application that runs in a container orchestration framework, such as Kubernetes. The agent automatically provisions Data Collector containers in the Kubernetes cluster on which it runs.
High Availability
Authentication
Control Hub User Interface
StreamSets Control Hub provides a web-based user interface (UI) for managing the pipeline repository and the execution components - including Data Collectors, Transformers, Edge Data Collectors, Provisioning Agents, and deployments. You can create jobs, topologies, and Control Hub users, as well as administer your organization.
Overview
Installation Requirements
Creating the Databases
Install the required database software and create the databases before installing StreamSets Control Hub.
Installing Control Hub
You can install Control Hub on the same machine as the required databases or on a remote machine. For best performance, we recommend installing on a remote machine.
Enabling LDAP Authentication
If your company uses Lightweight Directory Access Protocol (LDAP), you can use the LDAP provider to authenticate Control Hub users. LDAP authenticates a user using the credentials stored in the LDAP server.
Enabling HTTPS
To secure communications to the Control Hub UI and REST API, enable both Control Hub and the separate Admin tool to use HTTPS. HTTPS requires an SSL/TLS certificate.
Setting Up a Highly Available Environment
In a production environment, we recommend using multiple Control Hub instances and a load balancer to ensure that Control Hub is highly available.
Uninstalling Control Hub
Overview
Preparing for the Upgrade
Before upgrading Control Hub, shut down and back up the previous Control Hub version and then back up the Control Hub databases. Depending on the version that you are upgrading from, create the new required databases.
Upgrading the Initial Control Hub Instance
If you are upgrading a development environment, follow these instructions to upgrade the single Control Hub instance.
Upgrading a Highly Available Environment
For a highly available production environment, upgrade the additional Control Hub instances and update the load balancer used by Control Hub.
Starting and Logging Into the Upgraded Control Hub
Completing Post Upgrade Tasks
In some situations, you must complete tasks within Control Hub after you upgrade.
Overview
Organizations
An organization is a secure space provided to a set of Control Hub users. All Data Collectors, pipelines, jobs, topologies, and other objects added by any user in the organization belong to that organization. A user logs in to Control Hub as a member of an organization and can access data that belongs to that organization only.
Dashboards
Messaging View
System Pipeline Templates
Data Collector Version Range
Administer Control Hub Applications
Logs
Control Hub Configuration
You can edit Control Hub configuration files to configure properties such as the host name and port number and SMTP account information for emails. You can also customize Control Hub to display your company logo instead of the StreamSets logo in the user interface.
Control Hub Environment Configuration
Starting Control Hub
Shutting Down Control Hub
Renewing the Control Hub License
Control Hub Admin Tool
Control Hub includes a separate Admin tool. Use the Control Hub Admin tool to monitor and troubleshoot Control Hub issues. For example, if Control Hub becomes inaccessible, the Admin tool remains running. You can still log into the Admin tool to troubleshoot the Control Hub issues.
Tutorial: Data Collectors, Pipelines, and Jobs
This tutorial covers the basic tasks required to get up and running with StreamSets Control Hub. In this tutorial, we create Control Hub user accounts, register a Data Collector with Control Hub, design and publish a pipeline, and create and start a job for the pipeline.
Tutorial: Topologies
This tutorial covers working with topologies in StreamSets Control Hub. A topology provides an interactive end-to-end view of data as it traverses multiple pipelines. In this tutorial, we create and publish an SDC RPC origin pipeline and an SDC RPC destination pipeline, create jobs for the pipelines, map the related jobs in a topology, and then measure the progress of the topology.
Pipeline Repository Overview
The Control Hub pipeline repository provides access to the details and configuration for all pipelines and pipeline fragments in your Control Hub organization.
Design in Control Hub
You can design pipelines and pipeline fragments in Control Hub using the Control Hub Pipeline Designer. You can use Pipeline Designer to develop pipelines and fragments for Data Collector or Data Collector Edge.
Design in Data Collector
In addition to designing pipelines in the Control Hub Pipeline Designer, you can also log into an authoring Data Collector to design pipelines and then add them to the Control Hub pipeline repository. You cannot design pipeline fragments in Data Collector.
Pipeline Labels
A pipeline label identifies similar pipelines or pipeline fragments. Use pipeline labels to easily search and filter pipelines and fragments when viewing them in the pipeline repository.
Pipeline Templates
Version History
A typical pipeline or pipeline fragment development cycle involves iterative changes. The pipeline repository maintains the version history for each pipeline and fragment.
Filtering Pipelines and Fragments
Viewing Published Pipelines and Fragments
Duplicating a Pipeline or Fragment
Creating a Job for a Pipeline
Deleting Versions
You can delete a specific pipeline or fragment version or all versions of a pipeline or fragment from the repository.
Troubleshooting
Data Collectors Overview
Data Collector is an execution engine that works directly with Control Hub. You install Data Collectors in your corporate network, which can be on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.
Manually Administered Data Collectors
Manually administering authoring and execution Data Collectors involves installing Data Collectors on-premises or on a protected cloud computing platform and then registering them to work with Control Hub.
Authoring Data Collectors
You use authoring Data Collectors to design pipelines. You install and register authoring Data Collectors just as you do execution Data Collectors.
Labels
Resource Thresholds
Monitoring Data Collectors
When you view registered Data Collectors in the Execute view, you can monitor the performance of each Data Collector and the pipelines currently running on each Data Collector.
Disconnected Mode
Trusted Domains for Data Collectors
Using an HTTP or HTTPS Proxy Server
Deactivating a Registered Data Collector
Control Hub Configuration File
Troubleshooting
Provisioned Data Collectors Overview
Provisioning Requirements
Provision Data Collectors
Labels
Managing Provisioning Agents
After you create and deploy a Provisioning Agent as a containerized application to your Kubernetes cluster, use Helm or Kubernetes to manage the application.
Managing Deployments
After you create a deployment, you can start, stop, scale, or delete the deployment. You must stop a deployment to specify a different Provisioning Agent for the deployment, specify a different Data Collector Docker image to deploy, or assign additional labels to the deployment.
Edge Data Collectors Overview
Supported Platforms
Getting Started with SDC Edge and Control Hub
To get started with SDC Edge and Control Hub, you download and install SDC Edge on the edge device where you want to run edge pipelines.
Install SDC Edge
Download and install SDC Edge on each edge device where you want to run edge pipelines.
Register SDC Edge
You must register each Data Collector Edge (SDC Edge) to work with Control Hub. You can register Edge Data Collectors from Control Hub or from the SDC Edge command line interface.
Labels
Resource Thresholds
Monitor Edge Data Collectors
When you view registered Edge Data Collectors in the Execute view, you can monitor the performance of each SDC Edge and the edge pipelines currently running on each SDC Edge.
Disconnected Mode
After an SDC Edge is registered with Control Hub, the SDC Edge communicates with Control Hub at regular intervals - around every 30 seconds or less. If an SDC Edge cannot connect to Control Hub, due to a network or system outage, then the SDC Edge uses the Control Hub disconnected mode.
Administer SDC Edge
Administering Data Collector Edge (SDC Edge) involves configuring, starting, and shutting down the agent on the edge device.
Uninstalling SDC Edge
Jobs Overview
A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector, Data Collector Edge, or Transformer. Jobs are the execution of the dataflow.
Number of Pipeline Instances
For Data Collector and Data Collector Edge jobs, you can manually scale out pipeline processing by increasing the number of pipeline instances that Control Hub runs for a job.
Pipeline Failover
You can enable a Data Collector or Data Collector Edge job for pipeline failover. Enable pipeline failover to minimize downtime due to unexpected pipeline failures and to help you achieve high availability.
Runtime Parameters
Runtime parameters are parameters that you define for a pipeline and call from within that same pipeline. When you define a runtime parameter, you enter the default value to use. When you create or edit a job that includes a pipeline with runtime parameters, you can specify another value to override the default. When the job starts, the value replaces the default value of the runtime parameter.
Job Templates
When you create a job for a pipeline that uses runtime parameters, you can enable the job to work as a job template. A job template lets you run multiple job instances with different runtime parameter values from a single job definition.
Job Tags
A job tag identifies similar jobs or job templates. Use job tags to easily search and filter jobs and job templates when viewing them in the Jobs view. For example, you might want to group jobs by the origin system or by the test or production environment.
Managing Jobs
After publishing pipelines, create a job for each pipeline that you want to run. When you create a job that includes a pipeline with runtime parameters, you can enable the job to work as a job template.
Monitoring Jobs
After you start a job, you can view statistics, error information, logs, and alerts about all remote pipeline instances run from the job. During pipeline test runs or job runs for Data Collector, you can also capture and review a snapshot of the data being processed.
Troubleshooting
Scheduler Overview
The scheduler manages long-running scheduled tasks. A scheduled task periodically triggers an action on a job or report at the specified frequency. For example, a scheduled task can start or stop a job or generate a data delivery report on a weekly or monthly basis.
Scheduled Task Types
Time Zone
Each scheduled task has one time zone that applies to all the times configured for the task. By default, the time zone is UTC.
Cron Expression
Start and End Times
Missed Execution Handling
Creating Scheduled Tasks
Monitoring Scheduled Tasks
As a scheduled task runs, you can monitor the status of the task, view details about each time the task triggered an action on a job or report, and view an audit of all changes made to the scheduled task.
Managing Scheduled Tasks
Topologies Overview
A topology provides an interactive end-to-end view of data as it traverses multiple pipelines that work together.
Map Jobs in a Topology
Create a topology in Control Hub to map multiple related jobs into a single view. You can map all dataflow activities that serve the needs of one business function in a single topology.
Measure Topology Performance
After you start the jobs included in a topology, you can measure the health and performance of all jobs and systems in the topology.
Monitor Topologies with Data SLAs
A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet.
Default Topology
Auto Fix an Invalid Topology
Topology History
Renaming a Topology
Duplicating a Topology
Deleting Topology Versions
Data Delivery Reports Overview
Data delivery reports provide data processing metrics for a given job or topology. For example, you can use reports to view the number of records that were processed by a job or topology the previous day.
Creating Report Definitions
Generating Reports
To generate your data delivery report, do one of the following:
Viewing Reports
To view a data delivery report, click the name of the report in the Reports view. The details of the report display. Then click Show Reports.
Troubleshooting
Subscriptions Overview
A subscription listens for Control Hub events and then completes an action when those events occur. For example, you can create a subscription that sends a message to a Slack channel or emails an administrator each time a job status changes.
Enabling Events to Trigger Subscriptions
Events
A subscription is triggered by a Control Hub event. You can create a subscription for a single event type or for multiple event types.
Actions
After an event triggers a subscription, the subscription performs an action - such as using a webhook to send an HTTP request to an external system or sending an email.
Parameters
Examples
Let's look at a few subscription examples. The first example sends a message to a Slack channel when an event occurs. The second example calls an external API when an event occurs. The third example sends an email when an event occurs.
Creating Subscriptions
Troubleshooting
Export and Import Overview
Pipeline, Fragment, and Topology Export and Import
Pipelines, pipeline fragments, and topologies are versioned objects with either a draft or published state. You can export and import published pipelines, fragments, and topologies only.
Job Export and Import
Exporting Objects
You can export pipelines, fragments, jobs, and topologies from Control Hub. You can export a single object, a set of selected objects of the same type, or all objects of the same type that you have access to.
Importing Objects
You can import pipelines, fragments, jobs, and topologies that were exported from Control Hub. You can also import pipelines exported from a Data Collector that is not registered with Control Hub.
Organization Security Overview
Organization
Authentication
Users and Groups
Create groups and assign user accounts to the groups to group related users. Users can belong to one or more groups. Use groups to more efficiently manage user accounts.
Roles
Control Hub provides several types of roles that allow you to customize the tasks that users can perform.
Permissions
Permissions determine the access level that users have on objects. When permission enforcement is enabled for your organization, you can share and grant permissions on execution engines, Provisioning Agents, deployments, pipelines, fragments, jobs, scheduled tasks, subscriptions, topologies, and data SLAs.
Requirements for Common Tasks
Completing Control Hub tasks requires a combination of roles and permissions. The following sections list the requirements to complete common Control Hub tasks.
© 2020 StreamSets, Inc.