Architecture

StreamSets Control Hub architecture includes applications that manage user requests and databases that store application metadata and time series data.

Control Hub applications manage different user requests, such as designing and publishing a pipeline, starting a job, or measuring topology performance. The applications are independent and isolated from each other. They use REST APIs to communicate and cannot access the database data owned by another application. The applications are internal to Control Hub - when you install Control Hub, all of the applications are installed with Control Hub.

Control Hub uses authentication tokens to authenticate each message or request sent by an application. When you install Control Hub, you generate a unique authentication token for each application. The application includes the authentication token when it issues authenticated messages or requests to other applications.

Control Hub applications store data in a relational database and a time series database. Before you install Control Hub, you must set up the required databases.

The following image displays the architecture of Control Hub:

Applications

The following table describes each Control Hub application:
Application Description
Classification Not used at this time.
Dynamic Preview Supports dynamically generated pipelines.
Job Runner Manages job creation and requests to start, stop, and synchronize jobs.
Messaging Manages the messages sent between Control Hub, registered Data Collectors, registered Edge Data Collectors, and Provisioning Agents.
Notification Manages and triggers subscriptions that listen for Control Hub events. Manages the display and acknowledgement of data SLA alerts configured for topologies in Control Hub. Also manages the display of all metric, data, and data drift alerts configured for pipelines.
Pipeline Store Manages pipeline designing, publishing, versioning, and history.
Policy Not used at this time.
Provisioning Manages the automatic provisioning of Data Collectors to a container orchestration framework, such as Kubernetes. Provisioning includes deploying, registering, starting, stopping, and scaling Data Collector containers to work with Control Hub.
Reporting Manages data delivery reports - including creating report definitions, generating reports, and displaying generated reports.
Scheduler Manages the scheduling of jobs and data delivery reports.
Security Ensures the integrity of your data by managing the following components:
  • Registration of Data Collectors and Edge Data Collectors.
  • Management of authentication tokens for Control Hub applications, Provisioning Agents, registered Data Collectors, and registered Edge Data Collectors.
  • Authentication of each message or request sent by Control Hub applications, Provisioning Agents, registered Data Collectors, and registered Edge Data Collectors.
  • Authentication of user accounts and authorization of users through roles and permissions.
SLA Manages the creation and triggering of data SLA alerts for topologies.
Time Series Reads and stores statistics collected by Data Collectors and Edge Data Collectors running remote pipeline instances. Displays the statistics as metrics when you monitor jobs and topologies.
Topology Manages the creation and history of topologies.

Databases

Control Hub uses the following databases:
Relational
Control Hub applications store metadata in a relational database. Control Hub currently supports MariaDB, MySQL, or PostgreSQL for the relational database.
Important: Each application requires a unique database in the relational database. Create a database for each application, even if you do not plan to use the functionality offered by that application.
The relational database must include a database for each of the following applications:
  • Classification
  • Dynamic Preview
  • Job Runner
  • Messaging
  • Notification
  • Pipeline Store
  • Policy
  • Provisioning
  • Reporting
  • Scheduler
  • Security
  • SLA
  • Time Series
  • Topology
Time Series
Control Hub applications store metric data in a time series database. Control Hub currently supports InfluxDB for the time series database.
The time series database must include the following databases:
  • Metrics - The Time Series application writes aggregated statistics collected by Data Collectors running remote pipeline instances to this database. Control Hub users view the metrics as time series data when they monitor jobs and topologies. In addition, data delivery reports are generated using these metrics.

    The Time Series application stores all historical metrics in the time series database. The Time Series application stores the current metrics only in the relational database.

  • Application Metrics - All Control Hub applications write metrics about their actions to this database. As a system administrator, you can view these application metrics as time series data in the Control Hub Admin tool to help you troubleshoot Control Hub issues.
Note: StreamSets recommends using a time series database so that users can more accurately analyze dataflow performance. However, the time series database is optional.

If you choose not to set up a time series database for Control Hub, the Time Series application does not store all historical metrics. Instead, it only stores the current metrics in the relational database. Users can view the total record count and throughput for a job or topology, but they cannot view the data over a period of time nor can they generate data delivery reports.