Data Collector Communication

StreamSets Control Hub works with Data Collector to design pipelines and to execute standalone and cluster pipelines.

Control Hub runs on a public cloud service hosted by StreamSets - you simply need an account to get started. You install Data Collectors in your corporate network, which can be on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.

Each registered Data Collector serves one of the following purposes:
Authoring Data Collector
Use an authoring Data Collector to design pipelines. You can design pipelines in the Control Hub Pipeline Designer after selecting an available authoring Data Collector to use. The selected authoring Data Collector determines the stages, stage libraries, and functionality that display in Pipeline Designer.
Or, you can directly log into an authoring Data Collector to design pipelines using the Data Collector UI.
Execution Data Collector
Use an execution Data Collector to execute standalone and cluster pipelines run from Control Hub jobs.

A single Data Collector can serve both purposes. However, StreamSets recommends dedicating each Data Collector as either an authoring or execution Data Collector.

Registered Data Collectors use encrypted REST APIs to communicate with Control Hub. Data Collectors initiate outbound connections to Control Hub over HTTPS on port number 443.

The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub. The web browser initiates outbound connections to Control Hub over HTTPS on port number 443.

The authoring Data Collector selected for Pipeline Designer accepts inbound connections from the web browser on the port number configured for the Data Collector. The connection must be HTTPS.

The following image shows how authoring and execution Data Collectors communicate with Control Hub:

Data Collector Requests

Registered Data Collectors send requests and information to Control Hub.

Control Hub does not directly send requests to Data Collectors. Instead, Control Hub sends requests using encrypted REST APIs to a messaging queue managed by Control Hub. Data Collectors periodically check with the queue to retrieve Control Hub requests.

Data Collectors communicate with Control Hub in the following areas:

Pipeline management
When you use an authoring Data Collector to publish a pipeline to Control Hub, the Data Collector sends the request to Control Hub.
Security
When you enable Control Hub within a Data Collector or when a user logs into a registered Data Collector, the Data Collector makes an authentication request to Control Hub.
Metrics
Every minute, an execution Data Collector sends aggregated metrics for remotely running pipelines to Control Hub.
Messaging queue
Data Collectors send the following information to the messaging queue:
  • At startup, a Data Collector sends the following information: Data Collector version, HTTP URL of the Data Collector, and labels configured in the Control Hub configuration file, $SDC_CONF/dpm.properties.
  • When you update permissions on local pipelines, the Data Collector sends the updated pipeline permissions.
  • Every five seconds, a Data Collector sends a heartbeat and any status changes for remote pipelines.
  • Every minute, a Data Collector sends the last-saved offsets of remotely running pipelines and the status of all running pipelines.
Every three seconds, Control Hub checks the messaging queue to retrieve pipeline status changes and last-saved offsets sent by the Data Collectors.
Every five seconds, Data Collectors check with the messaging queue to retrieve requests and information sent by Control Hub. When you start, stop, or delete a job, Control Hub sends a pipeline request for specific execution Data Collectors to the messaging queue. When Data Protector is enabled and you commit a classification rule or update a protection policy, Control Hub sends the updates to the messaging queue. The messaging queue retains the request, rule, or policy until the receiving Data Collectors retrieve them.

Cluster Pipeline Communication

When you run a cluster pipeline from a Control Hub job, the Data Collector installed on the gateway node initiates outbound connections to Control Hub.

When a pipeline runs in cluster batch or cluster streaming mode, the Data Collector installed on the gateway node submits jobs to YARN. The jobs run on worker nodes in the cluster as either Spark streaming or MapReduce jobs. Data Collectors do not need to be installed on worker nodes in the cluster - the pipeline and all of its dependencies are included in the launched job.

The Data Collector worker processes running under YARN do not communicate with Control Hub. Instead, each Data Collector worker sends metrics for the running pipeline to the system configured to aggregate statistics for the pipeline - Kafka, Amazon Kinesis Streams, or MapR Streams. The gateway Data Collector reads and aggregates all pipeline statistics from the configured system and then sends the aggregated statistics to Control Hub.

Similarly, the Data Collector worker processes send a heartbeat back to the gateway Data Collector. The gateway Data Collector sends the URL of each Data Collector worker back to the Control Hub messaging queue.