Tutorial: Data Collectors, Pipelines, and Jobs

This tutorial covers the basic tasks required to get up and running with StreamSets Control Hub. In this tutorial, we create Control Hub user accounts, register a Data Collector with Control Hub, design and publish a pipeline, and create and start a job for the pipeline.

Although our tutorial provides a simple use case, keep in mind that Control Hub is a powerful tool that enables you to orchestrate and monitor large numbers of pipelines running on groups of Data Collectors.

Note: You can also use Control Hub to map multiple jobs into a topology. A topology provides an interactive end-to-end view of data as it traverses multiple pipelines. We'll save the concept of topologies for our next tutorial, described in Tutorial: Topologies.

To get started with Control Hub, we'll complete the following tasks:

Before You Begin

This tutorial assumes that you have a running StreamSets Data Collector and a user account to log in to Data Collector.

Data Collectors work directly with Control Hub - they execute standalone and cluster pipelines run from Control Hub jobs.

Before you begin this tutorial, you'll need a few things:

  1. A running Data Collector.
    If you haven't already, download and install Data Collector. For installation instructions, see Installation in the Data Collector documentation.
  2. A Data Collector login with the admin role or both the creator and manager roles.
    If you haven't set up custom user accounts, you can use the admin account shipped with Data Collector. The default login is: admin / admin.

Request an Organization and User Account

To use StreamSets Control Hub, you need a user account within an organization.

An organization groups all user accounts from an enterprise. All Data Collectors, pipelines, jobs, and topologies added by any user in the organization belong to that organization.

An organization typically includes multiple user accounts that perform different roles within Control Hub. For example, a data engineer designs and publishes pipelines, a DevOps or site reliability engineer runs jobs on registered Data Collectors, while a data architect manages topologies.

If you already have an organization defined for your enterprise, ask the organization administrator for a user account. If you don’t have access to an organization, contact StreamSets with a request for a new organization.

The initial user account created for your organization has all of the roles required to complete these tutorials.

Create User Accounts for Your Organization

If your organization currently uses Data Collector, you must create a Control Hub user account and group for each Data Collector user account and group before you register a Data Collector.

After a Data Collector is registered with Control Hub, the Data Collector uses Control Hub user authentication. Only Control Hub user accounts can log in to registered Data Collectors.

If your organization is not using Data Collector, you can skip this step and continue with Register a Data Collector. You can create new Control Hub user accounts and groups at any time.

Registered Data Collectors belong to the organization of the user who completes the registration. All Control Hub user accounts that belong to the same organization and have the appropriate roles can log in to the registered Data Collectors.

Tip: If you register the Data Collector from the Data Collector UI, you can skip this step and create Control Hub user accounts during the registration process, as described in Register a Data Collector.
  1. Locate the list of Data Collector users and groups and their assigned Data Collector roles, based on whether Data Collector uses file-based or LDAP authentication:
    • File-based authentication - In the Data Collector configuration directory, $SDC_CONF, open the authentication realm.properties file for the type of file-based authentication that Data Collector uses: basic, digest, or form.
    • LDAP authentication - Open the Data Collector configuration file, $SDC_CONF/sdc.properties, and locate the value of the http.authentication.ldap.role.mapping property. The property maps groups defined by the LDAP server to Data Collector roles. You'll need to create a unique Control Hub user account with the appropriate role for each user in the LDAP group.
    Next, let's log into Control Hub to create the corresponding Control Hub users and groups.
  2. Enter the following URL in the address bar of your browser:
    https://cloud.streamsets.com
  3. Enter your Control Hub user ID and click Continue.
    User IDs use the following format: <ID>@<organization ID>.
  4. Enter your password and click Log In.
  5. In the Navigation Panel, click Administration > Users.
  6. Click the Add New User icon: .
  7. On the Add User window, configure the following properties:
    User Account Property Description
    User ID User ID for the account. Use the following naming convention:
    <ID>@<organization ID>
    Display Name Display name.
    Email Address Email address for the account.
    SAML User Name SAML identity provider user account to map to this Control Hub user account.

    Configure when SAML authentication is enabled for the organization. If not using SAML authentication, you can leave the default value which is the same as the email address for the account.

  8. Assign the user the Organization User role.
  9. Assign the user the corresponding Engine role.

    Control Hub includes Engine roles that enable performing tasks in a registered Data Collector or Transformer. Each role provides the same access as the corresponding Data Collector or Transformer role:

    Control Hub Role Data Collector Role
    Engine Administrator admin
    Engine Manager manager
    Engine Creator creator
    Engine Guest guest
  10. Clear all other roles for now.
    At this point, we just want to make sure that Data Collector users can continue to log in to Data Collector after it is registered. After you are up and running with Control Hub, you can determine which Control Hub roles each user needs.
  11. Click Save.
  12. Repeat these steps to create a corresponding Control Hub user account for each Data Collector user.
  13. In the Navigation Panel, click Administration > Groups.
  14. To create a Control Hub group for each Data Collector group, click the Add New Group icon: .
  15. On the Add Group window, configure the following properties:
    Group Property Description
    Group ID Group ID for the group. Use the following naming convention:
    <ID>@<organization ID>
    Display Name Display name.
  16. Clear all default roles assigned to the group.
    For now, we're simply setting up users and groups to match the users and groups in Data Collector. And in Data Collector, you assign roles to users only.
    In Control Hub, you can assign roles to both users and groups. After you are up and running with Control Hub, consider assigning roles to groups to help you more efficiently manage user accounts, as described in Users and Groups.
  17. Repeat these steps to create a corresponding Control Hub group for each Data Collector group.
  18. After creating the groups, edit user accounts to add users to the appropriate group.

Register a Data Collector

You can register Data Collectors from Control Hub or from Data Collector. In these instructions, we'll use Data Collector. StreamSets recommends registering the latest version of Data Collector to ensure that you can use the newest features.

For a Data Collector installation with Cloudera Manager, you must use Cloudera Manager to register the Data Collector.

When you register a Data Collector, Data Collector generates an authentication token that it uses to issue authenticated requests to Control Hub.
Tip: You can also register Data Collectors by automatically provisioning them on a container orchestration framework such as Kubernetes.
  1. Log in to Data Collector using your Data Collector user account.
    Control Hub requires that the statistics stage library be installed on each registered Data Collector. So let's verify that the stage library is installed before we register our Data Collector.
  2. Click the Package Manager icon () to display the list of installed stage libraries.
  3. Verify that the Package Manager displays a check mark next to the statistics library, indicating that it is installed.
    If the statistics library is not installed, install the library before completing the remaining steps in this task. For instructions on installing additional stage libraries, see installing additional stage libraries in the Data Collector documentation.

    Next, let's register our Data Collector.

  4. Click Administration > Enable Control Hub.
  5. Enter the following information in the Enable Control Hub window:
    Property Description
    Control Hub Base URL URL to access Control Hub.

    Set to https://cloud.streamsets.com.

    Control Hub User ID Enter your Control Hub user ID using the following format:
    <ID>@<organization ID>
    Control Hub User Password Enter the password for your Control Hub user account.
    Labels for this Data Collector Let's leave the default label named all for now. We'll explore labels in more detail a little later.
  6. Click Enable Control Hub.
  7. If Data Collector uses file-based authentication and you skipped the previous task to create Control Hub user accounts, click Create Control Hub Groups and Users.
    The Create Control Hub Groups and Users window maps all existing Data Collector user accounts and groups to Control Hub user accounts and groups. You can remove users or groups and can edit the IDs, names, and email addresses as needed.

    When you have finished reviewing the users and groups to create, click Create. Each new user is assigned a default set of Control Hub roles. Groups are not assigned any roles. After you log in to Control Hub, change those role assignments as needed to secure the integrity of your organization and data.

  8. Restart the Data Collector.
  9. Open the Data Collector configuration file, $SDC_CONF/sdc.properties, and verify that the http.authentication.login.module property is set to file.
    Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.
    After you restart the Data Collector, the StreamSets Control Hub Enabled icon displays:

    After a Data Collector is registered with Control Hub, you use a Control Hub user account to log in to the Data Collector.

Log In to Control Hub

We'll log in to Control Hub to view the Data Collector that we registered.

  1. Enter the following URL in the address bar of your browser:
    https://cloud.streamsets.com
  2. Enter your Control Hub user ID and click Continue.
    User IDs use the following format: <ID>@<organization ID>.
  3. Enter your password and click Log In.
    The Dashboards view displays, presenting a summary of your organization's activity in Control Hub.

    We can see that we have one registered Data Collector so far:

  4. In the Navigation Panel, click Execute > Data Collectors.
    In the Data Collectors view, we can see the Data Collector that we just registered with Control Hub:

    So far, we have only registered one Data Collector, but you will most likely register multiple Data Collectors. All registered Data Collectors are listed in this view.

  5. Click the last reported time for our registered Data Collector to expand the details.

    The Data Collector details include historical time series charts for the CPU load and memory usage of the Data Collector. The details also include the Data Collector ID, URL, last reported time, version, and labels - which is the next topic in our tutorial.

Assign Labels to the Data Collector

Use labels to group Data Collectors registered with Control Hub. You assign labels to each Data Collector, using the same label for Data Collectors that you want to function as a group.

When you create a job, you tag labels to the job so that Control Hub knows on which group of Data Collectors the job should start.

For example, your organization uses development and test environments to design and test pipelines before replicating the final pipelines in the production environment. You assign a Test label to execution Data Collectors used to run test pipelines and a Production label to execution Data Collectors used to run production pipelines. When you create jobs, you select the appropriate label to ensure that the jobs are started in the correct environment.

You can assign multiple labels to Data Collectors to group Data Collectors by a combination of projects, geographic regions, environments, departments, or any other classification you choose.

You can assign labels to a Data Collector in the following ways:
Control Hub configuration file
Define labels for the dpm.remote.control.job.labels property in the Control Hub configuration file, $SDC_CONF/dpm.properties, located in the Data Collector installation.
By default, the dpm.properties file is configured to assign an all label to each Data Collector, which you can use to group all registered Data Collectors. For our tutorial, we'll leave the default label as is.
Control Hub UI
View the details of a registered Data Collector in the Control Hub UI, and then add the label.

Let's assume that our newly registered Data Collector is located on the West Coast and will be used to run pipelines for departments located in the west. So we'll use the Control Hub UI to assign a new label to designate the western region.

  1. In the Data Collectors view, click the last reported time for our registered Data Collector to expand the details.
  2. Click Enter New Labels and then type: western region.
  3. Hit Enter, and then click Save.
    Our Data Collector is now assigned two labels. Note that the only label we can remove from the Control Hub UI is the label that we just added. We'd have to modify the $SDC_CONF/dpm.properties file in the Data Collector installation to remove the default all label:

Design a Pipeline

You design pipelines in Control Hub using the Control Hub Pipeline Designer. A pipeline describes the flow of data from an origin system to destination systems and defines how to transform the data along the way.

StreamSets Control Hub provides support for multiple origin and destination systems - such as relational databases, log files, cloud storage platforms, or the Hadoop Distributed File System (HDFS). For a complete list of supported systems, see Origins and Destinations.

For now, we'll design and configure statistics for a single test pipeline that uses development stages. If you already have a test pipeline in the Control Hub pipeline repository that you'd like to use, feel free to use it and activate at least one metric rule for it. Otherwise, create a simple pipeline using the development stages and metric rules as described below.

  1. In the Navigation panel, click Pipeline Repository > Pipelines.
  2. Click the Create New Pipeline icon: .
  3. Enter the following name: Tutorial Pipeline.
  4. Keep the defaults to create a blank Data Collector pipeline, and then click Next.
    Note: You can also use Control Hub to create pipelines for Data Collector Edge (SDC Edge) and to create pipelines from templates. As you explore the Control Hub Pipeline Designer in more detail, you might want to review the pipeline templates to learn how to develop a similar pipeline.
  5. Select the system Data Collector as the authoring Data Collector, and then click Create.

    Or optionally, select another accessible and registered Data Collector.

  6. From the Pipeline Creation Help bar, click Select Origin > Dev Random Record Source.
  7. Use the Pipeline Creation Help bar to add the following additional development stages to the pipeline:
    • Dev Random Error processor
    • Trash destination

    The pipeline canvas should look like this:

    Next let's activate a metric rule for the pipeline so that we can see how Control Hub displays alerts. Metric rules and alerts provide notifications about real-time statistics for pipelines.

  8. Click a blank area in the canvas to display the pipeline properties.
  9. In the Properties panel, click the Rules tab.
  10. Select the High Incidence of Error Records rule and then click Activate.

    This metric rule triggers when the pipeline encounters more than 100 error records. We're sure that this rule will trigger since our tutorial pipeline includes the Dev Random Error processor.

    The Metric Rules tab should look like this:

  11. In the Properties panel, click Configuration > Error Records, and configure the pipeline to discard error records.

    Next, we'll configure statistics for the pipeline so that we can monitor statistics and metrics when we run the pipeline from a job.

  12. On the Statistics tab, select Write to Control Hub Directly so that the pipeline writes the statistics to Control Hub.
    Note: For our tutorial, writing statistics directly to Control Hub is sufficient because our job will run on a single Data Collector. When a job runs on multiple Data Collectors, a remote pipeline instance runs on each of the Data Collectors. To view aggregated statistics for the job within Control Hub, you must configure the pipeline to write the statistics to another system. For more information about pipeline statistics, see Pipeline Statistics.

Publish the Pipeline

Next, we'll publish the pipeline to indicate that our design is complete and the pipeline is ready to be added to a job and run. When you publish a pipeline, you enter a commit message. Control Hub maintains the commit history of each pipeline.

  1. With the tutorial pipeline open in the Pipeline Designer canvas, click the Publish icon: .
  2. Enter a commit message. For now, let's enter: Initial commit.
    As a best practice, state what changed in this pipeline version so that you can track the commit history of the pipeline.
  3. Click Publish.
    Notice how the top toolbar in the pipeline canvas indicates that this is version v1 of the pipeline and that the pipeline is in read only mode:

    All published pipelines display in read only mode. You must click the Edit icon to create another draft before you can modify the pipeline.
    Let's modify our published pipeline, and then go ahead and publish the pipeline a second time so that we can view the commit history of the pipeline.
  4. Click the Edit icon: .

    Notice how the top toolbar in the pipeline canvas indicates that this is version v2-DRAFT of the pipeline and that the pipeline is in edit mode:

  5. On the General tab for the pipeline, enter the following text for the Description property: Test pipeline to try out Control Hub.

    Let's go ahead and publish the updated pipeline.

  6. Click the Publish icon (), and then enter the following commit message: Another test commit.
  7. Click Publish.
  8. Click the History icon:
    The Pipeline Commits window displays details about each of our commits:

    For now, we haven't changed anything in our two commits. But in a development cycle that involves many pipeline changes, maintaining the commit history helps us understand the changes. We can even get an older version of a pipeline to continue editing the older version. Let's try that now.
  9. In the Pipeline Commits window, simply click the link for version 1.
    Control Hub displays version 1 of our pipeline in the pipeline canvas. We can make edits to version 1 of the pipeline, and then publish a third version of the pipeline.

Now that we've registered Data Collector, designed a pipeline, and published two versions of the test pipeline to Control Hub, let's add the pipeline to a job so that we can run it.

Add a Job for the Pipeline

A job defines the pipeline to run and the Data Collectors that run the pipeline.

When you add a job, you specify the published pipeline to run and you select Data Collector labels for the job. The labels indicate which group of Data Collectors should run the pipeline.

  1. With the tutorial pipeline open in the Pipeline Designer canvas, select the arrow next to the v1 label and then select v2 - our latest version of this pipeline.
    You can create a job for any published pipeline version. But we'll go ahead and create a job for version 2.
  2. Click the Create Job icon: .
  3. Let's use the default name: Job for Tutorial Pipeline.
  4. Click under Data Collector Labels, and then select the western region label that we assigned to our Data Collector.
  5. Keep the default values for the remaining properties.
  6. Click Save.

    Control Hub displays our newly created job in the Jobs view.

Start and Monitor the Job

Let's start our job and monitor its progress in Control Hub.

  1. In the Jobs view, click the Start Job icon for our tutorial job.
    Control Hub indicates that the job is active.
  2. Click the name of the job.
    Control Hub displays the pipeline in the canvas and displays statistics for the job in the Monitor panel. If the statistics in the Summary tab don't display yet, click the Refresh icon. By default, each job is configured to refresh the statistics every 60 seconds.
    Note: Remember how we configured our pipeline to write statistics directly to Control Hub? That configuration enables Control Hub to display statistics and metrics for the job in the Monitor panel. If you configure a pipeline to discard statistics, then Control Hub displays the following warning message:
    Aggregated metrics for the job are not available as individual pipeline metrics are discarded.
    After a bit of time, our High Incidence of Error Records pipeline alert triggers. The alert displays in the monitoring view for the job as follows:

    Note: In addition, the Alerts icon () in the top toolbar indicates that the alert has triggered. You can click the icon and then click View All to view the alert in the Alerts view.

    We can view statistics, configuration, and rules details for the entire pipeline using the tabs in the Monitor panel. Or, we can select a stage in the canvas to view statistics and configuration details for that stage.

    Let's stop our job so that the pipeline doesn't run indefinitely.

  3. Click the Stop Job icon for our active job.
    The Confirmation dialog box appears.
  4. To stop the job, click Yes.
    The job transitions from a deactivating to an inactive state.

That's the end of our Control Hub tutorial on Data Collectors, pipelines, and jobs. Remember that our tutorial was simple by design to introduce the concepts of Control Hub. However, the true power of Control Hub is its ability to orchestrate and monitor many pipelines running across groups of Data Collectors.