Tutorial: Topologies

This tutorial covers working with topologies in StreamSets Control Hub. A topology provides an interactive end-to-end view of data as it traverses multiple pipelines. In this tutorial, we create and publish an SDC RPC origin pipeline and an SDC RPC destination pipeline, create jobs for the pipelines, map the related jobs in a topology, measure the progress of the topology, and then define and monitor data SLAs for the topology.

Although our topology tutorial provides a simple use case, keep in mind that Control Hub enables you to manage topologies for multiple complex pipelines. For example, you can group all data flow activities that serve the needs of one business function under a single topology.

Note: This topology tutorial assumes that you are familiar with the concepts introduced in our earlier tutorial, described in Tutorial: Data Collectors, Pipelines, and Jobs.

To get started with Control Hub topologies, we'll complete the following tasks:

  1. Design and Publish Related Pipelines
  2. Add Jobs for the Related Pipelines
  3. Map Jobs in a Topology
  4. Measure and Monitor the Topology
  5. Monitor the Topology with Data SLAs

Design and Publish Related Pipelines

A topology provides a view into multiple related pipelines. To create our related pipelines, we'll configure an SDC RPC origin pipeline that passes data to an SDC RPC destination pipeline.

We'll configure statistics for each pipeline so that we can monitor the pipeline statistics and metrics from a job. Then, we'll publish both pipelines to indicate that our design is complete and the pipelines are ready to be added to jobs and run.

  1. In the Navigation panel, click Pipeline Repository > Pipelines.
  2. Click the Create New Pipeline icon: .
  3. Enter the following name: Origin Pipeline.
  4. Keep the defaults to create a blank Data Collector pipeline, and then click Next.
  5. Select the system Data Collector as the authoring Data Collector, and then click Create.

    Or optionally, select another accessible and registered Data Collector.

  6. Use the Pipeline Creation Help bar to add the following stages to the pipeline:
    • Dev Random Record Source origin
    • Dev Random Error processor
    • SDC RPC destination
  7. In the SDC RPC destination, configure the following properties on the RPC tab.
    Use the defaults for properties that aren't listed:
    RPC Property Description
    SDC RPC Connection Enter the host name of the machine running Data Collector and an available port number on the machine, using the following format:
    <host name>:<port number>
    SDC RPC ID Enter: TutorialID.

    At this point, the origin pipeline should look like this:

    Next let's configure error handling and statistics for the pipeline.

  8. Click the canvas to return focus to the pipeline instead of the SDC RPC destination.
  9. Click Configuration > Error Records, and then configure the pipeline to discard error records.
  10. On the Statistics tab, select Write to Control Hub Directly.
  11. Click the Publish icon: .
  12. Enter a commit message, and then click Publish.
    Now let's use similar steps to create our SDC RPC destination pipeline.
  13. Click the Navigation Panel icon (), and then click Pipeline Repository > Pipelines.
  14. Create another pipeline titled "Destination Pipeline" and add the following stages to the pipeline:
    • SDC RPC origin
    • Dev Random Error processor
    • Trash destination
  15. In the SDC RPC origin, configure the following properties on the RPC tab.
    Use the defaults for properties that aren't listed:
    RPC Property Description
    SDC RPC Listening Port Enter the same port number that you entered for the SDC RPC destination in step 7.
    SDC RPC ID Enter: TutorialID.

    At this point, the destination pipeline should look like this:

  16. Configure the pipeline to discard error records.
  17. On the Statistics tab, select Write to Control Hub Directly.
  18. Click the Publish icon: .
  19. Enter a commit message, and then click Publish.

Now that we've designed and published related pipelines, let's add jobs for the related pipelines.

Add Jobs for the Related Pipelines

You map jobs that contain related pipelines in a topology. Let's create jobs for the related pipelines that we just published.

  1. In the Navigation panel, click Pipeline Repository > Pipelines.
  2. Select both the destination and origin pipelines in the job list, and then click the Create Job icon .
    The Add Jobs window displays two pages - one for each pipeline that we selected.
  3. For the name of the destination job, enter: Tutorial Destination Job.
  4. Click under Data Collector Labels, and then select the "western region" label.
  5. Keep the default values for the remaining properties.
  6. Click Next.
    Now let's complete the same steps for our SDC RPC origin pipeline.
  7. For the name of the job, enter: Tutorial Origin Job.
  8. Select "western region" for the label, and then click Create.
    The Jobs view lists our two newly created jobs.

Map Jobs in a Topology

We'll add a topology and then map the jobs that work together to create a complete data flow.

  1. In the Navigation panel, click Topologies.
  2. Click the Add Topology icon: .
  3. Let's enter the name "Tutorial Topology" and then click Save.
    Control Hub displays an empty topology in the canvas. Next, we'll add our related jobs.
  4. Click the Add Job icon, and select "Tutorial Origin Job".

    Control Hub adds our origin job to the topology canvas, using circular icons for the connecting systems:

    Our tutorial origin job connects to the following systems:

    • Dev Random Record - The origin of our pipeline. Since we are using one of the development stages, we don't have another related job that sends data into this origin.
    • SDC RPC 1 - The destination of our pipeline. We'll connect our related tutorial destination job to this system.
    • Error Records Discard - A visualization of the error record handling configured for the pipeline. Since we are discarding our error records, we don't have another related job that processes those records. However, if we had written the error records to another system - for example, Kafka or Elasticsearch - we could connect a job that reads the records from that system.

      You can delete a connecting system in a topology if you do not want to measure the flow of data through the system. For example, in a complex topology, you might want to delete the Error Records Discard system in the canvas. However, we'll leave it in our example since we're designing a simple topology.

  5. Select the SDC RPC 1 system displayed in the canvas.
  6. Click the Add Job icon again, and select "Tutorial Destination Job".
    Control Hub adds our destination job to the topology canvas, connecting it to the SDC RPC 1 system. The canvas displays each job and connecting system as follows:

  7. Let's double-click our Tutorial Origin Job in the canvas.
    Control Hub displays details about the selected job in the detail pane.
  8. Expand the OriginPipeline section in the detail pane to see how we can view the design of a pipeline from a topology.
    Control Hub displays the complete pipeline included in our tutorial origin job:

    You can view the design of each pipeline included in a topology, providing you with a central view into the complete data flow.

Measure and Monitor the Topology

You can measure and monitor the progress of all running pipelines included in a topology. Let's start the jobs and then measure the progress of the running pipelines from our topology view.

Note: We'll need to start the SDC RPC destination job first, because an SDC RPC origin pipeline fails to start if it cannot connect to a running SDC RPC destination pipeline.
  1. Select the Tutorial Destination Job in the topology canvas, and then click the Start Job icon .
    Control Hub notifies you that the job is active.
  2. Select the Tutorial Origin Job in the canvas, and click the Start Job icon.
  3. Click the topology canvas to return focus to the topology instead of the origin job.

    The detail pane provides a single view into the record count and throughput for all jobs in the topology:

    Tip: If the record count and throughput do not yet display the statistics, click the Refresh icon: .
    You can select any job or system in the topology canvas to monitor the progress of that job or system. For example, if we select our tutorial origin job, the detail pane displays the aggregated statistics and metrics for that job.

Monitor the Topology with Data SLAs

A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet. Data SLAs provide immediate feedback on the data processing rates expected by your team. They enable you to monitor your data operations and quickly investigate issues that arise.

Let's see how we can define data SLAs for our topology. We'll create a data SLA to define the threshold that the error record rate cannot exceed. The data SLA triggers an alert when the threshold is reached.

  1. In the detail pane of our topology, locate the Data SLA section.
  2. Click the Add icon: .
  3. In the Add Data SLA window, configure the following properties:
    Data SLA Property Description
    Label Enter: Error record rate.
    Job Select: Tutorial Origin Job.
    QOS Parameter Select: Error Rate.
    Function Type Select: Max.
    Max Value Enter: 50.
    Alert Text Enter: More than 50 error records encountered per second.

    The configured data SLA should look like this:

  4. Click Add.
    Note that our data SLA is listed as inactive - let's go ahead and activate it.
  5. Select the data SLA in the list, and then click the Activate icon: .
    Next, we'll wait for just a bit, and then we'll see the red Alerts icon () in the top toolbar.
  6. To display the Alerts view, click the Alerts icon when it appears.
    The Alerts view lists all active and acknowledged alerts - we'll only see one alert for now.
  7. First let's acknowledge our alert by hovering over the alert and clicking the Acknowledge icon: .
  8. Next, click the alert message to open the tutorial topology again.
  9. Expand the Data SLA section in the detail pane.
  10. Then click our data SLA to display its details.
    The details display an error throughput graph, including a red line that represents our defined maximum threshold of 50 error records per second:

    It looks like our maximum error rate is well below our actual error rate.

    When we configured our data SLA, we forgot that our pipeline includes the Dev Random Error processor which produces about 800 error records per second. So that means that our alert will always trigger. Let's fix our alert in real time so that it triggers only when more than 900 error records occur in a second.

  11. Change the Max Value for our error rate parameter to 900 and change the alert text so that it mentions 900 errors, then click Save.
    Our modified data SLA takes effect immediately. Note how the error throughput graph reflects our changed maximum value:

    Now we'll be alerted only when the error record rate exceeds our expectations. If our tutorial pipeline encounters more than 900 error records per second, we'll know that there's an issue we need to investigate and resolve.

That's the end of our Control Hub tutorial on topologies. Remember that our tutorial included only two simple jobs to introduce the concept of topologies. However, you can use Control Hub to manage and monitor topologies for multiple complex jobs.