Design in Control Hub

You can design pipelines and pipeline fragments in Control Hub using the Control Hub Pipeline Designer. You can use Pipeline Designer to develop pipelines and fragments for Data Collector, Transformer, or Data Collector Edge.

Pipeline Designer enables you to configure pipelines, preview data, and publish pipelines. You can also design and publish pipeline fragments.

You can create new pipelines or edit previously published pipelines. When you create a pipeline in Pipeline Designer, you can start with a blank canvas or with a sample pipeline. Pipeline Designer provides several system sample pipelines. You can also create user-defined sample pipelines. Use the Pipelines view to create a new pipeline or to access existing pipelines in the pipeline repository.

You can also create or edit previous published fragments. When you create a fragment in Pipeline Designer, you start with a blank canvas. Use the Pipeline Fragments view to create a new fragment or access existing fragments.

When you configure a pipeline or pipeline fragment in Pipeline Designer, you specify the authoring engine to use - Data Collector or Transformer. Pipeline Designer displays stages, stage libraries, and functionality based on the selected authoring engine.

For more information about using Pipeline Designer, see Pipeline Designer UI and Pipeline Designer Tips.

Authoring Engine

When you create or edit a pipeline or pipeline fragment in Pipeline Designer, you select the authoring engine to use - Data Collector or Transformer. You can select an accessible authoring engine that is registered with your Control Hub organization and that meets all of the requirements.

Choose an authoring engine that is the same version as the execution engines that you intend to use to run the pipeline. Using a different version can result in pipelines that are invalid for the execution engines.

For example, if the authoring Data Collector is a more recent version than the execution Data Collector, pipelines might include a stage, stage library, or stage functionality that does not exist in the execution Data Collector.

An authoring engine must meet all of the following requirements:
  • StreamSets recommends using the latest version of Data Collector or Transformer.

    The minimum supported Data Collector version is 3.0.0.0. To design pipeline fragments, the minimum supported Data Collector version is 3.2.0.0. To create and use connections, the minimum supported Data Collector version is 3.19.0.

  • The Data Collector or Transformer uses the HTTPS protocol because Control Hub also uses the HTTPS protocol.
    Note: StreamSets recommends using a certificate signed by a certifying authority for a Data Collector or Transformer that uses the HTTPS protocol. If you use a self-signed certificate, you must first use a browser to access the Data Collector or Transformer URL and accept the web browser warning message about the self-signed certificate before users can select the component as the authoring engine.
  • The Data Collector or Transformer URL is reachable from the Control Hub web browser.

For more information about how a registered Data Collector works as an authoring Data Collector, see Registered Data Collector.

Note: Control Hub includes a system Data Collector for exploration and light development. Administrators can enable or disable the system Data Collector for use as the default authoring Data Collector in Pipeline Designer. For more information about how the system Data Collector works as an authoring Data Collector, see System Data Collector.

For example, the following Select an Authoring Data Collector window displays the system Data Collector and two registered Data Collectors as choices for the authoring Data Collector. Notice how the second registered Data Collector listed in this image is not accessible and thus cannot be selected because it uses the HTTP protocol:

When you edit a pipeline or fragment in Pipeline Designer, you can change the authoring engine as long as you select the same engine type. For example, when editing a Data Collector pipeline, you can select another authoring Data Collector. You can cannot select an authoring Transformer.

Click the Authoring icon () in the top right corner of Pipeline Designer to view which authoring engine is being used and to optionally change the selection.

For example, the following image shows a pipeline that is currently using an authoring Data Collector version 3.18.0:

Data Collector Stage Libraries

The selected authoring Data Collector determines the stage libraries that are installed and available for use as you design pipelines.

The stage library panel in the pipeline canvas displays all Data Collector stages. Stages that are not installed on the selected authoring Data Collector appear disabled, or greyed out. For example, the stage library panel shown below indicates that the Elasticsearch and Google BigQuery origins are not installed:

When the selected authoring Data Collector is a tarball installation, you can install additional stage libraries, including enterprise stage libraries, from Pipeline Designer. To install an additional stage library, click on a disabled stage. Confirm that you want to install the library, and then restart Data Collector for the changes to take effect.

Important: Installing an additional stage library from Pipeline Designer installs the library only on the selected authoring Data Collector. You must install the additional library on any other authoring Data Collector used to design the pipeline and on all execution Data Collectors where you run the pipeline.

When the selected authoring Data Collector is a core RPM installation, you must install additional RPM stage libraries using the Data Collector command line program, as described in Install Additional Stage Libraries in the Data Collector documentation.

When the selected authoring Data Collector is an RPM or Cloudera Manager installation, you must install enterprise stage libraries as custom stage libraries as described in Enterprise Stage Libraries in the Data Collector documentation.

External Libraries

The selected authoring Data Collector or Transformer determines the external libraries available to stages as you design pipelines. For example, some stages, such as most JDBC stages, require installing a JDBC driver as an external library on Data Collector or Transformer.

As you design pipelines in Pipeline Designer, each stage requiring an external library displays the currently installed libraries in the External Libraries tab in the stage properties panel. For example, the following image shows that a MySQL JDBC driver is installed for the JDBC stage library on the selected authoring Data Collector. As a result, this external library is available to the JDBC Query Consumer origin during pipeline design:

When needed, you can install an additional external library for a stage from Pipeline Designer by clicking Upload External Library from the External Libraries tab. Control Hub installs the external library to one of the following locations:
  • $SDC_DIST/streamsets-libs-extras directory on the selected authoring Data Collector
  • $TRANSFORMER_DIST/streamsets-libs-extras directory on the selected authoring Transformer
Important: Installing an external library from Pipeline Designer installs the library only on the selected authoring engine. You must install the external library on any other authoring engine that you use to design the pipeline and on all execution engines where you run the pipeline.

For more information about installing external libraries on Data Collector, see Install External Libraries in the Data Collector documentation.

For more information about installing external libraries on Transformer, see External Libraries in the Transformer documentation.

Creating a New Pipeline

You can use Pipeline Designer to create new pipelines.

Pipeline Designer provides system sample pipelines. You can also create user-defined sample pipelines. You can review sample pipelines to learn how you might develop a similar pipeline, or you might use the samples as a starting point for pipeline development.

When you create a pipeline, you specify the type to create - Data Collector Edge, Data Collector, or Transformer, whether to start from a blank canvas or from a sample pipeline, and the authoring Data Collector or Transformer to use. You can change the authoring Data Collector or Transformer used during development.

  1. To create a pipeline, click Pipeline Repository > Pipelines to access the Pipelines view.
  2. Click the Add icon.
  3. Enter the name and optional description.
  4. Select the type of pipeline to create: Data Collector Edge, Data Collector, or Transformer.
  5. Select how you want to create the pipeline, and then click Next.
    • Blank Pipeline - Use a blank canvas for pipeline development.
    • Sample Pipeline - Use an existing sample pipeline as the basis for pipeline development.
  6. If you selected Sample Pipeline, in the Select a Sample Pipeline dialog box, filter by the sample type, select the sample to use, then click Next.
  7. Select the authoring Data Collector or Transformer to use, then click Create.
    Pipeline Designer opens a blank canvas or the selected sample pipeline.
If needed, you can change the authoring Data Collector or Transformer for the pipeline using the Authoring icon: .

Creating a New Fragment

You can use Pipeline Designer to create new pipeline fragments.

When you create a pipeline fragment, you specify the execution engine for the fragment, Data Collector Edge, Data Collector, or Transformer, and select the authoring Data Collector or Transformer to use.
  1. To create a pipeline fragment, click Pipeline Repository > Pipeline Fragments to access the Pipeline Fragments view.
  2. Click the Add icon.
  3. Enter the name and optional description.
  4. Select the authoring Data Collector or Transformer to use, then click Create.
Pipeline Designer displays a blank canvas.
If needed, you can change the authoring Data Collector or Transformer for the fragment using the Authoring icon: .

Running a Test of a Draft Pipeline

As you design a pipeline, you can perform a test run of the draft pipeline in Pipeline Designer. Perform a test run of a draft pipeline to quickly test the pipeline logic.

You can perform a test run of a draft version of a fully configured pipeline. The Test Run menu becomes active when a draft pipeline is complete.

You cannot perform a test run of a published pipeline version. To run a published pipeline version, you must first create a job for the published pipeline version and then start the job.

Note: As you monitor a test run of a Data Collector pipeline, you can capture and review a snapshot of the data being processed.
  1. While viewing a draft version of a completed pipeline in Pipeline Designer, click Test Run in the top right corner of the toolbar, and then select one of the following options:
    • Start Pipeline - Start a test run of the pipeline.
    • Reset Origin and Start - Reset the origin and then start a test run of the pipeline.
    • Start with Parameters - Specify the parameter values to use and then start a test run of the pipeline.

    When the test run starts, Pipeline Designer displays statistics for the test run in the Monitor panel.

  2. Monitor the test run of the pipeline, including viewing real-time statistics and error information.

    To access the history of previous test runs for the pipeline, click the Test Run History tab. The test run history includes the start and end time of previous test runs and also the input, output, and error record count for each test run.

    Click View Summary for a specific run to view a summary of the metrics for that test run.

  3. When you've finished monitoring the test run, click the Stop Test Run icon: .
    You can continue designing the pipeline and performing additional test runs until you decide that the pipeline is complete, and then publish the pipeline.

Publishing a Fragment or Pipeline

You can publish fragments and pipelines that are designed or updated in Pipeline Designer.

Publish a fragment to use the fragment in a pipeline. Pipelines can only use published fragments.

Publish a pipeline to create a job that runs the pipeline, to use the pipeline as a sample pipeline, or to retain the published pipeline version for historical reference. You can only use published pipelines in jobs or as sample pipelines.

You can only publish valid pipelines. Pipeline Designer performs explicit validation before publishing a pipeline.

  1. While viewing a fragment or pipeline in edit mode, click the Check In icon: .

    The Check In window appears.

  2. Enter a commit message.

    As a best practice, state what changed in this version so that you can track the commit history of the fragment or pipeline.

  3. Click one of the following buttons:
    • Cancel - Cancels publishing the fragment or pipeline.
    • Publish and Update - Publishes the fragment or pipeline, and then displays the pipelines or jobs that currently use the published fragment or pipeline.

      Select the pipelines or jobs that you want to update to use the latest published version and click Update. Review the pipelines or jobs to be updated, and then click Close. When publishing a pipeline, you can also choose to create a new job for this pipeline version.

      Or, choose to skip the update and close the window. When publishing a pipeline, you can also choose to skip the update and create a new job for this pipeline version.
    • Publish and Close - Publishes the fragment or pipeline.