Try StreamSets Cloud

This tutorial covers the steps needed to try StreamSets Cloud. Although the tutorial provides a simple use case, keep in mind that StreamSets Cloud is a powerful DataOps platform that enables you to build, run, and monitor large numbers of complex pipelines.

To try StreamSets Cloud, complete the following steps:
  1. Sign Up
  2. Build a Pipeline
  3. Run the Pipeline
  4. Monitor the Pipeline

Sign Up

The first step to trying StreamSets Cloud is simply signing up as a new user.

When you sign up, you are automatically enrolled in a free trial plan. You can upgrade your plan at any time. For a description of each plan, see the StreamSets Cloud pricing page.

  1. Enter the following URL in the address bar of your browser:
  2. Click Sign Up.
  3. Use one of the following methods to sign up:
    • Sign up with a Google account.
    • Sign up with a Microsoft account.
    • Enter your email address, password, and contact details to create an account for StreamSets Cloud. After verifying your email address, you can log in.

  4. After you log in, accept the terms of service.

    StreamSets Cloud displays the Pipelines page.

Build a Pipeline

Build a pipeline to define how data flows from origin to destination systems and how the data is processed along the way.

This tutorial provides steps to build a simple pipeline that uses development stages. StreamSets Cloud provides support for multiple origin and destination systems, such as cloud storage platforms and relational databases. If you'd rather build your first pipeline using stages that connect to your origin and destination systems, feel free to create your own pipeline.

For a list of supported systems, see Origins and Destinations.

  1. On the Pipelines page, click Create.
  2. Enter a name for the pipeline, such as TestPipeline, and then click Create.
    The pipeline opens in the pipeline canvas.
  3. From the stage type list, select Origins to filter the stages by origins.
  4. In the Search field, type dev and then select the Dev Data Generator origin.

    The origin is added to the canvas.

    The Dev Data Generator origin is useful for testing because it can generate records with a type of sample data, such as address information, phone numbers, or book titles.

  5. In the pipeline properties pane, click the Data Generator tab.

    By default, Fields to Generate lists one field to generate.

  6. Click Add twice to add two additional fields.
  7. Add the following fields to generate:
    Field Name Field Type
    title BOOK_TITLE
    author BOOK_AUTHOR
    genre BOOK_GENRE

    You can preview the sample data that this development origin generates. When you preview data, source data passes through the pipeline, allowing you to review how the data passes and changes through each stage.

  8. In the toolbar above the pipeline canvas, click the Preview icon: .
  9. In the Preview Configuration dialog box, accept the defaults and then click Confirm.

    It takes a few minutes for the preview to start.

    When preview successfully starts, it shows that the development origin generates 10 output records containing sample book title, author, and genre data, as follows:

  10. Click the Preview icon again to close preview.
  11. From the stage type list, select Processors to filter the stages by processors.
  12. Type dev, and then select the Dev Random Error processor to add it to the canvas.
    The Dev Random Error processor generates error records which can be used to test pipeline error handling.

    In most cases, you'll add processors that actually process the data, such as renaming fields or performing calculations on the data. But for this simple pipeline, add just this single development processor.

  13. From the stage type list, select Destinations.
  14. In the Search field, enter trash and then select the Trash destination which simply discards records.
  15. Use your cursor to connect the origin to the processor and the processor to the destination.

    The finished pipeline should look like this:

  16. Click the pipeline canvas and then click the Error Records tab.
    Notice that the pipeline is configured to discard error records by default. For most of your pipelines, you'll want to configure another solution for error records so that you can handle error records without having to stop the pipeline. You can write them to another system such as Google Cloud Storage. For now, just keep the default configuration and discard the error records.

Run the Pipeline

Run the pipeline to move the sample data from the origin to the destination.

  1. In the toolbar above the pipeline canvas, click the Run icon: .
  2. In the Run Configuration dialog box, accept the defaults, and then click Run Pipeline.

    It takes a few minutes for the pipeline to be deployed and started.

    When the pipeline successfully starts, the monitor pane displays the Monitoring tab.

Monitor the Pipeline

As the pipeline runs, you can monitor the health and performance of the pipeline by viewing real-time statistics and errors as data moves through the pipeline.

  1. In the monitor pane, click the Monitoring tab.
    The Monitoring tab displays real-time statistics about the pipeline, including the total input, output, and error record count and the number of records processed by each stage, as follows:

  2. In the pipeline canvas, select the Dev Random Error processor to view statistics about the processor.

    By default, the Monitoring tab displays pipeline monitoring information. You can select a specific stage to view statistics about that stage.

  3. Click the Error tab.

    The Error tab displays error information about the Dev Random Error processor, including stage errors and error records, as follows:

    The Dev Random Error processor in this sample pipeline randomly generates the same error, which isn't particularly informative. If your own pipelines generate errors, you can view the messages on the Error tab to help troubleshoot and handle the errors.

    Just as with the Monitoring tab, you can select a specific stage to view errors related to that stage.

  4. When you've finished monitoring the sample pipeline, click Stop.