Documentation

Dataflow Performance Manager (DPM) documentation is password protected. To access the full DPM online help, use your DPM credentials to log in here: https://cloud.streamsets.com/docs/

Without DPM credentials, you can still see our Getting Started Guide.

Want access to DPM? Contact us or request a live demo.

Working with DPM

(Dataflow Performance Manager)


What’s New in Version 2.7.2.0

Data Collector 2.7.2.0, released on October 5, 2017, includes the following new features. For a list of bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.


What’s New in Version 2.7.1.0

Here are some of the new features in Data Collector 2.7.1.0, released on August 30, 2017. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.


What’s New in Version 2.7.0.0

Dataflow Performance Manager

Here are some of the new features and enhancements in 2.7.0.0, updated August 9, 2017. For a full list, see What's New.

  • Auth Token Administrator role – Role that enables users to register, unregister, and deactivate Data Collectors using DPM. Also enables users to regenerate authentication tokens.
  • Job history – History now includes all user actions on the job and the progress of all Data Collectors running pipelines for the job.
  • Number of pipeline instances – Default value for the number of pipeline instances for a job is now 1. This runs one pipeline instance on an available Data Collector running the fewest number of pipelines.
Data Collector

Here are some of the new features and enhancements in 2.7.0.0, released on August 18, 2017. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

A new credential store API that integrates with Java keystore and Hashicorp Vault.

Beta support for publishing metadata about running pipelines to Cloudera Navigator.

Pipeline event generation when the pipeline stops and starts that you can use as dataflow triggers.

The following new origins:

  • Google BigQuery – An origin that executes a query job and reads the result from Google BigQuery.
  • Google Pub/Sub Subscriber – A multithreaded origin that consumes messages from a Google Pub/Sub subscription.
  • OPC UA Client – An origin that processes data from an OPC UA server.
  • SQL Server CDC Client – A multithreaded origin that reads data from Microsoft SQL Server CDC tables.
  • SQL Server Change Tracking – A multithreaded origin that reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record.

The following new processors:

  • Data Parser – A processor that extracts NetFlow or syslog messages as well as other supported data formats that are embedded in a field.
  • JSON Generator – A processor that serializes data from a record field to a JSON-encoded string.
  • Kudu Lookup – A processor that performs lookups in Kudu to enrich records with additional data.

The following new destinations:

A new Amazon S3 executor that can be triggered to create new Amazon S3 objects for the specified content or add tags to existing objects each time it receives an event.


What’s New in Version 2.6.0.0

Here are some of the new features and enhancements in Data Collector 2.6.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

New Parquet support for the Drift Synchronization Solution for Hive – See the documentation for implementation details and a Parquet case study.

New multithreaded origins to create multithreaded pipelines:

  • CoAP Server origin – An origin that listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. 
  • TCP Server origin – An origin that listens at the specified ports, establishes TCP sessions with clients that initiate TCP connections, and then processes the incoming data. The origin can process NetFlow, syslog, and most Data Collector data formats as separated records.

New CoAP Client destination – A destination that writes to a CoAP endpoint.

New executors to use with dataflow triggers:

Support bundles – Automatically generate a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and pipeline snapshots to share with the StreamSets support team.

DPM pipeline statistics –  You can now configure a pipeline to write statistics directly to DPM. Write statistics directly to DPM when you run a job for the pipeline on a single Data Collector.


What’s New in Version 2.5.0.0

Here are some of the new features and enhancements in Data Collector 2.5.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

Multithreaded Pipelines – The multithreaded framework includes the following enhancements:

  • Origins for multithreaded pipelines – You can now use the following origins to create multithreaded pipelines:
    – Elasticsearch origin
    – JDBC Multitable Consumer origin
    – Kinesis Consumer origin
    – WebSocket Server origin
  • Maximum pipeline runners – You can now configure a maximum number of pipeline runners to use in a pipeline. Previously, Data Collector generated pipeline runners based on the number of threads created by the origin. This allows you to tune performance and resource usage. By default, Data Collector still generates runners based on the number of threads that the origin uses.
  • Record Deduplicator processor enhancement – The processor can now deduplicate records across all pipeline runners in a multithreaded pipeline.
  • Pipeline validation enhancement – The pipeline now displays duplicate errors generated by using multiple threads as one error message.
  • Log enhancement – Multithreaded pipelines now include the runner ID in log information.
  • Monitoring – Monitoring now displays a histogram of available pipeline runners, replacing the information previously included in the Runtime Statistics list.

Stop pipeline execution – You can configure pipelines to transfer data and automatically stop execution based on an event such as reaching the end of a table. The JDBC Multitable Consumer, JDBC Query Consumer, and Salesforce origins can generate events when they reach the end of available data that the new Pipeline Finisher executor uses to stop the pipeline. Click here for a case study.

Pipeline Runtime Parameters – You can define runtime parameters when you configure a pipeline, and then call the parameters from within that pipeline. When you start the pipeline from the user interface, the command line, or the REST API, you specify the values to use for those parameters. Use pipeline parameters to represent any stage or pipeline property with a value that must change for each pipeline run – such as batch sizes and timeouts, directories, or URI.

New endpoints:


What’s New in Version 2.4.0.0

Here are some of the new features and enhancements in Data Collector 2.4.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

  • Pipeline Permissions and Sharing  – Permissions determine the access level that users and groups have on pipelines. To create a multitenant environment, create groups of users and then share pipelines with the groups to grant different levels of access.
    This feature includes the following components:
    – Pipeline permissions – Pipelines now have read, write, and execute permissions. Pipeline permissions overlay existing Data Collector roles to provide greater security. For information about roles and permissions, see Roles and Permissions.
    Pipeline sharing – The pipeline owner and users with the Admin role can configure pipeline permissions for users and groups.
    – Data Collector pipeline access control property – You can enable and disable the use of pipeline permissions with the pipeline.access.control.enabled configuration property. By default, this property is enabled.
    – Permissions transfer – You can transfer all pipeline permissions associated with a user or group to a different user or group. Use pipeline transfer to easily migrate permissions after registering with DPM or after a user or group becomes obsolete.
  • Register Data Collectors with DPM – If Data Collector uses file-based authentication and if you register the Data Collector from the Data Collector UI, you can now create DPM user accounts and groups during the registration process.

What’s New in Version 2.3.0.0

Here are some of the new features and enhancements in Data Collector 2.3.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.


What’s New in Version 2.2.1.0

Here are the new features and enhancements in Data Collector 2.2.1.0. For a list of bug fixes and known issues, see the Release Notes.

Field Zip processor

Salesforce Lookup processor

Value Replacer enhancement

Whole file support in the Microsoft Azure Data Lake Store destination


What’s New in Version 2.2.0.0

Here are some of the great new features in Data Collector 2.2.0.0. For a full list of new features and enhancements, see What's New. For a list of bug fixes and known issues, see the Release Notes.

Event framework  with new executor stages

Microsoft Azure Data Lake Store destination  

Google Bigtable destination

Salesforce origin, Salesforce destination, and Wave Analytics destination

Spark Evaluator processor

MySQL Binary Log origin

Expression language functions for file paths, pipeline information, and datetime data


Want a quick introduction to Data Collector?

https://vimeo.com/141687653

For more introductory videos and cool case studies, check out our Videos page.

Loretta ChenDocumentation