Documentation

Data Collector Stages

Origins, ProcessorsDestinations, and Executors

Data Formats by Stage

Origins and Destinations

Working with DPM

(Dataflow Performance Manager)

To see a list of known and fixed issues, new features and upgrade notes, see our latest release notes.

Want to hear about Dataflow Performance Manager (DPM)? Contact us or request a live demo of DPM.


What’s New in Version 2.6.0.0

Here are some of the new features and enhancements in 2.6.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

New Parquet support for the Drift Synchronization Solution for Hive – See the documentation for implementation details and a Parquet case study.

New multithreaded origins to create multithreaded pipelines:

  • CoAP Server origin – An origin that listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. 
  • TCP Server origin – An origin that listens at the specified ports, establishes TCP sessions with clients that initiate TCP connections, and then processes the incoming data. The origin can process NetFlow, syslog, and most Data Collector data formats as separated records.

New CoAP Client destination – A destination that writes to a CoAP endpoint.

New executors to use with dataflow triggers:

Support bundles – Automatically generate a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and pipeline snapshots to share with the StreamSets support team.

DPM pipeline statistics –  You can now configure a pipeline to write statistics directly to DPM. Write statistics directly to DPM when you run a job for the pipeline on a single Data Collector.


What’s New in Version 2.5.0.0

Here are some of the new features and enhancements in 2.5.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

Multithreaded Pipelines – The multithreaded framework includes the following enhancements:

  • Origins for multithreaded pipelines – You can now use the following origins to create multithreaded pipelines:
    – Elasticsearch origin
    – JDBC Multitable Consumer origin
    – Kinesis Consumer origin
    – WebSocket Server origin
  • Maximum pipeline runners – You can now configure a maximum number of pipeline runners to use in a pipeline. Previously, Data Collector generated pipeline runners based on the number of threads created by the origin. This allows you to tune performance and resource usage. By default, Data Collector still generates runners based on the number of threads that the origin uses.
  • Record Deduplicator processor enhancement – The processor can now deduplicate records across all pipeline runners in a multithreaded pipeline.
  • Pipeline validation enhancement – The pipeline now displays duplicate errors generated by using multiple threads as one error message.
  • Log enhancement – Multithreaded pipelines now include the runner ID in log information.
  • Monitoring – Monitoring now displays a histogram of available pipeline runners, replacing the information previously included in the Runtime Statistics list.

Stop pipeline execution – You can configure pipelines to transfer data and automatically stop execution based on an event such as reaching the end of a table. The JDBC Multitable Consumer, JDBC Query Consumer, and Salesforce origins can generate events when they reach the end of available data that the new Pipeline Finisher executor uses to stop the pipeline. Click here for a case study.

Pipeline Runtime Parameters – You can define runtime parameters when you configure a pipeline, and then call the parameters from within that pipeline. When you start the pipeline from the user interface, the command line, or the REST API, you specify the values to use for those parameters. Use pipeline parameters to represent any stage or pipeline property with a value that must change for each pipeline run – such as batch sizes and timeouts, directories, or URI.

New endpoints:


What’s New in Version 2.4.0.0

Here are some of the new features and enhancements in 2.4.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

  • Pipeline Permissions and Sharing  – Permissions determine the access level that users and groups have on pipelines. To create a multitenant environment, create groups of users and then share pipelines with the groups to grant different levels of access.
    This feature includes the following components:
    – Pipeline permissions – Pipelines now have read, write, and execute permissions. Pipeline permissions overlay existing Data Collector roles to provide greater security. For information about roles and permissions, see Roles and Permissions.
    Pipeline sharing – The pipeline owner and users with the Admin role can configure pipeline permissions for users and groups.
    – Data Collector pipeline access control property – You can enable and disable the use of pipeline permissions with the pipeline.access.control.enabled configuration property. By default, this property is enabled.
    – Permissions transfer – You can transfer all pipeline permissions associated with a user or group to a different user or group. Use pipeline transfer to easily migrate permissions after registering with DPM or after a user or group becomes obsolete.
  • Register Data Collectors with DPM – If Data Collector uses file-based authentication and if you register the Data Collector from the Data Collector UI, you can now create DPM user accounts and groups during the registration process.

What’s New in Version 2.3.0.0

Here are some of the new features and enhancements in 2.3.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.


What’s New in Version 2.2.1.0

Here are the new features and enhancements in 2.2.1.0. For a list of bug fixes and known issues, see the Release Notes.

Field Zip processor

Salesforce Lookup processor

Value Replacer enhancement

Whole file support in the Microsoft Azure Data Lake Store destination


What’s New in Version 2.2.0.0

Here are some of the great new features in 2.2.0.0. For a full list of new features and enhancements, see What's New. For a list of bug fixes and known issues, see the Release Notes.

Event framework  with new executor stages

Microsoft Azure Data Lake Store destination  

Google Bigtable destination

Salesforce origin, Salesforce destination, and Wave Analytics destination

Spark Evaluator processor

MySQL Binary Log origin

Expression language functions for file paths, pipeline information, and datetime data


Want a quick introduction to Data Collector?

https://vimeo.com/141687653

For more introductory videos and cool case studies, check out our Videos page.

Loretta ChenDocumentation