Documentation

Control Hub Cloud
Control Hub On-Premises

Want access to Control Hub? Contact us or request a live demo.

Data Collector Documentation

Want access to the StreamSets SDK for Python? Contact us.

Want access to the StreamSets Test Framework? Contact us.

What’s New in Version 3.5.0

Control Hub Cloud

Here is the new feature included in StreamSets Control Hub Cloud 3.5.0, updated on October 12, 2018. For a full list, see What’s New.

Failover retries for jobs – When a job is enabled for pipeline failover, you can now define the maximum number of failover retries to perform.

Data Collector

Here are some of the new features included in StreamSets Data Collector 3.5.0, released on October 1, 2018. For the full list of features and enhancements, as well as the bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

New Pulsar Consumer origin that reads messages from one or more topics in an Apache Pulsar cluster.

The following new processors:

  • Encrypt and Decrypt Fields – Encrypts or decrypts individual field values.
  • MongoDB Lookup – Performs lookups in MongoDB and passes the returned document to an output field.
  • HTTP Router – Passes records to streams based on the HTTP method and URL path in record header attributes.

The following new destinations:

  • Pulsar Producer – Writes data to topics in an Apache Pulsar cluster.
  • Syslog – Writes data to a Syslog server.

New Databricks executor that starts a Databricks job each time it receives an event.

Data Collector now includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production. The following Technology Preview stages are available in this release:

  • Databricks ML Evaluator – Uses Spark-trained machine learning models to generate evaluations, scoring, or classifications of data.
  • MLeap Evaluator – Uses machine learning models stored in MLeap format to generate evaluations, scoring, or classifications of data.
  • PMML Evauator – Uses a machine learning model stored in the Predictive Model Markup Language (PMML) format to generate evaluations, scoring, or classifications of data.
  • TensorFlow Evaluator – Uses TensorFlow machine learning models to generate predictions or classifications of data.

Data Collector Edge (SDC Edge) enhancements:

Data governance tools enhancements:

  • Publish metadata to data governance tools for the following supported stages: Amazon S3 origin, Kafka Multitopic Consumer origin, SFTP/FTP Client origin, Kafka Producer destination.
  • Publish metadata to Cloudera Navigator running on Cloudera Manager versions 5.10 to 5.15.
  • Use TLS/SSL to securely connect to Cloudera Navigator.

Azure Key Vault support – Data Collector now provides a credential store implementation for Microsoft Azure Key Vault.

What’s New in Version 3.4.0

Control Hub Cloud

Here are some of the new features included in StreamSets Control Hub Cloud 3.4.0, updated on September 28, 2018. For a full list, see What's New.

You can now use StreamSets Data Protector to perform global in-stream discovery and protection of data in motion with Control Hub. Data Protector provides StreamSets classification rules and enables creating custom classification rules to identify sensitive data. Custom protection policies provide rules-based data protection for every job that you run.

Pipeline Designer enhancements:

  • Expression completion in stage and pipeline properties to provide a list of available data types, runtime parameters, fields, and functions.
  • Manage pipeline and pipeline fragment versions when configuring a pipeline or pipeline fragment in Pipeline Designer.

Data Collector

Here are some of the new features included in StreamSets Data Collector 3.4.0, released on July 30, 2018. For the full list of features and enhancements, as well as the bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

The following new origins:

  • PostgreSQL CDC Client – An origin that processes change data capture information for a PostgreSQL database.

  • Test – An origin that provides test data for data preview to aid in pipeline development.

New Whole File Transformer processor that converts fully written Avro files to Parquet in a whole file pipeline.

The following new destinations:

  • Couchbase – A destination that writes data to Couchbase.

  • Splunk – A destination that writes data to Splunk using the Splunk HTTP Event Collector (HEC).

Cluster EMR batch mode – Use to run cluster pipelines on an Amazon EMR cluster to process data from Amazon S3.

Microservice pipelines – Create Microservices pipelines using a new REST Service origin and new Send to Origin Response destination to build a free standing microservice.

Excel data format – The following file-based origins can now process Microsoft Excel files: Amazon S3, Directory, Google Cloud Storage, and SFTP/FTP Client.

Data Collector Edge (SDC Edge) enhancements:

Documentation enhancement – The online help has a new look and feel. All of the previous documentation remains exactly where you expect it, but it is now easier to navigate and view on smaller devices like your tablet or mobile phone.

What’s New in Version 3.3.0

Control Hub Cloud

Here are some of the new features included in StreamSets Control Hub Cloud 3.3.0, updated on August 4, 2018. For a full list, see What’s New.

Pipelines and pipeline fragments:

  • Data preview enhancements including data preview support for pipeline fragments and the ability to edit preview data and stage properties, then run the preview with your changes.

  • Select multiple stages when designing pipelines and pipeline fragments.

Pipeline failover enhancement – Control Hub now prioritizes Data Collectors that have not previously failed the pipeline.

Monitor the CPU load and memory usage of each registered Data Collector and Data Collector Edge (SDC Edge).

Data delivery reports for jobs and topologies now contain statistics for destinations.

Documentation enhancement – The online help has a new look and feel. All of the previous documentation remains exactly where you expect it, but it is now easier to navigate and view on smaller devices like your tablet or mobile phone.

Data Collector

Here are some of the new features included in StreamSets Data Collector 3.3.0, released on May 24, 2018. For the full list of features and enhancements, as well as the bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

  • Cluster streaming pipelines can now use Kafka security features such as SSL/TLS and Kerberos authentication using Spark 2.1 or later and Kafka 0.10.0.0 or later. However, this means that support for Spark 1.x is now removed and you must upgrade to Spark 2.1 or later.
  • WebSocket Client origin can now send an initial message or command after connecting to the WebSocket server.

What’s New in Version 3.2.0.0

Control Hub Cloud

Here are some of the new features included in StreamSets Control Hub Cloud 3.2.0, updated on May 11, 2018. For a full list, see What's New.

  • Pipeline fragments – Create pipeline fragments to easily add the same processing logic to multiple pipelines.
  • Scheduler – Schedule jobs or data delivery reports to run on a regular basis.
  • Data delivery reports – Create reports that show how much data was processed by a job or topology over a given period of time.
  • Subscription enhancements – Create a subscription that sends an email when Control Hub events occur. Create an expression to filter the events that you want to subscribe to.

Data Collector

Here are some of the new features included in StreamSets Data Collector 3.2.0.0, released on April 25, 2018. For the full list of features and enhancements, as well as the bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

The following new origins:

  • Hadoop FS Standalone – Use this origin in standalone execution mode pipelines to read files in HDFS.
  • MapR FS Standalone – Use this origin in standalone execution mode pipelines to read files in MapR FS.

MapReduce executor can now use the new Avro to ORC job to convert Avro files to ORC files.

Data Collector Edge (SDC Edge) now supports the JavaScript Evaluator processor.

What's New in Version 3.1.0.0

Control Hub Cloud

Here are some of the new features included in StreamSets Control Hub Cloud 3.1.0, updated on March 30, 2018. For a full list, see What's New.

  • Export and import – Export and import pipelines, jobs, and topologies to migrate the objects from one organization to another.
  • Subscriptions – Create a subscription that listens for Control Hub events and then completes an action when those events occur.
  • Scale out active jobs – Control Hub can automatically scale out pipeline processing for an active job when you set the number of pipeline instances to -1.

Data Collector

Here are some of the new features included in StreamSets Data Collector 3.1.0.0, released on February 9, 2018. For the full list of features and enhancements, as well as the bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

Drift Synchronization Solution for Postgres – Beta support for detecting drift in incoming data and automatically creating or altering corresponding PostgreSQL tables before the data is written.

SDC Edge pipelines support the Kafka Producer destination.

The following new processors:

The following new destinations:

  • Aerospike – A destination that writes data to Aerospike.
  • Named Pipe – A destination that writes data to a UNIX named pipe.

Support for processing the  Common Event Format (CEF) and Log Event Extended Format (LEEF) log formats.

What's New in Version 3.0.0.0

Control Hub Cloud

Here are some of the new features included in StreamSets Control Hub Cloud 3.0.0, updated on December 15, 2017. For a full list, see What's New.

Data Collector

Here are some of the many new features included in StreamSets Data Collector 3.0.0.0, released on November 27, 2017. For the full list of features and enhancements, as well as the latest bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

What's New in Version 2.7.2.0

Data Collector 2.7.2.0, released on October 5, 2017, includes the following new features. For a list of bug fixes and known issues, see the Release Notes. For a list of all recent features and enhancements, see What's New.

What's New in Version 2.7.1.0

Here are some of the new features in Data Collector 2.7.1.0, released on August 30, 2017. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

What's New in Version 2.7.0.0

Dataflow Performance Manager

Here are some of the new features and enhancements in 2.7.0.0, updated August 9, 2017. For a full list, see What's New.

  • Auth Token Administrator role – Role that enables users to register, unregister, and deactivate Data Collectors using DPM. Also enables users to regenerate authentication tokens.
  • Job history – History now includes all user actions on the job and the progress of all Data Collectors running pipelines for the job.
  • Number of pipeline instances – Default value for the number of pipeline instances for a job is now 1. This runs one pipeline instance on an available Data Collector running the fewest number of pipelines.

Data Collector

Here are some of the new features and enhancements in 2.7.0.0, released on August 18, 2017. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

A new credential store API that integrates with Java keystore and Hashicorp Vault.

Beta support for publishing metadata about running pipelines to Cloudera Navigator.

Pipeline event generation when the pipeline stops and starts that you can use as dataflow triggers.

The following new origins:

  • Google BigQuery – An origin that executes a query job and reads the result from Google BigQuery.
  • Google Pub/Sub Subscriber – A multithreaded origin that consumes messages from a Google Pub/Sub subscription.
  • OPC UA Client – An origin that processes data from an OPC UA server.
  • SQL Server CDC Client – A multithreaded origin that reads data from Microsoft SQL Server CDC tables.
  • SQL Server Change Tracking – A multithreaded origin that reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record.

The following new processors:

  • Data Parser – A processor that extracts NetFlow or syslog messages as well as other supported data formats that are embedded in a field.
  • JSON Generator – A processor that serializes data from a record field to a JSON-encoded string.
  • Kudu Lookup – A processor that performs lookups in Kudu to enrich records with additional data.

The following new destinations:

A new Amazon S3 executor that can be triggered to create new Amazon S3 objects for the specified content or add tags to existing objects each time it receives an event.

What's New in Version 2.6.0.0

Here are some of the new features and enhancements in Data Collector 2.6.0.0. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

New Parquet support for the Drift Synchronization Solution for Hive – See the documentation for implementation details and a Parquet case study.

New multithreaded origins to create multithreaded pipelines:

  • CoAP Server origin – An origin that listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. 
  • TCP Server origin – An origin that listens at the specified ports, establishes TCP sessions with clients that initiate TCP connections, and then processes the incoming data. The origin can process NetFlow, syslog, and most Data Collector data formats as separated records.

New CoAP Client destination – A destination that writes to a CoAP endpoint.

New executors to use with dataflow triggers:

Support bundles – Automatically generate a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and pipeline snapshots to share with the StreamSets support team.

DPM pipeline statistics –  You can now configure a pipeline to write statistics directly to DPM. Write statistics directly to DPM when you run a job for the pipeline on a single Data Collector.

Want a quick introduction to Data Collector?

https://vimeo.com/141687653

For more introductory videos and cool case studies, check out our Videos page.

Schedule a Demo
Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!

Pin It on Pinterest