Dataflow Performance Blog

Announcing Data Collector v2.7.0.0

This release has been superseded by version 2.7.1.0. Please upgrade to v2.7.1.0 at the earliest.

We are happy to release version 2.7.0.0 of StreamSets Data Collector.

You can download the latest open source release here.

This release has 134 new features and improvements and over160 bug fixes. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

Enhanced Connectivity

Connectivity remains a big focus for us at StreamSets. We’re regularly looking at what connectors we can add to the platform making it possible for you to use StreamSets Data Collector to build dataflow pipelines based on your unique needs. In this release we have a number of new connectors, including:

  • Cloud: Adding to our list of cloud connectors, we’ve added a number of connectors for Google Cloud – Google BigQuery origin, and Google Pub/Sub origin and destination.
  • CDC: Customers looking to synchronize their transactional databases into systems like Hadoop or other Cloud data stores to run data science, analytics and BI can now do so with origins for Microsoft SQL Server Change Tracking and CDC.
  • IoT: The new OPC UA Client origin enables you to browse, poll, and subscribe to OPC UA data streams like historians and SCADA networks.
  • Messaging: We’ve added a new JMS destination for writing to messaging systems like IBM MQ Series or Active MQ.

3rd Party Support

In addition to connectivity, we’re also looking for ways to tighten integration with leading technology providers and best leverage capabilities they have within their respective products. In 2.7, we’re adding tighter integration with Cloudera Navigator to strengthen our overall data governance capabilities:

  • StreamSets Data Collector now supports publishing pipeline lineage and provenance data to Cloudera Navigator, Cloudera’s solution for managing governance with Apache Hadoop™. This is an early release of this integration and should only be used in test environments. We encourage you to send us feedback as you test this new feature so we can refine it as best as possible for availability in an upcoming release.
  • Looking to perform lightning fast lookups within the pipeline? The new Kudu Lookup processor allows in-stream enrichment of data from a Kudu store.

Other Enhancements

We’ve also made improvements to StreamSets Data Collector to make it more feature rich. Added in this release are:

  • In addition to the existing Hashicorp Vault Integration, customers can now also use credentials stored in the Java Keystore.
  • New Pipeline Lifecycle events lets you trigger tasks when pipelines start or stop. This is a broadly useful feature that lets you perform pre and post processing tasks for your dataflows, including Apache Sqoop™, such as truncating hive tables before importing new data, deleting directories and updating HDFS metadata, and creating _SUCCESS marker files. Watch this space for an upcoming blog post on how to use this feature as a simple replacement for Apache Sqoop ™ pipelines.
  • A new executor for Amazon S3 lets you create new objects, or tag existing objects after delivering data into an S3 bucket.

We’re constantly working with our customers and the community to identify which connectors and features to add. If you’d like to see more connectors or other features, please let us know via this quick survey.

We're excited to put this new functionality into your hands and look forward to hearing your feedback on it. To get started with StreamSets Data Collector 2.7, download the new version today.

Kirit BasuAnnouncing Data Collector v2.7.0.0