skip to Main Content

The DataOps Blog

Where Change Is Welcome

Announcing StreamSets Data Collector

By Posted in StreamSets News January 27, 2016

We are very excited to announce the next version of the StreamSets Data Collector. This version has seen over 250 JIRAs with a host of new features, performance enhancements and bug fixes.

Without further ado, here’s what’s new in

Versioning Scheme

To keep up with an incredible volume of feature requests from customers and the community, we are introducing a new versioning scheme. From here on, we will use a 4-digit scheme that looks like so: W.X.Y.Z. Please see the FAQ for an explanation of this new scheme.

Core Functionality

Alert ELs for Detecting Data Drift
One of the core problems the Data Collector is designed to solve is data drift. This new version adds yet another powerful feature to deal with drift. We have new expression language (EL) primitives that detect when fields change. This is particularly useful for alerting you when fields are added, deleted, or changed in the incoming datasets.

Access Record Header Attributes via ELs
A much-requested feature was access to record header attributes through the expression language.

First Class Protocol Buffer Support
Adding to our existing support of serialization formats such as Avro, the Data Collector can now read and write protocol buffer data.

To further simplify the installation and management of Data Collector, we now offer a new packaging paradigm. Users can download a basic version of the SDC, and add support for additional stages on an individual basis as needed.

Note: We created this feature for customers who requested a smaller download. The full TAR, RPM, and Cloudera parcel files with all stages enabled are still available.

File Destination
Another highly-requested feature that’s very useful for quick and dirty testing on local machines – we now support a local file system destination stage. You can write data to a local or filesystem mount.

Compressed File Support for Amazon S3  and File Directory
The Data Collector now natively reads and writes compressed files. We support a large number of compression formats including common ones such as zip and gzip. This enables easily reading compressed and archived files, such as rotated webserver logs.

LDAP/AD Support
An often-requested feature from our enterprise customers is LDAP or AD support. In our continued commitment to the open source community, we have decided to create and open source LDAP/AD support.

Adding to our existing rich support for clustering, the Data Collector now has out-of-the-box support for Apache Spark on Mesos. Configure the Mesos Dispatcher URL and Checkpoint directory in the Data Collector and hit Start. – That’s all it takes to fire up StreamSets on Mesos.

CDH 5.5 Support + Cloudera Parcels
Cloudera CDH 5.5 is now supported. In a previous release we announced support for Cloudera parcels. Cloudera customers can now also use Cloudera Manager to create and maintain Kerberos principals and keytabs for StreamSets.

RabbitMQ Origin
RabbitMQ is a lightweight message queue system that is very popular with our community. We are happy to announce support for reading from RabbitMQ.

Kafka 0.9 Support
We now support the latest Kafka release, version 0.9. With this version, users can use Data Collector with the new security features within Kafka (SSL encryption for data over the wire, authentication using TLS client certificates, and authentication using Kerberos).
* Please note: The Kafka community considers these security features beta quality. Please use caution when using these features in a production setting.

Amazon S3 Destination
We now support writing data to Amazon S3 buckets. Drag and drop an Amazon S3 destination to your canvas, specify the API keys and that’s all you need to start sending data to the cloud.

First Class Support for Amazon Kinesis
We’ve updated our support for Amazon Kinesis. You can now read and write multiple data formats, configure a number of partitioning strategies, and find overall performance improvements to the integration.

We’ve created a new tutorials repository on github with sample data, piplelines and guides on how to use some of the basic features of the Data Collector.

We will keep adding to this repository over time, please feel free suggest topics or contribute.

What’s Next
We are planning a host of new integrations, new core features and improvements to the project over the next few months. We continue to be humbled by the positive responses from our customers and the community and we’d love to hear your feedback on what else you’d like to see included and how you’d like to participate.

Please connect with us on github or visit Community.

Download Data Collector

Back To Top

We use cookies to improve your experience with our website. Click Allow All to consent and continue to our site. Privacy Policy