skip to Main Content

StreamSets Data Collector – the First Four Years

By Posted in Data Integration August 16, 2018

June 2018 marked the fourth anniversary of StreamSets’ founding; here’s a look back at the past four years of StreamSets and the Data Collector product, from the early days in stealth-startup mode, to the recent release of StreamSets Data Collector 3.4.0.

Girish Pancha and Arvind Prabhakar founded StreamSets on June 27th, 2014. Girish had been at Informatica for many years, rising to the level of Chief Product Officer; Arvind had held technical roles at both Informatica and Cloudera. Between them, they had realized that the needs of enterprises were not being met by existing data integration products – for instance, the best practice for many customers ingesting data into Hadoop was laborious manual coding of data processing logic and orchestration using low-level frameworks like Apache Sqoop. They also realized that this would be true for other proliferating data platforms such as Kafka (Kafka Connect), Elastic (Logstash) and cloud infrastructure (Amazon Glue, etc.). The main obstacle they identified was ‘data drift’, the inescapable evolution of data structure, semantics and infrastructure, making data integration difficult and solutions brittle. Data drift exists in the traditional data integration world, but grows and accelerates when dealing with modern data sources and platforms. The founding vision for StreamSets was to ”solve the data drift problem for all modern data architectures”.

Girish and Arvind spent over a year talking to dozens of enterprises across the world to validate the problem, find product-market fit, and define the feature set for what would become StreamSets Data Collector. The initial version, 1.0.0, was released to a subset of these enterprises while StreamSets was still in stealth mode.

That first version of Data Collector defined the architecture for the product, processing continuous streams of data in micro-batches, and introduced Data Collector’s signature user interface, allowing users to easily create, run and monitor data pipelines by dragging and dropping pipeline elements, or stages. Underneath the covers, this implementation used Dataflow Sensors, which lie at the heart of the solution to data-drift. These dataflow sensors enabled continuous monitoring and control over every aspect of dataflows while operating in production environments.

Running Pipeline

StreamSets Data Collector 1.0.0 shipped with an impressive array of connectors, including origins reading data from Amazon KinesisApache KafkaDirectoryFile TailHTTPJDBC and MongoDB, and destinations targeting Amazon KinesisApache CassandraApache HBaseApache KafkaElasticsearchLocal FS and JDBC.

In addition, multiple versions of both the Cloudera and Hortonworks Hadoop distributions were supported, allowing users to read and write data to the Hadoop distributed file system. StreamSets’ innovative architecture allowed data pipelines to not only target multiple versions of platforms such as Kafka and Hadoop, but to do so within the same pipeline, allowing users to build pipelines to migrate from one version to another, or even to write to two versions of a platform simultaneously.

Aside from connectivity, the initial release of Data Collector also allowed users to apply many transformations to data as it flowed through the pipeline. That initial set of processors included Expression EvaluatorField RemoverField MaskerGeo IP (IP address to geographic data), Record Deduplicator and Stream Selector.

Recognizing that not every use case can be solved with off-the-shelf components, version 1.0.0 included JavaScript and Jython (Python on the JVM) script evaluators, allowing users to implement custom logic without needing to go all the way to writing a custom processor in Java.

Even in that initial release, dataflow pipelines could run standalone, or on a cluster, leveraging YARN and MapReduce to read data from Hadoop FS, and Spark Streaming to read data in parallel from multiple Kafka partitions.

Finally, demonstrating StreamSets’ commitment to DataOps from its very beginnings, the RESTful API allowed data pipeline operations to be automated.

The big reveal was in September 2015, with a blog post, Start with Why: Data Drift, and StreamSets Data Collector 1.1.0, released as open source under the Apache 2.0 license.

This first public version tidied up the 1.0.0 code, adding the necessary license headers and acknowledgements, a command-line interface, and more pipeline stages, including an Amazon S3 origin, Field Merger, Field Renamer, Apache Hive Streaming destination and Java Message Service (JMS) origin.

Data Collector 1.1.0 was quickly picked up by data engineers, data scientists and application developers looking to solve their data ingest problems, and there was a burst of activity on the sdc-user Google Group as users got to work building data pipelines.

The rapidly growing StreamSets product team released new versions of Data Collector every few weeks – far too many to list here in detail. Here are just some of the highlights…


The past four years have seen over 2 million downloads of Data Collector and its components, and over 400,000 pulls of the Docker images. While, because it is open source, it is difficult to ascertain exactly who has downloaded Data Collector, we have been able to identify over 2000 enterprises across all industries and government, including over ⅓ of the Fortune 500 and ⅔ of the Fortune 100.

Finally, bringing the story right up to date, StreamSets Data Collector 3.4.0, released in July 2018, includes several major new features:


Over the past four years, StreamSets Data Collector has evolved from a data ingest tool to the workhorse of the StreamSets DataOps Platform, joined by DPM, Control Hub, Data Collector Edge and, most recently, Data Protector. The StreamSets community has grown in parallel from a handful of users to many hundreds of data engineers, data scientists and developers across the sdc-user Google Group, the StreamSets Slack community channel, and the Ask StreamSets Q&A site.

Many thanks to our customers and community for their support over these past four years. If you’re among our users, join us in toasting this milestone; if not, take a look at Data Collector, and see what all the fuss is about!

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top