Dataflow Performance Blog

Start with Why: Data Drift

Today, after a year of working in stealth mode with a number of enterprise charter customers, we are excited to launch StreamSets. Arvind and I started StreamSets in June 2014 because, as they say in French, “plus ça change, plus c’est la même chose.” Or in other words, the more things change, the more they stay the same.

Arvind had come to realize over his four year career at Cloudera that the best practice for most customers ingesting data into Hadoop was manually coding data processing logic and orchestrating them using open source frameworks. I was flabbergasted! As Chief Product Officer at Informatica, I had spent more than a dozen years at Informatica delivering various technologies that automated the processing and moving of data into data warehouses. So why were people doing this manually for big data stores?

“Source data for big data applications is subject to constant and unexpected change.”

The answer, it turns out, doesn’t have to do with big data stores, but with source “data drift.” Source data for big data applications is subject to constant and unexpected change. This can be because of the inherent infrastructure drift in the Data Center with data sources moving continuously between physical and virtual, and between on-premise and cloud. Or it can be structural or semantic in nature, where the shape of data changes because of patches and upgrades to the source system. Previous generation ETL technologies were just too schema-centric to handle this gracefully. And frankly, so are most manually coded approaches.

We formed StreamSets to solve this data drift problem for all big data stores.

Before we started development, we spent many months last year talking to more than 30 enterprises across the world to validate the problem, find product-market fit, and define the Minimum Viable Product (MVP) for Version 1. We have since talked to many, many more. We are so confident of the value of StreamSets Data Collector to data teams everywhere, that we are making all of it available under an Apache license. We are busy continuing to innovate back at our labs, but in the meantime, take the Data Collector for a spin and let us know what you think at

Wondering about the title of this blog post, ‘Start with Why'. Check out this fascinating TED talk by author Simon Sinek.

Girish PanchaStart with Why: Data Drift