The answer, it turns out, doesn’t have to do with big data stores, but with source “data drift.” Source data for big data applications is subject to constant and unexpected change. This can be because of the inherent infrastructure drift in the Data Center with data sources moving continuously between physical and virtual, and between on-premise and cloud. Or it can be structural or semantic in nature, where the shape of data changes because of patches and upgrades to the source system. Previous generation ETL technologies were just too schema-centric to handle this gracefully. And frankly, so are most manually coded approaches.
We formed StreamSets to solve this data drift problem for all big data stores.
Before we started development, we spent many months last year talking to more than 30 enterprises across the world to validate the problem, find product-market fit, and define the Minimum Viable Product (MVP) for Version 1. We have since talked to many, many more. We are so confident of the value of StreamSets Data Collector to data teams everywhere, that we are making all of it available under an Apache license. We are busy continuing to innovate back at our labs, but in the meantime, take the Data Collector for a spin and let us know what you think at streamsets.com/community.
Wondering about the title of this blog post, ‘Start with Why'. Check out this fascinating TED talk by author Simon Sinek.