Forward-looking, data-driven enterprises increasingly leverage Big Data platforms, such as Hadoop, Elasticsearch and Amazon Web Services, to derive insights from non-transactional, machine-generated data. Many tools have emerged to power next generation data pipelines and provide specialized analytic capabilities. To get value from these technologies, data must reside in intermediate data stores in a consumable form. However, existing data integration tools do not offer the means to continuously extract data from the exploding variety of machine data sources and load into Big Data platforms in a consumption-ready manner.
Consequently, next-generation analytic applications end up implementing custom data ingestion processes in support of their specific use cases. Some enterprises have custom built data ingestion infrastructure to keep information flowing into Big Data platforms. These homegrown implementations typically leverage open source frameworks such as Apache Kafka, Flume, Spark Streaming, and Sqoop.
However, the data produced in source systems changes structure and format without notice. This often causes silent data corruption, data loss, or both. With increasing rates of data production, manual oversight to safeguard against such problems is economically neither viable, nor sustainable.
These lead to opaque, brittle and ad-hoc implementations of domain specific logic for data movement. And such ingestion infrastructure is often constrained by the limitations of the chosen underlying frameworks, and the requirements of then-known use cases. Such problems make today’s ingestion infrastructure a fragile part of otherwise robust Big Data environments, and end up become the bottleneck to delivering business value.
StreamSets Data Collector aims to address these shortcomings. We were inspired to reenvision data ingest from the ground up. It’s free and open source. Try it out at www.streamsets.com/opensource and let us know what you think.