In my previous post I discussed the causes and impacts of data drift, a natural consequence of Big Data which creates serious data quality and data pipeline operational issues. Now I will describe the features of StreamSets Data Collector, how they address ingesting data in a “drifty” environment and describe some common use cases.
StreamSets was founded to deliver a continuous ingest solution that allows enterprises to stop worrying about data drift once and for all. StreamSets Data Collector is a fully open source and
Apache-licensed product that enables you to overcome data movement challenges in any environment. It’s built from the ground up to anticipate and handle drift thereby dramatically improving both the quality of your data and productivity of your data operation.
We are able to this by building on a set of core design ideas, namely – intent-driven data flows, in-stream sanitization, intelligent monitoring, and low-latency in-memory continuous transportation of data.
Intent-driven Data Flows
Intent-driven data flows are ingest streams that require minimal specification of structures and constraints at setup. Unlike traditional ETL processes that require you to anticipate and codify every aspect of the incoming data structure, the intent-driven data flows require only what is absolutely necessary to specify for proper consumption. For example, if you are expecting payment information in the data stream, it is sufficient to state the intent as the presence of a credit card number as an attribute. It is irrelevant what the other attributes are, how many there are or whether they change over time.
While intent-driven data flows allow you to identify and route information as needed in real time, in-stream sanitization lets you prepare, refine and enrich your data for easy and immediate consumption upon arrival into your destination systems. Without this, every application requires one-off data preparation, slowing down consumption, wasting resources and complicating the assimilation of new data sources or applications.
StreamSets Data Collector provides a full range of mechanisms that enable in-stream sanitization ranging from data masking to anonymization, filtering, and even sophisticated script-based data transformation.
As data flows through ingest pipelines, flow performance and the data profile itself is sampled and inspected using the built-in intelligent monitoring system within the Data Collector. Sophisticated drift handling functions can be applied using an expression language to detect and trigger alerts when certain conditions are met or thresholds breached. All the metrics collected by the Data Collector about the flow or data profile are made available via JMX as well as via REST API to allow integration with external monitoring systems where needed.
The Data Collector runs as a continuous 100% in-memory service to ensure timely data movement. All data that fails to meet the specified intent is routed to exception flows that you can easily automate. Coupled with ease of modification of data flows, this allows for rapid adaptation of data flows to fit your changing environment. It can be deployed on edge nodes, as a cluster-based Apache Spark streaming application or as a YARN cluster to efficiently process large-scale batch jobs.
Today, StreamSets Data Collector is used in large-scale production environments where it provides rapid and continuous data movement while curtailing the impact of data drift. It has reduced the overall latency to data consumption from days to minutes due to its ease of use and broad applicability.
While the Data Collector can be used for any form of data movement within the enterprise infrastructure, a few use cases that have emerged include:
- Apache Hadoop ingest – capturing data from various sources spread out in the enterprise and loading into Hadoop systems like HDFS, Apache Hive, Apache HBase, etc;
- Search ingest – capturing data from various sources and loading into distributed search systems like Cloudera Search;
- Log shipping – collecting, aggregating and delivering consumption-ready log data from diverse set of sources;
- Apache Kafka enablement – loading data into and draining data from Kafka clusters into the storage tier.
We invite you to try out StreamSets Data Collector yourself. You can download it directly, or use Cloudera Manager for easy installation. Feel free to checkout our codebase on github, or drop by our mailing list if you have any questions, comments or feedback. And join us on our joint webinar on March 1, 2016, How to Implement Simple, Reliable, Continuous Data Delivery.
Arvind Prabhaker, Founder and CTO, StreamSets Inc.
Arvind Prabhakar is a seasoned engineering leader, who has worked on data integration challenges for over ten years. Before co-founding StreamSets, Arvind was an early employee of Cloudera, and led teams working on integration technologies such as Flume and Sqoop. A member of the Apache Software Foundation, Arvind is heavily involved in the open-source community as the PMC Chair for Apache Flume, the first PMC Chair of Apache Sqoop, and member of various other Apache projects.
Prior to Cloudera, Arvind was a software architect at Informatica, where he was responsible for architecting, designing, and implementing several core systems.