New Release of StreamSets Data Collector Takes on Data Drift
Company to Transform How Big Data Flows Are Managed
“Given the siloed, yet strategic nature of data, enterprises must develop a culture of data performance management,” said Girish Pancha, CEO, StreamSets, Inc. “Just as network operations and security operations matured from numerous siloed projects into centers of excellence, we believe it is time for data operations to make that same critical leap. StreamSets was founded to build the cornerstone infrastructure upon which enterprises can institute disciplined performance management for their data-in-motion.”
Data Drift Calls for a New Paradigm for Managing Data in Motion
Today, moving data reliably and with quality is blocked by the reality of data drift — the deluge of unpredictable, unannounced and unending mutations to data that occur at the source and blindside consuming applications. Data drift breaks ingest pipelines and causes undetected data loss and corrosion that pollutes data stores and downstream analysis. In a recent survey, StreamSets found that challenges with data drift were universal, with 96% of respondents experiencing data drift and nearly one-third citing it as a frequent occurrence.
Data drift cannot be addressed by the prevailing approach of custom coding with low-level frameworks such as Sqoop, Flume and Kafka. These approaches break in the face of unexpected schema change and cannot detect semantic changes since they cannot inspect data within the flow. This creates a chaotic environment that results in false or missed insights, loss of trust in the data, constant fire fighting and endless janitorial data cleansing.
StreamSets Data Collector: Giving Enterprises Control, Efficiency and Agility
With its latest v1.2 release, StreamSets Data Collector automates data drift handling and now supports the Big 3 major Hadoop distributions from Cloudera, MapR and Hortonworks. This version also is certified with the MapR Converged Data Platform including extended support for MapR Streams. It also provides connectors for other popular big data technologies such as Elasticsearch, NoSQL databases such as MongoDB and Cassandra and transient stores such Apache Kafka, MapR Streams and JMS-compliant message queues.
StreamSets Data Collector gives enterprises the necessary control, efficiency and agility to effectively manage performance of their data flows.
- Data flow KPIs for real-time control: Uniquely, StreamSets Data Collector monitors, detects and acts on changes in data patterns alongside providing fine-grained metrics on data flow throughput, latency and error rates. Data drift-handling rules ensure that pipelines flow correctly even when schema changes. Threshold rules, alerts and plug-in processors combine to identify, filter, re-route and sanitize anomalies in-stream to ensure that data lands ready for consumption.
- Adaptable pipelines for efficiency: StreamSets Data Collector provides a visual (integrated development environment) IDE for the design and execution of intent-driven data flows with minimal schema specification and custom code. It is a highly flexible environment, handling both batch and streaming data, and deploying on edge nodes, natively in clusters and as part of an application stack.
- Containerized architecture for agility: Built for continuous operations, StreamSets Data Collector addresses the issues of constant infrastructure upgrades and data flow evolution head on. Each source, stage and destination in a pipeline is isolated, allowing you to maintain and modernize your data infrastructure while ensuring zero downtime.
“The ability to ingest multiple streams of rapidly changing data and integrate it with historical batch data is already a problem for early adopters of big data processing technologies that will quickly become a mainstream problem for later adopters alike,” said Matt Aslett, research director, data platforms and analytics, 451 Research. “With StreamSets Data Collector, StreamSets has a differentiated offering that is designed specifically to address this problem.”
Enterprises Are Turning to StreamSets
Since exiting stealth mode in September 2015, StreamSets has gained tremendous traction in the marketplace, with thousands of downloads across hundreds of companies including mission critical commercial deployments at Cisco Systems, Lithium Technologies and Planet Labs.
StreamSets software delivers performance management for data flows that feed the next generation of big data applications. Its mission is to bring operational excellence to the management of data in motion, so that data arrives on time and with quality, accelerating analysis and decision making. StreamSets Data Collector is in use at hundreds of companies where it brings unprecedented visibility into and control over data as it moves between an expanding variety of sources and destinations.
Founded in 2014 by Girish Pancha, former chief product officer of Informatica, and Arvind Prabhakar, an early employee and engineering leader at Cloudera, StreamSets is headquartered in San Francisco and is backed by Accel Partners, Battery Ventures and New Enterprise Associates (NEA). For more information, visit streamsets.com.