Modern Complexity, Change and Data Drift

The growing complexity and constant change in modern data architectures, drives a new challenge, data drift. Data drift is unexpected changes to data structure and semantics (the meaning of data) that break pipelines or pollute data. Addressing data drift in a world where sources, compute platforms and use cases are always evolving requires unprecedented agility in how you design and operate your data movement architectures. Data drift, if not systematically dealt with, will cripple your analytics agenda.

The Ever-Changing Data Supply Chain Creates Data Drift

Data drift is caused by commonplace updates made to data sources, data infrastructure and analytics requirements as a part of normal business operations. Data drift has always existed but as the data supply chain becomes more complex and fluid, and more people and applications act on the data, it has grown from an annoyance to a constant and growing strain on companies striving to make intelligence pervasive.

Data Drift Causes Failed Pipelines and Polluted Data

Data drift leads to serious problems such as dropped and corroded data that pollute analysis or failing data pipelines. It shows up as data fields that are unexpectedly added, deleted or have their meaning changed, being brought on by upgrades to IoT device firmware, new versions of infrastructure like Hadoop or Kafka, or migrating to cloud analytics platforms.

The fragmentation of governance across the modern data supply chain is one key cause of data drift. While it is normal for the owners of source systems and infrastructure to adjust and optimize each system in the data supply chain to meet “local” needs, these changes create unintended negative and potentially widespread problems for downstream data users – whether they be automated applications or data scientists.

Data Drift is a Quiet Killer

Data drift is the classic example of “death by a thousand cuts”. A single pipeline break or data integrity problem caused by data drift by itself may be a nuisance, but multiplying these issues across dozens of hundreds of changes leads to chaos, which shows up as endless maintenance and fire-fighting of data pipelines that not only creates downtime for data consumers but also diverts developer resources and management attention away from higher value work.

Even worse, over time the data availability and quality issues caused by data drift create a fundamental distrust in the integrity of the data amongst consumers such as analysts and application developers.

Traditional Approaches Fail at Handling Data Drift

The insidious and damaging effects of data drift cannot be addressed by traditional data integration methods which are architected for infrequent and centrally-planned changes to data pipelines. Nor can architectural complexity and data drift be countered using a hodgepodge of platform-specific data ingestion tools as these low-level point-to-point frameworks defy end-to-end visibility and control.

A new approach is needed, one that treats data movement as a logical layer unto itself that is managed as a continuous and critical process. We call this DataOps.


Let your data flow

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!