Big Data or Bad Data? Survey Shows Enterprises Struggle to Manage Big Data Flows
Nearly 90% Report Polluting Their Data Stores; Highlights the Need for a Data Flow Operations Mindset
The survey reveals pervasive data pollution, which implies analytic results may be wrong, leading to false insights that drive poor business decisions. Even if companies can detect their bad data, the process of cleaning it after the fact wastes the time of data scientists and delays its use, which is deadly in a world increasingly reliant on real-time analysis.
Despite Constant Cleansing, Bad Data Is Polluting Data Stores and is Difficult to Detect
Respondents cited ensuring data quality as the most common challenge they face when managing big data flows (selected by 68 percent). In addition to bad data flowing into stores, 74 percent of organizations reported currently having bad data in their stores, despite cleansing data throughout the data lifecycle. While 69 percent of organizations consider the ability to detect diverging data values in flow as “valuable” or “very valuable,” only 34 percent rated themselves as “good” or “excellent” at detecting those changes.
Broad Challenges to Performance Managing Data Flows
While detecting bad data is a critical aspect of data flow performance, the survey showed that enterprise struggles are much broader. In fact, only 12 percent of respondents rate themselves as “good” or “excellent” across five key performance management areas, namely detecting the following events: pipeline down, throughput degradation, error rate increases, data value divergence and personally identifiable information (PII) violations.
Performance degradation (44%), error rate increases (44%) and detecting divergent data (34%) were where respondents felt weakest. Detecting a “pipeline down” event was the only metric where a large majority felt positively about their capabilities (66%). Across each key performance management area there was a very large gap between the respondents’ self-reported capabilities and how valuable they considered each competency.
Fragile Hand Coding Plus Data Drift is a Dangerous Combination
These quality and performance management issues may be driven by the reality of data drift — unexpected changes in data structure or semantics — combined with the continued use of outdated methods to design data flows such as low-level coding or use of schema-driven ETL tools. Making frequent changes to pipelines using these inflexible approaches is not only highly inefficient but prone to errors. Also, these tools do not let you watch the data in motion, which means you are flying blind and can’t detect data quality or data flow issues.
- Constant tweaking of pipelines due to data drift: Eighty-five percent said that unexpected changes to data structure or semantics create a substantial operational impact. Over half (53%) reported that they have to alter each data flow pipeline several times a month, with 23% making changes several times a week or more.
- The prominence of hand coding and legacy ETL tools: Nearly two thirds of respondents use ETL/data integration tools and 77 percent use hand coding to design their data pipelines.
The Need for a New Paradigm
With the emergence of fast data and streaming analytics, the operational risk has shifted from data at rest to data in motion. However, enterprises overwhelmingly report that they struggle to manage their data flows. What is required is a new organizational discipline around performance management of data flows with the goal of ensuring that next-generation applications are fed quality data continuously.
“In today’s world of real-time analytics, data flows are the lifeblood of an enterprise,” said Girish Pancha, CEO, StreamSets. “The industry has long been fixated on managing data at rest and this myopia creates a real risk for enterprises as they attempt to harness big and fast data. It is imperative that we shift our mindset towards building continuous data operations capabilities that are in tune with the time-sensitive, dynamic nature of today’s data.”
For more information and complete survey results, please visit https://streamsets.com/big-data-global-survey/
Founded in 2014, StreamSets provides data ingest technology for the next generation of big data applications. Its enterprise-grade infrastructure accelerates data analysis and decision-making by bringing unprecedented transparency and event processing to data in motion. The company was founded by Girish Pancha, a long-time executive and former chief product officer of Informatica, and Arvind Prabhakar, an early employee and engineering leader at Cloudera. StreamSets is headquartered in San Francisco, and backed by top-tier Silicon Valley venture capital firms and angel investors, including Accel Partners, Battery Ventures, Ignition Partners and New Enterprise Associates (NEA). For more information, visit streamsets.com.