A key impediment to harnessing the potential of big data is creating a repeatable process for continuously flowing data from source to use. Data-driven applications or business processes require a continuous supply of complete and accurate data but legacy systems are ill-equipped to ensure this, which creates serious problems in terms of efficiency, data quality and IT agility.
Systems in use today were designed for a world where data movement was slow, simple and predictable. They were solving the developer or “design time” problem of how to efficiently build data movement pipelines. But this only half the battle. The operational or runtime problem has been neglected since in the past data movement held little risk – it was not as time sensitive or as dynamic as it is today. In the old world you could “fire and forget” ETL pipelines, but big and fast data requires you to stay on top of the day-to-day operation of your dataflows.
It is time to address the operational management of data in motion in a focused and professional manner. Otherwise we expose ourselves to serious problems in the areas of quality, efficiency and agility, such as:
- Data that arrives late, incomplete or corroded.
- Data scientists and analysts who spend inordinate time cleaning data.
- Unreliable results from business processes and a loss of trust in data quality.
A new focus on continual data movement is required, a discipline we call “dataflow performance management”. It is a holistic approach to monitoring and controlling dataflow performance in alignment with business needs. The key principles of a dataflow performance management practice are that it is business-driven, holistic, agile and sustainable. Its activities span the dataflow lifecycle: discover and track topologies, benchmark performance, monitor data SLA’s and remediate violations.
We propose that enterprises create a data operations center (DOC) as a corporate center of excellence for dataflow performance management. A DOC provides a single point of accountability and control, helping the enterprise move from silos to a well-governed and comprehensive operation. DOC functions include dataflow life cycle management, architecture and technology selection, Data SLA management, incident response, data hygiene, dataflow microservices and security management.
The DOC will require a Dataflow Performance Management System (DPMS) as it’s control center, responsible for the discovery and cataloging of dataflows, orchestration of dataflows across shared infrastructure, implementation of Data SLA’s and dataflow release management. StreamSets Dataflow Performance Manager is the first example of a DPMS.
Enterprises who implement a DOC supported by a DPMS can expect numerous benefits. including focused accountability, performance transparency, improved data quality, enhanced agility and more.
Feel free to contact us to see how StreamSets can help you adopt a dataflow performance management mindset.