A New Definition of DataOps
This is short post, but relevant. Ever since DataOps was started (about 5 years) it hasn’t had a well-adopted and common definition.
Wikipedia is partially OK at:
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.
But it is not great and simply builds on the old DevOps definition. Wikipedia goes on to say: “While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting and recognizes the interconnected nature of the data analytics team and information technology operations.”
These additional comments at least add some color, but we still need a definition that doesn’t need an extra paragraph to clarify it. Here’s a few more:
The DataOps approach seeks to apply the principles of agile software development and DevOps (combining development and operations) to data analytics, to break down silos and promote efficient, streamlined data handling across many segments.
DataOps (data operations) is an Agile approach to designing, implementing and maintaining a distributed data architecture that will support a wide range of open source tools and frameworks in production.
The best definition from my perspective is:
DataOps is the practice of operationalizing data management and integration to rapidly meet new business demands for data, and to deliver continuously with confidence, in a world of fragmented data and ceaseless change.
The source comes from StreamSets – one of the industry leaders (if not the best one) delivering innovative technology delivering breakthrough DataOps solutions. Note that this definition DOESN’T mention DevOps; comparing DataOps to DevOps is the lazy answer. Two key terms in this definition are “deliver continuously” and the context that it does so in a “world of fragmented data and ceaseless change”.
One of the aspects that differentiates DataOps from any past data delivery/integration method is the ability to continue delivery data even when there are changes in the source applications or data stores (referred to Data Drift). None of the other definitions capture this capability which is why I feel they missed the point about what differentiates DataOps.
What do you think? Do you agree with my view that we should adopt StreamSets version as the official definition?