Before we dive too quickly into complexities of data operations, please allow me to set the table. Akin to art and music, the most appealing part of the design process is always the ideation, the canvases where you get to explore new ideas that usher in new value and opportunity. Consider The Space Travelers Lullaby from jazz artist Kamasi Washington, made almost completely as an improvisational jam. Kamasi is often quoted as saying “This process can be gradual, even frustrating”, but in the world of music these heroics are rewarded with fanfare and critical acclaim.
As companies target and acquire new data sources and aim to deliver new form factors of data analytics, a fair amount of creative capital is spent designing these new patterns and solutions. Not surprisingly, rich tools have arisen that give visual control over analytics and more recently data movement. Data engineers now have highly intuitive tools to control the integration of data across their business. A savvy engineer can boost their internal brand immensely by bringing a forward-thinking, new capability into reality. However, how can they find the time when they are so often mired in the task of keeping existing ideas healthy, modern, and in production?
The truth is that they often spend a great deal of time handling data operations to keep data pipelines running and functioning to meet the requirements of downstream projects. It’s understood widely among data engineers, but not broadly acknowledged, as an activity that will be rewarded with praise and accolades. In fact, it might be considered part and parcel with the ideation. But when things go wrong the pain that data engineers feel is real. Whether it’s a data science model that an analyst convinced an engineer to support or the daily dashboard supporting the sales team, when these capabilities break, friendship and an impeccable record of new ideas will only win you so many graces.
However, data operations doesn’t have to be a zero-sum game. You can build a modern data integration system that provides resilience and flexibility even at scale. Thinking smart about how you scale and react to the operational needs of your data pipelines and data processing can be cumbersome at first but will pay dividends as workloads and data projects amass.
In a static data world, upfront developer productivity matters more than operations. In a continuous data world, operations is everything.
So how do we build operations for a continuous data world? The answer is DataOps. DataOps is the mechanism for running a data driven business in a world of constant change. The following components are key to delivering DataOps functionality.
Visibility Across Data Operations
Think of a single data pipeline as a tab on your internet browser. At a small scale, toggling between tabs on your browser is relatively manageable (though not ideal). Depending on the size of your organization you likely have multiple tools (legacy and modern) to build data pipelines. These tools will all have a different degree of control and granularity to understand both the progress and health of your data pipelines. Data teams often spend a good deal of their time managing to the limitations of the tools.
But what about when you have 30, 100, 1K pipelines? The idea of diving through 100 internet browser tabs with differing degrees of helpfulness will likely only produce delays and cause an increasing amount of pain as these connections grow. With DataOps, the goal is to have a comprehensive and living map of all of your data movement and data processing jobs. When pipelines or stages of a pipeline breakdown, the errors aggregate into a single point of visual remediation. That way data engineers understand the issue and react in a manner that doesn’t harm downstream projects and take a blow to the engineer’s brand. This map should be as useful and as reactive for today’s data workloads, as it is for handling tomorrow’s workloads.
Monitoring Data Operations
A big component of visibility is active monitoring. Many tools aim to monitor the operations inside their product portfolio, but DataOps demands a level of monitoring that persists outside of a single system or workload. Continuous monitoring requires that systems share metadata and operational information so users can see a broader scope of challenges. These monitoring capabilities also help companies define, refine, and deliver on downstream data SLA’s. This ensures that analytics can be delivered with confidence and the company can evolve the art of self-service, which further removes the data engineer from risk. Monitoring should not only tackle alerting a team when something breaks, but in a DataOps scenario, it should actively monitor for the precursors to potential problems. Monitoring in DataOps should also be comprehensive, not allowing for data systems and silos to become operational black holes.
Automation of Data Operations
As companies scale their data practice, automation and integration with automation tooling becomes paramount. No aspect of self-service can be delivered without some level of automation. When engineers are able to work on automation tasks the impact can be amplified across multiple workloads. In the DataOps ecosystem, users will want to automate anything they can, while remaining reliable. Proprietary systems often offer poor extensibility making them complicated to automate and integrate with automation tools. This is why DataOps is often focused on open solutions or platforms that provide API extensibility and programmable integration with infrastructure and computing platforms.
What is the cost of not modernizing with data operations? Will it cost you your competitive stance? Today’s solution landscape is evolving at a feverish pace and the toll this takes on data professionals is almost criminal. Your data strategy must not only be competitive with today’s requirements, but also be future leaning to consider your company’s transformation over the next five to ten years.
Managing and evaluating this fast-moving landscape requires tools and platforms that can abstract away from reliance on a single data platform or analytics solution. Logic for designing pipelines should be transferable, no matter the source or destination, allowing for change based on the business requirements vs managing to the limitations of the system. In DataOps, change is a given. DataOps systems embrace change by allowing their users to easily adopt and understand complex new platforms in order to deliver the business functionality they need to remain competitive. DataOps embraces the cloud data platforms and helps companies build hybrid cloud solutions that may someday live natively in the cloud.
DataOps Is Real
I leave you by assuring you today that DataOps is real and is providing measurable business value for modern organizations. Consider this example from my own company (StreamSets). We have a customer that utilizes our software and is able to manage 194 pipelines with real-time visibility and proactive monitoring. This customer also executes 58 Apache Spark-based ETL jobs. They execute this all via one engineer. The value they receive from the work of a single individual generates numerous operational efficiencies while decreasing data duplication and redundant processes. While I don’t particularly advocate a single source inside any company to dictate its data success, the ability to streamline data operations allows more of those talented individuals to do the sexy ideation, much like the melodic sonnets of artist Kamasi Washington.