How DataOps is Adding Value to Data Lakes
For those of you who joined us on June 6th, you dialed into a forward-thinking conversation between three industry experts. They waxed poetic about topics including big data, DataOps, governance, data science, and more in an effort to help modern data architects and analytics professionals better understand the emerging practices and themes around DataOps. If you missed the videocast you can still access the full recording here.
Arguably, we could have spent all day with our panel. Follow-up questions and themes were just starting to evolve during the discussion. Due to time constraints and platform complexities we did not get a chance to answer all of your questions, so we have written this short blog where we will answer your questions. But first, a little recap on our topic. What is DataOps?
Modern BI requires the free movement of data to all corners of the business supported by automation and monitoring to ensure constant uptime and resiliency to change. DataOps — the application of DevOps practices to modern data management and integration — helps in this regard, and when applied to your projects can help reduce the cycle time of data analytics in close alignment with business objectives.
DataOps includes four key themes that are considered in the design, deployment, and continuous operations phases.
- Continuous Design – The Continuous Design function enables delivering data solutions in an ongoing basis rather than as discrete project events.
- Continuous Operations – DataOps encourages a holistic view where operators are able to see a living map of all data working together to serve the data needs of higher order business functions.
- Continuous Governance – Continuous Governance is responsible for establishing an Information Governance framework, a methodology, and standards for enterprise information management.
- Continuous Data – Continuous Data is responsible for the publication of data in a united hub maintaining data services, service levels and performance for both externally sourced and internally generated information and their use by users or application systems.
Now that we have given you a little primer on our subject, let’s dive into your questions
Audience Submitted Questions
How fast do you think DataOps will be adopted across the market?
DataOps has seen early traction with a feature in the November 2018 Gartner Hype Cycle. While still a growing category, the focus, and development of best practices in the last year has shown notable progress. As many companies begin to formalize their DataOps team and processes, new roles and standardized architectures will further enforce the enterprise adoption of DataOps. This year StreamSets will host the first-ever Dataops Summit in the hopes of bringing together practitioners and partners in the space.
What are the biggest pain points that DataOps addresses?
DataOps can provide many optimizations and accelerate value to a variety of analytics initiatives but when scoping how to implement and measure DataOps you should think of achieving these five key things.
- Self Service: What aspects of the data lifecycle can be automated? Can we give end-users more power to find, understand, ingest, wrangle, and transform data without sacrificing security and governance? For a growing number of companies, DataOps is enabling better self-service across all points in the data journey.
- Operational Visibility: DataOps gives you the visibility that spans systems and data sets. Visibility into your pipelines, your transformations, and what other users are doing with data. DataOps helps you discover data more effectively and prevents the governance of black holes.
- Scale: As data connections proliferate, ensuring that all components across the data ecosystem can scale to meet the demands of more consumption becomes a guessing game. DataOps offers a set of principles to ensure that data systems scale to meet current and future needs. DataOps also helps data science and analytics teams scale to deliver more projects under shorter timelines.
- End to End Governance: Gone are the days of implementing governance with good intentions. Between increasingly costly regulations such as GDPR and CCPA and the reality that data must pass through multiple systems with varying degrees of visibility and governance, users must build continuous data governance that is enforced in trusted analytics zones and while data is in-motion.
- Automated Change Detection and Management: Things change all the time: new data sets and schemas are introduced, existing ones are changed or used in new and unexpected ways, ML models drift, data quality levels fluctuate, ownership and stewardship responsibilities change. The goal of DataOps is to detect and manage changes so things do not break. That requires a high degree of monitoring, sophisticated change detection and automated change handling.
Across the data lifecycle, where are bottlenecks likely to occur?
While every company may experience a bottleneck in a different team or system, there are three main bottlenecks we see commonly alleviated by a DataOps practice. The first is a bottleneck of finding, understanding and trusting data. According to some analyst reports data users spend between 60- 80% looking for the right data and often settle for what they can find rather than what they need. The main reason for it is the discrepancy between how users think of data (in business terms) and how data is documented (cryptic, missing or misleading technical names assigned to data by developers). Once data is understood, the problem moves to ingesting the data into core systems. This may manifest as the inability to stand up the pipeline to retrieve the data or a lack of power to do the necessary processing required before the data lands. The next bottleneck is usually a bottleneck associated with access. Analytics teams cannot easily access the data they need causing delayed projects. There may be several reasons for this including rigid policies, data security, and governance concerns, or data landing in a format that is unusable to the analyst. The third is a bottleneck in getting value out of raw data. Across analytics teams we see differing requirements for data and it is often hard to address every need with a single approach.
How can we define DataOps in terms of bringing value to the business rather than a technology stack? How do you define and measure that?
DataOps value can be measured across two vectors. Early in adopting DataOps approaches companies will see optimizations and decreases in analytics dark zones within the first 9 months. As data is cataloged and automatically tagged with business terms, and regulated data is automatically detected and governed, the new data connections are made so the best practices and automation can be applied to accelerate business objectives directly. In terms of metrics. we suggest defining measures of interest to business leaders. For example, in addition to tracking how many new data sources are added per month and the number of dataflows, measure how they are helping to increase sales, improve customer experience, reduce operational costs, and lower enterprise risks.
Do we need business analysts/SME’s as part of the DataOps ecosystem?
Yes, the biggest challenge in making good decisions is the disconnect between people who need to make a decision or need an answer and people who can work on data. Analytics self-service closes this gap, but only if data is properly tagged with business terms, so analysts can understand it. This allows design-time self-service, so analysts can find the right data, prep it, and analyze it and then turn it over to data engineers to productize. Data stewards are also critical for setting data policies such as access control, data masking, and data quality, as well as curating and approving policy violations and changes to the existing data assets.
How can validation be incorporated into the DataOps process?
Validation is a critical part of DataOps. Waterline provides a tag-based rule engine that can automatically apply validation rules to all data sets in the enterprise. For example, it can associate a data quality rule with a business term or tag and apply it to any field that has been tagged with that tag regardless of the data source where data is stored, data format or field name. Of course, this works best if the fields can be automatically tagged with business terms using Waterline’s fingerprinting and AI driven automated tagging.
Who is Make Big Data Work?
We are a consortium of technology enabled platforms that help bring fast and tangible value to big data projects. We cover critical aspects of ingestion, operations, discovery, cataloging, and data wrangling. Together we help accelerate the movement through the data value chain while unifying functionality around end-to-end visibility and governance. Learn a little more about the players in Make Big Data Work.
Streamsets [Ingest and Operate]:
Gone are the days of loading operational application data into a warehouse overnight. Decision makers require diverse access to a variety of different data types coming from a range of different sources. Businesses are also demanding lower latency access to data in order to make decisions quicker. Hence the need for DataOps which brings DevOps disciplines and practices to Data Ingest.
Waterline Data [Find and Catalog]:
As enterprise organizations store more data across more platforms and add more users, they are challenged with the ability to easily find, govern, and secure data across their data estate. Organizations must be able to quickly access trusted data to ensure that are making better decisions faster with a higher level of confidence. Waterline Data has addressed these issues by creating an AI-driven data catalog that automates the discovery, classification, and governance of data at petabyte scale without compromising security.
Trifacta [Assess and Transform]
Trifacta is the data wrangling market leader. Essentially, Trifacta allows business users, the people who know the data best, to prepare data for analysis in an intuitive, Excel-like interface which they are already familiar with. This replaces complex, outdated IT-led processes, accelerates time-to-analysis, leads to the discovery of new patterns, and unexpected opportunities.
Do you want to learn more from these great thinkers? You can catch them live at DataOps Summit September 4-5th in San Francisco. The team will be getting back together to answer your questions!
Don’t miss out and sign up today!