skip to Main Content

The DataOps Blog

Where Change Is Welcome

The Hidden Complexity of Data Operations

By March 11, 2020

Akin to art and music, the most appealing part of the design process is always the ideation, the canvases where you get to explore new ideas that usher in new value and opportunity. Consider The Space Travelers Lullaby from jazz…

A New Definition of DataOps

By July 18, 2019

This is short post, but relevant.  Ever since DataOps was started (about 5 years) it hasn’t had a well-adopted and common definition.  Wikipedia is partially OK at: DataOps is an automated, process-oriented methodology, used by analytic and data teams, to…

Embrace Diversity in Your Data Architecture

By June 9, 2017

Many Roads Lead to Rome

Over the last ten years, the data management landscape has changed dramatically — on that, I think we can all agree. The rise of big data and the new data management ecosystem has created an abundance of new patterns and tools, each of which is more specialized than the last. With each new iteration, engineers and architects face pressure from all sides to simplify and consolidate.

But counter-intuitively, the best data architects embrace infrastructure diversity rather than fight it. The reality is that all of these tools and patterns have important uses in enterprise data architecture, and that today Kafka is no more the cure to all that ails than MapReduce was five years ago. The most sophisticated enterprises enable each business unit to use best-of-breed technology to succeed while facilitating seamless integration between them, creating agility and avoiding the chaos that often arises across legacy environments.

Making Sense of Stream Processing

By April 19, 2017

StreamThere has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.

In this article, I’ll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

By January 17, 2017

Screen Shot 2017-01-17 at 1.40.07 PMToday we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data.  Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.

Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.

The Challenge of Fetching Data for Apache Spot (incubating)

By October 19, 2016

Reposted from the Cloudera Vision blog.

What do Sony, Target and the Democratic Party have in common?

Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us.

Creating a Post-Lambda World with Apache Kudu

By September 23, 2016

Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing

As originally posted on the Cloudera VISION Blog.

At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.

This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.

Back To Top