Akin to art and music, the most appealing part of the design process is always the ideation, the canvases where you get to explore new ideas that usher in new value and opportunity. Consider The Space Travelers Lullaby from jazz…
This is short post, but relevant. Ever since DataOps was started (about 5 years) it hasn’t had a well-adopted and common definition. Wikipedia is partially OK at: DataOps is an automated, process-oriented methodology, used by analytic and data teams, to…
StreamSets is proud to be hosting the first annual DataOps Summit in San Francisco, California at the Hilton San Francisco Financial District on September 3rd-5th. The summit will feature a full day of data operations training and two days of…
StreamSets is excited to announce the immediate availability of StreamSets for Snowflake, the first DataOps platform for Snowflake. Now users can extend their Dataops environments to the popular Snowflake service. StreamSets makes copying data from databases, streams, and event processing…
As fall approaches and there's a chill in the morning air, it inevitably comes time for the annual Cloudera Data Impact awards. We are thrilled to have three finalists in the hunt this year: Vodafone, one of the world's largest…
Over the last ten years, the data management landscape has changed dramatically — on that, I think we can all agree. The rise of big data and the new data management ecosystem has created an abundance of new patterns and tools, each of which is more specialized than the last. With each new iteration, engineers and architects face pressure from all sides to simplify and consolidate.
But counter-intuitively, the best data architects embrace infrastructure diversity rather than fight it. The reality is that all of these tools and patterns have important uses in enterprise data architecture, and that today Kafka is no more the cure to all that ails than MapReduce was five years ago. The most sophisticated enterprises enable each business unit to use best-of-breed technology to succeed while facilitating seamless integration between them, creating agility and avoiding the chaos that often arises across legacy environments.
There has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.
In this article, I’ll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.
Today we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data. Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.
Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.
What do Sony, Target and the Democratic Party have in common?
Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us.
Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing
As originally posted on the Cloudera VISION Blog.
At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.
This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.