There has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.
In this article, I'll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.
Today we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data. Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.
Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.
Girish PanchaData in Motion Evolution: Where We’ve Been…Where We Need to Go
What do Sony, Target and the Democratic Party have in common?
Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us.
Rick BilodeauThe Challenge of Fetching Data for Apache Spot (incubating)
Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing
As originally posted on the Cloudera VISION Blog.
At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.
This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.
Arvind PrabhakarCreating a Post-Lambda World with Apache Kudu
Today I am delighted to announce our new product, StreamSets Dataflow Performance Manager, or DPM, the industry’s first solution for managing operations of a company’s end-to-end dataflows within a single pane of glass. The result of a year’s worth of innovative engineering and collaboration with key customers, DPM will be generally available on or before September 27, in time for Strata. We invite you to come by our booth (#451) for a live demonstration.
DPM is a natural follow-on to our first product, StreamSets Data Collector, which is open source software for building and deploying any-to-any dataflow pipelines. That product has enjoyed a great deal of success in its first year in market, with an accelerating number of weekly downloads, which now total in the tens of thousands across hundreds of enterprises, and numerous production use cases in Fortune 500 companies across a variety of industries.
Girish PanchaIntroducing StreamSets DPM – Operational Control of Your Data in Motion
Last week we announced the results of a survey of over 300 enterprise data professionals conducted by Dimensional Research and sponsored by StreamSets. We were trying to understand the market’s state of play for managing their big data flows. What we discovered was that there is an alarming issue at hand: companies are struggling to detect and keep bad data out of their stores.
There is a bad data problem within big data
When we asked data pros about their challenges with big data flows, the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness and consistency, getting votes from over ⅔ of respondents. Security and operations were also listed by more than half. The fact that quality was rated as a more common challenge than even security and compliance is quite telling, as you usually can count on security to be voted the #1 challenge for most IT domains.
Rick BilodeauSurvey Shows Enterprises Struggling with Bad Data
I am always eager to learn about new architectures and best big data practices. Recently I came across a paper from Trifacta discussing the role of data preparation and it got me thinking about the complementary nature of data ingestion and data preparation.
Data preparation, more colorfully known as data wrangling, is the activity performed by data-driven professionals, such as data or business analysts, to explore, clean, transform and blend data of all varieties to make it trustworthy for analysis or predictive modeling. A form of data manipulation that has traditionally been achieved using Excel or, for more technically-advanced end users, languages such as R, SAS or Python. But with the rise of enormous and dynamic data sets in Hadoop, these approaches are no longer feasible. Trifacta took the lead in creating a self-service web-based solution that enables business users to access and manipulate data stored in Hadoop without needing programming skills.
Rick BilodeauThe Complementary Nature of Data Ingestion and Data Preparation
The Forward-Looking Threat Research team at Trend Micro were early adopters of StreamSets Data Collector. They use StreamSets to ingest data from a wide variety of sources to create a Threat Assessment Dashboard in Elasticsearch. In this interview, we talk with members of their team about how they evaluated StreamSets and implemented it in their production environment in a short period of time.
Kirit BasuHow Trend Micro Uses StreamSets – An Interview with the Threat Research Team