skip to Main Content

StreamSets Data Integration Blog

Where change is welcome.

Getting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)

By October 23, 2017

ODM Windows PipelineThis post was originally published on the Cloudera VISION blog by Sam Heywood.   StreamSets configurations and images of Apache Spot Open Data Model ingest pipelines can be found here on Github.

A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer is to adopt a machine learning and analytics platform as a fundamental and durable part of their data strategy. However, many CISOs fear deployment of an initial use case will be somewhat daunting. Cloudera has partnered along with Arcadia Data and StreamSets to make it easier than ever for CISOs to take the first step and deploy basic use cases leveraging data sources common to many environments.

More Than One Third of the Fortune 100 Have Downloaded StreamSets Data Collector

By November 10, 2016

It’s been a little over a year (9/24/15) since we launched StreamSets Data Collector as an open source project. For those of you unfamiliar with the product, it’s any-to-any big data ingestion software through which you can build and place into production complex batch and streaming pipelines using built-in processors for all sorts of data transformations. 

We’re thrilled to announce that as of last month StreamSets Data Collector had been downloaded by over ⅓ of the Fortune 100! That’s several dozen of the largest companies in the U.S. And downloads of this award-winning software have been accelerating, with over 500% growth in the quarter ending in October versus the previous quarter.

The Challenge of Fetching Data for Apache Spot (incubating)

By October 19, 2016

Reposted from the Cloudera Vision blog.

What do Sony, Target and the Democratic Party have in common?

Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us.

Struggling with Bad Data? What to Do From the Enterprise

By June 28, 2016

Last week we announced the results of a survey of over 300 enterprise data professionals conducted by Dimensional Research and sponsored by StreamSets.  We were trying to understand the market’s state of play for managing their big data flows.  What we discovered was that there is an alarming issue at hand: companies are struggling to detect and keep bad data out of their stores.

There is a bad data problem within big data.

Bad Data: What to Do?

When we asked data pros about their challenges with big data flows, the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness and consistency, getting votes from over ⅔ of respondents.  Security and operations were also listed by more than half.  The fact that quality was rated as a more common challenge than even security and compliance is quite telling, as you usually can count on security to be voted the #1 challenge for most IT domains.  

The Complementary Nature of Data Ingestion and Data Preparation

By May 25, 2016

I am always eager to learn about new architectures and best big data practices. Recently I came across a paper from Trifacta discussing the role of data preparation and it got me thinking about the complementary nature of data ingestion and data preparation.

Data preparation, more colorfully known as data wrangling, is the activity performed by data-driven professionals, such as data or business analysts, to explore, clean, transform and blend data of all varieties to make it trustworthy for analysis or predictive modeling. A form of data manipulation that has traditionally been achieved using Excel or, for more technically-advanced end users, languages such as R, SAS or Python. But with the rise of enormous and dynamic data sets in Hadoop, these approaches are no longer feasible. Trifacta took the lead in creating a self-service web-based solution that enables business users to access and manipulate data stored in Hadoop without needing programming skills.

Binlog Processing Using Maxwell, Kafka & StreamSets

By March 2, 2016

This is a nice example of Kafka enablement using Maxwell (a mysql-to-kafka binlog processor) and StreamSets Data Collector from the folks at B23.   It includes a schema change listener for handling data drift.  Enjoy! Innovate on Your Data - Maxwell…

Back To Top