Dataflow Performance Blog

Announcing StreamSets Data Collector version 2.0

Last October, we publicly announced StreamSets Data Collector version 1.0. Over the last 12 months we have seen an awesome (a word we don't use lightly) amount of adoption of our first product – from individual developers simplifying their day-to-day work, to small startups building the next big thing, to the very largest companies building global scale enterprise architectures with StreamSets Data Collector at its core.

Kirit BasuAnnouncing StreamSets Data Collector version 2.0
Read More

MySQL Database Change Capture with MapR Streams, Apache Drill, and StreamSets

raphael headshotToday's post is from Raphaël Velfre, a senior data engineer at MapR. Raphaël has spent some time working with StreamSets Data Collector (SDC) and MapR's Converged Data Platform. In this blog entry, originally published on the MapR Converge blog, Raphaël explains how to use SDC to extract data from MySQL and write it to MapR Streams, and then move data from MapR Streams to MapR-FS via SDC, where it can be queried with Apache Drill.

Pat PattersonMySQL Database Change Capture with MapR Streams, Apache Drill, and StreamSets
Read More

Creating a Post-Lambda World with Apache Kudu

Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing

As originally posted on the Cloudera VISION Blog.

At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.

This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.

Arvind PrabhakarCreating a Post-Lambda World with Apache Kudu
Read More

Introducing StreamSets DPM – Operational Control of Your Data in Motion

Friends of StreamSets,

Today I am delighted to announce our new product, StreamSets Dataflow Performance Manager, or DPM, the industry’s first solution for managing operations of a company’s end-to-end dataflows within a single pane of glass. The result of a year’s worth of innovative engineering and collaboration with key customers, DPM will be generally available on or before September 27, in time for Strata. We invite you to come by our booth (#451) for a live demonstration.

DPM is a natural follow-on to our first product, StreamSets Data Collector, which is open source software for building and deploying any-to-any dataflow pipelines. That product has enjoyed a great deal of success in its first year in market, with an accelerating number of weekly downloads, which now total in the tens of thousands across hundreds of enterprises, and numerous production use cases in Fortune 500 companies across a variety of industries.

Girish PanchaIntroducing StreamSets DPM – Operational Control of Your Data in Motion
Read More

StreamSets Data Collector in Action at IBM Ireland

Guglielmo IozziaAfter Guglielmo Iozzia, a big data infrastructure engineer on the Ethical Hacking Team at IBM Ireland, recently spoke about building data pipelines using StreamSets Data Collector at Hadoop User Group Ireland, I invited him to contribute a blog post outlining how he discovered StreamSets Data Collector (SDC) and the kinds of problems he and his team are solving with it. Read on to discover how SDC is saving time and making Guglielmo and his team's lives a whole lot easier…

Pat PattersonStreamSets Data Collector in Action at IBM Ireland
Read More

Ingesting Drifting Data into Hive and Impala

HiveDrift2Importing data into Apache Hive is one of the most common use cases in big data ingest, but gets tricky when data sources ‘drift', changing the schema or semantics of incoming data. Introduced in StreamSets Data Collector (SDC) 1.5.0.0, the Hive Drift Solution monitors the structure of incoming data, detecting schema drift and updating the Hive Metastore accordingly, allowing data to keep flowing. In this blog entry, I'll give you an overview of the Hive Drift Solution and explain how you can try it out for yourself, today.

Pat PattersonIngesting Drifting Data into Hive and Impala
Read More

Announcing Data Collector ver 1.6.0.0

It's been a busy summer here at StreamSets, we've been enabling some exciting use-cases for our customers, partners and the community of open-source users all over the world. We are excited to announce the newest version of the StreamSets Data Collector.

This version has a host of new features and over 100 bug fixes.

Download it now.

Kirit BasuAnnouncing Data Collector ver 1.6.0.0
Read More

Whole File Transfer with StreamSets Data Collector

A key aspect of StreamSets Data Collector (SDC) is its ability to parse incoming data, giving you unprecedented flexibility in processing data flows. Sometimes, though, you don't need to see ‘inside' files – you just need to move them from a source to one or more destinations. Breaking news – the upcoming StreamSets Data Collector 1.6.0.0 release will include a new ‘Whole File Transfer' feature to do just that. If you're keen to try it out right now (on test data, of course!), you can download a nightly build of SDC and give it a whirl. In this blog entry I'll explain everything you need to know to be able to get started with Whole File Transfer, today!

Pat PattersonWhole File Transfer with StreamSets Data Collector
Read More

Dynamic Outlier Detection with StreamSets and Cassandra

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I'll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier' values.

Pat PattersonDynamic Outlier Detection with StreamSets and Cassandra
Read More

Standard Deviations on Cassandra – Rolling Your Own Aggregate Function

Cassandra logoIf you've been following the StreamSets blog over the past few weeks, you'll know that I've been building an Internet of Things testbed on the Raspberry Pi. First, I got StreamSets Data Collector (SDC) running on the Pi, ingesting sensor data and sending it to Apache Cassandra, and then I wrote a Python app to display SDC metrics on the PiTFT screen. In this blog entry I'll take the next step, querying Cassandra for statistics on my sensor data.

Pat PattersonStandard Deviations on Cassandra – Rolling Your Own Aggregate Function
Read More