skip to Main Content

StreamSets Data Integration Blog

Where change is welcome.

Contributing to the StreamSets Data Collector Community

By October 25, 2016

StreamSets MeetupAs you likely already know, StreamSets Data Collector (SDC) is open source, made available via the Apache 2.0 license. The entire source code for the product is hosted in a GitHub project and the binaries are always available for download.

As well as being part of our engineering culture, open source gives us a number of business advantages. Prospective users can freely download, install, evaluate and even put SDC into production, customers have access to the source code without a costly escrow process and, perhaps most importantly, our users can contribute fixes and enhancements to improve the product for the benefit of the whole community. In this post, I’d like to acknowledge some of those contributions, and invite you to contribute, too.

Creating a Custom Processor for StreamSets Data Collector

By October 19, 2016

gps image dataBack in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I’ve been looking for a good sample use case for a custom processor. It’s tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3

By October 13, 2016

sandish kumarSandish Kumar, a Solutions Engineer at phData, builds and manages solutions for phData customers. In this article, reposted from the phData blog, he explains how to generate simulated NetFlow data, read it into StreamSets Data Collector via the UDP origin, then buffer it in Apache Kafka before sending it to Apache Kudu. A true big data enthusiast, Sandish spends his spare time working to understand Kudu internals.

MySQL CDC with MapR Streams, Apache Drill, and StreamSets

By September 26, 2016

mysql-cdc-with-Raphaël-VelfreToday’s post is from Raphaël Velfre, a senior data engineer at MapR. Raphaël has spent some time working with StreamSets Data Collector (SDC) and MapR’s Converged Data Platform. In this blog entry, Raphaël explains how to use SDC for MySQL CDC to extract data from MySQL and write it to MapR Streams, and then move data from MapR Streams to MapR-FS via SDC, where it can be queried with Apache Drill.

Building a Data Science Pipeline at IBM Ireland with StreamSets

By September 12, 2016

Guglielmo Iozzia - Data Science Pipeline EngineerAfter Guglielmo Iozzia, a big data infrastructure engineer on the Ethical Hacking Team at IBM Ireland, recently spoke about building a data science pipeline using StreamSets Data Collector Engine at Hadoop User Group Ireland, I invited him to contribute a blog post outlining how he discovered StreamSets Data Collector (SDC) Engine and the kinds of problems he and his team are solving with it. Read on to discover how SDC is saving time and making Guglielmo and his team’s lives a whole lot easier.

Ingesting Drifting Data into Hive and Impala

By September 8, 2016

HiveDrift2Importing data into Apache Hive is one of the most common use cases in big data ingest, but gets tricky when data sources ‘drift’, changing the schema or semantics of incoming data. Introduced in StreamSets Data Collector (SDC) 1.5.0.0, the Hive Drift Solution monitors the structure of incoming data, detecting schema drift and updating the Hive Metastore accordingly, allowing data to keep flowing. In this blog entry, I’ll give you an overview of the Hive Drift Solution and explain how you can try it out for yourself, today.

Whole File Transfer Data Ingestion

By August 22, 2016

A key aspect of StreamSets Data Collector Engine is its ability to parse incoming data, giving you unprecedented flexibility in processing data flows. Sometimes, though, you don’t need to see ‘inside’ files – you just need to move them from a source to one or more destinations. The StreamSets Platform will include a new ‘Whole File Transfer’ feature for data ingestion In this blog entry I’ll explain everything you need to know to be able to get started with Whole File Transfer, today!

Dynamic Outlier Detection with StreamSets and Cassandra

By August 19, 2016

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I’ll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier’ values.

Standard Deviations on Cassandra – Rolling Your Own Aggregate Function

By July 28, 2016

Cassandra logoIf you’ve been following the StreamSets blog over the past few weeks, you’ll know that I’ve been building an Internet of Things testbed on the Raspberry Pi. First, I got StreamSets Data Collector (SDC) running on the Pi, ingesting sensor data and sending it to Apache Cassandra, and then I wrote a Python app to display SDC metrics on the PiTFT screen. In this blog entry I’ll take the next step, querying Cassandra for statistics on my sensor data.

Chat with the StreamSets Team via Slack!

By July 15, 2016

SlackSince its inception last October, the sdc-user Google group has been the primary medium for you, our community, to communicate with us, the StreamSets Team. We’ve seen over a thousand messages, and participated in discussions around installation, configuration, bugs, feature requests, and every other aspect of StreamSets Data Collector. While sdc-user works well for many interactions, it does lack the immediacy of chatting in real-time. So, this week, we added a new communication option: the ‘StreamSetters’ Slack team.

Back To Top