Skip to content

StreamSets Data Integration Blog

Where change is welcome.

Whole File Transfer Data Ingestion

By August 22, 2016

A key aspect of StreamSets Data Collector Engine is its ability to parse incoming data, giving you unprecedented flexibility in processing data flows. Sometimes, though, you don’t need to see ‘inside’ files – you just need to move them from a source to one or more destinations. The StreamSets Platform will include a new ‘Whole File Transfer’ feature for data ingestion In this blog entry I’ll explain everything you need to know to be able to get started with Whole File Transfer, today!

Dynamic Outlier Detection with StreamSets and Cassandra

By August 19, 2016

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I’ll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier’ values.

Standard Deviations on Cassandra – Rolling Your Own Aggregate Function

By July 28, 2016

Cassandra logoIf you’ve been following the StreamSets blog over the past few weeks, you’ll know that I’ve been building an Internet of Things testbed on the Raspberry Pi. First, I got StreamSets Data Collector (SDC) running on the Pi, ingesting sensor data and sending it to Apache Cassandra, and then I wrote a Python app to display SDC metrics on the PiTFT screen. In this blog entry I’ll take the next step, querying Cassandra for statistics on my sensor data.

Chat with the StreamSets Team via Slack!

By July 15, 2016

SlackSince its inception last October, the sdc-user Google group has been the primary medium for you, our community, to communicate with us, the StreamSets Team. We’ve seen over a thousand messages, and participated in discussions around installation, configuration, bugs, feature requests, and every other aspect of StreamSets Data Collector. While sdc-user works well for many interactions, it does lack the immediacy of chatting in real-time. So, this week, we added a new communication option: the ‘StreamSetters’ Slack team.

Retrieving Metrics via the StreamSets Data Collector REST API

By July 8, 2016

PiTFT Displaying SDC MetricsLast week, I explained how I was able to run StreamSets Data Collector Engine on a Raspberry Pi 3, ingesting sensor data and writing it to Cassandra. With that working, I wanted to show pipeline metrics across data pipelines on Adafruit’s awesome PiTFT Plus 2.8″ screen. In this blog post, I’ll explain how I was able to write a Python app to retrieve pipeline metrics with StreamSets Data Collector REST API, showing them on the PiTFT Plus via pygame to better manage data pipelines.

Struggling with Bad Data? What to Do From the Enterprise

By June 28, 2016

Last week we announced the results of a survey of over 300 enterprise data professionals conducted by Dimensional Research and sponsored by StreamSets.  We were trying to understand the market’s state of play for managing their big data flows.  What we discovered was that there is an alarming issue at hand: companies are struggling to detect and keep bad data out of their stores.

There is a bad data problem within big data.

Bad Data: What to Do?

When we asked data pros about their challenges with big data flows, the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness and consistency, getting votes from over ⅔ of respondents.  Security and operations were also listed by more than half.  The fact that quality was rated as a more common challenge than even security and compliance is quite telling, as you usually can count on security to be voted the #1 challenge for most IT domains.  

Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector

By June 27, 2016

Raspberry Pi with SensorIn the unlikely event you’re not familiar with the Raspberry Pi, it’s an ARM-based computer about the same size as a deck of playing cards. The latest iteration, Raspberry Pi 3, has a 1.2GHz ARMv8 CPU, 1MB of memory, integrated Wi-Fi and Bluetooth, all for the same $35 price tag as the original Raspberry Pi released in 2012. Running a variety of operating systems, including Linux, it’s actually quite a capable little machine. How capable? Well, last week, I successfully ran StreamSets Data Collector (SDC) on one of mine (yes, I have several!), ingesting sensor data and sending them to Apache Cassandra for analysis.

Here’s how you can build your own Internet of Things testbed and ingest sensor data for around $50.

Visualize StreamSets Data Collector Metrics with Datadog

By June 20, 2016

Datadog LogoBack in January, Adam blogged about StreamSets Monitoring with Grafana, InfluxDB, and jmxtrans. While the Grafana/InfluxDB/jmxtrans open source stack works great, there’s quite a lot of setup and configuration to keep track of. Another tool that StreamSets customers are using to monitor their infrastructure is Datadog; in today’s blog post I’ll show you the basics of configuring Datadog with StreamSets Data Collector (SDC).

Back To Top