Skip to content

StreamSets Data Integration Blog

Where change is welcome.

Standard Deviations on Cassandra – Rolling Your Own Aggregate Function

By July 28, 2016

Cassandra logoIf you’ve been following the StreamSets blog over the past few weeks, you’ll know that I’ve been building an Internet of Things testbed on the Raspberry Pi. First, I got StreamSets Data Collector (SDC) running on the Pi, ingesting sensor data and sending it to Apache Cassandra, and then I wrote a Python app to display SDC metrics on the PiTFT screen. In this blog entry I’ll take the next step, querying Cassandra for statistics on my sensor data.

Chat with the StreamSets Team via Slack!

By July 15, 2016

SlackSince its inception last October, the sdc-user Google group has been the primary medium for you, our community, to communicate with us, the StreamSets Team. We’ve seen over a thousand messages, and participated in discussions around installation, configuration, bugs, feature requests, and every other aspect of StreamSets Data Collector. While sdc-user works well for many interactions, it does lack the immediacy of chatting in real-time. So, this week, we added a new communication option: the ‘StreamSetters’ Slack team.

Struggling with Bad Data? What to Do From the Enterprise

By June 28, 2016

Last week we announced the results of a survey of over 300 enterprise data professionals conducted by Dimensional Research and sponsored by StreamSets.  We were trying to understand the market’s state of play for managing their big data flows.  What we discovered was that there is an alarming issue at hand: companies are struggling to detect and keep bad data out of their stores.

There is a bad data problem within big data.

Bad Data: What to Do?

When we asked data pros about their challenges with big data flows, the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness and consistency, getting votes from over ⅔ of respondents.  Security and operations were also listed by more than half.  The fact that quality was rated as a more common challenge than even security and compliance is quite telling, as you usually can count on security to be voted the #1 challenge for most IT domains.  

Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector

By June 27, 2016

Raspberry Pi with SensorIn the unlikely event you’re not familiar with the Raspberry Pi, it’s an ARM-based computer about the same size as a deck of playing cards. The latest iteration, Raspberry Pi 3, has a 1.2GHz ARMv8 CPU, 1MB of memory, integrated Wi-Fi and Bluetooth, all for the same $35 price tag as the original Raspberry Pi released in 2012. Running a variety of operating systems, including Linux, it’s actually quite a capable little machine. How capable? Well, last week, I successfully ran StreamSets Data Collector (SDC) on one of mine (yes, I have several!), ingesting sensor data and sending them to Apache Cassandra for analysis.

Here’s how you can build your own Internet of Things testbed and ingest sensor data for around $50.

Visualize StreamSets Data Collector Metrics with Datadog

By June 20, 2016

Datadog LogoBack in January, Adam blogged about StreamSets Monitoring with Grafana, InfluxDB, and jmxtrans. While the Grafana/InfluxDB/jmxtrans open source stack works great, there’s quite a lot of setup and configuration to keep track of. Another tool that StreamSets customers are using to monitor their infrastructure is Datadog; in today’s blog post I’ll show you the basics of configuring Datadog with StreamSets Data Collector (SDC).

Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets

By June 13, 2016

Riak TS LogoRiak KV is an open source, distributed, NoSQL key-value data store oriented towards high availability, fault tolerance and scalability. With its initial release in 2009, Risk KV is in use at companies such as AT&T, Comcast and GitHub. Last October, Basho, the vendor behind Riak KV, announced Riak TS. Riak TS, another distributed, NoSQL data store, is optimized for time series data generated by sources such as IoT sensors. The recent release of Riak TS 1.3 as an open source product under the Apache V2 license got me thinking – how would I get sensor data into Riak TS? There are a range of client libraries, which communicate with the data store via protocol buffers, but connected devices tend to use protocols such as MQTT, an extremely lightweight publish/subscribe messaging transport. StreamSets to the rescue!

May the 4th Be With You – Analyzing Star Wars Twitter Mentions in Minecraft

By May 4, 2016

Arena - high angleA couple of weeks ago, as May the 4th approached, a lively Star Wars debate brewed at StreamSets:

  • “Do new school characters get as much play as old favorites like Darth Vader, Yoda and Han Solo?”
  • “Does the Dark Side of the Force dominate the Light?”
  • “Does Yoda prevail over Darth Vader?”

It occurred to us that, with the Twitter Streaming API and StreamSets Data Collector, we didn’t have to guess or debate. We built a data flow that ingested and analyzed tweets and then displayed them in… Minecraft!

Ingest Salesforce Data for Analysis Using StreamSets

By April 29, 2016

Force.com origin allows ingest from SalesforceUPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

As I’ve mentioned a couple of times, my previous gig was as a developer evangelist at Salesforce, with particular focus on integration. A few weeks ago, I wrote a custom destination allowing StreamSets Data Collector (SDC) to write data to Salesforce Wave Analytics; today, I’ll show you how to ingest data from Salesforce and write it to any destination supported by SDC.

New Tutorial: Creating a Custom StreamSets Destination

By March 23, 2016

One of the first things I hear after I explain the basics of StreamSets Data Collector is, “Cool, so can I ingest data from/send data to X?”, for varying values of X. The short answer is, “Yes, you can!”, while the longer answer involves checking the lists of origins (for ingesting data from X) and destinations (for writing data) included with the product, and writing custom code if X is not on the list.

image_11“My X isn’t on the list! How do I get started writing that custom code?”, I hear you shout; well, I just wrote a detailed tutorial for creating your first custom StreamSets destination that explains all. Fire up your IDE, follow the steps, and you’ll build a sample destination that sends records to RequestBin, but could be adapted to send them pretty much anywhere.

Back To Top