skip to Main Content

StreamSets Data Integration Blog

Where change is welcome.

Struggling with Bad Data? What to Do From the Enterprise

By June 28, 2016

Last week we announced the results of a survey of over 300 enterprise data professionals conducted by Dimensional Research and sponsored by StreamSets.  We were trying to understand the market’s state of play for managing their big data flows.  What we discovered was that there is an alarming issue at hand: companies are struggling to detect and keep bad data out of their stores.

There is a bad data problem within big data.

Bad Data: What to Do?

When we asked data pros about their challenges with big data flows, the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness and consistency, getting votes from over ⅔ of respondents.  Security and operations were also listed by more than half.  The fact that quality was rated as a more common challenge than even security and compliance is quite telling, as you usually can count on security to be voted the #1 challenge for most IT domains.  

Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector

By June 27, 2016

Raspberry Pi with SensorIn the unlikely event you’re not familiar with the Raspberry Pi, it’s an ARM-based computer about the same size as a deck of playing cards. The latest iteration, Raspberry Pi 3, has a 1.2GHz ARMv8 CPU, 1MB of memory, integrated Wi-Fi and Bluetooth, all for the same $35 price tag as the original Raspberry Pi released in 2012. Running a variety of operating systems, including Linux, it’s actually quite a capable little machine. How capable? Well, last week, I successfully ran StreamSets Data Collector (SDC) on one of mine (yes, I have several!), ingesting sensor data and sending them to Apache Cassandra for analysis.

Here’s how you can build your own Internet of Things testbed and ingest sensor data for around $50.

Visualize StreamSets Data Collector Metrics with Datadog

By June 20, 2016

Datadog LogoBack in January, Adam blogged about StreamSets Monitoring with Grafana, InfluxDB, and jmxtrans. While the Grafana/InfluxDB/jmxtrans open source stack works great, there’s quite a lot of setup and configuration to keep track of. Another tool that StreamSets customers are using to monitor their infrastructure is Datadog; in today’s blog post I’ll show you the basics of configuring Datadog with StreamSets Data Collector (SDC).

Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets

By June 13, 2016

Riak TS LogoRiak KV is an open source, distributed, NoSQL key-value data store oriented towards high availability, fault tolerance and scalability. With its initial release in 2009, Risk KV is in use at companies such as AT&T, Comcast and GitHub. Last October, Basho, the vendor behind Riak KV, announced Riak TS. Riak TS, another distributed, NoSQL data store, is optimized for time series data generated by sources such as IoT sensors. The recent release of Riak TS 1.3 as an open source product under the Apache V2 license got me thinking – how would I get sensor data into Riak TS? There are a range of client libraries, which communicate with the data store via protocol buffers, but connected devices tend to use protocols such as MQTT, an extremely lightweight publish/subscribe messaging transport. StreamSets to the rescue!

Analyzing Salesforce Data with StreamSets, Elasticsearch, and Kibana

By June 3, 2016

UPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

After I published a proof-of-concept Salesforce Origin for StreamSets Data Collector (SDC), I noticed an article on the Elastic blog, Analyzing Salesforce Data with Logstash, Elasticsearch, and Kibana. In the blog entry, Elastic systems architect Russ Savage (now at Cask Data), explains the motivation for ingesting Salesforce data into Elasticsearch:

Working directly with sales and marketing operations, we outlined a number of challenges they had that might be solved with this solution. Those included:

  • Interactive time-series snapshot analysis across a number of dimensions. By sales rep, by region, by campaign and more.
  • Which sales reps moved the most pipeline the day before the end of month/quarter? What was the progression of Stage 1 opportunities over time.
  • Correlating data outside of Salesforce (like web traffic) to pipeline building and demand. By region/country/state/city and associated pipeline.

It’s very challenging to look back in time and see trends in the data. Many companies have configured Salesforce to save reporting snapshots, but if you’re like me, you want to see the data behind the aggregate report. I want the ability to drill down to any level of detail, for any timeframe, and find any metric. We found that Salesforce snapshots just aren’t flexible enough for that.

Since we have first-class support for Elasticsearch as a destination in SDC, I decided to recreate the use case with the Salesforce Origin and see if we could fulfill those same requirements while taking advantage of StreamSets’ interactive pipeline IDE and ability to continuously monitor origins for new data.

Back To Top