2016 | Page 4 of 6 | StreamSets

By Pat Patterson June 13, 2016

Riak KV is an open source, distributed, NoSQL key-value data store oriented towards high availability, fault tolerance and scalability. With its initial release in 2009, Risk KV is in use at companies such as AT&T, Comcast and GitHub. Last October, Basho, the vendor behind Riak KV, announced Riak TS. Riak TS, another distributed, NoSQL data store, is optimized for time series data generated by sources such as IoT sensors. The recent release of Riak TS 1.3 as an open source product under the Apache V2 license got me thinking – how would I get sensor data into Riak TS? There are a range of client libraries, which communicate with the data store via protocol buffers, but connected devices tend to use protocols such as MQTT, an extremely lightweight publish/subscribe messaging transport. StreamSets to the rescue!

Analyzing Salesforce Data with StreamSets, Elasticsearch, and Kibana

Advanced Analytics

Data Transformation

By Pat Patterson June 3, 2016

UPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

After I published a proof-of-concept Salesforce Origin for StreamSets Data Collector (SDC), I noticed an article on the Elastic blog, Analyzing Salesforce Data with Logstash, Elasticsearch, and Kibana. In the blog entry, Elastic systems architect Russ Savage (now at Cask Data), explains the motivation for ingesting Salesforce data into Elasticsearch:

Working directly with sales and marketing operations, we outlined a number of challenges they had that might be solved with this solution. Those included:

Interactive time-series snapshot analysis across a number of dimensions. By sales rep, by region, by campaign and more.

Which sales reps moved the most pipeline the day before the end of month/quarter? What was the progression of Stage 1 opportunities over time.

Correlating data outside of Salesforce (like web traffic) to pipeline building and demand. By region/country/state/city and associated pipeline.

It’s very challenging to look back in time and see trends in the data. Many companies have configured Salesforce to save reporting snapshots, but if you’re like me, you want to see the data behind the aggregate report. I want the ability to drill down to any level of detail, for any timeframe, and find any metric. We found that Salesforce snapshots just aren’t flexible enough for that.

Since we have first-class support for Elasticsearch as a destination in SDC, I decided to recreate the use case with the Salesforce Origin and see if we could fulfill those same requirements while taking advantage of StreamSets’ interactive pipeline IDE and ability to continuously monitor origins for new data.

Announcing Data Collector ver 1.4.0.0

By Kirit Basu, Head of Strategy June 2, 2016

We are excited to announce the release of the next version of StreamSets Data Collector. With this release we have a number of new features and enhancements and 60+ bug fixes. New Features : SFTP/FTP origin to read files from…

The Complementary Nature of Data Ingestion and Data Preparation

By Rick Bilodeau May 25, 2016

I am always eager to learn about new architectures and best big data practices. Recently I came across a paper from Trifacta discussing the role of data preparation and it got me thinking about the complementary nature of data ingestion and data preparation.

Data preparation, more colorfully known as data wrangling, is the activity performed by data-driven professionals, such as data or business analysts, to explore, clean, transform and blend data of all varieties to make it trustworthy for analysis or predictive modeling. A form of data manipulation that has traditionally been achieved using Excel or, for more technically-advanced end users, languages such as R, SAS or Python. But with the rise of enormous and dynamic data sets in Hadoop, these approaches are no longer feasible. Trifacta took the lead in creating a self-service web-based solution that enables business users to access and manipulate data stored in Hadoop without needing programming skills.

May the 4th Be With You – Analyzing Star Wars Twitter Mentions in Minecraft

Data Integration

By Pat Patterson May 4, 2016

A couple of weeks ago, as May the 4th approached, a lively Star Wars debate brewed at StreamSets:

“Do new school characters get as much play as old favorites like Darth Vader, Yoda and Han Solo?”
“Does the Dark Side of the Force dominate the Light?”
“Does Yoda prevail over Darth Vader?”

It occurred to us that, with the Twitter Streaming API and StreamSets Data Collector, we didn’t have to guess or debate. We built a data flow that ingested and analyzed tweets and then displayed them in… Minecraft!

Ingest Salesforce Data for Analysis Using StreamSets

Data Integration

Operational Analytics

By Pat Patterson April 29, 2016

UPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

As I’ve mentioned a couple of times, my previous gig was as a developer evangelist at Salesforce, with particular focus on integration. A few weeks ago, I wrote a custom destination allowing StreamSets Data Collector (SDC) to write data to Salesforce Wave Analytics; today, I’ll show you how to ingest data from Salesforce and write it to any destination supported by SDC.

Ingesting JSON Data Into Apache Kudu with StreamSets Data Collector

By Pat Patterson April 15, 2016

At the Hadoop Summit in Dublin this week, Ted Malaska, Principal Solutions Architect at Cloudera, and I presented Ingest and Stream Processing – What Will You Choose?, looking at the big data streaming landscape with a focus on ingest. The session closed with a demo of StreamSets Data Collector, the open source graphical IDE for building ingest pipelines.

Announcing Data Collector ver 1.3.0.0

By Kirit Basu, Head of Strategy April 13, 2016

With this release we have a number of exciting new features and integrations. And as usual, we've addressed a number of bug fixes. Integrations: Want to send data to Amazon Redshift? Use the new Kinesis Firehose destination to do it. If you…

Data in Motion: Simplifying Security & Building Custom Integrations

By Pat Patterson April 8, 2016

At the Strata+Hadoop World conference last week, I met with Pratik Verma, Chief Product Officer at BlueTalon, a Bay Area startup focused on big data security. As Pratik and I were talking, he explained some of the problems that arise when organizations collect more…

Integrating StreamSets with Salesforce Wave Analytics

Advanced Analytics

Operational Analytics

By Pat Patterson April 4, 2016

UPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

In my last blog entry I explained how you can write custom destinations to send data to systems not currently supported by StreamSets Data Collector. As you might know, my last gig was as a developer evangelist at Salesforce, so I put my experience to work writing a destination for Salesforce Wave Analytics. Now you can ingest data from any of a variety of origins, operate on it in a StreamSets pipeline, and write the results into a Wave dataset. Once uploaded, you can combine the dataset with CRM and other data in Salesforce for analysis.

StreamSets Data Integration Blog

Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets

Analyzing Salesforce Data with StreamSets, Elasticsearch, and Kibana

Announcing Data Collector ver 1.4.0.0

The Complementary Nature of Data Ingestion and Data Preparation

May the 4th Be With You – Analyzing Star Wars Twitter Mentions in Minecraft

Ingest Salesforce Data for Analysis Using StreamSets

Ingesting JSON Data Into Apache Kudu with StreamSets Data Collector

Announcing Data Collector ver 1.3.0.0

Data in Motion: Simplifying Security & Building Custom Integrations

Integrating StreamSets with Salesforce Wave Analytics

Stay in Touch

Connect