Skip to content

StreamSets Data Integration Blog

Where change is welcome.

Building a Data Science Pipeline at IBM Ireland with StreamSets

By September 12, 2016

Guglielmo Iozzia - Data Science Pipeline EngineerAfter Guglielmo Iozzia, a big data infrastructure engineer on the Ethical Hacking Team at IBM Ireland, recently spoke about building a data science pipeline using StreamSets Data Collector Engine at Hadoop User Group Ireland, I invited him to contribute a blog post outlining how he discovered StreamSets Data Collector (SDC) Engine and the kinds of problems he and his team are solving with it. Read on to discover how SDC is saving time and making Guglielmo and his team’s lives a whole lot easier.

Analyzing Salesforce Data with StreamSets, Elasticsearch, and Kibana

By June 3, 2016

UPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

After I published a proof-of-concept Salesforce Origin for StreamSets Data Collector (SDC), I noticed an article on the Elastic blog, Analyzing Salesforce Data with Logstash, Elasticsearch, and Kibana. In the blog entry, Elastic systems architect Russ Savage (now at Cask Data), explains the motivation for ingesting Salesforce data into Elasticsearch:

Working directly with sales and marketing operations, we outlined a number of challenges they had that might be solved with this solution. Those included:

  • Interactive time-series snapshot analysis across a number of dimensions. By sales rep, by region, by campaign and more.
  • Which sales reps moved the most pipeline the day before the end of month/quarter? What was the progression of Stage 1 opportunities over time.
  • Correlating data outside of Salesforce (like web traffic) to pipeline building and demand. By region/country/state/city and associated pipeline.

It’s very challenging to look back in time and see trends in the data. Many companies have configured Salesforce to save reporting snapshots, but if you’re like me, you want to see the data behind the aggregate report. I want the ability to drill down to any level of detail, for any timeframe, and find any metric. We found that Salesforce snapshots just aren’t flexible enough for that.

Since we have first-class support for Elasticsearch as a destination in SDC, I decided to recreate the use case with the Salesforce Origin and see if we could fulfill those same requirements while taking advantage of StreamSets’ interactive pipeline IDE and ability to continuously monitor origins for new data.

Integrating StreamSets with Salesforce Wave Analytics

By April 4, 2016

Wave AnalyticsUPDATE – Salesforce origin and destination stages, as well as a destination for Salesforce Wave Analytics, were released in StreamSets Data Collector 2.2.0.0. Use the supported, shipping Salesforce stages rather than the unsupported code mentioned below!

In my last blog entry I explained how you can write custom destinations to send data to systems not currently supported by StreamSets Data Collector. As you might know, my last gig was as a developer evangelist at Salesforce, so I put my experience to work writing a destination for Salesforce Wave Analytics. Now you can ingest data from any of a variety of origins, operate on it in a StreamSets pipeline, and write the results into a Wave dataset. Once uploaded, you can combine the dataset with CRM and other data in Salesforce for analysis.

Back To Top