December 2016

Calling External Java Code from Script Evaluators

By Pat Patterson December 21, 2016

When you’re building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of ‘off-the-shelf’ processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can usually achieve your goal using built-in scripting functions, as in the credit card issuing network computation shown in the SDC tutorial, but, again, sometimes you need to go a little further. For example, a member of the StreamSets community Slack channel recently asked about computing SHA-3 digests in JavaScript. In this blog entry I’ll show you how to do just this from Groovy, JavaScript and Jython.

Building an Amazon SQS Custom Origin for StreamSets Data Collector

By Pat Patterson December 20, 2016

As I explained in my recent tutorial, Creating a Custom Origin for StreamSets Data Collector, it's straightforward to extend StreamSets Data Collector (SDC) to ingest data from pretty much any source. Yogesh Choudhary, a software engineer at consulting and services company…

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Stream Data Processing

Data Integration

By Pat Patterson December 19, 2016

I’m frequently asked, ‘How does StreamSets Data Collector integrate with Spark Streaming? How about on Databricks?’. In this blog entry, I’ll explain how to use Data Collector to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere. This is one solution for continuous data integration that can be used in cloud data platforms.

Creating a Custom Origin for StreamSets Data Collector

Data Integration

By Pat Patterson December 12, 2016

Since writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I’ve been looking for a good use case for a custom origin tutorial. It’s been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom origin redundant. Then, last week, StreamSets software engineer Jeff Evans suggested Git. Creating a custom origin for StreamSets to read the Git commit log turned into the perfect tutorial.

Running Apache Spark Code in StreamSets Data Collector

Data Integration

Stream Data Processing

By Pat Patterson December 8, 2016

New in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Java or Scala code as a Spark Resilient Distributed Dataset (RDD). Your Spark Transformer can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination.

Announcing Data Collector ver 2.2.0.0

By Kirit Basu, Head of Strategy December 2, 2016

And here it is folks, the last release of 2016 – StreamSets Data Collector version 2.2.0.0. We’ve put in a host of important new features and resolved 120+ bugs.

We’re gearing up for a solid roadmap in 2017, enabling exciting new use cases and bringing in some great contributions from customers and our community.

Upgrading From Apache Flume to StreamSets Data Collector

By Pat Patterson December 1, 2016

Apache Flume “is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data”. The typical use case is collecting log data and pushing it to a destination such as the Hadoop Distributed File System. In this blog entry we’ll look at a couple of Flume use cases, and see how they can be implemented with an Apache Flume alternative, such as StreamSets Data Collector.

StreamSets Data Integration Blog

Calling External Java Code from Script Evaluators

Building an Amazon SQS Custom Origin for StreamSets Data Collector

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Creating a Custom Origin for StreamSets Data Collector

Running Apache Spark Code in StreamSets Data Collector

Announcing Data Collector ver 2.2.0.0

Upgrading From Apache Flume to StreamSets Data Collector

Stay in Touch

Connect