Dataflow Performance Blog

Ingesting data into Couchbase using StreamSets Data Collector

Nick Cadenhead, a Senior Consultant at 9th BIT Consulting in Johannesburg, South Africa, uses Couchbase Server to power analytics solutions for his clients. In this blog entry, reposted from his article at LinkedIn, Nick explains why he selected StreamSets Data Collector for data ingest, and how he extended it with a custom destination to write data to Couchbase.

Pat PattersonIngesting data into Couchbase using StreamSets Data Collector
Read More

Ingest Data into Splunk with StreamSets Data Collector

Splunk ChartSplunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I'll explain how to efficiently send high volumes of data to Splunk's HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I'll present a Jython script with which you'll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.

Pat PattersonIngest Data into Splunk with StreamSets Data Collector
Read More

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

Screen Shot 2017-01-17 at 1.40.07 PMToday we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data.  Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.

Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.

Girish PanchaData in Motion Evolution: Where We’ve Been…Where We Need to Go
Read More

Calling External Java Code from Script Evaluators

groovy logoWhen you're building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of ‘off-the-shelf' processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can usually achieve your goal using built-in scripting functions, as in the credit card issuing network computation shown in the SDC tutorial, but, again, sometimes you need to go a little further. For example, a member of the StreamSets community Slack channel recently asked about computing SHA-3 digests in JavaScript. In this blog entry I'll show you how to do just this from Groovy, JavaScript and Jython.

Pat PattersonCalling External Java Code from Script Evaluators
Read More

Building an Amazon SQS Custom Origin for StreamSets Data Collector

sqsoriginAs I explained in my recent tutorial, Creating a Custom Origin for StreamSets Data Collector, it's straightforward to extend StreamSets Data Collector (SDC) to ingest data from pretty much any source. Yogesh Choudhary, a software engineer at consulting and services company Clairvoyant, just posted his own walkthrough of building a custom origin for Amazon Simple Queue Service (SQS). Yogesh does a great job of walking you through the process of creating a custom origin project from the Maven archetype, building it, and then adding the Amazon SQS functionality. Read more at Creating a Custom Origin for StreamSets.

Pat PattersonBuilding an Amazon SQS Custom Origin for StreamSets Data Collector
Read More

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Databricks LogoI'm frequently asked, ‘How does StreamSets Data Collector (SDC) integrate with Spark Streaming? How about on Databricks?'. In this blog entry, I'll explain how to use SDC to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere.

Pat PattersonContinuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks
Read More

Creating a Custom Origin for StreamSets Data Collector

Git Commit Log RecordSince writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I've been looking for a good use case for a custom origin tutorial. It's been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom origin redundant. Then, last week, StreamSets software engineer Jeff Evans suggested Git. Creating a custom origin to read the Git commit log turned into the perfect tutorial.

Pat PattersonCreating a Custom Origin for StreamSets Data Collector
Read More

Running Apache Spark Code in StreamSets Data Collector

Spark LogoNew in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Java or Scala code as a Spark Resilient Distributed Dataset (RDD). Your Spark Transformer can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination.

Pat PattersonRunning Apache Spark Code in StreamSets Data Collector
Read More

Announcing Data Collector ver 2.2.0.0

And here it is folks, the last release of 2016 – StreamSets Data Collector version 2.2.0.0. We've put in a host of important new features and resolved 120+ bugs.

We're gearing up for a solid roadmap in 2017, enabling exciting new use cases and bringing in some great contributions from customers and our community.

Kirit BasuAnnouncing Data Collector ver 2.2.0.0
Read More