StreamSets Data Integration Blog
Where change is welcome.
Support for Amazon EMR Serverless
StreamSets has added support to run Transformer for Spark on a new type of cluster: Amazon EMR (Elastic MapReduce) Serverless.…
Calling External Java Code from Script Evaluators
When you're building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of 'off-the-shelf' processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can…
Building an Amazon SQS Custom Origin for StreamSets Data Collector
As I explained in my recent tutorial, Creating a Custom Origin for StreamSets Data Collector, it's straightforward to extend StreamSets Data Collector (SDC) to ingest data from pretty much any source. Yogesh Choudhary, a software engineer at consulting and services company Clairvoyant, just posted his own walkthrough of building a custom origin for Amazon Simple Queue Service (SQS). Yogesh does a great…
Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks
I'm frequently asked, 'How does StreamSets Data Collector integrate with Spark Streaming? How about on Databricks?'. In this blog entry, I'll explain how to use Data Collector to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere. This is one solution for continuous data integration that can be used in…
Creating a Custom Origin for StreamSets Data Collector
Since writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I've been looking for a good use case for a custom origin tutorial. It's been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom…
Running Apache Spark Code in StreamSets Data Collector
New in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the…