Dataflow Performance Blog

Running Scala Code in StreamSets Data Collector

Scala logoThe Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Transformer as a Spark Resilient Distributed Dataset (RDD). Your code can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination. Since Scala is Spark's ‘native tongue', it's well suited to the task of creating a Transformer. Here is the skeleton CustomTransformer class – you'll notice it's a bit briefer than the Java equivalent:

The tutorial starts from this sample and extends it to compute the credit card issuing network (Visa, Mastercard, etc) from a credit card number, validate that incoming records have a credit card field, and even allow configuration of card issuer prefixes from the SDC user interface rather than being hardcoded in Scala.

If you've been looking to implement custom functionality in SDC, and you're a fan of Scala's brevity and expressiveness, work through the tutorial and let us know how you get on in the comments.

Pat PattersonRunning Scala Code in StreamSets Data Collector