skip to Main Content

Running Scala Code in StreamSets Data Collector

By Posted in Data Integration February 27, 2017

Scala logoThe Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Transformer as a Spark Resilient Distributed Dataset (RDD). Your code can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination. Since Scala is Spark’s ‘native tongue’, it’s well suited to the task of creating a Transformer. Here is the skeleton CustomTransformer class – you’ll notice it’s a bit briefer than the Java equivalent:

class CustomTransformer extends SparkTransformer with Serializable {
  var emptyRDD: JavaRDD[(Record, String)] = _

  override def init(javaSparkContextInstance: JavaSparkContext, params: util.List[String]): Unit = {
    // Create an empty JavaPairRDD to return as 'errors'
    emptyRDD = javaSparkContextInstance.emptyRDD
  }

  override def transform(recordRDD: JavaRDD[Record]): TransformResult = {
    val rdd = recordRDD.rdd

    val errors = emptyRDD

    // Apply a map to the incoming records
    val result = rdd.map((record)=> record)

    // return result
    new TransformResult(result.toJavaRDD(), new JavaPairRDD[Record, String](errors))
  }
}

The tutorial starts from this sample and extends it to compute the credit card issuing network (Visa, Mastercard, etc) from a credit card number, validate that incoming records have a credit card field, and even allow configuration of card issuer prefixes from the SDC user interface rather than being hardcoded in Scala.

If you’ve been looking to implement custom functionality in SDC, and you’re a fan of Scala’s brevity and expressiveness, work through the tutorial and let us know how you get on in the comments.

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top