Running Scala Code in StreamSets Data Collector

By Pat Patterson Posted in Data Integration February 27, 2017

Scala logo The Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Transformer as a Spark Resilient Distributed Dataset (RDD). Your code can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination. Since Scala is Spark’s ‘native tongue’, it’s well suited to the task of creating a Transformer. Here is the skeleton CustomTransformer class – you’ll notice it’s a bit briefer than the Java equivalent:

class CustomTransformer extends SparkTransformer with Serializable {
  var emptyRDD: JavaRDD[(Record, String)] = _

  override def init(javaSparkContextInstance: JavaSparkContext, params: util.List[String]): Unit = {
    // Create an empty JavaPairRDD to return as 'errors'
    emptyRDD = javaSparkContextInstance.emptyRDD
  }

  override def transform(recordRDD: JavaRDD[Record]): TransformResult = {
    val rdd = recordRDD.rdd

    val errors = emptyRDD

    // Apply a map to the incoming records
    val result = rdd.map((record)=> record)

    // return result
    new TransformResult(result.toJavaRDD(), new JavaPairRDD[Record, String](errors))
  }
}

The tutorial starts from this sample and extends it to compute the credit card issuing network (Visa, Mastercard, etc) from a credit card number, validate that incoming records have a credit card field, and even allow configuration of card issuer prefixes from the SDC user interface rather than being hardcoded in Scala.

If you’ve been looking to implement custom functionality in SDC, and you’re a fan of Scala’s brevity and expressiveness, work through the tutorial and let us know how you get on in the comments.

Related Resources

Webinar

Integration Roadmap: Navigating the Future of iPaaS with webMethods and StreamSets

Get introduced to the newest capabilities of webMethods.io and StreamSets. Plus get a sneak peek into Software AG’s vision for the iPaaS...

Watch Now

Whitepapers & Ebooks

The Data Integration Advantage: Building a Foundation for Scalable AI

Explore the state of AI in the enterprise including challenges of scaling and optimizing data flows.

Download Now

Report

Creating Order from Chaos: Governance in the Data Wild West

The Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a...

Download Now

Running Scala Code in StreamSets Data Collector

Topics

Authors

Quick Links

Conduct Data Ingestion and Transformations In One Place