Dataflow Performance Blog

Creating a Custom Processor for StreamSets Data Collector

gps image dataBack in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I've been looking for a good sample use case for a custom processor. It's tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

With the help of Drew Noakes‘ excellent metadata-extractor you can access Exif and other metadata in a wide variety of image file types. In the tutorial, I give a simple example of reading and writing record fields, then show you how to integrate the metadata-extractor library, access whole file content, and write the resulting metadata tags as record fields. Having this metadata in the SDC record is incredibly useful. Want to search your photos by their location? As I describe in the tutorial, you can easily write the image's GPS coordinates, as well as the filename, to a database table.

Do you have a custom processor in mind for SDC? Follow the tutorial to get started, and let us know how it goes in the comments!

Pat PattersonCreating a Custom Processor for StreamSets Data Collector
  • Sree Harsha Gudise

    I just want to write custom processor for ingesting data into druid data store. Is there anyone to help me?

  • Sree Harsha Gudise

    On which folder should i have create archetype either data-collector or data-collector-api. And when i was creating archetype it is asking me an option to choose me an archetype out of 1650 options what should i do?

    • http://blog.superpat.com/ Pat Patterson

      You should create the custom processor in a completely separate directory – not under datacollector or datacollector-api. What command line are you using? Did you copy all three lines of the command line in the tutorial? It needs to be ALL of this (I updated the version for you):

      mvn archetype:generate -DarchetypeGroupId=com.streamsets
      -DarchetypeArtifactId=streamsets-datacollector-stage-lib-tutorial
      -DarchetypeVersion=2.7.0.0 -DinteractiveMode=true

      • Sree Harsha Gudise

        Thanks Patterson!! i was creating under data-collector folder. Thanks for the mvn command !!!