skip to Main Content

Creating a Custom Processor for StreamSets Data Collector

By Posted in Data Integration October 19, 2016

gps image dataBack in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I’ve been looking for a good sample use case for a custom processor. It’s tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

With the help of Drew Noakes‘ excellent metadata-extractor you can access Exif and other metadata in a wide variety of image file types. In the tutorial, I give a simple example of reading and writing record fields, then show you how to integrate the metadata-extractor library, access whole file content, and write the resulting metadata tags as record fields. Having this metadata in the SDC record is incredibly useful. Want to search your photos by their location? As I describe in the tutorial, you can easily write the image’s GPS coordinates, as well as the filename, to a database table.

Do you have a custom processor in mind for SDC? Follow the tutorial to get started, and let us know how it goes in the comments!

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top