Getting Started with StreamSets Data Collector

Getting Started with StreamSets Data Collector

Hi, I'm Pat Patterson, newly minted ‘community champion' here at StreamSets. As I get up to speed with big data in general and StreamSets Data Collector (SDC) in particular, I'll write up my exploits here on the StreamSets blog to help other novices as they get started with open source big data ingest.

I'm going to assume you know the basics of what StreamSets Data Collector can do, and you want to get started actually using it. If you do need some background, the product page and FAQs are great places to start.

Now, let's get hands on!

Installing StreamSets Data Collector

The first order of business was to install SDC on my shiny new MacBook Air. This was totally straightforward – I downloaded the TGZ file (‘tarball') from https://streamsets.com/opensource/, followed the instructions there to extract it, run the Data Collector, and then hit http://localhost:18630 to bring up the SDC console in a browser:

SDC Login

A quick scan of the docs revealed that the default admin username and password are both admin. Fine for development use on your own laptop, but you'll want to change these for production!

Getting Started with a Pipeline

Skip straight to the Tutorial and follow the instructions – in the space of an hour or so you’ll have a good understanding of what SDC can do. You'll start by downloading a sample CSV file of cash/credit card taxi fare transaction data, then build a simple data pipeline:

Tutorial Pipeline

The flow is pretty straightforward; the stream selector on the left sends credit card transactions along the top path. If you look carefully, you can see a gray gauge between the stream selector and Jython evaluator. This is where credit card transactions are validated – any records with the credit card transaction type, but no credit card number, are sent to an error log.

Now the Jython evaluator can get to work on the valid credit card transactions, augmenting each with the card type (Visa/Mastercard etc) derived from the card number. The field masker replaces all but the last four digits of the credit card number with the letter ‘x', and the resulting records are written to a CSV file by the Hadoop FS destination. Don't worry if you don't have Hadoop setup yet – you can configure the Hadoop FS destination to simply write to the local filesystem.

Cash transactions flow along the lower path, where the expression evaluator simply sets the credit card type to ‘n/a', so that every emitted transaction has a value for this field, and passes through all other fields unchanged. Cash transaction records flow on to the same Hadoop FS destination, and are written to the same file on local storage.

The resulting CSV data looks like this (I've hidden a lot of the columns, such as pickup and dropoff location, so you can see the relevant fields more easily):

Taxi Output CSV

You can imagine that all sorts of analysis would now be possible – for example, is there a correlation between pickup location and fare?

Now I understand the basics of SDC – awesome! Next, I'm going to write a flow to read JSON data, do some processing in JavaScript, and write the results to Hadoop so I can run queries.

Try StreamSets Data Collector for yourself, it's open source and completely free (as in beer!) to get started, and follow StreamSets on Twitter for more updates on my progress.

More About Me

I've been working with technical communities since Sun Microsystems open sourced their Access Manager product as OpenSSO, back in July 2005. After a brief interlude writing cloud storage infrastructure at Huawei, I joined Salesforce in October 2010 as a developer evangelist; if you want a taste of how much fun that gig was, check out my Salesforce/Minecraft integration! Now I'm ‘community champion' at StreamSets, you'll see me writing, tweeting and speaking about StreamSets Data Collector. In fact, I'll be presenting a session at the Glue Conference in Broomfield, CO in May – come say hi!

Outside work, I run regularly, but not quickly, enjoy craft beer, and hack on hardware and software with my two sons. I'm a social animal, so you can find me all over the web:

Related Resources

Check out StreamSets white papers, videos, webinars, report and more.

Visit the Resource Library

Related Blog Posts

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!