October 2016

Contributing to the StreamSets Data Collector Community

By Pat Patterson October 25, 2016

As you likely already know, StreamSets Data Collector (SDC) is open source, made available via the Apache 2.0 license. The entire source code for the product is hosted in a GitHub project and the binaries are always available for download.

As well as being part of our engineering culture, open source gives us a number of business advantages. Prospective users can freely download, install, evaluate and even put SDC into production, customers have access to the source code without a costly escrow process and, perhaps most importantly, our users can contribute fixes and enhancements to improve the product for the benefit of the whole community. In this post, I’d like to acknowledge some of those contributions, and invite you to contribute, too.

Creating a Custom Processor for StreamSets Data Collector

Data Integration

By Pat Patterson October 19, 2016

Back in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I’ve been looking for a good sample use case for a custom processor. It’s tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

The Challenge of Fetching Data for Apache Spot (incubating)

By Rick Bilodeau October 19, 2016

Reposted from the Cloudera Vision blog.

What do Sony, Target and the Democratic Party have in common?

Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us.

Announcing Data Collector ver 2.1.0.0

By Kirit Basu, Head of Strategy October 13, 2016

We’re happy to announce a new release of the Data Collector. This minor release has over 30+ bug fixes and a number of improvements and a few new features :

Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3

Data Integration

By Pat Patterson October 13, 2016

Sandish Kumar, a Solutions Engineer at phData, builds and manages solutions for phData customers. In this article, reposted from the phData blog, he explains how to generate simulated NetFlow data, read it into StreamSets Data Collector via the UDP origin, then buffer it in Apache Kafka before sending it to Apache Kudu. A true big data enthusiast, Sandish spends his spare time working to understand Kudu internals.

StreamSets Data Integration Blog

Contributing to the StreamSets Data Collector Community

Creating a Custom Processor for StreamSets Data Collector

The Challenge of Fetching Data for Apache Spot (incubating)

Announcing Data Collector ver 2.1.0.0

Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3

Stay in Touch

Connect