Dataflow Performance Blog

Building an Amazon SQS Custom Origin for StreamSets Data Collector

sqsoriginAs I explained in my recent tutorial, Creating a Custom Origin for StreamSets Data Collector, it's straightforward to extend StreamSets Data Collector (SDC) to ingest data from pretty much any source. Yogesh Choudhary, a software engineer at consulting and services company Clairvoyant, just posted his own walkthrough of building a custom origin for Amazon Simple Queue Service (SQS). Yogesh does a great job of walking you through the process of creating a custom origin project from the Maven archetype, building it, and then adding the Amazon SQS functionality. Read more at Creating a Custom Origin for StreamSets.

Pat PattersonBuilding an Amazon SQS Custom Origin for StreamSets Data Collector
Read More

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Databricks LogoI'm frequently asked, ‘How does StreamSets Data Collector (SDC) integrate with Spark Streaming? How about on Databricks?'. In this blog entry, I'll explain how to use SDC to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere.

Pat PattersonContinuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks
Read More

Creating a Custom Origin for StreamSets Data Collector

Git Commit Log RecordSince writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I've been looking for a good use case for a custom origin tutorial. It's been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom origin redundant. Then, last week, StreamSets software engineer Jeff Evans suggested Git. Creating a custom origin to read the Git commit log turned into the perfect tutorial.

Pat PattersonCreating a Custom Origin for StreamSets Data Collector
Read More

Running Apache Spark Code in StreamSets Data Collector

Spark LogoNew in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Java or Scala code as a Spark Resilient Distributed Dataset (RDD). Your Spark Transformer can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination.

Pat PattersonRunning Apache Spark Code in StreamSets Data Collector
Read More

Announcing Data Collector ver 2.2.0.0

And here it is folks, the last release of 2016 – StreamSets Data Collector version 2.2.0.0. We've put in a host of important new features and resolved 120+ bugs.

We're gearing up for a solid roadmap in 2017, enabling exciting new use cases and bringing in some great contributions from customers and our community.

Kirit BasuAnnouncing Data Collector ver 2.2.0.0
Read More

Upgrading From Apache Flume to StreamSets Data Collector

Apache Flume “is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data”. The typical use case is collecting log data and pushing it to a destination such as the Hadoop Distributed File System. In this blog entry we'll look at a couple of Flume use cases, and see how they can be implemented with StreamSets Data Collector.

Pat PattersonUpgrading From Apache Flume to StreamSets Data Collector
Read More

More Than One Third of the Fortune 100 Have Downloaded StreamSets Data Collector

It’s been a little over a year (9/24/15) since we launched StreamSets Data Collector as an open source project. For those of you unfamiliar with the product, it’s any-to-any big data ingestion software through which you can build and place into production complex batch and streaming pipelines using built-in processors for all sorts of data transformations. The product features, plus video demos, tutorials, etc. can all be “ingested” through the SDC product page.

We’re thrilled to announce that as of last month StreamSets Data Collector had been downloaded by over ⅓ of the Fortune 100! That's several dozen of the largest companies in the U.S. And downloads of this award-winning software have been accelerating, with over 500% growth in the quarter ending in October versus the previous quarter.

Rick BilodeauMore Than One Third of the Fortune 100 Have Downloaded StreamSets Data Collector
Read More

Contributing to the StreamSets Data Collector Community

StreamSets MeetupAs you likely already know, StreamSets Data Collector (SDC) is open source, made available via the Apache 2.0 license. The entire source code for the product is hosted in a GitHub project and the binaries are always available for download.

As well as being part of our engineering culture, open source gives us a number of business advantages. Prospective users can freely download, install, evaluate and even put SDC into production, customers have access to the source code without a costly escrow process and, perhaps most importantly, our users can contribute fixes and enhancements to improve the product for the benefit of the whole community. In this post, I'd like to acknowledge some of those contributions, and invite you to contribute, too.

Pat PattersonContributing to the StreamSets Data Collector Community
Read More

The Challenge of Fetching Data for Apache Spot (incubating)

Reposted from the Cloudera Vision blog.

What do Sony, Target and the Democratic Party have in common?

Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us.

Rick BilodeauThe Challenge of Fetching Data for Apache Spot (incubating)
Read More

Creating a Custom Processor for StreamSets Data Collector

gps image dataBack in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I've been looking for a good sample use case for a custom processor. It's tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

Pat PattersonCreating a Custom Processor for StreamSets Data Collector
Read More