Dataflow Performance Blog

Creating a Custom Multithreaded Origin for StreamSets Data Collector

Multithreaded PipelineMultithreaded Pipelines, introduced a couple of releases back, in StreamSets Data Collector (SDC), enable a single pipeline instance to process high volumes of data, taking full advantage of all available CPUs on the machine. In this blog entry I'll explain a little about how multithreaded pipelines work, and how you can implement your own multithreaded pipeline origin thanks to a new tutorial by Guglielmo Iozzia, Big Data Analytics Manager at Optum, part of UnitedHealth Group.

Pat PattersonCreating a Custom Multithreaded Origin for StreamSets Data Collector
Read More

Making Sense of Stream Processing

StreamThere has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.

In this article, I'll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.

Pat PattersonMaking Sense of Stream Processing
Read More

StreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale

We’re thrilled to announce version 2.5 of StreamSets Data Collector, a major release which includes important functionality related to the Internet of Things (IoT), high-performance database ingest, integration with Apache Spark and integration into your enterprise infrastructure.  You can download the latest open source release here.

This release has over 22 new features, 95 improvements and 150 bug fixes.

Kirit BasuStreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale
Read More

Installing StreamSets Data Collector on Amazon Web Services EC2

Mike FullerMike Fuller, a consultant at Red Pill Analytics, recently wrote Stream Me Up (to the Cloud), Scotty, a tutorial on installing StreamSets Data Collector (SDC) on Amazon Web Services EC2. Mike's article takes you all the way from logging in to a fresh EC2 instance to seeing your first pipeline in action. We're reposting it here courtesy of Mike and Red Pill.

Pat PattersonInstalling StreamSets Data Collector on Amazon Web Services EC2
Read More

Transform Data in StreamSets Data Collector

I've written quite a bit over the past few months about the more advanced aspects of data manipulation in StreamSets Data Collector (SDC) – writing custom processors, calling Java libraries from JavaScript, Groovy & Python, and even using Java and Scala with the Spark Evaluator. As a developer, it's always great fun to break out the editor and get to work, but we should be careful not to jump the gun. Just because you can solve a problem with code, doesn't mean you should. Using SDC's built-in processor stages is not only easier than writing code, it typically results in better performance. In this blog entry, I'll look at some of these stages, and the problems you can solve with them.

Pat PattersonTransform Data in StreamSets Data Collector
Read More

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

ADLS Drift PipelineOne of the great things about StreamSets Data Collector is that its record-oriented architecture allows great flexibility in creating data pipelines – you can plug together pretty much any combination of origins, processors and destinations to build a data flow. After I wrote the Ingesting Local Data into Azure Data Lake Store tutorial, it occurred to me that the Azure Data Lake Store destination should work with the Hive Metadata processor and Hive Metastore destination to allow me to replicate schema changes from a data source such as a relational database into Apache Hive running on HDInsight. Of course, there is a world of difference between should and does, so I was quite apprehensive as I duplicated the pipeline that I used for the Ingesting Drifting Data into Hive and Impala tutorial and replaced the Hadoop FS destination with the Azure Data Lake Store equivalent.

Pat PattersonDrift Synchronization with StreamSets Data Collector and Azure Data Lake
Read More

Read and Write JSON to MapR DB with StreamSets Data Collector

MapR DB logoMapR-DB is an enterprise-grade, high performance, NoSQL database management system. As a multi-model NoSQL database, it supports both JSON document models and wide column data models. MapR-DB stores JSON documents in tables; documents within a table in MapR-DB can have different structures. StreamSets Data Collector enables working with MapR-DB documents with its powerful schema-on-read and ingestion capability.

With StreamSets Data Collector, I’ll show you how easy it is to stream data from MongoDB into a MapR-DB table as well as stream data out of the MapR-DB table into MapR Streams.

Rupal ShahRead and Write JSON to MapR DB with StreamSets Data Collector
Read More

Announcing StreamSets Data Collector ver

We are happy to announce the newest version of StreamSets Data Collector is available for download. This short release has over 25 new features and improvements and over 50 bug fixes. This is an enterprise-focused release that addresses the needs of some of the world's largest organizations using StreamSets. Below is a short list of what's new, please check out the release notes for more details.

Kirit BasuAnnouncing StreamSets Data Collector ver
Read More

Running Scala Code in StreamSets Data Collector

Scala logoThe Spark Evaluator, introduced in StreamSets Data Collector (SDC) version, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

Pat PattersonRunning Scala Code in StreamSets Data Collector
Read More

Ingest Data into Azure Data Lake Store with StreamSets Data Collector

SDC and Power BIAzure Data Lake Store (ADLS) is Microsoft's cloud repository for big data analytic workloads, designed to capture data for operational and exploratory analytics. StreamSets Data Collector (SDC) version included an Azure Data Lake Store destination, so you can create pipelines to read data from any supported data source and write it to ADLS.

Since configuring the ADLS destination is a multi-step process; our new tutorial, Ingesting Local Data into Azure Data Lake Store, walks you through the process of adding SDC an an application in Azure Active Directory, creating a Data Lake Store, building a simple data ingest pipeline, and then configuring the ADLS destination with credentials to write to an ADLS directory.

Pat PattersonIngest Data into Azure Data Lake Store with StreamSets Data Collector
Read More