Data Integration Archives

Calling External Java Code from Script Evaluators

By Pat Patterson December 21, 2016

When you’re building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of ‘off-the-shelf’ processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can usually achieve your goal using built-in scripting functions, as in the credit card issuing network computation shown in the SDC tutorial, but, again, sometimes you need to go a little further. For example, a member of the StreamSets community Slack channel recently asked about computing SHA-3 digests in JavaScript. In this blog entry I’ll show you how to do just this from Groovy, JavaScript and Jython.

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Data Integration

Stream Data Processing

By Pat Patterson December 19, 2016

I’m frequently asked, ‘How does StreamSets Data Collector integrate with Spark Streaming? How about on Databricks?’. In this blog entry, I’ll explain how to use Data Collector to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere. This is one solution for continuous data integration that can be used in cloud data platforms.

Creating a Custom Origin for StreamSets Data Collector

Data Integration

By Pat Patterson December 12, 2016

Since writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I’ve been looking for a good use case for a custom origin tutorial. It’s been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom origin redundant. Then, last week, StreamSets software engineer Jeff Evans suggested Git. Creating a custom origin for StreamSets to read the Git commit log turned into the perfect tutorial.

Running Apache Spark Code in StreamSets Data Collector

Data Integration

Stream Data Processing

By Pat Patterson December 8, 2016

New in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Java or Scala code as a Spark Resilient Distributed Dataset (RDD). Your Spark Transformer can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination.

Creating a Custom Processor for StreamSets Data Collector

Data Integration

By Pat Patterson October 19, 2016

Back in March, I wrote a tutorial showing how to create a custom destination for StreamSets Data Collector (SDC). Since then I’ve been looking for a good sample use case for a custom processor. It’s tricky to find one, since the set of out-of-the-box processors is pretty extensive now! In particular, the scripting processors make it easy to operate on records with Groovy, JavaScript or Jython, without needing to break out the Java compiler.

Looking at the Whole File data format, introduced last month in SDC 1.6.0.0, inspired me… Our latest tutorial, Creating a Custom StreamSets Processor, explains how to extract metadata tags from image files as they are ingested, adding them to records as fields.

Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3

Data Integration

By Pat Patterson October 13, 2016

Sandish Kumar, a Solutions Engineer at phData, builds and manages solutions for phData customers. In this article, reposted from the phData blog, he explains how to generate simulated NetFlow data, read it into StreamSets Data Collector via the UDP origin, then buffer it in Apache Kafka before sending it to Apache Kudu. A true big data enthusiast, Sandish spends his spare time working to understand Kudu internals.

Creating a Post-Lambda World with Apache Kudu

Data Integration

Stream Data Processing

By Arvind Prabhakar September 23, 2016

Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing

As originally posted on the Cloudera VISION Blog.

At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.

This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.

Ingesting Drifting Data into Hive and Impala

Data Integration

Stream Data Processing

By Pat Patterson September 8, 2016

Importing data into Apache Hive is one of the most common use cases in big data ingest, but gets tricky when data sources ‘drift’, changing the schema or semantics of incoming data. Introduced in StreamSets Data Collector (SDC) 1.5.0.0, the Hive Drift Solution monitors the structure of incoming data, detecting schema drift and updating the Hive Metastore accordingly, allowing data to keep flowing. In this blog entry, I’ll give you an overview of the Hive Drift Solution and explain how you can try it out for yourself, today.

Whole File Transfer Data Ingestion

Data Integration

By Pat Patterson August 22, 2016

A key aspect of StreamSets Data Collector Engine is its ability to parse incoming data, giving you unprecedented flexibility in processing data flows. Sometimes, though, you don’t need to see ‘inside’ files – you just need to move them from a source to one or more destinations. The StreamSets Platform will include a new ‘Whole File Transfer’ feature for data ingestion In this blog entry I’ll explain everything you need to know to be able to get started with Whole File Transfer, today!

Dynamic Outlier Detection with StreamSets and Cassandra

Data Integration

Data Transformation

Stream Data Processing

By Pat Patterson August 19, 2016

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I’ll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier’ values.

StreamSets Data Integration Blog

Calling External Java Code from Script Evaluators

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Creating a Custom Origin for StreamSets Data Collector

Running Apache Spark Code in StreamSets Data Collector

Creating a Custom Processor for StreamSets Data Collector

Visualizing NetFlow Data with StreamSets Data Collector, Kudu, Impala and D3

Creating a Post-Lambda World with Apache Kudu

Ingesting Drifting Data into Hive and Impala

Whole File Transfer Data Ingestion

Dynamic Outlier Detection with StreamSets and Cassandra

Stay in Touch

Connect