Data Integration Archives

Create a Custom Expression Language Function for StreamSets Data Collector

By Pat Patterson April 28, 2017

One of the most powerful features in StreamSets Data Collector Engine is support for Expression Language, or ‘EL’ for short. EL was introduced in JavaServer Pages (JSP) 2.0 as a mechanism for accessing Java code from JSP. The Expression Evaluator and Stream Selector stages rely heavily on EL, but you can use Expression Language in configuring almost every SDC stage. In this blog entry I’ll explain a little about EL and show you how to write your own EL functions.

Creating a Custom Multithreaded Origin for StreamSets Data Collector

Data Integration

By Pat Patterson April 25, 2017

Multithreaded Pipelines, introduced a couple of releases back, in StreamSets Data Collector (SDC) 2.3.0.0, enable a single pipeline instance to process high volumes of data, taking full advantage of all available CPUs on the machine. In this blog entry I’ll explain a little about how multithreaded pipelines work, and how you can implement your own multithreaded pipeline origin thanks to a new tutorial by Guglielmo Iozzia, Big Data Analytics Manager at Optum, part of UnitedHealth Group.

Installing StreamSets Data Collector on Amazon Web Services EC2

Data Integration

By Pat Patterson April 17, 2017

Mike Fuller, a consultant at Red Pill Analytics, recently wrote Stream Me Up (to the Cloud), Scotty, a tutorial on installing StreamSets Data Collector (SDC) on Amazon Web Services EC2. Mike’s article takes you all the way from logging in to a fresh EC2 instance to seeing your first pipeline in action. We’re reposting it here courtesy of Mike and Red Pill.

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

Data Integration

Stream Data Processing

By Pat Patterson March 6, 2017

One of the great things about StreamSets Data Collector is that its record-oriented architecture allows great flexibility in creating data pipelines – you can plug together pretty much any combination of origins, processors and destinations to build a data flow. After I wrote the Ingesting Local Data into Azure Data Lake Store tutorial, it occurred to me that the Azure Data Lake Store destination should work with the Hive Metadata processor and Hive Metastore destination to allow me to replicate schema changes from a data source such as a relational database into Apache Hive running on HDInsight. Of course, there is a world of difference between should and does, so I was quite apprehensive as I duplicated the pipeline that I used for the Ingesting Drifting Data into Hive and Impala tutorial and replaced the Hadoop FS destination with the Azure Data Lake Store equivalent.

Read and Write JSON to MapR DB with StreamSets Data Collector

Data Integration

Data Transformation

By Rupal Shah March 5, 2017

MapR-DB is an enterprise-grade, high performance, NoSQL database management system. As a multi-model NoSQL database, it supports both JSON document models and wide column data models. MapR-DB stores JSON documents in tables; documents within a table in MapR-DB can have different structures. StreamSets Data Collector enables working with MapR-DB documents with its powerful schema-on-read and ingestion capability.

With StreamSets Data Collector, I’ll show you how easy it is to stream data from MongoDB into a MapR-DB table as well as stream data out of the MapR-DB table into MapR Streams.

Running Scala Code in StreamSets Data Collector

Data Integration

Operational Analytics

By Pat Patterson February 27, 2017

The Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

Ingest Data into Azure Data Lake Store with StreamSets Data Collector

Data Integration

Cloud Data Migration

By Pat Patterson February 20, 2017

Azure Data Lake Store (ADLS) is Microsoft’s cloud repository for big data analytic workloads, designed to capture data for operational and exploratory analytics. StreamSets Data Collector (SDC) version 2.3.0.0 included an Azure Data Lake Store destination, so you can create pipelines to read data from any supported data source and write it to ADLS.

Since configuring the ADLS destination is a multi-step process; our new tutorial, Ingesting Local Data into Azure Data Lake Store, walks you through the process of adding SDC as an application in Azure Active Directory, creating a Data Lake Store, building a simple data ingest pipeline, and then configuring the ADLS destination with credentials to write to an ADLS directory.

Replicating Relational Databases with StreamSets Data Collector

Data Integration

Cloud Data Migration

Data Transformation

By Pat Patterson February 3, 2017

StreamSets Data Collector Engine has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. StreamSets Data Collector Engine Now introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I’ll explain how the JDBC Multitable Consumer can implement a typical use case – replicating relational databases (an entire one) into Hadoop.

Ingest Data into Splunk with StreamSets Data Collector

Data Integration

Stream Data Processing

By Pat Patterson January 18, 2017

UPDATE – Data Collector’s HTTP Client destination can send a single request per batch of records, providing an easier way to send data to Splunk than the Jython script evaluator. See the blog post, Efficient Splunk Ingest for Cybersecurity for an example.

Splunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I’ll explain how to efficiently send high volumes of data to Splunk’s HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I’ll present a Jython script with which you’ll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

Data Transformation

Data Integration

Cloud Data Migration

By Girish Pancha January 17, 2017

Today we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data. Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.

Exerting firm control over data in motion is a critical competency which has become core to modern data integration and operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.

StreamSets Data Integration Blog

Create a Custom Expression Language Function for StreamSets Data Collector

Creating a Custom Multithreaded Origin for StreamSets Data Collector

Installing StreamSets Data Collector on Amazon Web Services EC2

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

Read and Write JSON to MapR DB with StreamSets Data Collector

Running Scala Code in StreamSets Data Collector

Ingest Data into Azure Data Lake Store with StreamSets Data Collector

Replicating Relational Databases with StreamSets Data Collector

Ingest Data into Splunk with StreamSets Data Collector

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

Stay in Touch

Connect