Dataflow Performance Blog

Running Scala Code in StreamSets Data Collector

Scala logoThe Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

Pat PattersonRunning Scala Code in StreamSets Data Collector
Read More

Ingest Data into Azure Data Lake Store with StreamSets Data Collector

SDC and Power BIAzure Data Lake Store (ADLS) is Microsoft's cloud repository for big data analytic workloads, designed to capture data for operational and exploratory analytics. StreamSets Data Collector (SDC) version 2.3.0.0 included an Azure Data Lake Store destination, so you can create pipelines to read data from any supported data source and write it to ADLS.

Since configuring the ADLS destination is a multi-step process; our new tutorial, Ingesting Local Data into Azure Data Lake Store, walks you through the process of adding SDC an an application in Azure Active Directory, creating a Data Lake Store, building a simple data ingest pipeline, and then configuring the ADLS destination with credentials to write to an ADLS directory.

Pat PattersonIngest Data into Azure Data Lake Store with StreamSets Data Collector
Read More

Replicating Relational Databases with StreamSets Data Collector

HiveDrift2StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. StreamSets Data Collector (SDC) 2.3.0.0 introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I'll explain how the JDBC Multitable Consumer can implement a typical use case – replicating an entire relational database into Hadoop.

Pat PattersonReplicating Relational Databases with StreamSets Data Collector
Read More

Ingesting data into Couchbase using StreamSets Data Collector

Nick Cadenhead, a Senior Consultant at 9th BIT Consulting in Johannesburg, South Africa, uses Couchbase Server to power analytics solutions for his clients. In this blog entry, reposted from his article at LinkedIn, Nick explains why he selected StreamSets Data Collector for data ingest, and how he extended it with a custom destination to write data to Couchbase.

Pat PattersonIngesting data into Couchbase using StreamSets Data Collector
Read More

Ingest Data into Splunk with StreamSets Data Collector

Splunk ChartSplunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I'll explain how to efficiently send high volumes of data to Splunk's HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I'll present a Jython script with which you'll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.

Pat PattersonIngest Data into Splunk with StreamSets Data Collector
Read More

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

Screen Shot 2017-01-17 at 1.40.07 PMToday we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data.  Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.

Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.

Girish PanchaData in Motion Evolution: Where We’ve Been…Where We Need to Go
Read More

Calling External Java Code from Script Evaluators

groovy logoWhen you're building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of ‘off-the-shelf' processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can usually achieve your goal using built-in scripting functions, as in the credit card issuing network computation shown in the SDC tutorial, but, again, sometimes you need to go a little further. For example, a member of the StreamSets community Slack channel recently asked about computing SHA-3 digests in JavaScript. In this blog entry I'll show you how to do just this from Groovy, JavaScript and Jython.

Pat PattersonCalling External Java Code from Script Evaluators
Read More

Building an Amazon SQS Custom Origin for StreamSets Data Collector

sqsoriginAs I explained in my recent tutorial, Creating a Custom Origin for StreamSets Data Collector, it's straightforward to extend StreamSets Data Collector (SDC) to ingest data from pretty much any source. Yogesh Choudhary, a software engineer at consulting and services company Clairvoyant, just posted his own walkthrough of building a custom origin for Amazon Simple Queue Service (SQS). Yogesh does a great job of walking you through the process of creating a custom origin project from the Maven archetype, building it, and then adding the Amazon SQS functionality. Read more at Creating a Custom Origin for StreamSets.

Pat PattersonBuilding an Amazon SQS Custom Origin for StreamSets Data Collector
Read More

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

Databricks LogoI'm frequently asked, ‘How does StreamSets Data Collector (SDC) integrate with Spark Streaming? How about on Databricks?'. In this blog entry, I'll explain how to use SDC to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere.

Pat PattersonContinuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks
Read More