Pat Patterson

Pat

Pat is StreamSets' Community Champion. An outgoing personality and self-described 'articulate techie', Pat is an accomplished international public speaker, presenting at events such as Dreamforce, JavaOne (both San Francisco and Tokyo), the RSA Conference, Defrag and more.

Making Sense of Stream Processing

StreamThere has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.

In this article, I'll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.

Pat PattersonMaking Sense of Stream Processing
Read More

Installing StreamSets Data Collector on Amazon Web Services EC2

Mike FullerMike Fuller, a consultant at Red Pill Analytics, recently wrote Stream Me Up (to the Cloud), Scotty, a tutorial on installing StreamSets Data Collector (SDC) on Amazon Web Services EC2. Mike's article takes you all the way from logging in to a fresh EC2 instance to seeing your first pipeline in action. We're reposting it here courtesy of Mike and Red Pill.

Pat PattersonInstalling StreamSets Data Collector on Amazon Web Services EC2
Read More

Transform Data in StreamSets Data Collector

I've written quite a bit over the past few months about the more advanced aspects of data manipulation in StreamSets Data Collector (SDC) – writing custom processors, calling Java libraries from JavaScript, Groovy & Python, and even using Java and Scala with the Spark Evaluator. As a developer, it's always great fun to break out the editor and get to work, but we should be careful not to jump the gun. Just because you can solve a problem with code, doesn't mean you should. Using SDC's built-in processor stages is not only easier than writing code, it typically results in better performance. In this blog entry, I'll look at some of these stages, and the problems you can solve with them.

Pat PattersonTransform Data in StreamSets Data Collector
Read More

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

ADLS Drift PipelineOne of the great things about StreamSets Data Collector is that its record-oriented architecture allows great flexibility in creating data pipelines – you can plug together pretty much any combination of origins, processors and destinations to build a data flow. After I wrote the Ingesting Local Data into Azure Data Lake Store tutorial, it occurred to me that the Azure Data Lake Store destination should work with the Hive Metadata processor and Hive Metastore destination to allow me to replicate schema changes from a data source such as a relational database into Apache Hive running on HDInsight. Of course, there is a world of difference between should and does, so I was quite apprehensive as I duplicated the pipeline that I used for the Ingesting Drifting Data into Hive and Impala tutorial and replaced the Hadoop FS destination with the Azure Data Lake Store equivalent.

Pat PattersonDrift Synchronization with StreamSets Data Collector and Azure Data Lake
Read More

Running Scala Code in StreamSets Data Collector

Scala logoThe Spark Evaluator, introduced in StreamSets Data Collector (SDC) version 2.2.0.0, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

Pat PattersonRunning Scala Code in StreamSets Data Collector
Read More

Ingest Data into Azure Data Lake Store with StreamSets Data Collector

SDC and Power BIAzure Data Lake Store (ADLS) is Microsoft's cloud repository for big data analytic workloads, designed to capture data for operational and exploratory analytics. StreamSets Data Collector (SDC) version 2.3.0.0 included an Azure Data Lake Store destination, so you can create pipelines to read data from any supported data source and write it to ADLS.

Since configuring the ADLS destination is a multi-step process; our new tutorial, Ingesting Local Data into Azure Data Lake Store, walks you through the process of adding SDC an an application in Azure Active Directory, creating a Data Lake Store, building a simple data ingest pipeline, and then configuring the ADLS destination with credentials to write to an ADLS directory.

Pat PattersonIngest Data into Azure Data Lake Store with StreamSets Data Collector
Read More

Replicating Relational Databases with StreamSets Data Collector

HiveDrift2StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. StreamSets Data Collector (SDC) 2.3.0.0 introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I'll explain how the JDBC Multitable Consumer can implement a typical use case – replicating an entire relational database into Hadoop.

Pat PattersonReplicating Relational Databases with StreamSets Data Collector
Read More

Ingesting data into Couchbase using StreamSets Data Collector

Nick Cadenhead, a Senior Consultant at 9th BIT Consulting in Johannesburg, South Africa, uses Couchbase Server to power analytics solutions for his clients. In this blog entry, reposted from his article at LinkedIn, Nick explains why he selected StreamSets Data Collector for data ingest, and how he extended it with a custom destination to write data to Couchbase.

Pat PattersonIngesting data into Couchbase using StreamSets Data Collector
Read More

Ingest Data into Splunk with StreamSets Data Collector

Splunk ChartSplunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I'll explain how to efficiently send high volumes of data to Splunk's HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I'll present a Jython script with which you'll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.

Pat PattersonIngest Data into Splunk with StreamSets Data Collector
Read More

Calling External Java Code from Script Evaluators

groovy logoWhen you're building a pipeline with StreamSets Data Collector (SDC), you can often implement the data transformations you require using a combination of ‘off-the-shelf' processors. Sometimes, though, you need to write some code. The script evaluators included with SDC allow you to manipulate records in Groovy, JavaScript and Jython (an implementation of Python integrated with the Java platform). You can usually achieve your goal using built-in scripting functions, as in the credit card issuing network computation shown in the SDC tutorial, but, again, sometimes you need to go a little further. For example, a member of the StreamSets community Slack channel recently asked about computing SHA-3 digests in JavaScript. In this blog entry I'll show you how to do just this from Groovy, JavaScript and Jython.

Pat PattersonCalling External Java Code from Script Evaluators
Read More