Pat Patterson


Pat is StreamSets' Community Champion. An outgoing personality and self-described 'articulate techie', Pat is an accomplished international public speaker, presenting at events such as Dreamforce, JavaOne (both San Francisco and Tokyo), the RSA Conference, Defrag and more.

Visualizing and Analyzing Salesforce Data with Neo4j

Cases in Neo4jGraph databases represent and store data in terms of nodes, edges and properties, allowing quick, easy retrieval of complex hierarchical structures that may be difficult to model in traditional relational databases. Neo4j is an open source graph database widely deployed in the community; in this blog entry I'll show you how to use StreamSets Data Collector (SDC) to read case data from Salesforce and load it into the graph database using Neo4j's JDBC driver.

Pat PattersonVisualizing and Analyzing Salesforce Data with Neo4j
Read More

Calling External Libraries from the JavaScript Evaluator

JavaScript logoThe Script Evaluators in StreamSets Data Collector (SDC) allow you to manipulate data in pretty much any way you please. I've already written about how you can call external Java code from your scripts – compiled Java code has great performance, but sometimes the code you need isn't available in a JAR. Today I'll show you how to call an external JavaScript library from the JavaScript Evaluator.

Pat PattersonCalling External Libraries from the JavaScript Evaluator
Read More

Quick Tip: Resolving ‘minReplication’ Hadoop FS Error

minReplication ErrorI run StreamSets Data Collector on my MacBook Pro. In fact, I have about a dozen different versions installed – the latest, greatest, older versions, release candidates, and, of course, a development ‘master' build that I hack on. Preparing for tonight's St Louis Hadoop User Group Meetup, I downloaded Cloudera's CDH 5.10 Quickstart VM so I could show our classic ‘Taxi Data Tutorial‘ and Drift Synchronization with Hadoop FS and Apache Hive. Spinning up the tutorial pipeline, I was surprised to see an error: HADOOPFS_13 - Error while writing to HDFS: com.streamsets.pipeline.api.StageException: HADOOPFS_58 - Flush failed on file: '/sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0' due to 'org.apache.hadoop.ipc.RemoteException( File /sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation. I'll explain what this means, and how to resolve it, in this blog post.

Pat PattersonQuick Tip: Resolving ‘minReplication’ Hadoop FS Error
Read More

Create a Custom Expression Language Function for StreamSets Data Collector

Custom EL SnapshotOne of the most powerful features in StreamSets Data Collector (SDC) is support for Expression Language, or ‘EL' for short. EL was introduced in JavaServer Pages (JSP) 2.0 as a mechanism for accessing Java code from JSP. The Expression Evaluator and Stream Selector stages rely heavily on EL, but you can use EL in configuring almost every SDC stage. In this blog entry I'll explain a little about EL and show you how to write your own EL functions.

Pat PattersonCreate a Custom Expression Language Function for StreamSets Data Collector
Read More

Creating a Custom Multithreaded Origin for StreamSets Data Collector

Multithreaded PipelineMultithreaded Pipelines, introduced a couple of releases back, in StreamSets Data Collector (SDC), enable a single pipeline instance to process high volumes of data, taking full advantage of all available CPUs on the machine. In this blog entry I'll explain a little about how multithreaded pipelines work, and how you can implement your own multithreaded pipeline origin thanks to a new tutorial by Guglielmo Iozzia, Big Data Analytics Manager at Optum, part of UnitedHealth Group.

Pat PattersonCreating a Custom Multithreaded Origin for StreamSets Data Collector
Read More

Making Sense of Stream Processing

StreamThere has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.

In this article, I'll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.

Pat PattersonMaking Sense of Stream Processing
Read More

Installing StreamSets Data Collector on Amazon Web Services EC2

Mike FullerMike Fuller, a consultant at Red Pill Analytics, recently wrote Stream Me Up (to the Cloud), Scotty, a tutorial on installing StreamSets Data Collector (SDC) on Amazon Web Services EC2. Mike's article takes you all the way from logging in to a fresh EC2 instance to seeing your first pipeline in action. We're reposting it here courtesy of Mike and Red Pill.

Pat PattersonInstalling StreamSets Data Collector on Amazon Web Services EC2
Read More

Transform Data in StreamSets Data Collector

I've written quite a bit over the past few months about the more advanced aspects of data manipulation in StreamSets Data Collector (SDC) – writing custom processors, calling Java libraries from JavaScript, Groovy & Python, and even using Java and Scala with the Spark Evaluator. As a developer, it's always great fun to break out the editor and get to work, but we should be careful not to jump the gun. Just because you can solve a problem with code, doesn't mean you should. Using SDC's built-in processor stages is not only easier than writing code, it typically results in better performance. In this blog entry, I'll look at some of these stages, and the problems you can solve with them.

Pat PattersonTransform Data in StreamSets Data Collector
Read More

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

ADLS Drift PipelineOne of the great things about StreamSets Data Collector is that its record-oriented architecture allows great flexibility in creating data pipelines – you can plug together pretty much any combination of origins, processors and destinations to build a data flow. After I wrote the Ingesting Local Data into Azure Data Lake Store tutorial, it occurred to me that the Azure Data Lake Store destination should work with the Hive Metadata processor and Hive Metastore destination to allow me to replicate schema changes from a data source such as a relational database into Apache Hive running on HDInsight. Of course, there is a world of difference between should and does, so I was quite apprehensive as I duplicated the pipeline that I used for the Ingesting Drifting Data into Hive and Impala tutorial and replaced the Hadoop FS destination with the Azure Data Lake Store equivalent.

Pat PattersonDrift Synchronization with StreamSets Data Collector and Azure Data Lake
Read More

Running Scala Code in StreamSets Data Collector

Scala logoThe Spark Evaluator, introduced in StreamSets Data Collector (SDC) version, lets you run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. Back in December, we released a tutorial walking you through the process of building a Transformer in Java. Since then, Maurin Lenglart, of Cuberon Labs, has contributed skeleton code for a Scala Transformer, paving the way for a new tutorial, Creating a StreamSets Spark Transformer in Scala.

Pat PattersonRunning Scala Code in StreamSets Data Collector
Read More