May 2017

Visualizing and Analyzing Salesforce Data with Neo4j

By Pat Patterson May 16, 2017

Graph databases represent and store data in terms of nodes, edges and properties, allowing quick, easy retrieval of complex hierarchical structures that may be difficult to model in traditional relational databases. Neo4j is an open source graph database widely deployed in the community; in this blog entry I’ll show you how to use StreamSets Data Collector to read case data from Salesforce and load it into the graph database using Neo4j’s JDBC driver.

Calling External Libraries from the JavaScript Evaluator

Data Integration

By Pat Patterson May 11, 2017

The Script Evaluators in StreamSets Data Collector (SDC) allow you to manipulate data in pretty much any way you please. I’ve already written about how you can call external Java code from your scripts – compiled Java code has great performance, but sometimes the code you need isn’t available in a JAR. Today I’ll show you how to call an external JavaScript library from the JavaScript Evaluator.

Quick Tip: Resolving ‘minReplication’ Hadoop FS Error

Data Integration

By Pat Patterson May 3, 2017

I run StreamSets Data Collector on my MacBook Pro. In fact, I have about a dozen different versions installed – the latest, greatest 2.5.0.0, older versions, release candidates, and, of course, a development ‘master’ build that I hack on. Preparing for tonight’s St Louis Hadoop User Group Meetup, I downloaded Cloudera’s CDH 5.10 Quickstart VM so I could show our classic ‘Taxi Data Tutorial‘ and Drift Synchronization with Hadoop FS and Apache Hive. Spinning up the tutorial pipeline, I was surprised to see an error: HADOOPFS_13 - Error while writing to HDFS: com.streamsets.pipeline.api.StageException: HADOOPFS_58 - Flush failed on file: '/sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0' due to 'org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation. I’ll explain what this means, and how to resolve it, in this blog post.

StreamSets Data Integration Blog

Visualizing and Analyzing Salesforce Data with Neo4j

Calling External Libraries from the JavaScript Evaluator

Quick Tip: Resolving ‘minReplication’ Hadoop FS Error

Stay in Touch

Connect