skip to Main Content

StreamSets Data Integration Blog

Where change is welcome.

Visualizing and Analyzing Salesforce Data with Neo4j

By May 16, 2017

Cases in Neo4jGraph databases represent and store data in terms of nodes, edges and properties, allowing quick, easy retrieval of complex hierarchical structures that may be difficult to model in traditional relational databases. Neo4j is an open source graph database widely deployed in the community; in this blog entry I’ll show you how to use StreamSets Data Collector to read case data from Salesforce and load it into the graph database using Neo4j’s JDBC driver.

Calling External Libraries from the JavaScript Evaluator

By May 11, 2017

JavaScript logoThe Script Evaluators in StreamSets Data Collector (SDC) allow you to manipulate data in pretty much any way you please. I’ve already written about how you can call external Java code from your scripts – compiled Java code has great performance, but sometimes the code you need isn’t available in a JAR. Today I’ll show you how to call an external JavaScript library from the JavaScript Evaluator.

Quick Tip: Resolving ‘minReplication’ Hadoop FS Error

By May 3, 2017

minReplication ErrorI run StreamSets Data Collector on my MacBook Pro. In fact, I have about a dozen different versions installed – the latest, greatest 2.5.0.0, older versions, release candidates, and, of course, a development ‘master’ build that I hack on. Preparing for tonight’s St Louis Hadoop User Group Meetup, I downloaded Cloudera’s CDH 5.10 Quickstart VM so I could show our classic ‘Taxi Data Tutorial‘ and Drift Synchronization with Hadoop FS and Apache Hive. Spinning up the tutorial pipeline, I was surprised to see an error: HADOOPFS_13 - Error while writing to HDFS: com.streamsets.pipeline.api.StageException: HADOOPFS_58 - Flush failed on file: '/sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0' due to 'org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation. I’ll explain what this means, and how to resolve it, in this blog post.

Back To Top