Dataflow Performance Blog

Scaling out StreamSets with Kubernetes

StreamSets, Docker, KubernetesIn today’s microservice revolution, where software applications are designed as independent services that work together, two technologies stand out. Docker, the defacto standard for containerization, and Kubernetes, a container orchestration and management tool. In this blog I will explain how to run StreamSets Data Collector (SDC) Docker containers on Kubernetes.

Hari NayakScaling out StreamSets with Kubernetes
Read More

Cache Salesforce Data in Redis with StreamSets Data Collector

Redis LogoRedis is an open-source, in-memory, NoSQL database implementing a networked key-value store with optional persistence to disk. Perhaps the most popular key-value database, Redis is widely used for caching web pages, sessions and other objects that require blazingly fast access – lookups are typically in the millisecond range.

At RedisConf 2017 I presented a session, Cache All The Things! Data Integration via Jedis (slides), looking at how the open source Jedis library provides a small, sane, easy to use Java interface to Redis, and how a StreamSets Data Collector (SDC) pipeline can read data from a platform such as Salesforce, write it to Redis via Jedis, and keep Redis up-to-date by subscribing for notifications of changes in Salesforce, writing new and updated data to Redis. In this blog entry, I'll describe how I built the SDC pipeline I showed during my session.

Pat PattersonCache Salesforce Data in Redis with StreamSets Data Collector
Read More

Triggering Databricks Notebook Jobs from StreamSets Data Collector

S3 and DatabricksLast December, I covered Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks. In StreamSets Data Collector (SDC) version 2.5.0.0 we added the Spark Executor, allowing your pipelines to trigger a Spark application, running on Apache YARN or Databricks. I'm going to cover the latter in this blog post, showing you how to trigger a notebook job on Databricks from events in a pipeline, generating analyses and visualizations on demand.

Pat PattersonTriggering Databricks Notebook Jobs from StreamSets Data Collector
Read More

Introducing the Data Collector Support Bundle

Hi, my name is Wagner Camarao and I'm a Software Engineer at StreamSets focusing on the user-facing aspects of our products. Today I'm going to talk about a new feature in the StreamSets Data Collector to optimize the interactions with our support team.

In version 2.6.0.0 of Data Collector, we’ve added a feature called Support Bundle. It allows you to generate an archive file with the most common information required to troubleshoot various issues with Data Collector, such as precise build information, configuration, thread dump, pipeline definitions and history files, and most recent log files.

Wagner CamaraoIntroducing the Data Collector Support Bundle
Read More

Announcing Data Collector ver 2.6.0.0

We are excited to announce version 2.6 of StreamSets Data Collector. This release has important functionality focused on helping customers to modernize their enterprise data warehouses on Hadoop, CyberSecurity, IoT and Spark.

You can download the latest open source release here.

This release has 6 new features, 20 improvements and 72 bug fixes. For a full list, see What's New. For a list of bug fixes and known issues, see the Release Notes.

Kirit BasuAnnouncing Data Collector ver 2.6.0.0
Read More

Embrace Diversity in Your Data Architecture

Many Roads Lead to Rome

Over the last ten years, the data management landscape has changed dramatically — on that, I think we can all agree. The rise of big data and the new data management ecosystem has created an abundance of new patterns and tools, each of which is more specialized than the last. With each new iteration, engineers and architects face pressure from all sides to simplify and consolidate.

But counter-intuitively, the best data architects embrace infrastructure diversity rather than fight it. The reality is that all of these tools and patterns have important uses in enterprise data architecture, and that today Kafka is no more the cure to all that ails than MapReduce was five years ago. The most sophisticated enterprises enable each business unit to use best-of-breed technology to succeed while facilitating seamless integration between them, creating agility and avoiding the chaos that often arises across legacy environments.

Jonathan NatkinsEmbrace Diversity in Your Data Architecture
Read More

Visualizing and Analyzing Salesforce Data with Neo4j

Cases in Neo4jGraph databases represent and store data in terms of nodes, edges and properties, allowing quick, easy retrieval of complex hierarchical structures that may be difficult to model in traditional relational databases. Neo4j is an open source graph database widely deployed in the community; in this blog entry I'll show you how to use StreamSets Data Collector (SDC) to read case data from Salesforce and load it into the graph database using Neo4j's JDBC driver.

Pat PattersonVisualizing and Analyzing Salesforce Data with Neo4j
Read More

Calling External Libraries from the JavaScript Evaluator

JavaScript logoThe Script Evaluators in StreamSets Data Collector (SDC) allow you to manipulate data in pretty much any way you please. I've already written about how you can call external Java code from your scripts – compiled Java code has great performance, but sometimes the code you need isn't available in a JAR. Today I'll show you how to call an external JavaScript library from the JavaScript Evaluator.

Pat PattersonCalling External Libraries from the JavaScript Evaluator
Read More

Quick Tip: Resolving ‘minReplication’ Hadoop FS Error

minReplication ErrorI run StreamSets Data Collector on my MacBook Pro. In fact, I have about a dozen different versions installed – the latest, greatest 2.5.0.0, older versions, release candidates, and, of course, a development ‘master' build that I hack on. Preparing for tonight's St Louis Hadoop User Group Meetup, I downloaded Cloudera's CDH 5.10 Quickstart VM so I could show our classic ‘Taxi Data Tutorial‘ and Drift Synchronization with Hadoop FS and Apache Hive. Spinning up the tutorial pipeline, I was surprised to see an error: HADOOPFS_13 - Error while writing to HDFS: com.streamsets.pipeline.api.StageException: HADOOPFS_58 - Flush failed on file: '/sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0' due to 'org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /sdc/taxi/_tmp_sdc-847321ce-0acb-4574-8d2c-ff63529f25b8_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation. I'll explain what this means, and how to resolve it, in this blog post.

Pat PattersonQuick Tip: Resolving ‘minReplication’ Hadoop FS Error
Read More

Create a Custom Expression Language Function for StreamSets Data Collector

Custom EL SnapshotOne of the most powerful features in StreamSets Data Collector (SDC) is support for Expression Language, or ‘EL' for short. EL was introduced in JavaServer Pages (JSP) 2.0 as a mechanism for accessing Java code from JSP. The Expression Evaluator and Stream Selector stages rely heavily on EL, but you can use EL in configuring almost every SDC stage. In this blog entry I'll explain a little about EL and show you how to write your own EL functions.

Pat PattersonCreate a Custom Expression Language Function for StreamSets Data Collector
Read More