Dataflow Performance Blog

StreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale

We’re thrilled to announce version 2.5 of StreamSets Data Collector, a major release which includes important functionality related to the Internet of Things (IoT), high-performance database ingest, integration with Apache Spark and integration into your enterprise infrastructure.  You can download the latest open source release here.

This release has over 22 new features, 95 improvements and 150 bug fixes.

Reduce the Cost of Internet of Things

This release includes new connectors for MQTT and Websockets.  Both are offered as origins or destinations that allow you to use SDC as your IoT ingestion gateway, potentially saving you substantial money vs. proprietary solutions, particular metered solutions from cloud providers.  Also included in this release is a HTTP Client Destination that can write a batch full of data to an HTTP Endpoint.

We will soon be adding other IoT gateway connectors such as OPC-UA and CoAP. If you'd like to see other IoT related functionality, please feel free to tell us through this survey.

High-performance Database Ingest

We have added a multi-threaded multi-table JDBC origin to our latest release in order to greatly speed ingest from databases. This support greatly increases throughput and shortens the time it takes to re-platform your data warehouse to Hadoop.

Scale Up Processing

In the last release we introduced a scale up architecture that allows pipelines to run in a multithreaded mode. With this release we’ve added a few more origins that can utilize this scale up architecture : Elasticsearch Origin, JDBC Multitable origin, Kineses Consumer and the Websocket origin.

Spark Executor

As part of our Dataflow Trigger framework we have added a Spark Executor which will allow you kick off simple Spark jobs such as format conversions and file compression all the way to running sophisticated machine learning algorithms. The Spark job can be run on a YARN or Databricks cloud.

StreamSets can greatly simplify developing ingest  pipelines when using the Databricks cloud, The Spark Executor processor can trigger a Spark job, or even a Databricks notebook.

Cluster Mode Spark Evaluator

The Spark evaluator in StreamSets lets you inject Spark within the pipeline. If you are writing algorithms in Spark for machine learning you need not worry about the semantics or plumbing of reading and writing data from a myriad of sources and destinations; the pipeline will give your Spark job a RDD to work with, and will similarly read out a RDD and guarantee delivery to whatever destinations you choose. When the pipeline is run in standalone mode, it uses Spark local mode, but if you run the pipeline in cluster streaming mode, it runs on the Spark cluster and your Spark transformation code will be able to see the data available to the entire cluster.

Elastic Origin for Easy Offload

The new Elasticsearch origin lets you get data out of an Elastic index. Customers use this in architectures where they want to migrate data off Elastic into other databases or lower cost systems such as Hadoop, move to cloud hosted systems or archive data to the cloud.

Improved Integration

Two new features advance the ability of StreamSets Data Collector to be integrated into your data workflow. First we have added the ability to pass parameters to a pipeline via  an API or CLI,  making it possible to have fine grained control of pipelines from external scripts.  Second, we have added support for alerting via Webhooks, meaning that you can trigger actions on external systems based on system, data and drift alerts within the pipeline. 

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

Kirit BasuStreamSets Data Collector v2.5 Adds IoT, Spark, Performance and Scale