Skip to content

StreamSets Data Integration Blog

Where change is welcome.

Drift Synchronization with StreamSets Data Collector and Azure Data Lake

By March 6, 2017

ADLS Drift PipelineOne of the great things about StreamSets Data Collector is that its record-oriented architecture allows great flexibility in creating data pipelines – you can plug together pretty much any combination of origins, processors and destinations to build a data flow. After I wrote the Ingesting Local Data into Azure Data Lake Store tutorial, it occurred to me that the Azure Data Lake Store destination should work with the Hive Metadata processor and Hive Metastore destination to allow me to replicate schema changes from a data source such as a relational database into Apache Hive running on HDInsight. Of course, there is a world of difference between should and does, so I was quite apprehensive as I duplicated the pipeline that I used for the Ingesting Drifting Data into Hive and Impala tutorial and replaced the Hadoop FS destination with the Azure Data Lake Store equivalent.

Ingest Data into Splunk with StreamSets Data Collector

By January 18, 2017

Splunk Chart

UPDATE – Data Collector’s HTTP Client destination can send a single request per batch of records, providing an easier way to send data to Splunk than the Jython script evaluator. See the blog post, Efficient Splunk Ingest for Cybersecurity for an example.

Splunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I’ll explain how to efficiently send high volumes of data to Splunk’s HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I’ll present a Jython script with which you’ll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.

Continuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks

By December 19, 2016

Databricks LogoI’m frequently asked, ‘How does StreamSets Data Collector integrate with Spark Streaming? How about on Databricks?’. In this blog entry, I’ll explain how to use Data Collector to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere. This is one solution for continuous data integration that can be used in cloud data platforms.

Running Apache Spark Code in StreamSets Data Collector

By December 8, 2016

Spark LogoNew in StreamSets Data Collector (SDC) 2.2.0.0 is the Spark Evaluator, a processor stage that allows you to run an Apache Spark application, termed a Spark Transformer, as part of an SDC pipeline. With the Spark Evaluator, you can build a pipeline to ingest data from any supported origin, apply transformations, such as filtering and lookups, using existing SDC processor stages, and have the Spark Evaluator hand off the data to your Java or Scala code as a Spark Resilient Distributed Dataset (RDD). Your Spark Transformer can then operate on the records, creating an output RDD, which is passed through the remainder of the pipeline to any supported destination.

Creating a Post-Lambda World with Apache Kudu

Arvind Prabhakar By September 23, 2016

Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing

As originally posted on the Cloudera VISION Blog.

At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.

This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.

Ingesting Drifting Data into Hive and Impala

By September 8, 2016

HiveDrift2Importing data into Apache Hive is one of the most common use cases in big data ingest, but gets tricky when data sources ‘drift’, changing the schema or semantics of incoming data. Introduced in StreamSets Data Collector (SDC) 1.5.0.0, the Hive Drift Solution monitors the structure of incoming data, detecting schema drift and updating the Hive Metastore accordingly, allowing data to keep flowing. In this blog entry, I’ll give you an overview of the Hive Drift Solution and explain how you can try it out for yourself, today.

Dynamic Outlier Detection with StreamSets and Cassandra

By August 19, 2016

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I’ll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier’ values.

Retrieving Metrics via the StreamSets Data Collector REST API

By July 8, 2016

PiTFT Displaying SDC MetricsLast week, I explained how I was able to run StreamSets Data Collector Engine on a Raspberry Pi 3, ingesting sensor data and writing it to Cassandra. With that working, I wanted to show pipeline metrics across data pipelines on Adafruit’s awesome PiTFT Plus 2.8″ screen. In this blog post, I’ll explain how I was able to write a Python app to retrieve pipeline metrics with StreamSets Data Collector REST API, showing them on the PiTFT Plus via pygame to better manage data pipelines.

Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector

By June 27, 2016

Raspberry Pi with SensorIn the unlikely event you’re not familiar with the Raspberry Pi, it’s an ARM-based computer about the same size as a deck of playing cards. The latest iteration, Raspberry Pi 3, has a 1.2GHz ARMv8 CPU, 1MB of memory, integrated Wi-Fi and Bluetooth, all for the same $35 price tag as the original Raspberry Pi released in 2012. Running a variety of operating systems, including Linux, it’s actually quite a capable little machine. How capable? Well, last week, I successfully ran StreamSets Data Collector (SDC) on one of mine (yes, I have several!), ingesting sensor data and sending them to Apache Cassandra for analysis.

Here’s how you can build your own Internet of Things testbed and ingest sensor data for around $50.

Ingesting MQTT Traffic into Riak TS via RabbitMQ and StreamSets

By June 13, 2016

Riak TS LogoRiak KV is an open source, distributed, NoSQL key-value data store oriented towards high availability, fault tolerance and scalability. With its initial release in 2009, Risk KV is in use at companies such as AT&T, Comcast and GitHub. Last October, Basho, the vendor behind Riak KV, announced Riak TS. Riak TS, another distributed, NoSQL data store, is optimized for time series data generated by sources such as IoT sensors. The recent release of Riak TS 1.3 as an open source product under the Apache V2 license got me thinking – how would I get sensor data into Riak TS? There are a range of client libraries, which communicate with the data store via protocol buffers, but connected devices tend to use protocols such as MQTT, an extremely lightweight publish/subscribe messaging transport. StreamSets to the rescue!

Back To Top