Skip to content

StreamSets Data Integration Blog

Where change is welcome.

Efficient Splunk Ingest for Cybersecurity

By April 17, 2018

Splunk logoMany StreamSets customers use Splunk to mine insights from machine-generated data such as server logs, but one problem they encounter with the default tools is that they have no way to filter the data that they are forwarding. While Splunk is a great tool for searching and analyzing machine-generated data, particularly in cybersecurity use cases, it’s easy to fill it with redundant or irrelevant data, driving up costs without adding value.  In addition, Splunk may not natively offer the types of analytics you prefer, so you might also need to send that data elsewhere.

In this blog entry I’ll explain how, with StreamSets Control Hub, we can build a topology of pipelines for efficient Splunk data ingestion to support cybersecurity and other domains, by sending only necessary and unique data to Splunk and routing other data to less expensive and/or more analytics-rich platforms.

Fun with FileRefs – Manipulating Whole File Data

By November 2, 2017

rot13 processorAs well as parsing incoming data into records, many StreamSets Data Collector (SDC) origins can be configured to ingest Whole Files. The blog entry Whole File Transfer with StreamSets Data Collector provides a basic introduction to the concept.

Although the initial release of the Whole File feature did not allow file content to be accessed in the pipeline, we soon added the ability for Script Evaluator processors to read the file, a feature exploited in the custom processor tutorial to read metadata from incoming image files. In this blog post, I’ll show you how a custom processor can both create new records with Whole File content, and replace the content in existing records.

Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector

By October 25, 2017

Avro LogoApache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read the schema and use it when deserializing the data.

While StreamSets Data Collector (SDC) frees you from the tyranny of schema, it can also work with tools that take a more rigid approach. In this blog, I’ll explain a little of how Avro works and how SDC can integrate with Confluent Schema Registry’s distributed Avro schema storage layer when reading and writing data to Apache Kafka topics and other destinations.

Transform Data in StreamSets DataOps Platform

By April 11, 2017

Do you need to transform data? Good new is that I’ve written quite a bit over the past few months about the more advanced aspects of data manipulation in StreamSets Data Collector Engine (SDC) – writing custom processors, calling Java libraries from JavaScript, Groovy & Python, and even using Java and Scala with the Spark Evaluator. As a developer, it’s always great fun to break out the editor and get to work, but we should be careful not to jump the gun. Just because you can solve a problem with code, doesn’t mean you should. Using SDC’s built-in processor stages is not only easier than writing code, it typically results in better performance. In this blog entry, I’ll look at some of these stages, and the problems you can solve with them.

Read and Write JSON to MapR DB with StreamSets Data Collector

By March 5, 2017

MapR DB logoMapR-DB is an enterprise-grade, high performance, NoSQL database management system. As a multi-model NoSQL database, it supports both JSON document models and wide column data models. MapR-DB stores JSON documents in tables; documents within a table in MapR-DB can have different structures. StreamSets Data Collector enables working with MapR-DB documents with its powerful schema-on-read and ingestion capability.

With StreamSets Data Collector, I’ll show you how easy it is to stream data from MongoDB into a MapR-DB table as well as stream data out of the MapR-DB table into MapR Streams.

Replicating Relational Databases with StreamSets Data Collector

By February 3, 2017

HiveDrift2StreamSets Data Collector Engine has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. StreamSets Data Collector Engine Now introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I’ll explain how the JDBC Multitable Consumer can implement a typical use case – replicating relational databases (an entire one) into Hadoop.

Data in Motion Evolution: Where We’ve Been…Where We Need to Go

By January 17, 2017

data-in-motionToday we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data. Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.

Exerting firm control over data in motion is a critical competency which has become core to modern data integration and operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.

Dynamic Outlier Detection with StreamSets and Cassandra

By August 19, 2016

This blog post concludes a short series building up a IoT sensor testbed with StreamSets Data Collector (SDC), a Raspberry Pi and Apache Cassandra. Previously, I covered:

To wrap up, I’ll show you how I retrieved statistics from Cassandra, fed them into SDC, and was able to filter out ‘outlier’ values.

Struggling with Bad Data? What to Do From the Enterprise

By June 28, 2016

Last week we announced the results of a survey of over 300 enterprise data professionals conducted by Dimensional Research and sponsored by StreamSets.  We were trying to understand the market’s state of play for managing their big data flows.  What we discovered was that there is an alarming issue at hand: companies are struggling to detect and keep bad data out of their stores.

There is a bad data problem within big data.

Bad Data: What to Do?

When we asked data pros about their challenges with big data flows, the most-cited issue was ensuring the quality of the data in terms of accuracy, completeness and consistency, getting votes from over ⅔ of respondents.  Security and operations were also listed by more than half.  The fact that quality was rated as a more common challenge than even security and compliance is quite telling, as you usually can count on security to be voted the #1 challenge for most IT domains.  

Back To Top