skip to Main Content

StreamSets Data Integration Blog

Where change is welcome.

Using StreamSets and MapR Together in Docker

By June 26, 2018

Today's guest blogger is Ian Downard, a Senior Developer Evangelist at MapR Technologies. Ian focuses on machine learning and data engineering, and recently documented how he brought together the MapR Persistent Application Client Container (PACC) with StreamSets Data Collector and Docker to build pipelines…

Streaming Extreme Data Made Simple with Kinetica and StreamSets

By June 21, 2018

Kinetica, just one of dozens of origins and destinations supported by StreamSets Data Collector, is a distributed, in-memory, GPU database designed for geospatial analysis, machine learning, predictive analytics, and other workloads requiring high performance parallel processing. Mathew Hawkins, a Principal Solutions Architect at Kinetica, recently…

Ingest Game-Streaming Data from the Twitch API

By May 25, 2018

Nick JastixNikolay Petrachkov (Nik for short) is a BI developer in Amsterdam by day, but in his spare time, he combines his passion for games and data engineering by building a project to analyze game-streaming data from Twitch. Nik discovered StreamSets Data Collector when he was looking for a way to build data pipelines to deliver insights from gaming data without having to write a ton of code. In this guest post, reposted from the original with his kind permission, Nik explains how he used StreamSets Data Collector to extract data about streams and games via the Twitch API. It’s a great example of applying enterprise dataops principles to a fun use case. Over to you, Nik…

DataOps: Applying DevOps to Data

By May 18, 2018

DevOps to DataOps LifecycleThe term DataOps is a contraction of ‘Data Operations’ and comes from applying DevOps to data. It seems to have been coined in a 2015 blog post by Tamr co-founder and CEO Andy Palmer. In this blog post, I’ll dive into what DataOps means today, and how enterprises can adopt its practice to create reliable, always-on dataflows using smart data pipelines to unlock the value of their data.

In his 2015 post, Palmer argued that the democratization of analytics and the implementation of “built-for-purpose” database engines created the need for DataOps. In addition to the two dynamics Palmer identified, a third has emerged: the need for analysis at the “speed of need”, which, depending on the use, can be real-time, near-real-time or with some acceptable latency. Data must be made available broadly, via a more diverse set of data stores and analytic methods, and as quickly as required by the consuming user or application.

What’s driving these three dynamics is the strategic imperative that enterprises wield their data as a competitive weapon by making it available and consumable across numerous points of use, in short, that their data enables pervasive intelligence. The centralized discipline of SQL-driven business intelligence has been subsumed into a decentralized world of advanced analytics and machine learning. Pervasive intelligence lets “a thousand flowers bloom” in order to maximize business benefits from a company’s data, whether it be speeding product innovation, lowering costs through operational excellence or reducing corporate risk.

Efficient Splunk Ingest for Cybersecurity

By April 17, 2018

Splunk logoMany StreamSets customers use Splunk to mine insights from machine-generated data such as server logs, but one problem they encounter with the default tools is that they have no way to filter the data that they are forwarding. While Splunk is a great tool for searching and analyzing machine-generated data, particularly in cybersecurity use cases, it’s easy to fill it with redundant or irrelevant data, driving up costs without adding value.  In addition, Splunk may not natively offer the types of analytics you prefer, so you might also need to send that data elsewhere.

In this blog entry I’ll explain how, with StreamSets Control Hub, we can build a topology of pipelines for efficient Splunk data ingestion to support cybersecurity and other domains, by sending only necessary and unique data to Splunk and routing other data to less expensive and/or more analytics-rich platforms.

Back To Top