2018 | Page 4 of 5 | StreamSets

Change Data Capture from Oracle with StreamSets Data Collector

By Pat Patterson June 12, 2018

Editor's Note: StreamSets no longer relies on the continuous miner function in Oracle. Here is an update on Oracle 19c Bulk Ingest and CDC. Today's guest post is by Franck Pachot, an Oracle Consultant at dbi services in Switzerland. Franck has over…

Ingest Game-Streaming Data from the Twitch API

Data Integration

Stream Data Processing

By Pat Patterson May 25, 2018

Nikolay Petrachkov (Nik for short) is a BI developer in Amsterdam by day, but in his spare time, he combines his passion for games and data engineering by building a project to analyze game-streaming data from Twitch. Nik discovered StreamSets Data Collector when he was looking for a way to build data pipelines to deliver insights from gaming data without having to write a ton of code. In this guest post, reposted from the original with his kind permission, Nik explains how he used StreamSets Data Collector to extract data about streams and games via the Twitch API. It’s a great example of applying enterprise dataops principles to a fun use case. Over to you, Nik…

DataOps: Applying DevOps to Data

Data Integration

By Pat Patterson May 18, 2018

The term DataOps is a contraction of ‘Data Operations’ and comes from applying DevOps to data. It seems to have been coined in a 2015 blog post by Tamr co-founder and CEO Andy Palmer. In this blog post, I’ll dive into what DataOps means today, and how enterprises can adopt its practice to create reliable, always-on dataflows using smart data pipelines to unlock the value of their data.

In his 2015 post, Palmer argued that the democratization of analytics and the implementation of “built-for-purpose” database engines created the need for DataOps. In addition to the two dynamics Palmer identified, a third has emerged: the need for analysis at the “speed of need”, which, depending on the use, can be real-time, near-real-time or with some acceptable latency. Data must be made available broadly, via a more diverse set of data stores and analytic methods, and as quickly as required by the consuming user or application.

What’s driving these three dynamics is the strategic imperative that enterprises wield their data as a competitive weapon by making it available and consumable across numerous points of use, in short, that their data enables pervasive intelligence. The centralized discipline of SQL-driven business intelligence has been subsumed into a decentralized world of advanced analytics and machine learning. Pervasive intelligence lets “a thousand flowers bloom” in order to maximize business benefits from a company’s data, whether it be speeding product innovation, lowering costs through operational excellence or reducing corporate risk.

Automating Pipeline Development with the StreamSets SDK for Python

Data Integration

By Dima Spivak May 15, 2018

When it comes to creating and managing your smart data pipelines, the graphical user interfaces of StreamSets Control Hub and StreamSets Data Collector Engine put the complete power of our robust Data Operations Platform at your fingertips. There are times, however, when a more programmatic approach may be needed, and those times will be significantly more enjoyable with the release of version 3.2.0 of the StreamSets SDK for Python. In this post, I’ll describe some of the SDK’s new functionality and show examples of how you can use it to enable your own data use cases.

StreamSets Announces Control Hub version 3.2

By Sean Anderson May 14, 2018

Today we are pleased to announce the general availability of StreamSets Control Hub version 3.2. StreamSets has built the industry’s only DataOps platform. We call it DataOps because our platform makes it easy to iteratively update dataflows when technology changes.…

Using StreamSets Control Hub with Minikube

Data Integration

By Mark Brooks April 26, 2018

Hari Nayak's recent blog post provides a quickstart for using StreamSets Control Hub to deploy multiple instances of StreamSets Data Collector on Google's Kubernetes Engine (GKE). This post modifies the core scripts from that project in order to run on Minikube rather than GKE. As Minikube can run…

Mini MapR Academy: How the ACT Government Uses Data Collector w/ MapR (videos)

Data Integration

Data Transformation

By Pat Patterson April 23, 2018

Selvaraaju (‘Selva') Murugesan is Senior Manager for Innovation and Data Analytics in the Australian Capital Territory (ACT) Government. Selva focuses on data management practices and data analytics, using StreamSets Data Collector to extract data from different databases, perform data cleansing on the fly and…

Efficient Splunk Ingest for Cybersecurity

Data Transformation

Stream Data Processing

By Pat Patterson April 17, 2018

Many StreamSets customers use Splunk to mine insights from machine-generated data such as server logs, but one problem they encounter with the default tools is that they have no way to filter the data that they are forwarding. While Splunk is a great tool for searching and analyzing machine-generated data, particularly in cybersecurity use cases, it’s easy to fill it with redundant or irrelevant data, driving up costs without adding value. In addition, Splunk may not natively offer the types of analytics you prefer, so you might also need to send that data elsewhere.

In this blog entry I’ll explain how, with StreamSets Control Hub, we can build a topology of pipelines for efficient Splunk data ingestion to support cybersecurity and other domains, by sending only necessary and unique data to Splunk and routing other data to less expensive and/or more analytics-rich platforms.

Kafka + TLS/Kerberos in Cluster Streaming Mode is here!

Data Integration

Stream Data Processing

By Adam Kunicki March 29, 2018

Spark Streaming + Data Collector + Secure Kafka

When we first introduced cluster streaming mode with Apache Spark Streaming 1.3 and Apache Kafka 0.8 several years ago, Kafka didn’t support security features such as TLS (transport encryption, authentication) and Kerberos (authentication). In Spark 2.1, an updated Kafka connector was introduced with support for these features when used with Kafka 0.10 or newer.

A Fun Example of Streaming Data into Minecraft

By Pat Patterson March 27, 2018

Angel Alvarado is a senior software engineer at One Degree, a San Francisco-based non-profit, and also helps run the Molanco data engineering community. In his spare time, Angel enjoys playing Minecraft with his 11 year-old-cousin. Recently, Angel, found a fun way to combine his gaming with data engineering. This blog entry, reposted from the original with Angel’s kind permission, picks up the story…

Data Engineering can get really complex really quick and being aware of the hundreds of tools and data platforms in the industry can get very overwhelming. The following project is about how to use three data engineering tools to visualize data in a video game, it aims to solve a common data engineering problem with a twist to make it fun and entertaining.

StreamSets Data Integration Blog

Change Data Capture from Oracle with StreamSets Data Collector

Ingest Game-Streaming Data from the Twitch API

DataOps: Applying DevOps to Data

Automating Pipeline Development with the StreamSets SDK for Python

StreamSets Announces Control Hub version 3.2

Using StreamSets Control Hub with Minikube

Mini MapR Academy: How the ACT Government Uses Data Collector w/ MapR (videos)

Efficient Splunk Ingest for Cybersecurity

Kafka + TLS/Kerberos in Cluster Streaming Mode is here!

Spark Streaming + Data Collector + Secure Kafka

A Fun Example of Streaming Data into Minecraft

Stay in Touch

Connect