Stream Data Processing Archives

By Pat Patterson September 17, 2018

Jowanza Joseph is a principal software engineer at One Click Retail with long experience of building reliable and performant distributed data systems. Recently, Jowanza built a pair of data pipelines with StreamSets Data Collector to read data from Ford GoBike and send…

Easy Splunk Integration with StreamSets Data Collector

Data Integration

Stream Data Processing

By Pat Patterson August 21, 2018

Splunk is the tool-of-choice for many enterprises mining insights from machine-generated data such as server logs, but one problem with the default tools is that there is no way to filter the data as it is fed into Splunk. It’s…

RingCentral Scales Out Big Data Streaming with StreamSets

Stream Data Processing

By Pat Patterson June 14, 2018

RingCentral is an award-winning global provider of cloud-unified communications and collaboration solutions. RingCentral solutions empower today’s mobile and distributed workforces to be connected anywhere and on any device through voice, video, team messaging, collaboration, SMS, conferencing, online meetings, contact center,…

Ingest Game-Streaming Data from the Twitch API

Data Integration

Stream Data Processing

By Pat Patterson May 25, 2018

Nikolay Petrachkov (Nik for short) is a BI developer in Amsterdam by day, but in his spare time, he combines his passion for games and data engineering by building a project to analyze game-streaming data from Twitch. Nik discovered StreamSets Data Collector when he was looking for a way to build data pipelines to deliver insights from gaming data without having to write a ton of code. In this guest post, reposted from the original with his kind permission, Nik explains how he used StreamSets Data Collector to extract data about streams and games via the Twitch API. It’s a great example of applying enterprise dataops principles to a fun use case. Over to you, Nik…

Efficient Splunk Ingest for Cybersecurity

Data Transformation

Stream Data Processing

By Pat Patterson April 17, 2018

Many StreamSets customers use Splunk to mine insights from machine-generated data such as server logs, but one problem they encounter with the default tools is that they have no way to filter the data that they are forwarding. While Splunk is a great tool for searching and analyzing machine-generated data, particularly in cybersecurity use cases, it’s easy to fill it with redundant or irrelevant data, driving up costs without adding value. In addition, Splunk may not natively offer the types of analytics you prefer, so you might also need to send that data elsewhere.

In this blog entry I’ll explain how, with StreamSets Control Hub, we can build a topology of pipelines for efficient Splunk data ingestion to support cybersecurity and other domains, by sending only necessary and unique data to Splunk and routing other data to less expensive and/or more analytics-rich platforms.

Kafka + TLS/Kerberos in Cluster Streaming Mode is here!

Data Integration

Stream Data Processing

By Adam Kunicki March 29, 2018

Spark Streaming + Data Collector + Secure Kafka

When we first introduced cluster streaming mode with Apache Spark Streaming 1.3 and Apache Kafka 0.8 several years ago, Kafka didn’t support security features such as TLS (transport encryption, authentication) and Kerberos (authentication). In Spark 2.1, an updated Kafka connector was introduced with support for these features when used with Kafka 0.10 or newer.

Streaming Data from Twitter for Analysis in Spark

Data Integration

Stream Data Processing

By Pat Patterson January 10, 2018

Happy New Year! Our first blog entry of 2018 is a guest post from Josh Janzen, a data scientist based in Minnesota. Josh wanted to ingest tweets referencing NFL games into Spark, then run some analysis to look for a correlation between Twitter activity and game winners. Josh originally posted this entry on his personal blog, and kindly allowed us to repost it here. Over to you, Josh:

Tis the season of NFL football, and one way to capture excitement is Twitter data. I’ve tickered around with Twitter’s Developer API before, but this time I wanted to use a streaming product I’ve heard good things about: StreamSets Data Collector.

Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector

Data Integration

Data Transformation

Stream Data Processing

By Pat Patterson October 25, 2017

Apache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read the schema and use it when deserializing the data.

While StreamSets Data Collector (SDC) frees you from the tyranny of schema, it can also work with tools that take a more rigid approach. In this blog, I’ll explain a little of how Avro works and how SDC can integrate with Confluent Schema Registry’s distributed Avro schema storage layer when reading and writing data to Apache Kafka topics and other destinations.

Fast, Easy Access to Secure Kafka Clusters

Stream Data Processing

By Hari Nayak August 28, 2017

It’s simple to connect StreamSets Data Collector (SDC) to Apache Kafka through the Kafka Consumer Origin and Kafka Producer Destination connectors. And because those connectors support all Kafka Client options, including the secure Kafka (SSL and SASL) options, connecting to an SSL-enabled secure Kafka cluster is just as easy. In this blog post I’ll walk through the steps required.

Making Sense of Stream Processing

Stream Data Processing

By Pat Patterson April 19, 2017

There has been an explosion of innovation in open source stream processing over the past few years. Frameworks such as Apache Spark and Apache Storm give developers stream abstractions on which they can develop applications; Apache Beam provides an API abstraction, enabling developers to write code independent of the underlying framework, while tools such as Apache NiFi and StreamSets Data Collector provide a user interface abstraction, allowing data engineers to define data flows from high-level building blocks with little or no coding.

In this article, I’ll propose a framework for organizing stream processing projects, and briefly describe each area. I’ll be focusing on organizing the projects into a conceptual model; there are many articles that compare the streaming frameworks for real-world applications – I list a few at the end.

StreamSets Data Integration Blog

Building a Real-Time Bike-Share Data Pipeline with StreamSets, Kafka and MapD

Easy Splunk Integration with StreamSets Data Collector

RingCentral Scales Out Big Data Streaming with StreamSets

Ingest Game-Streaming Data from the Twitch API

Efficient Splunk Ingest for Cybersecurity

Kafka + TLS/Kerberos in Cluster Streaming Mode is here!

Spark Streaming + Data Collector + Secure Kafka

Streaming Data from Twitter for Analysis in Spark

Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector

Fast, Easy Access to Secure Kafka Clusters

Making Sense of Stream Processing

Stay in Touch

Connect