Dataflow Performance Blog

Efficient Splunk Ingest for Cybersecurity

Splunk logoMany StreamSets customers use Splunk to mine insights from machine-generated data such as server logs, but one problem they encounter with the default tools is that they have no way to filter the data that they are forwarding. While Splunk is a great tool for searching and analyzing machine-generated data, particularly in cybersecurity use cases, it's easy to fill it with redundant or irrelevant data, driving up costs without adding value. In addition, Splunk may not natively offer the types of analytics you prefer, so you might want to route that data to additional analytics platforms.

In this blog entry I’ll explain how, with StreamSets Control Hub, we can build a topology of pipelines to support cybersecurity and other domains, sending only necessary and unique data to Splunk while also routing data to less expensive and/or more analytics-rich platforms.

Pat PattersonEfficient Splunk Ingest for Cybersecurity
Read More

Kafka + TLS/Kerberos in Cluster Streaming Mode is coming!

Spark Streaming + Data Collector + Secure Kafka

When we first introduced cluster streaming mode with Apache Spark Streaming 1.3 and Apache Kafka 0.8 several years ago, Kafka didn’t support security features such as TLS (transport encryption, authentication) and Kerberos (authentication). In Spark 2.1, an updated Kafka connector was introduced with support for these features when used with Kafka 0.10 or newer.

Adam KunickiKafka + TLS/Kerberos in Cluster Streaming Mode is coming!
Read More

A Fun Example of Streaming Data into Minecraft

Angel AlvaradoAngel Alvarado is a senior software engineer at One Degree, a San Francisco-based non-profit, and also helps run the Molanco data engineering community. In his spare time, Angel enjoys playing Minecraft with his 11 year-old-cousin. Recently, Angel, found a fun way to combine his gaming with data engineering. This blog entry, reposted from the original with Angel's kind permission, picks up the story…

Data Engineering can get really complex really quick and being aware of the hundreds of tools and data platforms in the industry can get very overwhelming. The following project is about how to use three data engineering tools to visualize data in a video game, it aims to solve a common data engineering problem with a twist to make it fun and entertaining.

Pat PattersonA Fun Example of Streaming Data into Minecraft
Read More

Introducing StreamSets Data Protector

Detect, Secure and Govern Sensitive Data Upon Ingestion

StreamSets is excited to announce a new product for protecting data in motion. StreamSets Data Protector, as the name would imply, extends the value of the StreamSets DataOps Platform to enable users to detect, secure and govern sensitive data as it flows around your business. Leveraging StreamSets' unique “Dataflow Sensors”, Data Protector can spot and act on personally identifiable information (PII) at the point of ingest, further strengthening your ability to meet new and changing regulatory requirements such as the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA).

ClarkeIntroducing StreamSets Data Protector
Read More

Managing Data Operations on the Edge

lightweight agentTogether, StreamSets Control Hub (SCH) and StreamSets Data Collector Edge (SDC Edge) allow you to create, deploy and run dataflow pipelines in an unprecedented variety of environments. In this short series of videos, I'll show you how to install SDC Edge on a Raspberry Pi, how to get started building edge pipelines with SCH's Pipeline Designer, and how SDC Edge and its big brother, StreamSets Data Collector (SDC) work together to move data all the way from IoT sensors to the heart of your data infrastructure.

Pat PattersonManaging Data Operations on the Edge
Read More

Using StreamSets Control Hub for Scalable Deployment via Kubernetes

StreamSets, Docker, KubernetesIn my previous blog entry, I explained how to spin up Data Collectors as Kubernetes deployments along with Dataflow Performance Manager. I recommended using a deployment with one replica as the design environment and a deployment with many replicas for execution. We recently announced StreamSets Control Hub which makes the Kubernetes integration way smoother! StreamSets Control Hub adds a Control Agent for Kubernetes that supports creating and managing Data Collector deployments and a Pipeline Designer that allows designing pipelines without having to install Data Collectors. In this blog, I will demonstrate how to take advantage of these features.



Hari NayakUsing StreamSets Control Hub for Scalable Deployment via Kubernetes
Read More

Streaming Data from Twitter for Analysis in Spark

FootballHappy New Year! Our first blog entry of 2018 is a guest post from Josh Janzen, a data scientist based in Minnesota. Josh wanted to ingest tweets referencing NFL games into Spark, then run some analysis to look for a correlation between Twitter activity and game winners. Josh originally posted this entry on his personal blog, and kindly allowed us to repost it here. Over to you, Josh:

Tis the season of NFL football, and one way to capture excitement is Twitter data. I’ve tickered around with Twitter’s Developer API before, but this time I wanted to use a streaming product I’ve heard good things about: StreamSets Data Collector.

Pat PattersonStreaming Data from Twitter for Analysis in Spark
Read More

Speed up Hive Data Retrieval using Spark, StreamSets and Predera

Predera logoIn this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.

Pat PattersonSpeed up Hive Data Retrieval using Spark, StreamSets and Predera
Read More

Introducing StreamSets Control Hub

Control. We always want it, regularly don’t get it, yet in business it’s a must have to ensure things run as expected. Control is particularly critical when it comes to moving data around your company. Without it, it’s difficult to know where data is coming from, where it’s going and how it’s been manipulated (and by whom!) along the way. At StreamSets, we specialize in helping customers effectively move data around their business, from any source to any destination. Over time, we’ve observed that most organizations lack proper controls to adequately ensure that dataflow pipelines are built, executed and operated in a manner that meets the needs of the application or business process which they support.

ClarkeIntroducing StreamSets Control Hub
Read More

Generate your Avro Schema – Automatically!

Avro Logo
In a previous blog post, I explained how StreamSets Data Collector (SDC) can work with Apache Kafka and Confluent Schema Registry to handle data drift via Avro schema evolution. In that blog post, I mentioned SDC's Schema Generator processor; today I'll explain how you can use the Schema Generator to automatically create Avro schemas.

Pat PattersonGenerate your Avro Schema – Automatically!
Read More