Engineering

Using StreamSets Control Hub for Scalable Deployment via Kubernetes

StreamSets, Docker, KubernetesIn my previous blog entry, I explained how to spin up Data Collectors as Kubernetes deployments along with Dataflow Performance Manager. I recommended using a deployment with one replica as the design environment and a deployment with many replicas for execution. We recently announced StreamSets Control Hub which makes the Kubernetes integration way smoother! StreamSets Control Hub adds a Control Agent for Kubernetes that supports creating and managing Data Collector deployments and a Pipeline Designer that allows designing pipelines without having to install Data Collectors. In this blog, I will demonstrate how to take advantage of these features.

 

 

Hari NayakUsing StreamSets Control Hub for Scalable Deployment via Kubernetes
Read More

Streaming Data from Twitter for Analysis in Spark

FootballHappy New Year! Our first blog entry of 2018 is a guest post from Josh Janzen, a data scientist based in Minnesota. Josh wanted to ingest tweets referencing NFL games into Spark, then run some analysis to look for a correlation between Twitter activity and game winners. Josh originally posted this entry on his personal blog, and kindly allowed us to repost it here. Over to you, Josh:

Tis the season of NFL football, and one way to capture excitement is Twitter data. I’ve tickered around with Twitter’s Developer API before, but this time I wanted to use a streaming product I’ve heard good things about: StreamSets Data Collector.

Pat PattersonStreaming Data from Twitter for Analysis in Spark
Read More

Generate your Avro Schema – Automatically!

Avro Logo
In a previous blog post, I explained how StreamSets Data Collector (SDC) can work with Apache Kafka and Confluent Schema Registry to handle data drift via Avro schema evolution. In that blog post, I mentioned SDC's Schema Generator processor; today I'll explain how you can use the Schema Generator to automatically create Avro schemas.

Pat PattersonGenerate your Avro Schema – Automatically!
Read More

Fun with FileRefs – Manipulating Whole File Data

rot13 processorAs well as parsing incoming data into records, many StreamSets Data Collector (SDC) origins can be configured to ingest Whole Files. The blog entry Whole File Transfer with StreamSets Data Collector provides a basic introduction to the concept.

Although the initial release of the Whole File feature did not allow file content to be accessed in the pipeline, we soon added the ability for Script Evaluator processors to read the file, a feature exploited in the custom processor tutorial to read metadata from incoming image files. In this blog post, I'll show you how a custom processor can both create new records with Whole File content, and replace the content in existing records.

Pat PattersonFun with FileRefs – Manipulating Whole File Data
Read More

Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector

Avro LogoApache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read the schema and use it when deserializing the data.

While StreamSets Data Collector (SDC) frees you from the tyranny of schema, it can also work with tools that take a more rigid approach. In this blog, I'll explain a little of how Avro works and how SDC can integrate with Confluent Schema Registry’s distributed Avro schema storage layer when reading and writing data to Apache Kafka topics and other destinations.

Pat PattersonEvolving Avro Schemas with Apache Kafka and StreamSets Data Collector
Read More

Getting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)

ODM Windows PipelineThis post was originally published on the Cloudera VISION blog by Sam Heywood.   StreamSets configurations and images of Apache Spot Open Data Model ingest pipelines can be found here on Github.

A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer is to adopt a machine learning and analytics platform as a fundamental and durable part of their data strategy. However, many CISOs fear deployment of an initial use case will be somewhat daunting. Cloudera has partnered along with Arcadia Data and StreamSets to make it easier than ever for CISOs to take the first step and deploy basic use cases leveraging data sources common to many environments.

Rick BilodeauGetting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)
Read More

Fast, Easy Access to Secure Kafka Clusters

It’s simple to connect StreamSets Data Collector (SDC) to Apache Kafka through the Kafka Consumer Origin and Kafka Producer Destination connectors. And because those connectors support all Kafka Client options, including the secure Kafka (SSL and SASL) options, connecting to an SSL-enabled secure Kafka cluster is just as easy. In this blog post I'll walk through the steps required.

Hari NayakFast, Easy Access to Secure Kafka Clusters
Read More

Scaling out StreamSets with Kubernetes

StreamSets, Docker, KubernetesIn today’s microservice revolution, where software applications are designed as independent services that work together, two technologies stand out. Docker, the defacto standard for containerization, and Kubernetes, a container orchestration and management tool. In this blog I will explain how to run StreamSets Data Collector (SDC) Docker containers on Kubernetes.

Hari NayakScaling out StreamSets with Kubernetes
Read More

Cache Salesforce Data in Redis with StreamSets Data Collector

Redis LogoRedis is an open-source, in-memory, NoSQL database implementing a networked key-value store with optional persistence to disk. Perhaps the most popular key-value database, Redis is widely used for caching web pages, sessions and other objects that require blazingly fast access – lookups are typically in the millisecond range.

At RedisConf 2017 I presented a session, Cache All The Things! Data Integration via Jedis (slides), looking at how the open source Jedis library provides a small, sane, easy to use Java interface to Redis, and how a StreamSets Data Collector (SDC) pipeline can read data from a platform such as Salesforce, write it to Redis via Jedis, and keep Redis up-to-date by subscribing for notifications of changes in Salesforce, writing new and updated data to Redis. In this blog entry, I'll describe how I built the SDC pipeline I showed during my session.

Pat PattersonCache Salesforce Data in Redis with StreamSets Data Collector
Read More

Introducing the Data Collector Support Bundle

Hi, my name is Wagner Camarao and I'm a Software Engineer at StreamSets focusing on the user-facing aspects of our products. Today I'm going to talk about a new feature in the StreamSets Data Collector to optimize the interactions with our support team.

In version 2.6.0.0 of Data Collector, we’ve added a feature called Support Bundle. It allows you to generate an archive file with the most common information required to troubleshoot various issues with Data Collector, such as precise build information, configuration, thread dump, pipeline definitions and history files, and most recent log files.

Wagner CamaraoIntroducing the Data Collector Support Bundle
Read More