Dataflow Performance Blog

Using StreamSets Control Hub for Scalable Deployment via Kubernetes

StreamSets, Docker, KubernetesIn my previous blog entry, I explained how to spin up Data Collectors as Kubernetes deployments along with Dataflow Performance Manager. I recommended using a deployment with one replica as the design environment and a deployment with many replicas for execution. We recently announced StreamSets Control Hub which makes the Kubernetes integration way smoother! StreamSets Control Hub adds a Control Agent for Kubernetes that supports creating and managing Data Collector deployments and a Pipeline Designer that allows designing pipelines without having to install Data Collectors. In this blog, I will demonstrate how to take advantage of these features.

 

 

Hari NayakUsing StreamSets Control Hub for Scalable Deployment via Kubernetes
Read More

Streaming Data from Twitter for Analysis in Spark

FootballHappy New Year! Our first blog entry of 2018 is a guest post from Josh Janzen, a data scientist based in Minnesota. Josh wanted to ingest tweets referencing NFL games into Spark, then run some analysis to look for a correlation between Twitter activity and game winners. Josh originally posted this entry on his personal blog, and kindly allowed us to repost it here. Over to you, Josh:

Tis the season of NFL football, and one way to capture excitement is Twitter data. I’ve tickered around with Twitter’s Developer API before, but this time I wanted to use a streaming product I’ve heard good things about: StreamSets Data Collector.

Pat PattersonStreaming Data from Twitter for Analysis in Spark
Read More

Speed up Hive Data Retrieval using Spark, StreamSets and Predera

Predera logoIn this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.

Pat PattersonSpeed up Hive Data Retrieval using Spark, StreamSets and Predera
Read More

Introducing StreamSets Control Hub

Control. We always want it, regularly don’t get it, yet in business it’s a must have to ensure things run as expected. Control is particularly critical when it comes to moving data around your company. Without it, it’s difficult to know where data is coming from, where it’s going and how it’s been manipulated (and by whom!) along the way. At StreamSets, we specialize in helping customers effectively move data around their business, from any source to any destination. Over time, we’ve observed that most organizations lack proper controls to adequately ensure that dataflow pipelines are built, executed and operated in a manner that meets the needs of the application or business process which they support.

ClarkeIntroducing StreamSets Control Hub
Read More

Generate your Avro Schema – Automatically!

Avro Logo
In a previous blog post, I explained how StreamSets Data Collector (SDC) can work with Apache Kafka and Confluent Schema Registry to handle data drift via Avro schema evolution. In that blog post, I mentioned SDC's Schema Generator processor; today I'll explain how you can use the Schema Generator to automatically create Avro schemas.

Pat PattersonGenerate your Avro Schema – Automatically!
Read More

Announcing StreamSets Data Collector Edge

Today an increasing amount of data is being generated from outside the data center or cloud – it isn’t always easy to get this data out of source systems or perform analytics right where it’s generated. Furthermore, getting this data into central big data systems managed by the enterprise is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.

Kirit BasuAnnouncing StreamSets Data Collector Edge
Read More

Announcing StreamSets Data Collector version 3.0

Version 3.0 marks an important new milestone for StreamSets. With close to a million downloads and a strong community and customer base, we are very excited to offer a host of powerful new capabilities within the product. This release has greater connectivity with cloud services, deeper integration with Hadoop distributions, new data aggregations and an exciting new technology for running pipelines on resource constrained devices.

For those keeping count, SDC 3.0 has 27 new features, close to 100 improvements and almost 200 bug fixes.

This release also contains important new functionality where we extend the reach of SDC out to devices out on the edge. SDC Edge is a lightweight agent that can execute pipelines designed in SDC. These agents can run on Windows, Linux, Mac, Android and IoS. To learn more about SDC Edge, follow this link.

Kirit BasuAnnouncing StreamSets Data Collector version 3.0
Read More

Bulk Loading Data into Snowflake Data Warehouse

Truck and LoaderMike Fuller, a consultant at Red Pill Analytics, has been busy integrating an Oracle RDS database with Snowflake's cloud data warehouse via StreamSets Data Collector. His blog post on bulk loading data into Snowflake is a great description of a real-world data integration use case. We're reposting it here courtesy of Mike and Red Pill.

Pat PattersonBulk Loading Data into Snowflake Data Warehouse
Read More

Fun with FileRefs – Manipulating Whole File Data

rot13 processorAs well as parsing incoming data into records, many StreamSets Data Collector (SDC) origins can be configured to ingest Whole Files. The blog entry Whole File Transfer with StreamSets Data Collector provides a basic introduction to the concept.

Although the initial release of the Whole File feature did not allow file content to be accessed in the pipeline, we soon added the ability for Script Evaluator processors to read the file, a feature exploited in the custom processor tutorial to read metadata from incoming image files. In this blog post, I'll show you how a custom processor can both create new records with Whole File content, and replace the content in existing records.

Pat PattersonFun with FileRefs – Manipulating Whole File Data
Read More

How to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines

Sqoop ImportWhen it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.

In StreamSets Data Collector (SDC) 2.7 we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.

ClarkeHow to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines
Read More