In my previous blog entry, I explained how to spin up Data Collectors as Kubernetesdeployments along with Dataflow Performance Manager. I recommended using a deployment with one replica as the design environment and a deployment with many replicas for execution. We recently announced StreamSets Control Hub which makes the Kubernetes integration way smoother! StreamSets Control Hub adds a Control Agent for Kubernetes that supports creating and managing Data Collector deployments and a Pipeline Designer that allows designing pipelines without having to install Data Collectors. In this blog, I will demonstrate how to take advantage of these features.
Hari NayakUsing StreamSets Control Hub for Scalable Deployment via Kubernetes
Happy New Year! Our first blog entry of 2018 is a guest post from Josh Janzen, a data scientist based in Minnesota. Josh wanted to ingest tweets referencing NFL games into Spark, then run some analysis to look for a correlation between Twitter activity and game winners. Josh originally posted this entry on his personal blog, and kindly allowed us to repost it here. Over to you, Josh:
Tis the season of NFL football, and one way to capture excitement is Twitter data. I’ve tickered around with Twitter’s Developer API before, but this time I wanted to use a streaming product I’ve heard good things about: StreamSets Data Collector.
Pat PattersonStreaming Data from Twitter for Analysis in Spark
In this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.
Pat PattersonSpeed up Hive Data Retrieval using Spark, StreamSets and Predera
Control. We always want it, regularly don’t get it, yet in business it’s a must have to ensure things run as expected. Control is particularly critical when it comes to moving data around your company. Without it, it’s difficult to know where data is coming from, where it’s going and how it’s been manipulated (and by whom!) along the way. At StreamSets, we specialize in helping customers effectively move data around their business, from any source to any destination. Over time, we’ve observed that most organizations lack proper controls to adequately ensure that dataflow pipelines are built, executed and operated in a manner that meets the needs of the application or business process which they support.
Today an increasing amount of data is being generated from outside the data center or cloud – it isn’t always easy to get this data out of source systems or perform analytics right where it’s generated. Furthermore, getting this data into central big data systems managed by the enterprise is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.
Kirit BasuAnnouncing StreamSets Data Collector Edge
Version 3.0 marks an important new milestone for StreamSets. With close to a million downloads and a strong community and customer base, we are very excited to offer a host of powerful new capabilities within the product. This release has greater connectivity with cloud services, deeper integration with Hadoop distributions, new data aggregations and an exciting new technology for running pipelines on resource constrained devices.
This release also contains important new functionality where we extend the reach of SDC out to devices out on the edge. SDC Edge is a lightweight agent that can execute pipelines designed in SDC. These agents can run on Windows, Linux, Mac, Android and IoS. To learn more about SDC Edge, follow this link.
Kirit BasuAnnouncing StreamSets Data Collector version 3.0
Although the initial release of the Whole File feature did not allow file content to be accessed in the pipeline, we soon added the ability for Script Evaluator processors to read the file, a feature exploited in the custom processor tutorial to read metadata from incoming image files. In this blog post, I'll show you how a custom processor can both create new records with Whole File content, and replace the content in existing records.
Pat PattersonFun with FileRefs – Manipulating Whole File Data
When it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.
In StreamSets Data Collector (SDC) 2.7 we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.
ClarkeHow to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines