Kinetica, just one of dozens of origins and destinations supported by StreamSets Data Collector, is a distributed, in-memory, GPU database designed for geospatial analysis, machine learning, predictive analytics, and other workloads requiring high performance parallel processing. Mathew Hawkins, a Principal Solutions Architect at Kinetica, recently wrote an excellent tutorial on integrating Data Collector with Kinetics. We repost it here with […]
In this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.
Getting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)
This post was originally published on the Cloudera VISION blog by Sam Heywood. StreamSets configurations and images of Apache Spot Open Data Model ingest pipelines can be found here on Github. A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer […]
Mike Fuller, a consultant at Red Pill Analytics, recently wrote Stream Me Up (to the Cloud), Scotty, a tutorial on installing StreamSets Data Collector (SDC) on Amazon Web Services EC2. Mike’s article takes you all the way from logging in to a fresh EC2 instance to seeing your first pipeline in action. We’re reposting it here courtesy of […]
Sandish Kumar, a Solutions Engineer at phData, builds and manages solutions for phData customers. In this article, reposted from the phData blog, he explains how to generate simulated NetFlow data, read it into StreamSets Data Collector via the UDP origin, then buffer it in Apache Kafka before sending it to Apache Kudu. A true big data enthusiast, Sandish spends […]
Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing As originally posted on the Cloudera VISION Blog. At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the […]
Big data has come a long way, with adoption accelerating as CIOs recognize the business value of extracting insights from the troves of data collected by their companies and business partners. But, as is often the case with innovations, mainstream adoption of big data has exposed a new challenge: how to ingest data continuously from […]
A step-by-step walkthrough of how Mac Noland implemented StreamSets to move away from hand-coded ETL and scale out an increasingly complex ingestion pipeline. Mac is a Solution Architect for phData, a Twin Cities services firm focused on Hadoop. He has spent 17 years as a software engineer and architect for projects in the legal, accounting, risk and medical device industries.