Streaming Extreme Data Made Simple with Kinetica and StreamSets
Kinetica, just one of dozens of origins and destinations supported by StreamSets Data Collector, is a distributed, in-memory, GPU database designed for geospatial analysis, machine learning, predictive analytics, and other workloads requiring high performance parallel processing. Mathew Hawkins, a Principal Solutions Architect at Kinetica, recently wrote an excellent tutorial on integrating Data Collector with Kinetics. We repost it here with Kinetica's kind permission.
One of the Kinetica engine’s unique and compelling differentiators is its ability to consume vast amounts of real-time data in parallel from multiple sources. Using our unique multihead ingest technology and the power of NVIDIA GPUs, Kinetica can consume real-time data feeds and analyze billions of rows of data at sub-second speeds – truly extreme data in motion.
It’s important for us to build strong relationships with popular data pipeline applications to give our customers the flexibility to use the best technology on the market. One of our key partners in this space is StreamSets, a popular platform for developing and operating continuous data pipelines. StreamSets allows you to productionize your real-time data streams and performantly move data into and across your infrastructure, harnessing the power of your Hadoop cluster, NoSQL databases and cloud data stores. You can download StreamSets Data Collector here.
The team at StreamSets realized that big and fast data was breaking the traditional data integration paradigm based on structured batch data and stable data movement patterns. This new kind of extreme data – marked by streaming data and unstructured data – required a new solution, so they created the StreamSets DataOps Platform, enabling the agile design and operation of continuous data movement architectures.
With StreamSets, you can seamlessly connect traditional and big data sources to analytics platforms and applications quickly with a minimum of hand-coding. To become a data-powered organization, you need to be able to wrangle data from all different kinds of data sources and platforms and analyze it in real-time. StreamSets is a key part of that equation. To show you how Kinetica and StreamSets can work together, I’ve put together a short technical tutorial that will give you a high level overview of our joint solution.
StreamSets Data Collector added a Kinetica destination in version 3.0.0 released earlier this year. At the end of this short tutorial, I’ll also provide a downloadable template to get you started using Data Collector to ingest data into Kinetica. The use case is self-monitoring – we’ll use Data Collector to process Kinetica’s own log file and build a dashboard using Kinetica’s dashboarding tool Reveal.
First, we have to define our schema and create our table. Kinetica is SQL compliant. To make this easy, we’ll use KiSQL in Kinetica’s Admin interface:
create or replace table KINETICA_LOGS(
severity varchar(8, text_search) not null
,message string(text_search) not null
,pid_info varchar(32, text_search) not null
,host varchar(32, text_search) not null
,ts datetime not null
We’ve enabled our text search features to allow us to not only summarize the data, but search and monitor the detail of events too. Let’s browse the schema:
Now that we have our table created, we can begin to use the template http://files.kinetica.com/streamsets/ingest_kinetica_logs.json and import it into Data Collector:
Next, we update our credentials, connection and table name information in the Kinetica destination:
When we start the pipeline we should see activity, and data should start writing to our Kinetica table:
And that should complete the pipeline! We’re now writing real-time log data into Kinetica using StreamSets Data Collector. The pipeline demonstrates standard StreamSets functionality to “ETL” and control your data as it writes to Kinetica in a performant and scalable way.
To complete the self-monitoring Kinetica use case, we use Reveal to understand the log information we’re storing. The dashboard below can be downloaded from here. This dashboard provides a high-level view of real-time information on warnings, errors, normal operation and audit. We can summarize and drill into the detail all in a single interface:
I hope you’re now inspired to think of ways in which Kinetica and StreamSets can fit into your ecosystem. Stay tuned for more blogs about our technology partnerships (like this one on Docker) to show how Kinetica works with innovative companies throughout the modern infrastructure stack.
Looking to learn more about StreamSets? Listen to our Telekinesis podcast with StreamSets community champion, Pat Patterson.
Many thanks, Mathew, and thanks for the shoutout at the end! Are you working with a data store that you'd like to see supported by Data Collector? Let us know in the comments!