StreamSets Data Collector has long supported both reading and writing data from and to relational databases via Java Database Connectivity (JDBC). While it was straightforward to configure pipelines to read data from individual tables, ingesting records from an entire database was cumbersome, requiring a pipeline per table. StreamSets Data Collector (SDC) 22.214.171.124 introduces the JDBC Multitable Consumer, a new pipeline origin that can read data from multiple tables through a single database connection. In this blog entry, I'll explain how the JDBC Multitable Consumer can implement a typical use case – replicating an entire relational database into Hadoop.
Pat PattersonReplicating Relational Databases with StreamSets Data Collector
Splunk indexes and correlates log and machine data, providing a rich set of search, analysis and visualization capabilities. In this blog post, I'll explain how to efficiently send high volumes of data to Splunk's HTTP Event Collector via the StreamSets Data Collector Jython Evaluator. I'll present a Jython script with which you'll be able to build pipelines to read records from just about anywhere and send them to Splunk for indexing, analysis and visualization.
Pat PattersonIngest Data into Splunk with StreamSets Data Collector
Today we hear a lot about streaming data, fast data, and data in motion. But the truth is that we have always needed ways to move our data. Historically, the industry has been pretty inventive about getting this done. From the early days of data warehousing and extract, transform, and load (ETL) to now, we have continued to adapt and create new data movement methods, even as the characteristics of the data and data processing architectures have dramatically changed.
Exerting firm control over data in motion is a critical competency which has become core to modern data operations. Based on more than 20 years in enterprise data, here is my take on the past, present and future of data in motion.
Girish PanchaData in Motion Evolution: Where We’ve Been…Where We Need to Go
Pat PattersonCalling External Java Code from Script Evaluators
I'm frequently asked, ‘How does StreamSets Data Collector (SDC) integrate with Spark Streaming? How about on Databricks?'. In this blog entry, I'll explain how to use SDC to ingest data into a Spark Streaming app running on Databricks, but the principles apply to Spark apps running anywhere.
Pat PattersonContinuous Data Integration with StreamSets Data Collector and Spark Streaming on Databricks
Since writing tutorials for creating custom destinations and processors for StreamSets Data Collector (SDC), I've been looking for a good use case for a custom origin tutorial. It's been trickier than I expected, partly because the list of out of the box origins is so extensive, and partly because the HTTP Client origin can access most web service APIs, rendering a custom origin redundant. Then, last week, StreamSets software engineer Jeff Evans suggested Git. Creating a custom origin to read the Git commit log turned into the perfect tutorial.
Pat PattersonCreating a Custom Origin for StreamSets Data Collector