Dataflow Performance Blog

Ingesting data into Couchbase using StreamSets Data Collector

Nick Cadenhead, a Senior Consultant at 9th BIT Consulting in Johannesburg, South Africa, uses Couchbase Server to power analytics solutions for his clients. In this blog entry, reposted from his article at LinkedIn, Nick explains why he selected StreamSets Data Collector for data ingest, and how he extended it with a custom destination to write data to Couchbase.

For some time, I have been working with the Couchbase NoSQL database solution and it’s been an interesting journey so far.

Historically, I’m not a database guy, so I’ve not worked much with databases in terms of designing, building and maintaining them as a full time job. However, I do know the basics. This position has allowed me to get into the “mindset” of NoSQL concepts like no structures, no transactions, denormalizing of data and more without having many conflicting situations with the paradigms of the structured world of SQL and relational databases.

So during my sales engineering activities supporting Couchbase proof of concepts (POC) engagements, there is always a requirement to ingest data into a Couchbase bucket (think of a bucket as a relational database) in order to demonstrate and highlight the features and capabilities of Couchbase.

Usually data ingestion requires some code to be written to ingest data into Couchbase. Couchbase provides quite a few SDKs (Java, .Net, Node JS and more) for developers to enable their applications to use Couchbase.

So this got me thinking. Why can’t there be a standard way, or tool for that matter, to ingest data into Couchbase instead of writing code all the time?

Don’t get me wrong. There’s nothing wrong with writing code!!!

Then I came across StreamSets Data Collector (SDC).

SDC is an open source platform for the ingestion of streaming and batch data into big data stores. It features a graphical web-based console for configuring data “pipelines” to handle data flows from origins to destinations, monitoring runtime dataflow metrics and automating the handling of data drift.

Data pipelines are constructed in the web-based console via a drag and drop process. Pipelines connect to origins (sources) and ingest data into destinations (targets). Between origins and destinations are processor steps which are essentially data transformation steps for doing field masking, field evaluating, looking up data in a database or external cloud services such as Salesforce, evaluating expressions on fields, routing data, evaluating/manipulating data using JavaScript, Jython or Groovy, and many more.

Couchbase integration with StreamSets
A StreamSets Data Collector pipeline ingesting data into a Couchbase Bucket

Thus SDC is a great option for my data ingestion needs. It's open source and available to download immediately. There are a large number of technologies supported for data ingestion ranging from databases to flat files, logs, HTTP services and big data platforms like Hadoop, MongoDB and cloud platforms like Salesforce. But there was one problem. Couchbase was not on the list of technology data connectors available for SDC. No problem! I decided to write my own connector for Couchbase.

Leveraging the Data Connector Java-based API available for the open community to extend the integration capabilities of SDC, together with the online documentation and guides, I was able to implement a data connector very quickly for Couchbase. The initial build of the connector is very simple; just ingest JSON data into a Couchbase bucket. Over time the connector will be expanded to query a Couchbase bucket, better ingestion capabilities and more. For now, it serves my needs.

One of the added benefits with SDC is data pipeline analytics. The analytics features in the SDC console give users an insight into how data is flowing from origins to destinations. The standard visualizations in the SDC console give detailed analysis on the performance of the data pipeline. The analysis of the pipeline showed me very quickly how my data was being ingested into the Couchbase Buckets and highlighted any errors which occurred throughout the stages of the data pipeline.

So by using data pipelines in SDC, they allow me to ingest data very quickly into Couchbase without writing much or any code at all.

The data connector is open of course at GitHub.

Please feel free to contact me if you have any questions on StreamSets Data Collector, Couchbase or the code of the Couchbase connector.

Nick's Couchbase destination currently targets StreamSets Data Collector API version 1.2.2.0. We at StreamSets are working with Nick to update it for the upcoming 2.3.0.0 release and ultimately add it to StreamSets Data Collector as a supported integration.

Pat PattersonIngesting data into Couchbase using StreamSets Data Collector