Ingesting Data from Apache Kafka to TimescaleDB

Ingesting Data from Apache Kafka to TimescaleDB

Timescale logoThe Glue Conference (better known as GlueCon) is always a treat for me. I've been speaking there since 2012, and this year I presented a session explaining how I use StreamSets Data Collector to ingest content delivery network (CDN) data from compressed CSV files in S3 to MySQL for analysis, using the Kickfire API to turn IP addresses into company data. The slides are here, and I'll write it up in a future blog post.

As well as speaking, I always enjoy the keynotes (shout out to Leah McGowen-Hare for an excellent presentation on inclusion!) and breakouts. In one of this year's breakouts, Diana Hsieh, director of product management at Timescale, focused on the TimescaleDB time series database.

Time Series Databases

Time series databases are optimized for handling data indexed by time, efficiently handling queries for data within a particular range of time. There are several time series databases in the market, indeed, Data Collector has long had the capability to write to InfluxDB, for example, but what intrigued me about TimescaleDB was the fact that it is built on PostgreSQL. Full disclosure: I spent five and a half years as a developer evangelist at Salesforce, and PostgreSQL was, and remains, a core part of Heroku's platform, but I've also come to love PostgreSQL as more robust alternative to MySQL.

Getting Started with TimescaleDB

While listening to Diana's presentation, I ran the TimescaleDB Docker image, mapping port 54321 on my laptop to 5432 in the Docker container so it wouldn't clash with my existing PostgreSQL deployment. As soon as Diana left the stage, I ran through the ‘Creating Hypertables' section of the TimescaleDB quick start, creating a PostgreSQL database, enabling it for TimescaleDB, and writing a row of data to it:

My First TimescaleDB Pipeline

Since TimescaleDB is built on PostgreSQL, the standard PostgreSQL JDBC driver works with it out of the box. Since I already have the driver installed in Data Collector, it took me about two minutes to build a simple test pipeline to write a second row of data to my shiny new TimescaleDB server:

Timescale test pipelineGratifyingly, the pipeline worked first time:

Ingesting IoT Data from Kafka to TimescaleDB

One of the primary use cases for a time series database is storing data from the Internet of Things. It took me a few minutes to code a simple Python Kafka client that would emulate a set of sensors producing more realistic temperature and humidity data than my test pipeline:

Notice that the emulator emits an integer value for location, and does not timestamp the data. As you can see, just for fun, I had the emulator generate a quarter million data points. This is enough to exercise TimescaleDB a little, without taking a significant amount of time to generate.

I replaced my pipeline's Dev Data Generator origin with a Kafka Consumer, and added a couple of processors to the pipeline:

Kafka Timescale pipelineThe Expression Evaluator simply adds a timestamp to each record, using some Expression Language to create the correct format:

The Static Lookup processor replaces the integer location field with a string to match the TimescaleDB table schema:

Static Lookup configurationThis short video shows the result. As you can see, the pipeline ingests 250,000 records in about 30 seconds. Note – this shouldn't be interpreted as a benchmark result. Everything here was running on my laptop, and the Kafka topic had a single partition. A real-work IoT ingestion pipeline would run on more capable hardware, with multiple Kafka partitions and the same number of Data Collector pipelines:

Conclusion

I was impressed by TimescaleDB. The unboxing experience was fast and pain-free. Although I only gave it the briefest of tire-kickings, everything worked first time. The fact that TimescaleDB is built on PostgreSQL made it easy for me to write data with off-the-shelf tooling, and I was able to use familiar SQL commands to work with the data once it was in a hypertable.

If you're using TimescaleDB, download StreamSets Data Collector and give it a try for your data integration needs. Like the core of TimescaleDB, it is made available as open source under the Apache 2.0 license and freely available for test, development and production use.

Related Resources

Check out StreamSets white papers, videos, webinars, report and more.

Visit the Resource Library

Related Blog Posts

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!