Dataflow Performance Blog

Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector

Raspberry Pi with SensorIn the unlikely event you're not familiar with the Raspberry Pi, it's an ARM-based computer about the same size as a deck of playing cards. The latest iteration, Raspberry Pi 3, has a 1.2GHz ARMv8 CPU, 1MB of memory, integrated Wi-Fi and Bluetooth, all for the same $35 price tag as the original Raspberry Pi released in 2012. Running a variety of operating systems, including Linux, it's actually quite a capable little machine. How capable? Well, last week, I successfully ran StreamSets Data Collector (SDC) on one of mine (yes, I have several!), ingesting sensor readings and sending them to Apache Cassandra for analysis.

Here's how you can build your own Internet of Things testbed and ingest sensor data for around $50.

Sensing Environmental Data

BMP180I started with a Raspberry Pi 3 Model B with Raspbian Jessie Lite installed and Adafruit's BMP180 Barometric Pressure/Temperature/Altitude Sensor. I followed the instructions in Adafruit's BMP085/180 tutorial to configure the Raspberry Pi to talk to the sensor via I2C and ran the example Python app to show sensor readings.

SDC's File Tail Origin seemed the easiest way to get sensor data from Python to a pipeline, so I modified Adafruit's Python app to produce JSON and wrote a simple script to run it every few seconds (code in Gist). The resulting output looks like this:

Once everything was working, I set my script running, with its output directed to ~/sensor.log, to generate some data.

Installing StreamSets Data Collector on the Raspberry Pi

Now to get SDC running! First, we need Java. StreamSets supports the Oracle JDK, rather than OpenJDK, so I installed it on the Pi with:

I used Jessie Lite to minimize the amount of storage and memory consumed on the Pi. One consequence was that there was no desktop to run a browser, so I browsed the StreamSets archive on my laptop, copied the relevant link, and downloaded SDC directly onto the Pi. To minimize SDC's footprint, I downloaded the core package, which includes only a minimal set of stages:

Following the instructions on the StreamSets download page, I extracted the tarball and was ready to download the Cassandra destination:

At this point, I ran into a snag. The SDC package manager uses sha1sum to verify the integrity of stages as they are downloaded. Unfortunately, a bug, fixed in the upcoming release, caused the package manager to fail with an error

There's an easy workaround, though – open libexec/_stagelibs in an editor, and change line 99 from


Now I could list the available stages:

Since File Tail Origin is in the included ‘basic' package of stages, I just needed to install the Cassandra Destination:

Creating a Table in Cassandra

Over on my laptop, I installed Cassandra 2.2. This was the first release of Cassandra to support aggregate functions, such as mean, max and min, and it works well with SDC's Cassandra destination, which uses the 2.1.5 driver.

In cqlsh, I created a keyspace and table to hold my data:

Building a Pipeline to Ingest Sensor Data

Now I could build a pipeline! The File Tail Origin was straightforward to configure – I just needed to specify JSON as the Data Format and /home/pi/sensor.log as the Path.

File Tail

I passed the origin's data stream to a Field Converter Processor, since the time value from the Python app, in milliseconds since 1/1/70, needed to be converted to a timestamp for consumption by Cassandra.

Field Converter

The Field Converter's output goes straight to the Cassandra Destination, which needed mappings from the property names in the JSON to Cassandra column names as well as the Cassandra host and port.

Cassandra Destination

Finally, I didn't need the File Tail Origin's metadata stream, so I just sent it to the trash.


In preview mode, everything looked just fine, but the first time I ran the pipeline, I hit a snag – the pipeline failed with the error:

After some research I discovered that the default Cassandra behavior is to log a warning on batch sizes greater than 5kB. Looking in the Cassandra logs, I could see that my SDC batch size of 1000 records crossed not only the warning threshold, but also the hard ceiling of 50kB!

Doing the math, it looked like Cassandra would be happy with 40 records, so I configured the origin's maximum batch size accordingly. Now my pipeline was able to ingest the 14,000 or so records I had already collected, and continue processing new readings as they were written to the sensor.log file.

Ingesting Sensor Data

Analyzing the Data in Cassandra

Over in cqlsh, I was able to see the data, and perform some simple analysis:

Of course, as analysis goes, this is barely scratching the surface. Now that the data is flowing into Cassandra, you could build a visualization, or write data to any other destination supported by SDC.


StreamSets Data Collector is sufficiently lightweight to run well on the Raspberry Pi, ingesting data produced by local apps and writing directly to a wide range of destinations. What can you build with SDC and RPi? Let me know in the comments!

Pat PattersonIngesting Sensor Data on the Raspberry Pi with StreamSets Data Collector