skip to Main Content

Ingesting Sensor Data on the Raspberry Pi with StreamSets Data Collector

By Posted in Data Integration June 27, 2016

Raspberry Pi with SensorIn the unlikely event you’re not familiar with the Raspberry Pi, it’s an ARM-based computer about the same size as a deck of playing cards. The latest iteration, Raspberry Pi 3, has a 1.2GHz ARMv8 CPU, 1MB of memory, integrated Wi-Fi and Bluetooth, all for the same $35 price tag as the original Raspberry Pi released in 2012. Running a variety of operating systems, including Linux, it’s actually quite a capable little machine. How capable? Well, last week, I successfully ran StreamSets Data Collector (SDC) on one of mine (yes, I have several!), ingesting sensor data and sending them to Apache Cassandra for analysis.

Here’s how you can build your own Internet of Things testbed and ingest sensor data for around $50.

Sensing Environmental Data

BMP180I started with a Raspberry Pi 3 Model B with Raspbian Jessie Lite installed and Adafruit’s BMP180 Barometric Pressure/Temperature/Altitude Sensor. I followed the instructions in Adafruit’s BMP085/180 tutorial to configure the Raspberry Pi to talk to the sensor via I2C and ran the example Python app to show sensor readings.

SDC’s File Tail Origin seemed the easiest way to get sensor data from Python to a pipeline, so I modified Adafruit’s Python app to produce JSON and wrote a simple script to run it every few seconds (code in Gist). The resulting output looks like this:


Once everything was working, I set my script running, with its output directed to ~/sensor.log, to generate some data.

./ > ~/sensor.log &

Installing StreamSets Data Collector on the Raspberry Pi

Now to get SDC running! First, we need Java. StreamSets supports the Oracle JDK, rather than OpenJDK, so I installed it on the Pi with:

$ sudo apt-get install oracle-java8-jdk

I used Jessie Lite to minimize the amount of storage and memory consumed on the Pi. One consequence was that there was no desktop to run a browser, so I browsed the StreamSets archive on my laptop, copied the relevant link, and downloaded SDC directly onto the Pi. To minimize SDC’s footprint, I downloaded the core package, which includes only a minimal set of stages:

$ wget

Following the instructions on the StreamSets download page, I extracted the tarball and was ready to download the Cassandra destination:

$ tar xvzf streamsets-datacollector-core-
$ cd streamsets-datacollector-core-

At this point, I ran into a snag. The SDC package manager uses sha1sum to verify the integrity of stages as they are downloaded. Unfortunately, a bug, fixed in the upcoming release, caused the package manager to fail with an error

$ ./bin/streamsets stagelibs -list

sha1sum: invalid option -- 's'
Try 'sha1sum --help' for more information.

Failed! running sha1sum -s -c in /tmp/sdc-setup-927

There’s an easy workaround, though – open libexec/_stagelibs in an editor, and change line 99 from

run sha1sum -s -c ${FILE_NAME}.sha1


run sha1sum --status -c ${FILE_NAME}.sha1

Now I could list the available stages:

$ ./bin/streamsets stagelibs -list

StreamSets Data Collector

Stage Library Repository:

    ID                                                           Name                                     Installed
 streamsets-datacollector-apache-kafka_0_8_1-lib              Apache Kafka                         NO
 streamsets-datacollector-apache-kafka_0_8_2-lib              Apache Kafka                         NO
 streamsets-datacollector-apache-kafka_0_9-lib                Apache Kafka                         NO
 streamsets-datacollector-apache-kudu-0_7-lib                 Apache Kudu 0.7.1                            NO
 streamsets-datacollector-aws-lib                             Amazon Web Services 1.10.59                  NO
 streamsets-datacollector-basic-lib                           Basic                                       YES
 streamsets-datacollector-cassandra_2-lib                     Cassandra 2.1.5                              NO
 streamsets-datacollector-cdh_5_2-lib                         CDH 5.2.6                                    NO
 streamsets-datacollector-cdh_5_3-lib                         CDH 5.3.8                                    NO
 streamsets-datacollector-cdh_5_4-cluster-cdh_kafka_1_2-lib   CDH 5.4.8 Cluster Kafka 1.2.0 Lib            NO
 streamsets-datacollector-cdh_5_4-cluster-cdh_kafka_1_3-lib   CDH 5.4.8 Cluster Kafka 1.3.2 Lib            NO
 streamsets-datacollector-cdh_5_4-lib                         CDH 5.4.8                                    NO
 streamsets-datacollector-cdh_5_5-cluster-cdh_kafka_1_3-lib   CDH 5.5.0 Cluster Kafka 1.3.2 Lib            NO
 streamsets-datacollector-cdh_5_5-lib                         CDH 5.5.0                                    NO
 streamsets-datacollector-cdh_5_7-cluster-cdh_kafka_2_0-lib   CDH 5.7.0 Cluster Kafka 2.0.1 Lib            NO
 streamsets-datacollector-cdh_5_7-lib                         CDH 5.7.0                                    NO
 streamsets-datacollector-cdh_kafka_1_2-lib                   CDH Kafka 1.2.0 (                    NO
 streamsets-datacollector-cdh_kafka_1_3-lib                   CDH Kafka 1.3.2 (                    NO
 streamsets-datacollector-elasticsearch_1_4-lib               Elasticsearch 1.4.5                          NO
 streamsets-datacollector-elasticsearch_1_5-lib               Elasticsearch 1.5.2                          NO
 streamsets-datacollector-elasticsearch_1_6-lib               Elasticsearch 1.6.2                          NO
 streamsets-datacollector-elasticsearch_1_7-lib               Elasticsearch 1.7.4                          NO
 streamsets-datacollector-elasticsearch_2_0-lib               Elasticsearch 2.0.2                          NO
 streamsets-datacollector-elasticsearch_2_1-lib               Elasticsearch 2.1.2                          NO
 streamsets-datacollector-elasticsearch_2_2-lib               Elasticsearch 2.2.2                          NO
 streamsets-datacollector-elasticsearch_2_3-lib               Elasticsearch 2.3.1                          NO
 streamsets-datacollector-groovy_2_4-lib                      Groovy 2.4.6                                YES
 streamsets-datacollector-hdp_2_2-lib                         HDP 2.2.0                                    NO
 streamsets-datacollector-hdp_2_3-lib                         HDP 2.3.0                                    NO
 streamsets-datacollector-hdp_2_4-lib                         HDP 2.4.0                                    NO
 streamsets-datacollector-influxdb_0_9-lib                    InfluxDB 0.9+                                NO
 streamsets-datacollector-jdbc-lib                            JDBC                                         NO
 streamsets-datacollector-jms-lib                             JMS                                          NO
 streamsets-datacollector-jython_2_7-lib                      Jython 2.7.0                                YES
 streamsets-datacollector-mapr_5_0-lib                        MapR 5.0.9                                   NO
 streamsets-datacollector-mapr_5_1-lib                        MapR 5.1.0                                   NO
 streamsets-datacollector-mongodb_3-lib                       MongoDB 3.0.2                                NO
 streamsets-datacollector-omniture-lib                        Omniture                                     NO
 streamsets-datacollector-rabbitmq-lib                        RabbitMQ 3.5.6                               NO
 streamsets-datacollector-redis-lib                           Redis 2.8.x, 3.0.x                           NO
 streamsets-datacollector-stats-lib                           Statistics                                   NO

Since File Tail Origin is in the included ‘basic’ package of stages, I just needed to install the Cassandra Destination:

$ ./bin/streamsets stagelibs -install=streamsets-datacollector-cassandra_2-lib

Creating a Table in Cassandra

Over on my laptop, I installed Cassandra 2.2. This was the first release of Cassandra to support aggregate functions, such as mean, max and min, and it works well with SDC’s Cassandra destination, which uses the 2.1.5 driver.

In cqlsh, I created a keyspace and table to hold my data:

    WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE mykeyspace;
CREATE TABLE sensor_readings (
    sensor_id INT,
    time TIMESTAMP,
    temperature DOUBLE,
    pressure DOUBLE,
    altitude DOUBLE,
    PRIMARY KEY (sensor_id, time)

Building a Pipeline to Ingest Sensor Data

Now I could build a pipeline! The File Tail Origin was straightforward to configure – I just needed to specify JSON as the Data Format and /home/pi/sensor.log as the Path.

File Tail

I passed the origin’s data stream to a Field Converter Processor, since the time value from the Python app, in milliseconds since 1/1/70, needed to be converted to a timestamp for consumption by Cassandra.

Field Converter

The Field Converter’s output goes straight to the Cassandra Destination, which needed mappings from the property names in the JSON to Cassandra column names as well as the Cassandra host and port.

Cassandra Destination

Finally, I didn’t need the File Tail Origin’s metadata stream, so I just sent it to the trash.


In preview mode, everything looked just fine, but the first time I ran the pipeline, I hit a snag – the pipeline failed with the error:

RUNNING_ERROR com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large

After some research I discovered that the default Cassandra behavior is to log a warning on batch sizes greater than 5kB. Looking in the Cassandra logs, I could see that my SDC batch size of 1000 records crossed not only the warning threshold, but also the hard ceiling of 50kB!

ERROR [SharedPool-Worker-1] 2016-06-24 13:19:27,119 - Batch of prepared statements for [mykeyspace.sensor_readings] is of size 116380, exceeding specified threshold of 51200 by 65180. (see batch_size_fail_threshold_in_kb)

Doing the math, it looked like Cassandra would be happy with 40 records, so I configured the origin’s maximum batch size accordingly. Now my pipeline was able to ingest the 14,000 or so records I had already collected, and continue processing new readings as they were written to the sensor.log file.

Ingesting Sensor Data

Analyzing the Data in Cassandra

Over in cqlsh, I was able to see the data, and perform some simple analysis:

cqlsh:mykeyspace> select count(*) from sensor_readings where sensor_id = 1;


(1 rows)
cqlsh:mykeyspace> select * from sensor_readings where sensor_id = 1 order by time desc limit 10;

 sensor_id | time | altitude | pressure | temperature
 1 | 2016-06-24 21:59:44+0000 | 4.83 | 1.0128e+05 | 27.7
 1 | 2016-06-24 21:59:38+0000 | 4.33 | 1.0128e+05 | 27.7
 1 | 2016-06-24 21:59:33+0000 | 3.91 | 1.0128e+05 | 27.7
 1 | 2016-06-24 21:59:28+0000 | 3.66 | 1.0127e+05 | 27.7
 1 | 2016-06-24 21:59:23+0000 | 4.33 | 1.0127e+05 | 27.7
 1 | 2016-06-24 21:59:18+0000 | 4.08 | 1.0128e+05 | 27.7
 1 | 2016-06-24 21:59:12+0000 | 4.08 | 1.0128e+05 | 27.7
 1 | 2016-06-24 21:59:07+0000 | 4.16 | 1.0128e+05 | 27.7
 1 | 2016-06-24 21:59:02+0000 | 4.41 | 1.0127e+05 | 27.7
 1 | 2016-06-24 21:58:57+0000 | 4.58 | 1.0128e+05 | 27.7

(10 rows)
cqlsh:mykeyspace> select count(temperature), avg(temperature), min(temperature), max(temperature) from sensor_readings where sensor_id = 1;

 system.count(temperature) | system.avg(temperature) | system.min(temperature) | system.max(temperature)
 14826 | 25.36766 | 24.1 | 28

(1 rows)
cqlsh:mykeyspace> select count(temperature) as count, avg(temperature) as mean, min(temperature) as min, max(temperature) as max from sensor_readings where sensor_id = 1;

 count | mean | min | max
 14830 | 25.36829 | 24.1 | 28

(1 rows)

Of course, as analysis goes, this is barely scratching the surface. Now that the data is flowing into Cassandra, you could build a visualization, or write data to any other destination supported by SDC.


StreamSets Data Collector is sufficiently lightweight to run well on the Raspberry Pi, ingesting data produced by local apps and writing directly to a wide range of destinations. What can you build with SDC and RPi? Let me know in the comments!

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top