Dataflow Performance Blog

Standard Deviations on Cassandra – Rolling Your Own Aggregate Function

Cassandra logoIf you've been following the StreamSets blog over the past few weeks, you'll know that I've been building an Internet of Things testbed on the Raspberry Pi. First, I got StreamSets Data Collector (SDC) running on the Pi, ingesting sensor data and sending it to Apache Cassandra, and then I wrote a Python app to display SDC metrics on the PiTFT screen. In this blog entry I'll take the next step, querying Cassandra for statistics on my sensor data.

Detecting Outlier Values

Now that I have sensor data flowing into Cassandra, I want to analyze that data and then feed it back into SDC so I can detect outlier values. A common way of detecting outliers is to flag readings that fall outside some range expressed in terms of the mean +/- some number of standard deviations (also known as ‘sigma' or σ). For instance, assuming a normal distribution, 99.7% of values should fall within 3σ of the mean, while 99.95% of values should fall within 4σ. Wikipedia has a handy table with the values. Cassandra can give me the mean of the values in a column, with its avg function, but not standard deviation. Fortunately, though, it's possible to define your own ‘user-defined aggregate' (UDA) functions for Cassandra. Here's how I created a UDA for standard deviation.

Cassandra User Defined Aggregate Functions

Cassandra UDA's are defined in terms of two user defined functions (UDF's): a state function and a final function. The state function, called for each row in turn, takes a state parameter and a value as parameters and returns a new state. After all rows have been processed by the state function, the final function is called with the last state value as its parameter, and returns the aggregate value. The Cassandra docs on UDA's show how to calculate the mean in this way. Let's pull the two functions out of the example and format them as Java functions to better see how they work:

As you can see, the avgState function keeps track of the count of values and their total, while avgFinal simply divides the total by the count to get the mean. Elementary school math!

Standard deviation, however, is a bit more complicated. A measure of the ‘spread' of a set of values from their mean, standard deviation is found by “taking the square root of the average of the squared deviations of the values from their average value”. Helpfully, Wikipedia contains an ‘online algorithm‘ for computing variance (the square of standard deviation) in a single pass through the data – just what we need! Transliterating from Wikipedia's Python implementation to Java, we get:

I wrote a test harness to calculate the mean and standard deviation of the integers from 1 to 10 using the ‘online' algorithm, a simpler iterative algorithm, and also created an Excel spreadsheet to do the same calculation using the AVERAGE() and STDEV() functions. Happily, all three methods gave the same mean of 5.5 and standard deviation of 3.02765!

Excel mean and SD

One piece of housekeeping was necessary before I could actually define my own function: since I'm using Cassandra 2.2, I needed to add the following line to cassandra.yaml:

Cassandra 3.0 lets you define Java functions without explicitly enabling them. See the docs on UDF's for more details.

That done, I just needed to remove comments and line breaks from my Java functions, paste them into the Cassandra function definitions and feed them into cqlsh:

So far, so good… Let's find the mean and standard deviation of the integers from 1 to 10 in Cassandra:

Success! Now let's try getting statistics for the past day's temperature readings:

It's been hot here in San Jose, California!

Now I'm able to get statistics from Cassandra, the next trick is to feed them into SDC to be able to filter out outlier values for closer inspection. I'll cover that in my next blog entry. Watch this space!

Pat PattersonStandard Deviations on Cassandra – Rolling Your Own Aggregate Function