Sometimes it is easier to analyze data in Ruby than pulling it into Excel or OpenOffice (poor tools for statistical analysis) or the very robust R.
For example, I was recently looking at system resource utilization for several jobs over a large benchmarking project. As I had stored the job information and a dozen or so statistical measures for each were in a SQLite database, it was much simpler to write a Ruby program that retrieves the values and crunches them, than the laborious process of loading a hundred or plus data extract files into a spreadsheet and manipulating them. This was especially so as I wanted to filter the data to only consider observations within one standard deviation from the mean (also easy in Ruby). Thankfully I have a small utility library on hand for computing a few statistical measures.
Perhaps you might find this useful too.
#
# Add statistical measures to Enumerable
#
# Copyright 2011 Trey Kinkead
#
module Enumerable
def mean
sum = self.inject(0) { |sum,x| sum+x}
sum.to_f / self.size
end
def median
self.size%2==0 ?
self.sort[self.size/2-1,2].mean : # even
self.sort[self.size/2].to_f # odd
end
def population_standard_deviation
mean = self.mean
sum_diff_squared=self.inject(0) { |accum,x| accum+((x-mean)**2)}
Math.sqrt(sum_diff_squared/self.size)
end
# Sample Variance
def variance
mean = self.mean
sum_diff_squared=self.inject(0) { |accum,x| accum+((x-mean)**2)}
sum_diff_squared/(self.size-1)
end
# Sample Standard Deviation
def standard_deviation
Math.sqrt(variance)
end
def mean_absolute_deviation
mean = self.mean
sum_abs_diff=self.inject(0) { |accum,x| accum+((x-mean).abs)}
sum_abs_diff/(self.size)
end
# computed by NIST method:
# http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm
#
# I believe this corresponds to R's type=6
# http://stat.ethz.ch/R-manual/R-patched/library/stats/html/quantile.html
#
def percentile(p)
sorted = self.sort
n = p*(sorted.size+1)
# alternate method used by Excel:
# n = p*(sorted.size-1)+1
k = n.floor
d = n-k
if (k==0.0)
sorted.first
elsif (k>=sorted.size)
sorted.last
else
# note that our indexes are 0-based, rather than 1-based
i=k-1
sorted[i]+d*(sorted[i+1]-sorted[i])
end
end
end
Then, using the mix-in is trival. For example, if the above code is in file "enumerable_statistics.rb", one can:
require 'enumerable_statistics'
a = [50, 48, 44, 56, 61, 52, 53, 55, 67, 51]
measures = %w[ size min max mean median
variance standard_deviation mean_absolute_deviation]
measures.each{ |m|
printf("%30s: %f\n", m, a.send(m))
}
For an output of:
size: 10.000000
min: 44.000000
max: 67.000000
mean: 53.700000
median: 52.500000
variance: 43.122222
standard_deviation: 6.566751
mean_absolute_deviation: 4.840000
A few measures are already built-in to Enumerable of course (e.g., "size", or count; "min"). Also, I'm invoking the methods via "send" so I can craft the simple demonstration above.
Naturally the usual caveats apply; there is no guarantee of correctness and you should use at your own risk.
| Attachment | Size |
|---|---|
| enumerable_statistics.zip | 4.56 KB |