Acme Sprockets

LinkedInTwitterRSS

Simple Statistics with Ruby

Sometimes it is easier to analyze data in Ruby than pulling it into Excel or OpenOffice (poor tools for statistical analysis) or the very robust R.

For example, I was recently looking at system resource utilization for several jobs over a large benchmarking project. As I had stored the job information and a dozen or so statistical measures for each were in a SQLite database, it was much simpler to write a Ruby program that retrieves the values and crunches them, than the laborious process of loading a hundred or plus data extract files into a spreadsheet and manipulating them. This was especially so as I wanted to filter the data to only consider observations within one standard deviation from the mean (also easy in Ruby). Thankfully I have a small utility library on hand for computing a few statistical measures.

Perhaps you might find this useful too.

#
# Add statistical measures to Enumerable
#
# Copyright 2011 Trey Kinkead
#

module Enumerable

  def mean
    sum = self.inject(0) { |sum,x| sum+x}
    sum.to_f / self.size
  end

  def median
    self.size%2==0 ?
      self.sort[self.size/2-1,2].mean : # even
      self.sort[self.size/2].to_f       # odd
  end

  def population_standard_deviation
    mean = self.mean
    sum_diff_squared=self.inject(0) { |accum,x| accum+((x-mean)**2)}
    Math.sqrt(sum_diff_squared/self.size)
  end

  # Sample Variance
  def variance
    mean = self.mean
    sum_diff_squared=self.inject(0) { |accum,x| accum+((x-mean)**2)}
    sum_diff_squared/(self.size-1)
  end

  # Sample Standard Deviation
  def standard_deviation
    Math.sqrt(variance)
  end

  def mean_absolute_deviation
    mean = self.mean
    sum_abs_diff=self.inject(0) { |accum,x| accum+((x-mean).abs)}
    sum_abs_diff/(self.size)
  end

  # computed by NIST method:
  # http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm
  #
  # I believe this corresponds to R's type=6
  # http://stat.ethz.ch/R-manual/R-patched/library/stats/html/quantile.html
  #
  def percentile(p)
    sorted = self.sort
    n = p*(sorted.size+1)
    # alternate method used by Excel:
    # n = p*(sorted.size-1)+1
    k = n.floor
    d = n-k
    if (k==0.0)
      sorted.first
    elsif (k>=sorted.size)
      sorted.last
    else
      # note that our indexes are 0-based, rather than 1-based
      i=k-1
      sorted[i]+d*(sorted[i+1]-sorted[i])
    end
  end
end

Then, using the mix-in is trival. For example, if the above code is in file "enumerable_statistics.rb", one can:

require 'enumerable_statistics'

a = [50, 48, 44, 56, 61, 52, 53, 55, 67, 51]

measures = %w[ size min max mean median
    variance standard_deviation mean_absolute_deviation]

measures.each{ |m|
  printf("%30s: %f\n", m, a.send(m))
}

For an output of:

                          size: 10.000000
                           min: 44.000000
                           max: 67.000000
                          mean: 53.700000
                        median: 52.500000
                      variance: 43.122222
            standard_deviation: 6.566751
       mean_absolute_deviation: 4.840000

A few measures are already built-in to Enumerable of course (e.g., "size", or count; "min"). Also, I'm invoking the methods via "send" so I can craft the simple demonstration above.

Naturally the usual caveats apply; there is no guarantee of correctness and you should use at your own risk.

AttachmentSize
enumerable_statistics.zip4.56 KB

If you're    ready for a zombie apocalypse, then you're ready for any emergency.    emergency.cdc.gov