A task I had to deal
with this week was this one. I was given
a data set we want to test, like this:
2 3 4 4 5 5 6 6 7 9
We have an algorithm
to compute some basic statistics on any given data set (mean, median, variance,
etc…). Nothing special about that. And I had two data sets - the small one above,
used mostly to make sure the testing code would work, and another data set of
50,000+ floating point numbers:
-8322.458 -6199.002 -6002.999 and so on.
What I needed to do
was compare the results of those types of calculations across a variety of
different tools which also compute those basic stats. I chose Excel, R, some Python code I wrote
myself, Numpy/Scipy and Octave.
And that is where
the problems came in.
My original data
sets were simply a list of numbers, without commas, seperated by spaces, all on
one row. For the small data set, for all
the tools, I could just copy/paste or even retype to get the data into the format
the tool wanted. This is not a hard
problem to solve, just tedious. The
industry calls this "data munging" (getting the data from the format
you have into the format your tools needs) and is almost always the most time
consuming part of any analysis. Hit me
up for links to prove this if you want.
For instance, excel
prefers a single column to make entering the calculations easy, but can use a
row. Python's read csv files wants
commas to separate values along a row (you can specify spaces) but once the
data is imported, it is easiest to have one column of data. So I had to create a file of the 50,000+
values with each value on one line.
R was able to use
the same file as Python. Nice!
Octave wanted all the values on one row so I had to
re-layout the numbers with a comma in between each. Since this was a one-off task, I simply used
Word to edit the file. It took a little
under a minute to make the 50,000+ replacements.
Now I have the data
files in the format that all the tools want, and can use their results to help
ensure Tableau is getting expected answers for these basic statistics.
Questions, comments,
concerns and criticisms always welcome,
John