Thursday, March 30, 2017

Some simple data munging work this week


A task I had to deal with this week was this one.  I was given a data set we want to test, like this:
2 3 4 4 5 5 6 6 7 9
We have an algorithm to compute some basic statistics on any given data set (mean, median, variance, etc…).  Nothing special about that.  And I had two data sets - the small one above, used mostly to make sure the testing code would work, and another data set of 50,000+ floating point numbers:
-8322.458 -6199.002 -6002.999 and so on.

What I needed to do was compare the results of those types of calculations across a variety of different tools which also compute those basic stats.  I chose Excel, R, some Python code I wrote myself, Numpy/Scipy and Octave.

And that is where the problems came in.

My original data sets were simply a list of numbers, without commas, seperated by spaces, all on one row.  For the small data set, for all the tools, I could just copy/paste or even retype to get the data into the format the tool wanted.  This is not a hard problem to solve, just tedious.  The industry calls this "data munging" (getting the data from the format you have into the format your tools needs) and is almost always the most time consuming part of any analysis.  Hit me up for links to prove this if you want.

For instance, excel prefers a single column to make entering the calculations easy, but can use a row.  Python's read csv files wants commas to separate values along a row (you can specify spaces) but once the data is imported, it is easiest to have one column of data.  So I had to create a file of the 50,000+ values with each value on one line.

R was able to use the same file as Python.  Nice!

Octave  wanted all the values on one row so I had to re-layout the numbers with a comma in between each.  Since this was a one-off task, I simply used Word to edit the file.  It took a little under a minute to make the 50,000+ replacements.

Now I have the data files in the format that all the tools want, and can use their results to help ensure Tableau is getting expected answers for these basic statistics.

Questions, comments, concerns and criticisms always welcome,
John



No comments:

Post a Comment