Wednesday, March 8, 2017

Hackathon week

This is a hackathon week for us here at Tableau.  We get to work on self-directed projects, things we "know need to get done", explore new avenues and so on.  Overall, it is a very productive time and most folks look forward to this regular activity.

For this week, I am working on a simple (think V1) "fuzzing" tool to dirty the Superstore sample workbook we ship.  In this context, fuzzing means to dirty some of the fields in the data to reflect bad data that may get sent to Tableau.

An example of this is a storm damage spreadsheet I encountered while taking a data cleaning class.  This spreadsheet (since it is still in use to my knowledge I won't link to it) was a hand created list of major storms in the USA dating back to the 1950s.  When I hear the phrase "hand created" I have learned to shudder.  There is simply the potential for a type to creep in and skew the data (at best) or result in completely invalid entries (much worse).  For instance, this particular spreadsheet had 54 states listed.

A fair question to ask would be "How many states would I expect to see?"  50 is an obvious answer, since we currently have 50 states.  But that wasn't true when the data was first starting to be collected.  But we have 50 now, so 50 might be a reasonable first guess.  The District of Columbia is not a state, but is often included in data gathered by the government, so 51 "states" may also be a good expectation.

Since this is storm data, I naturally include hurricanes in this and that makes me think of the Caribbean.  Now I have to wonder if I need to include Puerto Rico in my analysis, which would raise the state count to 52.

Finally, now that I had some expectation about 50 not necessarily being the absolute accurate value I expected I began to dig through the data.  In this case, DC was included, but Puerto Rico was not.  The other three "states" were typos - things like "Florida" being spelled "Flarida" and similar.  (Quick - how many ways can you type valid names or abbreviations for "Florida"?) Once I had that cleaned up, I was down to 51 states (50 states + the District of Columbia) for my analysis.

Anyway, remembering that lesson made me want to contribute to this effort.  Data is not always (never?) clean to start.  This hackathon I am trying to help solve that challenge.

Questions, comments, concerns and criticisms always welcome,

No comments:

Post a Comment