This is a hackathon
week for us here at Tableau. We get to
work on self-directed projects, things we "know need to get done",
explore new avenues and so on. Overall, it
is a very productive time and most folks look forward to this regular activity.
For this week, I am
working on a simple (think V1) "fuzzing" tool to dirty the Superstore
sample workbook we ship. In this
context, fuzzing means to dirty some of the fields in the data to reflect bad
data that may get sent to Tableau.
An example of this
is a storm damage spreadsheet I encountered while taking a data cleaning
class. This spreadsheet (since it is
still in use to my knowledge I won't link to it) was a hand created list of
major storms in the USA dating back to the 1950s. When I hear the phrase "hand
created" I have learned to shudder.
There is simply the potential for a type to creep in and skew the data
(at best) or result in completely invalid entries (much worse). For instance, this particular spreadsheet had
54 states listed.
A fair question to
ask would be "How many states would I expect to see?" 50 is an obvious answer, since we currently
have 50 states. But that wasn't true
when the data was first starting to be collected. But we have 50 now, so 50 might be a
reasonable first guess. The District of
Columbia is not a state, but is often included in data gathered by the
government, so 51 "states" may also be a good expectation.
Since this is storm
data, I naturally include hurricanes in this and that makes me think of the
Caribbean. Now I have to wonder if I
need to include Puerto Rico in my analysis, which would raise the state count
to 52.
Finally, now that I
had some expectation about 50 not necessarily being the absolute accurate value
I expected I began to dig through the data.
In this case, DC was included, but Puerto Rico was not. The other three "states" were typos
- things like "Florida" being spelled "Flarida" and
similar. (Quick - how many ways can you type valid names or abbreviations for "Florida"?) Once I had that cleaned up, I
was down to 51 states (50 states + the District of Columbia) for my analysis.
Anyway, remembering
that lesson made me want to contribute to this effort. Data is not always (never?) clean to
start. This hackathon I am trying to
help solve that challenge.
Questions, comments,
concerns and criticisms always welcome,
John
No comments:
Post a Comment