In statistical
circles, there is a set of data known as Anscombe's Quartet. The link goes to Robert
Kosara's blog and he
does a good job of showing off the data set and how radically different
looking
data can have similar statistical properties.
The design of the
set of data was to give a column of data points that all have the same standard
deviation (as one example). For
instance, the standard deviation of the first Y column is 1.937024.
But the second Y column is 1.937109. In both cases, I am rounding the numbers to
the 6th
decimal place.
The difference in
the values is1.937109 - 1.937024=
0.000085. Now, this is very close and to
a
human eye trying to view a plot of the data is probably too small to be
seen. For making a point
about the data -
that similar statistical properties cannot be used to determine the shape of
a
data set - this is good enough. But
computers have enough precision that a difference of
0.000085 is large enough to be detected and
the 2 columns of data would be distinctly
differentiable.
As a side note, I
did test clustering with the data set just for fun. Folks around here kind of grinned
a bit since
this data set is pretty well known and thought it would be fun to see the
results.
But as a test case, it really
is not very good at all. The challenge
here would be to come up with 2
columns of data that had the exact same standard deviation (subject to rounding errors) and use
that to validate tie
breaking rules we might have for this condition. One easy way to do this would
be to reverse
the signs of the numbers from column 1 when creating column 2. Then make a quick
check that the standard
deviation is the same to validate the rounding was the same for both
positive
and negative value calculations.
Another way would be
to scale the data, but that can result in different rounding behavior.
Even though this
data set is not a good test case, the idea behind it is very valid. Ensure that you
test with values that are
exactly the same - and keep in mind that you need to validate "exactly the
same" with any given data set.
Since this is such a
well known data set, I wanted to share my thoughts on it and why it is not all
that
useful for testing computer software. The precision of the computer needs to be taken into account.
Questions, comments,
concerns and criticisms always welcome,
John
No comments:
Post a Comment