Wednesday, September 28, 2016

Anscombe's Quartet is not especially good for software testing


In statistical circles, there is a set of data known as Anscombe's Quartet.  The link goes to Robert 
Kosara's blog and he does a good job of showing off the data set and how radically different 
looking data can have similar statistical properties.

The design of the set of data was to give a column of data points that all have the same standard 
deviation (as one example).  For instance, the standard deviation of the first Y column is 1.937024. 
But the second Y column is 1.937109.  In both cases, I am rounding the numbers to the 6th 
decimal place.

The difference in the values is1.937109 - 1.937024= 0.000085.  Now, this is very close and to a 
human eye trying to view a plot of the data is probably too small to be seen.  For making a point 
about the data - that similar statistical properties cannot be used to determine the shape of 
a data set - this is good enough.  But computers have enough precision that a difference of  
 0.000085 is large enough to be detected and the 2 columns of data would be distinctly 
differentiable.

As a side note, I did test clustering with the data set just for fun.  Folks around here kind of grinned 
a bit since this data set is pretty well known and thought it would be fun to see the results.   
But as a test case, it really is not very good at all.  The challenge here would be to come up with 2 
columns of data that had the exact same standard deviation (subject to rounding errors) and use 
that to validate tie breaking rules we might have for this condition.  One easy way to do this would 
be to reverse the signs of the numbers from column 1 when creating column 2.  Then make a quick 
check that the standard deviation is the same to validate the rounding was the same for both 
positive and negative value calculations. 

Another way would be to scale the data, but that can result in different rounding behavior.

Even though this data set is not a good test case, the idea behind it is very valid.  Ensure that you
 test with values that are exactly the same - and keep in mind that you need to validate "exactly the 
same" with any given data set.

Since this is such a well known data set, I wanted to share my thoughts on it and why it is not all that 
useful for testing computer software.  The precision of the computer needs to be taken into account.

Questions, comments, concerns and criticisms always welcome,
John

No comments:

Post a Comment