Wednesday, September 28, 2016

Anscombe's Quartet is not especially good for software testing


In statistical circles, there is a set of data known as Anscombe's Quartet.  The link goes to Robert 
Kosara's blog and he does a good job of showing off the data set and how radically different 
looking data can have similar statistical properties.

The design of the set of data was to give a column of data points that all have the same standard 
deviation (as one example).  For instance, the standard deviation of the first Y column is 1.937024. 
But the second Y column is 1.937109.  In both cases, I am rounding the numbers to the 6th 
decimal place.

The difference in the values is1.937109 - 1.937024= 0.000085.  Now, this is very close and to a 
human eye trying to view a plot of the data is probably too small to be seen.  For making a point 
about the data - that similar statistical properties cannot be used to determine the shape of 
a data set - this is good enough.  But computers have enough precision that a difference of  
 0.000085 is large enough to be detected and the 2 columns of data would be distinctly 
differentiable.

As a side note, I did test clustering with the data set just for fun.  Folks around here kind of grinned 
a bit since this data set is pretty well known and thought it would be fun to see the results.   
But as a test case, it really is not very good at all.  The challenge here would be to come up with 2 
columns of data that had the exact same standard deviation (subject to rounding errors) and use 
that to validate tie breaking rules we might have for this condition.  One easy way to do this would 
be to reverse the signs of the numbers from column 1 when creating column 2.  Then make a quick 
check that the standard deviation is the same to validate the rounding was the same for both 
positive and negative value calculations. 

Another way would be to scale the data, but that can result in different rounding behavior.

Even though this data set is not a good test case, the idea behind it is very valid.  Ensure that you
 test with values that are exactly the same - and keep in mind that you need to validate "exactly the 
same" with any given data set.

Since this is such a well known data set, I wanted to share my thoughts on it and why it is not all that 
useful for testing computer software.  The precision of the computer needs to be taken into account.

Questions, comments, concerns and criticisms always welcome,
John

Monday, September 19, 2016

A need for specific test cases


One of the tests we have to complete is validating our implementations for accuracy.  As I have mentioned, this can be tricky simply because we are using computers and have to contend to with rounding errors.  Another reason this is hard is simply the nature of statistics.

Consider the meteorologist.  Given a set of weather statistics - temperature, barometric pressure, wind speed, etc… - the meteorologist can state "There is a 70% chance of rain tomorrow."

Tomorrow comes, and it rains.  Was the forecast correct?  Of course - there was a 70% chance of rain.  Now suppose tomorrow arrives, and it does NOT rain.  Again, was the forecast correct?  Of course - there was a 30% chance that it would not rain.  Such is the nature of statistics, and that also hits us for some of our work.

Recently, we added a Clustering algorithm to Tableau.  The basic idea can be viewed as "Given a set of data points like this, divide them into equal size groups:"
 


In this case, I tried to obviously draw three clusters, with about the same number of dots in each.  

But what about this case:

 


Same number of dots.  Most people would probably want 2 groups, but that would probably look like this:


 
The group on the right would have more dots, but visually this seems to make sense.  Using three groups would give this:

 

Now they all have the same number of dots, but the groups on the right are very close together.


The bigger question to answer, just like the meteorologist faces, is "What is correct?"  And when predicting situations like this, that is a very difficult question to answer.  One of the challenges for the test team is creating data sets like the dots above that can let us validate our algorithm in cases which may not be as straightforward as some others.  In the test world, these are "boundary cases," and getting this data for our clustering testing was a task we faced.

Questions, comments, concerns and criticisms always welcome,
John

Monday, September 12, 2016

Software Testing is not insurance. It is defensive driving


Let's take a break from diving into rounding errors and take a larger scoped view of the testing role.  After all, computing a basic statistic - such as the average - of a set of numbers is a well understood problem in the statistical world.  How much value can testers add to this operation?  Fair question.  Let me try to answer it with an analogy.

I used to repeat the mantra I heard from many engineers in the software world that "testing is insurance."  This lead to the question ," How much insurance do you want or need for your project?" and that lead to discussions about the relative amount of funding a test organization should have.  The logic was that if your project as a million dollar project, you would want to devote some percentage of that million into testing as insurance that the project would succeed. 

The first analogy I want to draw is that insurance is a not a guarantee - or even influencer - of success.  Just because I have car insurance does not mean I won't get into a wreck.  Likewise, buying vacation insurance does not guarantee the weather will allow my flight to travel to the destination.  Insurance only helps when things go wrong.  The software engineering world has a process for that circumstance called "Root Cause Analysis."  I'll go over that later, but for now, think of it as the inspection team looking at the wreck trying to figure out what happened to cause the crash.

That leads me to my second analogy:  Testing is like defensive driving.  Defensive driving does not prevent the possibility of a crash.  Instead, it lessens the chance that you will get into an accident.  Hard stats are difficult to find, but most insurance companies in the USA will give you a 10% reduction in your premiums if you take such a class.  Other estimate range to up to a 50% decrease in the likelihood of a wreck, so I will use any number you want between 10% and 50%.

How those results are achieved are by teaching drivers to focus on the entire transportation system around them.  It is pretty easy to get locked into only looking in one direction and then being surprised by events that happen in another are (see where this analogy is going?).  If I am concentrating on the road directly in front of me, I may not notice a group of high speed drivers coming up behind me until it is they are very close.  Likewise, if I am only concentrated on accurate computing an average, I may not notice that my code is not performant and may not work on a server that is in use by many people.  In both cases, having a wider view will make my driving - or code development - much smoother.

More on this coming up.

Questions, comments, concerns and criticisms always welcome,
John

Thursday, September 8, 2016

My simple epsilon guideline


The general rule I want to follow for validation is that for big numbers, a big epsilon is OK.  For small numbers, small epsilon is desirable.In other words, if we are looking at interstellar distances, and error of a few kilometers is probably acceptable, but for microscopic measuresments, a few micrometers may be more appropriate.

So my rule - open to any interpretation - is "8 bits past the most precise point in the test data."  Let's look at a case where we want a small epsilon - for example, we are dealing with precise decimal values.

Suppose we have these data points for out test case:
1.1
2.7
8.003

The most precise data point is that last one - 8.003.  The epsilon factor will be based off that.

8 bits of precision means 1 / (2^8) or 1/256, which is approximately 1/256=0.0039.  Let's call that .004.   Append this to the precision of the last digit of 8.003, which is the one thousandth place.  I get 0.000004.  This means anything that is within .000004 of the exact answer will be considered correct.

So if we need to average those three numbers:
1.1 + 2.7 + 8.003 = 11.803
11.803/3=3.9343 33333….

The exact answer for the average is impossible to compute accurately in this case.  I still need to verify we get close to the answer, so my routine to validate the result will look for the average to be in this range:
3.934333 - 0.000004 = 3.934329
3.934333 + 0.000004 = 3.934337

So if the average we compute is between  3.934329  and 3.934337 .

More on this, and how it can be implemented to enforce even greater accuracy will come up later.

Questions, comments, concerns and criticisms always welcome,
John






Monday, August 29, 2016

What we can do, part 2


I've (hopefully) laid out the challenge the test team has with validating models (algorithms) given the constraints of using computers to compute results.  There seem to be some large hurdles to jump and indeed, there are.

On one hand, we could just point to the documentation (IEEE-754) and say "Well, you get the best computers can do."  That only goes so far, though.  The test team still needs to validate the algorithm returns a value that is approximately correct, so we have to define approximately. 

Step 1 is to define a range of values that we expect from an algorithm, rather than a single value.  This is a large step away from "typical" software validation.

For instance, if I send one piece of email with a subject "Lunar position" then I expect to receive one piece of email with the subject "Lunar position".  This is very straightforward and the basis for most test automation.  I expect (exactly) 1 piece of email, the subject needs to be (exactly) "Lunar position" and not "LUNAR POSITION" and the sender name need to be (exactly) the account from which the test email was sent.

What we do on our validations is set an error factor which we call epsilon:  .  This error factor is added to the exact value a pure math algorithm would produce and subtracted from that number as well to give a range of values we expect.  To go back to the average example, we expect a value of exactly 50.  If we set the epsilon to .0001, then we will allow the computer to pass the test if it gives us a number between 50 - .0001 and 50 + .0001.

50  -  =  49.9999 
50 +  = 50.0001
The range of values we would allow to pass would be between 49.9999 and 50.0001. Anything outside of this range would fail.

If the output from a test is 49.99997, we pass the test.  If 50.03 is the result, we would fail the test.

This is pretty simple, with the key challenge being setting a reasonable epsilon.  I'll cover that in the future.

Questions, comments, concerns and criticisms always welcome,
John

Thursday, August 25, 2016

So what can we do about this?


Here is the challenge the test team has.  Given that even a simple algorithm can have unexpected results, and given that a computer cannot - by its nature - give exact results, how can we validate the output of a given model?

The answer is both simple and difficult.

On the simple side, we do exactly what you would expect.  Knowing that an expected output value should be (exactly) 50, we can set up an error factor of some small amount and validate the value of the output is within that range of the expected output.

In other words, we can set a value of error equal to .00001, for instance.  Then if the output of the algorithm is within .00001 of 50, we can log a PASS result for the test.  If it is more or less than that range (50 +/- .00001) then we log a failure and investigate what happened.

That's the simple part.  In fact, having a small error factor is pretty standard practice.  If you want to read a bit more about this, just look up how computers determine the square root of a number.  This technique is usually taught during the first week or so of a computer science course. (And then it is hardly mentioned again, since it is such a rabbit hole.  Then it becomes a real consideration in jobs like this one).

The hard part is knowing how to set a reasonable range.  Obviously, a very large range will allow bogus results to be treated as passing results.  If we are trying to compute the average of 1-99 and allow a "correct" answer to be +/- 10 from from 50 (45 to 55), the test will always pass.  But everyone will notice that 54 or 47 or whatever else is not correct.

And if we make it too small - like .0000000000000001 (that is a 1 at the 16th decimal place), then the test will likely fail as we change the range to compute due to expected rounding errors.

This is a large challenge for us and I'll outline what we do to handle this next.

Questions, comments, concerns and criticisms always welcome,
John

Monday, August 22, 2016

How much precision an we provide?


So there are some big challenges right out of the gate when testing statistical software, and so far I've looked at rounding errors.  The bigger question, given that computers have this innate limitation, is how accurate can and should we be?

On one hand, statistics gives probabilities, not truth.  So having a routine give a 86.4532% chance compared to second routine giving an 86.4533% seems like we are splitting hairs. But we have some trust in our computers and we need to get the answer as "right" as possible.

My favorite stat on this is Nasa using 15 digits of accuracy for pi.  That is as much as they need to track everything they have ever launched, so that is one benchmark to consider.  For the software on which I work, I doubt many folks are tracking interstellar traffic, though.  It's still a good data point.

Financial markets are something we handle.  This gets tricky really quickly, though.  In the USA, we track (pennies) to two decimal places and have rules about rounding the third decimal place.  Other countries use more decimal places, so this doesn't help much.  (Side note: Ohio has a rounding methodology for sales tax that strongly skews to higher taxes: http://www.tax.ohio.gov/sales_and_use/information_releases/st200505.aspx)

There is also a performance (of the computer) vs. precision factor that we need to consider.  While we could get much more precision, that comes at the cost of run time.  One way of having more precision would be to allocate more memory for each number we are using.  Also, there are some libraries that the National Institute of Standards and Technology makes available that really help, and companies like Intel also provide these tools.  They generally run more slowly than code that doesn't push for that level of precision.  More on this later.

Looking ahead, once we settle on a standard, then the test team has to ensure we meet that goal 100% of the time.  I'll finally get around to covering that aspect of the test team's goals as we go along.

Questions, comments, concerns and criticisms always welcome,
John