Extraordinary Squares: August 2016

Monday, August 29, 2016

What we can do, part 2

I've (hopefully) laid out the challenge the test team has with validating models (algorithms) given the constraints of using computers to compute results. There seem to be some large hurdles to jump and indeed, there are.

On one hand, we could just point to the documentation (IEEE-754) and say "Well, you get the best computers can do." That only goes so far, though. The test team still needs to validate the algorithm returns a value that is approximately correct, so we have to define approximately.

Step 1 is to define a range of values that we expect from an algorithm, rather than a single value. This is a large step away from "typical" software validation.

For instance, if I send one piece of email with a subject "Lunar position" then I expect to receive one piece of email with the subject "Lunar position". This is very straightforward and the basis for most test automation. I expect (exactly) 1 piece of email, the subject needs to be (exactly) "Lunar position" and not "LUNAR POSITION" and the sender name need to be (exactly) the account from which the test email was sent.

What we do on our validations is set an error factor which we call epsilon: . This error factor is added to the exact value a pure math algorithm would produce and subtracted from that number as well to give a range of values we expect. To go back to the average example, we expect a value of exactly 50. If we set the epsilon to .0001, then we will allow the computer to pass the test if it gives us a number between 50 - .0001 and 50 + .0001.

50 - = 49.9999

50 + = 50.0001

The range of values we would allow to pass would be between 49.9999 and 50.0001. Anything outside of this range would fail.

If the output from a test is 49.99997, we pass the test. If 50.03 is the result, we would fail the test.

This is pretty simple, with the key challenge being setting a reasonable epsilon. I'll cover that in the future.

Questions, comments, concerns and criticisms always welcome,

John

Thursday, August 25, 2016

So what can we do about this?

Here is the challenge the test team has. Given that even a simple algorithm can have unexpected results, and given that a computer cannot - by its nature - give exact results, how can we validate the output of a given model?

The answer is both simple and difficult.

On the simple side, we do exactly what you would expect. Knowing that an expected output value should be (exactly) 50, we can set up an error factor of some small amount and validate the value of the output is within that range of the expected output.

In other words, we can set a value of error equal to .00001, for instance. Then if the output of the algorithm is within .00001 of 50, we can log a PASS result for the test. If it is more or less than that range (50 +/- .00001) then we log a failure and investigate what happened.

That's the simple part. In fact, having a small error factor is pretty standard practice. If you want to read a bit more about this, just look up how computers determine the square root of a number. This technique is usually taught during the first week or so of a computer science course. (And then it is hardly mentioned again, since it is such a rabbit hole. Then it becomes a real consideration in jobs like this one).

The hard part is knowing how to set a reasonable range. Obviously, a very large range will allow bogus results to be treated as passing results. If we are trying to compute the average of 1-99 and allow a "correct" answer to be +/- 10 from from 50 (45 to 55), the test will always pass. But everyone will notice that 54 or 47 or whatever else is not correct.

And if we make it too small - like .0000000000000001 (that is a 1 at the 16th decimal place), then the test will likely fail as we change the range to compute due to expected rounding errors.

This is a large challenge for us and I'll outline what we do to handle this next.

Questions, comments, concerns and criticisms always welcome,

John

Monday, August 22, 2016

How much precision an we provide?

So there are some big challenges right out of the gate when testing statistical software, and so far I've looked at rounding errors. The bigger question, given that computers have this innate limitation, is how accurate can and should we be?

On one hand, statistics gives probabilities, not truth. So having a routine give a 86.4532% chance compared to second routine giving an 86.4533% seems like we are splitting hairs. But we have some trust in our computers and we need to get the answer as "right" as possible.

My favorite stat on this is Nasa using 15 digits of accuracy for pi. That is as much as they need to track everything they have ever launched, so that is one benchmark to consider. For the software on which I work, I doubt many folks are tracking interstellar traffic, though. It's still a good data point.

Financial markets are something we handle. This gets tricky really quickly, though. In the USA, we track (pennies) to two decimal places and have rules about rounding the third decimal place. Other countries use more decimal places, so this doesn't help much. (Side note: Ohio has a rounding methodology for sales tax that strongly skews to higher taxes: http://www.tax.ohio.gov/sales_and_use/information_releases/st200505.aspx)

There is also a performance (of the computer) vs. precision factor that we need to consider. While we could get much more precision, that comes at the cost of run time. One way of having more precision would be to allocate more memory for each number we are using. Also, there are some libraries that the National Institute of Standards and Technology makes available that really help, and companies like Intel also provide these tools. They generally run more slowly than code that doesn't push for that level of precision. More on this later.

Looking ahead, once we settle on a standard, then the test team has to ensure we meet that goal 100% of the time. I'll finally get around to covering that aspect of the test team's goals as we go along.

Questions, comments, concerns and criticisms always welcome,

John

Friday, August 19, 2016

Some background for understanding testing the models

I received a comment from a reader last week (I have readers? Wow. I did very little "advertising"). Anyway, this person was a bit confused about modeling averages. I can understand that - we all learn how to compute averages in junior high or so, and unless we took an advanced math class, we never were exposed to why the algorithm works or whether there are alternatives to the way we learned (add 'em all up and divide by the number of things you added).

I figured a little background reading might help. Nothing too deep - I want to keep this blog away from research math and more accessible to everyone.

Averages are taken for granted nowadays but that has certainly not been the case always. In fact, in some ways, they were controversial when first introduced. And even "when they were first introduced" is a tough question to answer. http://www.york.ac.uk/depts/maths/histstat/eisenhart.pdf is a good starting point for digging into that aspect of averages.

The controversial part is pretty easy to understand and we even joke about it a bit today. "How many children on average does a family have?" is the standard question which leads to answers like 2.5. Obviously, there are no "half kids" running around anywhere, and we tend to laugh off these silly results. Coincidentally, the US believes the ideal number of kids is 2.9. The controversy came in initially - what value is an average if there is no actual, real world instance of a result having this value? In other words, what use would it to be to know that the average family has 2.5 children, yet no families have 2.5 children?

The controversy here was directed at the people that computed averages. They came up with a number - 2.5 in this example - that is impossible to have in the real world. And if you try to tell me your algorithm is a good algorithm yet it gives me impossible results, then I have to doubt it is "good."

(We will come back to the standard phrase "All models are wrong. Some are useful." later).

Floating point math is a more difficult concept to cover.

I don't want to get into the details of why this is happening in this blog since there is a huge amount of writing on this already. If you want details I found a couple of good starting points.

A reasonable introduction to this is on wikipedia:

https://en.m.wikipedia.org/wiki/Floating_point

A more hands on, technical overview:

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Questions, comments, concerns and criticisms always welcome,

John

Monday, August 15, 2016

Problems with the models

So now we have 2 models that we can use to compute an average. Both are mathematically correct and we expect to avoid some large numbers with our second model (the model in which we divide each number first by the count of all the numbers). This requires a suspension of disbelief because our maximum number we can handle is 4 billion and I know the testers out there have already asked, "But what if the average is expected to be over 4 billion?" We'll come back to that after we test the basic cases for the second model.

Let's start with some Python code here. Nothing special for the first model we had in which we add all the numbers first and divide by the count of the numbers:

sum = 0 #this is the running total

k=0 #this is how many items there are total

for i in range(1,100): #this will go from 1 to 100.

sum += i # add each number to the total, 1+2+3+4+5 ….+99

k+=1 #count how many numbers there are - there are 99 because python skips the 100 from above

print(sum / k) # print the average

And when this is run Python prints out 50.0. This is what we expected (except maybe we only expected 50 without the "point 0" at the end).

If we keep the k we had above (99) then we can add a second bit of code to get the second model working. Remember, the second model is the one in which we divide each number by the count of all the numbers there are and then add to a running total. That running total will be the average value when done.

sum = 0 #reset the sum to zero

for i in range(1,100): #again, loop from 1 to 99

sum += i/k #now add 1/99 + 2/99 + 3/99 + 4/99 + 5/99 …. + 99/99

print(sum)

And we get an incorrect result:

49.99999999999999

This is a short example of floating point errors cropping up. We haven't even gotten to the test to see if our numbers were getting too large and we already need to deal with this.

Questions, comments, concerns and criticisms always welcome,

John

Thursday, August 11, 2016

A good modeling class

One other note for today. Thinking about models is a skill in itself. I took this class online a few years ago and highly recommend it: https://www.coursera.org/learn/model-thinking It's free if you want and there is a pay option also. It starts Aug 8, 2016 so you may want to hop over there and check it out. It is not math or computer heavy - instead, it is an introduction to why we have models, how to use them and think about problems from a modeling point of view.

Questions, comments, concerns and criticisms always welcome,

John

Creating the model

Last time I gave an example of a model to compute an average. The method I used to compute the average was probably the most familiar method there is - basically, add up all the number and divide by the count of the numbers that were added.

This works well on paper but can lead to problems on computers. One error that can creep in is adding up to a number bigger than the computer can handle. Humans with pencil and paper can always make numbers bigger (just add more paper) but a computer will have the maximum number it can handle.

32 bit computers are still popular and for this next bit let's assume the largest integer it can handle is 4,294,967,294. This is the number 2 raised to the 32nd power (2 because the machine is binary, and 32 from the 32 bit processor). Four billion is a large number, but if we are trying to figure out average sales for a really large business, it won't work. When we try to add numbers that total to more than 4 billion, the computer will not know what to do - it can't count that high.

This isn't new: this is a very old problem with computers. No matter how you do it, there is a number that will be bigger than the computer can handle. Just fill up all the computer memory with whatever number you want (all 1s, since it is binary) and then add 1 more to it. The computer doesn't have enough memory to hold that new number.

So how can we deal with this potential problem?

One way is to change our algorithm. Our original algorithm is this:

Add all the numbers in the original list
Divide by the count of the numbers in the original list
The answer I get is the result I want

But we can crash at step 1.

One workaround is to look really hard at what the algorithm does. Our example had 5 grades, so we add them up and divide the total by 5. What if we divided each number by 5 to begin with, and then added up those results?

Example:

88/5=17.6

91/5=18.2

83/5=16.6

88/5=17.6

85/5=17

And 17.6+18.2+16.6+17.6+17=87 This was the output of the original algorithm, so this new algorithm looks like it might be useful.

So the new algorithm would be:

Count the number of items in the list
Divide the first number by that count and add the result to a running total
Divide the next number in the list by that count and add the result to the running total
Repeat step 3 until all the items in the list have been processed

Seems reasonable and next up we will give this a try.

Questions, comments, concerns and criticisms always welcome,

John

Monday, August 8, 2016

Setting the stage

The first thing I want to cover is what we test on the stats team. A good starting point is that we validate the behavior of statistical models we implement.

For non-statisticians, the first question I get is "What is a statistical model? " My definition to get started is pretty simple: a statistical model is a set of equations based on existing data that help make predictions about new data coming in.

A very familiar example is a model to compute an average. We probably all saw this in school. You take 5 tests, add up all your scores and divide by 5. That set our grade for the course. But I added what may seem like a strange phrase to my definition that a model helps us make predictions about new data coming in. Let's walk through that viewpoint since it is key to the definition.

In this case, suppose there is a 6th test we have to take. And suppose my grades had been 88, 91, 83, 88 and 85. They add up to 435. I divide 435 by 5 and get an average of 87. Not only is this my grade, it also is a prediction of what score I would make on the 6th test.

Now, it obviously doesn't mean I would make an 87 exactly. If that was the case, I wouldn't even need to take the test. But if I made an 86, or 92 or even 81 I would not be surprised. It can get interesting to investigate what happened if I made a 100 or a 53 and that will come up later.

So we have a statistical model at this point. It computes an average given a set of numbers as an input. It's our job as the test team to validate that the average is computed correctly in every possible case. Seems simple, maybe tedious, but once we bring computers into the equation (pun intended) things will start to get more complicated very, very, quickly.

I'll cover this simple model next.

Questions, comments, concerns and criticisms always welcome,

John

Monday, August 1, 2016

Welcome to Extraordinary Squares!

Welcome to my new blog about testing at Tableau!

Tableau is a company that helps you see and understand your data. You can read all about us at www.tableau.com, get a free version of Tableau and create an account there.

I work on the Statistics team here. Right off the bat, I was a little confused about this team. Internally, I was thinking that "All of Tableau is about statistics, so how can there be just one team for that?" Now that I have been awhile, I see all that we do, from connecting to databases (to get data for statistical analysis), to loading the data (not at all easy), to drawing maps, creating a web client, hosting servers, etc.. Etc.. Etc.. There is a lot to Tableau!

The goal of this blog is to roughly mimic my former OneNote Testing blog and talk about the challenges the test team here faces, how we address them and work to ensure Tableau is the best in class application available. I may (probably will) have an occasional tip or two as well.

One regret I had with my former blog was the name: specifically, the URL I used. I wish I had not had my name in the URL. So this time, I went for what I think is a cool and slightly punny name: Extraordinary Squares. Ordinary squares is a statistical test to determine if a line fits data (more on that later) and Extra-ordinary because, well, we are extraordinary!

Squares is also a pun. I was at a meeting with some of the people on my team and someone mentioned we were changing a ratio from 13:8 to 14:8. Someone else said, "That messes up everything."
I said, "Yeah, that even messes up our Fibonacci Golden Ratio."
Everyone in the room laughed.