Thursday, March 30, 2017

Some simple data munging work this week

A task I had to deal with this week was this one.  I was given a data set we want to test, like this:
2 3 4 4 5 5 6 6 7 9
We have an algorithm to compute some basic statistics on any given data set (mean, median, variance, etc…).  Nothing special about that.  And I had two data sets - the small one above, used mostly to make sure the testing code would work, and another data set of 50,000+ floating point numbers:
-8322.458 -6199.002 -6002.999 and so on.

What I needed to do was compare the results of those types of calculations across a variety of different tools which also compute those basic stats.  I chose Excel, R, some Python code I wrote myself, Numpy/Scipy and Octave.

And that is where the problems came in.

My original data sets were simply a list of numbers, without commas, seperated by spaces, all on one row.  For the small data set, for all the tools, I could just copy/paste or even retype to get the data into the format the tool wanted.  This is not a hard problem to solve, just tedious.  The industry calls this "data munging" (getting the data from the format you have into the format your tools needs) and is almost always the most time consuming part of any analysis.  Hit me up for links to prove this if you want.

For instance, excel prefers a single column to make entering the calculations easy, but can use a row.  Python's read csv files wants commas to separate values along a row (you can specify spaces) but once the data is imported, it is easiest to have one column of data.  So I had to create a file of the 50,000+ values with each value on one line.

R was able to use the same file as Python.  Nice!

Octave  wanted all the values on one row so I had to re-layout the numbers with a comma in between each.  Since this was a one-off task, I simply used Word to edit the file.  It took a little under a minute to make the 50,000+ replacements.

Now I have the data files in the format that all the tools want, and can use their results to help ensure Tableau is getting expected answers for these basic statistics.

Questions, comments, concerns and criticisms always welcome,

Monday, March 27, 2017

Getting our automation to 100%

One of the challenges we have taken on since last year has been driving our test passes to be 100% automated.  Historically, a test pass has had a manual testing component - a human sits in front of a computer that is running a build of Tableau that we intend to ship to customers and validates the responses to test cases are the results we expect.  This can be a challenge because, as often happens, the manual testing phase can take more than a day.  If we get a new build of Tableau each day - and we can get them more often than that - it is possible that we can never finish a manual test pass. 

It also can become very difficult to predict when a manual test pass will conclude.  If there are changes, test will need to validate if the change was expected or not, and while that may be obvious in some cases, it is not always the case.  Imagine that a new piece of hardware, such as a very high resolution monitor, comes out the day before the test pass.  It could be the case that no one on the team knew this would happen and no one has planned for how Tableau should work in that case.  This can slow down the evaluation of the results of the test pass.

To get around this, we embarked on getting enough automation in place that we can avoid all manual testing when a new build of Tableau is released for testing.  For us, this just means looking at automation results and verifying that no tests are failing.  That is my definition of 100% - that we can make our decision based on automation results.  That was a lot of work to get here, and even more work to remain at this level, but it does make testing much more reliable and predictable as time goes by. 

This also speeds up the new monitor test example I gave above.  At this point, all the test results are available so the extent of the test pass is done "all at once."  There is no  back and forth with trying each feature one at a time on the new monitor - all the automation runs and gives a report to the humans that can now start making decisions.  Without the automation, the decision process is slowed down by data gathering, and this is where automation can really speed up the cycle.

We did some tasks other than simply adding automation to help us get here and I'll cover those next.

Questions, comments, concerns and criticisms always welcome,

Friday, March 24, 2017

Some reading for my team

We are gathering books today that folks on the team have and use.  I made my contribution to the first pile of books:
The bottom most book - Doing Bayesian Data Analysis - is my contribution to the pile.  We actually had a learning group here last summer using that as our text and I really liked the way it was written.  Very useful, tutorial and plenty of examples. 

I also have a copy of the top book - Causality by Judea Pearl.  To me, that is the key attribute of what people want from statistics : did changing THIS cause THAT to happen?  I'm still working my way through it and am probably about 20% done so far.  Much more dense than I expected, but that is just another way of saying it is filled with information.

Books are a great way of continually learning and I will write more about that next.

Questions, comments, concerns and criticisms always welcome,

Tuesday, March 14, 2017

Using Virtual Machines to help me test

Tableau gives me a couple of ways to maintain virtual machines for a test environment.  My preferred method is probably seen as "old school" at this point.  I create machines locally to run on my nice desktop machine and use them in Hyper V.  This gives me the most control of my machines, they are always accessible and I can maintain them when and how I prefer to do so.  Right now, since I am still setting them up, I have a 32 bit Windows 7 machine, a 64 bit Windows 7 machine and a Debian machine.

We also use Openstack more globally and I would be willing to bet most folks around Tableau use it by default.  It has its own advantages - I can just select what OS and such I want, and the server builds a machine for me. I can log in when it is done, it doesn't consume any resources from my desktop and I can otherwise use it as needed.  I have not had any problems with this system at all: I simply prefer to run my machines locally since that is the first way I was introduced to using VMs for testing and development.

Informally, I once did a little experiment to judge how much faster I could be as a tester with more machines.  At one point, when I was testing email clients and servers, I had 14 total machines set up in my office running many different languages of clients and servers to validate email sent and received in these situations.  I was able to go about 50% faster by having all these machines under my control than by relying on labs or others to set up and maintain them.  I could also make any changes I wanted whenever I wanted since I was the only user.  The downside was learning to be a client and server and network administrator since I had to do all the work myself.  That probably paid for itself in the long run - the better I know my product overall the better my testing will be - but I did not factor that into my calculations of spending about 1-2 hours per day tweaking all the machines.  One last advantage with this many machines is that while updating them took quite a bit of time, since I updated them one after the other, one machine was almost always ready for testing while the rest were being upgraded.

And now, back to work!

Questions, comments, concerns and criticisms always welcome,

Wednesday, March 8, 2017

Hackathon week

This is a hackathon week for us here at Tableau.  We get to work on self-directed projects, things we "know need to get done", explore new avenues and so on.  Overall, it is a very productive time and most folks look forward to this regular activity.

For this week, I am working on a simple (think V1) "fuzzing" tool to dirty the Superstore sample workbook we ship.  In this context, fuzzing means to dirty some of the fields in the data to reflect bad data that may get sent to Tableau.

An example of this is a storm damage spreadsheet I encountered while taking a data cleaning class.  This spreadsheet (since it is still in use to my knowledge I won't link to it) was a hand created list of major storms in the USA dating back to the 1950s.  When I hear the phrase "hand created" I have learned to shudder.  There is simply the potential for a type to creep in and skew the data (at best) or result in completely invalid entries (much worse).  For instance, this particular spreadsheet had 54 states listed.

A fair question to ask would be "How many states would I expect to see?"  50 is an obvious answer, since we currently have 50 states.  But that wasn't true when the data was first starting to be collected.  But we have 50 now, so 50 might be a reasonable first guess.  The District of Columbia is not a state, but is often included in data gathered by the government, so 51 "states" may also be a good expectation.

Since this is storm data, I naturally include hurricanes in this and that makes me think of the Caribbean.  Now I have to wonder if I need to include Puerto Rico in my analysis, which would raise the state count to 52.

Finally, now that I had some expectation about 50 not necessarily being the absolute accurate value I expected I began to dig through the data.  In this case, DC was included, but Puerto Rico was not.  The other three "states" were typos - things like "Florida" being spelled "Flarida" and similar.  (Quick - how many ways can you type valid names or abbreviations for "Florida"?) Once I had that cleaned up, I was down to 51 states (50 states + the District of Columbia) for my analysis.

Anyway, remembering that lesson made me want to contribute to this effort.  Data is not always (never?) clean to start.  This hackathon I am trying to help solve that challenge.

Questions, comments, concerns and criticisms always welcome,