Monday, October 24, 2016

Automation fixes and generating test data


My automation changes are checked in and working fine, but I am putting this on hold right now as our team is working on a tool to generate test data for the future.

What we need is a tool to create CSV files with good test data, and the most obvious first step is to define "good" and "test" as they apply to data.  Let's talk about forecasting first.  We need time based data in order to test forecast.  There are a few statistical methods we could use for generating the data and I want to cover 2 of them here.

First is simply random.  Create a tool to generate random times  and some random value to go with it.  Think something like:
Time
Value
Jan 1, 2016, 12:02:53 AM
-126.3
July 8, 88, 2:19:21 PM
.000062

And so on.  This creates data sets that have little real world meaning (88 AD?  .00062?) but might be good test cases.  I like the way the Value column can have any number at any scale - that can really push an algorithm to its limits.  Think going from an atomic scale for length to a galactic scale for length - the precision of the algorithm will get stretched past a poorly designed breaking point for sure, and probably to the limit of a well designed breaking point.  And one of the roles of test is to verify that the breaking point (well covered in my first few posts on this blog), when hit, is handled gracefully.  Oh, and we document this as well.

The time column is a little more cumbersome.  Going back past 1582 gets dicey and right now Tableau only supports Gregorian calendars.  Also, date formats can lead to their own unique set of test cases that an application has to handle and most applications have a whole team to handle this area.  Notice also that I did not include time zones - that facet alone has contributed to the end of application development in some cases.

We might be tempted to have a rule about the lowest and highest values we have for the date/time of the Time column, but we need to test for extreme values as well.  But having a "bogus" value,  for instance, a year of 12,322 AD, would give us a good starting point for working on a potential code fix compared to documenting these extreme values.  Random cases can be good tests, but can also be noisy and point out the same known limitations over and over again.  In some cases, we want to avoid that and focus on more realistic data so that we can validate the code works correctly in non-extreme cases.

A second method for the time series that would help here would be to follow a time based generating process like Poisson .  Basically, this can be used to generate sample data for events that are based on the length of time between events, such as customers coming into a store.
Time
Number of Customers
10:00AM
5
10:10
7
10:20
11
10:30
24
10:40
16


Etc… 

So our tool will have to fulfill both these needs as well as any others we may discover as we move forward.  Once we have a good starting set of needs, we can start designing the tool.

Questions, comments, concerns and criticisms always welcome,
John


No comments:

Post a Comment