My automation
changes are checked in and working fine, but I am putting this on hold right
now as our team is working on a tool to generate test data for the future.
What we need is a
tool to create CSV files with good test data, and the most obvious first step
is to define "good" and "test" as they apply to data. Let's talk about forecasting first. We need time based data in order to test
forecast. There are a few statistical
methods we could use for generating the data and I want to cover 2 of them
here.
First is simply
random. Create a tool to generate random
times and some random value to go with
it. Think something like:
Time
|
Value
|
Jan 1, 2016,
12:02:53 AM
|
-126.3
|
July 8, 88,
2:19:21 PM
|
.000062
|
And so on. This creates data sets that have little real
world meaning (88 AD? .00062?) but might
be good test cases. I like the way the Value
column can have any number at any scale - that can really push an algorithm to
its limits. Think going from an atomic
scale for length to a galactic scale for length - the precision of the
algorithm will get stretched past a poorly designed breaking point for sure,
and probably to the limit of a well designed breaking point. And one of the roles of test is to verify
that the breaking point (well covered in my first few posts on this blog), when
hit, is handled gracefully. Oh, and we
document this as well.
The time column is a
little more cumbersome. Going back past
1582 gets dicey and right now Tableau only supports Gregorian calendars. Also, date formats can lead to their own unique
set of test cases that an application has to handle and most applications have
a whole team to handle this area. Notice
also that I did not include time zones - that facet alone has contributed to
the end of application development
in some cases.
We might be tempted
to have a rule about the lowest and highest values we have for the date/time of
the Time column, but we need to test for extreme values as well. But having a "bogus" value, for instance, a year of 12,322 AD, would give
us a good starting point for working on a potential code fix compared to
documenting these extreme values. Random
cases can be good tests, but can also be noisy and point out the same known
limitations over and over again. In some
cases, we want to avoid that and focus on more realistic data so that we can
validate the code works correctly in non-extreme cases.
A second method for
the time series that would help here would be to follow a time based generating
process like Poisson
. Basically, this can be used to
generate sample data for events that are based on the length of time between
events, such as customers coming into a store.
Time
|
Number of
Customers
|
10:00AM
|
5
|
10:10
|
7
|
10:20
|
11
|
10:30
|
24
|
10:40
|
16
|
Etc…
So our tool will
have to fulfill both these needs as well as any others we may discover as we
move forward. Once we have a good
starting set of needs, we can start designing the tool.
Questions, comments,
concerns and criticisms always welcome,
John
No comments:
Post a Comment