Shrinking fallback servers to primary servers

Imagine you are running a large system with some primary and some fallback servers. You have at least one, up to three primary servers and zero to four fallback servers. You run tests on a random configuration to make sure your system behaves as expected. One of those tests fails and you want to shrink it to a minimal counter example.

A minimal counter example often takes the configuration into account. Years ago we showed how to test elevator control software using QuickCheck. The property first selected a random number of floors and elevators and then run a sequence of random commands. As soon as such sequence failed, it would shrink the configuration to the minimal number of floors and elevators needed to provoke this error. The idea is that analyzing a failure in a small configuration or a default (i.e., well known) configuration is easier than analyzing it in a large, tricky configuration. Moreover, if we cannot shrink the configuration to for example 2 floors, then we know that we need at least 3 floors for the error to happen, giving us additional information about the possible causes.

At Basho they test their software against many random configurations of primary and fallback servers. The primary servers typically have id 1, 2 and 3 and the fallback servers typically have ids larger than 3. Thus the list [1, 3, 6, 8] represents a configuration with two primary servers (1 and 3) and two fallback servers (6 and 8).
A failing test case is easier to analyze if there are less servers involved and in particular if it can be provoked using only primary servers. We want a generator for such configurations that has good shrinking behaviour. That is, it may shrink [1, 3, 6, 8] to [1, 3, 2], clearly less nodes and only primary nodes.

Posting a challenge

At the first QuickCheck User workshop we had a free 30 minute slot and I posted the above sketched problem as a QuickCheck generator design challenge. Create a generator primary_fallback(N, M) for M >= N. The idea is that N is the maximum number of primary nodes and M is the maximum node number. The generator has to fullfil the following conditions:

  1. It generates a list of numbers between 1 and M
  2. The list contains at least one primary node, i.e., it contains an element less or equal to N.
  3. Each number occurs at most once
  4. For shrunk minimal counter examples we have:
    • Removing an element from the list should result in a passing test
    • Replacing a fallback by a primary node should result in a passing test

In order to verify that we shrink correctly, I proposed to use the QuickCheck primitives eqc:watch_shrinking() and eqc_gen:sampleshrink(Gen). The correct way, as @ulfnorell showed during the workshop is to create a property over shrinking paths, but we only had 30 minutes for this challenge.

If we want to watch shrinking, we better have a failing property. I came up with a rather simplistic property that would show some shrinking.

prop_pf() ->;
   ?FORALL(Nodes, primary_fallback(3, 7),
           lists:sum(Nodes) < 6).

Thus, if we have at most 3 primary nodes, always carrying node id 1, 2, or 3 and we have some fallback nodes with node id 4, 5, 6, or 7, then when we pick any set of nodes, the sum of the node ids is less than 6.

A clearly wrong solution would be to implement the primary_fallback generator as a subset of the list [1..M]:

primary_fallback(_, M) ->
   sublist(lists:seq(1,M)).

When testing with this generator, one would, for example see shrinking to a list without primary nodes as well as to lists that do keep the fallback nodes instead of replacing them by primary nodes. The first shrinking is wrong, since we need a primary node in each faling test case. The second shrinking is somehow correct, since replacing the fallback node 4 with a primary node makes the property pass. However, given the starting point [1,2,3,4] we might have liked to get the failing counter example [1,2,3] as minimal result:

4> eqc:quickcheck(prefs_eqc:prop_pf()).
...........Failed! After 12 tests.
[1,2,3,6]
Shrinking xxx..xx(2 times)
[6]
false
5> eqc:quickcheck(prefs_eqc:prop_pf()).
.................Failed! After 18 tests.
[1,2,3,4]
Shrinking xxx.xxx.xx(2 times)
[2,4]
false

By checking this property one observes the final shrinking result, but not the intermediate results. In order to do so, one uses watch_shrinking after one has found a failing test case:

6> eqc:watch_shrinking().              
Failed!
[1,2,3,4]

OK!
[]

OK!
[1]

OK!
[1,2]

Failed!
[3,4]

OK!
[4]

OK!
[3]

OK!
[1,4]

Failed!
[2,4]

OK!
[2,3]

OK!
[2]
ok

Shrinking never considered the case [1, 2, 3] as an alternative, smaller test case.
Your challenge: create a better generator.

A different property

The first thing that happens is that @rjmh comes up with a different property, quickly thereafter improved by @LeHoff. Instead of fixing the property, take an arbitrary property that fails for a specific given input. The property should fail sporadically in order to get interesting failures.

prop_randomfailure() ->
  ?FORALL({Nodes, P}, {primary_fallback(3, 7),
                       function1(weighted_default({70,true}, {10,false}))},
          P(Nodes)).

However, this property has as a disadvantage that you may have very few valid shrinking alternatives and that you do not know where a test case should shrink to. Therefore, we disregard this solution and stick to the prop_pf defined above.

Simple and reasonable effective

I present one generator that does a reasonable job. Compare this to one of your own solutions and see where your solution performs better. There’s a price to pay for making a super elegant solution compared to a working solution.

First thing we noticed from our naive approach is the generation of a list of nodes in which we do not have a primary node. We want at least one primary node! We could choose that first using elements(lists:seq(1, N)) and then choose a sublist of other nodes with sublist(lists:seq(1, M)). This will generate sublists that do contain the node chosen first and therefore we can eliminate duplicates from the concatenated list.

primary_fallback(N, M) ->
  ?LET({Primary, Fallback}, 
       {[elements(lists:seq(1, N))], sublist(lists:seq(1, M))},
       shuffle(lists:usort(Primary ++ Fallback))).

This generator produces lists of nodes that always contain a primary node.

7> eqc_gen:sample(prefs_eqc:primary_fallback(3,7)).
[1,2,3,6,7]
[1,2,3,4]
[4,5,7]
[5]
[4,6,7]
[2]
[2,3,5,6,7]
[1,2,3,4,5,6]
[4]
[1,2,3,5,6]
[2,3,4,5,6]
ok

We should also observe the shrinking behaviour and indeed, similar to before we shrink the length of the list eagerly.

8> eqc_gen:sampleshrink(prefs_eqc:primary_fallback(3,7)).
[2,5,7]
--> [[],[2],[2,5],[7],[5,7],[2,7],[1,5,7],[1,2,7],[2,3,7],[2,4,7],[1,2,5],[2,3,5],[2,5,6]]
ok

Using this generator for our property results in two failing cases [1,2,3] and [1,5]. Both are correct failing cases in the sense that they cannot be made smaller by shortening the list or replacing a fallback by a primary; when we replace 5 by 4, the property passes.

However, it is a bit unsatisfactory to get [1,5] so many times as a result of shrinking:
Failed! After 1 tests.

[3,1,6,2,7]
Shrinking .xxx...x.x(5 times)
[1,5]
false

We probably want to add a shrinking alternative that takes care of the cases in which the primary nodes alone could have caused the property to fail. We can do so by adding a shrinking alternative.

primary_fallback(N, M) ->
  ?LET({Primary, Fallback}, 
       {[elements(lists:seq(1, N))], sublist(lists:seq(1, M))},
       ?SHRINK(shuffle(lists:usort(Primary ++ Fallback)),
              [ [ E || E<-Primary++Fallback, E=<N] ])).

This only adds one shrinking alternative to the solution, viz. filtering out all primary nodes. One could add additional shrinking alternatives with subset of primary nodes, but I decided to not complicate this generator further.