Getting Better Statistics from State Machine Tests

One of the risks of property based testing is that, if your tests pass, then you never actually get to see your test data. You could potentially be running a large number of trivial tests, and ending up with a false sense of security as a result. To counter this, it’s important to collect statistics from your tests, and check that the distribution of tests is reasonable. This blog post shows you how to use one of the new features of QuickCheck state machines to get better statistics than you could before.

A Simple Example

As our example, we’re going to use a QuickCheck model of the Erlang process registry, which you can find here. This model combines positive and negative testing of register, unregister, and whereis, and moreover kills processes randomly to test the registry’s behaviour in the presence of crashes.

Here is the property that tests the model:

prop_registry() ->
  eqc:numtests(1000,
  eqc_statem:show_states(
  ?FORALL(Cmds, commands(?MODULE),
          begin
            [catch unregister(N) || N <- ?names],
            {H, S, Res} = run_commands(?MODULE,Cmds),
            pretty_commands(?MODULE, Cmds, {H, S, Res},
                            aggregate(command_names(Cmds),
                                      ?IMPLIES(Res/=precondition_failed,
                                               Res == ok)))
          end))).

It specifies that we should run 1,000 tests, and that the model states should be pretty-printed when a test fails. It generates test cases from the model in this module, cleans up the registry before running the test, runs the generated commands, and then pretty-prints the result in the case of a test failure. The next line specifies that we want to collect statistics on the names of the commands in the test case. The final two lines just check the result of the test.

We’re going to focus here on the statistics, though, collected by this line of code:

aggregate(command_names(Cmds),

When we test this property, we see that it passes, and we see the distribution of command names in the tests:

OK, passed 1000 tests

22.53% {registry_eqc,unregister,1}
22.18% {registry_eqc,whereis,1}
21.88% {registry_eqc,spawn,0}
16.75% {registry_eqc,register,2}
16.65% {registry_eqc,kill,1}

From this we can see that unregister, whereis and spawn (which is used to create processes that we can use as test data) were called equally often, while register and kill were called a little less often—because neither can be called until we have spawned a process to pass to them.

Using Features

So, we called all of the functions in our test reasonably often—but we know nothing about how often we tested positive versus negative cases. For example, register can fail under a variety of circumstances. How often did that happen? A new extension to QuickCheck makes it easy to find out. We can now associate “features” with each call in a test case—in this case, we might associate the features success and failure with calls of register. Then we can collect statistics on the features seen during testing, instead of or as well as the statistics we already collect.

To associate features with calls of register, we just add a definition of the register_features callback to our code:

register_features(_S,_Args,Result) ->
  case Result of
    true ->
      [success];
    {'EXIT',_} ->
      [failure]
  end.

This callback takes the same parameters as a postcondition, such as register_post, and returns a list of features that we assign to the call. So features can depend on the model state, the arguments, or the result of a call. In this case, we just check to see whether register raised an exception, or returned true (indicating success). Adding this callback is enough to make QuickCheck collect these features as tests are run.

Then we replace the line in our property that collects statistics by this one:

aggregate(call_features(H),

telling QuickCheck to collect call features rather than command names. Notice that the call features are extracted from the test case history H, rather than from the generated list of commands, because obviously they depend on the dynamic behaviour during the test. When we run the tests, then instead of the distribution of command names, we see

OK, passed 1000 tests

60.82% {{registry_eqc,register,2},failure}
39.18% {{registry_eqc,register,2},success}

We can see at once that only 40% of the calls to register succeeded—which is quite OK. Notice that QuickCheck automatically tells us which function a feature refers to—we don’t need to track that explicitly.

Let’s collect the success rate of unregister and whereis too:

unregister_features(_,_,Result) ->
  case Result of
    true ->
      [success];
    {'EXIT',_} ->
      [failure]
  end.

whereis fails in a slightly different way: instead of raising an exception, it returns undefined:

whereis_features(_,_,Result) ->
  [Result /= undefined].

Now when we run our tests, we see:

OK, passed 1000 tests

33.46% {{registry_eqc,unregister,1},failure}
33.12% {{registry_eqc,whereis,1},false}
16.27% {{registry_eqc,register,2},failure}
11.09% {{registry_eqc,register,2},success}
3.13% {{registry_eqc,unregister,1},success}
2.93% {{registry_eqc,whereis,1},true}

Now we can see that most calls to unregister and whereis fail! In fact, only around 10% of the calls to one of these operations succeeds. This is not necessarily a problem: in 1,000 tests, each consisting of many calls, we perform so many calls that we will certainly have called both unregister and whereis successfully many times. But these statistics do raise the question of whether we are using our test effort as effectively as we could. The point here, though, is that by assigning features to calls we can easily discover problems in the distribution of tests that would otherwise be quite hard to observe.

Improving the distribution

Let’s see how we can increase the proportion of successful calls to these two functions. First we must understand why so many calls fail. We’re choosing the name to unregister or look up uniformly from a list of five possibilities:

-define(names,[a,b,c,d,e]).

name() ->
  elements(?names).

So if only 10% of the calls are succeeding, then the registry must contain on average only 0.5 entries when we make these calls! We can easily confirm this by adding another measurement to the property we are testing: we add

measure(registered,length(S#state.regs),
        ...)

around the last two lines of the property. Here S is the state of our model at the end of the test, and S#state.regs is our model of the expected registry contents—a list of pairs of registered names and pids. This call measures the number of registered processes at the end of each test, which gives us a good approximation to the average number of processes in the registry overall. Rerunning our tests, we see

registered:    Count: 1000   Min: 0   Max: 4   Avg: 0.4350   Total: 435

which tells us that we measured “registered” 1,000 times, once per test, and the average number of processes we found in the registry was 0.435. No wonder calls to unregister and whereis usually fail!

Registering processes more often

The most obvious way to increase the average number of processes in the registry is to call register more often! We can specify a weight for each call via the weight callback, as follows:

weight(_,register) ->
  20;
weight(_,_) ->
  10.

This specifies that calls to register should be generated with a weight of 20, twice as often as any other function. (Weights must be integers, so we use 10 rather than 1 as the default so that we can later decrease the weights of other functions easily.) When we rerun our tests, we now see

OK, passed 1000 tests

28.72% {{registry_eqc,register,2},failure}
26.03% {{registry_eqc,whereis,1},false}
24.97% {{registry_eqc,unregister,1},failure}
13.61% {{registry_eqc,register,2},success}
3.40% {{registry_eqc,whereis,1},true}
3.27% {{registry_eqc,unregister,1},success}

registered:    Count: 1000   Min: 0   Max: 4   Avg: 0.5900   Total: 590

We increased the average number of registered processes at the end of a test to almost 0.6: an improvement, but a small one. Unsurprisingly, the proportion of successful calls to unregister and whereis did not change much.

Why didn’t this work? Well, looking at the collected call features, we see that most calls to register failed! To make this even clearer, we can add another line to our property to display statistics just for calls to register:

aggregate(call_features(register,H),
          ...)

call_features/2 lets us collect statistics about the calls to just one function, so we can easily see what proportion of register calls succeed, for example. Of course, we can aggregate call features several times, to analyze several functions independently, as well as or instead of using call_features/1 to see the overall picture.

The resulting output tells its sorry tale very clearly:

68.33% {{registry_eqc,register,2},failure}
31.67% {{registry_eqc,register,2},success}

Before we adjusted its weight, 60% of calls to register failed. Now, almost 70% do! Our problem is not just that we call register too rarely—it is that most calls fail.

Why do calls to register fail?

Time to collect some more information. Looking into the model, we find a function that predicts whether or not a call to register is expected to fail:

register_ok(S,[Name,Pid]) ->
  not lists:keymember(Name,1,S#state.regs)
    andalso not lists:keymember(Pid,2,S#state.regs)
    andalso not lists:member(Pid,S#state.dead).

There are three possible reasons for failure: because the name is already registered, because the pid is already registered, or because the pid is dead. We can investigate which of these is happening by redefining the features of a register call to tell us the reason for failure:

register_features(S,[Name,Pid],Result) ->
  [success || Result==true] ++
    [name_already_registered || lists:keymember(Name,1,S#state.regs)] ++
    [pid_already_registered || lists:keymember(Pid,2,S#state.regs)] ++
    [pid_is_dead || lists:member(Pid,S#state.dead)].

(The list comprehensions with only a condition just include the given element in the resulting list when the condition is true). Running our tests again, we see

36.45% {{registry_eqc,register,2},pid_is_dead}
28.43% {{registry_eqc,register,2},success}
21.41% {{registry_eqc,register,2},pid_already_registered}
13.71% {{registry_eqc,register,2},name_already_registered}

So most calls to register fail because the pid we try to register is dead! The next most common reason for failure is that the pid is already registered, and the least frequent problem is that the name is already registered.

So how can we make register succeed more often? Simple! We shall kill processes less often, so that fewer pids become dead, and we shall spawn processes more often, so that register is more likely to choose a fresh pid when it is called. Revising our weights like this:

weight(_,spawn)    -> 50;
weight(_,register) -> 20;
weight(_,kill)     -> 5;
weight(_,_)        -> 10.

we get the following results:

OK, passed 1000 tests

25.90% {{registry_eqc,register,2},success}
20.06% {{registry_eqc,unregister,1},failure}
18.90% {{registry_eqc,whereis,1},false}
11.61% {{registry_eqc,register,2},name_already_registered}
7.53% {{registry_eqc,register,2},pid_already_registered}
6.57% {{registry_eqc,whereis,1},true}
5.68% {{registry_eqc,unregister,1},success}
3.76% {{registry_eqc,register,2},pid_is_dead}

53.08% {{registry_eqc,register,2},success}
23.80% {{registry_eqc,register,2},name_already_registered}
15.43% {{registry_eqc,register,2},pid_already_registered}
7.70% {{registry_eqc,register,2},pid_is_dead}

registered:    Count: 1000   Min: 0   Max: 5   Avg: 1.1600   Total: 1160

We increased the average number of pids in the registry to over 1.1; we made over 50% of calls to register succeed; we increased the successful calls of unregister and whereis to more than one in five. Progress—but we’re not there yet!

Unregistering less often

Given that we are using five different names as test data, we might like the average number of processes in the registry to be around 2.5. We are still some way away from that–and one reason is that, as soon as a process is in the registry, we start trying to unregister it again! Somewhat paradoxically, if we want unregister to succeed more often, then we need to call it less often! So we shall reduce the weight of unregister:

weight(_,spawn)      -> 50;
weight(_,register)   -> 20;
weight(_,unregister) -> 1;
weight(_,kill)       -> 5;
weight(_,_)          -> 10.

If we also add two more calls to aggregate to collect the features of calls to unregister and whereis separately, then we see:

OK, passed 1000 tests

29.47% {{registry_eqc,register,2},success}
23.72% {{registry_eqc,whereis,1},false}
18.25% {{registry_eqc,register,2},name_already_registered}
10.38% {{registry_eqc,register,2},pid_already_registered}
9.24% {{registry_eqc,whereis,1},true}
5.28% {{registry_eqc,register,2},pid_is_dead}
2.56% {{registry_eqc,unregister,1},failure}
1.10% {{registry_eqc,unregister,1},success}

46.49% {{registry_eqc,register,2},success}
28.80% {{registry_eqc,register,2},name_already_registered}
16.38% {{registry_eqc,register,2},pid_already_registered}
8.33% {{registry_eqc,register,2},pid_is_dead}

69.9% {{registry_eqc,unregister,1},failure}
30.1% {{registry_eqc,unregister,1},success}

72.0% {{registry_eqc,whereis,1},false}
28.0% {{registry_eqc,whereis,1},true}

registered:    Count: 1000   Min: 0   Max: 5   Avg: 1.4310   Total: 1431

We’re up to 1.4 processes in the registry on average, and a 30% success rate for calls to unregister and whereis. As one final refinement, we’ll try generating longer test cases—it may be that we’re now generating so many calls to spawn that we don’t have time in one test to fill the registry up. We replace line 3 of the property by

  ?FORALL(Cmds, more_commands(3,commands(?MODULE)),

which generates test sequences three times as long as the default, and run our tests again:

OK, passed 1000 tests

26.34% {{registry_eqc,register,2},name_already_registered}
23.99% {{registry_eqc,register,2},success}
17.58% {{registry_eqc,whereis,1},false}
13.47% {{registry_eqc,whereis,1},true}
10.26% {{registry_eqc,register,2},pid_already_registered}
5.03% {{registry_eqc,register,2},pid_is_dead}
1.90% {{registry_eqc,unregister,1},failure}
1.43% {{registry_eqc,unregister,1},success}

40.14% {{registry_eqc,register,2},name_already_registered}
36.55% {{registry_eqc,register,2},success}
15.64% {{registry_eqc,register,2},pid_already_registered}
7.67% {{registry_eqc,register,2},pid_is_dead}

57.1% {{registry_eqc,unregister,1},failure}
42.9% {{registry_eqc,unregister,1},success}

56.60% {{registry_eqc,whereis,1},false}
43.40% {{registry_eqc,whereis,1},true}

registered:    Count: 1000   Min: 0   Max: 5   Avg: 2.0080   Total: 2008

Now we have on average two processes in the registry at the end of a test, and 40-50% of calls to unregister and whereis succeed—we’re calling unregister less often, but more calls are succeeding (in absolute, not just relative terms). This is a pretty good result. We could go further, and try to reduce the “wasted” calls to register, perhaps by reducing its weight if the registry is “nearly full”, or perhaps by generating the names we use more carefully—but this is enough for today.

The lesson is: assigning features to calls, and gathering statistics about them, can give us much more information about our tests, make problems obvious, and help us get better value from the machine time we spend on testing.