Getting Better Statistics from State Machine Tests
One of the risks of property based testing is that, if your tests pass, then you never actually get to see your test data. You could potentially be running a large number of trivial tests, and ending up with a false sense of security as a result. To counter this, it’s important to collect statistics from your tests, and check that the distribution of tests is reasonable. This blog post shows you how to use one of the new features of QuickCheck state machines to get better statistics than you could before.
A Simple Example
As our example, we’re going to use a QuickCheck model of the Erlang process registry, which you can find here. This model combines positive and negative testing of register
, unregister
, and whereis
, and moreover kills processes randomly to test the registry’s behaviour in the presence of crashes.
Here is the property that tests the model:
prop_registry() -> eqc:numtests(1000, eqc_statem:show_states( ?FORALL(Cmds, commands(?MODULE), begin [catch unregister(N) || N <- ?names], {H, S, Res} = run_commands(?MODULE,Cmds), pretty_commands(?MODULE, Cmds, {H, S, Res}, aggregate(command_names(Cmds), ?IMPLIES(Res/=precondition_failed, Res == ok))) end))).
It specifies that we should run 1,000 tests, and that the model states should be pretty-printed when a test fails. It generates test cases from the model in this module, cleans up the registry before running the test, runs the generated commands, and then pretty-prints the result in the case of a test failure. The next line specifies that we want to collect statistics on the names of the commands in the test case. The final two lines just check the result of the test.
We’re going to focus here on the statistics, though, collected by this line of code:
aggregate(command_names(Cmds),
When we test this property, we see that it passes, and we see the distribution of command names in the tests:
OK, passed 1000 tests 22.53% {registry_eqc,unregister,1} 22.18% {registry_eqc,whereis,1} 21.88% {registry_eqc,spawn,0} 16.75% {registry_eqc,register,2} 16.65% {registry_eqc,kill,1}
From this we can see that unregister
, whereis
and spawn
(which is used to create processes that we can use as test data) were called equally often, while register
and kill
were called a little less often—because neither can be called until we have spawned a process to pass to them.
Using Features
So, we called all of the functions in our test reasonably often—but we know nothing about how often we tested positive versus negative cases. For example, register
can fail under a variety of circumstances. How often did that happen? A new extension to QuickCheck makes it easy to find out. We can now associate “features” with each call in a test case—in this case, we might associate the features success
and failure
with calls of register
. Then we can collect statistics on the features seen during testing, instead of or as well as the statistics we already collect.
To associate features with calls of register, we just add a definition of the register_features
callback to our code:
register_features(_S,_Args,Result) -> case Result of true -> [success]; {'EXIT',_} -> [failure] end.
This callback takes the same parameters as a postcondition, such as register_post
, and returns a list of features that we assign to the call. So features can depend on the model state, the arguments, or the result of a call. In this case, we just check to see whether register
raised an exception, or returned true
(indicating success). Adding this callback is enough to make QuickCheck collect these features as tests are run.
Then we replace the line in our property that collects statistics by this one:
aggregate(call_features(H),
telling QuickCheck to collect call features rather than command names. Notice that the call features are extracted from the test case history H
, rather than from the generated list of commands, because obviously they depend on the dynamic behaviour during the test. When we run the tests, then instead of the distribution of command names, we see
OK, passed 1000 tests 60.82% {{registry_eqc,register,2},failure} 39.18% {{registry_eqc,register,2},success}
We can see at once that only 40% of the calls to register succeeded—which is quite OK. Notice that QuickCheck automatically tells us which function a feature refers to—we don’t need to track that explicitly.
Let’s collect the success rate of unregister
and whereis
too:
unregister_features(_,_,Result) -> case Result of true -> [success]; {'EXIT',_} -> [failure] end.
whereis
fails in a slightly different way: instead of raising an exception, it returns undefined
:
whereis_features(_,_,Result) -> [Result /= undefined].
Now when we run our tests, we see:
OK, passed 1000 tests 33.46% {{registry_eqc,unregister,1},failure} 33.12% {{registry_eqc,whereis,1},false} 16.27% {{registry_eqc,register,2},failure} 11.09% {{registry_eqc,register,2},success} 3.13% {{registry_eqc,unregister,1},success} 2.93% {{registry_eqc,whereis,1},true}
Now we can see that most calls to unregister
and whereis
fail! In fact, only around 10% of the calls to one of these operations succeeds. This is not necessarily a problem: in 1,000 tests, each consisting of many calls, we perform so many calls that we will certainly have called both unregister
and whereis
successfully many times. But these statistics do raise the question of whether we are using our test effort as effectively as we could. The point here, though, is that by assigning features to calls we can easily discover problems in the distribution of tests that would otherwise be quite hard to observe.
Improving the distribution
Let’s see how we can increase the proportion of successful calls to these two functions. First we must understand why so many calls fail. We’re choosing the name to unregister or look up uniformly from a list of five possibilities:
-define(names,[a,b,c,d,e]). name() -> elements(?names).
So if only 10% of the calls are succeeding, then the registry must contain on average only 0.5 entries when we make these calls! We can easily confirm this by adding another measurement to the property we are testing: we add
measure(registered,length(S#state.regs), ...)
around the last two lines of the property. Here S
is the state of our model at the end of the test, and S#state.regs
is our model of the expected registry contents—a list of pairs of registered names and pids. This call measures the number of registered processes at the end of each test, which gives us a good approximation to the average number of processes in the registry overall. Rerunning our tests, we see
registered: Count: 1000 Min: 0 Max: 4 Avg: 0.4350 Total: 435
which tells us that we measured “registered” 1,000 times, once per test, and the average number of processes we found in the registry was 0.435. No wonder calls to unregister
and whereis
usually fail!
Registering processes more often
The most obvious way to increase the average number of processes in the registry is to call register
more often! We can specify a weight for each call via the weight
callback, as follows:
weight(_,register) -> 20; weight(_,_) -> 10.
This specifies that calls to register
should be generated with a weight of 20, twice as often as any other function. (Weights must be integers, so we use 10 rather than 1 as the default so that we can later decrease the weights of other functions easily.) When we rerun our tests, we now see
OK, passed 1000 tests 28.72% {{registry_eqc,register,2},failure} 26.03% {{registry_eqc,whereis,1},false} 24.97% {{registry_eqc,unregister,1},failure} 13.61% {{registry_eqc,register,2},success} 3.40% {{registry_eqc,whereis,1},true} 3.27% {{registry_eqc,unregister,1},success} registered: Count: 1000 Min: 0 Max: 4 Avg: 0.5900 Total: 590
We increased the average number of registered processes at the end of a test to almost 0.6: an improvement, but a small one. Unsurprisingly, the proportion of successful calls to unregister
and whereis
did not change much.
Why didn’t this work? Well, looking at the collected call features, we see that most calls to register
failed! To make this even clearer, we can add another line to our property to display statistics just for calls to register
:
aggregate(call_features(register,H), ...)
call_features/2
lets us collect statistics about the calls to just one function, so we can easily see what proportion of register
calls succeed, for example. Of course, we can aggregate call features several times, to analyze several functions independently, as well as or instead of using call_features/1
to see the overall picture.
The resulting output tells its sorry tale very clearly:
68.33% {{registry_eqc,register,2},failure} 31.67% {{registry_eqc,register,2},success}
Before we adjusted its weight, 60% of calls to register
failed. Now, almost 70% do! Our problem is not just that we call register
too rarely—it is that most calls fail.
Why do calls to register fail?
Time to collect some more information. Looking into the model, we find a function that predicts whether or not a call to register
is expected to fail:
register_ok(S,[Name,Pid]) -> not lists:keymember(Name,1,S#state.regs) andalso not lists:keymember(Pid,2,S#state.regs) andalso not lists:member(Pid,S#state.dead).
There are three possible reasons for failure: because the name is already registered, because the pid is already registered, or because the pid is dead. We can investigate which of these is happening by redefining the features of a register
call to tell us the reason for failure:
register_features(S,[Name,Pid],Result) -> [success || Result==true] ++ [name_already_registered || lists:keymember(Name,1,S#state.regs)] ++ [pid_already_registered || lists:keymember(Pid,2,S#state.regs)] ++ [pid_is_dead || lists:member(Pid,S#state.dead)].
(The list comprehensions with only a condition just include the given element in the resulting list when the condition is true). Running our tests again, we see
36.45% {{registry_eqc,register,2},pid_is_dead} 28.43% {{registry_eqc,register,2},success} 21.41% {{registry_eqc,register,2},pid_already_registered} 13.71% {{registry_eqc,register,2},name_already_registered}
So most calls to register
fail because the pid we try to register is dead! The next most common reason for failure is that the pid is already registered, and the least frequent problem is that the name is already registered.
So how can we make register
succeed more often? Simple! We shall kill processes less often, so that fewer pids become dead, and we shall spawn processes more often, so that register
is more likely to choose a fresh pid when it is called. Revising our weights like this:
weight(_,spawn) -> 50; weight(_,register) -> 20; weight(_,kill) -> 5; weight(_,_) -> 10.
we get the following results:
OK, passed 1000 tests 25.90% {{registry_eqc,register,2},success} 20.06% {{registry_eqc,unregister,1},failure} 18.90% {{registry_eqc,whereis,1},false} 11.61% {{registry_eqc,register,2},name_already_registered} 7.53% {{registry_eqc,register,2},pid_already_registered} 6.57% {{registry_eqc,whereis,1},true} 5.68% {{registry_eqc,unregister,1},success} 3.76% {{registry_eqc,register,2},pid_is_dead} 53.08% {{registry_eqc,register,2},success} 23.80% {{registry_eqc,register,2},name_already_registered} 15.43% {{registry_eqc,register,2},pid_already_registered} 7.70% {{registry_eqc,register,2},pid_is_dead} registered: Count: 1000 Min: 0 Max: 5 Avg: 1.1600 Total: 1160
We increased the average number of pids in the registry to over 1.1; we made over 50% of calls to register
succeed; we increased the successful calls of unregister
and whereis
to more than one in five. Progress—but we’re not there yet!
Unregistering less often
Given that we are using five different names as test data, we might like the average number of processes in the registry to be around 2.5. We are still some way away from that–and one reason is that, as soon as a process is in the registry, we start trying to unregister it again! Somewhat paradoxically, if we want unregister
to succeed more often, then we need to call it less often! So we shall reduce the weight of unregister
:
weight(_,spawn) -> 50; weight(_,register) -> 20; weight(_,unregister) -> 1; weight(_,kill) -> 5; weight(_,_) -> 10.
If we also add two more calls to aggregate
to collect the features of calls to unregister
and whereis
separately, then we see:
OK, passed 1000 tests 29.47% {{registry_eqc,register,2},success} 23.72% {{registry_eqc,whereis,1},false} 18.25% {{registry_eqc,register,2},name_already_registered} 10.38% {{registry_eqc,register,2},pid_already_registered} 9.24% {{registry_eqc,whereis,1},true} 5.28% {{registry_eqc,register,2},pid_is_dead} 2.56% {{registry_eqc,unregister,1},failure} 1.10% {{registry_eqc,unregister,1},success} 46.49% {{registry_eqc,register,2},success} 28.80% {{registry_eqc,register,2},name_already_registered} 16.38% {{registry_eqc,register,2},pid_already_registered} 8.33% {{registry_eqc,register,2},pid_is_dead} 69.9% {{registry_eqc,unregister,1},failure} 30.1% {{registry_eqc,unregister,1},success} 72.0% {{registry_eqc,whereis,1},false} 28.0% {{registry_eqc,whereis,1},true} registered: Count: 1000 Min: 0 Max: 5 Avg: 1.4310 Total: 1431
We’re up to 1.4 processes in the registry on average, and a 30% success rate for calls to unregister
and whereis
. As one final refinement, we’ll try generating longer test cases—it may be that we’re now generating so many calls to spawn
that we don’t have time in one test to fill the registry up. We replace line 3 of the property by
?FORALL(Cmds, more_commands(3,commands(?MODULE)),
which generates test sequences three times as long as the default, and run our tests again:
OK, passed 1000 tests 26.34% {{registry_eqc,register,2},name_already_registered} 23.99% {{registry_eqc,register,2},success} 17.58% {{registry_eqc,whereis,1},false} 13.47% {{registry_eqc,whereis,1},true} 10.26% {{registry_eqc,register,2},pid_already_registered} 5.03% {{registry_eqc,register,2},pid_is_dead} 1.90% {{registry_eqc,unregister,1},failure} 1.43% {{registry_eqc,unregister,1},success} 40.14% {{registry_eqc,register,2},name_already_registered} 36.55% {{registry_eqc,register,2},success} 15.64% {{registry_eqc,register,2},pid_already_registered} 7.67% {{registry_eqc,register,2},pid_is_dead} 57.1% {{registry_eqc,unregister,1},failure} 42.9% {{registry_eqc,unregister,1},success} 56.60% {{registry_eqc,whereis,1},false} 43.40% {{registry_eqc,whereis,1},true} registered: Count: 1000 Min: 0 Max: 5 Avg: 2.0080 Total: 2008
Now we have on average two processes in the registry at the end of a test, and 40-50% of calls to unregister
and whereis
succeed—we’re calling unregister
less often, but more calls are succeeding (in absolute, not just relative terms). This is a pretty good result. We could go further, and try to reduce the “wasted” calls to register
, perhaps by reducing its weight if the registry is “nearly full”, or perhaps by generating the names we use more carefully—but this is enough for today.
The lesson is: assigning features to calls, and gathering statistics about them, can give us much more information about our tests, make problems obvious, and help us get better value from the machine time we spend on testing.