Debugging race conditions

Race conditions are among the worst kind of problem to debug: they tend to appear only rarely and unrepeatably, often arise only in long running cases in production, and leave little evidence of what went wrong.

Mnesia is a database written in Erlang that is used by for example in telecommunication equipment and financial transaction databases. Erlang is not immune to race conditions, despite its excellent support for concurrency. Race conditions can give rise to rare intermittent failures in software such as mnesia. Mnesia was known to fail “once every month or two” in production, and race conditions are one likely cause.

We extended QuickCheck with features to generate concurrent unit tests to provoke race conditions, and shrink them to minimal examples. With less than 200 lines of QuickCheck code, we were able to generate concurrent tests of dets, the disk storage layer of mnesia, and provoke five separate race conditions in short order. The technique is easy to apply to any software specified by a QuickCheck state-machine.

This not only means that we can find these kind of concurrency errors easily, it also means we can find them quicker, by already addressing them at the unit testing level.