Using Hedgehog to Test Déjà Fu

Date
Tags concurrency, dejafu, haskell, hedgehog, programming, property-based testing
Target Audience People interested in dejafu or in property-based testing.
Attention Conservation Notice There is a fairly long preamble before I get to the Hedgehog stuff, which you can jump to directly if you don't really care about how the Déjà Fu side of this works.

Déjà Fu is a con­cur­rency testing lib­rary, and one thing you def­in­itely don’t want to do when testing con­cur­rent pro­grams is to try every pos­sible in­ter­leaving of threads.

Trying every pos­sible in­ter­leaving will give you, in gen­eral, an ex­po­nen­tial blow-up of the ex­e­cu­tions you need to per­form as your test case grows in size. The core testing al­gorithm we use, a variant of dy­namic par­tial-order re­duc­tion (D­POR)1, at­tempts to re­duce this blow-up. DPOR iden­ti­fies ac­tions which are de­pendent, and only tries in­ter­leav­ings which per­mute de­pendent ac­tions.

Here are some ex­amples:

Two ac­tions are de­pendent if the order in which they are per­formed mat­ters.

So the in­tu­ition be­hind DPOR is that most ac­tions in a con­cur­rent pro­gram are in­de­pendent. DPOR won’t help you much if you have a single piece of shared state which every thread is hit­ting, but most con­cur­rent pro­grams aren’t like that. The worst case is still a ter­rible ex­po­nen­tial blow-up, but the av­erage case is much bet­ter.

The de­pend­ency re­la­tion is core part of Déjà Fu today. It has im­pacts on both per­form­ance and cor­rect­ness. If it says two ac­tions are de­pendent when they are not, then we may see un­ne­ces­sary in­ter­leav­ings tried. If it says two ac­tions are not de­pendent when they really are, then we may miss ne­ces­sary in­ter­leav­ings.

Being such an im­portant com­pon­ent, it must be well-­tested, right? Well, sort of. The Déjà Fu test­suite mostly con­sists of small con­cur­rent pro­grams to­gether with a list of ex­pected out­puts, testing that Déjà Fu finds all the non­determinism in the pro­gram. This does ex­er­cise the de­pend­ency re­la­tion, but only very in­dir­ectly.

The Idea

There things would have re­mained had I not ex­per­i­enced one of those co­in­cid­ence-driven flashes of in­sight:

I had my idea. I can dir­ectly test the de­pend­ency re­la­tion like so:

  1. Ex­ecute a con­cur­rent pro­gram.
  2. Nor­m­alise its ex­e­cu­tion trace in some way.
  3. “Re­play” the nor­m­al­ised trace.
  4. As­sert that the result is the same.

Nor­m­al­ising Traces

So, what is a good normal form for a trace? I tried out a few ap­proaches here, but there was one I kept coming back to: we should shuffle around in­de­pendent ac­tions to keep the pro­gram on the main thread for as long as pos­sible.

There are two reasons I think this works well. (1) The traces we get will be easier for a human to read, as the pro­gram will stay on its main thread and only ex­ecute an­other thread where ne­ces­sary. (2) A Haskell pro­gram ter­min­ates when the main thread ter­min­ates, so by ex­ecuting the main thread as much as pos­sible, we may find that some ac­tions don’t need to be ex­ecuted at all.

So firstly we need to know when two ac­tions com­mute. Let’s just use the de­pend­ency re­la­tion for that:

-- | Check if two ac­tions com­mute.
in­de­pendent
  :: Dep­State
  -> (ThreadId, ThreadAc­tion)
  -> (ThreadId, ThreadAc­tion)
  -> Bool
in­de­pendent ds (tid1, ta1) (tid2, ta2) = not (de­pendent ds tid1 ta1 tid2 ta2)

The Dep­State para­meter tracks in­form­a­tion about the his­tory of the ex­e­cu­tion, al­lowing us to make better de­cisions. For ex­ample: while in gen­eral it mat­ters in which order two put­MVars to the same MVar hap­pen; it doesn’t matter if the MVar is already full, as both ac­tions will block without achieving any­thing.

The ap­proach works well in prac­tice, but has been the source of so many off-by-one er­rors. Even while writing this memo!

So now onto trace nor­m­al­isa­tion. The easiest way to do it is bubble sort, but with an ad­di­tional con­straint on when we can swap things:

  1. For every ad­ja­cent pair of items x and y in the trace:
    1. If x and y com­mute and thread_id y < thread_id x:
      1. Swap x and y.
    2. Up­date the Dep­State and con­tinue to the next pair.
  2. Re­peat until there are no more changes.

And here’s the code:

-- | Re­write a trace into a ca­non­ical form.
nor­m­alise
  :: [(ThreadId, ThreadAc­tion)]
  -> [(ThreadId, ThreadAc­tion)]
nor­m­alise trc0 = if changed then nor­m­alise trc' else trc'
 where
  (changed, trc') = bubble ini­tialDep­State False trc0

  bubble ds flag ((x@(tid1, _)):(y@(tid2, _)):trc)
    | in­de­pendent ds x y && tid2 < tid1 = go ds True y (x : trc)
    | oth­er­wise = go ds flag x (y : trc)
  bubble _ flag trc = (flag, trc)

  go ds flag t@(tid, ta) trc =
    second (t :) (bubble (up­dateDep­State ds tid ta) flag trc)

Testing Nor­m­al­ised Traces

Now we need a sched­uler which can play a given list of scheduling de­cisions. This isn’t built in, but we can make one. Sched­ulers look like this:

-- from Test.De­jaFu.Schedule
new­type Sched­uler state = Sched­uler
  { sched­u­leThread
    :: Maybe (ThreadId, ThreadAc­tion)
    -> NonEmpty (ThreadId, Looka­head)
    -> state
    -> (Maybe ThreadId, state)
  }

A sched­uler is a stateful func­tion, which takes the pre­vi­ously sched­uled ac­tion and the list of run­nable threads, and gives back a thread to ex­ecute. We don’t care about those para­met­ers. We just want to play a fixed list of scheduling de­cisions. And here is how we do that:

-- | Ex­ecute a con­cur­rent pro­gram by playing a list of scheduling de­cisions.
play
  :: Mem­Type
  -> [ThreadId]
  -> ConcIO a
  -> IO (Either Failure a, [ThreadId], Trace)
play = run­Con­cur­rent (Sched­uler sched)
 where
  sched _ _ (t:ts) = (Just t, ts)
  sched _ _ [] = (Nothing, [])

Now all the back­ground is in place, so we can test what we want to test: that an ex­e­cu­tion, and the play-­back of its nor­m­al­ised trace, give the same res­ult. For reasons which will be­come ap­parent in the next sec­tion, I’m going to para­met­erise over the nor­m­al­isa­tion func­tion:

-- | Ex­ecute a con­cur­rent pro­gram with a random sched­uler, nor­m­alise its trace,
-- ex­ecute the nor­m­al­ised trace, and re­turn both res­ults.
run­Norm
  :: ([(ThreadId, ThreadAc­tion)] -> [(ThreadId, ThreadAc­tion)])
  -> Int
  -> Mem­Type
  -> ConcIO a
  -> IO (Either Failure a, [ThreadId], Either Failure a, [ThreadId])
run­Norm norm seed mem­type conc = do
  let g = mk­StdGen seed                                       -- 1
  (efa1, _, trc) <- run­Con­cur­rent ran­dom­Sched mem­type g conc
  let                                                         -- 2
    trc' = tail
      ( scanl
        (\(t, _) (d, _, a) -> (tidOf t d, a))
        (ini­tial­Thread, un­defined)
        trc
      )
  let tids1 = map fst trc'
  let tids2 = map fst (norm trc')                             -- 3
  (efa2, s, _) <- play mem­type tids2 conc
  let trun­cated = take (length tids2 - length s) tids2        -- 4
  pure (efa1, tids1, efa2, trun­cated)

There’s a lot going on here, so let’s break it down:

  1. We ex­ecute the pro­gram with the built-in random sched­uler, using the provided seed.

  2. The trace that run­Con­cur­rent gives us is in the form [(­De­cision, [(­ThreadId, Looka­head)], ThreadAc­tion)], whereas we want a [(­ThreadId, ThreadAc­tion)]. So this scan just changes the format. It’s a scan rather than a map be­cause to con­vert a De­cision into a ThreadId po­ten­tially re­quires knowing what the pre­vious thread was.

  3. We nor­m­alise the trace, and run it again.

  4. If the en­tire nor­m­al­ised trace wasn’t used up, then it has some un­ne­ces­sary suffix (be­cause the main thread is now ter­min­ating soon­er). So we make the nor­m­al­ised trace easier to read by chop­ping off any such suf­fix.

Fi­nally, we can write a little func­tion to test using the nor­m­alise func­tion:

-- | Ex­ecute a con­cur­rent pro­gram with a random sched­uler, nor­m­alise its trace,
-- ex­ecute the nor­m­al­ised trace, and check that both give the same res­ult.
test­Nor­m­alise
  :: (Eq a, Show a)
  => Int
  -> Mem­Type
  -> ConcIO a
  -> IO Bool
test­Nor­m­alise seed mem­type conc = do
  (efa1, tids1, efa2, tids2) <- run­Norm nor­m­alise seed mem­type conc
  un­less (efa1 == efa2) $ do
    put­StrLn   "Mis­matched res­ult!"
    put­StrLn $ "      ex­pec­ted: " ++ show efa1
    put­StrLn $ "       but got: " ++ show efa2
    put­StrLn   ""
    put­StrLn $ "re­written from: " ++ show tids1
    put­StrLn $ "            to: " ++ show tids2
  pure (efa1 == efa2)

And does it work? Let’s copy two ex­ample pro­grams from the Test.De­jaFu docs:

-- from Test.De­jaFu
ex­ample1
  :: Mon­ad­Conc m
  => m String
ex­ample1 = do
  var <- newEmptyMVar
  fork (put­MVar var "hello")
  fork (put­MVar var "world")
  read­MVar var

ex­ample2
  :: Mon­ad­Conc m
  => m (Bool, Bool)
ex­ample2 = do
  r1 <- new­CRef False
  r2 <- new­CRef False
  x <- spawn $ write­CRef r1 True >> read­CRef r2
  y <- spawn $ write­CRef r2 True >> read­CRef r1
  (,) <$> read­MVar x <*> read­MVar y

And then test them:

> test­Nor­m­alise 0 Total­Store­Order ex­ample1
True
> test­Nor­m­alise 0 Total­Store­Order ex­ample2
True

Ac­cording to my very un­scientific method, everything works per­fectly!

Enter Hedgehog

You can prob­ably see where this is go­ing: just sup­plying one random seed and one memory model is a poor way to test things. Ah, if only we had some sort of tool to gen­erate ar­bit­rary values for us!

But that’s not all: if the de­pend­ency re­la­tion is cor­rect, then any per­muta­tion of in­de­pendent ac­tions should give the same res­ult, not just the one which nor­m­alise im­ple­ments. So be­fore we in­tro­duce Hedgehog and ar­bit­rary val­ues, let’s make some­thing a little more chaotic:

-- | Shuffle in­de­pendent ac­tions in a trace ac­cording to the given list.
shuffle
  :: [Bool]
  -> [(ThreadId, ThreadAc­tion)]
  -> [(ThreadId, ThreadAc­tion)]
shuffle = go ini­tialDep­State
 where
  go ds (f:fs) (t1:t2:trc)
    | in­de­pendent ds t1 t2 && f = go' ds fs t2 (t1 : trc)
    | oth­er­wise = go' ds fs t1 (t2 : trc)
  go _ _ trc = trc

  go' ds fs t@(tid, ta) trc =
    t : go (up­dateDep­State ds tid ta) fs trc

In nor­m­alise, two in­de­pendent ac­tions will al­ways be re-ordered if it gets us closer to the ca­non­ical form. However, in shuffle, two in­de­pendent ac­tions will either be re-ordered or not, de­pending on the sup­plied list of Bool.

This is much better for testing our de­pend­ency re­la­tion, as we can now get far more re-or­der­ings which all should sat­isfy the same prop­erty: that no matter how the in­de­pendent ac­tions in a trace are shuffled, we get the same res­ult.

I think it’s about time to bring out Hedge­hog:

-- | Ex­ecute a con­cur­rent pro­gram with a random sched­uler, ar­bit­rarily per­mute
-- the in­de­pendent ac­tions in the trace, and check that we get the same result
-- out.
hog :: (Eq a, Show a) => ConcIO a -> IO Bool
hog conc = Hedgehog.check . prop­erty $ do
  mem <- forAll Gen.enum­Bounded                               -- 1
  seed <- forAll $ Gen.int (Range.linear 0 100)
  fs <- forAll $ Gen.list (Range.linear 0 100) Gen.bool

  (efa1, tids1, efa2, tids2) <- liftIO                        -- 2
    $ run­Norm (shuffle fs) seed mem conc
  foot­note ("            to: " ++ show tids2)                 -- 3
  foot­note ("re­written from: " ++ show tids1)
  efa1 === efa2

Let’s break that down:

  1. We’re telling Hedgehog that this prop­erty should hold for all memory mod­els, all seeds, and all Bool-l­ists. Un­like most Haskell prop­er­ty-testing lib­rar­ies, Hedgehog takes gen­er­ator func­tions rather than using a type­class. I think this is nicer.

  2. We run our pro­gram, nor­m­alise it, and get all the res­ults just as be­fore.

  3. We add some foot­notes: mes­sages which Hedgehog will dis­play along with a fail­ure. For some reason these get dis­played in re­verse or­der.

Al­right, let’s see if Hedgehog finds any bugs for us:

> hog ex­ample1
  ? <in­ter­act­ive> failed after 3 tests and 1 shrink.

       ??? ex­tra.hs ???
    82 ? hog :: (Eq a, Show a) => ConcIO a -> IO Bool
    83 ? hog conc = Hedge­ho­g.check . prop­erty $ do
    84 ?   mem <- forAll Gen.enum­Bounded
       ?   ? Se­quen­tial­Con­sist­ency
    85 ?   seed <- forAll $ Gen.int (Range.­linear 0 100)
       ?   ? 0
    86 ?   fs <- forAll $ Gen.list (Range.­linear 0 100) Gen.­bool
       ?   ? [ False , True ]
    87 ?
    88 ?   (e­fa1, tids1, efa2, tids2) <- liftIO
    89 ?     $ run­Norm (shuffle fs) seed mem conc
    90 ?   foot­note ("            to: " ++ show tids2)
    91 ?   foot­note ("re­written from: " ++ show tids1)
    92 ?   efa1 === efa2
       ?   ^^^^^^^^^^^^^
       ?   ? Failed (- lhs =/= + rhs)
       ?   ? - Right "hello"
       ?   ? + Left In­tern­alError

    re­written from: [main,­main,1,­main,1,2,­main,2,­main]
                to: [main,1]

    This failure can be re­pro­duced by run­ning:
    > recheck (Size 2) (Seed 1824012233418733250 (-4876494268681827407)) <prop­erty>

False

It did! And look at that out­put! Ma­gical! I must see if I can get Déjà Fu to give an­not­ated source output like that.

Let’s look at ex­ample1 again:

do
  var <- newEmptyMVar
  fork (put­MVar var "hello")
  fork (put­MVar var "world")
  read­MVar var

Oh dear, our re­written trace is trying to ex­ecute thread 1 im­me­di­ately after the first ac­tion of the main thread. The first ac­tion of the main thread is newEmptyMVar: thread 1 doesn’t exist at that point!

Let’s change our in­de­pendent func­tion to say that an ac­tion is de­pendent with the fork which cre­ates its thread:

in­de­pendent ds (tid1, ta1) (tid2, ta2)
  | ta1 == Fork tid2 = False
  | ta2 == Fork tid1 = False
  | oth­er­wise = not (de­pendent ds tid1 ta1 tid2 ta2)

How about now?

> hog ex­ample1
  ? <in­ter­act­ive> failed after 13 tests and 2 shrinks.

       ??? ex­tra.hs ???
    82 ? hog :: (Eq a, Show a) => ConcIO a -> IO Bool
    83 ? hog conc = Hedge­ho­g.check . prop­erty $ do
    84 ?   mem <- forAll Gen.enum­Bounded
       ?   ? Se­quen­tial­Con­sist­ency
    85 ?   seed <- forAll $ Gen.int (Range.­linear 0 100)
       ?   ? 0
    86 ?   fs <- forAll $ Gen.list (Range.­linear 0 100) Gen.­bool
       ?   ? [ True , True ]
    87 ?
    88 ?   (e­fa1, tids1, efa2, tids2) <- liftIO
    89 ?     $ run­Norm (shuffle fs) seed mem conc
    90 ?   foot­note ("            to: " ++ show tids2)
    91 ?   foot­note ("re­written from: " ++ show tids1)
    92 ?   efa1 === efa2
       ?   ^^^^^^^^^^^^^
       ?   ? Failed (- lhs =/= + rhs)
       ?   ? - Right "hello"
       ?   ? + Left In­tern­alError

    re­written from: [main,­main,1,­main,1,2,­main,2,­main]
                to: [main,1]

    This failure can be re­pro­duced by run­ning:
    > recheck (Size 12) (Seed 654387260079025817 (-6686572164463137223)) <prop­erty>

False

Well, that failing trace looks ex­actly like the pre­vious er­ror. But the para­meters are dif­fer­ent: the first error happened with the list [False, True], this re­quires the list [True, True]. So let’s think about what hap­pens to the trace in this case.

  1. We start with: [(­main, NewEmptyMVar 0), (main, Fork 1), (1, Put­MVar 0)].

  2. The first two ac­tions are in­de­pend­ent, and the flag is True, so we swap them. We now have: [(­main, Fork 1), (main, NewEmptyMVar 1), (1, Put­MVar 0)].

  3. The second two ac­tions are in­de­pend­ent, and the flag is True, so we swap them. We now have: [(­main, Fork 1), (1, Put­MVar 0), (main, NewEmptyMVar 0)].

We can’t ac­tu­ally re-order ac­tions of the same thread, so we should never have swapped the first two. I sup­pose there’s an­other problem here, that no ac­tion on an MVar com­mutes with cre­ating that MVar, but we should never be in a situ­ation where that could hap­pen. So we need an­other case in in­de­pendent:

in­de­pendent ds (tid1, ta1) (tid2, ta2)
  | tid1 == tid2 = False
  | ta1 == Fork tid2 = False
  | ta2 == Fork tid1 = False
  | oth­er­wise = not (de­pendent ds tid1 ta1 tid2 ta2)

Our first ex­ample pro­gram works fine now:

> hog ex­ample1
  ? <in­ter­act­ive> passed 100 tests.
True

The second is a little less happy:

> hog ex­ample2
  ? <in­ter­act­ive> failed after 48 tests and 9 shrinks.

       ??? ex­tra.hs ???
    82 ? hog :: (Eq a, Show a) => ConcIO a -> IO Bool
    83 ? hog conc = Hedge­ho­g.check . prop­erty $ do
    84 ?   mem <- forAll Gen.enum­Bounded
       ?   ? Total­Store­Order
    85 ?   seed <- forAll $ Gen.int (Range.­linear 0 100)
       ?   ? 0
    86 ?   fs <- forAll $ Gen.list (Range.­linear 0 100) Gen.­bool
       ?   ? [ False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , False
       ?   ? , True
       ?   ? ]
    87 ?
    88 ?   (e­fa1, tids1, efa2, tids2) <- liftIO
    89 ?     $ run­Norm (shuffle fs) seed mem conc
    90 ?   foot­note ("            to: " ++ show tids2)
    91 ?   foot­note ("re­written from: " ++ show tids1)
    92 ?   efa1 === efa2
       ?   ^^^^^^^^^^^^^
       ?   ? Failed (- lhs =/= + rhs)
       ?   ? - Right ( False , True )
       ?   ? + Left In­tern­alError

    re­written from: [main,­main,­main,­main,­main,1,-1,1,1,­main,1,­main,­main,­main,­main,2,-1,2,2,­main,­main]
                to: [main,­main,­main,­main,­main,1,-1,1,1,­main,1,­main,­main,­main,2,­main,-1]

    This failure can be re­pro­duced by run­ning:
    > recheck (Size 47) (Seed 2159662051602767058 (-7857629802164753123)) <prop­erty>

False

This is a little trick­ier. Here’s my dia­gnosis:

  1. It’s an In­tern­alError again, which means we’re trying to ex­ecute a thread which isn’t run­nable.

  2. The memory model is Total­Store­Order, and the thread we’re trying to ex­ecute is thread -1, a “fake” thread used in the re­laxed memory im­ple­ment­a­tion. So this is a re­laxed memory bug.

  3. The traces only differ in one place: where main, 2, -1 is changed to 2, main, -1. So the issue is caused by re-or­dering main and thread 2.

  4. If the main ac­tion is a memory bar­rier, then thread -1 will not exist after it.

  5. So the main ac­tion is prob­ably a memory bar­rier.

Let’s push along those lines and add a case for memory bar­riers to in­de­pendent:

in­de­pendent ds (tid1, ta1) (tid2, ta2)
  | tid1 == tid2 = False
  | ta1 == Fork tid2 = False
  | ta2 == Fork tid1 = False
  | oth­er­wise = case (sim­pli­fy­Ac­tion ta1, sim­pli­fy­Ac­tion ta2) of
      (Un­syn­chron­ised­Write _, a) | is­Bar­rier a -> False
      (a, Un­syn­chron­ised­Write _) | is­Bar­rier a -> False
      _ -> not (de­pendent ds tid1 ta1 tid2 ta2)

Did we get it?

> hog ex­ample2
  ? <in­ter­act­ive> passed 100 tests.
True

Great!

Bugs?

So, we ex­plored the de­pend­ency re­la­tion with Hedge­hog, and found three missing cases:

  1. Two ac­tions of the same thread are de­pend­ent.

  2. Any ac­tion of a thread is de­pendent with the fork which cre­ates that thread.

  3. Un­syn­chron­ised writes are de­pendent with memory bar­ri­ers.

But are these bugs? I’m not so sure:

  1. The de­pend­ency re­la­tion is only ever used to com­pare dif­ferent threads.

  2. This is tech­nic­ally cor­rect, but it’s not in­ter­esting or use­ful.

  3. This could be a bug. The re­laxed memory im­ple­ment­a­tion is pretty hairy and I’ve had a lot of prob­lems with it in the past. Hon­estly, I just need to re­write it (or cam­paign for Haskell to be­come se­quen­tially con­sistent2 and rip it out).

But even if not bugs, these are def­in­itely con­fusing. The de­pend­ency re­la­tion is cur­rently just an in­ternal thing, not ex­posed to users. However, I’m plan­ning to ex­pose a func­tion to nor­m­alise traces, in which case providing an in­de­pendent func­tion is en­tirely reas­on­able.

So even if these changes don’t make it into de­pendent, they will be handled by in­de­pendent.

Next steps: I’m going to get this into the test suite, to get a large number of extra ex­ample pro­grams for free. My hacky and cobbled-­to­gether testing frame­work in de­jafu-tests is cap­able of run­ning every test case with a variety of dif­ferent sched­ulers, so I just need to add an­other way it runs everything. I won’t need to touch the ac­tual tests, just the layer of glue which runs them all, which is nice.

The only problem is that this glue is cur­rently based on HUnit and test-­frame­work, whereas the only in­teg­ra­tion I can find for Hedgehog is tasty-hedgehog, so I might need to switch to tasty first. As usual, the hardest part is get­ting dif­ferent lib­raries to co-­op­er­ate!

Hope­fully I’ll find some bugs! Well, not ex­actly hope­fully, but you know what I mean.


  1. For all the gory de­tails, see:

    • Dy­namic par­tial order re­duc­tion for re­laxed memory models, N. Zhang, M. Kusano, and C. Wang (2015)

    • Bounded par­tial-order re­duc­tion, K. Coons, M. Musuvathi, and K. McKinley (2013)

    • Re­fining de­pend­en­cies im­proves par­tial-order veri­fic­a­tion methods (ex­tended ab­stract), P. Gode­froid and D. Pirottin (1993)

  2. SC-Haskell: Se­quen­tial Con­sist­ency in Lan­guages That Min­imize Mut­able Shared Heap, M. Vollmer, R. G. Scott, M. Musuvathi, and R. R. Newton (2017)