← Back to blog

We're symbolic: finding the EPaxos bug with Specl BFS and symbolic modes

speclformal-verificationmodel-checkingdistributed-systemsepaxosconsensus

After I announced Specl, Pierre Sutra — pointed me to his 2019 paper On the correctness of Egalitarian Paxos:

About EPaxos, you might be interested with the following result. Basically, the TLA spec is wrong as well as the Go implementation.

He was right — it's a great case study for Specl. Here's how a single ~150-line spec can both find the bug and verify the fix.

What is EPaxos?

Egalitarian Paxos (Moraru et al., 2013) is a leaderless consensus protocol for state machine replication. Unlike Multi-Paxos, which funnels all commands through a single leader, EPaxos lets any replica propose commands. When commands don't conflict, they commit in a single round trip — the "fast path". When they do conflict, the protocol falls back to a slower two-round path to agree on a dependency ordering.

For a single command instance with 3 replicas, it works like this:

  1. Pre-Accept: The command leader (say, replica 0) proposes its command along with a dependency set. It sends this to a fast quorum (2 out of 3 replicas). Each member computes its own dependencies from local conflict knowledge — it may disagree with the leader.

  2. Fast Commit: If every fast quorum member agrees on the same dependencies at ballot 0, the command commits immediately. One round trip.

  3. Slow Path: If they disagree, the leader picks a merged dependency set, runs a classic Paxos Accept phase with a majority quorum, and commits. Two round trips.

  4. Recovery: If the command leader crashes mid-protocol, another replica takes over. It contacts a quorum, collects their state for this instance, and decides what value to adopt. The tricky part: recovery must detect whether the fast path might have already committed, and if so, adopt that value conservatively.

The bug

Sutra found that the original EPaxos protocol uses a single ballot number per replica per instance. This ballot serves two purposes: tracking which ballot a replica has promised to participate in, and tracking which ballot a value was accepted at.

The problem: when a replica processes a Prepare message (the first phase of recovery), its ballot number gets inflated — but its accepted value and acceptance status don't change. A replica that accepted value X at ballot 1, and later processed a Prepare for ballot 3, now reports ballot 3 — even though its acceptance happened at ballot 1.

When a subsequent recovery sees two accepted values and compares ballot numbers to decide which is more recent, the inflated ballot makes a stale acceptance look like the most recent one. Recovery picks the wrong value. Two different recoveries can commit different values. Agreement is broken.

The fix, described in Sutra's paper: split the single ballot into two variables. bal tracks the highest ballot a replica has joined (inflated by Prepares). vbal tracks the ballot at which the replica last accepted a value (only updated by actual Accept operations). Recovery compares vbal instead of bal. Ballot inflation becomes harmless.

Why wasn't this caught earlier?

The original TLA+ spec is 715 lines and does model recovery — it has SendPrepare, ReplyPrepare, and PrepareFinalize actions with full message passing. The bug is right there on line 476: recovery compares prev_ballot, which is derived from rec.ballot — the inflated ballot — rather than the ballot at which the value was actually accepted.

But the repo ships no TLC configuration file. I couldn't find evidence the spec was ever model-checked. Even if it had been, the state space with explicit message sets, 3 replicas, 2 commands, and multiple ballot values is likely intractable for exhaustive search — the message set grows combinatorially.

Sutra found the bug through logical reasoning. His counterexample is a hand-constructed 24-step trace that forces TLC through a specific execution path to demonstrate the violation. He didn't use TLC to find the bug — he used it to verify his manually constructed and figured-out counterexample.

This is exactly the kind of situation where a minimal, focused model can pay off. The original TLA+ spec is 715 lines of message-passing machinery. Our Specl model is ~150 lines with atomic actions and can be explored with BFS or symbolic model checking approaches. That being said, it must be noted our model was developed with the benefit of hindsight, and while it can find this bug, it doesn't prove the corrected version is actually correct.

The Specl model

The spec below models one EPaxos consensus instance with 3 replicas. Replica 0 is the command leader. The fast quorum is {0, 1}. A classic quorum is any 2 of 3.

The value being agreed on represents the dependency set: 0 = no dependencies, 1 = has dependencies. The UseFix constant toggles between the buggy and fixed recovery logic.

module EPaxos

const MaxBal: Int
const NB: Int // must equal MaxBal + 1
const UseFix: 0..1

// 3 replicas: 0, 1, 2
// Status: 0=none, 1=preaccepted, 2=accepted, 3=committed

var status: Dict[0..2, 0..3]
var bal: Dict[0..2, 0..MaxBal]   // highest ballot seen (inflated by prepares)
var vbal: Dict[0..2, 0..MaxBal]  // ballot of last acceptance
var val: Dict[0..2, 0..1]
var next_bal: 1..NB
var initial_val: 0..1 // command leader's original proposal

init {
    status = {r: 0 for r in 0..2}
    and bal = {r: 0 for r in 0..2}
    and vbal = {r: 0 for r in 0..2}
    and val = {r: 0 for r in 0..2}
    and next_bal = 1
    and initial_val = 0
}

// Command leader (r0) initiates pre-accept with value v.
action LeaderPreAccept(v: 0..1) {
    require status[0] == 0
    require bal[0] == 0
    status = status | {0: 1}
    and val = val | {0: v}
    and initial_val = v
}

// Fast quorum member (r1) responds to pre-accept.
// May reply with different deps due to local conflicts.
action FastQuorumPreAccept(v: 0..1) {
    require status[0] >= 1
    require status[1] == 0
    require bal[1] == 0
    status = status | {1: 1}
    and val = val | {1: v}
}

// Fast-path commit: fast quorum {0, 1} agrees at ballot 0.
action FastCommit(r: 0..2) {
    require r <= 1
    require status[r] >= 1
    require status[r] != 3
    require bal[r] == 0
    require status[0] >= 1 and status[1] >= 1
    require val[0] == val[1]
    require bal[0] == 0 and bal[1] == 0
    status = status | {r: 3}
}

// Process a Prepare from some recovery attempt.
// Inflates bal without changing vbal, status, or value.
action ReceivePrepare(r: 0..2) {
    require next_bal <= MaxBal
    require next_bal > bal[r]
    require status[r] != 3
    bal = bal | {r: next_bal}
    and next_bal = next_bal + 1
}

// Recovery: leader collects state from quorum {leader, other} and decides.
// BUG/FIX: "both accepted" comparison uses bal (UseFix=0) or vbal (UseFix=1).
action Recovery(leader: 0..2, other: 0..2, pick: 0..1) {
    require leader != other
    require next_bal <= MaxBal
    require next_bal > bal[leader]
    require next_bal > bal[other]
    require status[leader] != 3
    require status[leader] >= 1 or status[other] >= 1

    val = val | {leader:
        if status[other] == 3 then val[other]
        else if status[leader] == 2 and status[other] == 2 then
            if UseFix == 1 then
                if vbal[leader] > vbal[other] then val[leader]
                else if vbal[other] > vbal[leader] then val[other]
                else val[leader]
            else
                if bal[leader] > bal[other] then val[leader]
                else if bal[other] > bal[leader] then val[other]
                else val[leader]
        else if status[other] == 2 then val[other]
        else if status[leader] == 2 then val[leader]
        else if (status[leader] == 1 and val[leader] == initial_val)
                or (status[other] == 1 and val[other] == initial_val)
            then initial_val
        else if status[leader] == 1 and status[other] == 1 then
            if val[leader] == val[other] then val[leader] else pick
        else if status[leader] == 1 then val[leader]
        else if status[other] == 1 then val[other]
        else pick
    }
    and status = status | {leader: 2}
    and bal = bal | {leader: next_bal, other: next_bal}
    and vbal = vbal | {leader: next_bal}
    and next_bal = next_bal + 1
}

// Phase 2: propagate accepted value to another replica.
action Accept(src: 0..2, dst: 0..2) {
    require src != dst
    require status[src] == 2
    require status[dst] != 3
    require vbal[src] == bal[src]
    require bal[src] >= bal[dst]
    status = status | {dst: 2}
    and bal = bal | {dst: bal[src]}
    and vbal = vbal | {dst: vbal[src]}
    and val = val | {dst: val[src]}
}

// Slow-path commit: quorum of 2 accepted replicas agree.
action SlowCommit(r: 0..2) {
    require status[r] == 2
    require vbal[r] == bal[r]
    require any r2 in 0..2:
        r2 != r and status[r2] >= 2 and vbal[r2] == vbal[r] and val[r2] == val[r]
    status = status | {r: 3}
}

invariant Agreement {
    all r1 in 0..2: all r2 in 0..2:
        (status[r1] == 3 and status[r2] == 3) implies val[r1] == val[r2]
}

Checking it

Install Specl

git clone https://github.com/danwt/specl.git
cd specl/specl
cargo install --path crates/specl-cli

BFS vs symbolic

Specl supports two checking modes. Both find the bug in this spec.

BFS exhaustively enumerates every reachable state. It guarantees the shortest counterexample and produces concrete traces.

$ specl check examples/showcase/epaxos.specl -c MaxBal=8 -c NB=9 -c UseFix=0 --max-depth 20 --no-deadlock --no-auto

...

Result: INVARIANT VIOLATION
  Invariant: Agreement
  Trace (11 steps):
    0: init -> ...
    1: LeaderPreAccept -> ...
    2: FastQuorumPreAccept -> ...
    3: Recovery -> ...
    4: Recovery -> ...
    5: Recovery -> ...
    6: Accept -> ...
    7: Recovery -> ...
    8: SlowCommit -> ...
    9: Accept -> ...
    10: SlowCommit -> ...

Symbolic (via Z3) encodes the spec as SMT constraints. Use --smart to enable it — this tries induction, k-induction, IC3/PDR, and falls back to bounded model checking. Symbolic mode handles unbounded types (Int, Nat) that BFS can't explore, and can verify correctness without enumerating states:

$ specl check examples/showcase/epaxos.specl -c MaxBal=8 -c NB=9 -c UseFix=0 --max-depth=20

Result: INVARIANT VIOLATION

INFO BMC complete, no violations found depth=8
INFO starting symbolic BMC depth=16
INFO invariant violation found invariant="Agreement" depth=10
INFO trace reconstructed via BMC depth=16
...

Applying Pierre's fix

# bfs
specl check examples/showcase/epaxos.specl -c MaxBal=8 -c NB=9 -c UseFix=1 --max-depth 20 --no-deadlock --no-auto

...

Result: OK
  States explored: 586.5K
  Max depth: 15
  Time: 0.9s
  Throughput: 665.4K states/s

# symbolic
specl check examples/showcase/epaxos.specl -c MaxBal=8 -c NB=9 -c UseFix=1 --depth 20
...
Result: OK
  Method: smart(IC3)
  Depth: 20
  Time: 75.65s

Thanks to Pierre Sutra for the pointer, the excellent paper, and the fix. The spec is at examples/showcase/epaxos.specl in the Specl repo.