Why does Jepsen Work? - Daniel Tisdall

Why does the highly regarded testing tool Jepsen, which randomly partitions networks in distributed systems, find so many bugs?

The 2018 paper 'Why Is Random Testing Effective for Partition Tolerance Bugs?' by Rupak Majumdar, Filip Niksic shows that several notions of high test coverage can be achieved with a number of tests logarithmic in the number of cases to be tested. Moreover, the tests are chosen randomly. We explain the most general construction here. The paper contains more.

A lower bound

Consider a finite set $M$ and a set $T$ whose elements are subsets of $M$ . We want to find a subset $\mathcal{F}$ of $T$ which is small and where for each $x \in M :\exists t \in F$ st $x \in t$ . That is, we want to find a small set of tests from $T$ which cover all the cases in $M$ .

If we know a positive lower bound probability $p$ over all fixed elements $x \in M$ , for which a random $t \in T$ contains $x$ , then we can express the probability that a candidate $\mathcal{F}\_{candidate}$ satisfies the properties of $\mathcal{F}$ in terms of $\vert \mathcal{F}_{candidate} \vert$ and a factor logarithmic in $\vert M \vert$ .

To show this we need to remind ourselves of high school logarithms, and some straightforward probability. We'll say a set $S$ covers $x$ if $x \in S$ , and we'll abuse this language.

Proof:

Fix a subset $\mathcal{F} \subseteq T$ chosen uniformly randomly. We know that a an element $t \in \mathcal{F}$ covers any fixed element $x \in M$ with probability at least $p$ . The probability that $x$ is not covered is therefore $\leq 1-p$ . When taken over all elements of $\mathcal{F}$ we have $\mathcal{P}(\mathcal{F} \text{ does not cover } x) \leq (1-p)^{\vert \mathcal{F} \vert}$ .

The probability of $\mathcal{F}$ not covering an element of $M$ cannot be greater than the sum of probabilities of missing a single given element (union bound). So we have

\mathcal{P}(\mathcal{F} \text{ does not cover } M) \leq \vert M \vert (1-p)^{\vert \mathcal{F} \vert}

We want to bound this

\vert M \vert (1-p)^{\vert \mathcal{F} \vert} < \epsilon

in terms of $\vert \mathcal{F} \vert$ .

We have

(1-p)^{\vert \mathcal{F} \vert} < \frac{\epsilon}{\vert M \vert}

and we can take logarithm of both sides base $1-p$ , which is a decreasing function so we flip the inequality.

\log_{(1-p)}(\frac{\epsilon}{\vert M \vert}) < \vert \mathcal{F} \vert

Using the fact that for any basis $b$ and number $a$ we have $e^{\log(b)\log_{b}(a)} = e^{log(a)}$ we can get back an expression using the natural logarithm

\frac{\log(\epsilon) - \log(\vert M \vert)}{\log(1-p)} < \vert \mathcal{F} \vert

If $p$ is sufficiently small then $- \log(1-p) < p$ and we have the convenient

\frac{\log(\vert M \vert) - \log(\epsilon)}{p} < \vert \mathcal{F} \vert

We obtain a relation between our desired confidence $\epsilon$ and $\vert \mathcal{F} \vert$ , and this scales with $\vert M \vert$ .

Link to Jepsen

One way Jepsen can find bugs is by testing cases where a particular subset of network nodes of size $k$ is distributed evenly in a $k$ -partition of the whole network. In this case $M$ contains all subsets of nodes with size $k$ , and $T$ contains all $k$ -partitions. An element of $t \in T$ maps to an element of $x \in M$ if all nodes in $x$ are in a different block (partition element) of $t$ .

Takeaway

The paper says that it's difficult to derive bounds for more 'dynamic' notions of testing and coverage, those involving timing and so on. For now, we probably want to follows Ashish Darbari's advice and keep striving for better correctness than what we get from just leaving everything 'in the hands of random seeds'.

A lower bound

Link to Jepsen

Takeaway

Links