← Back to blog

Why does Jepsen Work?

math

Why does the highly regarded testing tool Jepsen, which randomly partitions networks in distributed systems, find so many bugs?

The 2018 paper 'Why Is Random Testing Effective for Partition Tolerance Bugs?' by Rupak Majumdar, Filip Niksic shows that several notions of high test coverage can be achieved with a number of tests logarithmic in the number of cases to be tested. Moreover, the tests are chosen randomly. We explain the most general construction here. The paper contains more.

A lower bound

Consider a finite set MM and a set TT whose elements are subsets of MM. We want to find a subset F\mathcal{F} of TT which is small and where for each xM:tFx \in M :\exists t \in F st xtx \in t . That is, we want to find a small set of tests from TT which cover all the cases in MM.

If we know a positive lower bound probability pp over all fixed elements xMx \in M, for which a random tTt \in T contains xx, then we can express the probability that a candidate F_candidate\mathcal{F}\_{candidate} satisfies the properties of F\mathcal{F} in terms of Fcandidate\vert \mathcal{F}_{candidate} \vert and a factor logarithmic in M\vert M \vert.

To show this we need to remind ourselves of high school logarithms, and some straightforward probability. We'll say a set SS covers xx if xSx \in S, and we'll abuse this language.

Proof:

Fix a subset FT\mathcal{F} \subseteq T chosen uniformly randomly. We know that a an element tFt \in \mathcal{F} covers any fixed element xMx \in M with probability at least pp . The probability that xx is not covered is therefore 1p\leq 1-p. When taken over all elements of F\mathcal{F} we have P(F does not cover x)(1p)F\mathcal{P}(\mathcal{F} \text{ does not cover } x) \leq (1-p)^{\vert \mathcal{F} \vert}.

The probability of F\mathcal{F} not covering an element of MM cannot be greater than the sum of probabilities of missing a single given element (union bound). So we have

P(F does not cover M)M(1p)F \mathcal{P}(\mathcal{F} \text{ does not cover } M) \leq \vert M \vert (1-p)^{\vert \mathcal{F} \vert}

We want to bound this

M(1p)F<ϵ \vert M \vert (1-p)^{\vert \mathcal{F} \vert} < \epsilon

in terms of F\vert \mathcal{F} \vert .

We have

(1p)F<ϵM (1-p)^{\vert \mathcal{F} \vert} < \frac{\epsilon}{\vert M \vert}

and we can take logarithm of both sides base 1p1-p, which is a decreasing function so we flip the inequality.

log(1p)(ϵM)<F \log_{(1-p)}(\frac{\epsilon}{\vert M \vert}) < \vert \mathcal{F} \vert

Using the fact that for any basis bb and number aa we have elog(b)logb(a)=elog(a)e^{\log(b)\log_{b}(a)} = e^{log(a)} we can get back an expression using the natural logarithm

log(ϵ)log(M)log(1p)<F \frac{\log(\epsilon) - \log(\vert M \vert)}{\log(1-p)} < \vert \mathcal{F} \vert

If pp is sufficiently small then log(1p)<p- \log(1-p) < p and we have the convenient

log(M)log(ϵ)p<F \frac{\log(\vert M \vert) - \log(\epsilon)}{p} < \vert \mathcal{F} \vert

We obtain a relation between our desired confidence ϵ\epsilon and F\vert \mathcal{F} \vert , and this scales with M\vert M \vert .

Link to Jepsen

One way Jepsen can find bugs is by testing cases where a particular subset of network nodes of size kk is distributed evenly in a kk-partition of the whole network. In this case MM contains all subsets of nodes with size kk, and TT contains all kk-partitions. An element of tTt \in T maps to an element of xMx \in M if all nodes in xx are in a different block (partition element) of tt.

Takeaway

The paper says that it's difficult to derive bounds for more 'dynamic' notions of testing and coverage, those involving timing and so on. For now, we probably want to follows Ashish Darbari's advice and keep striving for better correctness than what we get from just leaving everything 'in the hands of random seeds'.

Links