Why does Jepsen Work?
Why does the highly regarded testing tool Jepsen, which randomly partitions networks in distributed systems, find so many bugs?
The 2018 paper 'Why Is Random Testing Effective for Partition Tolerance Bugs?' by Rupak Majumdar, Filip Niksic shows that several notions of high test coverage can be achieved with a number of tests logarithmic in the number of cases to be tested. Moreover, the tests are chosen randomly. We explain the most general construction here. The paper contains more.
A lower bound
Consider a finite set and a set whose elements are subsets of . We want to find a subset of which is small and where for each st . That is, we want to find a small set of tests from which cover all the cases in .
If we know a positive lower bound probability over all fixed elements , for which a random contains , then we can express the probability that a candidate satisfies the properties of in terms of and a factor logarithmic in .
To show this we need to remind ourselves of high school logarithms, and some straightforward probability. We'll say a set covers if , and we'll abuse this language.
Proof:
Fix a subset chosen uniformly randomly. We know that a an element covers any fixed element with probability at least . The probability that is not covered is therefore . When taken over all elements of we have .
The probability of not covering an element of cannot be greater than the sum of probabilities of missing a single given element (union bound). So we have
We want to bound this
in terms of .
We have
and we can take logarithm of both sides base , which is a decreasing function so we flip the inequality.
Using the fact that for any basis and number we have we can get back an expression using the natural logarithm
If is sufficiently small then and we have the convenient
We obtain a relation between our desired confidence and , and this scales with .
Link to Jepsen
One way Jepsen can find bugs is by testing cases where a particular subset of network nodes of size is distributed evenly in a -partition of the whole network. In this case contains all subsets of nodes with size , and contains all -partitions. An element of maps to an element of if all nodes in are in a different block (partition element) of .
Takeaway
The paper says that it's difficult to derive bounds for more 'dynamic' notions of testing and coverage, those involving timing and so on. For now, we probably want to follows Ashish Darbari's advice and keep striving for better correctness than what we get from just leaving everything 'in the hands of random seeds'.