Bayesian solution to the Lincoln index problem

Jul 10, 2014

Last year my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers. Here's his presentation of the problem:

Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.

Then he presents the Lincoln index, an estimator "described by Frederick Charles Lincoln in 1930," where Wikpedia's use of "described" is a hint that the index is another example of Stigler's law of eponymy.

Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find. The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine].

So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs.

Of course, whenever I see something like this, the idea that pops into my head is that there must be a (better) Bayesian solution! And there is. You can read and download my solution here.

I represent the data using the tuple (k1, k2, c), where k1 is the number of bugs found by the first tester, k2 is the number of bugs found by the second, and c is the number they find in common.

The hypotheses are a set of tuples (n, p1, p2), where n is the actual number of errors, p1 is the probability that the first tester finds any given bug, and p2 is the probability for the second tester.

Now all I need is a likelihood function:

class Lincoln(thinkbayes.Suite, thinkbayes.Joint):

def Likelihood(self, data, hypo):
"""Computes the likelihood of the data under the hypothesis.

hypo: n, p1, p2
data: k1, k2, c
"""
n, p1, p2 = hypo
k1, k2, c = data

part1 = choose(n, k1) * binom(k1, n, p1)
part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
return part1 * part2

Where choose evaluates the binomial coefficient,

, and binom evaluates the rest of the binomial pmf:

def binom(k, n, p):
return p**k * (1-p)**(n-k)

And that's pretty much it. Here's the code that builds and updates the suite of hypotheses:

data = 20, 15, 3

probs = numpy.linspace(0, 1, 101)

hypos = []

for n in range(32, 350):

for p1 in probs:

for p2 in probs:

hypos.append((n, p1, p2))

suite = Lincoln(hypos)

suite.Update(data)

The suite contains the joint posterior distribution for (n, p1, p2), but p1 and p2 are nuisance parameters; we only care about the marginal distribution of n. Lincoln inherits Marginal from Joint, so we can get the marginal distribution like this:

n_marginal = suite.Marginal(0)

Where 0 is the index of n in the tuple. And here's what the distribution looks like:

The lower bound is 32, which is the total number of bugs found by the two testers. I set the upper bound at 350, which chops off a little bit of the tail.

The maximum likelihood estimate in this distribution is 72; the mean is 106. So those are consistent with the Lincoln index, which is 100. But as usual, it is more useful to have the whole posterior distribution, not just a point estimate. For example, this posterior distribution can be used as part of a risk-benefit analysis to guide decisions about how much effort to allocate to finding additional bugs.

This solution generalizes to more than 2 testers, but figuring out the likelihood function, and evaluating it quickly, becomes increasingly difficult. Also, as the number of testers increases, so does the number of dimensions in the space of hypotheses. With two testers there are about 350 * 100 * 100 hypotheses. On my non-very-fast desktop computer, that takes about a minute. I could speed it up by evaluating the likelihood function more efficiently, but each new tester would multiply the run time by 100.

The library I used in my solution is thinkbayes.py, which is described in my book, Think Bayes. Electronic copies are available under a Creative Commons licence from Green Tea Press. Hard copies are published by O'Reilly Media and available from Amazon.com and other booksellers.

I believe that the approach to Bayesian statistics I present in Think Bayes is a good way to solve problems like this. I cite as evidence that this example, from the time I read John's article, to the time I pressed "Publish" on this post, took me about 3 hours.

Probably Overthinking It

Discussion about this post