## Proofreading as “mark and recapture”

Last week I saw two different articles (at Futility Closet and DataGenetics) about exactly the same topic: suppose that two reviewers each proofread the same document.  The first reviewer finds $A=10$ errors, and the second finds $B=12$ errors, of which $C=8$ are in common, i.e., 8 of those 12 errors had already been found by the first reviewer.  Can we estimate how many additional errors remain in the document that were missed by both reviewers?

Both articles essentially reproduce the argument given by Pólya (see reference below) that a reasonable estimate for the total number of errors (both found and missed) is given by the following simple formula:

$\hat{N} = \frac{A B}{C} = 15$

This is a “mark-and-recapture” estimation method similar to that used, for example, to estimate the number of fish in a lake.  Intuitively, the first reviewer identifies and “marks” $A/N$ of the errors in the document (where $N$ is unknown), which should approximately equal the fraction $C/B$ of errors found by the second reviewer that were already marked.

However, neither article points out just how inaccurate this method of estimation can be, nor the fact that better alternatives are available.  For example, continuing with the example above as originally presented in the DataGenetics article, let us assume for the moment that

1. There really are a total of 15 errors in the document being reviewed.
2. The first reviewer really does find each error independently with probability 10/15=2/3.
3. The second reviewer really does find each error independently with probability 12/15=4/5.

Note that this example is arguably somewhat contrived to be “nice,” since the actual number $C=8$ of errors observed in common by both reviewers happens to equal the expected number of such errors.  This need not be the case; with this model of reviewer accuracy, the number of common errors may be as large as $N=15$… or as small as zero, in which case our estimator breaks down entirely.

Even if we condition against this unlikely difficulty, essentially asking the reviewers to both start over if they don’t find any errors in common (and to forget the errors they may have already found), there is still significant variance in the possible estimates that may result, as shown in the following figure.

Distribution of estimate of number of errors (N=15, p1=2/3, p2=4/5).

(This is not a sample from a simulation; we can calculate this distribution exactly.)  The mean of the estimate, shown in red, is approximately 15.17, which is pretty good.  However, we can do better– only slightly better in this already-fortunate case, but a lot better in other cases– using a slightly different estimator due to Chapman:

$\hat{N} = \frac{(A+1)(B+1)}{C+1} - 1$

This estimate has several advantages over the Lincoln-Petersen method.  It has less bias and less variance, particularly in situations like this where the “population” of errors is relatively small.  Also, it still works even when $C=0$, i.e., when no errors are found in common by both reviewers.

Having said that, it’s not clear how really useful either method is in this particular context, given how widely the resulting estimate may vary from the true value.  These estimation methods work much better when at least one of the two sample sizes $A$ and $B$ is pre-determined (e.g., first catch exactly 100 fish, mark them, then catch 100 fish again), and only $C$ varies randomly.

Reference:

• Pólya, G., Probabilities in Proofreading, American Mathematical Monthly, 83(1) January 1976, p. 42 [JSTOR]
This entry was posted in Uncategorized. Bookmark the permalink.

This site uses Akismet to reduce spam. Learn how your comment data is processed.