On a web site where users rate the quality of– well, just about anything, from consumer products to comments on news articles– with positive or negative “votes,” how should the rated items be sorted so as to present the higher-rated items near the top of the page and the lower-rated items near the bottom?
I recently learned how Reddit handles this problem when sorting user comments on a link post. (I was teased to the Reddit link by the title’s suggestion that Randall Munroe of xkcd fame had a hand in the algorithm.) There is some interesting mathematics involved… but there also appear to be some equally interesting bugs in the implementation, which I will discuss here.
[Edit: Two of the three problems discussed here have been fixed.]
First, some background, in chronological order: Evan Miller has a great write-up titled “How Not To Sort By Average Rating.” He shows why a few simple solutions do not work very well, and instead proposes using the lower bound of the Wilson score confidence interval to sort ratings, including Ruby source code implementing the formula.
Evan’s post was in turn referenced by a Reddit blog guest post by Randall, describing Reddit’s new “best” comment ranking algorithm around the time it was implemented. And finally, the most recent amix blog by Amir Salihefendic describes Reddit’s ranking algorithms in excellent detail, including inspection of the Reddit source code. (To be clear, Reddit uses different ranking algorithms for “stories” (i.e., link posts) vs. user comments. We are focusing on the latter here.)
I had initially planned to discuss confidence intervals in general, and some interesting issues with binomial proportion estimation in particular, of which the Wilson score interval is just one example. There frequently seems to be confusion about just what information is conveyed by a confidence interval. For example, Evan describes the Wilson score lower bound as the answer to the question, “Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?” This is misleading, since no claim can be made about the probability that the “real” fraction of positive ratings lies within a particular computed interval. (Wikipedia actually does a great job of explaining this.)
But let’s skip that and instead focus on the actual implementation in the Reddit source code. I’ll take Amir’s approach of presenting a Python translation, since the original is in the less common Pyrex:
# From /r2/r2/lib/db/_sorts.pyx from math import sqrt def confidence(ups, downs): n = ups + downs if n == 0: return 0 z = 1.0 #1.0 = 85%, 1.6 = 95% phat = float(ups) / n return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
Compare this with the formula for the endpoints of the Wilson score interval:
where is the fraction of observed votes that are positive, and is the appropriate quantile of the standard normal distribution (more on this shortly).
We can make several observations at this point. First, and most importantly, the square root is in the wrong place in the code! I am not sure how this error crept in, since the form of the expression and even the variable names are nearly identical to the correct Ruby example from Evan’s original post. At any rate, whatever Reddit is computing, it isn’t the Wilson score interval.
Second, it is worth clarifying for exactly what confidence interval we are trying to compute endpoints. Both Evan and Amir present the above formula with the plus/minus notation, and use in their -subscripts, which together strongly suggest that the two-sided Wilson score interval of the form is intended. However, the text in Evan’s post and the actual constants used for in the code comments suggest a one-sided interval of the form , where is the value returned by the function. Since the distinction only matters when we talk about the actual level of confidence in the resulting interval (e.g., is it 95% or 90%, 85% or 70%, etc.), to avoid further confusion I will use the code as a guide and refer to levels of confidence assuming one-sided intervals.
Finally, one interesting side effect of the current incorrect implementation is that the function always returns positive values, even for zero upvotes. More generally, the resulting interval frequently does not even contain the maximum likelihood estimate of the true fraction of positive ratings! That’s the bad news… but the good news is that the function “does the right thing” for zero upvotes anyway, by returning smaller values for larger numbers of downvotes. The true Wilson score interval lower bound does not have this nice property, since the lower bound is zero for , independent of .
(Actually, I am not sure if this is ever a problem, at least on Reddit, since every comment seems to begin its life with one “automatic” upvote.)
So, how should we fix things? I would recommend the following modification to the algorithm:
# From /r2/r2/lib/db/_sorts.pyx from math import sqrt def confidence_fixed(ups, downs): if ups == 0: return -downs n = ups + downs z = 1.64485 #1.0 = 85%, 1.6 = 95% phat = float(ups) / n return (phat+z*z/(2*n)-z*sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
This addresses each of the three issues described above:
First, we fix the square root in the formula, so that we are indeed computing the lower bound of the Wilson score interval (usually, anyway; see below).
Second, we change the constant to use the 95% confidence interval instead of the original 85%. This is really just a fine-tuning to keep the resulting ordering “closer,” in an eyeball-norm sense, to how the current algorithm performs.
Finally, we extend the special-case behavior from just returning zero for zero total votes, to returning negative values for zero upvotes. This maintains the current property that more downvotes will always push items farther down the list.