On a web site where users rate the quality of– well, just about anything, from consumer products to comments on news articles– with positive or negative “votes,” how should the rated items be sorted so as to present the higher-rated items near the top of the page and the lower-rated items near the bottom?

I recently learned how Reddit handles this problem when sorting user comments on a link post. (I was teased to the Reddit link by the title’s suggestion that Randall Munroe of xkcd fame had a hand in the algorithm.) There is some interesting mathematics involved… but there also appear to be some equally interesting bugs in the implementation, which I will discuss here.

[*Edit*: Two of the three problems discussed here have been fixed.]

First, some background, in chronological order: Evan Miller has a great write-up titled “How Not To Sort By Average Rating.” He shows why a few simple solutions do not work very well, and instead proposes using the lower bound of the Wilson score confidence interval to sort ratings, including Ruby source code implementing the formula.

Evan’s post was in turn referenced by a Reddit blog guest post by Randall, describing Reddit’s new “best” comment ranking algorithm around the time it was implemented. And finally, the most recent amix blog by Amir Salihefendic describes Reddit’s ranking algorithms in excellent detail, including inspection of the Reddit source code. (To be clear, Reddit uses different ranking algorithms for “stories” (i.e., link posts) vs. user comments. We are focusing on the latter here.)

I had initially planned to discuss confidence intervals in general, and some interesting issues with binomial proportion estimation in particular, of which the Wilson score interval is just one example. There frequently seems to be confusion about just what information is conveyed by a confidence interval. For example, Evan describes the Wilson score lower bound as the answer to the question, “*Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what?*” This is misleading, since no claim can be made about the probability that the “real” fraction of positive ratings lies within a particular computed interval. (Wikipedia actually does a great job of explaining this.)

But let’s skip that and instead focus on the actual implementation in the Reddit source code. I’ll take Amir’s approach of presenting a Python translation, since the original is in the less common Pyrex:

# From /r2/r2/lib/db/_sorts.pyx from math import sqrt def confidence(ups, downs): n = ups + downs if n == 0: return 0 z = 1.0 #1.0 = 85%, 1.6 = 95% phat = float(ups) / n return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

Compare this with the formula for the endpoints of the Wilson score interval:

where is the fraction of observed votes that are positive, and is the appropriate quantile of the standard normal distribution (more on this shortly).

We can make several observations at this point. First, and most importantly, the square root is in the wrong place in the code! I am not sure how this error crept in, since the form of the expression and even the variable names are nearly identical to the correct Ruby example from Evan’s original post. At any rate, whatever Reddit is computing, it isn’t the Wilson score interval.

Second, it is worth clarifying for exactly what confidence interval we are trying to compute endpoints. Both Evan and Amir present the above formula with the plus/minus notation, and use in their -subscripts, which together strongly suggest that the *two-sided* Wilson score interval of the form is intended. However, the *text* in Evan’s post and the actual constants used for in the code comments suggest a *one-sided* interval of the form , where is the value returned by the function. Since the distinction only matters when we talk about the actual level of confidence in the resulting interval (e.g., is it 95% or 90%, 85% or 70%, etc.), to avoid further confusion I will use the code as a guide and refer to levels of confidence assuming *one-sided* intervals.

Finally, one interesting side effect of the current incorrect implementation is that the function always returns positive values, *even for zero upvotes*. More generally, the resulting interval frequently does not even contain the maximum likelihood estimate of the true fraction of positive ratings! That’s the bad news… but the good news is that the function “does the right thing” for zero upvotes anyway, by returning smaller values for larger numbers of *downvotes*. The true Wilson score interval lower bound does not have this nice property, since the lower bound is zero for , independent of .

(Actually, I am not sure if this is ever a problem, at least on Reddit, since every comment seems to begin its life with one “automatic” upvote.)

So, how should we fix things? I would recommend the following modification to the algorithm:

# From /r2/r2/lib/db/_sorts.pyx from math import sqrt def confidence_fixed(ups, downs): if ups == 0: return -downs n = ups + downs z = 1.64485 #1.0 = 85%, 1.6 = 95% phat = float(ups) / n return (phat+z*z/(2*n)-z*sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

This addresses each of the three issues described above:

First, we fix the square root in the formula, so that we are indeed computing the lower bound of the Wilson score interval (usually, anyway; see below).

Second, we change the constant to use the 95% confidence interval instead of the original 85%. This is really just a fine-tuning to keep the resulting ordering “closer,” in an eyeball-norm sense, to how the current algorithm performs.

Finally, we extend the special-case behavior from just returning zero for zero *total* votes, to returning *negative* values for zero *upvotes*. This maintains the current property that more downvotes will always push items farther down the list.

Pingback: Reddit: ¿qué sistema usa para puntuar las entradas?

Pingback: Reddit’s ACTUAL story ranking algorithm explained (significant typos in previously published version) | Bibliographic Wilderness

The June 2010 version of reddit’s code has the bug you highlight, but the April 2011 version has fixed it and clarified the code a bit!

https://github.com/reddit/reddit/commits/master/r2/r2/lib/db/_sorts.pyx and click through to the two versions.

Thanks for pointing this out, John. I edited this post accordingly to note these fixes.

There remains the question of the third “problem” described in the post. That is, the fixed version always returns a confidence bound of 0 for 0 upvotes, independent of the number of downvotes. It seems like it would be desirable for more downvotes to (strictly) decrease the confidence score, even for upvotes=0. (Both the old buggy version, and my suggested fix, had this property.)

But maybe this situation never arises, since comments always seem to start with 1 upvote?

It’s possible to remove that 1 upvote from your own comment if you want, so it can happen.

Shouldn’t the z-score be 1.96 for a 95% interval? 1.64485 should be for 90%?

This has to do with the “two-sided” vs. “one-sided” issue discussed in the post. That is, do we want an interval with two endpoints (strictly between 0 and 1), so that 95% of the time the interval will contain the “true” probability of an up-vote? Or do we just want a *lower bound* that 95% of the time will be less than the true probability of an up-vote?

If it’s the former (as the other referenced blog posts suggest), then you’re right, typical practice is to split the 5% equally between the upper and lower tails. But if it’s the latter (as the actual Reddit code comments suggest), then we want an interval of the form (a,1], and the lower tail gets the whole 5%.

Of course, this is all simply an approximation that is meant to *sort* comments in a relative sense, not measure anything *absolutely* for any one comment. So it really doesn’t matter– that is, we can choose to interpret the “score” either way: it is either (1) the lower of two endpoints of an interval that contains the true probability 90% of the time, or (2) the value a such that the interval (a,1] contains the true probability 95% of the time.

Have you come across an approach that also takes into account the age of the positive and negative ratings?

@NK: If I understand your question correctly, the “hot” ranking of comments (see the same source code link in the post) attempts to do this, or at least, it generally degrades a comment’s ranking over time (relative to newer ones).

i have a question regarding how the confidence sort rank a message with 0 up and 0 down votes (no vote at all) compare to a message with 0 upvote and 1 downvote.

using the the confidence sort it will rank the message with 0 upvote and 1 downvote higher than message with 0 up and down vote. this doesn’t seem to make sense to me.

am i missing something?

Hmmm. I think it depends on which version of the confidence scoring implementation you are looking at:

1. If you are looking at the original, buggy implementation quoted in this post (with the square root “in the wrong place”), then you are right; one symptom of this earlier incorrect implementation is that a comment with (0 up, 1 down) has a higher score than (0 up, 0 down).

2. Note, however, that the code has been fixed, so that this is the current implementation… in which a (0,0) comment always has the *same* score (of zero) as any other comment with 0 upvotes. This is arguably not an improvement, IMO.

3. My suggested fix, quoted in this post, strictly orders comments with 0 upvotes so that (0,0)>(0,1)>(0,2)>… That is, with zero upvotes, more downvotes are always worse.

If I understand you correctly, we agree that (3) is the “right” behavior, at least for comments with 0 upvotes. There is further argument for this being the “right” ordering in a follow-up post proposing what is IMO a less ad hoc ordering that yields this same behavior.

possiblywrong, thanks for your reply

I have also notice that (1,X) will always rank higher than (0,0) even if X is a huge number for example (1,1000). how did you resolve this? does this make sense?

@howard, this is a tricky one, since ranking a comment with absolutely no votes depends a lot on your assumptions about “prior” knowledge of the value of a comment. You’re right that Reddit’s current implementation ranks (0,0)<(1,x). But note that my proposed ranking in the follow-up post mentioned earlier handles it differently, ranking …(1,3)<(1,2)<(0,0)<(1,1)<(1,0). So there is definitely an argument for other interpretations/assumptions.

I used your adaptation of the algorithm in my own application. I’d like to quote you in the report that I am writing. Couldn’t find any info about your name, though. Do you want to remain anonymous or could you share your name, please?

Pingback: 【翻译】How Reddit ranking algorithms work

How do you think the Wilson score interval would hold up without upvotes or downvotes, and instead, Medium-like highlights? If highlights are the only source of input, with the potential for multiple highlights from one user on one post/response, is that statistically relevant enough to determine post/response quality for ranking?