Reddit’s comment ranking algorithm revisited

Introduction

The “Bayesian/frequentist” coin puzzle discussed in the last couple of posts was really just an offshoot of some thoughts I have been mulling over about Reddit’s current default approach to ranking user comments on a post, based on the number of upvotes and downvotes each comment receives.  (Or more generally, the problem of ranking a collection of any items, whether comments, or consumer products, etc., based on varying numbers of “like/dislike” votes.)  Instead of trying to estimate the bias of a coin based on the observed number of heads and tails flipped, here each comment is like a coin, and each upvote (or downvote) is like an observation of a coin flip coming up heads (or tails).

If we assume that each comment has some fixed– but unknown– probability \theta that a random user will upvote the comment, then it would be convenient to simply sort all of the comments on a particular post by decreasing \theta, so that the “best” comments would appear near the top.  Unfortunately, we don’t actually know \theta, we can only estimate it somehow by using the observed pair (u,d) of upvotes and downvotes, respectively.

A natural first idea might be to “score” each comment using the maximum likelihood estimate

\hat{\theta} = \frac{u}{u+d}

and sort the comments by this score.  But this tends to unfairly compare comments with very different numbers of total votes; e.g., should a comment with votes (3,0) really be ranked higher than (99,1)?

Wilson Score Interval

Evan Miller’s “How Not To Sort By Average Rating” does a good job of presenting this and other approaches, eventually arguing for sorting by the lower bound of the Wilson score interval, which is what Reddit currently does.  Briefly, the Wilson score interval is a confidence interval intended to “cover” (i.e., contain) the true— but unknown– value \theta with at least some guaranteed probability, described as the “confidence level.”  In general, the higher the confidence level, or the fewer the number of observations, the wider the corresponding confidence interval.  By scoring each comment with the lower bound of this confidence interval, we are effectively starting with a point estimate based on the fraction of upvotes, but then penalizing this score according to the total number of votes, with fewer votes receiving a greater penalty.

Reddit’s use of this scheme has evolved slightly over time, initially computing a 70% confidence interval, but then changing to the current wider 80% confidence interval, having the effect of imposing a slightly greater penalty on comments with fewer total votes.  This “fine-tuning” of the scoring algorithm raises the question whether there might not be a more natural method for ranking user comments, that does not require this sort of knob-turning.

A Bayesian Alternative

Last year, James Neufeld proposed the interesting idea of sampling a random score for each comment by drawing from a corresponding beta distribution with parameters

(\alpha, \beta) = (u+1, d+1)

The idea is that this beta distribution is a natural way to express our uncertainty about the “true” value \theta of a comment, starting with an assumed prior uniform distribution on \theta (i.e., a comment is initially equally likely to be great, terrible, or anything in between), and updating based on the observation of (u,d) upvotes and downvotes, respectively.  For example, a comment with 30 upvotes and 10 downvotes yields a beta distribution with the following density:

Probability density of beta distribution with parameters (30+1,10+1).

Probability density of beta distribution with parameters (30+1,10+1).

A key point is that every user does not necessarily see the comments for a post in the same order.  Each time the post is viewed, the comments are re-scored by new random draws from the corresponding beta distributions, and sorted accordingly.  As a comment receives more and more upvotes and/or downvotes, it will “settle in” to a particular position among other comments… but comments with few votes, or even strongly downvoted comments, will still have some chance of appearing near the top of any particular user’s view of the page.

I really like this idea, but the non-deterministic ordering of comments presented to different users may be seen as a drawback.  Can we fix this?

Sorting by Expected Rank

I can think of two natural deterministic modifications of this approach.  The first is to sort comments by their expected ranking using the random scoring described above.  In other words, for each comment, compute the expected number of other comments that would appear higher than it on one of Neufeld’s randomly generated pages, and sort the comments by this expected value.

Although this method “fixes” the non-determinism of the original, unfortunately it suffers from a different undesirable property: the relative ranking of two comments may be affected by the presence or absence of other comments on the same post.  For example, consider the two comments identified by their upvote/downvote counts (0,1) and (1,3).  If these are the only two comments on a post, then (0,1) < (1,3).  However, if we introduce a third comment (7,3), then the resulting overall ranking is (1,3) < (0,1) < (7,3), reversing the ranking of the original two comments!

Pairwise comparisons

Which brings me, finally, to my initial idea for the following second alternative: sort the comments on a post according to the order relation

(u_1,d_1) < (u_2,d_2) \iff P(X_1 > X_2) < \frac{1}{2}

where

X_k \sim Beta(u_k+1,d_k+1)

More intuitively, we are simply ranking one comment higher than another if it is more likely than not to appear higher using Neufeld’s randomized ranking.

Note one interesting property of this approach that distinguishes it from all of the other methods mentioned so far: it does not involve assigning a real-valued “score” to each individual comment (and subsequently sorting by that score).  This is certainly possible in principle (see below), but as currently specified we can only compare two comments by performing a calculation involving parameters of both in a complex way.

Open Questions

Unfortunately, there are quite a few holes to be patched up with this method, and I am hoping that someone can shed some light on how to address these.  First, the strict order defined above is not quite a total order, since there are some pairs of distinct comments where one comment’s randomized score is equally likely to be higher or lower than the other.  For example, all of the comments of the form (u,u), with an equal number of upvotes and downvotes, have this problem.  This is probably not a big deal, though, since I think it is possible to arbitrarily order these comments, for example by increasing total number of votes.

But there are other more interesting pairs of incomparable comments.  For example, consider (5,0) and (13,1).  The definition above is insufficient to rank these two… but it turns out that it had better be the case that (13,1) < (5,0), since we can find a third comment that lies between them:

(13,1) < (70,8) < (5,0)

This brings us to the next open question: is this order relation transitive (in other words, is it even a partial order)?  I have been unable to prove this, only verify it computationally among comments with bounded numbers of votes.

The final problem is a more practical one: how efficiently can this order relation be computed?  Evaluating the probability that one beta-distributed random variable exceeds another involves a double integral that “simplifies” to an expression involving factorials and a hypergeometric function of the numbers of upvotes and downvotes.  If you want to experiment, following is Python code using the mpmath library to compute the probability P(X_1 > X_2):

from mpmath import fac, hyp3f2

def prob_greater(u1, d1, u2, d2):
    return (hyp3f2(-d2, u2 + 1, u1 + u2 + 2, u2 + 2, u1 + u2 + d1 + 3, 1) *
            fac(u1 + u2 + 1) / (fac(u1) * fac(u2)) *
            fac(u1 + d1 + 1) * fac(u2 + d2 + 1) /
            ((u2 + 1) * fac(d2) * fac(u1 + u2 + d1 + 2)))

print(prob_greater(5, 0, 13, 1))

John Cook has written a couple of interesting papers on this, in the medical context of evaluating clinical trials.  This one discusses various approximations, and this one presents exact formulas and recurrences for some special cases.  The problem of computing the actual probability seems daunting… but perhaps it is a simpler problem in this case to not actually compute the value, but just determine whether it is greater than 1/2 or not?

In summary, I think these difficulties can be rolled up into the following more abstract statement of the problem: can we impose a “natural,” efficiently computable total order on the set of all beta distributions with positive integer parameters, that looks something like the order relation described above?

 

A coin puzzle revisited

This is a follow-up to some interesting discussion in the comments on my previous post, involving a coin-flipping probability puzzle, and a comparison of Bayesian and frequentist approaches to “solving” it.  For completeness, here is the original problem:

You have once again been captured by pirates, who threaten to make you walk the plank unless you can correctly predict the outcome of an experiment.  The pirates show you a single gold doubloon, that when flipped has some fixed but unknown probability of coming up heads.  The coin is then flipped 7 times, of which you observe 5 to be heads and 2 to be tails.  At this point, you must now bet your life on whether or not, in two subsequent flips of the coin, both will come up heads.  If you predict correctly, you go free; if not, you walk the plank.  Which outcome would you choose?

A typical puzzle-solver would (rightly) point out that necessary information is missing; we cannot determine the optimal action without knowing how the coin (and thus its bias) was selected.  Instead of providing that information, I stirred the Bayesian vs. frequentist debate by showing how each might reason without that information, and come up with differing conclusions.

One of the reasons that I like this problem is that the “Bayesian vs. frequentist” perspective is a bit of a ruse.  The frequentist in the original post computes the maximum likelihood estimate of the probability of the coin coming up heads… and makes a betting decision based on that estimate.  The Bayesian performs a slightly more complex calculation, involving updating a prior beta distribution using the observed flips, doing some calculus… but then makes a similar “threshold” betting decision based on that calculation.

The key observation is that any deterministic betting strategy whatsoever, whether wearing a frequentist hat, a Bayesian hat, or a clown hat, may be specified as a function

f:\{0, 1, 2, ..., n\} \rightarrow \{0, 1\}

mapping the number of heads observed in n=7 total flips to 1 indicating a bet for two subsequent heads, and 0 indicating a bet against.  Neither the underlying statistical philosophy nor the complexity of implementation of this function matters; all that matters is the output.

Actually, we can simplify things even further if we only consider “monotonic” strategies of the form “bet for two heads if k or more heads are observed, otherwise bet against.”  That is,

f_k(h) = H[h-k]

where H[] is the unit step function.

As mendel points out in the comments on the previous post, the frequentist MLE strategy is equivalent to f_5 (i.e., bet on two heads with “5 or more” observed heads), and the Bayesian strategy is equivalent to f_6 (“6 or more”).  We can compare these strategies– along with the seven other monotonic strategies– by computing the probability of their success, as a function of the unknown probability p of heads for each single coin flip.  That is, the probability of surviving the game with strategy f_k is

\sum_{h=0}^n {n \choose h} p^h (1-p)^{n-h}(f_k(h)(2p^2-1) + 1-p^2)

The following figure shows the results for all nine strategies:

Comparison of monotonic strategies as a function of probability of heads in a single coin flip.  The frequentist MLE strategy is "5 or more," and the Bayesian strategy is "6 or more."

Comparison of monotonic strategies as a function of probability of heads in a single coin flip. The frequentist MLE strategy is “5 or more,” and the Bayesian strategy is “6 or more.”

The MLE strategy (green) and Bayesian strategy (blue) are certainly contenders for the best reasonable approach.  However, neither of these, nor any other single strategy, dominates all others for all possible values of the unknown probability of heads in a single coin flip.  In other words, whether the Bayesian or frequentist has a better chance of survival truly does depend on the information that we are explicitly not given.

 

A coin puzzle

Are you a Bayesian or a frequentist?  What do these terms mean, and what are the differences between the two?  For me, these questions have never been terribly interesting, despite many attempts at answers given in the literature (see the references below for useful and entertaining examples).

My problem has been that explanations typically focus on the different approaches to expressing uncertainty, as opposed to different approaches to actually making decisions.  That is, in my opinion, Bayesians and frequentists can argue all they want about what “the probability of an event” really means, and how much prior information the other camp has or hasn’t unjustifiably assumed… but when pressed to actually take an action, when money is on the table, everyone becomes a Bayesian.

Or do they?  Following is an interesting puzzle that seems to more clearly distinguish the Bayesian from the frequentist, by forcing them both to put money on the table, so to speak:

Problem: You have once again been captured by bloodthirsty logical pirates, who threaten to make you walk the plank unless you can correctly predict the outcome of an experiment.  The pirates show you a single irregularly-shaped gold doubloon selected from their booty, and tell you that when the coin is flipped, it has some fixed but unknown probability of coming up heads.  The coin is then flipped 7 times, of which you observe 5 to be heads and 2 to be tails.

At this point, you must now bet your life on whether or not, in two subsequent flips of the coin, both will come up heads.  If you predict correctly, you go free; if not, you walk the plank.  Which outcome would you choose?  (The pirates helpfully remind you that, if your choice is not to play, then you will walk the plank anyway.)

I think this is an interesting problem because two different but reasonable approaches yield two different answers.  For example, the maximum likelihood estimate of the unknown probability that a single flip of the coin will come up heads is 5/7 (i.e., the observed fraction of flips that came up heads), and thus the probability that the next two consecutive flips will both come up heads is (5/7)*(5/7)=25/49, or slightly better than 1/2.  So perhaps a frequentist would bet on two heads.

On the other hand, a Bayesian might begin with an assumed prior distribution on the unknown probability for a single coin flip, and update that distribution based on the observation of h=5 heads and t=2 tails.  For example, using a “maximum entropy” uniform prior, the posterior probability for a single flip has a beta distribution with parameters (h+1, t+1), and so the probability of two consecutive heads is

\int_0^1 \frac{x^h (1-x)^t}{B(h+1, t+1)} x^2 dx = \frac{7}{15} < \frac{1}{2}

where B(h+1, t+1) is the beta function.  So perhaps a Bayesian would bet against two heads.

What would you do?

(A couple of comments: first, one might reasonably complain that observing just 7 coin flips is simply too small a sample to make a reasonably informed decision.  However, the dilemma does not go away with a larger sample: suppose instead that you initially observe 17 heads and 7 tails, and are again asked to bet on whether the next two flips will come up heads.  Still larger samples exist that present the same problem.

Second, a Bayesian might question the choice of a uniform prior, suggesting as another reasonable starting point the “non-informative” Jeffreys prior, which in this case is the beta distribution with parameters (1/2, 1/2).  This has a certain cynical appeal to it, since it effectively assumes that the pirates have selected a coin which is likely to be biased toward either heads or tails.  Unfortunately, this also does not resolve the issue.)

References:

1. Jaynes, E. T., Probability Theory: The Logic of Science. Cambridge: Cambridge University Press, 2003 [PDF]

2. Lindley, D. V. and Phillips, L. D., Inference for a Bernoulli Process (A Bayesian View), The American Statistician, 30:3 (August 1976), 112-119 [PDF]