Are you a Bayesian or a frequentist? What do these terms mean, and what are the differences between the two? For me, these questions have never been terribly interesting, despite many attempts at answers given in the literature (see the references below for useful and entertaining examples).

My problem has been that explanations typically focus on the different approaches to *expressing uncertainty*, as opposed to different approaches to actually *making decisions*. That is, in my opinion, Bayesians and frequentists can argue all they want about what “the probability of an event” really means, and how much prior information the other camp has or hasn’t unjustifiably assumed… but when pressed to actually *take an action*, when money is on the table, everyone becomes a Bayesian.

Or do they? Following is an interesting puzzle that seems to more clearly distinguish the Bayesian from the frequentist, by forcing them both to put money on the table, so to speak:

**Problem:** You have once again been captured by bloodthirsty logical pirates, who threaten to make you walk the plank unless you can correctly predict the outcome of an experiment. The pirates show you a single irregularly-shaped gold doubloon selected from their booty, and tell you that when the coin is flipped, it has some fixed but unknown probability of coming up heads. The coin is then flipped 7 times, of which you observe 5 to be heads and 2 to be tails.

At this point, you must now bet your life on whether or not, in *two subsequent* flips of the coin, *both* will come up heads. If you predict correctly, you go free; if not, you walk the plank. Which outcome would you choose? (The pirates helpfully remind you that, if your choice is not to play, then you will walk the plank anyway.)

I think this is an interesting problem because two different but reasonable approaches yield two different answers. For example, the maximum likelihood estimate of the unknown probability that a single flip of the coin will come up heads is 5/7 (i.e., the observed fraction of flips that came up heads), and thus the probability that the next *two* consecutive flips will both come up heads is (5/7)*(5/7)=25/49, or slightly better than 1/2. So perhaps a frequentist would bet on two heads.

On the other hand, a Bayesian might begin with an assumed prior distribution on the unknown probability for a single coin flip, and update that distribution based on the observation of heads and tails. For example, using a “maximum entropy” uniform prior, the posterior probability for a single flip has a beta distribution with parameters , and so the probability of two consecutive heads is

where is the beta function. So perhaps a Bayesian would bet *against* two heads.

What would you do?

(A couple of comments: first, one might reasonably complain that observing just 7 coin flips is simply too small a sample to make a reasonably informed decision. However, the dilemma does not go away with a larger sample: suppose instead that you initially observe 17 heads and 7 tails, and are again asked to bet on whether the next two flips will come up heads. Still larger samples exist that present the same problem.

Second, a Bayesian might question the choice of a uniform prior, suggesting as another reasonable starting point the “non-informative” Jeffreys prior, which in this case is the beta distribution with parameters . This has a certain cynical appeal to it, since it effectively assumes that the pirates have selected a coin which is *likely* to be biased toward either heads or tails. Unfortunately, this also does not resolve the issue.)

**References:**

1. Jaynes, E. T., Probability Theory: The Logic of Science. Cambridge: Cambridge University Press, 2003 [PDF]

2. Lindley, D. V. and Phillips, L. D., Inference for a Bernoulli Process (A Bayesian View), The American Statistician, 30:3 (August 1976), 112-119 [PDF]

The hypothesis that the coin is fair (irregular shape or not) can’t be rejected with confidence in the 5/7 case, since for a fair coin, we’d still see 5 or more heads in 22.7% of all experiments.

In the 17/24 case, the confidence for the hypothesis that the coin is crooked is 96.8%.

How does a coin flip work? Assuming that the outcome depends on the angle of the coin when it strikes the surface (and that this angle is uniformly distributed, e.g. when the coin spins “slower” than it falls), the face normal of the “head” side can cover 71% of an arc, or 256°, for the probability of the coin to be 71% heads. This means the coin will still flip over even if it is slanted (256-180)/2=38° to the other side. This means the center of gravity must be 0.39 diameters out from the plane of the edge. The edge of the coin may not be on a plane, but I’m hoping it’s close, in fact, I’m going to hope we can model the coin with a hemisphere. The center of the hollow hemisphere is 1/3 of a radius out from the edge, so the center of the hemisphere must be 3*0.39 = 1.17 time the diameter out from the edge, and that’s one seriously misshapen coin. If the coin is a solid hemisphere, its center of gravity is still 3/8 of a diameter out [1], making the coin 1.04 times as “thick” as its diameter.

How likely is it that a “gold doubloon” has this shape?

If we pull some a priori out of our ass (like the Bayesians do), we ought to do it based on some constraints on what pirates are calling a coin, and since in the 5/7 case, the assumption that pirates call the same things a coin that statisticians do can’t really be rejected, so we’d best bet against two heads.

Now if the pirates weren’t using a coin flip, but rather an unknown random process…

[1] http://www.a-levelmathstutor.com/m-statics-rigid-bods.php

Your last comment is on the money– “irregularly-shaped gold doubloon” was my failed attempt at translating “Bernoulli process with unknown p.” If you like, instead of a coin, consider a “Magic 8-ball” with a polyhedron in a dark, viscous liquid… where you don’t know the number of faces on the polyhedron, or how many of them say “heads” vs. “tails”.

You’re right, the intended point of this problem was to “force” a decision, possibly involving an uncomfortable choice of prior, where two reasonable approaches give different answers. If you’re interested, following is a list of similar experiments with ever-larger sample sizes that demonstrate the same problem:

{29,12}

{34,14}

{46,19}

{58,24}

{75,31}

{87,36}

{99,41}

{104,43}

{116,48}

{128,53}

Could the conclusion of the Bayesian route be achieved by simply applying the “regression toward the mean” phenomenon?

If I understand your question correctly, I think regression toward the mean would involve tailoring the prior beta distribution parameters to match known (or assumed) moments of the population– that is, how do the rest of the coins in the pirates’ treasure behave? We don’t know. (Of course, we were stuck with the same problem in the OP as well, assuming a uniform prior.)

Yes, we don’t know the mean. 🙂

I like the observation that the assumption of the “uniform prior” also implies that we know a mean. “Maximum likelihood” also assumes that we have a mean, but this assumption comes from the data we’re given, and we all know that the results we’ll get are very much uncertain in this respect.

Generally, the Bayesians’ a priori strikes me as a codification of prejudices or unwarranted assumptions – great for AI, when you want machines to imitate human behaviour, but not a good approach for science.

That’s a general problem I’ve had with Yudkowsky’s writing: that he puts a lot of unexamined assumptions in and then comes to a result by “rational” reasoning, which he then claims is “less wrong” than other results – and his followers ooh and aah over the reasoning and assume that different results must be more wrong than his.

I wrote, “I like the observation that the assumption of the “uniform prior” also implies that we know a mean.“

More properly, we _expect_ a mean when we hear about coin flipping. However, “regression to the mean” doesn’t mean that the average regresses to the mean we expect, but rather to the mean that’s actually there. Experimentally, with an unknown mean that is extremely unsatisfying, because all it says is that the average regresses to the mean that it regresses to. There is no knowledge to be gained this way.

If I sent you a computer program today (or pointed you to a web page) that gave you heads or tails, with reference to this discussion, and you made 48 experiments and got 34 heads and 14 tails, would you still assume a uniform prior? Or would you assume I had tailored the program for a mean of slightly more than sqrt(1/2)? How could you tell?

Would you be a Bayesian and use one of the canned acceptable a priori distributions?

Or would you assume that my choice was the one that made the observed data most likely? I had the choice of encoding a probability of slightly less than 0.7071, slightly more than 0.7071, or even exactly 0.7071 (sqrt 0.5). Of these, the presumed choice of “slightly more” makes the observed data most likely to appear. Would it be rational to assume I had done otherwise?

Would it be rational to assume the pirates had done otherwise?

Is the maximum likelihood really what a frequentist would do? Also, what happens if you choose a hyper prior for the Beta parameters? Presumably a perfectly rational captive is going to use all the information available (i.e., physics of coin tossing, disposition of crooked pirates) to make his final decision so hopefully the choice of prior won’t be that “uncomfortable.”

Whether computing an MLE is “naive” is a valid question… although it’s not clear to me what *else* a frequentist would do.

I think a hyperprior is really just a computational convenience. That is, we aren’t really doing anything fundamentally different (i.e., we’re still starting with a prior distribution, updating with a likelihood, yielding a posterior), the mechanics are just easier if we “add levels” to the calculation.

You caught me in the middle of a follow-on post, where I hope to answer some of these questions. More shortly…

I really like this approach, getting to grips with the Bayesian mindset. That’s generally something I love about thsi blog: how it makes a difficult mathematical concept accessible.

I’ve been thinking of making a simulation. I would take a pirate with frequentist leanings, whose unknown random process would have p=0.75 (he’s using 4n-sided dice), and he captures both “frequentists” and bayesians (have you ever heard a statistician refer to himself as “frequentist” when he was not talking to a Bayesian?), rolls his dice 7 times and provides the results to the mathematician (without revealing the process). Now the mathematician is in the situation your blog post describes, so the Bayesian is going to correctly bet .on two heads if the provided results show 6 heads or more (44% of the time), and everybody else bets on it if they see 5 heads or more (76% of the time).

Obviously this is not a good situation for the Bayesians to be in, but I did say that pirate had frequentist leanings. Now there might be another pirate (a twin of the first, maybe) with Bayesian leanings, or maybe he doesn’t have the right kind of dice, because he uses 3n-sided dice to make p=2/3. Since the Bayesian is now guessing correctly in the 5 heads-case whereas everyone else isn’t, everything turns out in their favor.

— Ah, you cry out, so success is not a matter of mathematical method, but of experimental design! We must objectivize this by simulating not only single pirates, but every possible pirates: we shall run a large number of simulations, with pirates using all manner of p, and count up the results, and then we will know which approach is better!

— Pirates using all manner of p? Choosing that p at random, presumably? From a uniform distribution? Making the Bayesian’s guess about the a priori true, even though they can’t know that? No wonder they’ll do better then, if you set up the experiment with hidden parameters that the Bayesians will correctly guess at! It’s like they have access to some god-given truth that nobody else has!

So, to avoid rewarding one of the sides by making their guess true in advance (or punishing them by making it as false as possible), we need to set this experiment up without hidden parameters. We’re going to assume that we only have those two pirates (p=0.75 and p=2/3), and there’s a fixed chance that any mathematician will end up with each of them. So the problem then becomes that of correctly guessing from the observed data whose ship we’re on. (Observing the pirate doesn’t help, because they’re twins, remember?)

The statistician is of course going to do something with conditional probabilities, mumble about first-order and second-order error, and come up with a decision.

If the Bayesian comes up with anything else, he’s going to do worse! So of course he won’t; he’ll be using a prior that takes into account that there are only these two pirates, and do essentially the same computations as everyone else.

So, Bayesians are doing well if their a priori assumption (guess) happens to be close to the actual facts; they do worse the worse it is off. This is not surprising: if you guess lucky, you do well.

If the Bayesians are not guessing because all of the facts are known, then everybody else is just as good as they.

You have hit on exactly the right idea. I’m working on a follow-up post that essentially describes the “experiment” you propose.

Pingback: A coin puzzle revisited | Possibly Wrong