This past week Warren Buffett announced that he is backing an offer of $1 billion to whoever picks a “perfect” bracket for this year’s NCAA basketball tournament. That is, you must correctly predict the winners of all 63 games in the 64-team, single-elimination tournament. (Or maybe it’s all 67 games among 68 teams, depending on the specific rules which have not yet been published.)

This news has typically been accompanied in the press by mention of the absurdly small probability of picking such a perfect bracket, usually quoted as “1 in 9,223,372,036,854,775,808,” or 1 in 9.2 quintillion, or . Intuitively, the idea is that there are 63 games over the course of the tournament, each with 2 possible outcomes. (I found it interesting that several other web sites seem to have latched onto a different figure of 1 in 4,294,967,296, or 1 in . This would correspond to picking just the first round winners.)

To get a feel for how small this probability is, imagine being blindfolded, and asked to blindly solve each of *five* differently scrambled Rubik’s cubes. You have a slightly better chance of accidentally solving even *one* of the five cubes correctly than you have of picking a perfect bracket.

Or do you? This back-of-the-envelope calculation essentially assumes that either (a) you know nothing about college basketball, and simply pick winners of each game at random; or (b) all teams are equally matched, so that the probability of any game is equal to 1/2. The latter is certainly not true; for example, over the past 29 years of the current tournament format, among 116 first-round match-ups between 1st and 16th seeds, there has not been a single upset. More generally, the larger the *difference* between teams’ seeds, the more lopsided we might expect the outcome to be.

So the question that motivated this post is: can we improve on this calculation of the probability of picking a perfect NCAA tournament bracket? It seemed like it would be useful to start with some historical data about past tournament games. This turned out to be harder to find than I thought… at least, in some conveniently machine-readable format. But after poking around on the web and writing a few quick-and-dirty parsing scripts, following is some data that will help.

First, the following 16×16 matrix with elements indicates the total number of games played between teams seeded *i* and *j*, from 1985 through 2013. For example, the 116’s along the anti-diagonal correspond to the 29 years’ worth of first round games in each of the four regional brackets (1 vs 16, 2 vs 15, 3 vs 14, etc.):

16 56 27 52 39 10 4 58 61 4 5 19 4 0 0 116 56 2 47 7 4 29 67 5 1 42 12 1 0 0 116 0 27 47 1 7 3 63 9 2 1 13 37 0 0 116 1 0 52 7 7 1 62 4 2 6 2 2 0 30 116 0 0 0 39 4 3 62 1 1 0 3 2 1 0 116 14 0 0 0 10 29 63 4 1 0 6 1 0 6 116 0 0 14 0 0 4 67 9 2 0 6 0 1 0 116 3 0 0 1 3 0 58 5 2 6 3 1 1 0 116 0 1 1 1 0 0 0 61 1 1 2 2 0 0 116 0 0 0 0 1 0 0 0 4 42 13 2 1 6 116 0 0 0 1 0 0 1 4 0 5 12 37 0 0 116 3 1 0 1 0 0 0 3 0 0 19 1 0 30 116 0 0 1 0 0 0 0 11 0 0 0 4 0 0 116 14 0 0 1 1 0 0 11 0 0 0 0 0 0 116 0 0 14 1 0 0 1 3 0 0 0 0 0 0 116 1 0 0 0 3 0 0 4 0 0 0 0 0 0 116 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Next, the following 16×16 matrix with elements indicates the number of those games won by the team seeded *i*. (Note: the diagonal requires some careful handling, and is left as an exercise for the reader.)

16 31 15 36 32 7 4 47 56 4 2 19 4 0 0 116 25 2 29 3 0 22 50 2 0 25 11 1 0 0 109 0 12 18 1 4 2 36 6 2 1 9 25 0 0 99 1 0 16 4 3 1 34 2 2 2 2 2 0 18 91 0 0 0 7 4 1 28 1 1 0 1 1 1 0 75 11 0 0 0 3 7 27 2 0 0 3 0 0 4 77 0 0 12 0 0 0 17 3 0 0 3 0 0 0 70 0 0 0 1 2 0 11 3 0 4 2 1 1 0 56 0 1 0 1 0 0 0 5 1 0 0 1 0 0 60 0 0 0 0 1 0 0 0 0 17 4 0 0 2 46 0 0 0 0 0 0 1 4 0 3 1 12 0 0 39 3 0 0 1 0 0 0 3 0 0 0 0 0 12 41 0 0 1 0 0 0 0 8 0 0 0 0 0 0 25 3 0 0 0 0 0 0 3 0 0 0 0 0 0 17 0 0 2 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Using this data, it would be nice to be able to estimate the probability that a team seeded *i* will beat a team seeded *j*, as simply … at least for those match-ups for which we *have* data. If we knew all of these probabilities, we could compute the desired overall probability that any given bracket correctly predicts all 63 games.

Unfortunately, the result appears to be a bit of a mess, if we plot these estimated probabilities as a function of the difference in seeds:

At first glance, the most “extreme” upsets shown here seem to be those involving zero wins by the favored team. For example, a #2 seed lost to a #9 seed in the only such match-up (indicated by the point (7,0) in the figure), and a #2 seed has never beat a #5 seed in any of *four* separate games (the point (3,0) in the figure).

But these are pretty small samples, though… and so the 95% confidence intervals *surrounding* these point estimates are extremely wide. Wide enough, in fact, that they contain the simple linear model (shown by the red line in the figure, which I will describe shortly) for all but three of the 68 past match-ups:

- #1 beat #5 in 32 of 39 games. The linear model predicts P(1 beats 5)=0.629, which is less than the 95% confidence interval (0.665, 0.925).
- #1 beat #9 in 56 of 61 games. The linear model predicts P(1 beats 9)=0.758, which is less than the 95% confidence interval (0.819, 0.973).
- #2 beat #10 in only 25 of 42 games. The linear model predicts P(2 beats 10)=0.758, which is
*greater*than the 95% confidence interval (0.433, 0.744).

If we really think that seed difference can be a useful estimator, then we can improve the picture somewhat by actually *grouping* all match-ups with the same seed difference (e.g., collect games between #1 and #5 with those between #2 and #6, #3 and #7, etc.), and plot the resulting point estimates and narrower confidence intervals:

At any rate, there is arguably still room for refinement, but this simple linear model seems like a reasonable starting point. If we estimate the probability that seed *i* beats seed *j* as

as shown by the red line in the above figure, then using this model, we can compute the overall probability of, say, a bracket with *no* upsets (i.e., always picking the higher-seeded team to win). **The resulting chance of winning with this bracket is approximately 1 in 241 billion**, much better odds than “1 in 9 quintillion”… but still a long shot to say the least. And more importantly, *any* other bracket, with *any* upsets, has an even smaller chance of winning!

To put it in a perspective similar to that given in the video shown here (which seems to be a quick correction of the earlier “1 in 9.2 quintillion” article linked above), suppose that we could get all 314 million people in the United States to:

- Submit a bracket,
- Coordinate their entries so that each of the 314 million brackets are distinct, and
- Select those 314 million distinct brackets that are the most likely to occur; i.e., with the fewest carefully selected “upsets”, involving the most reasonable match-ups (e.g., #8 losing to #9, as opposed to #1 losing to #16).

Even with this massively and optimally coordinated effort, the probability that *anyone* in the country has the perfect bracket– let alone whether that person is *you*— is only about 0.001, or about one-tenth of one percent.

My wager: Buffett’s extra billion is safe.

For a billion dollars, how many of these games could you rig the outcome of? ;-P

Pingback: NCAA bracket scoring systems | Possibly Wrong