# The odds of picking a “perfect” NCAA tournament bracket

This past week Warren Buffett announced that he is backing an offer of \$1 billion to whoever picks a “perfect” bracket for this year’s NCAA basketball tournament.  That is, you must correctly predict the winners of all 63 games in the 64-team, single-elimination tournament.  (Or maybe it’s all 67 games among 68 teams, depending on the specific rules which have not yet been published.)

This news has typically been accompanied in the press by mention of the absurdly small probability of picking such a perfect bracket, usually quoted as “1 in 9,223,372,036,854,775,808,” or 1 in 9.2 quintillion, or $1/2^{63}$.  Intuitively, the idea is that there are 63 games over the course of the tournament, each with 2 possible outcomes.  (I found it interesting that several other web sites seem to have latched onto a different figure of 1 in 4,294,967,296, or 1 in $2^{32}$.  This would correspond to picking just the first round winners.)

To get a feel for how small this probability is, imagine being blindfolded, and asked to blindly solve each of five differently scrambled Rubik’s cubes.  You have a slightly better chance of accidentally solving even one of the five cubes correctly than you have of picking a perfect bracket.

Or do you?  This back-of-the-envelope calculation essentially assumes that either (a) you know nothing about college basketball, and simply pick winners of each game at random; or (b) all teams are equally matched, so that the probability of any game is equal to 1/2.  The latter is certainly not true; for example, over the past 29 years of the current tournament format, among 116 first-round match-ups between 1st and 16th seeds, there has not been a single upset.  More generally, the larger the difference between teams’ seeds, the more lopsided we might expect the outcome to be.

So the question that motivated this post is: can we improve on this calculation of the probability of picking a perfect NCAA tournament bracket?  It seemed like it would be useful to start with some historical data about past tournament games.  This turned out to be harder to find than I thought… at least, in some conveniently machine-readable format.  But after poking around on the web and writing a few quick-and-dirty parsing scripts, following is some data that will help.

First, the following 16×16 matrix with elements $G_{i,j}$ indicates the total number of games played between teams seeded i and j, from 1985 through 2013.  For example, the 116’s along the anti-diagonal correspond to the 29 years’ worth of first round games in each of the four regional brackets (1 vs 16, 2 vs 15, 3 vs 14, etc.):

  16  56  27  52  39  10   4  58  61   4   5  19   4   0   0 116
56   2  47   7   4  29  67   5   1  42  12   1   0   0 116   0
27  47   1   7   3  63   9   2   1  13  37   0   0 116   1   0
52   7   7   1  62   4   2   6   2   2   0  30 116   0   0   0
39   4   3  62   1   1   0   3   2   1   0 116  14   0   0   0
10  29  63   4   1   0   6   1   0   6 116   0   0  14   0   0
4  67   9   2   0   6   0   1   0 116   3   0   0   1   3   0
58   5   2   6   3   1   1   0 116   0   1   1   1   0   0   0
61   1   1   2   2   0   0 116   0   0   0   0   1   0   0   0
4  42  13   2   1   6 116   0   0   0   1   0   0   1   4   0
5  12  37   0   0 116   3   1   0   1   0   0   0   3   0   0
19   1   0  30 116   0   0   1   0   0   0   0  11   0   0   0
4   0   0 116  14   0   0   1   1   0   0  11   0   0   0   0
0   0 116   0   0  14   1   0   0   1   3   0   0   0   0   0
0 116   1   0   0   0   3   0   0   4   0   0   0   0   0   0
116   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


Next, the following 16×16 matrix with elements $W_{i,j}$ indicates the number of those $G_{i,j}$ games won by the team seeded i.  (Note: the diagonal requires some careful handling, and is left as an exercise for the reader.)

  16  31  15  36  32   7   4  47  56   4   2  19   4   0   0 116
25   2  29   3   0  22  50   2   0  25  11   1   0   0 109   0
12  18   1   4   2  36   6   2   1   9  25   0   0  99   1   0
16   4   3   1  34   2   2   2   2   2   0  18  91   0   0   0
7   4   1  28   1   1   0   1   1   1   0  75  11   0   0   0
3   7  27   2   0   0   3   0   0   4  77   0   0  12   0   0
0  17   3   0   0   3   0   0   0  70   0   0   0   1   2   0
11   3   0   4   2   1   1   0  56   0   1   0   1   0   0   0
5   1   0   0   1   0   0  60   0   0   0   0   1   0   0   0
0  17   4   0   0   2  46   0   0   0   0   0   0   1   4   0
3   1  12   0   0  39   3   0   0   1   0   0   0   3   0   0
0   0   0  12  41   0   0   1   0   0   0   0   8   0   0   0
0   0   0  25   3   0   0   0   0   0   0   3   0   0   0   0
0   0  17   0   0   2   0   0   0   0   0   0   0   0   0   0
0   7   0   0   0   0   1   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


Using this data, it would be nice to be able to estimate the probability $P_{i,j}$ that a team seeded i will beat a team seeded j, as simply $W_{i,j}/G_{i,j}$… at least for those match-ups for which we have data.  If we knew all of these probabilities, we could compute the desired overall probability that any given bracket correctly predicts all 63 games.

Unfortunately, the result appears to be a bit of a mess, if we plot these estimated probabilities as a function of the difference in seeds:

Probability of higher (i.e., favored) seed i beating seed j, vs. the difference in seeds j-i. Historical data shown in blue, simple linear model shown in red.

At first glance, the most “extreme” upsets shown here seem to be those involving zero wins by the favored team.  For example, a #2 seed lost to a #9 seed in the only such match-up (indicated by the point (7,0) in the figure), and a #2 seed has never beat a #5 seed in any of four separate games (the point (3,0) in the figure).

But these are pretty small samples, though… and so the 95% confidence intervals surrounding these point estimates are extremely wide.  Wide enough, in fact, that they contain the simple linear model (shown by the red line in the figure, which I will describe shortly) for all but three of the 68 past match-ups:

• #1 beat #5 in 32 of 39 games.  The linear model predicts P(1 beats 5)=0.629, which is less than the 95% confidence interval (0.665, 0.925).
• #1 beat #9 in 56 of 61 games.  The linear model predicts P(1 beats 9)=0.758, which is less than the 95% confidence interval (0.819, 0.973).
• #2 beat #10 in only 25 of 42 games.  The linear model predicts P(2 beats 10)=0.758, which is greater than the 95% confidence interval (0.433, 0.744).

If we really think that seed difference can be a useful estimator, then we can improve the picture somewhat by actually grouping all match-ups with the same seed difference (e.g., collect games between #1 and #5 with those between #2 and #6, #3 and #7, etc.), and plot the resulting point estimates and narrower confidence intervals:

Probability of favored team winning vs. the difference in seeds. Historical data aggregated over all match-ups with the given seed difference, with 95% confidence intervals.

At any rate, there is arguably still room for refinement, but this simple linear model seems like a reasonable starting point.  If we estimate the probability that seed i beats seed j as

$P_{i,j} = \frac{1}{2} + \frac{j-i}{31}$

as shown by the red line in the above figure, then using this model, we can compute the overall probability of, say, a bracket with no upsets (i.e., always picking the higher-seeded team to win).  The resulting chance of winning with this bracket is approximately 1 in 241 billion, much better odds than “1 in 9 quintillion”… but still a long shot to say the least.  And more importantly, any other bracket, with any upsets, has an even smaller chance of winning!

To put it in a perspective similar to that given in the video shown here (which seems to be a quick correction of the earlier “1 in 9.2 quintillion” article linked above), suppose that we could get all 314 million people in the United States to:

1. Submit a bracket,
2. Coordinate their entries so that each of the 314 million brackets are distinct, and
3. Select those 314 million distinct brackets that are the most likely to occur; i.e., with the fewest carefully selected “upsets”, involving the most reasonable match-ups (e.g., #8 losing to #9, as opposed to #1 losing to #16).

Even with this massively and optimally coordinated effort, the probability that anyone in the country has the perfect bracket– let alone whether that person is you— is only about 0.001, or about one-tenth of one percent.

My wager: Buffett’s extra billion is safe.

# Computing positions and eclipses of the Sun and Moon

Looking ahead to this new year, some of us here in the United States will have several opportunities to see the Earth and Moon get in each other’s way, so to speak.  There will be a lunar eclipse on 14-15 April 2014, and another in October.  There will also be a partial solar eclipse on 23 October 2014… and a total solar eclipse three years from now, on 21 August 2017.

For example, the following simple animation shows what the 2017 eclipse will look like from my dad’s house, which is slightly south of the path of totality.

Partial solar eclipse on 21 August 2017 as seen from my dad’s house. Times are UTC.

Predicting eclipses requires knowing where the Sun and Moon will be in relation to the Earth at any given time.  (Actually, I guess it really only requires knowing which web sites have prediction/animation tools.  NASA has a good site for predicting eclipses at arbitrary latitude/longitude locations, but without corresponding visualizations.  And timeanddate.com has some nice animated visualizations of eclipses, but only for canned locations.)

Computing the location of the Sun and/or Moon presents some interesting challenges.  In particular, the complexity of the solution depends a lot on how much accuracy is required.  The references at the end of this post provide equations with accuracies that are suitable for the kinds of applications I am interested in… the trick is simply stitching together those equations, coordinate frames, etc.

The result is a couple of functions that can be handy in many different exercises, from recreational to educational to work-related: predicting eclipses, predicting the visible illumination of satellite passes (e.g., the International Space Station), evaluating the accuracy of trying to use an analog watch as a compass, and extending the accuracy of ballistically propagated satellite orbits.

First, following is Python code for computing the position of the Sun, in ECEF coordinates, with an accuracy of “approximately 0.01 degree.”  Time is specified in seconds since 1970-01-01T00:00:00Z.

import math

def sun_position(t):
"""Return ECEF position (m) of the sun at time t (sec. since 1970).
"""
# Reference: Meeus, Jean, Astronomical Algorithms (2nd Ed.). Richmond:
# Willmann-Bell, Inc., 2009, Ch. 25.
d = t / 86400.0 - 10957.5
T = d / 36525
L = math.radians(280.46646 + T * (36000.76983 + 0.0003032 * T))
M = math.radians(357.52911 + T * (35999.05029 - 0.0001537 * T))
e = 0.016708634 - T * (0.000042037 + 0.0000001267 * T)
C = math.radians((1.914602 - T * (0.004817 + 0.000014 * T)) * math.sin(M) +
(0.019993 - 0.000101 * T) * math.sin(2 * M) +
0.000289 * math.sin(3 * M))
s = L + C
v = M + C
R = 149597870000 * (1.000001018 * (1 - e * e) / (1 + e * math.cos(v)))
omega = math.radians(125.04 - 1934.136 * T)
lon = math.radians(math.degrees(s) - 0.00569 - 0.00478 * math.sin(omega))
epsilon = math.radians(23.0 + (26.0 + (21.448 - T * (46.8150 + T *
(0.00059 - 0.001813 * T))) / 60) / 60 +
0.00256 * math.cos(omega))
x = R * math.cos(lon)
y = R * math.sin(lon) * math.cos(epsilon)
z = R * math.sin(lon) * math.sin(epsilon)
theta = math.radians(280.46061837 + 360.98564736629 * d + T * T *
(0.000387933 - T / 38710000))
s = math.sin(theta)
c = math.cos(theta)
return (x * c + y * s, -x * s + y * c, z)


Next is the corresponding function for the Moon, with an accuracy of “several arcminutes and about 500 kilometers in the lunar distance.”

import math

def moon_position(t):
"""Return ECEF position (m) of the moon at time t (sec. since 1970).
"""
# Reference: Montenbruck and Gill, Satellite Orbits. Berlin: Springer,
# 2005, Ch. 3.3.2.
d2000 = t / 86400.0 - 10957.5
T = d2000 / 36525
L0 = math.radians(218.31617 + 481267.88088 * T - 1.3972 * T)
l = math.radians(134.96292 + 477198.86753 * T)
lp = math.radians(357.52543 + 35999.04944 * T)
F = math.radians(93.27283 + 483202.01873 * T)
D = math.radians(297.85027 + 445267.11135 * T)

# Longitude with respect to the equinox and ecliptic of the year 2000.
(22640 * math.sin(l) + 769 * math.sin(2 * l)
- 4586 * math.sin(l - 2 * D)
+ 2370 * math.sin(2 * D) - 668 * math.sin(lp) - 412 * math.sin(2 * F)
- 212 * math.sin(2 * l - 2 * D) - 206 * math.sin(l + lp - 2 * D)
+ 192 * math.sin(l + 2 * D) - 165 * math.sin(lp - 2 * D)
+ 148 * math.sin(l - lp) - 125 * math.sin(D)
- 110 * math.sin(l + lp) - 55 * math.sin(2 * F - 2 * D)) / 3600)

# Lunar latitude.
(18520 * math.sin(F + lon - L0 + math.radians(
(412 * math.sin(2 * F) + 541 * math.sin(lp)) / 3600))
- 526 * math.sin(F - 2 * D) + 44 * math.sin(l + F - 2 * D)
- 31 * math.sin(-l + F - 2 * D) - 25 * math.sin(-2 * l + F)
- 23 * math.sin(lp + F - 2 * D) + 21 * math.sin(-l + F)
+ 11 * math.sin(-lp + F - 2 * D)) / 3600)

# Distance from the center of the Earth.
r = (385000 - 20905 * math.cos(l) - 3699 * math.cos(2 * D - l)
- 2956 * math.cos(2 * D) - 570 * math.cos(2 * l)
+ 246 * math.cos(2 * l - 2 * D)
- 205 * math.cos(lp - 2 * D) - 171 * math.cos(l + 2 * D)
- 152 * math.cos(l + lp - 2 * D)) * 1000

# Convert ecliptic to equatorial Cartesian coordinates.
lon = lon + math.radians(1.3972 * T)
x = r * math.cos(lon) * math.cos(beta)
y = r * math.sin(lon) * math.cos(beta)
z = r * math.sin(beta)
s = math.sin(-epsilon)
c = math.cos(-epsilon)
y, z = (y * c + z * s,
-y * s + z * c)

# Convert to ECEF.
theta = math.radians(280.46061837 + 360.98564736629 * d2000)
s = math.sin(theta)
c = math.cos(theta)
return (x * c + y * s, -x * s + y * c, z)


References:

1. Meeus, Jean, Astronomical Algorithms (2nd Ed.). Richmond: Willmann-Bell, Inc., 2009, Ch. 25.
2. Montenbruck and Gill, Satellite Orbits. Berlin: Springer, 2005, Ch. 3.3.2.