The clueless student

The following problem appeared in Marilyn vos Savant’s “Ask Marilyn” column in the 25 July 2004 issue of Parade magazine:

A high school student who hadn’t opened his American history book in weeks was dismayed to walk into class and be greeted with a pop quiz.  It was in the form of two lists, one naming the 24 presidents in office during the 19th century in alphabetical order and another list noting their terms in office, but scrambled.  The object was to match the presidents with their terms.  The completely clueless student had to guess every time.  On average, how many did he guess correctly? — Bob Hendel

This is a nice problem, with an equally nice answer, which Marilyn boiled down to just two words: “Only one!”  That the average number of correct guesses is exactly one, no matter how many questions (i.e., presidents and terms) in the quiz, is a great demonstration of the power of linearity of expectation.  But I wonder if Marilyn recognized that there are at least two reasonable interpretations of how the student might “guess every time”:

1. Match each president with a term, using each available term exactly once.  In other words, select a random permutation of the terms.
2. Match each president with a randomly selected term, where each term may be used more than once, or not at all.  In other words, select a random function mapping presidents to terms.

Fortunately, it doesn’t matter; the average number of correct guesses– one– is the same in either case.

But now let’s make the problem more interesting: suppose that this 24-question quiz is the last exam of the school year, and our clueless student needs just one correct answer to pass the class.  What should he do?  Using either of the two methods above, the probability that he gets every answer wrong, and fails the class, is about $1/e$, or about 0.37.  (The actual probabilities are different for the two methods, but both converge to $1/e$ in the limit as the number of questions grows large.)

There is a better strategy, though: if he instead selects one term of office– any term of office– and writes that same term for every president in the quiz, then he is guaranteed to get exactly one answer correct!

The motivation for this post is to generalize this problem: given a matching quiz with $n$ questions and $n$ corresponding possible answers, and a desired “target” score of $1 \leq m \leq n$, what guessing strategy maximizes the probability of at least $m$ correct answers?  How are such strategies characterized, and given such a strategy, how do we compute the corresponding probability of success?

Train tracks and graph theory

I think games and toys, even those for young children, can often be a great source of interesting mathematics.  Recently a friend of mine was helping his sons build a train track using the pieces shown below:

An unfinished train track, with all available track pieces shown.

As you can see, this particular track is “incomplete”: one track segment has an end that is not interlocked (in jigsaw-puzzle fashion) with any other piece.  Let’s say that a track is complete if it doesn’t have this problem; that is, every piece used has each of its puzzle-piece “ends” interlocked with some other piece in the track.

My friend wondered whether it was possible to build a complete track, using all of the available pieces shown in the photo?  It turns out that it is not possible… and so the next question is, how to prove it?

This problem has a flavor very similar to the so-called mutilated chessboard, in that (1) it seems difficult at first glance to prove that you can’t do something, no matter how hard you try; and yet (2) thinking about the problem in just the right (but not necessarily obvious) way yields a straightforward proof.

In this case, suppose that we view a complete track as a graph, consisting of vertices (drawn as points) and edges (drawn as line segments connecting pairs of vertices).  Each track piece is a vertex, and an edge between vertices indicates that the corresponding track pieces are interlocked, as shown below, with vertices in blue and edges in green:

The corresponding graph, with each vertex (blue) representing a track piece, and each edge (green) representing an interlocking connection between pieces. The dashed yellow edge would complete the track/graph.

If a track is complete, then every possible connection has an edge associated with it.  Let us temporarily suppose that the track shown above is actually complete, that the “three-way switch” piece is actually a four-way switch, and that the extra dashed yellow edge represents the final required connection.

Now, given such a graph representing a complete track, how many edges does it have?  We could just count up the green and yellow line segments, and find that there are 49 edges… but instead, let us examine each vertex $v$ in turn, computing its degree $d(v)$, or the number of edges emanating from it.  We will see why this is handy shortly.

Most of the vertices, corresponding to the simple track segments with two ends, have degree 2.  But some of the “switch” track pieces have degree 3, and one of our temporarily-augmented pieces now has degree 4.  If we add up the degrees of all of the vertices, we find that the total in this case is 98… exactly twice the number of edges!

This is no coincidence.  In general, given any undirected graph, the sum of its vertex degrees equals twice the number of edges.  To see why, consider that in the summation, each edge in the graph gets “counted” exactly twice, once for each of its two vertex endpoints.

An immediate corollary is that, in any such graph, the sum of the vertex degrees must be even.  And this leads to the desired result: if we expect to have any chance of using all available track pieces to build a complete track, then the sum of the degrees– i.e., total number of interlock points– of all of the pieces must be even.  But as shown in the original photo above, in addition to the simple track segments with degree 2, there are five pieces with degree 3, yielding an odd total.

Floating-point round trips revisited

After a couple of interesting conversations on the subject, I think I should try to clear up some potentially confusing comments I made in my last post about converting binary floating-point values to decimal strings… and back to binary floating point again.

Injection != round-trip identity

“… You need the representation to be one-to-one, so that distinct finite values always produce distinct output.  More precisely, we would like to be able to (1) convert a floating-point value $x$ to a string, then (2) convert the string back to floating-point, recovering exactly the original value $x$.”

In the above quote, “more precisely” is a misleading choice of words, since it suggests that these two properties (one-to-one-ness and round-tripping) are equivalent, when they are not.  For a simplest possible example, consider the correctly rounded floating-point conversion from single-bit binary to single-digit decimal.  This conversion is one-to-one: distinct powers of two always map to distinct single-digit multiples of powers of 10.  However, the round-trip conversion, from binary to decimal and back to correctly rounded decimal, is not the identity.  It’s an exercise for the reader to verify that the value $2^{27}$, used by Matula in Reference (2) below, is a minimal counterexample.

To see why this distinction matters, consider the more general problem of the correctly rounded conversion from $n$-digit base $\beta$ to $m$-digit base $\nu$… where $\beta$ and $\nu$ are not “variants of a common base” (more on this later).  Then as shown in Reference (1), the resulting function is one-to-one if and only if

$\nu^{m-1} \geq \beta^n - 1$

However, we want more than the ability to simply invert this one-to-one conversion function: instead, we would like to be able to compose this conversion function with another correctly rounded conversion back in the other direction, and yield the identity.  As shown in Reference (2), this composition yields the identity if and only if the following stronger condition holds:

$\nu^{m-1} > \beta^n$

Fortunately, in the case where $\beta=2$ and $\nu=10$, the single-bit-to-single-digit wrinkle mentioned above (i.e., $n=m=1$) is the only situation where these two conditions are not equivalent, and so in practice when using floats and doubles this distinction is less important.

Variants of a common base

It’s another exercise for the reader to verify that the second inequality above may be solved for a minimal number $m$ of decimal digits required for a successful round-trip conversion from $n$-bit binary, to yield the formula from the last post:

$m_{min} = \lceil{n \log_{10}(2)}\rceil + 1$

However, there is an important detail involved in turning that strict inequality into this ceiling expression, that is left out of many discussions that I see online (at least where an attempt is made to generalize beyond binary-decimal conversions): this only works since $n \log_{10}(2)$ is not an integer.  That is, base 2 and base 10 are not “variants [i.e., powers] of a common base,” as discussed in the references below.  For conversions between bases that are powers of some common base, these formulas do not hold in general.  (Consider various conversions between binary and hexadecimal, for example, again as discussed in the references below.)

References:

1. Matula, David W., The Base Conversion Theorem, Proceedings of the American Mathematical Society, 19(3) June 1968, p. 716-723 [PDF]
2. Matula, David W., In-and-Out Conversions, Communications of the ACM, 11(1) January 1968, p. 47-50 [ACM]

Floating-point round trips

I often need to display or record output of floating-point values, in a human-readable decimal format.  This is easy enough in most programming languages… unless you need the representation to be one-to-one, so that distinct finite values always produce distinct output.  More precisely, we would like to be able to (1) convert a floating-point value $x$ to a string, then (2) convert the string back to floating-point, recovering exactly the original value $x$.

For example, I use MATLAB more than I care to, and despite its primary application being numeric scientific computing, none of its built-in output formats support this.  The simplest way that I know of to do this is actually via MATLAB’s interface to Java:

>> java.lang.Double.toString(1.1 + 2.2 - 3.3)

In C++, you can set the precision of an output stream to a desired number of decimal digits.  In C++11, the constant max_digits10 is defined just for this purpose:

#include <sstream> // ostringstream
#include <iomanip> // setprecision
#include <limits> // numeric_limits

template<typename T>
std::string to_string(T x)
{
std::ostringstream os;
os << std::setprecision(std::numeric_limits<T>::max_digits10) << x;
return os.str();
}

Unfortunately, I learned the hard way that Visual Studio 2010 gets this wrong, at least for single-precision float, defining max_digits10 in that case to be 8 instead of 9 as it should be.

For compatibility with older compilers– either with bugs like VS 2010, or simply pre-C++11– the good news is that we can compute the required number of digits $d$, as a function of the number of significant bits $b$ in the underlying floating-point representation (which is in turn defined in the standard header <limits> as numeric_limits<T>::digits):

$d = \left \lceil{b \log_{10}(2)}\right \rceil + 1$

where $b=24$ for float and $b=53$ for double, yielding $d=9$ and $d=17$, respectively.  (Since I can’t seem to write a post without including an obligatory puzzle: why is the +1 needed in the above equation?)

Interestingly, the following alternative formula also works, with the added benefit that it can be evaluated at compile time:

max_digits10 = std::numeric_limits<T>::digits * 3010 / 10000;

I’m not sure why this particular approximation of $\log_{10}(2)$ is used, since 301/1000 works equally well; both are correct for any number of bits less than 196.

(Edit: The last time I wrote about floating-point issues, I mentioned Bruce Dawson’s Random ASCII blog, in particular the excellent series of articles about floating point.  I should have done so here as well.  In this case, the specific relevant article is here… but instead of the brute force enumeration approach described there explaining the “+1” in the formula above– and how often it is actually needed– my motivation for this post, in the form of the “puzzle,” was to suggest a pencil-and-paper derivation of that formula, which would in turn yield a direct calculation of how often that +1 is needed.)

Posted in Uncategorized | 3 Comments

The fourth-best for the job

The subject of this post is a puzzle, a variant of the so-called secretary problem: suppose that you need to hire one of $n$ candidates for a job.  You must interview one candidate at a time in random order, and after each interview you must decide whether to accept the candidate and offer him or her the position, or to reject the candidate and move on to the next interview.  You cannot return to a previously rejected candidate.

Suppose that the candidates’ interview results are not quantifiable, but they are comparable, so that at any point in the process you can rank in strict preference order the candidates you have interviewed so far.  What should be your strategy for selecting the “best” candidate?

In the classical statement of the problem, the objective is to maximize the probability of selecting the overall best candidate.  The “popular” nice result is that the corresponding optimal strategy has a very simple form, and a nearly fixed success rate, no matter the total number $n$ of candidates: interview the first $n/e$ candidates (about 36.8% of them), rejecting them outright, then accept the first subsequent candidate that is the best so far.  The resulting probability of selecting the best overall candidate is approximately $1/e$, or about 0.368.

I remember first encountering this problem, and thinking that the goal was unrealistic: for example, consider the extreme situation where you have interviewed the 99th of 100 candidates, and he or she is the second-best so far.  The optimal strategy dictates rejecting the candidate– which is at worst third-best overall– and rolling the dice on the final candidate.

Problem: Instead, suppose that the objective is to minimize the expected overall rank of the selected candidate.  That is, if $X$ is the random variable indicating the final overall rank of the selected candidate, then instead of maximizing $P(X=1)$, we want to minimize $E(X)$.  What is the optimal strategy and resulting expected rank for, say, one hundred candidates?  One million candidates?

(Note that this seemingly natural variant of the problem is missing from the Wikipedia page, despite the inclusion of several other similar excursions.)

I think this is a great problem, particularly as a programming exercise.  It can be solved with a relatively simple recursive implementation in less than a dozen lines of code.  But with one million candidates, that simple implementation bogs down, and requires some re-factoring.  The resulting more efficient solution still involves less than a dozen lines of code… and I think the answer is just as interesting as the $1/e$ result in the original problem.

NCAA bracket scoring systems

Introduction

During the 2015 NCAA men’s basketball tournament, I won our office pool by (1) picking then-undefeated Kentucky to lose– although earlier than their actual Final Four loss to Wisconsin– and (2) picking Duke to win the championship game.  It was a come-from-behind victory for my bracket, moving from 14th place to 7th to 1st… over the span of the last three games in the 63-game tournament.

But should I have won?  Our pool used the common bracket scoring system of assigning:

• 1 point for each correct pick in the first round of 64 teams,
• 2 points for each correct pick in the second round of 32 teams,
• 4 points for each correct pick in the third round of 16 teams,
• 8 points for each correct pick in the fourth round of 8 teams,
• 16 points for each correct pick in the two Final Four games,
• 32 points for correctly picking the champion.

This “doubling” system has several reasonable mathematical motivations.  For example, each round of games is potentially worth the same number of points (32).  Also, assuming all of the teams are evenly matched– or equivalently, assuming that you make all of your picks by flipping a fair coin– then the expected number of points scored decreases by exactly half with each round.

But teams aren’t evenly matched, and you don’t make your picks by flipping coins.  Intuitively, then, it seems like this doubling system might be over-weighting the importance of later rounds, and perhaps a better system involves less extreme increases in points per game from one round to the next.  One of the more amusing common suggestions is a progression based on the Fibonacci sequence, with games in each round worth 2, 3, 5, 8, 13, and 21 points, respectively.  My goal in this post is to describe a means of more accurately evaluating and comparing these and other bracket scoring systems.

Probability model for tournament games

First, we need a way to model the probability of correctly picking any particular game.  A reasonably simple starting point is to assume that all games are independent, with each outcome’s probability depending only on the teams’ seeds.  More precisely, let P be a 16×16 matrix with entries

$p_{i,j} = 1 - p_{j,i} = \frac{1}{2} + k(s_i - s_j)$

indicating the probability that seed i beats seed j, where $s_i$ is some measure of “strength” of seed i (decreasing in i), and k is a scaling factor that effectively determines the range of resulting probabilities.  For example, if $k=0$, then every game is a coin flip; at the other extreme, if $k=1/(2(s_1 - s_{16}))$, then a 16th seed has zero probability of a first-round upset against a 1st seed.  For this discussion, k will be chosen so that

$p_{1,16} = \frac{124+1}{124+2} = \frac{125}{126}$

based on the observation that, in 124 match-ups over the last 31 years of the current tournament format, a 1st seed has so far never lost to a 16th seed.  This probability is the expected value of the corresponding beta distribution.

I used a simple version of this model a year ago to estimate the probability of picking a “perfect bracket,” that is, picking all 63 games correctly, using a linear strength function:

$s_i = -i$

so that $p_{i,j}$ depends only on the difference between seeds.  Even this very simple model isn’t too bad, as shown in the following updated figure, with the linear prediction model in red, and the last 31 years of historical data shown in blue, with corresponding 95% confidence intervals in black.  As the often very wide confidence intervals suggest, 31 years is still not much data; for example, there have been only 7 match-ups between seeds differing by 10: 1st vs. 11th are split 3-3, and a single 2nd seed won over a 12th.

Probability of winning as a function of seed difference: point estimate (blue), 95% confidence interval (black), and linear prediction model (red).

As usual, it turns out that this was not a new idea; Schwertman et. al. (see References at the end of this post) considered this same model back in 1991, as well as another non-linear strength function that turns out to be a better historical fit:

$s_i = \Phi^{-1}(1 - \frac{4i}{n})$

where $\Phi^{-1}$ is the quantile function of the normal distribution, and $n=351$ is the total number of Division I men’s basketball teams.  The idea is that the “strengths” of all teams are normally distributed, with the 64 teams in the tournament comprising the “strongest” teams in the upper tail of this distribution.  I will use this strength function for the remainder of this discussion.

Computing probabilities of correct picks

Given whatever matrix P of probabilities we choose, we can use it to compute the resulting distribution of the seed winning any particular game in the tournament.  If $\mathbf{a}$ and $\mathbf{b}$ are 16-element column vectors with $a_i$ ($b_i$) indicating the probability that the home (visiting) team in a particular game is seeded i, then the distribution of the seed winning that game is given by

$\mathbf{a}\circ(P\mathbf{b}) + \mathbf{b}\circ(P\mathbf{a})$

where $\circ$ is the element-wise Hadamard product.  In the first round, each $\mathbf{a}$ and $\mathbf{b}$ is a basis vector.  Note that including both terms in the summation is really just a computational convenience, at least within a region, since for a given seed, only one of the two terms’ corresponding components will be non-zero.

By applying this formula iteratively for each game in each successive round, we can eventually compute the probability of each seed winning each game in the tournament.  For example, the following Python code computes the distribution of the winner of any one of the four regional championships (among 16 teams each):

import numpy as np

def bracket_seeds(num_teams):
"""Seed given number of teams in single-elimination tournament."""
seeds = np.array([1])
while len(seeds) < num_teams:
seeds = np.array([seeds,
2 * len(seeds) + 1 - seeds]).transpose().flatten()
return seeds

def dist_game(dist_a, dist_b, P):
"""Compute distribution of winner of a vs. b with probability model P."""
return dist_a * (P.dot(dist_b)) + dist_b * (P.dot(dist_a))

def dist_region(P):
"""Compute distribution of regional champion with probability model P."""
games = np.eye(16)[bracket_seeds(16) - 1]
for round in range(4):
games = [dist_game(games[k], games[k + 1], P)
for k in range(0, len(games), 2)]
return games

The resulting predicted probabilities are shown in the following figure in red– using the normal quantile strength function above– compared with the actual frequencies in blue.

Winner of regional championship: actual frequency (blue) and predicted probability (red).

Bracket scoring systems

Now that we have a means of computing the probability of any particular team winning any particular game, we can evaluate a completed bracket by computing the expected number of correct picks in each round.  For example, suppose that our bracket simply picks the favorite (i.e., the higher seed) to win every game.  Then the expected number of correct picks will be:

• 23.156 of 32 games in the first round,
• 9.847 of 16 games in the second round,
• 4.292 of 8 games in the third round,
• 1.792 of 4 games in the fourth round regional championships,
• 0.540 of 2 games in the Final Four,
• 0.156 of the final championship game.

At this point, we can compare various bracket scoring systems by comparing the expected number of points scored in each round using those systems.  For example, the following table shows the expected points per round for the two systems mentioned so far: the doubling system (1, 2, 4, 8, 16, 32) and the Fibonacci system (2, 3, 5, 8, 13, 21), normalized to 1 point per first-round game.

 Round Doubling Fibonacci/2 1 23.156 23.156 2 19.694 14.771 3 17.168 10.730 4 14.334 7.167 Final Four 8.636 3.508 Championship 4.978 1.633

Which of these or any other systems is “best” depends on what kind of pool you want.  With the doubling system (or even greater progressions), you can have an “exciting,” horse race-ish pool, with lead changes and multiple entries having a chance of winning throughout all six rounds.  With the Fibonacci system (or even more gradual progressions), you can have a pool that rewards research and accurate prediction of early-round upsets… but such a pool may be effectively over well before the Final Four.

References:

1. Schwertman, N., McCready, T., and Howard, L., Probability Models for the NCAA Regional Basketball Tournaments, The American Statistician, 45(1) February 1991, p. 35-38 [PDF]

Appendix: Historical data

The following matrices contain the record of all wins and losses, by round and seed match-up, for the 31 tournaments in the current format from 1985 through 2015.  First, the following 16×16 matrix indicates the number of regional games– that is, in the first through fourth rounds– in which seed i beat seed j.  Note that the round in which each game was played is also implicitly determined by the seed matchup (e.g., 1 vs. 16 is in the first round, etc.).

0  21  13  32  30   6   4  51  56   4   3  19   4   0   0 124
21   0  23   2   0  23  53   2   0  26  12   1   0   0 117   0
8  14   0   2   2  38   7   1   1   9  25   0   0 104   1   0
15   4   3   0  36   2   2   3   2   2   0  21  99   0   0   0
7   3   1  30   0   1   0   0   1   1   0  80  11   0   0   0
2   6  28   1   0   0   3   0   0   4  81   0   0  13   0   0
0  20   5   2   0   3   0   0   0  76   0   0   0   1   2   0
12   3   0   5   2   1   1   0  63   0   0   0   1   0   0   0
5   1   0   0   1   0   0  61   0   0   0   0   1   0   0   0
0  18   4   0   0   2  48   0   0   0   0   0   0   1   4   0
3   1  13   0   0  43   3   0   0   2   0   0   0   5   0   0
0   0   0  12  44   0   0   1   0   0   0   0   8   0   0   0
0   0   0  25   3   0   0   0   0   0   0   3   0   0   0   0
0   0  20   0   0   2   0   0   0   0   0   0   0   0   0   0
0   7   0   0   0   0   1   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

The following matrix, in the same format, is for (fifth round) Final Four games:

12   6   2   5   1   0   1   1   1   0   0   0   0   0   0   0
4   2   3   1   0   1   0   0   0   0   1   0   0   0   0   0
4   2   0   2   0   0   0   0   0   0   1   0   0   0   0   0
1   0   0   1   1   0   0   0   0   0   0   0   0   0   0   0
0   1   0   0   1   0   0   1   0   0   0   0   0   0   0   0
0   1   0   1   0   0   0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   2   0   0   0   0   0   0   0   0   1   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

And finally for championship games:

6   6   1   2   3   1   0   0   0   0   0   0   0   0   0   0
1   0   3   0   0   0   0   0   0   0   0   0   0   0   0   0
0   2   1   0   0   0   0   1   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

Introduction

I think a computer version of the paper-and-pencil game Hangman is a great programming project for students.  Actually, “Evil” Hangman is even better as a more advanced project: instead of the computer selecting a random dictionary word as usual at the start of the game– and remaining bound to that initial selection throughout the game– it merely selects a word length.  Then in response to each player guess, it “scores” the guess in a way that maximizes the number of possible dictionary words that are still consistent with the clues provided so far.  This is “evil” in that it is arguably cheating… but in a way that is indistinguishable from fair play from the perspective of the human guesser.

In any case, a necessary step in such a project is identifying a list of words to use as a dictionary.  A Scrabble dictionary seemed like a reasonable starting point… but considering playable words like zyzzyvas, I thought it might be useful to trim the dictionary down to a shorter list of “easier” words in more common use.  But how to determine which words are more common than others?  We can do that, with help from Google Books.

Word Lists

I started with the OTCWL2, the second edition of the Official Tournament and Club Word List used for Scrabble in the United States and Canada.  This is the latest “official” reference accessible in electronic form, containing 178,691 words of 2 to 15 letters.

The OSPD4 is the fourth edition of the Official Scrabble Players Dictionary, a less rigorously edited publication containing a subset of the words in the OTCWL2 with at most 8 letters (with a small tail of some longer inflections), and excluding words that are considered offensive or otherwise inappropriate for school and family games.  The figure below shows the distribution of word lengths for both of these word lists.

Distribution of word length in the OTCWL2 and OSPD4 Scrabble dictionaries.

To sort these lists of words by frequency of occurrence in natural language, I used the Google Books Ngrams data set (English Version 20120701), containing a list of numbers of occurrences of “1-grams”– roughly, whitespace-delimited tokens– in the more than 8 million books scanned by Google.  I aggregated these counts for all “words” containing only the case-insensitive alphabetic characters A-Z, with no other special characters, but with no other restriction on word length or frequency of occurrence.  (Peter Norvig did a very similar data reduction analysis recently, but (1) he discarded rarely-occurring but still “playable” words, and (2) I cannot reproduce his data; both I and another commenter observe identical word counts that are, strangely, roughly half what his analysis indicates.  I could use more eyes on this if someone wants to take a third look.)

The result is nearly 5 million distinct words, each from 1 to 133 letters in length, occurring over 378 billion times throughout the corpus.  The following figure shows the distribution of lengths of distinct words, truncated to the same 15-letter maximum for comparison with the figure above.

Distribution of lengths of distinct 1-grams in the Google Books corpus.

This distribution has a shape similar to that of the OTCWL2, with comparable average lengths of 8.86713 and 8.61226 letters for Scrabble words and 1-grams, respectively.  However, shorter words are more common in actual use: if we weight each 1-gram by its number of occurrences in the corpus, the resulting distribution shown below is skewed downward significantly, with an average length of only 4.84264 letters.

Distribution of length of 1-grams, weighted by number of occurrences.

Word Frequencies

Armed with frequencies of occurrence of all 1-grams in the Google Books corpus, we can now map Scrabble dictionary words to their corresponding frequencies, and sort.  The following table lists the top 150 most frequently occurring words, which are common to both OTCWL2 and OSPD4.  (See the usual location here to download the complete lists.)

the      26548583149
of       15482969531
and      11315969857
to        9673642739
in        8445476198
is        4192081707
that      4000373255
for       3272628244
it        2870028984
as        2850302594
was       2751347404
with      2591390604
be        2409421528
by        2351529575
on        2297245630
not       2261359962
he        2055218371
this      1913024043
are       1850210733
or        1833848095
his       1805683788
from      1734598284
at        1706718058
which     1570109342
but       1396171439
have      1388716420
an        1363116521
they      1231059987
you       1168867586
were      1135241275
their     1076488415
one       1074487479
all       1031379950
we        1028643905
can        832705210
her        816638617
has        816070042
there      811848203
been       809520620
if         782488523
more       773623990
when       759867879
will       742569455
would      736404476
who        732299236
so         724571145
no         700307542
she        696346928
other      691591710
its        684717118
may        659022046
these      652892753
what       611266211
them       599817013
than       592508982
some       591157776
him        590383891
time       583894803
into       572112004
only       567615117
do         558298911
such       550034776
my         541136893
new        536629127
out        524101460
also       519307893
two        517040036
any        491821918
up         487701411
first      461575987
could      443444946
our        437481406
then       436395729
most       430481826
see        423246243
me         409783516
should     407122235
after      401110215
said       378997761
very       373437241
between    372032027
many       366336877
over       357758573
like       354924476
those      345833904
did        344906403
now        339668198
even       338753297
well       337707194
where      335821693
must       331496692
people     329358030
through    323545720
how        315034600
same       312263785
work       310897825
being      308096733
because    306160671
man        304508612
life       303632580
before     302583542
each       300640666
much       300630798
under      292804847
years      281623235
way        279891710
great      278574742
state      276687623
both       271888302
use        267896038
upon       266216636
own        262133879
used       262032076
however    259603813
us         258486246
part       257902754
good       254969414
world      252934367
make       251401534
three      243893653
while      241720369
long       238665543
day        235502808
without    232791602
just       227081780
men        227046739
general    225306151
during     224086601
another    222347954
little     221427582
found      218308716
here       216154769
system     214324790
does       213337247
de         212139309
down       211277897
number     210832441
case       208128805
against    205222260
might      204780135
still      204131780
back       203273772
right      202859545
know       202770927
place      200698028
every      196653070

You may have noticed that the single-letter words a and I are missing, since Scrabble words always contain at least two letters.  After these two, which are 6th and 19th, respectively, the next most frequently occurring 1-grams that are not Scrabble-playable are (ignoring other single letters): American, York, IILondon, British, England, non, America, William, France, and vol.

Although Scrabble OTCWL2 dictionary words make up just about 3.3% of all 1-grams, they are not exclusively “common” words.  The following figure shows, among a given quantile x of most frequently occurring 1-grams in the corpus, the fraction y of the Scrabble dictionary appearing among those 1-grams.

Cumulative distribution of Scrabble-playable words among 1-grams (sorted by frequency).

If Scrabble words were the most common 1-grams, the red curve for example would increase linearly from (0,0) to (0.033,1) and clamp there.  The curves do increase sharply at the beginning, indicating that most of the dictionary is indeed among the most frequently occurring 1-grams.  But since the curves instead remain strictly increasing throughout, this indicates that there are Scrabble words of any desirable rarity: unpuckers, for example, is one of the more amusing of the 26 playable words that occurs the minimum of just 40 times in the entire corpus.

But 40 isn’t really the minimum; there are Scrabble words that do not appear even once as a 1-gram.  As indicated by the x=1 intercepts in the figure above, about 7% (12,867 to be exact) of the OTCWL2 words do not appear anywhere in the Google data set: zyzzyvas mentioned earlier is one such word.

Hangman Strategy

A final thought: as a programming project, Hangman is even more challenging with the computer acting as the guesser.  This DataGenetics blog post contains an interesting analysis of incremental improvements in strategy for the guesser in Hangman.  For example, the author rightly points out the distinction between guessing letters based on their frequency of occurrence in natural English text, and the better strategy of guessing based on the number of remaining possible words in the dictionary that contain a candidate letter.

However, this latter “greedy” strategy is still not optimal, even assuming uniformly random “non-evil” selection of the unknown secret word.  Rather than just maximizing the probability of a successful next guess, the player really wants to maximize the overall probability of winning— that is, of not exceeding the allowed total number of incorrect guesses.

For a realistic dictionary size, this is a much harder problem to solve.  But we can see that this is indeed strictly better than the greedy strategy with the following minimal counterexample: consider the dictionary with just four 2-letter words {am, an, at, me}, where the player is allowed zero incorrect guesses.  Using the greedy strategy described in the DataGenetics post, the probability of winning is 1/4; but an optimal strategy wins with probability 1/2.