Hangman, Scrabble, and Google Books


I think a computer version of the paper-and-pencil game Hangman is a great programming project for students.  Actually, “Evil” Hangman is even better as a more advanced project: instead of the computer selecting a random dictionary word as usual at the start of the game– and remaining bound to that initial selection throughout the game– it merely selects a word length.  Then in response to each player guess, it “scores” the guess in a way that maximizes the number of possible dictionary words that are still consistent with the clues provided so far.  This is “evil” in that it is arguably cheating… but in a way that is indistinguishable from fair play from the perspective of the human guesser.

In any case, a necessary step in such a project is identifying a list of words to use as a dictionary.  A Scrabble dictionary seemed like a reasonable starting point… but considering playable words like zyzzyvas, I thought it might be useful to trim the dictionary down to a shorter list of “easier” words in more common use.  But how to determine which words are more common than others?  We can do that, with help from Google Books.

Word Lists

I started with the OTCWL2, the second edition of the Official Tournament and Club Word List used for Scrabble in the United States and Canada.  This is the latest “official” reference accessible in electronic form, containing 178,691 words of 2 to 15 letters.

The OSPD4 is the fourth edition of the Official Scrabble Players Dictionary, a less rigorously edited publication containing a subset of the words in the OTCWL2 with at most 8 letters (with a small tail of some longer inflections), and excluding words that are considered offensive or otherwise inappropriate for school and family games.  The figure below shows the distribution of word lengths for both of these word lists.

Distribution of word length in the OTCWL2 and OSPD4 Scrabble dictionaries.

Distribution of word length in the OTCWL2 and OSPD4 Scrabble dictionaries.

To sort these lists of words by frequency of occurrence in natural language, I used the Google Books Ngrams data set (English Version 20120701), containing a list of numbers of occurrences of “1-grams”– roughly, whitespace-delimited tokens– in the more than 8 million books scanned by Google.  I aggregated these counts for all “words” containing only the case-insensitive alphabetic characters A-Z, with no other special characters, but with no other restriction on word length or frequency of occurrence.  (Peter Norvig did a very similar data reduction analysis recently, but (1) he discarded rarely-occurring but still “playable” words, and (2) I cannot reproduce his data; both I and another commenter observe identical word counts that are, strangely, roughly half what his analysis indicates.  I could use more eyes on this if someone wants to take a third look.)

The result is nearly 5 million distinct words, each from 1 to 133 letters in length, occurring over 378 billion times throughout the corpus.  The following figure shows the distribution of lengths of distinct words, truncated to the same 15-letter maximum for comparison with the figure above.

Distribution of lengths of distinct 1-grams in the Google Books corpus.

Distribution of lengths of distinct 1-grams in the Google Books corpus.

This distribution has a shape similar to that of the OTCWL2, with comparable average lengths of 8.86713 and 8.61226 letters for Scrabble words and 1-grams, respectively.  However, shorter words are more common in actual use: if we weight each 1-gram by its number of occurrences in the corpus, the resulting distribution shown below is skewed downward significantly, with an average length of only 4.84264 letters.

Distribution of length of 1-grams, weighted by number of occurrences.

Distribution of length of 1-grams, weighted by number of occurrences.

Word Frequencies

Armed with frequencies of occurrence of all 1-grams in the Google Books corpus, we can now map Scrabble dictionary words to their corresponding frequencies, and sort.  The following table lists the top 150 most frequently occurring words, which are common to both OTCWL2 and OSPD4.  (See the usual location here to download the complete lists.)

the      26548583149
of       15482969531
and      11315969857
to        9673642739
in        8445476198
is        4192081707
that      4000373255
for       3272628244
it        2870028984
as        2850302594
was       2751347404
with      2591390604
be        2409421528
by        2351529575
on        2297245630
not       2261359962
he        2055218371
this      1913024043
are       1850210733
or        1833848095
his       1805683788
from      1734598284
at        1706718058
which     1570109342
but       1396171439
have      1388716420
an        1363116521
had       1308007510
they      1231059987
you       1168867586
were      1135241275
their     1076488415
one       1074487479
all       1031379950
we        1028643905
can        832705210
her        816638617
has        816070042
there      811848203
been       809520620
if         782488523
more       773623990
when       759867879
will       742569455
would      736404476
who        732299236
so         724571145
no         700307542
she        696346928
other      691591710
its        684717118
may        659022046
these      652892753
what       611266211
them       599817013
than       592508982
some       591157776
him        590383891
time       583894803
into       572112004
only       567615117
do         558298911
such       550034776
my         541136893
new        536629127
about      535438145
out        524101460
also       519307893
two        517040036
any        491821918
up         487701411
first      461575987
could      443444946
our        437481406
then       436395729
most       430481826
see        423246243
me         409783516
should     407122235
after      401110215
said       378997761
your       375384324
very       373437241
between    372032027
made       368828362
many       366336877
over       357758573
like       354924476
those      345833904
did        344906403
now        339668198
even       338753297
well       337707194
where      335821693
must       331496692
people     329358030
through    323545720
how        315034600
same       312263785
work       310897825
being      308096733
because    306160671
man        304508612
life       303632580
before     302583542
each       300640666
much       300630798
under      292804847
years      281623235
way        279891710
great      278574742
state      276687623
both       271888302
use        267896038
upon       266216636
own        262133879
used       262032076
however    259603813
us         258486246
part       257902754
good       254969414
world      252934367
make       251401534
three      243893653
while      241720369
long       238665543
day        235502808
without    232791602
just       227081780
men        227046739
general    225306151
during     224086601
another    222347954
little     221427582
found      218308716
here       216154769
system     214324790
does       213337247
de         212139309
down       211277897
number     210832441
case       208128805
against    205222260
might      204780135
still      204131780
back       203273772
right      202859545
know       202770927
place      200698028
every      196653070

You may have noticed that the single-letter words a and I are missing, since Scrabble words always contain at least two letters.  After these two, which are 6th and 19th, respectively, the next most frequently occurring 1-grams that are not Scrabble-playable are (ignoring other single letters): American, York, IILondon, British, England, non, America, William, France, and vol.

Although Scrabble OTCWL2 dictionary words make up just about 3.3% of all 1-grams, they are not exclusively “common” words.  The following figure shows, among a given quantile x of most frequently occurring 1-grams in the corpus, the fraction y of the Scrabble dictionary appearing among those 1-grams.

Cumulative distribution of Scrabble-playable words among 1-grams (sorted by frequency).

Cumulative distribution of Scrabble-playable words among 1-grams (sorted by frequency).

If Scrabble words were the most common 1-grams, the red curve for example would increase linearly from (0,0) to (0.033,1) and clamp there.  The curves do increase sharply at the beginning, indicating that most of the dictionary is indeed among the most frequently occurring 1-grams.  But since the curves instead remain strictly increasing throughout, this indicates that there are Scrabble words of any desirable rarity: unpuckers, for example, is one of the more amusing of the 26 playable words that occurs the minimum of just 40 times in the entire corpus.

But 40 isn’t really the minimum; there are Scrabble words that do not appear even once as a 1-gram.  As indicated by the x=1 intercepts in the figure above, about 7% (12,867 to be exact) of the OTCWL2 words do not appear anywhere in the Google data set: zyzzyvas mentioned earlier is one such word.

Hangman Strategy

A final thought: as a programming project, Hangman is even more challenging with the computer acting as the guesser.  This DataGenetics blog post contains an interesting analysis of incremental improvements in strategy for the guesser in Hangman.  For example, the author rightly points out the distinction between guessing letters based on their frequency of occurrence in natural English text, and the better strategy of guessing based on the number of remaining possible words in the dictionary that contain a candidate letter.

However, this latter “greedy” strategy is still not optimal, even assuming uniformly random “non-evil” selection of the unknown secret word.  Rather than just maximizing the probability of a successful next guess, the player really wants to maximize the overall probability of winning— that is, of not exceeding the allowed total number of incorrect guesses.

For a realistic dictionary size, this is a much harder problem to solve.  But we can see that this is indeed strictly better than the greedy strategy with the following minimal counterexample: consider the dictionary with just four 2-letter words {am, an, at, me}, where the player is allowed zero incorrect guesses.  Using the greedy strategy described in the DataGenetics post, the probability of winning is 1/4; but an optimal strategy wins with probability 1/2.


This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Hangman, Scrabble, and Google Books

  1. Pingback: Analysis of the game of HIPE | Possibly Wrong

  2. Pingback: Probability of a Scrabble bingo | Possibly Wrong

  3. Pingback: Anagrams | Possibly Wrong

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s