Introduction
I think a computer version of the paper-and-pencil game Hangman is a great programming project for students. Actually, “Evil” Hangman is even better as a more advanced project: instead of the computer selecting a random dictionary word as usual at the start of the game– and remaining bound to that initial selection throughout the game– it merely selects a word length. Then in response to each player guess, it “scores” the guess in a way that maximizes the number of possible dictionary words that are still consistent with the clues provided so far. This is “evil” in that it is arguably cheating… but in a way that is indistinguishable from fair play from the perspective of the human guesser.
In any case, a necessary step in such a project is identifying a list of words to use as a dictionary. A Scrabble dictionary seemed like a reasonable starting point… but considering playable words like zyzzyvas, I thought it might be useful to trim the dictionary down to a shorter list of “easier” words in more common use. But how to determine which words are more common than others? We can do that, with help from Google Books.
Word Lists
I started with the OTCWL2, the second edition of the Official Tournament and Club Word List used for Scrabble in the United States and Canada. This is the latest “official” reference accessible in electronic form, containing 178,691 words of 2 to 15 letters.
The OSPD4 is the fourth edition of the Official Scrabble Players Dictionary, a less rigorously edited publication containing a subset of the words in the OTCWL2 with at most 8 letters (with a small tail of some longer inflections), and excluding words that are considered offensive or otherwise inappropriate for school and family games. The figure below shows the distribution of word lengths for both of these word lists.
To sort these lists of words by frequency of occurrence in natural language, I used the Google Books Ngrams data set (English Version 20120701), containing a list of numbers of occurrences of “1-grams”– roughly, whitespace-delimited tokens– in the more than 8 million books scanned by Google. I aggregated these counts for all “words” containing only the case-insensitive alphabetic characters A-Z, with no other special characters, but with no other restriction on word length or frequency of occurrence. (Peter Norvig did a very similar data reduction analysis recently, but (1) he discarded rarely-occurring but still “playable” words, and (2) I cannot reproduce his data; both I and another commenter observe identical word counts that are, strangely, roughly half what his analysis indicates. I could use more eyes on this if someone wants to take a third look.)
The result is nearly 5 million distinct words, each from 1 to 133 letters in length, occurring over 378 billion times throughout the corpus. The following figure shows the distribution of lengths of distinct words, truncated to the same 15-letter maximum for comparison with the figure above.
This distribution has a shape similar to that of the OTCWL2, with comparable average lengths of 8.86713 and 8.61226 letters for Scrabble words and 1-grams, respectively. However, shorter words are more common in actual use: if we weight each 1-gram by its number of occurrences in the corpus, the resulting distribution shown below is skewed downward significantly, with an average length of only 4.84264 letters.
Word Frequencies
Armed with frequencies of occurrence of all 1-grams in the Google Books corpus, we can now map Scrabble dictionary words to their corresponding frequencies, and sort. The following table lists the top 150 most frequently occurring words, which are common to both OTCWL2 and OSPD4. (See the usual location here to download the complete lists.)
the 26548583149 of 15482969531 and 11315969857 to 9673642739 in 8445476198 is 4192081707 that 4000373255 for 3272628244 it 2870028984 as 2850302594 was 2751347404 with 2591390604 be 2409421528 by 2351529575 on 2297245630 not 2261359962 he 2055218371 this 1913024043 are 1850210733 or 1833848095 his 1805683788 from 1734598284 at 1706718058 which 1570109342 but 1396171439 have 1388716420 an 1363116521 had 1308007510 they 1231059987 you 1168867586 were 1135241275 their 1076488415 one 1074487479 all 1031379950 we 1028643905 can 832705210 her 816638617 has 816070042 there 811848203 been 809520620 if 782488523 more 773623990 when 759867879 will 742569455 would 736404476 who 732299236 so 724571145 no 700307542 she 696346928 other 691591710 its 684717118 may 659022046 these 652892753 what 611266211 them 599817013 than 592508982 some 591157776 him 590383891 time 583894803 into 572112004 only 567615117 do 558298911 such 550034776 my 541136893 new 536629127 about 535438145 out 524101460 also 519307893 two 517040036 any 491821918 up 487701411 first 461575987 could 443444946 our 437481406 then 436395729 most 430481826 see 423246243 me 409783516 should 407122235 after 401110215 said 378997761 your 375384324 very 373437241 between 372032027 made 368828362 many 366336877 over 357758573 like 354924476 those 345833904 did 344906403 now 339668198 even 338753297 well 337707194 where 335821693 must 331496692 people 329358030 through 323545720 how 315034600 same 312263785 work 310897825 being 308096733 because 306160671 man 304508612 life 303632580 before 302583542 each 300640666 much 300630798 under 292804847 years 281623235 way 279891710 great 278574742 state 276687623 both 271888302 use 267896038 upon 266216636 own 262133879 used 262032076 however 259603813 us 258486246 part 257902754 good 254969414 world 252934367 make 251401534 three 243893653 while 241720369 long 238665543 day 235502808 without 232791602 just 227081780 men 227046739 general 225306151 during 224086601 another 222347954 little 221427582 found 218308716 here 216154769 system 214324790 does 213337247 de 212139309 down 211277897 number 210832441 case 208128805 against 205222260 might 204780135 still 204131780 back 203273772 right 202859545 know 202770927 place 200698028 every 196653070
You may have noticed that the single-letter words a and I are missing, since Scrabble words always contain at least two letters. After these two, which are 6th and 19th, respectively, the next most frequently occurring 1-grams that are not Scrabble-playable are (ignoring other single letters): American, York, II, London, British, England, non, America, William, France, and vol.
Although Scrabble OTCWL2 dictionary words make up just about 3.3% of all 1-grams, they are not exclusively “common” words. The following figure shows, among a given quantile x of most frequently occurring 1-grams in the corpus, the fraction y of the Scrabble dictionary appearing among those 1-grams.
If Scrabble words were the most common 1-grams, the red curve for example would increase linearly from (0,0) to (0.033,1) and clamp there. The curves do increase sharply at the beginning, indicating that most of the dictionary is indeed among the most frequently occurring 1-grams. But since the curves instead remain strictly increasing throughout, this indicates that there are Scrabble words of any desirable rarity: unpuckers, for example, is one of the more amusing of the 26 playable words that occurs the minimum of just 40 times in the entire corpus.
But 40 isn’t really the minimum; there are Scrabble words that do not appear even once as a 1-gram. As indicated by the x=1 intercepts in the figure above, about 7% (12,867 to be exact) of the OTCWL2 words do not appear anywhere in the Google data set: zyzzyvas mentioned earlier is one such word.
Hangman Strategy
A final thought: as a programming project, Hangman is even more challenging with the computer acting as the guesser. This DataGenetics blog post contains an interesting analysis of incremental improvements in strategy for the guesser in Hangman. For example, the author rightly points out the distinction between guessing letters based on their frequency of occurrence in natural English text, and the better strategy of guessing based on the number of remaining possible words in the dictionary that contain a candidate letter.
However, this latter “greedy” strategy is still not optimal, even assuming uniformly random “non-evil” selection of the unknown secret word. Rather than just maximizing the probability of a successful next guess, the player really wants to maximize the overall probability of winning— that is, of not exceeding the allowed total number of incorrect guesses.
For a realistic dictionary size, this is a much harder problem to solve. But we can see that this is indeed strictly better than the greedy strategy with the following minimal counterexample: consider the dictionary with just four 2-letter words {am, an, at, me}, where the player is allowed zero incorrect guesses. Using the greedy strategy described in the DataGenetics post, the probability of winning is 1/4; but an optimal strategy wins with probability 1/2.
Pingback: Analysis of the game of HIPE | Possibly Wrong
Pingback: Probability of a Scrabble bingo | Possibly Wrong
Pingback: Anagrams | Possibly Wrong