I think I nerd-sniped myself. This started with the objective of writing a simple program to play Hangman, as a demonstration of a potential programming exercise for students. (Don’t assign the problem if you don’t know the solution.) But in the process of creating the word list for the game, I found that last year Google released an updated export of its Google Books Ngrams data, which I’ve used and discussed here several times before. So I thought it would be interesting to revisit that data reduction exercise, and see what’s changed since the last release.
First, let’s get the Hangman game out of the way. The Python code is on GitHub; it’s only about 30 lines (plus a list of 10,000 words)… but I think it’s an interesting 30 lines. This could be a great problem for students, either to implement the game themselves, or even just to play this version, and inspect the code to figure out what’s “evil” about this particular computer player. I think it’s an interesting problem to implement even this simple approach efficiently, let alone to significantly improve on the computer player’s performance (see Reference 3 below).
The word list for the game is selected from a longer list– also on GitHub— of 457,548 words and their corresponding frequency of occurrence in the Google Books dataset. I built this longer list in two steps, as described below.
Google Books 1-grams
First, I downloaded the entire list of Google Books 1-grams (roughly, whitespace-delimited tokens) from the new English version 20200217 release, and aggregated the total number of occurrences of each 1-gram containing only the case-insensitive letters ‘A’ through ‘Z’, with no other special characters, but otherwise without any restrictions on word length or frequency of occurrence.
(Aside: This is the same filtering approach that I used for the previous 20120701 release, although the file organization and format changed in this new version. Peter Norvig did a similar analysis of the earlier data, but I was unable to reproduce his results; both I and at least one commenter on his site observed identical frequency counts that are, interestingly, almost-but-not-quite exactly half of his values.)
The result is 14,808,229 tokens and corresponding frequency counts. This is roughly triple the 4,999,714 tokens from the 2012 release, although it’s interesting that this new data set is not a proper superset of the old: there are 57,754 tokens missing in the new release, three of which are valid Collins Scrabble words (more on this later): alcaicerias (a Spanish bazaar), initiatrices (female initiators), and nouritures (nourishment).
More interesting are the new words that have been added in the last decade or so since the 2012 release. Scanning the 250 most frequently occurring new tokens yields a technological trip down memory lane: instagram, blockchain, bitcoin, hadoop, brexit, icloud, crowdfunding, pinterest, wikileaks, obamacare, gamification, hashtag, github, selfie, airbnb, kinect, tumblr, crispr, sexting, whatsapp, snapchat, spotify, microservices, cryptocurrency, tensorflow, emoji, cisgender.
An updated word frequency list
Armed with this list of nearly 15 million tokens and corresponding frequencies, the second step was to reduce the list of tokens to a more manageable “dictionary” of “words.” To do this, I used the union of the following four word lists:
- The ENABLE2k word list, containing 173,528 words.
- The North American Scrabble Players Association (NASPA) Word List 2018 (NWL2018), used in Scrabble tournaments in the United States and Canada, containing 192,111 words.
- The Collins Scrabble Words 2019 (CSW19) list, used in Scrabble tournaments pretty much everywhere else, containing 279,496 words.
- The Spell Checker Oriented Word List (SCOWL) by Kevin Atkinson, containing 430,590 words. (See the repository for details on the configurable parameters of this word list.)
The SCOWL is included as a sort of intentional overkill, a compromise between the size of the dataset and the hope that it will contain as a subset whatever dictionary you might want to use for your application. Note that this comes at a cost of including tokens that are definitely not words in any reasonable dictionary; for example, all 26 single-letter tokens are present, not just the two words a and I.
The result is a single tab-separated text file with 457,548 rows, one for each word, and three columns: the word, followed by the number of occurrences in the 20120701 and 20200217 Google Books datasets, respectively. The entire list is sorted in decreasing order of frequency in the latest 20200217 dataset.