An updated Google Books word frequency list

Introduction

I think I nerd-sniped myself. This started with the objective of writing a simple program to play Hangman, as a demonstration of a potential programming exercise for students. (Don’t assign the problem if you don’t know the solution.) But in the process of creating the word list for the game, I found that last year Google released an updated export of its Google Books Ngrams data, which I’ve used and discussed here several times before. So I thought it would be interesting to revisit that data reduction exercise, and see what’s changed since the last release.

First, let’s get the Hangman game out of the way. The Python code is on GitHub; it’s only about 30 lines (plus a list of 10,000 words)… but I think it’s an interesting 30 lines. This could be a great problem for students, either to implement the game themselves, or even just to play this version, and inspect the code to figure out what’s “evil” about this particular computer player. I think it’s an interesting problem to implement even this simple approach efficiently, let alone to significantly improve on the computer player’s performance (see Reference 3 below).

The word list for the game is selected from a longer list– also on GitHub— of 457,548 words and their corresponding frequency of occurrence in the Google Books dataset. I built this longer list in two steps, as described below.

Google Books 1-grams

First, I downloaded the entire list of Google Books 1-grams (roughly, whitespace-delimited tokens) from the new English version 20200217 release, and aggregated the total number of occurrences of each 1-gram containing only the case-insensitive letters ‘A’ through ‘Z’, with no other special characters, but otherwise without any restrictions on word length or frequency of occurrence.

(Aside: This is the same filtering approach that I used for the previous 20120701 release, although the file organization and format changed in this new version. Peter Norvig did a similar analysis of the earlier data, but I was unable to reproduce his results; both I and at least one commenter on his site observed identical frequency counts that are, interestingly, almost-but-not-quite exactly half of his values.)

The result is 14,808,229 tokens and corresponding frequency counts. This is roughly triple the 4,999,714 tokens from the 2012 release, although it’s interesting that this new data set is not a proper superset of the old: there are 57,754 tokens missing in the new release, three of which are valid Collins Scrabble words (more on this later): alcaicerias (a Spanish bazaar), initiatrices (female initiators), and nouritures (nourishment).

More interesting are the new words that have been added in the last decade or so since the 2012 release. Scanning the 250 most frequently occurring new tokens yields a technological trip down memory lane: instagram, blockchain, bitcoin, hadoop, brexit, icloud, crowdfunding, pinterest, wikileaks, obamacare, gamification, hashtag, github, selfie, airbnb, kinect, tumblr, crispr, sexting, whatsapp, snapchat, spotify, microservices, cryptocurrency, tensorflow, emoji, cisgender.

An updated word frequency list

Armed with this list of nearly 15 million tokens and corresponding frequencies, the second step was to reduce the list of tokens to a more manageable “dictionary” of “words.” To do this, I used the union of the following four word lists:

  • The ENABLE2k word list, containing 173,528 words.
  • The North American Scrabble Players Association (NASPA) Word List 2018 (NWL2018), used in Scrabble tournaments in the United States and Canada, containing 192,111 words.
  • The Collins Scrabble Words 2019 (CSW19) list, used in Scrabble tournaments pretty much everywhere else, containing 279,496 words.
  • The Spell Checker Oriented Word List (SCOWL) by Kevin Atkinson, containing 430,590 words. (See the repository for details on the configurable parameters of this word list.)

The SCOWL is included as a sort of intentional overkill, a compromise between the size of the dataset and the hope that it will contain as a subset whatever dictionary you might want to use for your application. Note that this comes at a cost of including tokens that are definitely not words in any reasonable dictionary; for example, all 26 single-letter tokens are present, not just the two words a and I.

The result is a single tab-separated text file with 457,548 rows, one for each word, and three columns: the word, followed by the number of occurrences in the 20120701 and 20200217 Google Books datasets, respectively. The entire list is sorted in decreasing order of frequency in the latest 20200217 dataset.

References:

  1. Google Books Ngram Viewer Exports, English versions 20120701 and 20200217
  2. Atkinson, K., Spell Checker Oriented Word List (SCOWL) Copyright 2000-2018 by Kevin Atkinson
  3. Barbay, J. and Subercaseaux, B., The Computational Complexity of Evil Hangman, arXiv:2003.10000

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to An updated Google Books word frequency list

  1. (Don’t assign the problem if you don’t know the solution.)

    Oh yes, I’ve learned this lesson the hard way. (“Whoops, you need to do X but you haven’t learned how to do X yet…”)

    inspect the code to figure out what’s “evil” about this particular computer player

    The word “evil” tipped me off, so I correctly guessed that the computer wasn’t just picking one word at the beginning like a human opponent. 🙂

    • Yeah, this was an interesting exercise. I had seen this as a suggested assignment before, but had never actually implemented it myself until now. It’s a nice problem, because it seems pretty easy to accidentally end up with an unacceptably *inefficient* implementation (e.g., quadratic in dictionary size).

      The linked paper is intriguing, too; this greedy “evil” approach isn’t just suboptimal, it can be arbitrarily bad, and optimality is coNP-hard. It would be interesting to measure (via Monte Carlo) for realistic dictionaries just how much more “evil” this approach is– or isn’t– against a simple computer guesser, compared to a “non-evil” approach of just randomly selecting a fixed word.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.