Anagrams

Introduction

This was a fun exercise, motivated by several interesting recent posts by Mark Dominus at The Universe of Discourse about finding anagrams of individual English words, such as (relationshipsrhinoplasties), and how to compute a “score” for such anagrams by some reasonable measure of the complexity of the rearrangement, so that (attentivenesstentativeness), with a common 8-letter suffix, may be viewed as less “interesting” than, say, the more thoroughly shuffled (microclimates, commercialist).

The proposed scoring metric is the size of a “minimum common string partition” (MCSP): what is the fewest number of blocks of consecutive letters in a partition of the first word that may be permuted and re-concatenated to yield the second word?  For example, the above word attentiveness may be partitioned into 3 blocks, at+tent+iveness, and transposing the first two blocks yields tent+at+iveness.  Thus, the score for this anagram is only 3.  Compare this with the score of 12 for (intolerances, crenelations), where all 12 letters must be rearranged.

Computing MCSP

I wanted to experiment with this idea in a couple of different ways.  First, as Mark points out, the task of finding the anagrams themselves is pretty straightforward, but computing the resulting MCSP scores is NP-complete.  Fortunately, there is a nice characterization of the solution– essentially the same “brute force” approach described by Mark– that allows concise and reasonably efficient implementation.

Consider an anagram of two words $(w_1, w_2)$ with $n$ letters each, where the necessary rearrangement of letters of $w_1$ to produce $w_2$ is specified by a permutation

$\pi:\{1,2,\ldots,n\} \rightarrow \{1,2,\ldots,n\}$

where the $i$-th letter in $w_2$ is the $\pi(i)$-th letter in $w_1$.  This permutation of individual letters corresponds to a permutation of blocks of consecutive letters, where the number of such blocks– the MCSP score– is

$s(\pi) = n - \left|\{i

Computing an MCSP is hard because this permutation transforming $w_1$ into $w_2$ is not necessarily unique; we need the permutation that minimizes $s(\pi)$.  The key observation is that each candidate permutation may be decomposed into $\pi = \pi_2 \pi_1^{-1}$, where $\pi_j$ transforms any canonical (e.g., sorted) ordering of letters into $w_j$.  So we can fix, say, $\pi_2$, and the enumeration of possible $\pi_1$ is easy to express, since we are using the sorted list of letters as our starting point.

The following Mathematica function implements this approach:

anagramScore[w1_String, w2_String] :=
Module[
{s1 = Characters[w1], s2 = Characters[w2], p1, p2, i},
p1 = Ordering@Ordering[s1];
p2 = Ordering@Ordering[s2];
Length[s1] - Max@Outer[
Count[
Differences[Ordering[ReplacePart[p1, {##} // Flatten]][[p2]]],
1] &,
Sequence @@ Map[
(i = Position[s1, #] // Flatten;
Thread[i -> #] & /@ Permutations[p1[[i]]]) &,
Union[s1]
], 1]]


Using this, we find, as Mark does, that an anagram with maximum MCSP score of 14 is (cinematographer, megachiropteran)… along with the almost-as-interesting (involuntariness, nonuniversalist), but also other fun ones farther down the list, such as (enthusiastic, unchastities) with a score of 9.

Scoring Anagrams Using MCSP and Frequency

From Mark’s post:

Clearly my chunk score is not the end of the story, because “notaries / senorita” should score better than “abets / baste” (which is boring) or “Acephali / Phacelia” (whatever those are), also 5-pointers. The length of the words should be worth something, and the familiarity of the words should be worth even more [my emphasis].

The problem is that an MCSP score alone is a pretty coarse metric, since it’s an integer bounded by the length of the words in the dictionary.  So the second idea was to refine the ordering of the list of anagrams as Mark suggests, with a lexicographic sort first by MCSP score, then by (average) frequency of occurrence in language, as estimated using the Google Books Ngrams data set (methodology described in more detail here).  The expectation was that this would make browsing a long list easier, with more “recognizable” anagrams appearing together near the beginning of each MCSP grouping.

However, because I wanted to try to reproduce Mark’s results, I also needed a larger dictionary that contained, for example, megachiropteran. (which, by the way, is a bat that can have a wing span of over 5 feet).  I used the American English version of the Spell Checker Oriented Word List (SCOWL), combined with the Scrabble and ENABLE2k word lists used in similar previous experiments– which, interestingly, alone contain many anagrams not found in the earlier list.  (The SCOWL was really only needed to “reach” megachiropteran; with the exception of it and nonuniversalist, all of the other examples in this post are valid Scrabble words!).  The resulting word lists and corresponding counts of occurrences in the Ngrams data set are available here.

The resulting list of anagrams are in the text file linked below, sorted by MCSP score, then by the average frequency of the pair of words in each anagram.  Interesting examples high on the list are (personality, antileprosy) with a score of 11, (industries, disuniters) with a score of 10, etc.

The full list of 82,776 anagrams sorted by MCSP and frequency

The individual word frequencies are included in the list, to allow investigation of other sorting methods.  For example, it might be useful to normalize MCSP score by word length.  Or instead of using the average frequency of the two words in an anagram, the ratio of frequencies would more heavily weight anagrams between a really common word and a relatively unknown one, such as (penalties, antisleep)– I have never heard of the latter, but both are Scrabble-playable.

References:

1. Dominus, M., I found the best anagram in English, The Universe of Discourse, 21 February 2017 [HTML]
2. Goldstein, A., Kolman, P., Zheng, J., Minimum Common String Partition Problem: Hardness and Approximations, Electronic Journal of Combinatorics, 12 (2005), #R50 [PDF]
This entry was posted in Uncategorized. Bookmark the permalink.