**Introduction**

This was a fun exercise, motivated by several interesting recent posts by Mark Dominus at The Universe of Discourse about finding anagrams of individual English words, such as (*relationships*, *rhinoplasties*), and how to compute a “score” for such anagrams by some reasonable measure of the complexity of the rearrangement, so that (*attentiveness*, *tentativeness*), with a common 8-letter suffix, may be viewed as less “interesting” than, say, the more thoroughly shuffled (*microclimates*, *commercialist*).

The proposed scoring metric is the size of a “minimum common string partition” (MCSP): what is the fewest number of blocks of consecutive letters in a partition of the first word that may be permuted and re-concatenated to yield the second word? For example, the above word *attentiveness* may be partitioned into 3 blocks, *at*+*tent*+*iveness*, and transposing the first two blocks yields *tent*+*at*+*iveness*. Thus, the score for this anagram is only 3. Compare this with the score of 12 for (*intolerances*, *crenelations*), where all 12 letters must be rearranged.

**Computing MCSP**

I wanted to experiment with this idea in a couple of different ways. First, as Mark points out, the task of *finding* the anagrams themselves is pretty straightforward, but computing the resulting MCSP scores is NP-complete. Fortunately, there is a nice characterization of the solution– essentially the same “brute force” approach described by Mark– that allows concise and reasonably efficient implementation.

Consider an anagram of two words with letters each, where the necessary rearrangement of letters of to produce is specified by a permutation

where the -th letter in is the -th letter in . This permutation of *individual* letters corresponds to a permutation of *blocks of consecutive* letters, where the number of such blocks– the MCSP score– is

Computing an MCSP is hard because this permutation transforming into is not necessarily unique; we need the permutation that minimizes . The key observation is that each candidate permutation may be decomposed into , where transforms any canonical (e.g., sorted) ordering of letters into . So we can fix, say, , and the enumeration of possible is easy to express, since we are using the *sorted* list of letters as our starting point.

The following Mathematica function implements this approach:

anagramScore[w1_String, w2_String] := Module[ {s1 = Characters[w1], s2 = Characters[w2], p1, p2, i}, p1 = Ordering@Ordering[s1]; p2 = Ordering@Ordering[s2]; Length[s1] - Max@Outer[ Count[ Differences[Ordering[ReplacePart[p1, {##} // Flatten]][[p2]]], 1] &, Sequence @@ Map[ (i = Position[s1, #] // Flatten; Thread[i -> #] & /@ Permutations[p1[[i]]]) &, Union[s1] ], 1]]

Using this, we find, as Mark does, that an anagram with maximum MCSP score of 14 is (*cinematographer*, *megachiropteran*)… along with the almost-as-interesting (*involuntariness*, *nonuniversalist*), but also other fun ones farther down the list, such as (*enthusiastic*, *unchastities*) with a score of 9.

**Scoring Anagrams Using MCSP and Frequency**

From Mark’s post:

Clearly my chunk score is not the end of the story, because “notaries / senorita” should score better than “abets / baste” (which is boring) or “Acephali / Phacelia” (whatever those are), also 5-pointers. The length of the words should be worth something,

and the familiarity of the words should be worth even more[my emphasis].

The problem is that an MCSP score alone is a pretty coarse metric, since it’s an integer bounded by the length of the words in the dictionary. So the second idea was to refine the ordering of the list of anagrams as Mark suggests, with a lexicographic sort first by MCSP score, then by (average) frequency of occurrence in language, as estimated using the Google Books Ngrams data set (methodology described in more detail here). The expectation was that this would make browsing a long list easier, with more “recognizable” anagrams appearing together near the beginning of each MCSP grouping.

However, because I wanted to try to reproduce Mark’s results, I also needed a larger dictionary that contained, for example, *megachiropteran*. (which, by the way, is a bat that can have a wing span of over 5 feet). I used the American English version of the Spell Checker Oriented Word List (SCOWL), combined with the Scrabble and ENABLE2k word lists used in similar previous experiments– which, interestingly, *alone* contain many anagrams *not* found in the earlier list. (The SCOWL was really only needed to “reach” *megachiropteran*; with the exception of it and *nonuniversalist*, all of the other examples in this post are valid Scrabble words!). The resulting word lists and corresponding counts of occurrences in the Ngrams data set are available here.

The resulting list of anagrams are in the text file linked below, sorted by MCSP score, then by the average frequency of the pair of words in each anagram. Interesting examples high on the list are (*personality*, *antileprosy*) with a score of 11, (*industries*, *disuniters*) with a score of 10, etc.

The full list of 82,776 anagrams sorted by MCSP and frequency

The individual word frequencies are included in the list, to allow investigation of other sorting methods. For example, it might be useful to normalize MCSP score by word length. Or instead of using the *average* frequency of the two words in an anagram, the *ratio* of frequencies would more heavily weight anagrams between a really common word and a relatively unknown one, such as (*penalties*, *antisleep*)– I have never heard of the latter, but both are Scrabble-playable.

**References:**