Beware the “natural” quaternion


Rotation math can be confusing. But it didn’t need to be this confusing.

I think the reason that 3D rotations can be tricky to work with is that there are so many choices of convention– in interpretation, notation, and implementation– and we often do not communicate or document all of those choices clearly and completely. Are we “actively” rotating vectors, or “passively” rotating frames? If the latter, are we rotating from the “body” to “world” frame, or from world to body? Are rotations applied as we write them left to right, or right to left? Are the rotations right-handed or left-handed? Etc.

This post is motivated by lost and delayed productivity, stemming from confusion and miscommunication about one of these choices that seems not widely known: there are two different quaternions out there in the wild. The unit quaternion, familiar to most of us as a way to represent rotations in many application domains, has an evil twin. If you didn’t already know this, then hopefully this post is a helpful warning to double-check your documentation and code. But even if this is old news for you, then I will try to argue that use of this evil twin quaternion, advocated by Shuster [1] as the “natural” convention, was an unnecessary choice… and certainly not any more “natural,” “physical,” or otherwise inherently preferable to the quaternion most of us are familiar with.

Rotation matrices

To try to show more clearly how silly this situation is, let’s think like a software developer, working out how to implement rotations for some application. Let’s start with the interface first, then we will flesh out the implementation. We can view the concept of rotation as a group acting on \mathbb{R}^3, with a particular rotation represented as a group element g, that transforms a vector v \in \mathbb{R}^3 into a new rotated vector g(v). For example, in Python, we could make a Rotation object callable:

class Rotation:
    def __call__(self, v):
        """Rotate vector v."""
        return ...

g = Rotation(...)
v = (1, 0, 0)
v_rotated = g(v)

But wait– we’ve already made several choices of convention here, at least implicitly. First, from this software implementation perspective, we are bypassing the discussion of interpretation altogether. That is, for example, whether the user wants to interpret an argument vector v \in \mathbb{R}^3 as being “actively” rotated within a fixed coordinate frame, or as a fixed point in space whose coordinates are being “passively” rotated from one frame to another, does not affect the eventual lines of code implementing the arithmetic operations realizing the rotation. (Or put another way, those interpretation details get pushed into the constructor, affecting how we want to interpret the parameters specifying creation of a Rotation. More on this later.)

However, we have made a definite choice about notation: namely, the left group action convention, where we write (from left to right) the rotation group element g first, followed by the vector v being acted upon.

This is okay; no one is complaining about this choice of convention. For example, if we now consider actually implementing a rotation as a 3×3 matrix R (we’ll get to quaternions shortly), pretty much everyone agrees with and has settled on the well-established convention of multiplying the matrix R on the left by the column vector v on the right, so that the group action g(v) corresponds to the matrix multiplication Rv:

import numpy as np

class Rotation:
    def __init__(self, matrix):
        """Create rotation from matrix."""
        self.r = np.array(matrix)
    def __call__(self, v):
        """Rotate vector v."""
        return self.r @ np.array(v) # matrix multiplication

g = Rotation(((0, -1, 0), (1, 0, 0), (0, 0, 1)))
v = (1, 0, 0)
v_rotated = g(v)

(As an aside, the only reasonably common application context I can think of where it’s more common to multiply a row vector on the left by a matrix on the right is a Markov chain, where we update the probability distribution as a row vector multiplied by the transition matrix. Maybe there are others that I’m not aware of?)

Composition of rotations

Our implementation isn’t quite complete. We still need to work out how to compose multiple rotations. Fortunately, there is again community-wide agreement on how this works with rotation matrices, and it’s consistent with the same left group action convention. Namely, composition (just like the group action on vectors) is denoted and implemented as the usual left-to-right matrix multiplication:

import numpy as np

class Rotation:
    def __init__(self, matrix):
        """Create rotation from matrix."""
        self.r = np.array(matrix)
    def __call__(self, v):
        """Rotate vector v."""
        return self.r @ np.array(v)

    def __matmul__(self, r2):
        """Compose rotations (this @ r2)."""
        return Rotation(self.r @ r2.r)

r1 = Rotation(((0, -1, 0), (1, 0, 0), (0, 0, 1)))
r2 = Rotation(((0, 0, 1), (0, 1, 0), (-1, 0, 0)))
v = (1, 0, 0)

v_rotated = (r1 @ r2)(v)
v_rotated = r1(r2(v)) # same result

Note a couple of observations before we move on. First, the example above highlights a potentially confusing side effect of this left group action convention: in the composition R_1 R_2, the rightmost rotation R_2 is the first to act on the input vector v.

Second, recall that matrix multiplication is defined so that each element in the matrix product C=AB is computed as c_{i,j}=\sum_k a_{i,k}b_{k,j}. This is technically another arbitrary choice of convention; we could in principle choose to abandon this convention, and define “our” multiplication as c_{i,j}=\sum_k a_{k,i}b_{j,k}. The reader can verify that the effect of changing this convention is simply to reverse the order in which we would write a “historically conventional” well-defined matrix product.

Maybe we’ve been using the “wrong” definition of matrix multiplication for the last two centuries or so, and the above redefinition is more “natural”?

Keep the interface, change the implementation

Now that we have a working software implementation using rotation matrices, that corresponds nicely with cocktail-napkin notation, suppose that we want to change that implementation to use quaternions instead of matrices under the hood… but without changing the interface exposed to the user.

First, let’s review how unit quaternions work as rotation operators. Just as the group SO(3) of rotation matrices acts on vectors by multiplication, the group SU(2) of unit quaternions of the form q=w+x\mathbf{i}+y\mathbf{j}+z\mathbf{k} acts on vectors by conjugation, with a unit quaternion q transforming a vector v into the rotated vector qvq^{-1}=qvq^{*}, where we temporarily convert vectors in \mathbb{R}^3 to and from corresponding quaternions with zero real part, and the non-commutative quaternion multiplication is defined by

\mathbf{i}^2 = \mathbf{j}^2 = \mathbf{k}^2 = \mathbf{ijk} = -1

Note that this form of conjugation, qvq^{-1}, is also a choice of convention. We could have used q^{-1}vq instead… but here again, most everyone seems to agree the above is the more sensible choice, since it keeps us consistent with the left group action notational convention, allowing us to comfortably switch back and forth between thinking in terms of rotation matrices or quaternions as our underlying implementation.

Here is the result:

import numpy as np

def quat_product(q1, q2):
    """Return quaternion product (q1 q2)."""
    w1, xyz1 = q1
    w2, xyz2 = q2
    return (w1 * w2 -, xyz2),
            w1 * xyz2 + w2 * xyz1 + np.cross(xyz1, xyz2))

class Rotation:
    def __init__(self, quat):
        """Create rotation from quaternion."""
        self.q = (quat[0], np.array(quat[1]))

    def __call__(self, v):
        """Rotate vector v."""
        w, xyz = quat_product(quat_product(self.q, (0, np.array(v))),
                              (self.q[0], -self.q[1]))
        return xyz

    def __matmul__(self, r2):
        """Compose rotations (this @ r2)."""
        return Rotation(quat_product(self.q, r2.q))

c = np.sqrt(0.5)
r1 = Rotation((c, (0, 0, c)))
r2 = Rotation((c, (0, c, 0)))
v = (1, 0, 0)

v_rotated = (r1 @ r2)(v)
v_rotated = r1(r2(v)) # same result

Note that I’ve pulled out a helper function, quat_product, that multiplies two quaternions, using the product as defined above and originally set down by Hamilton nearly two centuries ago.

But this is yet another technically arbitrary choice of convention. There is another definition of the quaternion product, advocated by Shuster [1], and in wide use by the spacecraft dynamics community, that changes the product relations from the “Hamilton convention” of

\mathbf{i}^2 = \mathbf{j}^2 = \mathbf{k}^2 = \mathbf{ijk} = -1

to what Sommer et. al. [2] call the “Shuster convention” (sometimes referred to as the “JPL convention,” denoted here as the operator \otimes) of

\mathbf{i}^2 = \mathbf{j}^2 = \mathbf{k}^2 = -1, \mathbf{i}\otimes\mathbf{j}\otimes\mathbf{k} = +1

Mathematically speaking, there is nothing wrong with this. (Side puzzle: this doesn’t contradict the Frobenius theorem, right?) However– and finally we come to the point of this discussion– there is also nothing more inherently Right, or Natural, about it, either.

The effect, and indeed the motivation, of using Shuster’s convention is simply to reverse the order of the quaternion product. That is, given two quaternions q_1,q_2 with Hamilton product denoted by juxtaposition q_1 q_2, the reader can verify that their “Shuster product” q_1 \otimes q_2 = q_2 q_1.

In the quat_product implementation above, we could realize the Shuster product by flipping the plus sign on the cross product to a minus sign… and that’s it. Nothing else changes… except, possibly, how we might choose to interpret the parameterization of our rotations in the constructor.

Homomorphism vs. antihomomorphism

So, why mess with two centuries of notational inertia just to write products in reverse order? To see what I think is the root of the problem, note that I cheated somewhat in the quaternion implementation of the Rotation class above. I claimed not to change the interface at all, but I did: the original rotation matrix implementation’s constructor expected a matrix argument, while the quaternion version expects a quaternion argument.

One way to fix this would be to keep the original expectation of a matrix as the constructor argument, and convert it to the corresponding quaternion under the hood. Or conversely, in the matrix implementation, we would accept a quaternion as the constructor argument, and convert it to the corresponding matrix.

But what do we mean by conversion to the corresponding matrix (or quaternion)? We want the external interface to remain unchanged, so that we preserve behavior as observed by the user, no matter which internal implementation we use. In mathematical terms, we need a homomorphism: a conversion function \phi : SU(2) \to SO(3), with the property that for all unit quaternions q_1, q_2, converting them to rotation matrices preserves the behavior of composition, that is, \phi(q_1 q_2)=\phi(q_1)\phi(q_2).

So, consider the situation at this point. We have two choices of convention to make:

  1. What is the quaternion product: do we use Hamilton’s that has been around for nearly two centuries and is in common use in mathematical literature and software libraries, or Shuster’s that is mathematically equally workable but much more recent and much less widely known?
  2. What is the conversion function \phi described above?

The key observation is that, if we want our conversion function \phi to be a homomorphism, then fixing the choice of convention in either (1) or (2) determines the choice in the other. And this whole mess, that on the surface is Shuster’s “natural” quaternion product as choice (1), really stems from the prematurely fixed, equally arbitrary choice (2) for the correspondence between quaternions and matrices.

In other words, the complaint seems to be, “I’ve already fixed my assumed conversion function for some reason, and I want it to be a homomorphism, so I need the quaternion product definition to change accordingly.” But if we simply transpose Shuster’s assumed quaternion-to-matrix conversion– which he refers to in equation (4) of [1], but does not actually define until equation (61), without ever noting the implicit conventional choice in doing so– then the Hamilton product does indeed “satisfy… the natural order” of multiplication as desired.


Unfortunately, the damage is done. That is, rants like this aren’t going to fix all of the papers and lines of code that already exist that use both conventions, so we can do no better than double-check assumptions and conventions whenever we encounter a new paper, software library, etc.

There are even examples of more intermediate weirdness, so to speak, to be on the lookout for. For example, MATLAB’s Robotics Toolbox is reasonably well-behaved, consistently using the Hamilton product (robotics.utils.internal.quatMultiply) and the corresponding homomorphic conversion functions (quat2rotm and rotm2quat)… but in the Aerospace Toolbox, despite also using the Hamilton product (quatmultiply), the analogous conversion functions (quat2dcm and dcm2quat) are seemingly antihomomorphic– i.e., quat2dcm(q) returns the transpose of quat2rotm(q)— except for the added wrinkle that the provided rotation function quatrotate uses the conjugation q^{-1}vq consistent with the right group action convention (i.e., so that when composing rotations, matrix products look reversed compared to their corresponding quaternion products).


  1. Shuster, M., The nature of the quaternion, The Journal of the Astronautical Sciences, 56(3) 2008, 359–373
  2. Sommer, Gilitschenski, Bloesch, Weiss, Siegwart, Nieto, Why and How to Avoid the Flipped Quaternion Multiplication, arXiv:1801.07478
Posted in Uncategorized | Leave a comment

Analysis of Evil Hangman


This is a follow-up to the previous post that briefly mentioned the game of Hangman, in the context of developing a dictionary of words sorted based on their frequency of occurrence in the recently updated Google Books Ngram dataset.

My Python implementation of the game is on GitHub, where you are the guesser playing against the computer acting as the hangman. But this hangman is “evil:” that is, the computer doesn’t play “honestly,” selecting a secret word at the start of the game, and keeping responses to your guesses consistent with that initial selection throughout the game. Instead, the computer does not start with a secret word at all, merely a word length, and scores guesses using a greedy algorithm that maximizes the number of possible words in the dictionary that remain consistent with all scored guesses so far.

(Aside: this version is slightly more evil than the original, modified to break ties by minimizing the number of correct letters in each response, so that it will score a “miss” if possible, subject to the original constraint of maximizing the number of possible remaining words.)

It seems intuitive that Evil Hangman should be a much harder game to play than the normal honest version. But how much harder is it really? A recent paper by Barbay and Subercaseaux [1] shows that this simple children’s game hides a lot of complexity: determining a strategy that guarantees at least a specified number of missed guesses is coNP-hard, and the standard “homework assignment” greedy algorithm can be arbitrarily worse than optimal for appropriately crafted dictionaries… but what about “real” dictionaries? How much harder is Evil Hangman in practice with a typical English-language dictionary?

Modeling the guesser

To automate the analysis, we need to implement a model of a player’s sequence of guesses in each game. When I play Hangman, my initial guesses tend to be hard-coded, always starting with ETAOIN SHRDLU, until a first correct guess, only after which my strategy depends on the situation, the length of the word, which letters were correct and where, etc.

I do this knowing that it’s not optimal; there is a great Data Genetics blog post [2] that describes the problem with this strategy, and a better solution. Rather than prioritizing guesses based on frequency of occurrence in language (where even ETAOINSHRDLU isn’t quite right– a more accurate sequence is approximately ETAOINSRHLDCU), a better approach is instead to guess letters that occur in the largest number of remaining possible words in the dictionary.

The following table shows what this strategy looks like, indicating the sequence of guesses you should make for each possible word length… at least until you score the first correct guess, after which the subsequent strategy is too complex to fit in a simple table, since it depends on not only the word length, but which letter was correct and where, etc.

Word lengthGuesses until first correct
Table 1. Recommended guessing strategy leading to first correct guess, using NWL2018 Scrabble dictionary.

Note that this table differs slightly from the Data Genetics article, since the strategy depends on the dictionary. Here and throughout I am using the current North American (NWL2018) Scrabble dictionary of 192,111 words.


The following figure shows a comparison of performance of this “optimal” guessing strategy against various hangman strategies.

Figure 1. Number of misses (i.e., failed guesses) required vs. overall size of dictionary, for each possible word length. Left: typical “Honest” Hangman, where a word is selected uniformly at random from the dictionary. Middle: “Evil” Hangman. Right: Honest Hangman, but with the word selected to maximize number of misses against the implemented guesser.

There is a lot to unpack here. First, we consider a range of dictionary sizes on the x-axis, ranging from just the most commonly occurring words to the entire NWL2018 Scrabble dictionary. Second, each curve corresponds to a specific word length. This is perhaps the most interesting and unintuitive observation: the shortest words, with just 2-5 letters, are consistently the hardest ones to guess.

The leftmost plot shows the “normal” game, where an honest hangman selects a fixed secret word uniformly at random, with the y-axis indicating the expected number of misses (conditioned on that word length) before guessing the word.

The middle plot shows the performance against Evil Hangman. Interestingly, short words are not only harder to guess, they are much harder, with words of 3-5 letters requiring nearly twice as many misses as words with 6 letters. What are these “hardest” words? Recall that Evil Hangman does not select a specific word to guess; but when finally forced into a corner by this “optimal” guessing strategy, examples of the eventual lone remaining valid dictionary word are (from most to least common): hip, rib, tung, buzz, puff, jig, pugh, fil, jib, wuz, cuz, yuck, and guck.

However, as discussed in the paper motivating this post, this “greedily” Evil Hangman does not necessarily yield a maximum number of failed guesses. The rightmost plot shows the worst case performance against a normal honest hangman, i.e., where the secret word is selected to maximize the number of misses against this “optimal” guessing strategy. It is again perhaps unintuitive that merely crafting a particular selected word– but a fixed word– can yield so many more misses than the Evil Hangman, who actively avoids giving out information.

This behavior can be more clearly seen with a simpler example. Consider a dictionary with just five words: from, have, that, this, with. We can verify that our guessing strategy will win with zero misses against Evil Hangman, by choosing H as the first letter; but an honest hangman can guarantee that first guess will miss, by selecting the word from.

So what are these hardest words that an honest hangman could select from the actual Scrabble dictionary? Some of the more recognizable ones are junked (12 misses), jazzed and faffed (13 misses each)… and it only gets weirder from there; to require more than the 16 failed guesses squeezed out by the Evil Hangman, one could select words like: hin, rill, zin, zill, zax, yill, yins, zills, yills. I don’t recognize any of these, and I imagine that selecting one of them as a hangman secret word would probably start a fight.

Open problems

Of course, these examples highlight one of a couple of problems with this analysis: first, the strategy for both the guesser and the hangman, even in the evil case, are completely deterministic, and thus predictably repeatable. For example, having observed that the “optimal” guesser requires 12 misses to eventually guess the word pop, as hangman we could select pop again and again, and score 12 misses every single time.

Second, we aren’t even really playing the right game. In Hangman, the guesser does not necessarily continue until finally guessing the entire word. Instead, a maximum number of failed guesses is designated ahead of time, and the guesser loses as soon as that maximum is reached. Rather than minimizing the number of misses– which we haven’t even been guaranteeing anyway, hence my quotes around “optimal” strategy– the real objective is to maximize the probability of winning, i.e., of not reaching the maximum number of misses.

This is a very hard problem, with a mixed (that is, randomized) strategy Nash equilibrium that would be incredibly complex to even specify, let alone actually compute. Jon McLoone presents an interesting Wolfram blog post [3] that attempts to address these issues, albeit in a similarly simplified and not necessarily optimal way. (Many of the above words, including buzz and its inflected forms, jazzed, faffed, etc., also appear high on his list of “best” words for a hangman to choose.)


  1. Barbay, J. and Subercaseaux, B., The Computational Complexity of Evil Hangman, arXiv:2003.10000
  2. Berry, N., A Better Strategy for Hangman, Data Genetics (blog),
  3. McLoone, J., 25 Best Hangman Words, Wolfram News, Views, & Insights (blog),
Posted in Uncategorized | Leave a comment

An updated Google Books word frequency list


I think I nerd-sniped myself. This started with the objective of writing a simple program to play Hangman, as a demonstration of a potential programming exercise for students. (Don’t assign the problem if you don’t know the solution.) But in the process of creating the word list for the game, I found that last year Google released an updated export of its Google Books Ngrams data, which I’ve used and discussed here several times before. So I thought it would be interesting to revisit that data reduction exercise, and see what’s changed since the last release.

First, let’s get the Hangman game out of the way. The Python code is on GitHub; it’s only about 30 lines (plus a list of 10,000 words)… but I think it’s an interesting 30 lines. This could be a great problem for students, either to implement the game themselves, or even just to play this version, and inspect the code to figure out what’s “evil” about this particular computer player. I think it’s an interesting problem to implement even this simple approach efficiently, let alone to significantly improve on the computer player’s performance (see Reference 3 below).

The word list for the game is selected from a longer list– also on GitHub— of 457,548 words and their corresponding frequency of occurrence in the Google Books dataset. I built this longer list in two steps, as described below.

Google Books 1-grams

First, I downloaded the entire list of Google Books 1-grams (roughly, whitespace-delimited tokens) from the new English version 20200217 release, and aggregated the total number of occurrences of each 1-gram containing only the case-insensitive letters ‘A’ through ‘Z’, with no other special characters, but otherwise without any restrictions on word length or frequency of occurrence.

(Aside: This is the same filtering approach that I used for the previous 20120701 release, although the file organization and format changed in this new version. Peter Norvig did a similar analysis of the earlier data, but I was unable to reproduce his results; both I and at least one commenter on his site observed identical frequency counts that are, interestingly, almost-but-not-quite exactly half of his values.)

The result is 14,808,229 tokens and corresponding frequency counts. This is roughly triple the 4,999,714 tokens from the 2012 release, although it’s interesting that this new data set is not a proper superset of the old: there are 57,754 tokens missing in the new release, three of which are valid Collins Scrabble words (more on this later): alcaicerias (a Spanish bazaar), initiatrices (female initiators), and nouritures (nourishment).

More interesting are the new words that have been added in the last decade or so since the 2012 release. Scanning the 250 most frequently occurring new tokens yields a technological trip down memory lane: instagram, blockchain, bitcoin, hadoop, brexit, icloud, crowdfunding, pinterest, wikileaks, obamacare, gamification, hashtag, github, selfie, airbnb, kinect, tumblr, crispr, sexting, whatsapp, snapchat, spotify, microservices, cryptocurrency, tensorflow, emoji, cisgender.

An updated word frequency list

Armed with this list of nearly 15 million tokens and corresponding frequencies, the second step was to reduce the list of tokens to a more manageable “dictionary” of “words.” To do this, I used the union of the following four word lists:

  • The ENABLE2k word list, containing 173,528 words.
  • The North American Scrabble Players Association (NASPA) Word List 2018 (NWL2018), used in Scrabble tournaments in the United States and Canada, containing 192,111 words.
  • The Collins Scrabble Words 2019 (CSW19) list, used in Scrabble tournaments pretty much everywhere else, containing 279,496 words.
  • The Spell Checker Oriented Word List (SCOWL) by Kevin Atkinson, containing 430,590 words. (See the repository for details on the configurable parameters of this word list.)

The SCOWL is included as a sort of intentional overkill, a compromise between the size of the dataset and the hope that it will contain as a subset whatever dictionary you might want to use for your application. Note that this comes at a cost of including tokens that are definitely not words in any reasonable dictionary; for example, all 26 single-letter tokens are present, not just the two words a and I.

The result is a single tab-separated text file with 457,548 rows, one for each word, and three columns: the word, followed by the number of occurrences in the 20120701 and 20200217 Google Books datasets, respectively. The entire list is sorted in decreasing order of frequency in the latest 20200217 dataset.


  1. Google Books Ngram Viewer Exports, English versions 20120701 and 20200217
  2. Atkinson, K., Spell Checker Oriented Word List (SCOWL) Copyright 2000-2018 by Kevin Atkinson
  3. Barbay, J. and Subercaseaux, B., The Computational Complexity of Evil Hangman, arXiv:2003.10000

Posted in Uncategorized | 2 Comments

Balanced clock hands

Given an analog clock with sweeping hour, minute, and second hands, at what times are the hands most evenly “balanced”– that is, most equally separated in angle?

It’s a standard high school algebra problem to compute all of the times at which just the hour and minute hands are aligned on top of each other. And it’s relatively straightforward to show that all three hands, including the second hand, are only exactly aligned twice per day, at 12 o’clock. But this problem is different: here, we would like the three angles made by the three hands to be as equal– that is, as close to 120 degrees each– as possible.

As we will see shortly, there is no time at which all three angles are exactly 120 degrees. (This is another nice exercise to prove just using pencil and paper.) So, how close can we get? This post was an admittedly somewhat frivolous excuse to experiment with SymPy, the Python symbolic mathematics library bundled with SciPy. My approach was pretty brute-force, letting SymPy do all of the heavy lifting with very few lines of code, including evaluating multiple different metrics defining what we mean by “balanced.” (All of the code discussed here is available on GitHub.)

The problem is complicated by the floor functions that would be buried in the cost function if we were to use a single parameter to represent the continuously varying time over the entire 12-hour domain. Instead, let’s divide the domain into 12×60=720 one-minute intervals, each specified by a fixed integer hour and minute, and for each, only let the second hand sweep through the single revolution of that one minute of time, computing the resulting three angles between consecutive hands:

def hand_positions(hour, minute, second):
    """Return positions in [0, 360) of clock hands."""
    r_60 = sym.Rational(60, 1)
    return (360 * (hour + minute / r_60 + second / (r_60 * r_60)) / 12,
            360 * (minute + second / r_60) / r_60,
            360 * second / r_60)

def hand_angles(hands):
    """Return angles between clock hands."""
    x, y, z = sorted(hands)
    return (y - x, z - y, x - z + 360)

At any given time, what yardstick should we use to measure how “balanced” the clock hands are? There are at least a couple of reasonable alternatives, even if we restrict our attention to (piecewise) linear cost functions. The first that occurred to me was to “maximize the minimum” angle: by the pigeonhole principle, at least one of the angles is at most 120 degrees, so let’s try to make that smallest angle as large as possible, measuring the deficit:

def max_min(*time):
    """(120 minus) minimum angle between clock hands."""
    return 120 - min(hand_angles(hand_positions(*time)))

Rather than maximizing the minimum angle, another option is to “minimize the maximum” absolute deviation of each angle from the desired optimal 120 degrees. This cost is implemented below. (I had initially also implemented the Manhattan distance, i.e., the sum (or equivalently, the average) of absolute deviations from 120 degrees. But it’s another nice problem to show that this is unnecessary: the sum of absolute deviations yields the same ordering as the maximum of absolute deviations… but this would not generally be true if our clocks somehow had more than just three hands (why?).)

def min_max(*time):
    """Max. deviation from 120 deg. of angles between clock hands."""
    return max(abs(a - 120) for a in hand_angles(hand_positions(*time)))

At this point, the key observation is that the only possible candidate times at which these cost functions are optimized are at their critical points (or at endpoints of the domain). And because these costs are piecewise linear, the critical points are easy to enumerate: either two of the angles are equal (when maximizing the minimum angle), or one of the angles is exactly 120 degrees (when minimizing the absolute deviation). The following function lumps all of these possibilities together into one list:

h, m, s = [sym.Symbol(v) for v in 'hms']

def critical_points():
    """Generate possible critical positions of sweeping second hand."""
    yield 0
    for x, y, z in permutations(hand_positions(h, m, s)):
        a, b, c = (y - x, z - y, x - z + 360)
        for lhs, rhs in ((a, b), (a, c), (b, c), (a, 120), (b, 120), (c, 120)):
            yield from sym.solve(lhs - rhs, [s])

The resulting “most balanced” times are shown below. There are a couple of interesting observations. First, a solution won’t be unique; times come in “mirror image” pairs with the same cost. Second, although the two cost functions considered here do yield slightly different optimal times, they differ by less than 3 hundredths of a second, and the same four or five best mirror pairs are all clustered within about 8 hundredths of a second of each other– all at approximately 2:54:34, along with its mirror time 9:05:25.

The two “mirror” times with the same minimum maximum absolute deviation of clock hand angles from 120 degrees: 2:54:34 394/719 (at left) and 9:05:25 325/719 (at right).

Finally, what if the second hand doesn’t sweep, but ticks from one integer second to the next? (When I was a kid, this was the magic method to distinguish between the real and fake Rolex watch, neither of which any of us had actually ever seen.) In this case, the most balanced times– that also appear within about a tenth of a second of the nearby overall top ten times– are at 5:49:09, along with its mirror time 6:10:51, as shown below.

The two “mirror” times that are most balanced if we constrain the second hand to “tick” at integer numbers of seconds: 5:49:09 (at left) and 6:10:51 (at right).
Posted in Uncategorized | Leave a comment

What is the probability of a tie vote?

This article was discussed on Reddit.

It is time to vote for a new mathematics department chair. All n=20 voting faculty members are on the ballot as eligible candidates… but no one really cares about the job, so each faculty member simply votes randomly, selecting a name uniformly at random from the n names on the ballot. The candidate receiving the most votes wins the election. What is the probability of a tie vote, that is, what is the probability that more than one faculty member receives the same maximum number of votes, requiring a runoff?

I found this problem interesting for a couple of reasons. First, the original version of the problem as presented to me was slightly different, where each candidate is excluded from voting for themselves (not only do they not care about the job, they actively avoid it). This seems significantly more difficult– or at least, much more computationally expensive– to solve. See the messy details at the end of this post.

The second interesting aspect of this problem was the potentially weird limiting behavior as the number of voting candidates grows large, as shown in the figure below.

Probability of a tie vote in an election vs. number of voting candidates.

Does the probability of a tie approach a limit as n \to \infty, or does it continue to persistently meander around? The latter would be really interesting, but perhaps not entirely unexpected: van de Brug, Kager, and Meester (see reference below) analyze the “group Russian roulette” problem, where at each time step, the survivors from an initial group of n people each shoot and kill a random other person. The probability that eventually there are no survivors (that there is a deadly tie for the win, so to speak), does not approach a limit, but instead repeatedly– but ever more slowly– varies up and down as the group size increases.

Following are my notes on the solution to both voting problems. First, the probability p(n) of a tie vote where candidates may vote for themselves is

p(n) = 1-\frac{n}{n^n}\sum\limits_{v=2}^{n}[\frac{x^n}{n!}]\frac{x^v}{v!}\left(\sum\limits_{k=0}^{v-1}\frac{x^k}{k!}\right)^{n-1}

If we constrain each candidate to vote for someone other than themselves, the resulting probability is the even more unpleasant

p^*(n) = 1-\frac{n}{(n-1)^n}\sum\limits_{v=2}^{n-1}\sum\limits_{k=0}^{n}(-1)^k [\frac{x^{n-k}}{(n-k)!}](

{{n-1} \choose {k-1}}\frac{x^{v-1}}{(v-1)!}\left(\sum\limits_{j=0}^{v-2}\frac{x^j}{j!}\right)^{k-1}\left(\sum\limits_{j=0}^{v-1}\frac{x^j}{j!}\right)^{n-k}

+ {{n-1} \choose k}\frac{x^v}{v!} \left(\sum\limits_{j=0}^{v-2}\frac{x^j}{j!}\right)^k\left(\sum\limits_{j=0}^{v-1}\frac{x^j}{j!}\right)^{n-k-1})

It’s unclear to me whether these exact formulas may be simplified, or whether they even help with analysis of asymptotic behavior.


  1. T. van de Brug, W. Kager, R. Meester, The asymptotics of group Russian roulette. [arXiv]
Posted in Uncategorized | 4 Comments

Counting edge-matching puzzles

I recently re-discovered a puzzle that I had mostly forgotten from when I was a kid. The problem is simple to state: rearrange and rotate the seven hexagonal pieces shown below so that each of the twelve pairs of facing edges have matching labels. (Currently, only the 0s at 2 o’clock and the 3s at 5 o’clock match as required.)

Example hexagonal edge-matching puzzle.

The original puzzle wasn’t exactly the one above; the edge labels were different, but the basic idea was the same. What snagged my interest here, decades later, was not solving this puzzle, but counting them. That is, if hexagonal pieces are distinguishable only by their pattern (up to rotation) of edge labels 0 through 5, then how many different possible puzzles– sets of seven such pieces packaged and sold as a product– are there?

I think this question is not “nice” mathematically– or at least, I was unable to make much progress toward a reasonably concise solution– but it was interesting computationally, because the numbers involved are small enough to be tractable, but large enough to require some thought in design and implementation of even a “brute force” approach.

(My Python solution is on GitHub. What I learned from this exercise: I had planned to implement a lazy k-way merge using the priority queue in the heapq module, but I found that it was already built-in.)

There are several variants of the question that we can ask. First and easiest, let’s ignore solvability. There are 5!=120 different individual hexagonal pieces, and so there are {7+120-1 \choose 7}, or 84,431,259,000 distinguishable sets of seven such pieces.

However, most of these puzzles do not have a solution. It turns out there are 4,967,864,520 different solvable puzzles… but there are at least a couple of ways that we might reasonably reduce this number further. For example, over a billion of these solvable puzzles have multiple solutions– 1800 of which have twenty different solutions each. If we constrain a “marketable” puzzle to have a unique solution, then there are… well, still 3,899,636,160 different possible puzzles.

Of course, many of these puzzles are only cosmetically different, so to speak. For example, the puzzle shown above has four identical pieces with the same 0-through-5 counterclockwise labeling. If we arbitrarily distinguish this “identity” piece, then although some puzzles have none of these pieces, they are not really “different” in a useful way, since we could simply relabel all of the edges appropriately so that they do contain at least one identity piece. There are only 281,528,111 different puzzles containing at least one identity piece, of which 221,013,350 have a unique solution.

Posted in Uncategorized | Leave a comment

The Fisher-Yates shuffle is backward

Given a list of n elements, such as cards in a deck, what is the right way to shuffle the list? That is, what is the appropriate algorithm to (pseudo)randomly permute the elements in the list, so that each of the n! possible permutations is equally likely?

This is an interesting problem, in part because it is easy to get wrong. The standard, all-the-cool-kids-know-it response is the Fisher-Yates shuffle, consisting of a sequence of n-1 carefully specified random transpositions, with the following basic implementation in Python:

def fisher_yates_shuffle(a):
    """Shuffle list a[0..n-1] of n elements."""
    for i in range(len(a) - 1, 0, -1): # i from n-1 downto 1
        j = random.randint(0, i) # inclusive
        a[i], a[j] = a[j], a[i]

Note that the loop index i decreases from n-1 down to 1. Everywhere I have looked, this is how the algorithm is always presented. The motivation for this post is to wonder aloud why the following variant– which seems simpler, at least to me– is not the “standard” approach, with the only difference being that the loop runs “forward” instead of backward:

def forward_shuffle(a):
    "Shuffle list a[0..n-1] of n elements."""
    for i in range(1, len(a)): # i from 1 to n-1
        j = random.randint(0, i) # inclusive
        a[i], a[j] = a[j], a[i]

It’s worth emphasizing that this is different from what seems to be the usual “forward” version of the algorithm (e.g., this “equivalent version”), that seems to consistently insist on also “mirroring” the ranges of the random draws, so that they are decreasing with each loop iteration instead of the loop index:

def mirror_shuffle(a):
    "Shuffle list a[0..n-1] of n elements."""
    for i in range(0, len(a) - 1): # i from 0 to n-2
        j = random.randint(i, len(a) - 1) # inclusive
        a[i], a[j] = a[j], a[i]

There are a couple of ways to see and/or prove that forward_shuffle does indeed yield a uniform distribution on all possible permutations. One is by induction– the rather nice loop invariant is that, after each iteration i, the sublist a[0..i] is a uniformly random permutation of the original sublist a[0..i]. (Contrast this with the normal Fisher-Yates shuffle, where after each iteration indexed by i, the “suffix” sublist a[i..n-1] is essentially a uniformly permuted reservoir-sampled subset of the entire original list.)

Another way to see that forward_shuffle works as desired is to relate its behavior to that of the original fisher_yates_shuffle, which has already been proved correct. Consider the set of independent discrete random variables \{X_1, X_2, \ldots, X_{n-1}\}, with each X_i distributed uniformly between 0 and i, inclusive. These X_i are the random draws returned from random.randint(0, i).

Imagine generating the entire set of those n-1 independent random draws up front, then applying the sequence of corresponding transpositions (i, X_i). The original Fisher-Yates shuffle applies those transpositions in order of decreasing i, while forward_shuffle applies the same set of random transpositions, but in reverse order. Thus, the permutations resulting from fisher_yates_shuffle and forward_shuffle are inverses of each other… and if a random permutation is uniformly distributed, then so is its inverse.

There is nothing special here– indeed, this forward_shuffle is really just a less dressed-up implementation of what is usually referred to as the “inside-out” version of Fisher-Yates, that for some reason seems to be presented as only appropriate when shuffling a list generated from an external source (possibly of unknown length):

def forward_shuffle(source):
    "Return shuffled list from external source."""
    a = []
    for i, x in enumerate(source):
        j = random.randint(0, i) # inclusive
        a[i], a[j] = a[j], a[i]
    return a

I say “less dressed-up” because I’ve skipped what seems to be the usual special case j==i comparison that would eliminate the swapping. The above seems simpler to me, and I would be curious to know if these (branchless) swaps are really less efficient in practice.

Posted in Uncategorized | 3 Comments

Among Us: Morse code puzzle

In the online game Among Us, players who visit the Comms room hear a fuzzy audio recording of a series of high-pitched beeps that sound like Morse code. I first heard the recording here, but this more recent video also plays it at around 5:00, followed by a good explanation of the problem with trying to decipher the code.

The following figure shows a spectrogram of the audio clip, with time on the x-axis, and each vertical slice showing the Fourier transform of a short (roughly 50 ms) sliding window of the signal centered at the corresponding time. We can clearly see the “dots” and “dashes” at around 1 kHz, with the corresponding translation overlaid in yellow.

Spectrogram of the Comms room audio, with the translated Morse code also indicated in yellow.

Now that we have the Morse code extracted from the audio (which, for reference if you want to copy-paste and play with this problem, is “.-..--...-.---...-..-...“), we just need to decode it, right? The problem is that the dots and dashes are all uniformly spaced, without the required longer gaps between letters, let alone the still longer gaps that would be expected between words. Without knowing the intended locations of those gaps, the code is ambiguous: for example, the first dot could indicate the letter E, or the first dot and dash together could indicate an A, etc.

That turns out to be a big problem. The following figure shows the decoding trie for Morse code letters and digits; starting at the root, move to the left child vertex for each dot, or to the right child vertex for each dash. A red vertex indicates either an invalid code or other punctuation.

Morse code represented as a decoding trie; each left and right edge indicates a dot or dash, respectively.

If we ignore the digits in the lowest level of the trie, we see that not only are Morse code letters ambiguous (i.e., not prefix-free), they are nearly “maximally ambiguous,” in the sense that the trie of letters is nearly complete. That is, for almost any prefix of four dots and dashes we may encounter, the gap indicating the end of the first letter could be after any of those first four symbols.

This would make a nice programming exercise for students, to show that this particular sequence of 24 symbols may be decoded into a sequence of letters in exactly 3,457,592 possible ways. Granted, most of these decodings result in nonsense, like AEABKGEAEAEEE. But a more interesting and challenging problem is to efficiently search for reasonable decodings, that is, messages consisting of actual (English?) words, perhaps additionally constrained by grammatical connections between words.

Of course, it’s also possible– probable?– that this audio clip is simply made up, a random sequence of dots and dashes meant to sound like “real” Morse code. And even if it’s not, we might not be able to tell the difference. Which is the interesting question that motivated this post: if we generate a completely random, and thus intentionally unintelligible, sequence of 24 dots and dashes, what is the probability that it still yields a “reasonable” possible decoding, for sufficiently large values of “reasonable”?

Posted in Uncategorized | 6 Comments

Counting Hotel Key Cards


Suppose that you are the owner of a new hotel chain, and that you want to implement a mechanical key card locking system on all of the hotel room doors. Each key card will have a unique pattern of holes in it, so that when a card is inserted into the corresponding room’s door lock, a system of LEDs and detectors inside the lock will only recognize that unique pattern of holes as an indication to unlock the door.

(I have vague childhood memories of family vacations and my parents letting me use just such an exotic gadget to unlock our hotel room door.)

When you meet with a lock manufacturer, he shows you some examples of his innovative square key card design, with the “feature” that a key card may be safely inserted into the slot in a door lock in any of its eight possible orientations: any of the four edges of the square key card may be inserted first, with either side of the key card facing up. Each key card has a pattern of up to 36 holes aligned with a 6×6 grid of sensors in the lock that may “scan” the key card in any orientation.

Examples of hotel key cards each with a 6×6 grid of 36 possible holes.

The lock manufacturer agrees to provide locks and corresponding key cards for each room, with the following requirements:

  1. A manufacturer-provided key card will only open its assigned manufacturer-provided lock and no other; and
  2. A manufacturer-provided key card will open its assigned manufacturer-provided lock when inserted into the slot in any orientation.

How many distinct safely-locked rooms can the manufacturer support?

A simpler lock is a harder problem

The problem as stated above is a relatively straightforward application of Pólya counting, using the cycle index of the dihedral group of symmetries of the key card acting on (2-colorings of) the n \times n grid of possible holes in the card. When n is even, the cycle index (recently worked out in a similar problem here) is

Z(D_4) = \frac{1}{8}(x_1^{n^2}+2x_4^{\frac{n^2}{4}}+3x_2^{\frac{n^2}{2}}+2x_1^n x_2^{\frac{n^2-n}{2}})

Evaluating at n=6, x_i=2 yields a total of 8,590,557,312 distinct key cards– and corresponding hotel room door locks– that the manufacturer can provide.

However, these locks are expensive: the second requirement above means that each lock must contain not only the sensing hardware to scan the pattern of holes in a key card, but also the software to compare that detected pattern against the eight possibly distinct rotations and reflections of the pattern that unlocks the door. (For example, the key card on the left in the figure above “looks the same” to the sensor in any orientation; the key card in the middle, however, may present any of four distinct patterns of scanned holes; and the key card on the right “looks different” in each of its eight possible rotated or flipped orientations.)

Which leads to the problem that motivated this post: to reduce cost, let’s modify the second requirement above– but still retaining the first requirement– so that a manufacturer-provided key card will only open its assigned manufacturer-provided lock when inserted into the slot in a single correct orientation labeled on the key card. This way, the sensing hardware in the lock only needs to “look for” a single pattern of holes.

Now how many distinct key cards and corresponding room locks are possible?

Counting regular orbits

The idea is that, referring again to the figure above, key cards may only have patterns of holes like the example on the far right, without any rotation or reflection symmetries. In other words, given the (dihedral) group G of symmetries acting on colorings of the set X of possible key card hole positions, we are counting only regular orbits of this action– i.e., those orbits whose colorings are “fully asymmetric,” having a trivial stabilizer.

So how can we do this? My approach was to use inclusion-exclusion, counting those colorings fixed by none of the symmetries in G-\{e\}. To start, we represent each element of G as a list of lists of elements of X, corresponding to the disjoint cycles in the permutation of X. For a given subset S \subseteq G-\{e\} in the inclusion-exclusion summation, consider the equivalence relation on X relating two key card hole positions if we can move one to the other by a sequence of symmetries in S. Then the desired number of k-colorings fixed by S is k^{m_S}, where m_S is the number of equivalence classes.

We can compute this equivalence relation using union-find to incrementally “merge” the sets of disjoint cycles in each permutation in S (all of the code discussed here is available on GitHub):

def merge(s, p):
    """Merge union-find s with permutation p (as cycles)."""
    def find(x):
        while s[x] != x:
            x = s[x]
        return x
    def union(x, y):
        x = find(x)
        s[find(y)] = x
        return x
    for cycle in p:
        reduce(union, cycle)
    for x in range(len(s)):
        s[x] = find(x)
    return s

It remains to compute the inclusion-exclusion alternating sum of these (-1)^{|S|}k^{m_S} over all subsets S \subseteq G-\{e\}.

def cycle_index_term(s, k=2):
    """Convert union-find s to cycle index monomial at x[i]=k."""
    #return prod(x[i]**j for i, j in Counter(Counter(s).values()).items())
    return k ** sum(Counter(Counter(s).values()).values())

def asymmetric_colorings(group, k=2):
    """Number of k-colorings with no symmetries in the given group."""

    # Group G acts on (colorings of) X = {0, 1, 2, ..., n-1}.
    G = list(group)
    n = sum(len(cycle) for cycle in G[0])

    # Compute inclusion-exclusion sum over subsets of G-e.
    G = [g for g in G if len(g) < n]
    return sum((-1) ** len(subset) *
               cycle_index_term(reduce(merge, subset, list(range(n))), k)
               for subset in chain.from_iterable(combinations(G, r)
                                                 for r in range(len(G) + 1)))

Evaluating the result– and dividing by the size of each orbit |G|=8— yields 8,589,313,152 possible “fully asymmetric” key cards satisfying our requirements.


At first glance, this seems like a nice solution, with a concise implementation, that doesn’t require much detailed knowledge about the structure of the symmetry group involved in the action… but we get a bit lucky here. The time to compute the inclusion-exclusion summation is exponential in the order of the group, which just happens to be small in this case.

For a more complex example, imagine coloring each face of a fair die red or blue; how many of these colorings are “orientable,” so that if the die rests on a table and we pick it up, put it in a cup and shake it and roll the die to a random unknown orientation, we can inspect the face colors to unambiguously determine the die’s original resting orientation? We can use the same code above to answer this question for a cube or tetrahedron (0 ways) or octahedron (120/24=5 ways)… but the dodecahedron and icosahedron are beyond reach, with rotational symmetry groups of order 60.

Of course, in those particular cases, we can lean on the additional knowledge about the structure of the subgroup inclusion partial order to solve the problem with fewer than the 2^{60}-ish operations required here. But is there a way to improve the efficiency of this algorithm in a way that is still generally applicable to arbitrary group actions?

Posted in Uncategorized | 5 Comments

Exploiting advantage from too few shuffles


A few days ago a friend of mine referred me to an interesting podcast discussing card shuffling, framed as a friendly argument-turned-wager between a couple about how many times you should shuffle a deck of cards. A woman claims that the “rule” is that you riffle shuffle three times, then quit messing around and get to dealing. Her partner, on the other hand, feels like at least “four or more” riffle shuffles are needed for the cards to be sufficiently random.

A mathematician is brought into the discussion, who mentions the popular result that seven shuffles are needed… at least according to specific, but perhaps not necessarily practical, mathematical criteria for “randomness.” (There is some interesting preamble about the need to define exactly what is meant by “random,” which I was disappointed to hear defined as, “any card is equally likely to be in any position in the deck.” This isn’t really even close to good enough. For example, start with a brand new deck of cards in a known order, and simply cut the deck at a uniformly random position. Now each and every card is equally likely to be in any position in the deck, but the resulting arrangement of cards can hardly be called sufficiently random.)

A win for the man, right? But the woman’s side is vindicated in the end, by noting that even in casinos– where presumably this has been given a lot of thought– a standard poker deck is typically only shuffled three times. Several dealers are interviewed, each describing the process with the chant, “riffle, riffle, box, riffle, cut.”

The wash

A couple of observations occurred to me after listening to this discussion. First, it’s true that casino dealers don’t shuffle seven times… but they also don’t just shuffle three times. Particularly when presented with a brand new pack, before any riffle shuffling, they often start with a “wash,” consisting of spreading the cards haphazardly around the table, eventually collecting them back into a squared-up deck to begin the riffle-and-cut sequence.

Depending on how thorough it is, that initial wash alone is arguably sufficient to randomize the deck. If we think of a single riffle shuffle as applying a random selection of one of “only” 2^{52} possible permutations in a generating set, then the wash is roughly akin to making a single initial selection from a generating set of all 52! possible arrangements. If the wash is thorough enough that this selection is approximately uniform, then after that, any additional shuffling, riffle or otherwise, is just gravy.

When does it really matter?

The second observation is one made by a dealer interviewed in the podcast, who asks what I think is the critical practical question:

The real question is, what’s the goal of the shuffle? Is it to completely randomize the cards, or is it to make it so that it’s a fair game?

In other words, if we are going to argue that three, or any other number of shuffles, is not sufficient, then the burden is on us to show that this limited number of shuffles provides a practical advantage that we can actually exploit in whatever game we happen to be playing.

We have discussed some examples of this here before. For example, this wonderful card trick due to Charles Jordan involves finding a spectator’s secretly selected card, despite being buried in a thrice-shuffled deck. And even seven shuffles is insufficient to eliminate a huge advantage in the so-called New Age Solitaire wager.

But it’s an interesting question to consider whether there are “real” card games– not magic tricks or contrived wagers– where advantage may be gained by too few shuffles.

I struggled to think of such a practical example, and the following is the best I can come up with: let’s play a simplified version of the card game War (also discussed here recently). Start with a “brand new” deck of cards in the following order:

A “new deck order” of cards prior to shuffling for a simplified game of War.

Riffle shuffle the deck three times, and cut the deck. In fact, go ahead and cut the deck after each riffle shuffle. Then I will deal the cards into two equal piles of 26 cards, one for each of us. At each turn, we will simultaneously turn over the top card from our piles, and the higher card wins the “trick.” Let’s simplify the game by just playing through the deck one time, and instead of a “war” between cards of the same rank, let’s just discard the trick as a push. At the end of the game, whoever has taken the most tricks wins a dollar from the other player.

If three shuffles is really sufficient to make this a “fair” game, then the expected return for each player should be zero. Instead, I as the dealer will win over two out of three games, taking about 42 cents from you per game on average!

Of course, this is still contrived. Even the initial deck order above is cheating, since it isn’t the typical “new deck order” in most packs manufactured in the United States. And if we play the game repeatedly (with three shuffle-cuts in between), the advantage returns to near zero for reasonable methods of collecting the played cards back into the deck.

So, I wonder if there are better real, practical examples of this kind of exploitable advantage from too few shuffles? And can this advantage persist across multiple games, with the same too-few shuffles in between? It’s interesting to consider what types of games involve methods of collecting the played cards back into the deck to shuffle for the next round, that might retain some useful ordering; rummy-style games come to mind, for example, where we end up with “clumps” of cards of the same rank, or of consecutive ranks, etc.

Posted in Uncategorized | Leave a comment