Using a watch– or a stick– as a compass

A couple of years ago, I wrote about a commonly cited method of direction-finding using an analog watch and the sun.  Briefly, if you hold your watch face horizontally with the hour hand pointing toward the sun, then the ray halfway between the hour hand and 12 noon points approximately true south.  (This is for locations in the northern hemisphere; there is a slightly different version that works in the southern hemisphere.)

The punch line was that the method can be extremely inaccurate, with errors potentially exceeding 80 degrees depending on the location, month, and time of day.  I provided a couple of figures, each for a different “extreme” location in the United States, showing the range of error in estimated direction over the course of an entire year.

Unfortunately, I ended on that essentially negative note, without considering any potentially more accurate methods as an alternative.  This post is an attempt to remedy that.  In recent discussion in the comments, Steve H. suggested analysis of the use of the “shadow-stick” method: place a stick vertically in the ground, and mark the location on the ground of the tip of the stick’s shadow at two (or more) different times.  The line containing these points will be roughly west-to-east.

Illustration of the shadow-stick method of direction-finding.  With a stick placed vertically in the ground, the tip of the stick's shadow moves roughly from west to east.

Illustration of the shadow-stick method of direction-finding. With a stick placed vertically in the ground, the tip of the stick’s shadow moves roughly from west to east.

As the following analysis shows, this shadow-stick method of direction-finding is indeed generally more accurate than the watch method… most of the time, anyway.  But even when it is better, it can still be bad.  It turns out that both methods are plagued with some problems, with the not-so-surprising conclusion that if you need to find your way home, there is a tradeoff to be made between accuracy and convenience.

One of the problems with my original presentation was condensing the behavior of the watch method over an entire year into a single plot (in this case, at Lake of the Woods in Minnesota, at a northern latitude where the watch method’s accuracy is best).  This clearly shows the performance envelope, i.e. the maximum possible error over the whole year, but it hides the important trending behavior within each day, and how that daily trend changes very gradually over the year.  We can see this more clearly with an animation: the following shows the same daily behavior of error in estimated direction using the watch method (in blue), but also the shadow-stick method (in red), over the course of this year.

Accuracy of the watch method (blue) and shadow-stick method (red) of direction-finding, over the course of the year 2014 in Lake of the Woods, Minnesota. The shadow-stick method is more accurate 40.6% of the time.

Accuracy of the watch method (blue) and shadow-stick method (red) of direction-finding, over the course of the year 2014 in Lake of the Woods, Minnesota. The shadow-stick method is more accurate 40.6% of the time.

For reference, following are links to a couple of other animations showing the same comparison at other locations.

  • Florida Keys (a southern extreme, where the watch method performs poorly, included in the original earlier post)
  • Durango, Colorado (discussed in the comments on the earlier post)

There are several things to note here.  First, this is an example where the shadow-stick method can actually perform significantly worse than the watch method.  Its worst-case behavior is near the solstices in June and December, with errors exceeding 30 degrees near sunrise and sunset.  This worst-case error increases with latitude, which is the opposite of how the watch method behaves, as shown by the Florida Keys example above.

However, note the symmetry in the error curve for the shadow-stick method.  It always passes from an extreme in the morning, to zero around noon, to the other extreme in the evening.  We can exploit this symmetry… if we are willing to wait around a while.  That is, we could improve our accuracy by making a direction measurement some time in the morning before noon, then making another measurement at the same time after noon, and using the average of the two as our final estimate.  (A slightly easier common refinement of the shadow-stick method is to (1) mark the tip of the shadow sometime in the morning, then (2) mark the shadow again later in the afternoon when the shadow is the same length.  The basic idea is the same in either case.)

Finally, this issue of the length of time between measurements is likely an important consideration in the field.  A benefit of the watch method is that you get a result immediately; look at the sun, look at your watch, and you’re off.  The shadow-stick method, on the other hand, requires a pair of measurements, with some waiting time in between.  How long are you willing to wait for more accuracy?

Interestingly, the benefit of that additional waiting time isn’t linear– that is, all of the data shown here assumes just 15 minutes between marking the stick’s shadow.  Waiting longer can certainly reduce the effect of measurement error (i.e., the problem of using cylindrical sticks and spherical pebbles, etc., instead of mathematical line segments and points) by providing a longer baseline… but the inherent accuracy of the method only improves significantly when the two measurement times span apparent noon, as in the refinement above, which could take hours.

To wrap up, I still do not see a way to condense this information into a reasonably simple, easy-to-remember, expedient method for finding direction in the field without a compass.  The regular, symmetric behavior of the error in the shadow-stick method suggests that we could possibly devise an “immediate” method of eliminating most of that error, by considering the extent and sense of the error as a function of the season, and a “scale factor” as a function of the time until/since noon… but that starts to sound like anything but “simple and easy-to-remember.”


Posted in Uncategorized | 3 Comments

The Price Is Right puzzle solution

This is just a quick follow-up to capture my notes on an approach to solving the following problem from the last post:

Problem: Suppose you and n other players are playing the following game.  Each player in turn spins a wheel as many times as desired, with each spin adding to the player’s score a random value uniformly distributed between 0 and 1.  However, if a player’s cumulative score ever exceeds 1, she immediately loses her turn (and the game).

After all players have taken a turn, the player with the highest score (not exceeding 1) wins the game.  You get to go first– what should your strategy be, and what are your chances of winning?

Solution: Suppose that our strategy is to continue to spin until our score exceeds some target value t.  The first key observation is that the probability that we do so “safely,” i.e. reach a total score greater than t without exceeding 1, is

q(t) = e^t(1-t)

To see this, partition the successful outcomes by the number k of initial spins with total score at most t, with the final k+1-st spin landing the total score in the interval (t, 1].  The probability of such an outcome is


(which may be shown by induction, or see also here and here), and the result follows from the Taylor series expansion for e^t.

At this point, we can express the overall probability of winning using strategy t as

p_n(t) = q(t) \int_t^1 \frac{1}{1-t} (1 - q(s))^n ds

Intuitively, we must:

  1. Safely reach a score in the interval (t, 1], with probability q(t); and
  2. For each such score s, reached with uniform density 1/(1-t), each of the remaining n players must fail to beat our score.

The optimal strategy consists of choosing a target score t maximizing p_n(t).  Unfortunately, this does not have a closed form; however, after differentiating and some manipulation, the desired target score t can be shown to be the root in [0, 1] of the following equation, which has a nice interpretation:

\int_t^1 (1 - q(s))^n ds = (1 - q(t))^n

The idea is that we are choosing a target score t where the (left-hand side) probability of winning by spinning one more time exactly equals the (right-hand side) probability of winning by stopping with the current score t.

Handing the problem over to the computer, the following table shows the optimal target score and corresponding probability of winning for the first few values of n.

Optimal target score (red) and corresponding probability of winning (blue), vs. number of additional players.

Optimal target score (red) and corresponding probability of winning (blue), vs. number of additional players.

One final note: how does this translate into an optimal strategy for the players after the first?  At any point in the game, the current player has some number n of players following him.  His optimal strategy is to target the maximum of the best score so far from the previous players, and the critical score computed above.

[Edit: Following is Python code using mpmath that implements the equations above.]

import mpmath

def q(t):
    return mpmath.exp(t) * (1 - t)

def p_stop(t, n):
    return (1 - q(t)) ** n

def p_spin(t, n):
    return mpmath.quad(lambda s: p_stop(s, n), [t, 1])

def t_opt(n):
    return mpmath.findroot(
        lambda t: p_spin(t, n) - p_stop(t, n), [0, 1], solver='ridder')

def p_win(t, n):
    return q(t) / (1 - t) * p_spin(t, n)

if __name__ == '__main__':
    for n in range(11):
        t = float(t_opt(n))
        p = float(p_win(t, n))
        print('{:4} {:.9f} {:.9f}'.format(n, t, p))


Posted in Uncategorized | 2 Comments

The Price Is Right… sort of

After a couple of recent conversations about the dice game puzzle proposed a few months ago, I spent some time experimenting with the following game that has a vaguely similar “feel,” and that also has the usual desirable feature of being approachable via either computer simulation or pencil and paper.

Suppose you and n other players are playing the following game.  Each player in turn spins a wheel as many times as desired, with each spin adding to the player’s score a random value uniformly distributed between 0 and 1.  However, if a player’s cumulative score ever exceeds 1, she immediately loses her turn (and the game).

After all players have taken a turn, the player with the highest score (not exceeding 1) wins the game.  You get to go first– what should your strategy be, and what are your chances of winning?

The obvious similarity to the dice game Pig is in the “jeopardy”-type challenge of balancing the risk of losing everything– in this case, by “busting,” or exceeding a total score of 1– with the benefit of further increasing your score, and thus decreasing the other players’ chances of beating that score.

I like this “continuous” version of the problem, for a couple of reasons.  First, it’s trickier to attack with a computer, resisting a straightforward dynamic programming approach.  But at the same time, I think we still need the computer, despite some nice pencil-and-paper mathematics involved in the solution.

We can construct an equally interesting discrete version of the game, though, as well: instead of each spin of the wheel yielding a random real value between 0 and 1, suppose that each spin yields a random integer between 1 and m (say, 20), inclusive, where each player’s total score must not exceed m.  The first player who reaches the maximum score not exceeding m wins the game.

This version of the game with n=3 and m=20 is very similar to the “Showcase Showdown” on the television game show The Price Is Right, where three players each get up to two spins of a wheel partitioned into dollar amounts from $.05 to $1.00, in steps of $.05.  The television game has been analyzed before (see here, for example), but as a computational problem I like this version better, since it eliminates both the limit on the number of spins, as well as the potential for ties.


Posted in Uncategorized | Leave a comment

Reddit’s comment ranking algorithm revisited


The “Bayesian/frequentist” coin puzzle discussed in the last couple of posts was really just an offshoot of some thoughts I have been mulling over about Reddit’s current default approach to ranking user comments on a post, based on the number of upvotes and downvotes each comment receives.  (Or more generally, the problem of ranking a collection of any items, whether comments, or consumer products, etc., based on varying numbers of “like/dislike” votes.)  Instead of trying to estimate the bias of a coin based on the observed number of heads and tails flipped, here each comment is like a coin, and each upvote (or downvote) is like an observation of a coin flip coming up heads (or tails).

If we assume that each comment has some fixed– but unknown– probability \theta that a random user will upvote the comment, then it would be convenient to simply sort all of the comments on a particular post by decreasing \theta, so that the “best” comments would appear near the top.  Unfortunately, we don’t actually know \theta, we can only estimate it somehow by using the observed pair (u,d) of upvotes and downvotes, respectively.

A natural first idea might be to “score” each comment using the maximum likelihood estimate

\hat{\theta} = \frac{u}{u+d}

and sort the comments by this score.  But this tends to unfairly compare comments with very different numbers of total votes; e.g., should a comment with votes (3,0) really be ranked higher than (99,1)?

Wilson Score Interval

Evan Miller’s “How Not To Sort By Average Rating” does a good job of presenting this and other approaches, eventually arguing for sorting by the lower bound of the Wilson score interval, which is what Reddit currently does.  Briefly, the Wilson score interval is a confidence interval intended to “cover” (i.e., contain) the true– but unknown– value \theta with at least some guaranteed probability, described as the “confidence level.”  In general, the higher the confidence level, or the fewer the number of observations, the wider the corresponding confidence interval.  By scoring each comment with the lower bound of this confidence interval, we are effectively starting with a point estimate based on the fraction of upvotes, but then penalizing this score according to the total number of votes, with fewer votes receiving a greater penalty.

Reddit’s use of this scheme has evolved slightly over time, initially computing a 70% confidence interval, but then changing to the current wider 80% confidence interval, having the effect of imposing a slightly greater penalty on comments with fewer total votes.  This “fine-tuning” of the scoring algorithm raises the question whether there might not be a more natural method for ranking user comments, that does not require this sort of knob-turning.

A Bayesian Alternative

Last year, James Neufeld proposed the interesting idea of sampling a random score for each comment by drawing from a corresponding beta distribution with parameters

(\alpha, \beta) = (u+1, d+1)

The idea is that this beta distribution is a natural way to express our uncertainty about the “true” value \theta of a comment, starting with an assumed prior uniform distribution on \theta (i.e., a comment is initially equally likely to be great, terrible, or anything in between), and updating based on the observation of (u,d) upvotes and downvotes, respectively.  For example, a comment with 30 upvotes and 10 downvotes yields a beta distribution with the following density:

Probability density of beta distribution with parameters (30+1,10+1).

Probability density of beta distribution with parameters (30+1,10+1).

A key point is that every user does not necessarily see the comments for a post in the same order.  Each time the post is viewed, the comments are re-scored by new random draws from the corresponding beta distributions, and sorted accordingly.  As a comment receives more and more upvotes and/or downvotes, it will “settle in” to a particular position among other comments… but comments with few votes, or even strongly downvoted comments, will still have some chance of appearing near the top of any particular user’s view of the page.

I really like this idea, but the non-deterministic ordering of comments presented to different users may be seen as a drawback.  Can we fix this?

Sorting by Expected Rank

I can think of two natural deterministic modifications of this approach.  The first is to sort comments by their expected ranking using the random scoring described above.  In other words, for each comment, compute the expected number of other comments that would appear higher than it on one of Neufeld’s randomly generated pages, and sort the comments by this expected value.

Although this method “fixes” the non-determinism of the original, unfortunately it suffers from a different undesirable property: the relative ranking of two comments may be affected by the presence or absence of other comments on the same post.  For example, consider the two comments identified by their upvote/downvote counts (0,1) and (1,3).  If these are the only two comments on a post, then (0,1) < (1,3).  However, if we introduce a third comment (7,3), then the resulting overall ranking is (1,3) < (0,1) < (7,3), reversing the ranking of the original two comments!

Pairwise comparisons

Which brings me, finally, to my initial idea for the following second alternative: sort the comments on a post according to the order relation

(u_1,d_1) < (u_2,d_2) \iff P(X_1 > X_2) < \frac{1}{2}


X_k \sim Beta(u_k+1,d_k+1)

More intuitively, we are simply ranking one comment higher than another if it is more likely than not to appear higher using Neufeld’s randomized ranking.

Note one interesting property of this approach that distinguishes it from all of the other methods mentioned so far: it does not involve assigning a real-valued “score” to each individual comment (and subsequently sorting by that score).  This is certainly possible in principle (see below), but as currently specified we can only compare two comments by performing a calculation involving parameters of both in a complex way.

Open Questions

Unfortunately, there are quite a few holes to be patched up with this method, and I am hoping that someone can shed some light on how to address these.  First, the strict order defined above is not quite a total order, since there are some pairs of distinct comments where one comment’s randomized score is equally likely to be higher or lower than the other.  For example, all of the comments of the form (u,u), with an equal number of upvotes and downvotes, have this problem.  This is probably not a big deal, though, since I think it is possible to arbitrarily order these comments, for example by increasing total number of votes.

But there are other more interesting pairs of incomparable comments.  For example, consider (5,0) and (13,1).  The definition above is insufficient to rank these two… but it turns out that it had better be the case that (13,1) < (5,0), since we can find a third comment that lies between them:

(13,1) < (70,8) < (5,0)

This brings us to the next open question: is this order relation transitive (in other words, is it even a partial order)?  I have been unable to prove this, only verify it computationally among comments with bounded numbers of votes.

The final problem is a more practical one: how efficiently can this order relation be computed?  Evaluating the probability that one beta-distributed random variable exceeds another involves a double integral that “simplifies” to an expression involving factorials and a hypergeometric function of the numbers of upvotes and downvotes.  If you want to experiment, following is Python code using the mpmath library to compute the probability P(X_1 > X_2):

from mpmath import fac, hyp3f2

def prob_greater(u1, d1, u2, d2):
    return (hyp3f2(-d2, u2 + 1, u1 + u2 + 2, u2 + 2, u1 + u2 + d1 + 3, 1) *
            fac(u1 + u2 + 1) / (fac(u1) * fac(u2)) *
            fac(u1 + d1 + 1) * fac(u2 + d2 + 1) /
            ((u2 + 1) * fac(d2) * fac(u1 + u2 + d1 + 2)))

print(prob_greater(5, 0, 13, 1))

John Cook has written a couple of interesting papers on this, in the medical context of evaluating clinical trials.  This one discusses various approximations, and this one presents exact formulas and recurrences for some special cases.  The problem of computing the actual probability seems daunting... but perhaps it is a simpler problem in this case to not actually compute the value, but just determine whether it is greater than 1/2 or not?

In summary, I think these difficulties can be rolled up into the following more abstract statement of the problem: can we impose a "natural," efficiently computable total order on the set of all beta distributions with positive integer parameters, that looks something like the order relation described above?


Posted in Uncategorized | 2 Comments

A coin puzzle revisited

This is a follow-up to some interesting discussion in the comments on my previous post, involving a coin-flipping probability puzzle, and a comparison of Bayesian and frequentist approaches to “solving” it.  For completeness, here is the original problem:

You have once again been captured by pirates, who threaten to make you walk the plank unless you can correctly predict the outcome of an experiment.  The pirates show you a single gold doubloon, that when flipped has some fixed but unknown probability of coming up heads.  The coin is then flipped 7 times, of which you observe 5 to be heads and 2 to be tails.  At this point, you must now bet your life on whether or not, in two subsequent flips of the coin, both will come up heads.  If you predict correctly, you go free; if not, you walk the plank.  Which outcome would you choose?

A typical puzzle-solver would (rightly) point out that necessary information is missing; we cannot determine the optimal action without knowing how the coin (and thus its bias) was selected.  Instead of providing that information, I stirred the Bayesian vs. frequentist debate by showing how each might reason without that information, and come up with differing conclusions.

One of the reasons that I like this problem is that the “Bayesian vs. frequentist” perspective is a bit of a ruse.  The frequentist in the original post computes the maximum likelihood estimate of the probability of the coin coming up heads… and makes a betting decision based on that estimate.  The Bayesian performs a slightly more complex calculation, involving updating a prior beta distribution using the observed flips, doing some calculus… but then makes a similar “threshold” betting decision based on that calculation.

The key observation is that any deterministic betting strategy whatsoever, whether wearing a frequentist hat, a Bayesian hat, or a clown hat, may be specified as a function

f:\{0, 1, 2, ..., n\} \rightarrow \{0, 1\}

mapping the number of heads observed in n=7 total flips to 1 indicating a bet for two subsequent heads, and 0 indicating a bet against.  Neither the underlying statistical philosophy nor the complexity of implementation of this function matters; all that matters is the output.

Actually, we can simplify things even further if we only consider “monotonic” strategies of the form “bet for two heads if k or more heads are observed, otherwise bet against.”  That is,

f_k(h) = H[h-k]

where H[] is the unit step function.

As mendel points out in the comments on the previous post, the frequentist MLE strategy is equivalent to f_5 (i.e., bet on two heads with “5 or more” observed heads), and the Bayesian strategy is equivalent to f_6 (“6 or more”).  We can compare these strategies– along with the seven other monotonic strategies– by computing the probability of their success, as a function of the unknown probability p of heads for each single coin flip.  That is, the probability of surviving the game with strategy f_k is

\sum_{h=0}^n {n \choose h} p^h (1-p)^{n-h}(f_k(h)(2p^2-1) + 1-p^2)

The following figure shows the results for all nine strategies:

Comparison of monotonic strategies as a function of probability of heads in a single coin flip.  The frequentist MLE strategy is "5 or more," and the Bayesian strategy is "6 or more."

Comparison of monotonic strategies as a function of probability of heads in a single coin flip. The frequentist MLE strategy is “5 or more,” and the Bayesian strategy is “6 or more.”

The MLE strategy (green) and Bayesian strategy (blue) are certainly contenders for the best reasonable approach.  However, neither of these, nor any other single strategy, dominates all others for all possible values of the unknown probability of heads in a single coin flip.  In other words, whether the Bayesian or frequentist has a better chance of survival truly does depend on the information that we are explicitly not given.


Posted in Uncategorized | 13 Comments

A coin puzzle

Are you a Bayesian or a frequentist?  What do these terms mean, and what are the differences between the two?  For me, these questions have never been terribly interesting, despite many attempts at answers given in the literature (see the references below for useful and entertaining examples).

My problem has been that explanations typically focus on the different approaches to expressing uncertainty, as opposed to different approaches to actually making decisions.  That is, in my opinion, Bayesians and frequentists can argue all they want about what “the probability of an event” really means, and how much prior information the other camp has or hasn’t unjustifiably assumed… but when pressed to actually take an action, when money is on the table, everyone becomes a Bayesian.

Or do they?  Following is an interesting puzzle that seems to more clearly distinguish the Bayesian from the frequentist, by forcing them both to put money on the table, so to speak:

Problem: You have once again been captured by bloodthirsty logical pirates, who threaten to make you walk the plank unless you can correctly predict the outcome of an experiment.  The pirates show you a single irregularly-shaped gold doubloon selected from their booty, and tell you that when the coin is flipped, it has some fixed but unknown probability of coming up heads.  The coin is then flipped 7 times, of which you observe 5 to be heads and 2 to be tails.

At this point, you must now bet your life on whether or not, in two subsequent flips of the coin, both will come up heads.  If you predict correctly, you go free; if not, you walk the plank.  Which outcome would you choose?  (The pirates helpfully remind you that, if your choice is not to play, then you will walk the plank anyway.)

I think this is an interesting problem because two different but reasonable approaches yield two different answers.  For example, the maximum likelihood estimate of the unknown probability that a single flip of the coin will come up heads is 5/7 (i.e., the observed fraction of flips that came up heads), and thus the probability that the next two consecutive flips will both come up heads is (5/7)*(5/7)=25/49, or slightly better than 1/2.  So perhaps a frequentist would bet on two heads.

On the other hand, a Bayesian might begin with an assumed prior distribution on the unknown probability for a single coin flip, and update that distribution based on the observation of h=5 heads and t=2 tails.  For example, using a “maximum entropy” uniform prior, the posterior probability for a single flip has a beta distribution with parameters (h+1, t+1), and so the probability of two consecutive heads is

\int_0^1 \frac{x^h (1-x)^t}{B(h+1, t+1)} x^2 dx = \frac{7}{15} < \frac{1}{2}

where B(h+1, t+1) is the beta function.  So perhaps a Bayesian would bet against two heads.

What would you do?

(A couple of comments: first, one might reasonably complain that observing just 7 coin flips is simply too small a sample to make a reasonably informed decision.  However, the dilemma does not go away with a larger sample: suppose instead that you initially observe 17 heads and 7 tails, and are again asked to bet on whether the next two flips will come up heads.  Still larger samples exist that present the same problem.

Second, a Bayesian might question the choice of a uniform prior, suggesting as another reasonable starting point the “non-informative” Jeffreys prior, which in this case is the beta distribution with parameters (1/2, 1/2).  This has a certain cynical appeal to it, since it effectively assumes that the pirates have selected a coin which is likely to be biased toward either heads or tails.  Unfortunately, this also does not resolve the issue.)


1. Jaynes, E. T., Probability Theory: The Logic of Science. Cambridge: Cambridge University Press, 2003 [PDF]

2. Lindley, D. V. and Phillips, L. D., Inference for a Bernoulli Process (A Bayesian View), The American Statistician, 30:3 (August 1976), 112-119 [PDF]


Posted in Uncategorized | 12 Comments

allRGB: Hilbert curves and random spanning trees


Consider the following problem: create a visually appealing digital image with exactly one pixel for each of the 16.8 million possible colors in the 24-bit RGB color space.  Not a single color missing, and no color appearing more than once.

Check out the web site, which hosts dozens of such contributed images, illustrating all sorts of interesting approaches.  Most of them involve coloring the pixels in some systematic way, although a few start with an existing image or photograph, and transform it so that it contains all possible colors, while still “looking like” the original image.  I’ll come back to this latter idea later.

Hilbert’s curve

The motivation for this post is two algorithms that lend themselves naturally to this problem.  The first is the generation of Hilbert‘s space-filling curve.  The idea fascinated me when I first encountered it, in the May 1985 issue of the short-lived Enter magazine.  How could the following code, that was so short, if indecipherable (by me, anyway), produce something so beautifully intricate?

 10 PI = 3.14159265: HCOLOR= 3
 20  HOME : VTAB 21
 60  HGR
 70 PX = 0:PY =  - 60
 80 H = 0:P = 1
 90  GOSUB 110
 100  GOTO 20
 110  IF L = 0 THEN  RETURN
 120 L = L - 1:L1 = P * 90
 130 H = H + L1:P = ( - P)
 140  GOSUB 110
 150 P = ( - P)
 160  GOSUB 280
 170 R1 = P * 90:H = H - R1
 180  GOSUB 110
 190  GOSUB 280
 200  GOSUB 110
 210 R1 = P * 90:H = H - R1
 220  GOSUB 280
 230 P = ( - P)
 240  GOSUB 110
 250 P = ( - P):L1 = P * 90
 260 H = H + L1:L = L + 1
 270  RETURN
 280 HX =  COS (H * PI / 180)
 290 HY =  SIN (H * PI / 180)
 300 NX = PX + HX * F1
 310 NY = PY + HY * F1
 320  IF NX > 139 OR NY > 79 THEN GOTO 370
 330  HPLOT 140 + PX,80 - PY TO 140 + NX,80 - NY
 340 PX = NX:PY = NY
 350 PY = NY
 360  RETURN
 380  FOR I = 1 TO 1000: NEXT I
 390  GOTO 20

Nearly 30 years later, my first thought when encountering the allRGB problem was to use the Hilbert curve, but use it twice in parallel: traverse the pixels of the image via a 2-dimensional (order 12) Hilbert curve, while at the same time traversing the RGB color cube via a 3-dimensional (order 8) Hilbert curve, assigning each pixel in turn the corresponding color.  The result looks like this (or see this video showing the image as it is constructed pixel-by-pixel):

allRGB image created by traversing image in 2D Hilbert curve order, assigning colors from the RGB cube in 3D Hilbert curve order.

allRGB image created by traversing image in 2D Hilbert curve order, assigning colors from the RGB cube in 3D Hilbert curve order.

(Aside: all of the images and videos shown here are reduced in size, using all 262,144 18-bit colors, for a more manageable 512x512-pixel image size.)

As usual and as expected, this was not a new idea.  Aldo Cortesi provides an excellent detailed write-up and source code for creating a similar image, with a full-size submission on the allRGB site.

However, the mathematician in me prefers a slightly different implementation.  Cortesi's code is based on algorithms in the paper by Chris Hamilton and Andrew Rau-Chaplin (see Reference (2) below), in which the computation of a point on a Hilbert curve depends on both the dimension n (i.e., filling a square, cube, etc.) and the order (i.e., depth of recursion) of the curve.  But we can remove this dependence on the order of the curve; it is possible to orient each successively larger-order curve so that it contains the smaller-order curves, effectively realizing a bijection between the non-negative integers \mathbb{N} and the non-negative integer lattice \mathbb{N}^n.  Furthermore, it seems natural that the first 2^n points on the curve (i.e., the order 1 curve) should be visited in the "canonical" reflected Gray code order.

The result is the following Python code, with methods Hilbert.encode() and Hilbert.decode() for converting indices to and from coordinates of points on the n-dimensional Hilbert curve:

class Hilbert:
    """Multi-dimensional Hilbert space-filling curve.

    def __init__(self, n):
        """Create an n-dimensional Hilbert space-filling curve.
        self.n = n
        self.mask = (1 << n) - 1

    def encode(self, index):
        """Convert index to coordinates of a point on the Hilbert curve.

        # Compute base-n digits of index.
        digits = []
        while True:
            index, digit = divmod(index, self.mask + 1)
            if index == 0:

        # Start with largest hypercube orientation that preserves
        # orientation of smaller order curves.
        vertex, edge = (0, -len(digits) % self.n)

        # Visit each base-n digit of index, most significant first.
        coords = [0] * self.n
        for digit in reversed(digits):

            # Compute position in current hypercube, distributing the n
            # bits across n coordinates.
            bits = self.subcube_encode(digit, vertex, edge)
            for bit in range(self.n):
                coords[bit] = (coords[bit] << 1) | (bits & 1)
                bits = bits >> 1

            # Compute orientation of next sub-cube.
            vertex, edge = self.rotate(digit, vertex, edge)
        return tuple(coords)

    def decode(self, coords):
        """Convert coordinates to index of a point on the Hilbert curve.

        # Convert n m-bit coordinates to m base-n digits.
        coords = list(coords)
        m = self.log2(max(coords)) + 1
        digits = []
        for i in range(m):
            digit = 0
            for bit in range(self.n - 1, -1, -1):
                digit = (digit << 1) | (coords[bit] & 1)
                coords[bit] = coords[bit] >> 1

        # Start with largest hypercube orientation that preserves
        # orientation of smaller order curves.
        vertex, edge = (0, -m % self.n)

        # Visit each base-n digit, most significant first.
        index = 0
        for digit in reversed(digits):

            # Compute index of position in current hypercube.
            bits = self.subcube_decode(digit, vertex, edge)
            index = (index << self.n) | bits

            # Compute orientation of next sub-cube.
            vertex, edge = self.rotate(bits, vertex, edge)
        return index

    def subcube_encode(self, index, vertex, edge):
        h = self.gray_encode(index)
        h = (h << (edge + 1)) | (h >> (self.n - edge - 1))
        return (h & self.mask) ^ vertex

    def subcube_decode(self, code, vertex, edge):
        k = code ^ vertex
        k = (k >> (edge + 1)) | (k << (self.n - edge - 1))
        return self.gray_decode(k & self.mask)

    def rotate(self, index, vertex, edge):
        v = self.subcube_encode(max((index - 1) & ~1, 0), vertex, edge)
        w = self.subcube_encode(min((index + 1) | 1, self.mask), vertex, edge)
        return (v, self.log2(v ^ w))

    def gray_encode(self, index):
        return index ^ (index >> 1)

    def gray_decode(self, code):
        index = code
        while code > 0:
            code = code >> 1
            index = index ^ code
        return index

    def log2(self, x):
        y = 0
        while x > 1:
            x = x >> 1
            y = y + 1
        return y

[Edit: The FXT algorithm library contains an implementation of the multi-dimensional Hilbert curve, with an accompanying write-up whose example output suggests that it uses exactly this same convention.]

Random spanning trees

At this point, a key observation is that the Hilbert curve image above is really just a special case of a more general approach: we are taking a "walk" through the pixels of the image, and another "walk" through the colors of the RGB cube, in parallel, assigning colors to pixels accordingly.  Those "walks" do not necessarily need to be along a Hilbert curve.  In general, any pair of "reasonably continuous" paths might yield an interesting image.

A Hilbert curve is an example of a Hamiltonian path through the corresponding graph, where the pixels are vertices and adjacent pixels are next to each other horizontally or vertically.  In general, Hamiltonian paths are hard to find, but in the case of the two- and three-dimensional grid graphs we are considering here, they are easier to come by.  The arguably simplest example is a "serpentine" traversal of the pixels or colors (see Cortesi's illustration referred to as "zigzag" order here), but the visual results are not very interesting.  So, what else?

We'll come back to Hamiltonian paths shortly.  But first, consider a "less continuous" traversal of the pixels and/or colors: construct a random spanning tree of the grid graph, and traverse the vertices in... well, any of several natural orders, such as breadth-first, depth-first, etc.

So how to construct a random spanning tree of a graph?  This is one of those algorithms that goes in the drawer labeled, "So cool because at first glance it seems like it has no business working as well as it does."

First, pick a random starting vertex (this will be the root), and imagine taking a random walk on the graph, at each step choosing uniformly from among the current vertex's neighbors.  Whenever you move to a new vertex for the first time, "mark" or add the traversed edge to the tree.  Once all vertices have been visited, the resulting marked edges form a spanning tree.  That's it.  The cool part is that the resulting spanning tree has a uniform distribution among all possible trees.

But we can do even better.  David Wilson (see Reference (3) below) describes the following improved algorithm that has the same uniform distribution on its output, but typically runs much faster: instead of taking a single meandering random walk to eventually "cover" all of the vertices, consider looping over the vertices, in any order, and for each vertex (that is not already in the tree) take a loop-erased random walk from that vertex until reaching the first vertex already in the tree.  Add the vertices and edges in that walk to the tree, and repeat.  Despite this "backward" growing of the tree, the resulting distribution is still uniform.

Here is the implementation in Python, where a graph is represented as a dictionary mapping vertices to lists of adjacent vertices:

import random

def random_spanning_tree(graph):
    """Return uniform random spanning tree of undirected graph.
    root = random.choice(graph.keys())
    parent = {root: None}
    tree = set([root])
    for vertex in graph:

        # Take random walk from a vertex to the tree.
        v = vertex
        while v not in tree:
            neighbor = random.choice(graph[v])
            parent[v] = neighbor
            v = neighbor

        # Erase any loops in the random walk.
        v = vertex
        while v not in tree:
            v = parent[v]
    return parent

The following image shows one example of using this algorithm, visiting the image pixels in a breadth-first traversal of a random spanning tree, assigning colors according to a corresponding Hilbert curve traversal of the RGB color cube.

Breadth-first traversal of random spanning tree of pixels, assigning colors in Hilbert curve order.

Breadth-first traversal of random spanning tree of pixels, assigning colors in Hilbert curve order.

My wife pointed out, correctly, I think, that watching the construction of these images is as interesting as the final result; this video shows another example of this same approach.  Note how the animation appears to slow down, as the number of pixels at a given depth in the tree increases.  A depth-first traversal, on the other hand, has a more constant speed, and a "snakier" look to it.

Expanding on this general idea, below is the result of all combinations of choices of four different types of paths through the image pixels and/or paths through the colors: Hilbert curve, serpentine, breadth-first through a random spanning tree, or depth-first.

Images resulting from different choices of paths through image pixels and colors.

Images resulting from different choices of paths through image pixels and colors.

Complete source code for generating larger versions of these images is available at the usual location here, in both Python, where I started, and C++, where I ended up, to manage the speed and space needed to create the full-size images.

Open questions

Let's come back now to the earlier mention of (a) transforming an existing image to contain all possible colors, and (b) using Hamiltonian paths in some way.  Reference (1) below describes a method of constructing a Hamiltonian cycle through the pixels of an image, using a minimum spanning tree of a "contraction" of the associated grid graph.  The idea is to group the pixels into 2x2 square blocks of "mini-Hamiltonian-cycles," resulting in a quarter-size grid graph, with edge weights between blocks corresponding to how "close" the pixel colors of the neighboring blocks are in the image.

Then, given a minimum spanning tree-- or really, any spanning tree-- in this smaller graph, we can construct a corresponding Hamiltonian cycle in the larger grid graph in a natural way, by connecting the 2x2 "mini-cycles" that are adjacent in the tree together into a larger cycle.  The resulting Hamiltonian cycle tends to have pixels that are close together on the cycle be "close" in color as well.

Now consider the following allRGB image: start with an existing image, and construct a Hamiltonian cycle from the minimum-weight "block" spanning tree.  Visit the pixels in cycle order, assigning colors according to some other corresponding "smooth" order, e.g. using a Hilbert curve.  How much of the original image would still be apparent in the result?


1. Dafner, R., Cohen-Or, D. and Matias, Y., Context-Based Space-Filling Curves, Eurographics 2000, 19:3 (2000) [PDF]

2. Hamilton, C. and Rau-Chaplin, A., Compact Hilbert Indices for Multi-Dimensional Data, Proceedings of the First International Conference on Complex, Intelligent and Software Intensive Systems, April 2007, 139-146 [PDF]

3. Wilson, D., Generating Random Spanning Trees More Quickly than the Cover Time, Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, ACM (1996), 296-303 [PDF]


Posted in Uncategorized | Leave a comment