As an undergraduate in computer science, and even in graduate school, I learned many elegant algorithms for solving common problems. The kind of algorithms that you might think *every* programmer should know.

But a few weeks ago, I was reminded of just how relatively *empty* my toolbox is, as I learned about an algorithm that solves such a common problem, with such a simple-yet-brilliant insight, and with such unexpected efficiency, that made me think, “How have I never heard of this before?”

The problem is simple to state: given a uniform random number generator and a discrete probability distribution with a finite set of possible outcomes, find an efficient method for sampling from that distribution. An example of this situation has been discussed here before, involving simulation of a lottery with multiple payoffs. As in that example, the first approach that comes to mind is to simply pre-compute the cumulative distribution (using indices of the outcomes) and do a linear search to find the interval containing the sampled uniform random value.

This method requires time to execute, which may not sound like a big deal, particularly since each step is just an array lookup and arithmetic comparison. But it turns out that there is an algorithm called the Alias Method that can do the job in time!

There is already an excellent write-up by Keith Schwartz that describes the problem and various solutions in great detail, leading up to the Alias Method and the cool geometric insight into how the algorithm works. (As a brief teaser, instead of selecting a random point on an interval of the real line, extend to two dimensions and imagine throwing a dart at a rectangular dart board. The challenge is to associate *every* point in the rectangle with an outcome.)

** Edit**: Motivated by discussion in the comments, following is a MATLAB script implementing the alias method, including a speed comparison with the basic “roulette wheel” inverse CDF method. Two different situations are evaluated: (1) generating many random samples at once, where MATLAB’s vectorization of built-in functions makes the inverse CDF method faster until about ; and (2) generating random samples one at a time, where the alias method wins by orders of magnitude even for small .

% Create test probability distribution. p = rand(1, 100); p = p / sum(p); % Initialize roulette wheel method: precompute {c}. c = min([0, cumsum(p)], 1); % Initialize Vose alias method: pre-compute {n, prob, alias}. n = numel(p); prob = nan(1,n); alias = nan(1,n); p = p * (n / sum(p)); small = nan(1,n); large = nan(1,n); ns = 0; nl = 0; for j = 1:n if p(j) > 1 nl = nl + 1; large(nl) = j; else ns = ns + 1; small(ns) = j; end end while ns ~= 0 && nl ~= 0 j = small(ns); ns = ns - 1; k = large(nl); nl = nl - 1; prob(j) = p(j); alias(j) = k; p(k) = p(k) + p(j) - 1; if p(k) > 1 nl = nl + 1; large(nl) = k; else ns = ns + 1; small(ns) = k; end end while ns ~= 0 prob(small(ns)) = 1; ns = ns - 1; end while nl ~= 0 prob(large(nl)) = 1; nl = nl - 1; end % Measure vectorized execution time of both methods. samples = 1e6; disp('Vectorized roulette wheel method:'); tic; [~, r] = histc(rand(1,samples), c); toc disp('Vectorized Vose alias method:'); tic; u = n * rand(1,samples); j = floor(u); r = alias(j + 1); idx = (u - j <= prob(j + 1)); r(idx) = j(idx) + 1; toc % Measure single-sample execution time of both methods. samples = 1e6; disp('Roulette wheel method:'); tic; for i = 1:samples [~, r] = histc(rand(), c); end toc disp('Vose alias method:'); tic; for i = 1:samples u = n * rand(); j = floor(u); if u - j <= prob(j + 1) r = j + 1; else r = alias(j + 1); end end toc

One final interesting note that came out of investigating this algorithm: Wolfram’s *Mathematica*, a tool known for its ability to perform not just extended precision but *arbitrary* precision calculations, has a rather strange bug when confined to “simple” double precision. One of the steps in the alias method and many other algorithms involves generating a uniform random integer in the set . A common approach is to compute , where is uniformly distributed in the half-open interval . When doing this with the limited precision of IEEE-754 floating point arithmetic, it is natural to ask whether there is a danger of the resulting product yielding instead of for a value of sufficiently close to 1?

Since the IEEE-754 standard guarantees that elementary arithmetic operations must yield a best possible result, it is not difficult to show that there *should* be no worries for all values of , the largest of the consecutive representable integers in double precision.

However, *Mathematica* fails to get this right for many values of . The smallest integer where the error occurs is 2049:

N[2049] * N[1 - 2^-53] // InputForm

The second term in the product is the largest double precision value less than 1. Evaluating this expression yields 2049, not 2048.9999999999995 as expected (at least on the Windows versions of *Mathematica* 7 and 8). It is not clear to me what is happening here, since this should presumably be a *hardware* operation, and is handled correctly in C++, Java, Python, etc. Anyway, you have been warned.

Reference: Vose, Michael D., A Linear Algorithm For Generating Random Numbers With a Given Distribution. *IEEE Transactions on Software Engineering*, **17**(9) (September 1991):972-975. [PDF]

Very cool! I’m going to have implement this in MATLAB (if it’s not already built-in).

One question: I’m not as familiar with conventions for complexity. Is there a difference between O(n) and \Theta(n)?

There is a difference, and I committed a common-but-confusing abuse of notation in the original post. I wrote that the basic “inverse-CDF” method “requires O(n) time to execute.” Although technically true, a more rigorous and stronger statement would be that the method’s worst-case running time is \Theta(n)… which *implies* that the method’s running time is O(n) for *any* choice of sequence of probability distributions as inputs.

Put roughly, \Theta(f(n)) is the set of functions that can be asymptotically bounded *above and below* by constant multiples of f(n), while O(f(n)) is the set of functions that can be asymptotically bounded *above* by a constant multiple of f(n). So \Theta(f(n)) is a subset of O(f(n)), and saying a function is \Theta(f(n)) is thus a stronger statement than saying it is O(f(n)). Armed with this distinction, you may find that people frequently say O(f(n)) when they probably mean \Theta(f(n))– in particular, when *worst-case* running time is evaluated, Big-Theta is usually the appropriate notation.

Though the operational complexity may be less, my implementation of the alias method in MATLAB doesn’t beat the simple cdf method (aka roulette wheel selection). In fact, the cdf method is about twice as fast of the alias. This is largely due to the fact that the cdf method is more easily vectorized. In fact by using cumsum() to create the lookup table and histc() to perform the search, it’s implemented in one line of code:

If you can find a way to vectorize the Vose algorithm, I’d be very interested in what you come up with.

You make an important point: don’t optimize what isn’t broken. I updated the post to include MATLAB code that compares the two methods. I saw similar behavior; for “small” values of n (or numel(prob) in your code snippet), the vectorized CDF method is about twice as fast.

However, as you might expect, the alias method catches up– or rather, the CDF method slows down– as n grows larger: when sampling from a distribution with 10,000 or so outcomes, the vectorized alias method beats out the linear-time CDF method. Big deal, right? When was the last time you needed to sample from a finite distribution with that many outcomes? (Of course, MATLAB is rather unique in this respect, where the “constant factors” vary so much between user code and built-in functions. It would be interesting to see where these curves cross with an implementation in, say, C++ or Java.)

But on the other hand, I think a more important thing to note is that, in practice, we are also probably not generating samples from our distribution “all at once” like this. Consider some complex Monte Carlo simulation, where each iteration involves just one or a few of these samples, which are then used as part of a larger calculation. The second speed comparison in the MATLAB script evaluates this “one-at-a-time” situation, and without the benefit of vectorization, the results are predictably in favor of the alias method.

I couldn’t resist doing a little more profiling myself. I found similar to you, that the alias method wins (if pre-initialized) when you’ve got a large distribution, or you’re taking a small number of draws.

It’s funny how two people can take an algorithm, and implement it so differently. I noticed in your implementation of the alias method, you only make one random draw (rather than two). In my version (below), I reasoned (without formal proof) that if I used the same uniform random draw to determine both the bin and the coin-flip, I’d be biasing towards the alias in the higher indexed bins. Does that make sense? I wonder if the same is true in your version… or rather, can you get yourself into trouble using the same random number for both supposedly independent events?

r = floor(rand(nSamples,1)*n) + 1;

isTails = prob(r) < rand(nSamples,1);

r(isTails) = alias(r(isTails));

The main benefit of getting both the bin and the coin flip out of a single random draw is that random number generators can be *expensive* (in execution time, I mean). Your question is whether this approach still yields the desired distribution of outcomes that we want. I might be missing the potential problem that you are describing; from the purely mathematical perspective, it’s certainly ok: if u is a uniform random variable in [0,n), then u-floor(u) is uniform in [0,1).

From the perspective of actual implementation with limited (double) precision, look at it this way: in a single random draw we get about 53 random bits. The “multiply, floor, and remainder” operations essentially split those bits into two groups: the high order bits determine the bin, and the low order bits determine the coin flip.

It’s cool that you’ve discovered the Alias method… but I think you’re judging Mathematica on faulty assumptions. The function N[expr] gives the numerical value of expr with a decimal precision up to K digits, with K a (configurable) parameter of your Mathematica environment. If you explicitly want 54 digits, you would do N[expr, 54].

Unfortunately, I think this is wrong on almost all counts. You are correct that you can specify a specific precision as a second argument to N[], and the result will be represented “in software” as a value with the specified precision. But the parameter is given in decimal digits, not bits, so the closest approximation to double precision would be N[expr, 16], not N[expr, 54]. (Or N[expr, 15.954589770191003], to be exact.)

Also, the default precision implied by N[expr] without the second argument is MachinePrecision, which (1) is *not* a configurable parameter (try it; $MachinePrecision is protected) and (2) is indeed a “special case” in the sense that it is not the same as N[expr, 16], since the result is represented as a value on which arithmetic operations can be performed “in hardware.” On every platform that I know of, this representation is an IEEE-754 64-bit double.

Finally, I have communicated with Wolfram, and they also agree and acknowledge that this is a bug, which will be addressed in a future release.