**Introduction**

This past Monday marked the start of a 10-week promotion where you can buy bottles of Mountain Dew, each with a label showing one of the 50 United States. Collect all 50 state labels, and win $100 (in the form of a prepaid Visa card).

How many bottles of Mountain Dew should you expect to have to buy to win one of these $100 gift cards? The motivation for this post is to discuss this promotion as an instance of the classic coupon collector’s problem… but with a couple of interesting wrinkles, one of which appears to make this promotion vulnerable to exploitation by a determined participant, with an opportunity to net a significant positive return with almost no risk.

**Group drawings: buying six-packs**

The first wrinkle is to suppose that we plan to buy our bottles of Mountain Dew in six-packs, instead of one bottle at a time. Buying in bulk reduces the price per bottle: an individual 16.9-ounce bottle is easily a dollar or more, while I can find six-packs in my neighborhood for about $3.48, or 58 cents per bottle. I will assume this price for the remainder of this discussion.

So how many six-packs should we expect to have to buy to collect a complete set of 50 state labels? More generally, let be the number of bottles in each pack, where each bottle has a label chosen independently and uniformly from possible states, and let be the random variable indicating the number of packs we must buy until we first collect at least one of every state label. What is the distribution of ?

Stadje refers to this as the coupon collector’s problem “with group drawings” (see references at the end of this post). However, he addresses a slightly different variant, where the labels in a six-pack must be distinct, that is, sampled *without replacement* from the 50 possible labels (similar to the NFL drug testing procedure discussed in a recent post). In our case, each individual bottle’s label is an independent sample, so that a six-pack may contain duplicates. Inspection of a few six-packs at my local grocery store verified that this is indeed the more appropriate model for this problem.

The probability that we have collected all state labels after buying packs is given by

It’s a nice exercise to show that, from this, we can derive the expected number of packs as

which evaluated for and yields an average of about 37.91 six-packs, or about $131.92. So if we buy six-packs until we collect all 50 state labels, thus winning $100, we should expect to *lose* about $31.92 on average as a result of our venture, with a probability of only about 0.16 that we manage to net any positive return. Even worse, there is no upper bound on *how much* we might lose in that remaining 84% case. We will fix this shortly.

**The Double Dixie cup problem: decreasing risk**

The much more interesting wrinkle is that the terms of the promotion allow a single participant *to win more than once*: up to five times during the 10-week period, collecting a total of $500 worth of gift cards. This helps us a lot, because although we expect to need to buy about 38 six-packs to collect the *first* set of 50 state labels, we typically need less than half as many *additional* packs to collect the *second* and subsequent complete sets of labels.

Newman refers to this as the “double Dixie cup problem” (see references below). Let’s extend the notation described above, and define the random variable to indicate the number of packs that we must buy to collect complete sets of state labels. Then the cumulative distribution for is given by

Now, let’s suppose that we start buying six-packs, trying to collect complete sets of bottle labels, thus winning $500. At $3.48 per six-pack, we will lose money only if we have to buy more than 143 six-packs.

Evaluating yields a probability of less than 0.008 that we will lose money in our search for five complete sets of labels. We can bound the possible amount lost by stopping at 144 six-packs, no matter what, and collecting $100 gift cards for however many complete sets of bottle labels we have managed to collect at that point. The figure below shows the resulting distribution of possible net returns, with the expected return of $167.53 shown in red.

**Open questions**

A 99.2679% chance of making money, with an expected return of $167.53, sounds like a pretty good wager. But even this may not be the best we could possibly do. Although there is certainly no advantage to buying any *more* than 144 six-packs in total, it may be advantageous to stop *before* that point, even without having collected all five complete sets of bottle labels, if the expected *additional* gain from continuing to buy packs is negative. Such an optimal stopping strategy would likely be very complex, but it would be interesting to investigate how much the above approach could be improved.

Finally, I have already sunk about $300 into a bunch of Skittles that I didn’t eat; I have no plans to invest in another experiment buying gallons of Mountain Dew that I won’t drink. But there is a good reason why I’m more reluctant in this case: all of the above analysis assumes that each of the 50 states are equally likely to appear on any bottle. Perhaps they truly are uniformly distributed– the promotion is intended to “celebrate what makes every state epic,” after all, so any state that is more scarce than it should be would feel justifiably slighted.

We made the same uniformity assumption in the recent Skittles experiment, which turned out to be mostly correct. But in that case, any departure from uniformity would only have *helped* us, making it *more* likely to find an identical pair of packs. In this case, however, a non-uniform distribution of state labels makes it *harder* to collect complete sets.

**References:**

I would imagine, at least regionally, the distribution of states would have to be non-uniform. If we extend the double dixie cup problem for a large group of people pooing their money so that they could each claim a prize, that would mean eventually that every bottle in a 6 pack would end up contributing to a collection of 50. At 58 cents, that’s only $29 to get a full collection. You could start a hedge fund with a $29 share that guarantees a $100 payoff minus administration costs….

You’re right that one could in principle further decrease risk by targeting larger and larger numbers of complete sets of bottle labels, and potentially do so by pooling resources among multiple investors/participants. However, the program terms explicitly forbid this: “Labels cannot be sold, traded, bartered, assigned or transferred to or shared with a third party, auctioned through an online auction site, or otherwise or obtained through any other source other than the methods mentioned above.” In other words, any one participant cannot technically benefit from anyone else’s purchases, nor share his/her own with others.

The promotion organizers having thought to prohibit this potential pooling of resources, I think it doesn’t necessarily follow that they must also have skewed the distribution of state labels. That’s what I found interesting about this problem: it might perhaps be “obvious” or intuitive to the organizers that pooling resources among extremely large numbers of participants could effectively eliminate risk… but it might be less obvious that, even with a single participant acting alone, the risk is already so low targeting just five complete sets of labels (as opposed to, say, fifty, or a hundred, or whatever).

Fun problem, particularly the Double Dixie Cup variation. I banged it out using a Markov chain, got a match to your plot. Then I looked at some literature about analysis. My take is there is no exact formula for P(Xd <= n), but there are good approximations. I couldn't quite follow your formula — e.g., what is "x"? Instead I used this one implied by Newman's Theorem 1:

P(Xd = d),

and Zt is Poisson with mean t.

It gives a match to your plot.

The formula for P(X_d<=n) in the article is exact, but it involves extracting the coefficient of a generating function, and so is arguably not "closed form." That is, the "x" is the formal variable in a power series, and the notation [x^{mn}][g(x)] means "the coefficient of x^{mn} in the expansion of the polynomial g(x)."

Thanks, got it. And you’re right, I should have said “no closed-form formula.”

For whatever reason my formula didn’t show up in my post, so I’ve put it here (sagedrive dot com / possibly dot pdf):

http://sagedrive.com/possibly.pdf

I can’t find Hawaii or Iowa

I have 46-48 right now seriously searching only a few days. I’m in Michigan, Idaho and North Carolina are alluding me

Dude, you can scan photo of bottle off internet

I would say that this is for regular mountain dew drinkers to begin with. I did not change my purchasing habits for mountain dew and I was able to collect 47 States and then I simply went into a Walmart to get the last three states as individual bottles so there was no cost to me other than my normal purchasing habits. so I give kudos to mountain dew for a wonderful challenge that you could complete just being a regular mountain dew drinker.