The lead article in yesterday’s *Post* described some interesting analysis of one of Thomas Jefferson’s early drafts of the Declaration of Independence. Although several words can be seen to have been crossed out and replaced with others, in one instance where Jefferson initially used the word *subjects*, he did his best to not just cross it out but actually erase it and write *citizens* in its place. (I assume the part in question is the following final version: “He has constrained our fellow *Citizens* taken Captive on the high Seas to bear Arms against their Country…”)

In a similar spirit of this holiday weekend, I can’t resist mentioning another interesting application of mathematics, in this case to the problem of determining authorship of the so-called “disputed” Federalist Papers. This isn’t new stuff; however, I like problems (and solutions) like this because of the relative simplicity with which the ideas may be explained… while at the same time there is some meaty mathematics under the surface. It is the kind of problem that has the potential to excite and challenge students.

The Federalist Papers are a collection of essays written by (variously) Alexander Hamilton, James Madison, and a few by John Jay, in support of ratification of the U. S. Constitution. Although authorship of most of these essays is relatively certain, there has been some debate about twelve in particular. These “disputed papers” are today generally all thought to have been written by Madison.

My first exposure to this problem was a 1998 paper by Bosch and Smith. (Unfortunately, this JSTOR link is not accessible without a journal subscription.) In it, the authors describe the idea of using “separating hyperplanes” to identify the author(s) of the disputed papers. They compute, for each of the essays, a point in 70-dimensional space, with each coordinate indicating the frequency of occurrence of a corresponding “function word.” Think of these function words and their frequencies of use as a “fingerprint” that is unique to a particular author.

Now, considering only the 65 points corresponding to the *undisputed* papers, the authors compute a “separating hyperplane,” or a hyperplane such that all of the points corresponding to Hamilton’s essays are on one side, and Madison’s on the other. (This is where the interesting mathematics comes in; how do you compute such a separating hyperplane? Under what conditions does a separating hyperplane even exist? In the likely case that there is an entire infinite family of possible separating hyperplanes, how much does it matter which one you choose?)

Anyway, given such a hyperplane separating the two authors of the *undisputed* papers, the authorship of the *disputed* papers may be determined by observing on which side of the hyperplane the corresponding points fall. It turns out that this approach yields the same conclusion, that all 12 were written by Madison.

For those of us without access to the paper, does the algorithm give some certainty or confidence for each of the 12 papers? Did they fall close to the decision boundary, or were they well within the Madison side of the hyperplane?

Lately, I’ve been learning a lot about this field of mathematics, pattern recognition, and the thing that bugs me is that it’s really hard (if not impossible) to prove that a solution will generalize well to new data. Suppose a new undisputed paper is discovered, and when the hyperplane is re-evaluated with this new data point one of the disputed papers switches over to Hamilton’s side. Well, now what?

Your suggestion of distance from the hyperplane as a measure of confidence in a particular classification is right on the money. Indeed, this is effectively how the hyperplane is selected: there are generally an entire collection of possible hyperplanes that separate the “training” points, and each hyperplane has associated with it a minimum distance to one of the training points. The hyperplane we want is (the) one that maximizes this minimum distance… which is another way of saying that it maximizes the minimum “confidence” in its classification of a training point.

For further reading, particularly without a JSTOR subscription, search for “support vector machine”; Wikipedia has a pretty good discussion.

As to the problem of the model misbehaving in the presence of additional data, I think that’s the nature of the game, particularly if you don’t have any statistical characterization of what that future new data might look like. I am reminded of the comment by George Box: “All models are wrong; some models are useful.”