I learned about an interesting application of CAPTCHAs this past week. Good timing, too, since CAPTCHAs make a nice follow-on to last week’s post about the Turing Test and interactive proof systems.
First, what is a CAPTCHA? Last week, I described the Turing Test as a thought experiment. A CAPTCHA is a practical realization of the Turing Test– with a slight twist– that is used thousands of times per day. You have seen them online, even if you have not heard of them: when you login to some email accounts, or buy concert tickets, or do your online banking, you are sometimes confronted with a small, slightly distorted image of some text, usually just a word or two, and are asked to type in the words that you see. Wikipedia has some images of past and current examples of CAPTCHAs.
CAPTCHA is an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart.” As its name suggests, a CAPTCHA is basically a Turing Test, designed to ensure that it is in fact you, a human being, buying those concert tickets, and not a “bot,” or computer program trying to buy a large block of tickets to re-sell later.
The idea is that the test should be something that a human can respond to relatively easily and quickly, but that a computer (currently) cannot. As described above, this test is usually of the form “read the word(s) in this distorted image.” Humans are pretty good at doing this, while the current state of the art in image processing is not very good at “reading” images that have been distorted in some ways.
The “twist” mentioned earlier refers to the fact that in this case, it is a computer (the web server) that is performing the test, trying to identify another computer “acting” like a human. Because a computer is performing the test, the test must be automated; that is, the web server must be able to generate many different instances of the test automatically, to present to each of the many humans– and bots– accessing the site.
Ok, so far so good. The cool new interesting thing that I learned was how CAPTCHAs are being used to not only protect web applications, but to digitize books… at the same time. Check out the reCAPTCHA web site; this is a particular flavor of CAPTCHA that seems to be becoming the de facto standard. It uses not just one but two words in the image. One word is the actual pass/fail challenge; the other word comes from a book that is being digitized or “read” by character recognition software. This second word was “confusing” in some way, and was unable to be read correctly by the software. (The fact that character recognition algorithms can typically “know” when they can’t read something correctly is an interesting topic in its own right.)
The CAPTCHA uses the human test subject to help the reading process along. If the human (or bot) fails the “real” test, then the submission is rejected, fulfilling the main purpose of the CAPTCHA. If he/she/it gets it right, though, it is assumed that he/she/it probably read the other word correctly as well, and this response is used to resolve the confusion.