This past week I spent some time experimenting with hacking the CAPTCHAs at the login page for one of my online financial institutions. (See here for a discussion almost a year ago about reCAPTCHAs, a particularly interesting additional application of CAPTCHAs.)
I was motivated to do this because (1) it seemed like an interesting problem, since I don’t know much about image processing or OCR; and because (2) these particular CAPTCHAs seemed disturbingly simple… in more than one way, as we shall see.
Actually, I will skip over the image processing rather quickly. Although that was my initial interest in the problem, it is not really my focus here. The important observation is simply that it was easy. The CAPTCHA images are always the same size in pixels, they are always four capital letters, the letters are never stretched or skewed, and they always make up an English word. The only real challenge is the image background, which is a combination of a varying “bumpy” texture and some skewed cross-hatch lines through the letters.
Parsing the CAPTCHAs took just a few steps. First, color-quantize the image first down to eight colors, then down to two (filtering on one of the eight) to get a black and white image. This almost completely removes the texture and cross-hatching, leaving just the letters. Then apply a couple of 3×3 box filters, the first to remove any “island” dots in the image left over from the background removal, and the second to smooth out any “toothiness” in the edges of the characters. Finally, I didn’t even have to bother with my own OCR. The open-source GOCR software did the job just fine.
The result is about 20 lines of Mathematica code, yielding an automated solution with a success rate of approximately 90%. (This is based on my manual sample of logins over the course of several days, since I don’t want to actually throw a bot at the site.)
But even more interesting than the CAPTCHA images themselves is the protocol, or manner in which the CAPTCHAs are incorporated into the login process. Once a user reaches the page with the CAPTCHA– after entering a valid account password– a notice at the bottom of the page informs the user that he or she can refresh the page with a new CAPTCHA if the current one is too difficult to read. I had not really noticed or tried this “feature” before, but it turns out that refreshing the CAPTCHA does indeed load a new image… but the four-letter word does not change.
What does this mean for an automated attempt at solving the CAPTCHA? Recall the 90% success rate mentioned previously; in most of the 10% of cases where my algorithm fails, the problem has not been that it comes up with the wrong four-letter word. Instead, it knows that it failed, since one or more of the letters were unrecognizable. In those cases, subsequent refreshes of the CAPTCHA have eventually yielded a successful solution.
So what is the point of all of this? I think the important questions to ask are, “What type of attack are we trying to prevent?” and “What reduction in probability of successful attack does this additional complexity of protocol provide?” The moral here seems to be that increased complexity (“let’s add CAPTCHAs to protect against bots”) does not imply increased security… and that increasing convenience (“but we’ll let users refresh the image if the text is too hard to read”) almost always decreases security.