CAPTCHAs revisited

This past week I spent some time experimenting with hacking the CAPTCHAs at the login page for one of my online financial institutions.  (See here for a discussion almost a year ago about reCAPTCHAs, a particularly interesting additional application of CAPTCHAs.)

I was motivated to do this because (1) it seemed like an interesting problem, since I don’t know much about image processing or OCR; and because (2) these particular CAPTCHAs seemed disturbingly simple… in more than one way, as we shall see.

Actually, I will skip over the image processing rather quickly.  Although that was my initial interest in the problem, it is not really my focus here.  The important observation is simply that it was easy.  The CAPTCHA images are always the same size in pixels, they are always four capital letters, the letters are never stretched or skewed, and they always make up an English word.  The only real challenge is the image background, which is a combination of a varying “bumpy” texture and some skewed cross-hatch lines through the letters.

Parsing the CAPTCHAs took just a few steps.  First, color-quantize the image first down to eight colors, then down to two (filtering on one of the eight) to get a black and white image.  This almost completely removes the texture and cross-hatching, leaving just the letters.  Then apply a couple of 3×3 box filters, the first to remove any “island” dots in the image left over from the background removal, and the second to smooth out any “toothiness” in the edges of the characters.  Finally, I didn’t even have to bother with my own OCR.  The open-source GOCR software did the job just fine.

The result is about 20 lines of Mathematica code, yielding an automated solution with a success rate of approximately 90%.  (This is based on my manual sample of logins over the course of several days, since I don’t want to actually throw a bot at the site.)

But even more interesting than the CAPTCHA images themselves is the protocol, or manner in which the CAPTCHAs are incorporated into the login process.  Once a user reaches the page with the CAPTCHA– after entering a valid account password– a notice at the bottom of the page informs the user that he or she can refresh the page with a new CAPTCHA if the current one is too difficult to read.  I had not really noticed or tried this “feature” before, but it turns out that refreshing the CAPTCHA does indeed load a new image… but the four-letter word does not change.

What does this mean for an automated attempt at solving the CAPTCHA?  Recall the 90% success rate mentioned previously; in most of the 10% of cases where my algorithm fails, the problem has not been that it comes up with the wrong four-letter word.  Instead, it knows that it failed, since one or more of the letters were unrecognizable.  In those cases, subsequent refreshes of the CAPTCHA have eventually yielded a successful solution.

So what is the point of all of this?  I think the important questions to ask are, “What type of attack are we trying to prevent?” and “What reduction in probability of successful attack does this additional complexity of protocol provide?”  The moral here seems to be that increased complexity (“let’s add CAPTCHAs to protect against bots”) does not imply increased security… and that increasing convenience (“but we’ll let users refresh the image if the text is too hard to read”) almost always decreases security.

This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to CAPTCHAs revisited

  1. Brian Taylor says:

    Wow. That’s remarkably effective.

    About a year ago I wrote a python script that just summed the channels, thresholded, and then did a couple steps of dilate and erode before feeding it to OCR and I was seeing about 80% success with that braindead technique. I’ve now implemented yours and it is far more elegant and does a much better job of eliminating the background.

    • Yeah, you don’t mention what OCR you used, but if it’s “general purpose” GOCR like I tried, then these are much weaker attacks than would be feasible. The font is always the same, and it’s always the same size, and the letters are always in the same position in the image. A more special-purpose pattern match against 26 specific capital letters would nail this, I think.

      Which re-raises the second question in the OP, namely, why is this here at all, particularly behind the login password? I may be underestimating possible “bad guy” intentions, but it seems like anyone that gets past an account password is done with the automated attack, and is at that point physically at the keyboard ready to try to move money around.

      • Brian Taylor says:

        Yes, I used a general OCR (specifically tesseract-ocr.)

        Good point on specializing to this. I’ve written a little template generator in Mathematica (code following) and am using ImageCorrelate to apply it to the background removed image. Since the position is always the same in the image you could just find the peak in a strategic y band of the correlated signal for each of the templates. The strongest peak should tell you which template is correct and the x offset would tell you where that letter goes.

        (* template generator *)
        GenChar[val_String] := 1 – 2*ImageData[
        Graphics[Style[Text[val], 30, FontFamily -> “Helvetica”]],
        RasterSize -> 30, ImageSize -> 20]],
        3, White]
        ][[All, All, 1]];

  2. Cool! I bet you could probably even get away with not “searching” for the x-position for each letter, if you worked out the letters one at a time from left to right. In other words, the (left edge of) the first letter is always in the same spot, so you just need to figure out which template matches best there. Once that’s done, the proportional font spacing between letters tells you where the second letter goes, and you find a matching template there, then move on to the third, etc.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.