Thursday, May 24, 2007

reCAPTCHA

After code reviewing some large number of files, I took a slight detour over to BoingBoing to an article on an alternative CAPTCHA mechanism. Instead of using a normal CAPTCHA system, typing in known words that have been graphically altered so that a computer couldn’t interpret it, you type in unknown words gathered via OCR that no computer has interpreted, but that the reCAPTCHA system has a pretty high liklihood index on knowing what at least one of them is. You type in two words. You get the “known” one right, and you get in, and your submission for the other word is kept to help determine what the other word is. It’s a pretty interesting idea. Plus, they’re using it to help digitize documents in the public domain that have been scanned, but that OCR has been hard pressed to convert them–a good cause.

In my playing around with the example on the site, I noticed some oddities:

  • You can fudge at least one of the words. There was an instance of “1980” and another word. I guessed that the other word was the “known” word, typed that in verbatim, and passed in “1988” instead. The reCAPTCHA system said I was in like Flynn.
  • You are likely to get partial or multiple words instead. Since it’s OCR that’s already known to not interpret it correctly, it may not even get the correct word break and when you are instructed to enter “two words,” you may have to enter in three or four words, or partial words.
  • Sometimes there are odd symbols in the older texts. There was a word that had an ‘æ’ in it. I dutifully asked the International menu to show me my Keyboard Viewer (since I never remember the Option-key combination for those diareses), found the character, and submitted the word exactly as written. It called me correct, but I wager that normal people won’t bother to use those high-falutin’, old-fangled characters, and just type “ae”. It might be close enough, but is it right?
  • Use of alternative symbols might get you an error. There was a word that clearly had a right-side closing single quote character used as a possesive or a contraction. However, I dutifully entered in the curley-quote version and was rebuffed! I guess the straight quotes is what they wanted.
  • Older texts have spelling mistakes or perhaps older or alternative spellings. I wonder how many people are going to bother making sure they type that extra ‘t’ that doesn’t belong; will it be enough for the verification code to preserve the original text’s spelling?
  • Some of those OCR documents are hard to read as it is, and blurring it makes it harder. There were several times where I had to refresh the reCAPTCHA many times in a row because it was just impossible to read. I wonder what their mechanism is when many, many people have avoided making a claim, or made irreconcilable claims as to what that word is. Does some poor shmuck have to go back and look at it manually? Or do they start limiting the blurring? Hmm…

All in all, though, it still seemed like it provided a very reasonable security while providing a nifty public service.