Since I spend an above-average amount of time looking at Google Books screens, the Making of America sites, and scans of colonial newspapers in the Archive of Americana, I was struck by Matthew Battles’s article in the Boston Globe earlier this month, headlined “Click to translate”:
Digital cameras at libraries worldwide are scanning millions of pages of old books, automatically “reading” the texts and turning them into computer files. But as books age, their typography smudges and flakes away. While human readers have little trouble comprehending even the most mangled words, sophisticated computer software still hangs up on them. Somewhere on the page, the dot of an i has disappeared, the smile of an e has gone gappy, the belly of a capital D has detached itself from its backbone. The computer thinks it’s seeing an ‘l,’ a ‘c,’ and a capital I followed by a parenthesis.On the one hand, I like the idea of faster conversion of scanned pages into searchable text, and I’m impressed with the elegance of this solution. On the other hand, this system uses unwitting volunteer labor to replace “human reviewers” who appear to be slightly better at the task—but actually wish to be paid. An ethical dilemma. Feel free to comment (though, of course, that requires responding to a captcha).
In a paper published last Friday in the journal Science, computer-science professor Luis von Ahn describes a new system to solve this problem. Taking advantage of humans’ natural ability to decipher messy text, von Ahn’s system places tiny bits of those unreadable lines as mystery words on websites around the world. As people solve the usual logon puzzle, they also decode a real word; the results are then collected and used to correct the text and produce clean copies of scanned books. . . .
Books are scanned twice and the two text streams are compared; any mismatched words become captchas. The mystery words are paired with known words on normal website security checks, and the user is asked to solve both words. If the user is right about the known word, his or her answer for the mystery word is kept and compared to solutions offered by others. Von Ahn finds that the system correctly decodes mystery words more than 99 percent of the time—results nearly identical to that of the scanning projects’ human reviewers.
According to the Science article, this system, dubbed “reCAPTCHA,” is now used on some 40,000 websites, where it has solved some 44 million words in one year of operation—the equivalent of about 17,600 books in von Ahn’s estimation.
It may be a telling fact about the Internet that von Ahn was not the first one to this idea: online pornographers trying to unlock captchas (and gather up millions of e-mail addresses) realized that they could solve them by the thousands through a neat trick. Whenever their bots run into a puzzle, they take a snapshot of the captcha and shoot it back to the porn site, where viewers have to solve it to move on to the next picture.