Releasing the Kraken on Revolutionary Philadelphia
The American Philosophical Society, which we remember was co-founded by tinkerer Benjamin Franklin, has a project called “The Revolutionary City.”
Its goal is “to digitize all manuscript material to, from, or about Philadelphia during the American Revolutionary War (1774-1783).” The count of those manuscript pages is 46,000 and growing.
But posting digitized images of those manuscripts on the web has a limited value since only a small fraction of the public has experience in reading eighteenth-century handwritten documents.
David Ragnar Nelson, a Digital Projects Specialist at the A.P.S., recently shared a précis of how the organization is using digital tools to create transcriptions. He writes:
Nelson notes those programs’ advantages: ”First, these programs are open source, which means that anyone with sufficient knowledge of programming can use them for free and contribute to them. Second, the community of users around these softwares has a strong commitment to an open and transparent practice of model generation.”
So how has the performance been so far? “The largest model we have trained incorporates over 45,000 lines of cleaned training data. This model achieves an accuracy of around 94%, though the accuracy drops quite a bit if the material does not resemble something in the training set.” (In other words, somebody else’s handwriting.) The results still require “a bit of manual correction” but arrive faster than all-manual transcriptions.
Its goal is “to digitize all manuscript material to, from, or about Philadelphia during the American Revolutionary War (1774-1783).” The count of those manuscript pages is 46,000 and growing.
But posting digitized images of those manuscripts on the web has a limited value since only a small fraction of the public has experience in reading eighteenth-century handwritten documents.
David Ragnar Nelson, a Digital Projects Specialist at the A.P.S., recently shared a précis of how the organization is using digital tools to create transcriptions. He writes:
Fortunately, massive steps forward have been made in recent years to crack the code for HTR [handwritten text recognition]. There are now a number of existing softwares that can help transcribe handwritten documents. Most of these models rely on some form of deep learning, a type of machine learning through which computers recognize and reproduce patterns.The A.P.S. is using two software programs for this task. Kraken is an H.T.R. program originally designed to work with Arabic which “has shown great success in English cursive.” For training data, eScriptorium is providing a wrapper and user interface for running kraken.
While these technologies are often grouped under the moniker “artificial intelligence,” all they do is encode patterns as probabilities. The computer cannot “see” or “read” letters in the same way that a human being can, but associates the contours of visual material on the page with the probability of being a certain letter.
To “teach” the computer to recognize handwriting, we need to provide it with high-quality samples of correct output. We call these samples “training data.” The computer will then run an algorithm over the training data and create what is known as a “model.” This model encodes the probabilities the computer uses to decode the symbols on the page. Since the computer can only learn what is in its training data, we need lots of high-quality training data to produce an accurate model. Any and every possible variation of a letter must be accounted for in the training set.
Nelson notes those programs’ advantages: ”First, these programs are open source, which means that anyone with sufficient knowledge of programming can use them for free and contribute to them. Second, the community of users around these softwares has a strong commitment to an open and transparent practice of model generation.”
So how has the performance been so far? “The largest model we have trained incorporates over 45,000 lines of cleaned training data. This model achieves an accuracy of around 94%, though the accuracy drops quite a bit if the material does not resemble something in the training set.” (In other words, somebody else’s handwriting.) The results still require “a bit of manual correction” but arrive faster than all-manual transcriptions.
No comments:
Post a Comment