Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
22<br />
In ancient manuscripts, Sanskrit is written without spaces, <strong>and</strong> from our po<str<strong>on</strong>g>in</str<strong>on</strong>g>t of view, this is an<br />
important graphical specificity, because it <str<strong>on</strong>g>in</str<strong>on</strong>g>creases greatly the complexity of text comparis<strong>on</strong><br />
algorithms. One may remark that Sanskrit is not the <strong>on</strong>ly language where spaces are miss<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g><br />
the text: Roman epigraphy <strong>and</strong> European Middle Age manuscripts are also good examples of<br />
that (Csernel <strong>and</strong> Patte 2009).<br />
The soluti<strong>on</strong> that the authors ultimately proposed for creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g a critical editi<strong>on</strong> of a Sanskrit text<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>volved the lemmatizati<strong>on</strong> by h<strong>and</strong> of <strong>on</strong>e of the two texts, specifically, the text of the editi<strong>on</strong>.<br />
Alignments between this lemmatized text <strong>and</strong> other texts then made use of the l<strong>on</strong>gest comm<strong>on</strong><br />
subsequence (LCS) algorithm. The authors are still experiment<str<strong>on</strong>g>in</str<strong>on</strong>g>g with their methodology, but po<str<strong>on</strong>g>in</str<strong>on</strong>g>ted<br />
out that the absence of a Sanskrit lexic<strong>on</strong> limited their approach.<br />
The development of OCR tools that will process Sanskrit scripts is a highly sought-after goal. Very<br />
little work has been d<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g> this area, but Thomas Breuel recently reported not <strong>on</strong>ly <strong>on</strong> the use of<br />
OCRopus to recognize the Devanagari script but also <strong>on</strong> its applicati<strong>on</strong> both to primary texts <str<strong>on</strong>g>in</str<strong>on</strong>g><br />
classical languages <strong>and</strong> to sec<strong>on</strong>dary classical scholarship. As was discussed previously <str<strong>on</strong>g>in</str<strong>on</strong>g> Boschetti et<br />
al. (2009), prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary work with OCRopus produced promis<str<strong>on</strong>g>in</str<strong>on</strong>g>g results with Ancient Greek.<br />
Breuel (2009) described OCRopus as an OCR system that is designed to be both omnil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual <strong>and</strong><br />
omniscript <strong>and</strong> that advances the state of the art <str<strong>on</strong>g>in</str<strong>on</strong>g> that new text-recogniti<strong>on</strong> <strong>and</strong> layout-analysis<br />
modules can be easily plugged <str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>and</strong> that it uses an adaptive <strong>and</strong> user-extensible character recogniti<strong>on</strong><br />
module. Breuel acknowledged that there are many challenges to recogniz<str<strong>on</strong>g>in</str<strong>on</strong>g>g Devanagari script,<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g the large number of ligatures, complicated diacritics, <strong>and</strong> the “large <strong>and</strong> unusual vocabulary<br />
used <str<strong>on</strong>g>in</str<strong>on</strong>g> academic <strong>and</strong> historical texts” (Breuel 2009). In additi<strong>on</strong> to Sanskrit texts, Breuel made the<br />
important po<str<strong>on</strong>g>in</str<strong>on</strong>g>t that historical scholarship about Sanskrit <strong>and</strong> other classical languages is frequently<br />
multil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual <strong>and</strong> multiscript <strong>and</strong> can mix Devanagari <strong>and</strong> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> as well as Greek. Breuel thus proposed<br />
that OCRopus has a number of potential applicati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> the field of classical scholarship, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />
recogniti<strong>on</strong> of orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al documents (written records), orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al primary source texts (pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted editi<strong>on</strong>s of<br />
classical texts), <strong>and</strong> both modern <strong>and</strong> historical sec<strong>on</strong>dary scholarship, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g commentaries <strong>and</strong><br />
textbooks, <strong>and</strong> reference works such as dicti<strong>on</strong>aries <strong>and</strong> encyclopedias.<br />
He also expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed that OCRopus uses a “strictly feed-forward system,” an important feature that<br />
supports the plug-<str<strong>on</strong>g>in</str<strong>on</strong>g> of other layout-analysis <strong>and</strong> text-recogniti<strong>on</strong> modules. Other features <str<strong>on</strong>g>in</str<strong>on</strong>g>clude the<br />
use of <strong>on</strong>ly a small number of data types to support reuse, “weighted f<str<strong>on</strong>g>in</str<strong>on</strong>g>ite state transducers” (WFSTs)<br />
to represent the output of text l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recogniti<strong>on</strong>, <strong>and</strong> f<str<strong>on</strong>g>in</str<strong>on</strong>g>al output <str<strong>on</strong>g>in</str<strong>on</strong>g> the hOCR format, which “encodes<br />
OCR <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> completely st<strong>and</strong>ards-compliant HTML files.” This open-source system can be<br />
hosted through a web service, run from the comm<strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>e or shell scripts, <strong>and</strong> users can customize how<br />
it performs by script<str<strong>on</strong>g>in</str<strong>on</strong>g>g “the OCR eng<str<strong>on</strong>g>in</str<strong>on</strong>g>e <str<strong>on</strong>g>in</str<strong>on</strong>g> Lua.”<br />
The basic stages <str<strong>on</strong>g>in</str<strong>on</strong>g> us<str<strong>on</strong>g>in</str<strong>on</strong>g>g OCRopus are image preprocess<str<strong>on</strong>g>in</str<strong>on</strong>g>g, layout analysis, text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recogniti<strong>on</strong>, <strong>and</strong><br />
statistical language model<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Each stage offers a variety of customizati<strong>on</strong> opti<strong>on</strong>s that make it<br />
particularly useful for historical languages. In terms of text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recogniti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> historical texts, the fact<br />
that OCRopus has both built-<str<strong>on</strong>g>in</str<strong>on</strong>g> text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recognizers <strong>and</strong> the ability to add external text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e<br />
recognizers for different scripts is very important, because as Breuel articulated:<br />
Some historical texts may use different writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g systems, s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce Devanagari is not the <strong>on</strong>ly script<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g> historical use for Sanskrit. Scholarly writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>on</strong> Sanskrit almost always uses Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> script, <strong>and</strong><br />
Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> script is also used for writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g Sanskrit itself, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g extended passages. Sanskrit<br />
written <str<strong>on</strong>g>in</str<strong>on</strong>g> Devanagari <strong>and</strong> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> scripts also makes use of numerous diacritics that need to be