26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

22<br />

In ancient manuscripts, Sanskrit is written without spaces, <strong>and</strong> from our po<str<strong>on</strong>g>in</str<strong>on</strong>g>t of view, this is an<br />

important graphical specificity, because it <str<strong>on</strong>g>in</str<strong>on</strong>g>creases greatly the complexity of text comparis<strong>on</strong><br />

algorithms. One may remark that Sanskrit is not the <strong>on</strong>ly language where spaces are miss<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g><br />

the text: Roman epigraphy <strong>and</strong> European Middle Age manuscripts are also good examples of<br />

that (Csernel <strong>and</strong> Patte 2009).<br />

The soluti<strong>on</strong> that the authors ultimately proposed for creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g a critical editi<strong>on</strong> of a Sanskrit text<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>volved the lemmatizati<strong>on</strong> by h<strong>and</strong> of <strong>on</strong>e of the two texts, specifically, the text of the editi<strong>on</strong>.<br />

Alignments between this lemmatized text <strong>and</strong> other texts then made use of the l<strong>on</strong>gest comm<strong>on</strong><br />

subsequence (LCS) algorithm. The authors are still experiment<str<strong>on</strong>g>in</str<strong>on</strong>g>g with their methodology, but po<str<strong>on</strong>g>in</str<strong>on</strong>g>ted<br />

out that the absence of a Sanskrit lexic<strong>on</strong> limited their approach.<br />

The development of OCR tools that will process Sanskrit scripts is a highly sought-after goal. Very<br />

little work has been d<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g> this area, but Thomas Breuel recently reported not <strong>on</strong>ly <strong>on</strong> the use of<br />

OCRopus to recognize the Devanagari script but also <strong>on</strong> its applicati<strong>on</strong> both to primary texts <str<strong>on</strong>g>in</str<strong>on</strong>g><br />

classical languages <strong>and</strong> to sec<strong>on</strong>dary classical scholarship. As was discussed previously <str<strong>on</strong>g>in</str<strong>on</strong>g> Boschetti et<br />

al. (2009), prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary work with OCRopus produced promis<str<strong>on</strong>g>in</str<strong>on</strong>g>g results with Ancient Greek.<br />

Breuel (2009) described OCRopus as an OCR system that is designed to be both omnil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual <strong>and</strong><br />

omniscript <strong>and</strong> that advances the state of the art <str<strong>on</strong>g>in</str<strong>on</strong>g> that new text-recogniti<strong>on</strong> <strong>and</strong> layout-analysis<br />

modules can be easily plugged <str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>and</strong> that it uses an adaptive <strong>and</strong> user-extensible character recogniti<strong>on</strong><br />

module. Breuel acknowledged that there are many challenges to recogniz<str<strong>on</strong>g>in</str<strong>on</strong>g>g Devanagari script,<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g the large number of ligatures, complicated diacritics, <strong>and</strong> the “large <strong>and</strong> unusual vocabulary<br />

used <str<strong>on</strong>g>in</str<strong>on</strong>g> academic <strong>and</strong> historical texts” (Breuel 2009). In additi<strong>on</strong> to Sanskrit texts, Breuel made the<br />

important po<str<strong>on</strong>g>in</str<strong>on</strong>g>t that historical scholarship about Sanskrit <strong>and</strong> other classical languages is frequently<br />

multil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual <strong>and</strong> multiscript <strong>and</strong> can mix Devanagari <strong>and</strong> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> as well as Greek. Breuel thus proposed<br />

that OCRopus has a number of potential applicati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> the field of classical scholarship, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />

recogniti<strong>on</strong> of orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al documents (written records), orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al primary source texts (pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted editi<strong>on</strong>s of<br />

classical texts), <strong>and</strong> both modern <strong>and</strong> historical sec<strong>on</strong>dary scholarship, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g commentaries <strong>and</strong><br />

textbooks, <strong>and</strong> reference works such as dicti<strong>on</strong>aries <strong>and</strong> encyclopedias.<br />

He also expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed that OCRopus uses a “strictly feed-forward system,” an important feature that<br />

supports the plug-<str<strong>on</strong>g>in</str<strong>on</strong>g> of other layout-analysis <strong>and</strong> text-recogniti<strong>on</strong> modules. Other features <str<strong>on</strong>g>in</str<strong>on</strong>g>clude the<br />

use of <strong>on</strong>ly a small number of data types to support reuse, “weighted f<str<strong>on</strong>g>in</str<strong>on</strong>g>ite state transducers” (WFSTs)<br />

to represent the output of text l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recogniti<strong>on</strong>, <strong>and</strong> f<str<strong>on</strong>g>in</str<strong>on</strong>g>al output <str<strong>on</strong>g>in</str<strong>on</strong>g> the hOCR format, which “encodes<br />

OCR <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> completely st<strong>and</strong>ards-compliant HTML files.” This open-source system can be<br />

hosted through a web service, run from the comm<strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>e or shell scripts, <strong>and</strong> users can customize how<br />

it performs by script<str<strong>on</strong>g>in</str<strong>on</strong>g>g “the OCR eng<str<strong>on</strong>g>in</str<strong>on</strong>g>e <str<strong>on</strong>g>in</str<strong>on</strong>g> Lua.”<br />

The basic stages <str<strong>on</strong>g>in</str<strong>on</strong>g> us<str<strong>on</strong>g>in</str<strong>on</strong>g>g OCRopus are image preprocess<str<strong>on</strong>g>in</str<strong>on</strong>g>g, layout analysis, text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recogniti<strong>on</strong>, <strong>and</strong><br />

statistical language model<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Each stage offers a variety of customizati<strong>on</strong> opti<strong>on</strong>s that make it<br />

particularly useful for historical languages. In terms of text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recogniti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> historical texts, the fact<br />

that OCRopus has both built-<str<strong>on</strong>g>in</str<strong>on</strong>g> text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e recognizers <strong>and</strong> the ability to add external text-l<str<strong>on</strong>g>in</str<strong>on</strong>g>e<br />

recognizers for different scripts is very important, because as Breuel articulated:<br />

Some historical texts may use different writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g systems, s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce Devanagari is not the <strong>on</strong>ly script<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g> historical use for Sanskrit. Scholarly writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>on</strong> Sanskrit almost always uses Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> script, <strong>and</strong><br />

Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> script is also used for writ<str<strong>on</strong>g>in</str<strong>on</strong>g>g Sanskrit itself, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g extended passages. Sanskrit<br />

written <str<strong>on</strong>g>in</str<strong>on</strong>g> Devanagari <strong>and</strong> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> scripts also makes use of numerous diacritics that need to be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!