26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

13<br />

One major project that has recently been funded <str<strong>on</strong>g>in</str<strong>on</strong>g> this area is “New Technology for Digitizati<strong>on</strong> of<br />

Ancient Objects <strong>and</strong> Documents,” a jo<str<strong>on</strong>g>in</str<strong>on</strong>g>t project of the Archaeological Comput<str<strong>on</strong>g>in</str<strong>on</strong>g>g Research Group<br />

(ACRG) <strong>and</strong> the School of Electr<strong>on</strong>ics <strong>and</strong> Computer Science (ECS), Southampt<strong>on</strong>; the Centre for the<br />

Study of Ancient Documents (CSAD), Oxford; the CDLI, Los Angeles-Philadelphia-Oxford-Berl<str<strong>on</strong>g>in</str<strong>on</strong>g>;<br />

<strong>and</strong> the Electr<strong>on</strong>ic Text Corpus of Sumerian Literature (ETCSL), Oxford. 42 This project has received a<br />

12-m<strong>on</strong>th Arts <strong>and</strong> Humanities Research <str<strong>on</strong>g>Council</str<strong>on</strong>g> (AHRC) grant to “develop a “Reflectance<br />

Transformati<strong>on</strong> Imag<str<strong>on</strong>g>in</str<strong>on</strong>g>g (RTI) System for Ancient Documentary Artefacts.” The team plans to<br />

develop two RTI systems that can be used to capture high-quality digital images of documentary texts<br />

<strong>and</strong> archaeological materials. The <str<strong>on</strong>g>in</str<strong>on</strong>g>itial test<str<strong>on</strong>g>in</str<strong>on</strong>g>g will be c<strong>on</strong>ducted <strong>on</strong> stylus tablets from V<str<strong>on</strong>g>in</str<strong>on</strong>g>dol<strong>and</strong>a,<br />

st<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s, L<str<strong>on</strong>g>in</str<strong>on</strong>g>ear B, <strong>and</strong> cuneiform tablets.<br />

Other relevant research is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g c<strong>on</strong>ducted by the IMPACT (Improv<str<strong>on</strong>g>in</str<strong>on</strong>g>g Access to Text) 43 project. The<br />

European Commissi<strong>on</strong> has funded this project <strong>and</strong> it is explor<str<strong>on</strong>g>in</str<strong>on</strong>g>g how to develop advanced OCR<br />

methods for historical texts, particularly <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of the use of OCR <str<strong>on</strong>g>in</str<strong>on</strong>g> mass digitizati<strong>on</strong> processes. 44<br />

While their research is not specifically focused <strong>on</strong> develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g techniques for classical languages, Lat<str<strong>on</strong>g>in</str<strong>on</strong>g><br />

was the major language of <str<strong>on</strong>g>in</str<strong>on</strong>g>tellectual discourse <str<strong>on</strong>g>in</str<strong>on</strong>g> Europe for almost a century, so techniques adapted<br />

for either manuscripts or early pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted books would be useful to classical scholarship <strong>and</strong> bey<strong>on</strong>d.<br />

Ancient Greek<br />

Only a limited amount of work has c<strong>on</strong>sidered us<str<strong>on</strong>g>in</str<strong>on</strong>g>g automatic techniques <str<strong>on</strong>g>in</str<strong>on</strong>g> the optical recogniti<strong>on</strong> of<br />

ancient or classical Greek. While some recent research has focused <strong>on</strong> the development of OCR for<br />

“Old Greek” historical manuscripts, 45 little work has explored develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g techniques for either<br />

manuscripts or pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted editi<strong>on</strong>s of Ancient Greek texts.<br />

Some prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary work <str<strong>on</strong>g>in</str<strong>on</strong>g> develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g an automatic-recogniti<strong>on</strong> methodology for Ancient Greek is<br />

detailed by Stewart et al. (2007). In these authors’ <str<strong>on</strong>g>in</str<strong>on</strong>g>itial survey of Greek editi<strong>on</strong>s, they found that <strong>on</strong><br />

average almost 14 percent of the Greek words <strong>on</strong> a text page were found <str<strong>on</strong>g>in</str<strong>on</strong>g> the notes or apparatus<br />

criticus. The authors first used a multi-tiered approach to OCR that applied two major post-process<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

techniques to the output of two commercial OCR packages, ABBYY F<str<strong>on</strong>g>in</str<strong>on</strong>g>eReader (8.0) 46 <strong>and</strong><br />

Anagnostis 4.1. Dur<str<strong>on</strong>g>in</str<strong>on</strong>g>g this experiment, they found that character accuracy <strong>on</strong> simple uncorrected text<br />

averaged about 98.57 percent. Other prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary experiments with OCR-generated text revealed that<br />

the uncorrected OCR could serve as searchable corpora. Even when work<str<strong>on</strong>g>in</str<strong>on</strong>g>g with a mid-n<str<strong>on</strong>g>in</str<strong>on</strong>g>eteenthcentury<br />

editi<strong>on</strong> of Aristotle <str<strong>on</strong>g>in</str<strong>on</strong>g> a n<strong>on</strong>st<strong>and</strong>ard Greek f<strong>on</strong>t, searches of the OCR-generated text typically<br />

provided superior recall than searches of texts that had been manually typed because the OCR text<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>cluded variant read<str<strong>on</strong>g>in</str<strong>on</strong>g>gs found <str<strong>on</strong>g>in</str<strong>on</strong>g> the notes. In a sec<strong>on</strong>d experiment, the automatic correcti<strong>on</strong> of s<str<strong>on</strong>g>in</str<strong>on</strong>g>gle<br />

texts was performed us<str<strong>on</strong>g>in</str<strong>on</strong>g>g a list of <strong>on</strong>e milli<strong>on</strong> Greek words <strong>and</strong> the Morpheus Greek morphological<br />

analyzer that was developed by the PDL.<br />

For their third experiment, Stewart <strong>and</strong> colleagues used the OCR output of multiple editi<strong>on</strong>s of the<br />

same work to correct <strong>on</strong>e another <str<strong>on</strong>g>in</str<strong>on</strong>g> a three-step process. First, different editi<strong>on</strong>s of a text were aligned<br />

by f<str<strong>on</strong>g>in</str<strong>on</strong>g>d<str<strong>on</strong>g>in</str<strong>on</strong>g>g unique str<str<strong>on</strong>g>in</str<strong>on</strong>g>gs <str<strong>on</strong>g>in</str<strong>on</strong>g> each. Sec<strong>on</strong>d, if an error word was found <str<strong>on</strong>g>in</str<strong>on</strong>g> <strong>on</strong>e text, a fuzzy search was<br />

performed <str<strong>on</strong>g>in</str<strong>on</strong>g> the aligned parallel text to try to locate the correct form. F<str<strong>on</strong>g>in</str<strong>on</strong>g>ally, <strong>on</strong>ce error words <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />

base text had been matched aga<str<strong>on</strong>g>in</str<strong>on</strong>g>st potential ground truth counterparts <str<strong>on</strong>g>in</str<strong>on</strong>g> the parallel texts, rules<br />

42 http://www.southampt<strong>on</strong>.ac.uk/archaeology/news/news_2010/acrg_dedefi_ma<str<strong>on</strong>g>in</str<strong>on</strong>g>.shtml<br />

43 http://www.impact-project.eu/home/<br />

44 For a recent overview of some of the IMPACT project’s research, see Ploeger et al. (2009).<br />

45 For an example, see Ntzios et al. (2007).<br />

46 http://www.abbyy.com/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!