26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

19<br />

when digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g early modern books, a project needs to determ<str<strong>on</strong>g>in</str<strong>on</strong>g>e how much functi<strong>on</strong>ality users will<br />

require from a digital facsimile <strong>and</strong> how much human <str<strong>on</strong>g>in</str<strong>on</strong>g>terventi<strong>on</strong> will be required to create it.<br />

In analyz<str<strong>on</strong>g>in</str<strong>on</strong>g>g these questi<strong>on</strong>s, Rydberg-Cox proposed five possible approaches: (1) image books with<br />

simple page images; (2) image books with m<str<strong>on</strong>g>in</str<strong>on</strong>g>imal structural data; (3) image fr<strong>on</strong>t transcripti<strong>on</strong>s (such<br />

as those found <str<strong>on</strong>g>in</str<strong>on</strong>g> the Mak<str<strong>on</strong>g>in</str<strong>on</strong>g>g of America 58 project) with page images that have searchable uncorrected<br />

OCR; (4) carefully edited <strong>and</strong> tagged transcripti<strong>on</strong>s (generally marked up <str<strong>on</strong>g>in</str<strong>on</strong>g> XML); <strong>and</strong> (5) scholarly<br />

<strong>and</strong> critical editi<strong>on</strong>s. Ultimately, the project decided to create sample texts <str<strong>on</strong>g>in</str<strong>on</strong>g> all of these genres except<br />

that of the scholarly critical editi<strong>on</strong> because of the cost of creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g such editi<strong>on</strong>s. The decisi<strong>on</strong> to<br />

digitize the text, rather than just provide page images with limited OCR, raised its own issues,<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g the need to manually photograph rather than scan pages <strong>and</strong> how to address characters <strong>and</strong><br />

glyphs that could not be represented by Unicode. They had to create a method that could be used by<br />

data entry c<strong>on</strong>tractors to represent characters as they typed up texts, <strong>and</strong> the first step was to create a<br />

catalog of all the brevigraphs that appeared <str<strong>on</strong>g>in</str<strong>on</strong>g> the pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted books <strong>and</strong> that assigned a unique entity<br />

identifier to each n<strong>on</strong>st<strong>and</strong>ard character that data entry pers<strong>on</strong>nel could use to represent the glyph.<br />

In additi<strong>on</strong> to this catalog, a number of computati<strong>on</strong>al tools were created to assist the data entry<br />

operators:<br />

Because the expansi<strong>on</strong> of these abbreviati<strong>on</strong>s is an extremely time-c<strong>on</strong>sum<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> pa<str<strong>on</strong>g>in</str<strong>on</strong>g>stak<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

task, we developed three tools to facilitate the tagg<str<strong>on</strong>g>in</str<strong>on</strong>g>g process. These tools suggest possible<br />

expansi<strong>on</strong>s for Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> abbreviati<strong>on</strong>s <strong>and</strong> brevigraphs, help identify words that are divided across<br />

l<str<strong>on</strong>g>in</str<strong>on</strong>g>es, <strong>and</strong> separate words that are jo<str<strong>on</strong>g>in</str<strong>on</strong>g>ed as the results of irregular spac<str<strong>on</strong>g>in</str<strong>on</strong>g>g. All three programs<br />

can return results <str<strong>on</strong>g>in</str<strong>on</strong>g> HTML for human readability or by XML <str<strong>on</strong>g>in</str<strong>on</strong>g> resp<strong>on</strong>se to remote procedure<br />

call as part of a program to automatically exp<strong>and</strong> abbreviati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> these texts (Rydberg-Cox<br />

2009).<br />

Another important po<str<strong>on</strong>g>in</str<strong>on</strong>g>t raised by Rydberg-Cox was that while the project needed to develop tools<br />

such as this, if such tools were shared <str<strong>on</strong>g>in</str<strong>on</strong>g> a larger <str<strong>on</strong>g>in</str<strong>on</strong>g>frastructure, they could be reused by the numerous<br />

projects digitiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> books. Ultimately, Rydberg-Cox c<strong>on</strong>cluded that this work showed that a<br />

large-scale project that created image-fr<strong>on</strong>t editi<strong>on</strong>s (e.g., us<str<strong>on</strong>g>in</str<strong>on</strong>g>g uncorrected data that were manually<br />

typed to support search<str<strong>on</strong>g>in</str<strong>on</strong>g>g rather than uncorrected OCR) could be affordably managed. In the<br />

workflow of his own project, Rydberg-Cox found that the most significant expense was hav<str<strong>on</strong>g>in</str<strong>on</strong>g>g human<br />

editors tag abbreviati<strong>on</strong>s <strong>and</strong> a sec<strong>on</strong>d editor proofread the work.<br />

N<strong>on</strong>etheless, Rydberg-Cox c<strong>on</strong>v<str<strong>on</strong>g>in</str<strong>on</strong>g>c<str<strong>on</strong>g>in</str<strong>on</strong>g>gly argued that a certa<str<strong>on</strong>g>in</str<strong>on</strong>g> level of transcripti<strong>on</strong> is typically worth<br />

the cost because it provides better searchability <strong>and</strong>, even more important, supports automatic<br />

hypertext <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g to dicti<strong>on</strong>aries <strong>and</strong> other read<str<strong>on</strong>g>in</str<strong>on</strong>g>g support tools. Such tools can help students <strong>and</strong><br />

scholars read texts <str<strong>on</strong>g>in</str<strong>on</strong>g> Greek <strong>and</strong> Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> without expert knowledge of such languages, <strong>and</strong> they are<br />

particularly important for early modern books, many of which have never been translated.<br />

Furthermore, Rydberg-Cox noted that larger collecti<strong>on</strong>s of lightly edited text often reach far larger<br />

audiences than small collecti<strong>on</strong>s of closely edited texts or critical editi<strong>on</strong>s. In additi<strong>on</strong>, this model does<br />

not preclude the development of critical editi<strong>on</strong>s, for as l<strong>on</strong>g as the images <strong>and</strong> transcripti<strong>on</strong>s are made<br />

available as open c<strong>on</strong>tent they can be reused by scholars <str<strong>on</strong>g>in</str<strong>on</strong>g> support of their own editi<strong>on</strong>s.<br />

In c<strong>on</strong>trast to us<str<strong>on</strong>g>in</str<strong>on</strong>g>g digitized images <strong>and</strong> typed <str<strong>on</strong>g>in</str<strong>on</strong>g> transcripti<strong>on</strong>s, recent research reported by Sim<strong>on</strong>e<br />

Mar<str<strong>on</strong>g>in</str<strong>on</strong>g>ai (2009) explored the use of automatic text <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> retrieval methods to support<br />

58 http://moa.umdl.umich.edu/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!