26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

20<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> retrieval from early modern books. She tested her methods <strong>on</strong> the Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> Gutenberg Bible<br />

<strong>and</strong> reported the same problems as Schibel <strong>and</strong> Rydberg-Cox, namely, the high density of text <strong>on</strong> each<br />

page, the limited spac<str<strong>on</strong>g>in</str<strong>on</strong>g>g between words, <strong>and</strong>, most important, the use of many abbreviati<strong>on</strong>s <strong>and</strong><br />

ligatures. She noted that such issues limit not just automatic techniques but human read<str<strong>on</strong>g>in</str<strong>on</strong>g>g as well. The<br />

Gutenberg Bible al<strong>on</strong>e <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded 75 types of ligatures, with two dense columns of text per page, each<br />

c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g 42 l<str<strong>on</strong>g>in</str<strong>on</strong>g>es. The methodology proposed, Mar<str<strong>on</strong>g>in</str<strong>on</strong>g>ai hoped, would support <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> retrieval<br />

bey<strong>on</strong>d this <strong>on</strong>e text:<br />

… our aim is not to deal <strong>on</strong>ly with the Gutenberg Bible, but to design tools that can process<br />

early pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted books, that can adopt different ligatures <strong>and</strong> abbreviati<strong>on</strong>s. We therefore designed<br />

a text retrieval tool that deals with the text <str<strong>on</strong>g>in</str<strong>on</strong>g> a pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted document <str<strong>on</strong>g>in</str<strong>on</strong>g> a different way, try<str<strong>on</strong>g>in</str<strong>on</strong>g>g to<br />

identify occurrences of query words rather than recogniz<str<strong>on</strong>g>in</str<strong>on</strong>g>g the whole text (Mar<str<strong>on</strong>g>in</str<strong>on</strong>g>ai 2009).<br />

Instead of segment<str<strong>on</strong>g>in</str<strong>on</strong>g>g words, Mar<str<strong>on</strong>g>in</str<strong>on</strong>g>ai’s technique extracted “character objects” from documents that<br />

were then clustered together us<str<strong>on</strong>g>in</str<strong>on</strong>g>g self-organiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g maps so that “symbolic” classes could be assigned<br />

to <str<strong>on</strong>g>in</str<strong>on</strong>g>dexed objects. User query terms were selected from “<strong>on</strong>e-word” images <str<strong>on</strong>g>in</str<strong>on</strong>g> the collecti<strong>on</strong> that were<br />

then compared aga<str<strong>on</strong>g>in</str<strong>on</strong>g>st “<str<strong>on</strong>g>in</str<strong>on</strong>g>dexed character objects with a Dynamic Time Warp<str<strong>on</strong>g>in</str<strong>on</strong>g>g (DTW) based<br />

approach.” This “query by example” approach did face <strong>on</strong>e major challenge <str<strong>on</strong>g>in</str<strong>on</strong>g> that it could not f<str<strong>on</strong>g>in</str<strong>on</strong>g>d<br />

occurrences of query words that were pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted with different ligatures.<br />

As this subsecti<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>dicates, the development of tools for the automatic recogniti<strong>on</strong> <strong>and</strong> process<str<strong>on</strong>g>in</str<strong>on</strong>g>g of<br />

Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> is a research area that still has many challenges.<br />

Sanskrit<br />

The issues <str<strong>on</strong>g>in</str<strong>on</strong>g>volved <str<strong>on</strong>g>in</str<strong>on</strong>g> the digitizati<strong>on</strong> of Sanskrit texts <strong>and</strong> the development of tools to study <strong>and</strong><br />

present them <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e are so complicated that an annual <str<strong>on</strong>g>in</str<strong>on</strong>g>ternati<strong>on</strong>al Sanskrit computati<strong>on</strong>al l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistics<br />

symposium was established <str<strong>on</strong>g>in</str<strong>on</strong>g> 2007. 59 This subsecti<strong>on</strong> provides an overview of some of the major<br />

digital Sanskrit projects <strong>and</strong> current issues <str<strong>on</strong>g>in</str<strong>on</strong>g> digitizati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> that language.<br />

The major digital Sanskrit project <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e is the Sanskrit <strong>Library</strong>, a “digital library dedicated to<br />

facilitat<str<strong>on</strong>g>in</str<strong>on</strong>g>g educati<strong>on</strong> <strong>and</strong> research <str<strong>on</strong>g>in</str<strong>on</strong>g> Sanskrit by provid<str<strong>on</strong>g>in</str<strong>on</strong>g>g access to digitized primary texts <str<strong>on</strong>g>in</str<strong>on</strong>g> Sanskrit<br />

<strong>and</strong> computerized research <strong>and</strong> study tools to analyze <strong>and</strong> maximize the utility of digitized Sanskrit<br />

text.” 60 The Sanskrit <strong>Library</strong> is part of the Internati<strong>on</strong>al Digital Sanskrit <strong>Library</strong> Integrati<strong>on</strong> project,<br />

which seeks to c<strong>on</strong>nect various Sanskrit digital archives <strong>and</strong> tool projects as well as to establish<br />

encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g st<strong>and</strong>ards, enhance manuscript access, <strong>and</strong> develop OCR technology <strong>and</strong> display software for<br />

Devanagari text. On an <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual basis, the Sanskrit <strong>Library</strong> supports philological research <strong>and</strong><br />

educati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> Vedic <strong>and</strong> Classical Sanskrit language <strong>and</strong> literature <strong>and</strong> provides access to Sanskrit texts<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g> digital form. The Sanskrit <strong>Library</strong> currently c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s <str<strong>on</strong>g>in</str<strong>on</strong>g>dependent-study Sanskrit readers,<br />

grammatical literature, morphological software, <str<strong>on</strong>g>in</str<strong>on</strong>g>structi<strong>on</strong>al materials, <strong>and</strong> a digital versi<strong>on</strong> of W. D.<br />

Whitney’s The Roots, Verb-Forms, <strong>and</strong> Primary Derivatives of the Sanskrit Language. The <strong>Library</strong>’s<br />

current areas of research <str<strong>on</strong>g>in</str<strong>on</strong>g>clude “l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic issues <str<strong>on</strong>g>in</str<strong>on</strong>g> encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g, computati<strong>on</strong>al ph<strong>on</strong>ology <strong>and</strong><br />

morphology, OCR for Indic scripts, <strong>and</strong> markup of digitized Sanskrit lexica.” Free access to this<br />

library is provided, but users must register.<br />

59 http://www.spr<str<strong>on</strong>g>in</str<strong>on</strong>g>gerl<str<strong>on</strong>g>in</str<strong>on</strong>g>k.com/c<strong>on</strong>tent/p665684g40h7/p=967bbca4213c4cb6988c40c0e3ae3a95&pi=0<br />

60 http://sanskritlibrary.org/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!