26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

31<br />

provides English translati<strong>on</strong>s; <strong>and</strong> (4) it is the “<strong>on</strong>ly corpus of any Ancient Middle Eastern language<br />

that has been tagged <strong>and</strong> lemmatized” (Ebel<str<strong>on</strong>g>in</str<strong>on</strong>g>g 2007). These literary texts also differ from<br />

adm<str<strong>on</strong>g>in</str<strong>on</strong>g>istrative texts <str<strong>on</strong>g>in</str<strong>on</strong>g> that they spell out the morphology <str<strong>on</strong>g>in</str<strong>on</strong>g> detail <strong>and</strong> provide a source for cultural <strong>and</strong><br />

religious vocabulary.<br />

The ETSCL, like the CDLI <strong>and</strong> the BDNTS, c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s transliterati<strong>on</strong>s of Sumerian, where the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al<br />

cuneiform has been c<strong>on</strong>verted <str<strong>on</strong>g>in</str<strong>on</strong>g>to <strong>and</strong> is represented by a sequence of Roman characters. As noted<br />

above, it also c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s English translati<strong>on</strong>s, <strong>and</strong> transliterati<strong>on</strong>s <strong>and</strong> translati<strong>on</strong>s are l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked at the<br />

paragraph level. This supports a parallel read<str<strong>on</strong>g>in</str<strong>on</strong>g>g of the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al text <strong>and</strong> translati<strong>on</strong>. In additi<strong>on</strong>, the<br />

entire corpus is marked up <str<strong>on</strong>g>in</str<strong>on</strong>g> TEI (P4) with some extensi<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> order to accommodate textual variants<br />

<strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic annotati<strong>on</strong>s, which, Ebel<str<strong>on</strong>g>in</str<strong>on</strong>g>g admitted, had “sometimes stretched the descriptive<br />

apparatus to the limit.” One challenge of present<str<strong>on</strong>g>in</str<strong>on</strong>g>g the text <strong>and</strong> transliterati<strong>on</strong> side-by-side <str<strong>on</strong>g>in</str<strong>on</strong>g> the<br />

ETSCL was that the transliterati<strong>on</strong> was often put together from several fragmentary sources. This was<br />

solved by us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the tagpair <strong>and</strong> from the TEI. The “type” attribute is used to<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>dicate whether it is a primary or sec<strong>on</strong>dary variant. A special format was also developed for<br />

encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g broken <strong>and</strong> damaged texts.<br />

One major advance of the ETSCL for corpus studies is that the transliterati<strong>on</strong>s were lemmatized with<br />

an automatic morphological parser (developed by the PSD Project) <strong>and</strong> the output was then proofread.<br />

While this process took a year, it also supports lemmatized search<str<strong>on</strong>g>in</str<strong>on</strong>g>g of the ETSCL, <strong>and</strong> when a user<br />

clicks <strong>on</strong> an <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual lemma <str<strong>on</strong>g>in</str<strong>on</strong>g> the ETSCL it can launch a search <str<strong>on</strong>g>in</str<strong>on</strong>g> the PSD <strong>and</strong> vice versa. In sum,<br />

the ETSCL serves as a “diachr<strong>on</strong>ic, annotated, transliterated, bil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual, parallel corpus of literature or<br />

as an all-<str<strong>on</strong>g>in</str<strong>on</strong>g>-<strong>on</strong>e corpus” (Ebel<str<strong>on</strong>g>in</str<strong>on</strong>g>g 2007). The further development of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic analysis <strong>and</strong> corpus<br />

search tools for the ETSCL has also been detailed by Tablan et al. (2006):<br />

The ma<str<strong>on</strong>g>in</str<strong>on</strong>g> aim of our work is to create a set of tools for perform<str<strong>on</strong>g>in</str<strong>on</strong>g>g automatic morphological<br />

analysis of Sumerian. This essentially entails identify<str<strong>on</strong>g>in</str<strong>on</strong>g>g the part of speech for each word <str<strong>on</strong>g>in</str<strong>on</strong>g> the<br />

corpus (technically, this <strong>on</strong>ly <str<strong>on</strong>g>in</str<strong>on</strong>g>volves nouns <strong>and</strong> verbs which are the <strong>on</strong>ly categories that are<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>flected), separat<str<strong>on</strong>g>in</str<strong>on</strong>g>g the lemma part from the clitics <strong>and</strong> assign<str<strong>on</strong>g>in</str<strong>on</strong>g>g a morphological functi<strong>on</strong> to<br />

each of the clitics (Tablan et al. 2006).<br />

The authors used the open-source GATE (General Architecture for Text Eng<str<strong>on</strong>g>in</str<strong>on</strong>g>eer<str<strong>on</strong>g>in</str<strong>on</strong>g>g), 106 which was<br />

developed at the University of Sheffield, <strong>and</strong> found that <strong>on</strong>e of the biggest problems <str<strong>on</strong>g>in</str<strong>on</strong>g> evaluat<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />

success of their methods was that they lacked a morphological gold st<strong>and</strong>ard for Sumerian aga<str<strong>on</strong>g>in</str<strong>on</strong>g>st<br />

which to evaluate their data. Many of the challenges faced by the ETSCL thus illustrate some of the<br />

comm<strong>on</strong> issues faced when creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g corpora for historical languages, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g a lack of lexical<br />

resources <strong>and</strong> gold st<strong>and</strong>ard tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> evaluati<strong>on</strong> data, the difficulties of automatic process<str<strong>on</strong>g>in</str<strong>on</strong>g>g, <strong>and</strong><br />

the need to represent physically fragmented sources.<br />

A recently started literary text project is the SEAL (Sources of Early Akkadian Literature) 107 corpus,<br />

which is composed of Akkadian (Babyl<strong>on</strong>ian <strong>and</strong> Assyrian) literary texts from the third <strong>and</strong> sec<strong>on</strong>d<br />

centuries BC that were documented <strong>on</strong> cuneiform tablets. The goal of this project, which is funded by<br />

the German Israeli Foundati<strong>on</strong> for Scientific Research <strong>and</strong> Development (GIF), is to “compile a<br />

complete <strong>and</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>dexed corpus of Akkadian literary texts from the 3rd <strong>and</strong> 2nd Millennia BCE.” They<br />

hope that this corpus will form the basis for both a history <strong>and</strong> a glossary of early Akkadian literature.<br />

Around 150 texts are available <strong>and</strong> they are organized by genre classificati<strong>on</strong>s (such as epics, hymns,<br />

106 http://gate.ac.uk/<br />

107 http://www.seal.uni-leipzig.de/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!