Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
31<br />
provides English translati<strong>on</strong>s; <strong>and</strong> (4) it is the “<strong>on</strong>ly corpus of any Ancient Middle Eastern language<br />
that has been tagged <strong>and</strong> lemmatized” (Ebel<str<strong>on</strong>g>in</str<strong>on</strong>g>g 2007). These literary texts also differ from<br />
adm<str<strong>on</strong>g>in</str<strong>on</strong>g>istrative texts <str<strong>on</strong>g>in</str<strong>on</strong>g> that they spell out the morphology <str<strong>on</strong>g>in</str<strong>on</strong>g> detail <strong>and</strong> provide a source for cultural <strong>and</strong><br />
religious vocabulary.<br />
The ETSCL, like the CDLI <strong>and</strong> the BDNTS, c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s transliterati<strong>on</strong>s of Sumerian, where the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al<br />
cuneiform has been c<strong>on</strong>verted <str<strong>on</strong>g>in</str<strong>on</strong>g>to <strong>and</strong> is represented by a sequence of Roman characters. As noted<br />
above, it also c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g>s English translati<strong>on</strong>s, <strong>and</strong> transliterati<strong>on</strong>s <strong>and</strong> translati<strong>on</strong>s are l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked at the<br />
paragraph level. This supports a parallel read<str<strong>on</strong>g>in</str<strong>on</strong>g>g of the orig<str<strong>on</strong>g>in</str<strong>on</strong>g>al text <strong>and</strong> translati<strong>on</strong>. In additi<strong>on</strong>, the<br />
entire corpus is marked up <str<strong>on</strong>g>in</str<strong>on</strong>g> TEI (P4) with some extensi<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> order to accommodate textual variants<br />
<strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic annotati<strong>on</strong>s, which, Ebel<str<strong>on</strong>g>in</str<strong>on</strong>g>g admitted, had “sometimes stretched the descriptive<br />
apparatus to the limit.” One challenge of present<str<strong>on</strong>g>in</str<strong>on</strong>g>g the text <strong>and</strong> transliterati<strong>on</strong> side-by-side <str<strong>on</strong>g>in</str<strong>on</strong>g> the<br />
ETSCL was that the transliterati<strong>on</strong> was often put together from several fragmentary sources. This was<br />
solved by us<str<strong>on</strong>g>in</str<strong>on</strong>g>g the tagpair <strong>and</strong> from the TEI. The “type” attribute is used to<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>dicate whether it is a primary or sec<strong>on</strong>dary variant. A special format was also developed for<br />
encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g broken <strong>and</strong> damaged texts.<br />
One major advance of the ETSCL for corpus studies is that the transliterati<strong>on</strong>s were lemmatized with<br />
an automatic morphological parser (developed by the PSD Project) <strong>and</strong> the output was then proofread.<br />
While this process took a year, it also supports lemmatized search<str<strong>on</strong>g>in</str<strong>on</strong>g>g of the ETSCL, <strong>and</strong> when a user<br />
clicks <strong>on</strong> an <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual lemma <str<strong>on</strong>g>in</str<strong>on</strong>g> the ETSCL it can launch a search <str<strong>on</strong>g>in</str<strong>on</strong>g> the PSD <strong>and</strong> vice versa. In sum,<br />
the ETSCL serves as a “diachr<strong>on</strong>ic, annotated, transliterated, bil<str<strong>on</strong>g>in</str<strong>on</strong>g>gual, parallel corpus of literature or<br />
as an all-<str<strong>on</strong>g>in</str<strong>on</strong>g>-<strong>on</strong>e corpus” (Ebel<str<strong>on</strong>g>in</str<strong>on</strong>g>g 2007). The further development of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic analysis <strong>and</strong> corpus<br />
search tools for the ETSCL has also been detailed by Tablan et al. (2006):<br />
The ma<str<strong>on</strong>g>in</str<strong>on</strong>g> aim of our work is to create a set of tools for perform<str<strong>on</strong>g>in</str<strong>on</strong>g>g automatic morphological<br />
analysis of Sumerian. This essentially entails identify<str<strong>on</strong>g>in</str<strong>on</strong>g>g the part of speech for each word <str<strong>on</strong>g>in</str<strong>on</strong>g> the<br />
corpus (technically, this <strong>on</strong>ly <str<strong>on</strong>g>in</str<strong>on</strong>g>volves nouns <strong>and</strong> verbs which are the <strong>on</strong>ly categories that are<br />
<str<strong>on</strong>g>in</str<strong>on</strong>g>flected), separat<str<strong>on</strong>g>in</str<strong>on</strong>g>g the lemma part from the clitics <strong>and</strong> assign<str<strong>on</strong>g>in</str<strong>on</strong>g>g a morphological functi<strong>on</strong> to<br />
each of the clitics (Tablan et al. 2006).<br />
The authors used the open-source GATE (General Architecture for Text Eng<str<strong>on</strong>g>in</str<strong>on</strong>g>eer<str<strong>on</strong>g>in</str<strong>on</strong>g>g), 106 which was<br />
developed at the University of Sheffield, <strong>and</strong> found that <strong>on</strong>e of the biggest problems <str<strong>on</strong>g>in</str<strong>on</strong>g> evaluat<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />
success of their methods was that they lacked a morphological gold st<strong>and</strong>ard for Sumerian aga<str<strong>on</strong>g>in</str<strong>on</strong>g>st<br />
which to evaluate their data. Many of the challenges faced by the ETSCL thus illustrate some of the<br />
comm<strong>on</strong> issues faced when creat<str<strong>on</strong>g>in</str<strong>on</strong>g>g corpora for historical languages, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g a lack of lexical<br />
resources <strong>and</strong> gold st<strong>and</strong>ard tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> evaluati<strong>on</strong> data, the difficulties of automatic process<str<strong>on</strong>g>in</str<strong>on</strong>g>g, <strong>and</strong><br />
the need to represent physically fragmented sources.<br />
A recently started literary text project is the SEAL (Sources of Early Akkadian Literature) 107 corpus,<br />
which is composed of Akkadian (Babyl<strong>on</strong>ian <strong>and</strong> Assyrian) literary texts from the third <strong>and</strong> sec<strong>on</strong>d<br />
centuries BC that were documented <strong>on</strong> cuneiform tablets. The goal of this project, which is funded by<br />
the German Israeli Foundati<strong>on</strong> for Scientific Research <strong>and</strong> Development (GIF), is to “compile a<br />
complete <strong>and</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>dexed corpus of Akkadian literary texts from the 3rd <strong>and</strong> 2nd Millennia BCE.” They<br />
hope that this corpus will form the basis for both a history <strong>and</strong> a glossary of early Akkadian literature.<br />
Around 150 texts are available <strong>and</strong> they are organized by genre classificati<strong>on</strong>s (such as epics, hymns,<br />
106 http://gate.ac.uk/<br />
107 http://www.seal.uni-leipzig.de/