Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
52<br />
First, it utilizes a nearest neighbor framework that requires no h<strong>and</strong>-crafted rules, <strong>and</strong> provides<br />
analogies to facilitate learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Sec<strong>on</strong>d, <strong>and</strong> perhaps more significantly, it exploits a large,<br />
unlabelled corpus to improve the predicti<strong>on</strong> of novel roots (Lee 2008).<br />
Lee observed that many students of Ancient Greek memorized “paradigmatic” verbs that could be used<br />
as analogies to identify the roots of unseen verbs. From this <str<strong>on</strong>g>in</str<strong>on</strong>g>sight, Lee utilized a “nearest-neighbor”<br />
mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g framework to model this process. When given a word <str<strong>on</strong>g>in</str<strong>on</strong>g> an <str<strong>on</strong>g>in</str<strong>on</strong>g>flected form, the<br />
algorithm searched for the root form am<strong>on</strong>g its “neighbors” by mak<str<strong>on</strong>g>in</str<strong>on</strong>g>g substituti<strong>on</strong>s to its prefix <strong>and</strong><br />
suffix. Valid substituti<strong>on</strong>s are harvested from pairs of <str<strong>on</strong>g>in</str<strong>on</strong>g>flected <strong>and</strong> root forms <str<strong>on</strong>g>in</str<strong>on</strong>g> a tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set of data,<br />
<strong>and</strong> these pairs are then used to serve as “analogies to re<str<strong>on</strong>g>in</str<strong>on</strong>g>force learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g.” N<strong>on</strong>etheless, Ancient Greek<br />
still posed some challenges that complicated a m<str<strong>on</strong>g>in</str<strong>on</strong>g>imally supervised approach. Lee expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed that<br />
heavily <str<strong>on</strong>g>in</str<strong>on</strong>g>flected languages such as Greek suffer from “data sparseness” s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce many <str<strong>on</strong>g>in</str<strong>on</strong>g>flected forms<br />
appear at most a few times <strong>and</strong> many root forms may not appear at all <str<strong>on</strong>g>in</str<strong>on</strong>g> a corpus. As a rule-based<br />
system, Morpheus needed a priori knowledge of possible stems <strong>and</strong> affixes, all of which had to be<br />
crafted by h<strong>and</strong>. To provide a more scalable approach, Lee used a data-driven approach that<br />
automatically determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed stems <strong>and</strong> affixes from tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data (morphology data for the Greek<br />
Septuag<str<strong>on</strong>g>in</str<strong>on</strong>g>t from the University of Pennsylvania) <strong>and</strong> then used the TLG as a source of unlabeled data<br />
to guide predicti<strong>on</strong> of novel roots.<br />
While Lee made use of mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> unlabeled corpora, Tambouratzis (2008) automated the<br />
morphological segmentati<strong>on</strong> of Greek by “coupl<str<strong>on</strong>g>in</str<strong>on</strong>g>g an iterative pattern-recogniti<strong>on</strong> algorithm with a<br />
modest amount of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic knowledge, expressed via a set of <str<strong>on</strong>g>in</str<strong>on</strong>g>teracti<strong>on</strong>s associated with weights.”<br />
He used an “ant col<strong>on</strong>y optimizati<strong>on</strong> (ACO) metaheuristic” to automatically determ<str<strong>on</strong>g>in</str<strong>on</strong>g>e optimal weight<br />
values <strong>and</strong> found that <str<strong>on</strong>g>in</str<strong>on</strong>g> several cases the automatic system provided better results than those that had<br />
been manually determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by scholars. In c<strong>on</strong>trast to Lee, Tambouratzis used <strong>on</strong>ly a subset of the TLG<br />
for tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data (<str<strong>on</strong>g>in</str<strong>on</strong>g> this case, the speeches of several Greek orators).<br />
In additi<strong>on</strong> to the work d<strong>on</strong>e by Dik <strong>and</strong> Whal<str<strong>on</strong>g>in</str<strong>on</strong>g>g for “Perseus Under PhiloLogic,” other research <str<strong>on</strong>g>in</str<strong>on</strong>g>to<br />
automatic morphological analysis of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> has been c<strong>on</strong>ducted by (F<str<strong>on</strong>g>in</str<strong>on</strong>g>kel <strong>and</strong> Stump 2009). These<br />
authors reported <strong>on</strong> computati<strong>on</strong>al experiments to generate the morphology of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> verbs.<br />
Lexic<strong>on</strong>s<br />
Lexic<strong>on</strong>s are reference tools that have l<strong>on</strong>g played an important role <str<strong>on</strong>g>in</str<strong>on</strong>g> classical scholarship <strong>and</strong><br />
particularly <str<strong>on</strong>g>in</str<strong>on</strong>g> the study of historical languages. 153 As previously noted, the lack of a computati<strong>on</strong>al<br />
lexic<strong>on</strong> for Sanskrit is a major research challenge. This secti<strong>on</strong> explores some important lexic<strong>on</strong>s for<br />
classical languages <strong>and</strong> suggests new roles for these traditi<strong>on</strong>al reference works <str<strong>on</strong>g>in</str<strong>on</strong>g> a digital<br />
envir<strong>on</strong>ment.<br />
The Comprehensive Aramaic Lexic<strong>on</strong> 154 (CAL) hopes to serve as a “new dicti<strong>on</strong>ary of the Aramaic<br />
language.” Aramaic is a Semitic language, <strong>and</strong> numerous <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s <strong>and</strong> papyri, as well as Biblical<br />
<strong>and</strong> other religious texts, are written <str<strong>on</strong>g>in</str<strong>on</strong>g> it. This project, currently <str<strong>on</strong>g>in</str<strong>on</strong>g> preparati<strong>on</strong> by an <str<strong>on</strong>g>in</str<strong>on</strong>g>ternati<strong>on</strong>al<br />
team of scholars, is based at Hebrew Uni<strong>on</strong> College <str<strong>on</strong>g>in</str<strong>on</strong>g> C<str<strong>on</strong>g>in</str<strong>on</strong>g>c<str<strong>on</strong>g>in</str<strong>on</strong>g>nati. The goal is to create a<br />
comprehensive lexic<strong>on</strong> that will take all of ancient Aramaic <str<strong>on</strong>g>in</str<strong>on</strong>g>to account, be based <strong>on</strong> a compilati<strong>on</strong> of<br />
all Aramaic literature, <strong>and</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>clude extensive references to modern scholarly literature. Although a<br />
153 This secti<strong>on</strong> focuses <strong>on</strong> larger projects that plan to create <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e or digital lexic<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> additi<strong>on</strong> to pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted <strong>on</strong>es, but there are also a number of lexic<strong>on</strong>s<br />
for classical languages that have been placed <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e as PDFs or <str<strong>on</strong>g>in</str<strong>on</strong>g> other static formats, such as the Chicago Demotic Dicti<strong>on</strong>ary<br />
(http://oi.uchicago.edu/research/projects/dem/); other projects have scanned historical dicti<strong>on</strong>aries <strong>and</strong> provided <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e search<str<strong>on</strong>g>in</str<strong>on</strong>g>g capabilities, such as<br />
Sanskrit, Tamil <strong>and</strong> Pahlavi Dicti<strong>on</strong>aries, http://webapps.uni-koeln.de/tamil/<br />
154 http://cal1.cn.huc.edu/