26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

52<br />

First, it utilizes a nearest neighbor framework that requires no h<strong>and</strong>-crafted rules, <strong>and</strong> provides<br />

analogies to facilitate learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Sec<strong>on</strong>d, <strong>and</strong> perhaps more significantly, it exploits a large,<br />

unlabelled corpus to improve the predicti<strong>on</strong> of novel roots (Lee 2008).<br />

Lee observed that many students of Ancient Greek memorized “paradigmatic” verbs that could be used<br />

as analogies to identify the roots of unseen verbs. From this <str<strong>on</strong>g>in</str<strong>on</strong>g>sight, Lee utilized a “nearest-neighbor”<br />

mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g framework to model this process. When given a word <str<strong>on</strong>g>in</str<strong>on</strong>g> an <str<strong>on</strong>g>in</str<strong>on</strong>g>flected form, the<br />

algorithm searched for the root form am<strong>on</strong>g its “neighbors” by mak<str<strong>on</strong>g>in</str<strong>on</strong>g>g substituti<strong>on</strong>s to its prefix <strong>and</strong><br />

suffix. Valid substituti<strong>on</strong>s are harvested from pairs of <str<strong>on</strong>g>in</str<strong>on</strong>g>flected <strong>and</strong> root forms <str<strong>on</strong>g>in</str<strong>on</strong>g> a tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g set of data,<br />

<strong>and</strong> these pairs are then used to serve as “analogies to re<str<strong>on</strong>g>in</str<strong>on</strong>g>force learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g.” N<strong>on</strong>etheless, Ancient Greek<br />

still posed some challenges that complicated a m<str<strong>on</strong>g>in</str<strong>on</strong>g>imally supervised approach. Lee expla<str<strong>on</strong>g>in</str<strong>on</strong>g>ed that<br />

heavily <str<strong>on</strong>g>in</str<strong>on</strong>g>flected languages such as Greek suffer from “data sparseness” s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce many <str<strong>on</strong>g>in</str<strong>on</strong>g>flected forms<br />

appear at most a few times <strong>and</strong> many root forms may not appear at all <str<strong>on</strong>g>in</str<strong>on</strong>g> a corpus. As a rule-based<br />

system, Morpheus needed a priori knowledge of possible stems <strong>and</strong> affixes, all of which had to be<br />

crafted by h<strong>and</strong>. To provide a more scalable approach, Lee used a data-driven approach that<br />

automatically determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed stems <strong>and</strong> affixes from tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data (morphology data for the Greek<br />

Septuag<str<strong>on</strong>g>in</str<strong>on</strong>g>t from the University of Pennsylvania) <strong>and</strong> then used the TLG as a source of unlabeled data<br />

to guide predicti<strong>on</strong> of novel roots.<br />

While Lee made use of mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> unlabeled corpora, Tambouratzis (2008) automated the<br />

morphological segmentati<strong>on</strong> of Greek by “coupl<str<strong>on</strong>g>in</str<strong>on</strong>g>g an iterative pattern-recogniti<strong>on</strong> algorithm with a<br />

modest amount of l<str<strong>on</strong>g>in</str<strong>on</strong>g>guistic knowledge, expressed via a set of <str<strong>on</strong>g>in</str<strong>on</strong>g>teracti<strong>on</strong>s associated with weights.”<br />

He used an “ant col<strong>on</strong>y optimizati<strong>on</strong> (ACO) metaheuristic” to automatically determ<str<strong>on</strong>g>in</str<strong>on</strong>g>e optimal weight<br />

values <strong>and</strong> found that <str<strong>on</strong>g>in</str<strong>on</strong>g> several cases the automatic system provided better results than those that had<br />

been manually determ<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by scholars. In c<strong>on</strong>trast to Lee, Tambouratzis used <strong>on</strong>ly a subset of the TLG<br />

for tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data (<str<strong>on</strong>g>in</str<strong>on</strong>g> this case, the speeches of several Greek orators).<br />

In additi<strong>on</strong> to the work d<strong>on</strong>e by Dik <strong>and</strong> Whal<str<strong>on</strong>g>in</str<strong>on</strong>g>g for “Perseus Under PhiloLogic,” other research <str<strong>on</strong>g>in</str<strong>on</strong>g>to<br />

automatic morphological analysis of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> has been c<strong>on</strong>ducted by (F<str<strong>on</strong>g>in</str<strong>on</strong>g>kel <strong>and</strong> Stump 2009). These<br />

authors reported <strong>on</strong> computati<strong>on</strong>al experiments to generate the morphology of Lat<str<strong>on</strong>g>in</str<strong>on</strong>g> verbs.<br />

Lexic<strong>on</strong>s<br />

Lexic<strong>on</strong>s are reference tools that have l<strong>on</strong>g played an important role <str<strong>on</strong>g>in</str<strong>on</strong>g> classical scholarship <strong>and</strong><br />

particularly <str<strong>on</strong>g>in</str<strong>on</strong>g> the study of historical languages. 153 As previously noted, the lack of a computati<strong>on</strong>al<br />

lexic<strong>on</strong> for Sanskrit is a major research challenge. This secti<strong>on</strong> explores some important lexic<strong>on</strong>s for<br />

classical languages <strong>and</strong> suggests new roles for these traditi<strong>on</strong>al reference works <str<strong>on</strong>g>in</str<strong>on</strong>g> a digital<br />

envir<strong>on</strong>ment.<br />

The Comprehensive Aramaic Lexic<strong>on</strong> 154 (CAL) hopes to serve as a “new dicti<strong>on</strong>ary of the Aramaic<br />

language.” Aramaic is a Semitic language, <strong>and</strong> numerous <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s <strong>and</strong> papyri, as well as Biblical<br />

<strong>and</strong> other religious texts, are written <str<strong>on</strong>g>in</str<strong>on</strong>g> it. This project, currently <str<strong>on</strong>g>in</str<strong>on</strong>g> preparati<strong>on</strong> by an <str<strong>on</strong>g>in</str<strong>on</strong>g>ternati<strong>on</strong>al<br />

team of scholars, is based at Hebrew Uni<strong>on</strong> College <str<strong>on</strong>g>in</str<strong>on</strong>g> C<str<strong>on</strong>g>in</str<strong>on</strong>g>c<str<strong>on</strong>g>in</str<strong>on</strong>g>nati. The goal is to create a<br />

comprehensive lexic<strong>on</strong> that will take all of ancient Aramaic <str<strong>on</strong>g>in</str<strong>on</strong>g>to account, be based <strong>on</strong> a compilati<strong>on</strong> of<br />

all Aramaic literature, <strong>and</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g>clude extensive references to modern scholarly literature. Although a<br />

153 This secti<strong>on</strong> focuses <strong>on</strong> larger projects that plan to create <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e or digital lexic<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g> additi<strong>on</strong> to pr<str<strong>on</strong>g>in</str<strong>on</strong>g>ted <strong>on</strong>es, but there are also a number of lexic<strong>on</strong>s<br />

for classical languages that have been placed <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e as PDFs or <str<strong>on</strong>g>in</str<strong>on</strong>g> other static formats, such as the Chicago Demotic Dicti<strong>on</strong>ary<br />

(http://oi.uchicago.edu/research/projects/dem/); other projects have scanned historical dicti<strong>on</strong>aries <strong>and</strong> provided <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e search<str<strong>on</strong>g>in</str<strong>on</strong>g>g capabilities, such as<br />

Sanskrit, Tamil <strong>and</strong> Pahlavi Dicti<strong>on</strong>aries, http://webapps.uni-koeln.de/tamil/<br />

154 http://cal1.cn.huc.edu/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!