26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

60<br />

Text M<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g, Quotati<strong>on</strong> Detecti<strong>on</strong>, <strong>and</strong> Authorship Attributi<strong>on</strong><br />

A number of potential technologies could benefit both from automatic citati<strong>on</strong> detecti<strong>on</strong> <strong>and</strong> from the<br />

broader use of more st<strong>and</strong>ardized citati<strong>on</strong> encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g> digital corpora; these <str<strong>on</strong>g>in</str<strong>on</strong>g>clude text m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

applicati<strong>on</strong>s, such as the study of text reuse, as well as quotati<strong>on</strong> detecti<strong>on</strong> <strong>and</strong> authorship attributi<strong>on</strong>.<br />

While the research presented <str<strong>on</strong>g>in</str<strong>on</strong>g> this secti<strong>on</strong> made use of various text-m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> NLP techniques with<br />

unlabeled corpora, digital texts with large numbers of citati<strong>on</strong>s either automatically or manually<br />

marked up could provide useful tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data for this k<str<strong>on</strong>g>in</str<strong>on</strong>g>d of work. Regardless of how the <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong><br />

is detected <strong>and</strong> extracted, the ability to exam<str<strong>on</strong>g>in</str<strong>on</strong>g>e text reuse, trace quotati<strong>on</strong>s, 179 <strong>and</strong> analyze <str<strong>on</strong>g>in</str<strong>on</strong>g>dividual<br />

authors <strong>and</strong> study different patterns of authorship will be <str<strong>on</strong>g>in</str<strong>on</strong>g>creas<str<strong>on</strong>g>in</str<strong>on</strong>g>gly important services expected not<br />

<strong>on</strong>ly by users of mass-digitizati<strong>on</strong> projects but of classical digital libraries as well.<br />

The eAQUA project, 180 based <str<strong>on</strong>g>in</str<strong>on</strong>g> Germany, is broadly <str<strong>on</strong>g>in</str<strong>on</strong>g>vestigat<str<strong>on</strong>g>in</str<strong>on</strong>g>g how text-m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g technologies<br />

might be used <str<strong>on</strong>g>in</str<strong>on</strong>g> the analysis of classical texts through six specific subprojects (rec<strong>on</strong>structi<strong>on</strong> of the<br />

lost works of the Atthidographers, text reuse <str<strong>on</strong>g>in</str<strong>on</strong>g> Plato, papyri classificati<strong>on</strong>, extracti<strong>on</strong> of templates for<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s, metrical analysis of Plautus, <strong>and</strong> text completi<strong>on</strong> of fragmentary texts). 181 “The ma<str<strong>on</strong>g>in</str<strong>on</strong>g><br />

focus of this project is to break down research questi<strong>on</strong>s from the field of Classics <str<strong>on</strong>g>in</str<strong>on</strong>g> a reusable format<br />

fitt<str<strong>on</strong>g>in</str<strong>on</strong>g>g with NLP algorithms,” Büchler et al. (2008) submitted, “<strong>and</strong> to apply this type of approach to<br />

the data from the Ancient sources.” This approach of first determ<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g how classical scholars actually<br />

c<strong>on</strong>duct research <strong>and</strong> then attempt<str<strong>on</strong>g>in</str<strong>on</strong>g>g to match those processes with appropriate algorithms shows the<br />

importance of underst<strong>and</strong><str<strong>on</strong>g>in</str<strong>on</strong>g>g the discipl<str<strong>on</strong>g>in</str<strong>on</strong>g>e for which you are design<str<strong>on</strong>g>in</str<strong>on</strong>g>g tools. This is an essential po<str<strong>on</strong>g>in</str<strong>on</strong>g>t<br />

that is seen throughout this review.<br />

The basic visi<strong>on</strong> of eAQUA is to present a unified approach c<strong>on</strong>sist<str<strong>on</strong>g>in</str<strong>on</strong>g>g of “Data, Algorithms <strong>and</strong><br />

Applicati<strong>on</strong>s,” <strong>and</strong> this project specifically addresses both the development of applicati<strong>on</strong>s (research<br />

questi<strong>on</strong>s) <strong>and</strong> algorithms (NLP, text m<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g, co-occurrence analysis, cluster<str<strong>on</strong>g>in</str<strong>on</strong>g>g, classificati<strong>on</strong>). Data<br />

or corpora from research partners will be imported through st<strong>and</strong>ardized data <str<strong>on</strong>g>in</str<strong>on</strong>g>terfaces <str<strong>on</strong>g>in</str<strong>on</strong>g>to an<br />

eAQUA portal that is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g developed. This portal will also provide access to all the structured data<br />

that are extracted through a variety of web services that can be used by scholars. 182<br />

One area of active research that is be<str<strong>on</strong>g>in</str<strong>on</strong>g>g c<strong>on</strong>ducted by the eAQUA project is the use of citati<strong>on</strong><br />

detecti<strong>on</strong> <strong>and</strong> textual reuse <str<strong>on</strong>g>in</str<strong>on</strong>g> the TLG corpus to <str<strong>on</strong>g>in</str<strong>on</strong>g>vestigate “the recepti<strong>on</strong> of Plato as a case study of<br />

textual reuse <strong>on</strong> ancient Greek texts” (Büchler <strong>and</strong> Geßner 2009). In their work, they first extracted<br />

word-by-word citati<strong>on</strong>s by comb<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g n-gram overlaps <strong>and</strong> significant terms for several works of<br />

Plato; sec<strong>on</strong>d, they loosened the c<strong>on</strong>stra<str<strong>on</strong>g>in</str<strong>on</strong>g>ts <strong>on</strong> syntactic word order to f<str<strong>on</strong>g>in</str<strong>on</strong>g>d citati<strong>on</strong>s. The authors<br />

emphasized that develop<str<strong>on</strong>g>in</str<strong>on</strong>g>g appropriate visualizati<strong>on</strong> tools is essential to study textual reuse s<str<strong>on</strong>g>in</str<strong>on</strong>g>ce textm<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

approaches to corpora typically produce a huge amount of data that simply cannot be explored<br />

manually. Their paper thus offers several <str<strong>on</strong>g>in</str<strong>on</strong>g>trigu<str<strong>on</strong>g>in</str<strong>on</strong>g>g visualizati<strong>on</strong>s, <str<strong>on</strong>g>in</str<strong>on</strong>g>clud<str<strong>on</strong>g>in</str<strong>on</strong>g>g highlight<str<strong>on</strong>g>in</str<strong>on</strong>g>g the<br />

differences <str<strong>on</strong>g>in</str<strong>on</strong>g> citati<strong>on</strong>s to works of Plato across time (from the Neo-Plat<strong>on</strong>ists to the Middle<br />

179 Prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary research <strong>on</strong> quotati<strong>on</strong> identificati<strong>on</strong> <strong>and</strong> track<str<strong>on</strong>g>in</str<strong>on</strong>g>g has been reported for Google Books (Schilit <strong>and</strong> Kolak 2008).<br />

180 http://www.eaqua.net/en/<str<strong>on</strong>g>in</str<strong>on</strong>g>dex.php<br />

181 The computati<strong>on</strong>al challenges of automatic metrical analysis <strong>and</strong> fragmentary texts have received some research attenti<strong>on</strong>. For metrical analysis, see<br />

Deufert et al. (2010), Eder (2007), Fusi (2008) <strong>and</strong> Papakitsos (2011) For fragmentary texts, see Berti et al. (2009) <strong>and</strong> Romanello et al. (2009b). The use<br />

of digital technology for <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>s <strong>and</strong> papyri is covered <str<strong>on</strong>g>in</str<strong>on</strong>g> their respective secti<strong>on</strong>s.<br />

182 Accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to the W3C, a web service can be def<str<strong>on</strong>g>in</str<strong>on</strong>g>ed as “a software system designed to support <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperable mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-to-mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e <str<strong>on</strong>g>in</str<strong>on</strong>g>teracti<strong>on</strong> over a<br />

network. It has an <str<strong>on</strong>g>in</str<strong>on</strong>g>terface described <str<strong>on</strong>g>in</str<strong>on</strong>g> a mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-processable format (specifically WSDL). Other systems <str<strong>on</strong>g>in</str<strong>on</strong>g>teract with the Web service <str<strong>on</strong>g>in</str<strong>on</strong>g> a manner<br />

prescribed by its descripti<strong>on</strong> us<str<strong>on</strong>g>in</str<strong>on</strong>g>g SOAP messages, typically c<strong>on</strong>veyed us<str<strong>on</strong>g>in</str<strong>on</strong>g>g HTTP with an XML serializati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> c<strong>on</strong>juncti<strong>on</strong> with other Web-related<br />

st<strong>and</strong>ard.” (http://www.w3.org/TR/ws-arch/#whatis). Two important related st<strong>and</strong>ards are SOAP (Simple Object Access Protocol), a “lightweight protocol<br />

<str<strong>on</strong>g>in</str<strong>on</strong>g>tended for exchang<str<strong>on</strong>g>in</str<strong>on</strong>g>g structured <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong> <str<strong>on</strong>g>in</str<strong>on</strong>g> a decentralized, distributed envir<strong>on</strong>ment” (http://www.w3.org/TR/soap12-part1/) <strong>and</strong> WSDL (Web<br />

Services Descripti<strong>on</strong> Language), an “XML format for describ<str<strong>on</strong>g>in</str<strong>on</strong>g>g network services as a set of endpo<str<strong>on</strong>g>in</str<strong>on</strong>g>ts operat<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>on</strong> messages c<strong>on</strong>ta<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g either<br />

document-oriented or procedure-oriented <str<strong>on</strong>g>in</str<strong>on</strong>g>formati<strong>on</strong>.” (http://www.w3.org/TR/wsdl)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!