26.12.2014 Views

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

Rome Wasn't Digitized in a Day - Council on Library and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

58<br />

the <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>, not the <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong> itself. When a digital surrogate becomes available, I can po<str<strong>on</strong>g>in</str<strong>on</strong>g>t to<br />

that. In the meantime, a way of st<strong>and</strong>ardiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g references to parts of a work would be useful,” he added.<br />

Other recent research has exam<str<strong>on</strong>g>in</str<strong>on</strong>g>ed some potential methods for resolv<str<strong>on</strong>g>in</str<strong>on</strong>g>g the issues of semantic<br />

encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Romanello (2008) proposed the use of microformats 174 <strong>and</strong> the CTS to provide<br />

semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g between classics e-journals <strong>and</strong> the primary sources/can<strong>on</strong>ical texts they referenced.<br />

One of the first challenges was simply to detect the can<strong>on</strong>ical references themselves, for as Romanello<br />

dem<strong>on</strong>strated, references to ancient texts were often abridged, the abbreviati<strong>on</strong>s used for author <strong>and</strong><br />

work names varied greatly, <strong>on</strong>ly some citati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded the editors’ names, <strong>and</strong> the reference schemes<br />

could differ (e.g., for Aeschylus Persae, variant citati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded A. Pers., Aesch. Pers., <strong>and</strong> Aeschyl.<br />

Pers.). For this reas<strong>on</strong>, Romanello et al. (2009a) explored the use of mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g to extract<br />

can<strong>on</strong>ical references to primary classical sources from unstructured texts. Although references to<br />

primary sources with<str<strong>on</strong>g>in</str<strong>on</strong>g> the sec<strong>on</strong>dary literature can vary greatly, as seen above, they noted that a<br />

number of similar patterns could often be detected. They thus tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed c<strong>on</strong>diti<strong>on</strong>al r<strong>and</strong>om fields (CRF)<br />

to identify references to primary sources texts with<str<strong>on</strong>g>in</str<strong>on</strong>g> larger unstructured texts. CRF was a particularly<br />

suitable algorithm because of its ability to c<strong>on</strong>sider a large number of token features when classify<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />

tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data as either “citati<strong>on</strong>s” or “not citati<strong>on</strong>s.” Prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary results <strong>on</strong> a sample of 24 pages<br />

achieved a precisi<strong>on</strong> of 81 percent <strong>and</strong> a recall of 94.1 percent. 175<br />

Even when references are successfully identified, the challenges of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g rema<str<strong>on</strong>g>in</str<strong>on</strong>g>.<br />

Romanello (2008) stated that most references to primary texts with<str<strong>on</strong>g>in</str<strong>on</strong>g> electr<strong>on</strong>ic sec<strong>on</strong>dary sources<br />

were hard l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked “through a tightly coupled l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g system” <strong>and</strong> were also rarely encoded <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />

mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-readable format. Other obstacles to semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded the lack of shared st<strong>and</strong>ards or<br />

best practices <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g primary references <str<strong>on</strong>g>in</str<strong>on</strong>g> most corpora served as XHTML documents<br />

<strong>and</strong> the lack of comm<strong>on</strong> protocols to support <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperability am<strong>on</strong>g different texts collecti<strong>on</strong>s that<br />

would allow the l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g of primary <strong>and</strong> sec<strong>on</strong>dary sources. To allow as much <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperability as<br />

possible, Romanello promoted us<str<strong>on</strong>g>in</str<strong>on</strong>g>g “a comm<strong>on</strong> protocol to access collecti<strong>on</strong>s of texts <strong>and</strong> a shared<br />

format to encode can<strong>on</strong>ical references with<str<strong>on</strong>g>in</str<strong>on</strong>g> web <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e resources” (Romanello 2008). The other<br />

requirements of a semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g system were that it must be open ended, <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperable, <strong>and</strong><br />

semantic- <strong>and</strong> language-neutral. Language-neutral <strong>and</strong> unique identifiers for authors <strong>and</strong> works (such<br />

as those of the TLG) were also recommended to support cross-l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g across languages.<br />

The basic system for semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by Romanello thus made use of the CTS URN scheme,<br />

which uses the TLG Can<strong>on</strong> 176 of Greek authors for identifiers, a series of microformats that he<br />

specifically developed to embed can<strong>on</strong>ical references <str<strong>on</strong>g>in</str<strong>on</strong>g> HTML elements, <strong>and</strong> open protocols such as<br />

the CTS text-retrieval protocol to retrieve either whole texts or parts of texts <str<strong>on</strong>g>in</str<strong>on</strong>g> order to support various<br />

value-added services such as reference <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Romanello proposed three microformats for his<br />

system: ctauthor (references to can<strong>on</strong>ical authors, or statements that can be made mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-readable<br />

through the CTS URN structure); ctwork (references to works without author names); <strong>and</strong> ctref—“a<br />

compound microformat to encode a complete can<strong>on</strong>ical reference” that requires the use of ctauthor,<br />

ctwork; <strong>and</strong> a range property to specify the text secti<strong>on</strong>s that were referred to. While implementati<strong>on</strong> of<br />

such microformats encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> CTS protocols would enable a number of <str<strong>on</strong>g>in</str<strong>on</strong>g>terest<str<strong>on</strong>g>in</str<strong>on</strong>g>g value-added<br />

services such as semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g, granular text retrieval, <strong>and</strong> cross-l<str<strong>on</strong>g>in</str<strong>on</strong>g>gual reference <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g (e.g.,<br />

174 Accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to the microformats website, “microformats are a set of simple, open data formats built up<strong>on</strong> exist<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> widely adopted st<strong>and</strong>ards” that<br />

have been designed to be both human <strong>and</strong> mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e readable (http://microformats.org/about)<br />

175 Work by Romanello c<strong>on</strong>t<str<strong>on</strong>g>in</str<strong>on</strong>g>ues <str<strong>on</strong>g>in</str<strong>on</strong>g> this area through crefex (Can<strong>on</strong>ical REFerences Extractor- http://code.google.com/p/crefex/) <strong>and</strong> was presented at the<br />

Digital Classicist/ICS Work <str<strong>on</strong>g>in</str<strong>on</strong>g> Progress Sem<str<strong>on</strong>g>in</str<strong>on</strong>g>ar <str<strong>on</strong>g>in</str<strong>on</strong>g> July 2010. See Matteo Romanello, “Towards a Tool for the Automatic Extracti<strong>on</strong> of Can<strong>on</strong>ical<br />

References.” http://www.digitalclassicist.org/wip/wip2010-04mr.pdf<br />

176 http://www.tlg.uci.edu/can<strong>on</strong>/f<strong>on</strong>tsel

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!