Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Rome Wasn't Digitized in a Day - Council on Library and Information ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
58<br />
the <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong>, not the <str<strong>on</strong>g>in</str<strong>on</strong>g>scripti<strong>on</strong> itself. When a digital surrogate becomes available, I can po<str<strong>on</strong>g>in</str<strong>on</strong>g>t to<br />
that. In the meantime, a way of st<strong>and</strong>ardiz<str<strong>on</strong>g>in</str<strong>on</strong>g>g references to parts of a work would be useful,” he added.<br />
Other recent research has exam<str<strong>on</strong>g>in</str<strong>on</strong>g>ed some potential methods for resolv<str<strong>on</strong>g>in</str<strong>on</strong>g>g the issues of semantic<br />
encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Romanello (2008) proposed the use of microformats 174 <strong>and</strong> the CTS to provide<br />
semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g between classics e-journals <strong>and</strong> the primary sources/can<strong>on</strong>ical texts they referenced.<br />
One of the first challenges was simply to detect the can<strong>on</strong>ical references themselves, for as Romanello<br />
dem<strong>on</strong>strated, references to ancient texts were often abridged, the abbreviati<strong>on</strong>s used for author <strong>and</strong><br />
work names varied greatly, <strong>on</strong>ly some citati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded the editors’ names, <strong>and</strong> the reference schemes<br />
could differ (e.g., for Aeschylus Persae, variant citati<strong>on</strong>s <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded A. Pers., Aesch. Pers., <strong>and</strong> Aeschyl.<br />
Pers.). For this reas<strong>on</strong>, Romanello et al. (2009a) explored the use of mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e learn<str<strong>on</strong>g>in</str<strong>on</strong>g>g to extract<br />
can<strong>on</strong>ical references to primary classical sources from unstructured texts. Although references to<br />
primary sources with<str<strong>on</strong>g>in</str<strong>on</strong>g> the sec<strong>on</strong>dary literature can vary greatly, as seen above, they noted that a<br />
number of similar patterns could often be detected. They thus tra<str<strong>on</strong>g>in</str<strong>on</strong>g>ed c<strong>on</strong>diti<strong>on</strong>al r<strong>and</strong>om fields (CRF)<br />
to identify references to primary sources texts with<str<strong>on</strong>g>in</str<strong>on</strong>g> larger unstructured texts. CRF was a particularly<br />
suitable algorithm because of its ability to c<strong>on</strong>sider a large number of token features when classify<str<strong>on</strong>g>in</str<strong>on</strong>g>g<br />
tra<str<strong>on</strong>g>in</str<strong>on</strong>g><str<strong>on</strong>g>in</str<strong>on</strong>g>g data as either “citati<strong>on</strong>s” or “not citati<strong>on</strong>s.” Prelim<str<strong>on</strong>g>in</str<strong>on</strong>g>ary results <strong>on</strong> a sample of 24 pages<br />
achieved a precisi<strong>on</strong> of 81 percent <strong>and</strong> a recall of 94.1 percent. 175<br />
Even when references are successfully identified, the challenges of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g rema<str<strong>on</strong>g>in</str<strong>on</strong>g>.<br />
Romanello (2008) stated that most references to primary texts with<str<strong>on</strong>g>in</str<strong>on</strong>g> electr<strong>on</strong>ic sec<strong>on</strong>dary sources<br />
were hard l<str<strong>on</strong>g>in</str<strong>on</strong>g>ked “through a tightly coupled l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g system” <strong>and</strong> were also rarely encoded <str<strong>on</strong>g>in</str<strong>on</strong>g> a<br />
mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-readable format. Other obstacles to semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g <str<strong>on</strong>g>in</str<strong>on</strong>g>cluded the lack of shared st<strong>and</strong>ards or<br />
best practices <str<strong>on</strong>g>in</str<strong>on</strong>g> terms of encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g primary references <str<strong>on</strong>g>in</str<strong>on</strong>g> most corpora served as XHTML documents<br />
<strong>and</strong> the lack of comm<strong>on</strong> protocols to support <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperability am<strong>on</strong>g different texts collecti<strong>on</strong>s that<br />
would allow the l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g of primary <strong>and</strong> sec<strong>on</strong>dary sources. To allow as much <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperability as<br />
possible, Romanello promoted us<str<strong>on</strong>g>in</str<strong>on</strong>g>g “a comm<strong>on</strong> protocol to access collecti<strong>on</strong>s of texts <strong>and</strong> a shared<br />
format to encode can<strong>on</strong>ical references with<str<strong>on</strong>g>in</str<strong>on</strong>g> web <strong>on</strong>l<str<strong>on</strong>g>in</str<strong>on</strong>g>e resources” (Romanello 2008). The other<br />
requirements of a semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g system were that it must be open ended, <str<strong>on</strong>g>in</str<strong>on</strong>g>teroperable, <strong>and</strong><br />
semantic- <strong>and</strong> language-neutral. Language-neutral <strong>and</strong> unique identifiers for authors <strong>and</strong> works (such<br />
as those of the TLG) were also recommended to support cross-l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g across languages.<br />
The basic system for semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g outl<str<strong>on</strong>g>in</str<strong>on</strong>g>ed by Romanello thus made use of the CTS URN scheme,<br />
which uses the TLG Can<strong>on</strong> 176 of Greek authors for identifiers, a series of microformats that he<br />
specifically developed to embed can<strong>on</strong>ical references <str<strong>on</strong>g>in</str<strong>on</strong>g> HTML elements, <strong>and</strong> open protocols such as<br />
the CTS text-retrieval protocol to retrieve either whole texts or parts of texts <str<strong>on</strong>g>in</str<strong>on</strong>g> order to support various<br />
value-added services such as reference <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g. Romanello proposed three microformats for his<br />
system: ctauthor (references to can<strong>on</strong>ical authors, or statements that can be made mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e-readable<br />
through the CTS URN structure); ctwork (references to works without author names); <strong>and</strong> ctref—“a<br />
compound microformat to encode a complete can<strong>on</strong>ical reference” that requires the use of ctauthor,<br />
ctwork; <strong>and</strong> a range property to specify the text secti<strong>on</strong>s that were referred to. While implementati<strong>on</strong> of<br />
such microformats encod<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> CTS protocols would enable a number of <str<strong>on</strong>g>in</str<strong>on</strong>g>terest<str<strong>on</strong>g>in</str<strong>on</strong>g>g value-added<br />
services such as semantic l<str<strong>on</strong>g>in</str<strong>on</strong>g>k<str<strong>on</strong>g>in</str<strong>on</strong>g>g, granular text retrieval, <strong>and</strong> cross-l<str<strong>on</strong>g>in</str<strong>on</strong>g>gual reference <str<strong>on</strong>g>in</str<strong>on</strong>g>dex<str<strong>on</strong>g>in</str<strong>on</strong>g>g (e.g.,<br />
174 Accord<str<strong>on</strong>g>in</str<strong>on</strong>g>g to the microformats website, “microformats are a set of simple, open data formats built up<strong>on</strong> exist<str<strong>on</strong>g>in</str<strong>on</strong>g>g <strong>and</strong> widely adopted st<strong>and</strong>ards” that<br />
have been designed to be both human <strong>and</strong> mach<str<strong>on</strong>g>in</str<strong>on</strong>g>e readable (http://microformats.org/about)<br />
175 Work by Romanello c<strong>on</strong>t<str<strong>on</strong>g>in</str<strong>on</strong>g>ues <str<strong>on</strong>g>in</str<strong>on</strong>g> this area through crefex (Can<strong>on</strong>ical REFerences Extractor- http://code.google.com/p/crefex/) <strong>and</strong> was presented at the<br />
Digital Classicist/ICS Work <str<strong>on</strong>g>in</str<strong>on</strong>g> Progress Sem<str<strong>on</strong>g>in</str<strong>on</strong>g>ar <str<strong>on</strong>g>in</str<strong>on</strong>g> July 2010. See Matteo Romanello, “Towards a Tool for the Automatic Extracti<strong>on</strong> of Can<strong>on</strong>ical<br />
References.” http://www.digitalclassicist.org/wip/wip2010-04mr.pdf<br />
176 http://www.tlg.uci.edu/can<strong>on</strong>/f<strong>on</strong>tsel