25.08.2013 Views

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

PDF (Online Text) - EURAC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

esource and the access to the resource. For this reason, these resources will remain<br />

unexplored, even if prices are modest. Although ELRA and LDC have their merits, for<br />

small languages, better solutions are available for the hosting of data (cf. Streiter<br />

2005).<br />

Project Gutenberg provides structured access to its 16,000 documents (comprising<br />

about thirty languages) through an XML-RDF. Unfortunately, information characterising<br />

text T1 as translation of T2 is still not provided, that is, although parallel corpora are<br />

implicitly present, they are not identifiable as such. In theory, the documents of<br />

Project Gutenberg could be used to build up corpora in XNLRDF. Such a copying of<br />

resources, however, might only be justifiable for writing systems for which little corpus<br />

material is available. More important might be a mapping from the writing system of<br />

XNLRDF to the documents of Project Gutenberg, thus translating the available XML-<br />

RDF in terms of XNLRDF.<br />

Free monolingual and parallel corpora are available at a great number of sites,<br />

most prominently at http://www.unhchr.ch/udhr/navigate/alpha.htm (Universal<br />

Declaration of Human Rights in 330 languages), http://www.translatum.gr/bible/<br />

download.htm (The Bible), and The European Parliament Proceedings Parallel Corpus<br />

(http://people.csail.mit.edu/koehn/ publications/europarl), among others. Those<br />

documents that support otherwise underrepresented writing systems will be integrated<br />

into XNLRDF in the form of corpora.<br />

The Wikipedia project is interesting for XNLRDF for a number of reasons. First,<br />

it provides documents that can be used to build corpora without infringing upon<br />

copyrights. Second, as the Wikipedia is available in more than one hundred languages,<br />

thousands of quasi-parallel texts become accessible. Third, the model of cooperation<br />

in Wiki projects, and the underlying software, will indicate the way XNLRDF will go.<br />

Thus, XNLRDF will gradually enlarge the community of researchers involved to the<br />

point that the world’s linguists will be able to collect the data they need for their<br />

writing systems. This issue will be further discussed below.<br />

5. Conceptual Design of XNLRDF<br />

The purpose of XNLRDF is to find adequate NLP resources to process a text document.<br />

To this end, the metadata of the document and the resource are matched. The better<br />

the match, the more suitable the resource is for the processing of the document. The<br />

metadata matched are those categories that make up the writing system.<br />

197

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!