05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4. System Extension to a New Language 115<br />

rely on expensive manually annotated training data. Therefore non-recoverable engi-<br />

neering costs for extending and updating the classifier are kept to a minimum. Not<br />

only can the system be easily applied to new data from the same domain and language<br />

without a serious performance decrease, it can also be extended to a new language<br />

and produce similarly high scores. The performance could possibly be even better for<br />

languages with the same script that are less closely related than French and English or<br />

German and English.<br />

The English inclusion classifier described in this and the previous chapter is de-<br />

signed particularly for languages composed <strong>of</strong> tokens separated by white space and<br />

punctuation and with Latin-based scripts. A system that tracks English inclusions oc-<br />

curring in languages with non-Latin based scripts necessitates a different setup as the<br />

inclusions tend to be transcribed in the alphabet <strong>of</strong> the base language <strong>of</strong> the text (e.g. in<br />

Russian). The English inclusion classifier is also not designed to deal with languages<br />

where words are not separated by white space. An entirely different approach would<br />

be required for such a scenario.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!