05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5. Parsing English Inclusions 117<br />

cessing inclusions. Most <strong>of</strong> the time, they are unknown words and, as they originate<br />

from another language, standard methods for unknown word guessing (suffix strip-<br />

ping, etc.) are unlikely to be successful. Furthermore, the fact that inclusions are <strong>of</strong>ten<br />

multi-word expressions (e.g., named entities or code-switches) means that simply part-<br />

<strong>of</strong>-speech (POS) tagging them accurately is not sufficient: the parser positing a phrase<br />

boundary within an inclusion is likely to severely decrease accuracy.<br />

After a brief summary <strong>of</strong> related work in Section 5.1, this chapter then describes<br />

an extrinsic evaluation <strong>of</strong> this classifier for parsing. It is shown that recognising and<br />

dealing with English inclusions via a special annotation label improves the accuracy<br />

<strong>of</strong> parsing. In particular, this chapter demonstrates that detecting English inclusions<br />

in German text improves the performance <strong>of</strong> two German parsers, a treebank-induced<br />

parser as well as a parser based on a hand-crafted grammar (Sections 5.3 and 5.4). Cru-<br />

cially, the former parser requires modifications <strong>of</strong> its underlying grammar to deal with<br />

the inclusions, the latter’s grammar is already designed to deal with multi-word expres-<br />

sions signalled in the input. Both parsers and necessary modifications are described in<br />

detail in Sections 5.3.1 and 5.4.1. The data used for all the parsing experiments is<br />

described in 5.2.<br />

5.1 Related Work<br />

Previous work on inclusion detection exists in the TTS literature (Pfister and Roms-<br />

dorfer, 2003; Farrugia, 2005; Marcadet et al., 2005), which is reviewed in detail in<br />

Sections 2.2.1.1 and 2.2.1.2. Here, the aim is to design a system that recognises for-<br />

eign inclusions on the word and sentence level and functions as the front-end to a<br />

polyglot TTS syn<strong>thesis</strong>er. Similar initial efforts have been undertaken in the field <strong>of</strong><br />

lexicography where the importance <strong>of</strong> recognising anglicisms from the perspective <strong>of</strong><br />

lexicographers responsible for updating lexicons and dictionaries has been acknowl-<br />

edged (Andersen, 2005) (see also Section 2.2.1.4). In the context <strong>of</strong> parsing, however,<br />

there has been little focus on this issue. Although Forst and Kaplan (2006) have stated<br />

the need for dealing with foreign inclusions in parsing as they are detrimental to a<br />

parser’s performance, they do not substantiate this claim using numeric results.<br />

Previous work reported in this <strong>thesis</strong> have focused on devising a classifier that de-<br />

tects anglicisms and other English inclusions in text written in other languages, namely

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!