PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 5. Parsing English Inclusions 117<br />
cessing inclusions. Most <strong>of</strong> the time, they are unknown words and, as they originate<br />
from another language, standard methods for unknown word guessing (suffix strip-<br />
ping, etc.) are unlikely to be successful. Furthermore, the fact that inclusions are <strong>of</strong>ten<br />
multi-word expressions (e.g., named entities or code-switches) means that simply part-<br />
<strong>of</strong>-speech (POS) tagging them accurately is not sufficient: the parser positing a phrase<br />
boundary within an inclusion is likely to severely decrease accuracy.<br />
After a brief summary <strong>of</strong> related work in Section 5.1, this chapter then describes<br />
an extrinsic evaluation <strong>of</strong> this classifier for parsing. It is shown that recognising and<br />
dealing with English inclusions via a special annotation label improves the accuracy<br />
<strong>of</strong> parsing. In particular, this chapter demonstrates that detecting English inclusions<br />
in German text improves the performance <strong>of</strong> two German parsers, a treebank-induced<br />
parser as well as a parser based on a hand-crafted grammar (Sections 5.3 and 5.4). Cru-<br />
cially, the former parser requires modifications <strong>of</strong> its underlying grammar to deal with<br />
the inclusions, the latter’s grammar is already designed to deal with multi-word expres-<br />
sions signalled in the input. Both parsers and necessary modifications are described in<br />
detail in Sections 5.3.1 and 5.4.1. The data used for all the parsing experiments is<br />
described in 5.2.<br />
5.1 Related Work<br />
Previous work on inclusion detection exists in the TTS literature (Pfister and Roms-<br />
dorfer, 2003; Farrugia, 2005; Marcadet et al., 2005), which is reviewed in detail in<br />
Sections 2.2.1.1 and 2.2.1.2. Here, the aim is to design a system that recognises for-<br />
eign inclusions on the word and sentence level and functions as the front-end to a<br />
polyglot TTS syn<strong>thesis</strong>er. Similar initial efforts have been undertaken in the field <strong>of</strong><br />
lexicography where the importance <strong>of</strong> recognising anglicisms from the perspective <strong>of</strong><br />
lexicographers responsible for updating lexicons and dictionaries has been acknowl-<br />
edged (Andersen, 2005) (see also Section 2.2.1.4). In the context <strong>of</strong> parsing, however,<br />
there has been little focus on this issue. Although Forst and Kaplan (2006) have stated<br />
the need for dealing with foreign inclusions in parsing as they are detrimental to a<br />
parser’s performance, they do not substantiate this claim using numeric results.<br />
Previous work reported in this <strong>thesis</strong> have focused on devising a classifier that de-<br />
tects anglicisms and other English inclusions in text written in other languages, namely