PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 5. Parsing English Inclusions 117 cessing inclusions. Most of the time, they are unknown words and, as they originate from another language, standard methods for unknown word guessing (suffix strip- ping, etc.) are unlikely to be successful. Furthermore, the fact that inclusions are often multi-word expressions (e.g., named entities or code-switches) means that simply part- of-speech (POS) tagging them accurately is not sufficient: the parser positing a phrase boundary within an inclusion is likely to severely decrease accuracy. After a brief summary of related work in Section 5.1, this chapter then describes an extrinsic evaluation of this classifier for parsing. It is shown that recognising and dealing with English inclusions via a special annotation label improves the accuracy of parsing. In particular, this chapter demonstrates that detecting English inclusions in German text improves the performance of two German parsers, a treebank-induced parser as well as a parser based on a hand-crafted grammar (Sections 5.3 and 5.4). Cru- cially, the former parser requires modifications of its underlying grammar to deal with the inclusions, the latter’s grammar is already designed to deal with multi-word expressions signalled in the input. Both parsers and necessary modifications are described in detail in Sections 5.3.1 and 5.4.1. The data used for all the parsing experiments is described in 5.2. 5.1 Related Work Previous work on inclusion detection exists in the TTS literature (Pfister and Roms- dorfer, 2003; Farrugia, 2005; Marcadet et al., 2005), which is reviewed in detail in Sections 2.2.1.1 and 2.2.1.2. Here, the aim is to design a system that recognises foreign inclusions on the word and sentence level and functions as the front-end to a polyglot TTS synthesiser. Similar initial efforts have been undertaken in the field of lexicography where the importance of recognising anglicisms from the perspective of lexicographers responsible for updating lexicons and dictionaries has been acknowl- edged (Andersen, 2005) (see also Section 2.2.1.4). In the context of parsing, however, there has been little focus on this issue. Although Forst and Kaplan (2006) have stated the need for dealing with foreign inclusions in parsing as they are detrimental to a parser’s performance, they do not substantiate this claim using numeric results. Previous work reported in this thesis have focused on devising a classifier that de- tects anglicisms and other English inclusions in text written in other languages, namely
Chapter 5. Parsing English Inclusions 118 ART-NK Das NP-SB ADJA-NK schönste NE-PNC Road PN-NK S VVFIN-HD kam NE-PNC Movie APPR-AC aus PP-MO ART-NK der NE-NK Schweiz Figure 5.1: Example parse tree of a German TIGER sentence containing an English inclusion. Translation: The nicest road movie came from Switzerland. German and French. In Chapter 3, it has been shown that the frequency of English inclusions varies considerably depending on the domain of a text collection but that the classifier is able to detect them equally well with an F-score approaching 85 for each domain. 5.2 Data Preparation The experiments described in this chapter involve applying the English inclusion classifier to the TIGER treebank (Brants et al., 2002) 2 , a syntactically annotated corpus consisting of 40,020 sentences of German newspaper text, and evaluating it extrinsi- cally on a standard NLP task, namely parsing. The aim is to investigate the occurrence of English inclusions in general newspaper text and to examine if the detection of En- glish inclusions can improve parsing performance. The English inclusion classifier was run once over the entire TIGER corpus. In total, the system detected English 2 All the following parsing experiments are conducted on TIGER data (Release 1). Some of them contain additional language knowledge output by the English inclusion classifier. The pre-processing module of the classifier hereby always involves POS tagging with the TnT tagger trained on the NEGRA corpus (TnTNEGRA, see Section 3.5.1.1) and not the TIGER corpus.
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Chapter 3 Tracking English Inclusio
Page 61 and 62:
Chapter 3. Tracking English Inclusi
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 129: Chapter 5 Parsing English Inclusion
Page 133 and 134: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162: Chapter 6. Other Potential Applicat
Page 181 and 182:
Chapter 6. Other Potential Applicat
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?