PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 1. Introduction 3 reviewed in Chapter 2). None of their algorithms have been extensively evaluated on unseen data, using instead fixed sets of data that were also used during the algorithm design phase. Some of them rely on continued human input as the language evolves either to annotate new training data or to generate new language rules. It is therefore unclear how these methods perform on unseen data from a new domain or how much effort is involved to extend them to a new language. This diminishes the benefit of these algorithms, particularly in NLP applications, given that continued human inter- action is necessary to ensure accurate processing. This thesis examines the hypothesis that it is possible to create a self-evolving system that automatically detects English inclusions in other languages with minimal linguistic expert knowledge and little ongoing maintenance. It proposes a solution combining computationally inexpensive lexicon lookup and dynamic web-search pro- cedures that will verify and optimise its output using post-processing and consistency checking. This novel approach to English inclusion detection will then be extensively evaluated on various data sets, including unseen data in a number of domains and in two different languages. The thesis also presents extrinsic evaluation experiments to test the usefulness of English inclusion detection for parsing. It will show that by providing knowledge about automatically detected English multi-word inclusions in German to both a treebank-induced and a hand-crafted grammar parser, performance or coverage can be improved significantly. Successful demonstration of such an En- glish inclusion classifier solves a significant problem faced by the NLP community in ensuring accurate and reliable output given the growing challenge of language mixing in an Internet connected world. This thesis consists of six chapters each of which examines distinct aspects of this work. They are outlined in the following paragraphs: Chapter 2: Background and Theory presents the linguistic background and theoreti- cal knowledge that lies behind this thesis. It first introduces the linguistic phenomenon of language mixing due to the increasing influence of English on other languages, pro- ceeding to provide an overview of different types and frequencies of English inclusions in German, French and a few other languages. The historical background and attitudes towards the influx of anglicisms are also discussed. The chapter then reviews related work on automatic language identification and discusses four alternative approaches to mixed-lingual text analysis.
Chapter 1. Introduction 4 Chapter 3: Tracking English Inclusions in German describes an English inclusion classifier developed for mixed-lingual input text with German as the base language. It focuses initially on evaluation data preparation and annotation issues, subsequently providing a complete system description. The chapter also presents an evaluation of the English inclusion classifier and its components, as well as its performance on two unseen datasets. The results show that the classifier performs well on new data in different domains and compares well to another state-of-the-art mixed-lingual language identification approach. The penultimate section describes and discusses parameter tuning experiments conducted to determine the optimal settings for the classifier. Fi- nally, the English inclusion classifier is compared to a supervised machine learner. Chapter 4: System Extension to a New Language describes the adaptation of the classifier to process French text containing English inclusions. The aim of this chapter is to illustrate the ease with which the system can be adapted to deal with a new base language. The chapter first describes data preparation and then explains the work involved in extending various system modules. Finally, a detailed evaluation on unseen test data and a comparison of the classifier’s performance across languages is presented and discussed. The results show that the English inclusion classifier not only performs well on new data in different domains but also successfully fulfils its purpose in different language scenarios. Chapter 5: Parsing English Inclusions concentrates on applying the techniques developed in the previous two chapters to a real-world task. This chapter presents a series of experiments on English inclusions and a set of random test suites using a treebank- induced and a hand-crafted rule-based German grammar parser. The aim here is to investigate the difficulty that state-of-the-art parsers have with sentences containing foreign inclusions, thereby determining the reasons for inaccuracy by means of error analysis and identifying appropriate ways of improving parsing performance. The ul- timate goal of this chapter is to highlight the oft-forgotten issue of English inclusions to researchers in the parsing community and motivate them to identify ways of dealing with inclusions by demonstrating the potential gains in parsing quality.
Page 1 and 2: Automatic Detection of English Incl
Page 3 and 4: these parsers with the annotation-f
Page 5 and 6: Declaration I declare that this the
Page 7 and 8: 3.3.5 Post-processing Module . . .
Page 9 and 10: A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12: 5.6 Average relative token frequenc
Page 13 and 14: 3.16 Most frequent English inclusio
Page 15: Chapter 1. Introduction 2 siderable
Page 19 and 20: Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22: Chapter 2. Background and Theory 8
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 67 and 68:
Chapter 3. Tracking English Inclusi
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Chapter 4 System Extension to a New
Page 115 and 116:
Chapter 4. System Extension to a Ne
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?