PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 3. Tracking English Inclusions in German 63 Language preference German English Counts Actual (f) Normalised (rf) Actual (f) Normalised (rf) Anbieter 62.0 M 0.00116463 0.333 M 0.00000626 Provider 11.2 M 0.00001753 168.0 M 0.00026289 Table 3.6: Actual and normalised frequencies of the search engine module for one German and one English example. In the unlikely event that both searches return zero hits, the token is classified as the base language, in this case German, by default. In the initial experiment, this happened only for two tokens: Orientierungsmotoren (navigation engines) and Reserveammo- niak (spare ammonia). Word queries that return zero or a low number of hits can also be indicative of new expressions that have entered a language. The search engine module lookup is carried out only for the sub-group of tokens not found in either lexicon in the preceding module in order to keep the computational cost to a minimum. This decision is also supported by the evaluation of the lexicon module (Section 3.4.2.1) which shows that it performs sufficiently accurately on tokens contained exclusively in the German or English lexicons. Besides, current search op- tions granted by search engines are limited in that it is impossible to treat queries case- or POS-sensitively. Therefore, tokens found in both lexical databases would often be wrongly classified as English, particularly those that are frequently used (e.g. Rat). The evaluation results specific to the search engine module as a separate component are presented in Section 3.4.2.1. 3.3.5 Post-processing Module The final system component is a post-processing module that resolves several language classification ambiguities and classifies some single-character tokens. The post- processing rules are derived following extensive error analysis on the core English inclusion classifier output of the German development data. In the remainder of the thesis, the English inclusion classifier without post-processing is referred to as core system and with post-processing (and optional document consistency checking) as full system. The different types of post-processing rules implemented in this mod-
Chapter 3. Tracking English Inclusions in German 64 Post-processing type Example Ambiguous words Space Station Crew Single letters E-mail Currencies & Units Euro Function words Friends of the Earth Abbreviations Europäische Union (EU) Person names Präsident Bush Table 3.7: Different types of post-processing rules. ule involve resolving language classification of ambiguous words, single letter tokens, currencies and units of measurement, function words, abbreviations and person names. Each type of post-processing is listed in Table 3.7 with an example and explained in more detail in the following. Individual contributions of each type are presented in Section 3.4.2.3. Most of the rules lead to improvements in performance for all of the three domains and none of them deteriorate the scores. As only the token and its POS tag but not its surrounding context are considered in the lexicon module classification, it is difficult to identify the language of interlingual homographs, tokens with the same spelling in both languages (e.g. Station). Therefore, the majority of post-processing rules are designed to disambiguate such instances. For example, if a language ambiguous token is preceded and followed by an English token, then its is also likely to be of English origin (e.g. Space Station Crew versus macht Station auf Sizilien). The post-processing module applies rules that disambiguate such interlingual homographs based on their POS tag and contextual information. Moreover, the module contains rules designed to flag single-character tokens cor- rectly. These occur because the tokeniser is set up to split hyphenated compounds like E-mail into three separate tokens (Section 3.3.2). The core system identifies the language of tokens with a length of more than one character and therefore only recog- nises mail as English in this example. The post-processing rule flags E as English as well. Several additional rules deal with names of currencies and units of measurements and prevent them from being mistaken as English inclusions. Furthermore, some rules were designed to classify English function words as English. As the core system classifies each token individually, a further post-processing
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Chapter 2. Background and Theory 10
Page 25 and 26: Chapter 2. Background and Theory 12
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 75: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 127 and 128:
Chapter 4. System Extension to a Ne
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?