PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 4. System Extension to a New Language 107 4.3.2 Lexicon Module The lexicon module (cf. Section 3.3.3) performs an initial language classification run based on a case-insensitive double lookup procedure using two lexicons: one for the base-language and one for the language of the inclusions. For French, the system queries Lexique, a lexical database which contains 128,919 word forms representing 54,158 lemmas (New et al., 2004). It is derived from 487 texts (31 million words) published between 1950 and 2000. In order to detect common English inclusions, the system searches the English database of CELEX holding 51,728 lemmas and their 365,530 corresponding word forms (Celex, 1993) . The lexicon module was adapted to French by exploiting distinctive characteristics of French orthography. For example, words containing diacritic characters typical for French are automatically excluded from being considered as English inclusions. 3 4.3.3 Search Engine Module Tokens which are not clearly identified by the lexicon module as English are passed to the back-off search engine module. As explained in Section 3.3.4, the search engine module performs language classification based on the maximum normalised score of the number of hits returned for two searches per token, one for each language L: r f Cweb(L)(t) (4.1) Extending the search engine module to French merely involved adjusting the language preferences in the search engine API and incorporating the relative frequencies of representative French tokens in a standard French corpus for estimating the size of the French Web corpus (Grefenstette and Nioche, 2000). The search engine Yahoo was used instead of Google as it outperformed the latter in a task-based evaluation for German (Section 3.5.2) and also allows a larger number of automatic queries per day. 3 Note that English also contains words with diacritic characters, e.g. née or naïve. However, they tend to be loan words from French or other languages. When appearing in French text, they would be either part of a French word or a word from a language other than English.
Chapter 4. System Extension to a New Language 108 4.3.4 Post-processing Module The final system component which required adapting to French is the post-processing module. It is designed to resolve language classification ambiguities and classify some single-character tokens. As for German, some time was invested in analysing the core system output of the French development data in order to generate these post- processing rules. The individual contribution of each of the following rules on the system performance on the French development data is discussed in Section 4.4. The most general rules are designed to disambiguate one-character tokens and in- terlingual homographs. They are flagged as English if followed by a hyphen and an English token (e-mail or joint-venture). Furthermore, typical English function words are flagged as English, including prepositions, pronouns, conjunctions and articles, as these belong to a closed class and are easily recognisable. This also avoids having to extend the core system to these categories which not only prevents some output errors but also improves the performance of the POS tagger which is often unable to process foreign function words correctly. In the post-processing step, their POS tags are therefore corrected. Any words in the closed class of English function words that are ambiguous with respect to their language such as an (in French year) or but (in French goal) are only flagged as English inclusions if their surrounding context is al- ready classified as English by the system. Similarly, the possessive marker ’s is flagged as English if it is preceded by an English token. Moreover, several rules are designed to automatically deal with names of currencies (e.g. Euro) and units of measurement (e.g. Km). Such instances are prevented from being identified as English inclusions. Similarly to the German post-processing module, abbreviation extraction (Schwartz and Hearst, 2003) is applied in order to relate language information between abbrevi- ations or acronyms and their definitions. Subsequently, post-processing is applied to guarantee that each pair and earlier plus later mentions of either the definition or the abbreviation/acronym are assigned the same language tag within a document. When analysing the errors which the system made in the development data, it was also observed that foreign person names (e.g. Graham Cluley) are frequently identified as English inclusions. In the experiments described in this thesis, the system is evaluated on identifying English inclusions. These are defined as any English words in the text except for person and location names. Therefore, the evaluation data does
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Chapter 3 Tracking English Inclusio
Page 61 and 62:
Chapter 3. Tracking English Inclusi
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 119: Chapter 4. System Extension to a Ne
Page 129 and 130: Chapter 5 Parsing English Inclusion
Page 131 and 132: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162: Chapter 6. Other Potential Applicat
Page 171 and 172:
Chapter 6. Other Potential Applicat
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?