PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 4. System Extension to a New Language 105 tools and therefore demanded more time and effort to be customised for French. The core system was adapted in approximately one person week in total (Section 4.1). Fig- ure 4.2 illustrates the system architecture after extending it to French. Note that the system now involves an additional document-based language identification step after pre-processing in which the base language of the document is determined by TextCat (Cavnar and Trenkle, 1994). TextCat, a traditional language identification tool, per- forms well on identifying the language of sentences and larger passages. This enables running the English inclusion classifier over either German or French text without having to specify the base language of the text manually. The base-language-specific classifier components are therefore initiated purely based on TextCat’s language identification. For both the German and French newspaper articles, TextCat is always able to identify the language correctly. 4.3.1 Pre-processing Module The pre-processing module involves tokenisation and POS tagging (cf. Section 3.3.2). First, the German tokeniser was adapted to French and a French part-of-speech (POS) tagger was integrated into the system. The French tokeniser consists of two rule-based tokenisation grammars. In the same way as the German version, it not only identifies tokens surrounded by white space and punctuation but also resolves typical abbrevia- tions, numerals and URLs. Both grammars are applied by means of improved upgrades of the XML tools described in Thompson et al. (1997) and Grover et al. (2000). These tools process an XML input stream and rewrite it on the basis of the rules provided. The French TreeTagger (see Section 3.5.1.2) is used for POS tagging. It is freely available for research and is also trained for a number of other languages, including German and English. The TreeTagger functions on the basis of binary decision trees trained on a French corpus of 35,448 word forms and yields a tagging accuracy of over 94% on an evaluation data set comprising of 10,000 word forms (Stein and Schmidt, 1995). 2 2 While trained models are available online, the tagged data set that was used to train and evaluate the French TreeTagger model is not part of the distribution. Otherwise, the data could have been used to train TnT, as that tagger resulted in a better performance of the English inclusion classifier on the German development data (see Section 3.5.1).
Chapter 4. System Extension to a New Language 106 LEXICON 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 HTML 00 11 00 11 00 11 00 11 00 11 00 11 TIDY BASE LANGUAGE LID TOKENISER POS TAGGER POST−PROCESSING 00 11 00 1100 11 00 1100 1100 11 00 1100 1100 11 00 1100 1100 11 00 1100 1100 11 00 1100 11 00 11 WWW CONSISTENCY CHECK LANGUAGE CLASSIFICATION Figure 4.2: Extended system architecture.
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Chapter 3 Tracking English Inclusio
Page 61 and 62:
Chapter 3. Tracking English Inclusi
Page 63 and 64:
Page 65 and 66:
Page 67 and 68: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 117: Chapter 4. System Extension to a Ne
Page 129 and 130: Chapter 5 Parsing English Inclusion
Page 131 and 132: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162: Chapter 6. Other Potential Applicat
Page 169 and 170:
Chapter 6. Other Potential Applicat
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?