PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 4. System Extension to a New Language 103 the domain of internet and telecoms (IT). All French articles were manually annotated for English inclusions using an annotation tool based on NXT (Carletta et al., 2003). As with the experiments on German, the data is split into a development set and a test set of approximately 16,000 tokens each. While both sets are of similar domain, date of publishing and content, they do not overlap. The development set served both for the purpose of performance testing of individual system modules for French and error analysis. The test set, however, was treated as unseen and only employed for evaluation in the final run. Data Development set Test set Domain: IT Tokens % Types % TTR Tokens % Types % TTR French Total 16188 3233 0.20 16125 3437 0.21 English 986 6.1 339 10.5 0.34 1089 6.8 367 10.7 0.34 German Total 15919 4152 0.26 16219 4404 0.27 English 963 6.0 283 6.8 0.29 1034 6.4 258 5.9 0.25 Table 4.1: Corpus statistics including type-token-ratios (TTRs) of the French development and test sets compared to the German data. Table 4.1 lists the total number of tokens and types plus the number of English inclusions in the French development and test sets. The corresponding statistics of the German data are reproduced to facilitate a comparison. The French data sets have similar characteristics, particularly regarding their type-token-ratios (TTRs) for each entire set (0.20 versus 0.21) and for the English inclusions alone (0.34 each). The French test set contains slightly more English inclusions (+0.7%) than the development set. Comparing these figures with those of the previously annotated German IT data sets shows that the proportion of English tokens in this domain is extremely similarly at approximately 6%. However, the percentage of English types varies to some extent both for the development and test sets. They only amount to 6.8% and 5.9% in the German data, compared to 10.5% and 10.7% in the French data. Moreover, the TTRs of English inclusions are 0.5 and 0.9 points higher in the French data sets, signalling
Chapter 4. System Extension to a New Language 104 that they are less repetitive than those contained in the German articles. However, overall TTRs are 0.6 points lower for French than for German which means that the remaining vocabulary in the French articles is somewhat less heterogeneous than in the German data. This is due to the nature of the two languages. In German, lexical variety is higher due to compounding and case inflection for articles, adjectives and nouns. French internet data Token f Token f e 49 web 21 Google 44 spam 18 internet 35 mail 18 Microsoft 33 Firefox 16 mails 32 Windows 14 Table 4.2: Ten most frequent (f) English inclusions in the French data. The ten most frequent English inclusions found in the French data set are listed in Table 4.2. The majority of them are either common or proper nouns. The most frequent English token is actually the single-character token e as occurring in e-mail. Less frequent pseudo-anglicisms like le forcing in the sense of putting pressure on someone or le parking referring to a car park are also contained in the French data. Although such lexical items are either non-existent or scarcely used by native English speakers, they are also annotated as English inclusions. 4.3 System Module Conversion to French The extended system architecture of the English inclusion classifier consists of sev- eral pre-processing steps, followed by a lexicon module, a search engine module and a post-processing module. Converting the search engine module to a new language required little computational cost and time. Conversely, the pre- and post-processing as well as the lexicon modules necessitated some language knowledge resources or
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Chapter 3 Tracking English Inclusio
Page 61 and 62:
Chapter 3. Tracking English Inclusi
Page 63 and 64:
Chapter 3. Tracking English Inclusi
Page 65 and 66: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115: Chapter 4. System Extension to a Ne
Page 119 and 120: Chapter 4. System Extension to a Ne
Page 129 and 130: Chapter 5 Parsing English Inclusion
Page 131 and 132: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162: Chapter 6. Other Potential Applicat
Page 167 and 168:
Chapter 6. Other Potential Applicat
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?