PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 3. Tracking English Inclusions in German 97 F-score 100 90 80 70 60 50 40 30 Statistical tagger English inclusion classifier 10000 20000 30000 40000 50000 60000 70000 80000 Amount of training data (in tokens) Figure 3.4: Learning curve of a supervised ML classifier versus the performance of the annotation-free English inclusion classifier. sion classifier does not rely on annotated data, it can be tested and evaluated once for this entire corpus. It yields an overall F-score of 85.43 (see Figure 3.4). In order to determine the machine learner’s performance over the entire data set, and at the same time investigate the effect of the quantity of annotated training data available, a 10-fold cross-validation test was conducted whereby increasingly larger sub-parts of training data are provided when testing on each held out fold. At first, the pooled data is randomised and split into a 90% large training and 10% large test set. This randomisation and split is done on the document level, i.e the training set contains 131 newspaper articles and the test set 14. The training sub-sets are also increased on the document level by batches of 6 newspaper articles at each step. The increasingly larger sub-sets of the training data are then used to train the classifier and subsequently evaluate it on the test set. This procedure is then repeated for each of the 10 held out folds and scores are averaged. Each point in the resulting learning curve presented in Figure 3.4 shows the average F-score of the ML classifier when trained on the selected sub-set of articles and evaluated on the held out set. Average F-scores are plotted
Chapter 3. Tracking English Inclusions in German 98 against the average number of tokens in the training data at each step in order to get a better representation of the amount of labelled data involved at each step. The learning curves presented in Figure 3.4 show that the performance on the ML classifier improves considerably as the amount of training data is increased. The graph shows a rapid growth in F-score which tails off as more data is added. Provided with 100% of labelled training documents (amounting to approximately 86,500 tokens) the ML classifier reaches an F-score of 84.59. The graph shows that the English inclusion classifier has a real advantage over the supervised ML approach which relies on expen- sive hand-annotated data. A large training set of 86,500 tokens is required to achieve a performance that approximates that of the annotation-free English inclusion classifier. Moreover, the latter system has been shown to perform similarly well on unseen texts in different domains (see Section 3.6.2). 3.7 Chapter Summary This chapter first described a German newspaper corpus made up of articles on three different topics: internet & telecoms, space travel and European Union. The corpus was annotated in parallel by two different annotators for English inclusions and used for a large number of experiments aimed at English inclusion detection. The corpus analysis showed that, in specific domains, up to 6.4% of the tokens of German newspaper text can be made up of English inclusions. The inter-annotator agreement of identifying English inclusions is very high for a number of metrics, signalling almost perfect agreement and the fact that English inclusion detection is a highly manageable task for humans to carry out. Subsequently, this chapter presented an annotation-free classifier that exploits lin- guistic knowledge resources including lexicons and the World Wide Web to detect English inclusions in German text on different domains. Its main advantage is that no annotated, and for that matter unannotated, training data is required. The English inclusion classifier can be successfully applied to new text and domains with little computational cost and extended to new languages as long as lexical resources are available. In the following Chapter, the time and effort involved in extending the system to a new language will be examined. In this chapter, the classifier was examined as whole and in terms of its individual components both on seen and unseen parts of
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 109: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 129 and 130: Chapter 5 Parsing English Inclusion
Page 131 and 132: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?