PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 6. Other Potential Applications 173 clear strategy for making this decision. He argues that non-adapted anglicisms should be entered into general dictionaries or lexicons if they occur above a certain frequency in a large and balanced corpus. The English inclusion classifier would be a useful tool in this context. It could be constantly run over new documents, thereby allowing lexicographers to identify new loan words, possibly even trace them, and determine the frequency of a certain loan word over time. The English inclusion classifier can consequently make lexicographers aware of a language mixing phenomenon that they might otherwise miss during their corpus analysis. Equally, lexicographers could feed their knowledge back into the classifier as a way of improving its performance. In this way, the classifier would allow lexicographers to base their decisions to include a term in the dictionary based on empirical facts, and, conversely, the lexicographers’ knowledge could be exploited to increase the performance of the classifier. 6.4 Chapter Summary This chapter described in detail the usefulness of English inclusion detection for various applications and fields, including TTS, MT and linguistics and lexicography. As with parsing, input to TTS and MT systems is generally assumed to be monolingual and so far there has been little focus on devising systems that are able to process mixed- lingual input sentences. In our increasingly globalised world where English is infiltrat- ing many other languages, automatic natural language processing must be able to deal with such language mixing. The English inclusion classifier could be used in a pre- processing stage in order to signal where language changes occur. Further processing of English inclusions then depends on various synthesis or translation strategies for specific cases. This chapter reviewed previous work on deriving such strategies and presented some ideas for future work in terms of extrinsically evaluating the benefit of English inclusion detection for both applications. Regarding the fields of linguistics and lexicography, this chapter summarised the benefits of the English inclusion classifier as a tool for automating synchronic and di- achronic language analysis. As such, the classifier could be beneficial to linguists who examine the frequency of certain expressions at a given point in time, or in different do- mains, and who track language changes over time. Moreover, it could be used to assist lexicographers in their decisions to include specific terms into lexicons or dictionaries.
Chapter 7 Conclusions and Future Work This thesis has shown that it is possible to create a self-evolving system that auto- matically detects English inclusions in other languages with minimal linguistic expert knowledge and no ongoing maintenance. This English inclusion classifier has shown three key advantages in that it is annotation-free, dynamic and easily extensible. The fact that the English inclusion classifier is annotation-free represents an ad- vance over existing statistical-based NLP systems which require annotated training data, a dependency that is referred to as the annotation bottleneck. When applied to a new problem or domain, statistical systems will fail without this annotation. This weakness has been demonstrated here with an experiment that applied a machine learn- ing approach to English inclusion detection. A further experiment with the machine learner also determined that an annotated data pool of over 80,000 tokens is required to reach even a comparable performance to the English inclusion classifier developed here. In fact, the classifier does not require any overhead in terms of extensive, and consequently expensive, manual annotation when introduced to a new domain. There- fore the English inclusion classifier is readily applicable to unseen data sets and has been experimentally shown to perform well under these circumstances. The English inclusion classifier is dynamic because of its search engine compo- nent. As the Internet provides access to extremely large quantities of evolving data in different languages, search engines can be used to determine the estimated relative token frequencies for new and unseen words. This classifier therefore exploits the volume of data published online to perform mixed-lingual language identification. The thesis has also presented a corpus search experiment with various sizes of corpora 174
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20:
Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22:
Chapter 2. Background and Theory 8
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Chapter 3 Tracking English Inclusio
Page 61 and 62:
Chapter 3. Tracking English Inclusi
Page 63 and 64:
Page 65 and 66:
Page 67 and 68:
Page 69 and 70:
Page 71 and 72:
Page 73 and 74:
Page 75 and 76:
Page 77 and 78:
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Chapter 4 System Extension to a New
Page 115 and 116:
Chapter 4. System Extension to a Ne
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Chapter 5. Parsing English Inclusio
Page 135 and 136: Chapter 5. Parsing English Inclusio
Page 159 and 160: Chapter 6 Other Potential Applicati
Page 161 and 162: Chapter 6. Other Potential Applicat
Page 185: Chapter 6. Other Potential Applicat
Page 189 and 190: Chapter 7. Conclusions and Future W
Page 191 and 192: Appendix A. Evaluation Metrics and
Page 199 and 200: Appendix B. Guidelines for Annotati
Page 205 and 206: Appendix C TIGER Tags and Labels C.
Page 207 and 208: Appendix C. TIGER Tags and Labels 1
Page 209 and 210: Appendix C. TIGER Tags and Labels 1
Page 211 and 212: Bibliography 198 Andersen, G. (2005
Page 213 and 214: Bibliography 200 Bresnan, J. (2001)
Page 215 and 216: Bibliography 202 Damashek, M. (1995
Page 217 and 218: Bibliography 204 Finkel, J., Dingar
Page 219 and 220: Bibliography 206 Hachey, B., Alex,
Page 221 and 222: Bibliography 208 Kirkness, A. (1984
Page 223 and 224: Bibliography 210 and Technology (In
Page 225 and 226: Bibliography 212 Poplack, S. (1988)
Page 227 and 228: Bibliography 214 Sokol, D. K. (2000
Page 229: Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?