PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 2. Background and Theory 43 that could have generated the observed token sequence. This is done by means of the Viterbi algorithm (e.g. Rabiner, 1989). The LID algorithm handles unknown words first by means of a dictionary lookup for each language involved. If an unknown token is present in a dictionary, four training samples are added with the corresponding language tag. If it is not found in the dictionary, one training sample is added. If a token is not found in any dictionary, the system backs off to a character n-gram language model based on a training corpus for each language (e.g. Dunning, 1994). Farrugia uses a parallel Maltese English corpus of legislative documents for this purpose. Three samples are then added to the SMS training corpus for the most likely language guess. After biasing the training sample in this way, the HMM is rebuilt and the input text is tagged with language tags. Farrugia’s algorithm is set up to distinguish between Maltese and English tokens. He reports an average LID accuracy of 95% for all tokens in three different test sets containing 100 random SMS messages each, obtained via a three-fold cross-validation experiment. As the language distribution for each of the test sets is not provided, it is unclear how well the system performs for each language in terms of precision, recall and F-score and consequently how proficient it is at determining English inclusions. Therefore, it is difficult to say what improvement this LID system provides over simply assuming that the input text is monolingual Maltese or English. In fact, Farrugia (2005) does not clarify at what level code-switching takes place, i.e. if SMS messages are made up of mostly Maltese text containing embedded English expressions, if language changes are on the sentence level, or if messages are written entirely in Maltese or English. Furthermore, it would be really interesting to investigate how well Farrugia (2005)’s approach performs on running text in other domains and what the performance contribution of each of the system components is. Considering that languages are constantly evolving and new words enter the vocabulary every day, the dictionary and character n-gram based approach for dealing with unknown words is relatively static and may not perform well for languages that are closely related. 2.2.1.4 Lexicon Lookup, Chargrams and Regular Expression Matching: Andersen (2005) Andersen (2005) notes the importance of recognising anglicisms to lexicographers. He tests several algorithms based on lexicon lookup, character n-grams and regular
Chapter 2. Background and Theory 44 expression matching and a combination thereof to automatically extract anglicisms in Norwegian text. The test set, a random sub-set of 10,000 tokens from a neologism archive (Wangensteen, 2002), was manually annotated by the author for anglicisms. For this binary classification, anglicisms were defined as either English words or com- pounds containing at least one element of English origin. Based on this annotation, the test data contained 563 tokens classified as anglicisms. Using lexicon lookup only, Andersen determines that exact matching against a lexicon undergenerates in detecting anglicisms, resulting in low recall (6.75%). Con- versely, fuzzy matching overgenerates, resulting in low precision (8.39%). The character n-gram matching is based on a chargram list of 1,074 items constisting of 4-6 characters which frequently occur in the British National Corpus (BNC). Being typ- ical English letter sequences, any word in the test set containing such a chargram is classified as English. This method leads to a higher precision of 74.73% but a relatively low recall of 36.23%. Finally, regular expression matching based on English orthographic patterns results in a precision of 60.6% and a recall of 39.0%. On the 10,000 word test set of the neologism archive (Wangensteen, 2002), the best method of combining character n-gram and regular expression matching yields an accuracy of 96.32%. Simply assuming that the data does not contain any anglicisms yields an accuracy of 94.47%. Andersen’s reported accuracy score is therefore mis- leadingly high. In fact, the best F-score, which is calculated based on the number of recognised and target anglicisms only, amounts to only 59.4 (P = 75.8%, R = 48.8%). However, this result is unsurprisingly low as no differentiation is made between full- word anglicisms and tokens with mixed-lingual morphemes in the gold standard. A shortcoming of Andersen’s work, and other reviewed studies, is that the methods are not evaluated on unseen test data. The knowledge of previous evaluations could have affected the design of later algorithms. This could easily be tested on an- other set of data that was not used during the development stage. It would also be interesting to investigate how the methods devised by Andersen perform on running text instead of a collection of neologisms extracted from text. While Andersen’s work is already applied in a language identification module as part of a classification tool for neologisms, language identification on running text could exploit knowledge of the surrounding text. Applied in such a way, anglicism detection would also allow lexicographers to examine the use of borrowings in context.
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6: Declaration I declare that this the
Page 7 and 8: 3.3.5 Post-processing Module . . .
Page 9 and 10: A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12: 5.6 Average relative token frequenc
Page 13 and 14: 3.16 Most frequent English inclusio
Page 15 and 16: Chapter 1. Introduction 2 siderable
Page 17 and 18: Chapter 1. Introduction 4 Chapter 3
Page 19 and 20: Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22: Chapter 2. Background and Theory 8
Page 55: Chapter 2. Background and Theory 42
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 107 and 108:
Chapter 3. Tracking English Inclusi
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Chapter 4 System Extension to a New
Page 115 and 116:
Chapter 4. System Extension to a Ne
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?