PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 2. Background and Theory 39 S 461 | +- DS_G 460 | +- NG_G 6 | | +- NG_SIMP_G 5 | | +- PRON_G "er" [’?e:ˆ6] 1 | +- V_G 155 | | +- VST_G 153 | | | +- VSIMP_G 152 | | | +- VS_G 151 | | | +- VS_E "surf" [’s3:f] 1 | | +- VE_G "t#" [t#] 1 | +- AN_G 289 | +- PG_G 288 | +- PG_SIMP_G 287 | +- PREPC_G "im" [’?Im] 1 | +- NGN_G 284 | +- NG_E 34 | +- MOD_REP_E 21 | | +- MOD_REP_E 11 | | | +- MOD_E 8 | | | +- N_E 4 | | | +- NS_E "world" [’w3:ld] 1 | | | +- NE_E "#" [#] 1 | | +- MOD_E 8 | | +- ADJ_E 7 | | +- AJ_E 4 | | +- AS_E "wide" [’wa_Id] 1 | | +- ASE_E "#" [#] 1 | +- N_E 4 | +- NS_E "web" [’web] 1 | +- NE_E "#" [#] 1 Figure 2.7: Mixed-lingual analyser output (Pfister and Romsdorfer, 2003) The interlingual homograph Lager could either refer to the German neuter noun, which has many different meanings including camp, or the English noun, i.e. a type of beer. Given the context, it is evident that we are dealing with an English inclusion. However, the analyser would give higher priority to the German variant. Pfister and Romsdorfer (2003) note that individual language grammars consist of more than 500 rules whereas the inclusion grammar contains around 20 rules. While they state that the morpho-syntactic analyser is precise at detecting the language of tokens and the sentence structure, they did not actually evaluate the performance of their system. The reason is that the various lexica and grammars used by the rule- based morpho-syntactic analyser are relatively small and coverage is thus very limited (correspondence with authors). Results would be largely dominated by the words that are not covered by the morphological analyser and not measure the performance of the approach. It is therefore unclear how well the analyser performs on real mixed-lingual data. Although Pfister and Romsdorfer (2003) have taken an interesting approach to dealing with mixed-lingual data, a system working with larger lexica and grammars will be very expensive in terms of computational overhead and may fail when rules contradict each other. This method is also very costly considering that linguistic ex- perts have to write from scratch all the necessary grammars for each language scenario.
Chapter 2. Background and Theory 40 2.2.1.2 Combined Dictionary and N-gram Language Identification: Marcadet et al. (2005) A further approach to LID for mixed-lingual text is that of Marcadet et al. (2005). Their LID system is specifically designed to function at the front-end to a polyglot TTS syn- thesis system. They present experiments with a dictionary-based transformation-based learning (TBL) and a corpus-based n-gram approach and show that a combination of both methods yields best results. The dictionary TBL approach is based on the concept of starting with a simple algorithm and iteratively applying transformations to improve the current solution. It starts with an initial state annotator which classifies tokens as either English, French, German, Italian or Spanish based on dictionary lookup. The dictionary contains the most frequent words for each language and is severely reduced in size by applying over 27,000 morphological rules including special character as well as suffix and prefix rules. Marcadet et al. (2005) do not give any details as to how these rules are created. After the initial lookup, all tokens which could not be assigned to one specific language are treated as ambiguous. Subsequently, the primary language of each sentence is determined. Finally, the language ambiguous tokens are resolved by means of a rule tagger. This tagger is made up of 500 hand-written rules conditioning on the current, previous or next word and language tag. Even though the authors call this method their TBL approach, TBL is not actually carried out due to the lack of bilingual training data. Their second method, the n-gram with context approach, is entirely corpus-based. A character n-gram language model is trained for each language and during the LID stage, the most likely language tag L for a word w is computed as: ˆL = argmaxL{P(L|w)} (2.2) The language likelihood of a given word is calculated on the basis of the probability of its character n-gram sequence (7-grams) and weighted language likelihood scores of the previous and next token in order to account for context. Marcadet et al. (2005) evaluate their system using three small mixed-lingual test scripts in different languages (Table 2.2). The proportion of foreign inclusions in each of the test scripts suggests that they are not a random selection of text but rather a col-
Page 1 and 2: Automatic Detection of English Incl
Page 3 and 4: these parsers with the annotation-f
Page 5 and 6: Declaration I declare that this the
Page 7 and 8: 3.3.5 Post-processing Module . . .
Page 9 and 10: A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12: 5.6 Average relative token frequenc
Page 13 and 14: 3.16 Most frequent English inclusio
Page 15 and 16: Chapter 1. Introduction 2 siderable
Page 17 and 18: Chapter 1. Introduction 4 Chapter 3
Page 19 and 20: Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22: Chapter 2. Background and Theory 8
Page 51: Chapter 2. Background and Theory 38
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 103 and 104:
Chapter 3. Tracking English Inclusi
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Chapter 4 System Extension to a New
Page 115 and 116:
Chapter 4. System Extension to a Ne
Page 117 and 118:
Page 119 and 120:
Page 121 and 122:
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?