PhD thesis - School of Informatics - University of Edinburgh

More documents

Recommendations

Info

Chapter 3. Tracking English Inclusions in German 57 to tokenise the XML documents. The first grammar pre-tokenises the text into tokens surrounded by white space and punctuation and the second grammar groups together various abbreviations, numerals and URLs. Grammar rules also split hyphenated tokens. The two grammars are applied with lxtransduce 6 , a transducer which adds or rewrites XML markup to an input stream based on the rules provided. lxtransduce is an improved version of fsgmatch, the core program of LT-TTT (Grover et al., 2000). The tokenised text is then POS-tagged using the statistical POS tagger TnT (Trigrams’n’Tags). The tagger is trained on the TIGER Treebank (Release 1) which consists of 700,000 tokens of German newspaper text (Brants et al., 2002) annotated with the Stuttgart-Tübingen Tagset (Schiller et al., 1995), henceforth referred to as STTS. 3.3.3 Lexicon Lookup Module The lexicon module performs an initial language classification run based on a case- insensitive lookup procedure using two lexicons, one for the base language of the text and one for the language of the inclusions. The system is designed to search CELEX Version 2 (Celex, 1993), a lexical database of German, English and Dutch. The Ger- man database holds 51,728 lemmas and their 365,530 word forms and the English database contains 52,446 lemmas representing 160,594 corresponding word forms. A CELEX lookup is only performed for tokens which TnT tags as NN (common noun), NE (named entity), ADJA or ADJD (attributive and adverbial or predicatively used adjec- tives) as well as FM (foreign material). Anglicisms representing other parts of speech are relatively infrequently used in German (Yeandle, 2001) which is the principal reason for focussing on the classification of noun and adjective phrases. Before the lexicon lookup is performed, distinctive characteristics of German orthography are exploited for classification. So, all tokens containing German umlauts are automatically recog- nised as German and are therefore not further processed by the system. The core lexicon lookup algorithm involves each token being looked up twice, both in the German and English CELEX databases. Each part of a hyphenated compound is checked individually. Moreover, the lookup in the English database is made case- insensitive in order to identify the capitalised English tokens in the corpus, the reason 6 http://www.ltg.ed.ac.uk/˜richard/lxtransduce.html
Chapter 3. Tracking English Inclusions in German 58 being that all proper and regular nouns are capitalised in German. The lexicon lookup is also sensitive to POS tags to reduce classification errors. On the basis of this initial lexicon lookup, each token is found either: (1) only in the German lexicon, (2) only in the English lexicon, (3) in both or (4) in neither lexicon. 1. The majority of tokens found exclusively in the German lexicon are actual Ger- man words. Only very few are English words with German case inflection such as Computern. The word Computer is used so frequently in German that it al- ready appears in lexicons and dictionaries. To detect the base language of in- flected forms, a second lookup could be performed checking whether the lemma of the token also occurs in the English lexicon. 2. Tokens found exclusively in the English lexicon such as Software or News are generally English words and do not overlap with German lexicon entries. These tokens are clear instances of English inclusions and consequently tagged as such. Internet & telecoms Space travel European Union Token Frequency Token Frequency Token Frequency Dollar 16 Station 58 Union 28 Computer 14 All 30 April 12 Generation 12 Start 27 Referendum 10 April 12 Mission 16 Fall 9 Autos 7 Chef 14 Rat 8 Table 3.5: Most frequent words per domain found in both lexicons. 3. Tokens which are found in both lexicons are words with the same orthographic characteristics in both languages (see Table 3.5). These are words without in- flectional endings or words ending in s signalling either the German genitive singular case or the German and English plural forms of that token, e.g. Com- puters. The majority of these lexical items have the same or similar semantics in both languages and represent assimilated borrowings and cognates where the language origin is not always immediately apparent (e.g. Mission). This phe- nomenon is due to the fact that German and English belong to the same language
Page 1 and 2:
Automatic Detection of English Incl
Page 3 and 4:
these parsers with the annotation-f
Page 5 and 6:
Declaration I declare that this the
Page 7 and 8:
3.3.5 Post-processing Module . . .
Page 9 and 10:
A.2.2 Kappa Coefficient . . . . . .
Page 11 and 12:
5.6 Average relative token frequenc
Page 13 and 14:
3.16 Most frequent English inclusio
Page 15 and 16:
Chapter 1. Introduction 2 siderable
Page 17 and 18:
Chapter 1. Introduction 4 Chapter 3
Page 19 and 20: Chapter 1. Introduction 6 1.1 Relat
Page 21 and 22: Chapter 2. Background and Theory 8
Page 59 and 60: Chapter 3 Tracking English Inclusio
Page 61 and 62: Chapter 3. Tracking English Inclusi
Page 69: Chapter 3. Tracking English Inclusi
Page 113 and 114: Chapter 4 System Extension to a New
Page 115 and 116: Chapter 4. System Extension to a Ne
Page 121 and 122:
Chapter 4. System Extension to a Ne
Page 123 and 124:
Page 125 and 126:
Page 127 and 128:
Page 129 and 130:
Chapter 5 Parsing English Inclusion
Page 131 and 132:
Chapter 5. Parsing English Inclusio
Page 133 and 134:
Page 135 and 136:
Page 137 and 138:
Page 139 and 140:
Page 141 and 142:
Page 143 and 144:
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Chapter 6 Other Potential Applicati
Page 161 and 162:
Chapter 6. Other Potential Applicat
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
Page 173 and 174:
Page 175 and 176:
Page 177 and 178:
Page 179 and 180:
Page 181 and 182:
Page 183 and 184:
Page 185 and 186:
Page 187 and 188:
Chapter 7 Conclusions and Future Wo
Page 189 and 190:
Chapter 7. Conclusions and Future W
Page 191 and 192:
Appendix A. Evaluation Metrics and
Page 193 and 194:
Page 195 and 196:
Page 197 and 198:
Page 199 and 200:
Appendix B. Guidelines for Annotati
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Appendix C TIGER Tags and Labels C.
Page 207 and 208:
Appendix C. TIGER Tags and Labels 1
Page 209 and 210:
Appendix C. TIGER Tags and Labels 1
Page 211 and 212:
Bibliography 198 Andersen, G. (2005
Page 213 and 214:
Bibliography 200 Bresnan, J. (2001)
Page 215 and 216:
Bibliography 202 Damashek, M. (1995
Page 217 and 218:
Bibliography 204 Finkel, J., Dingar
Page 219 and 220:
Bibliography 206 Hachey, B., Alex,
Page 221 and 222:
Bibliography 208 Kirkness, A. (1984
Page 223 and 224:
Bibliography 210 and Technology (In
Page 225 and 226:
Bibliography 212 Poplack, S. (1988)
Page 227 and 228:
Bibliography 214 Sokol, D. K. (2000
Page 229:
Bibliography 216 Yang, W. (1990). A
show all

PhD thesis - School of Informatics - University of Edinburgh

Create successful ePaper yourself

Delete template?

Save as template?