himalayan linguistics - UCSB Linguistics - University of California ...

More documents

Recommendations

Info

Himalayan Linguistics, Vol 10(1) Error Type Count Percentage All tokens 8763 100.0 Error in tokenisation only 7 < 0.1 Error in POS tagging only 76 0.9 Error in lemmatisation only 255 2.9 Error in tokenisation, POS and lemmatisation 108 1.2 Error in POS tagging and lemmatisation 69 0.8 Total tokens with any error 515 5.9 Table 2. Error rates in manually-checked sample data. The overall accuracy rate of 94.1% for the annotations as a whole is higher than the 93% measured for the POS tagger alone prior to the new additions to the tokenisation; this is probably because many of the newly separated elements are highly frequent auxiliary verbs whose tagging (and lemmatisation) is straightforward (such as cha “is”). The accuracy of the lemmatiser is 95.1%, but if we try to separate the effects of the lemmatiser from the other components by considering only tokens where the input to the lemmatiser did not contain an error, then the lemmatiser’s accuracy rate becomes 97%. So as we would have predicted, errors in tokenisation or tagging tend to cascade through and result in erroneous lemmatisation as well. Moreover, it was noticeable that many of the errors involved the same words being tagged incorrectly every time they occur; this suggests that, with relatively little additional effort, the error rate could be reduced further by addressing these recurrent problems directly. In sum, then, although a small-scale test like this can never be relied on to produce wholly precise accuracy statistics, a performance in the region of 94-95% is roughly in keeping with the success rates most commonly reported in the literature for token-level corpus annotation. 6 Applications for the lemmatised NNC The resources described in the preceding sections have been applied to the NNC to produce a “second version” of the data, which is the version currently made available to researchers working with the corpus. The lemmatisation, and the extended tokenisation on which it depends, have numerous applications. At the most basic level, searches in the corpus are now much more flexible and comprehensive. For example, the use of lemmatisation to link variant spellings with their standard forms allows a complete set of instances to be retrieved by searching for the lemma – rather than searching for each variant form separately or using a complex regular-expression query, as would previously have been necessary. Similarly, many kinds of analyses can be undertaken at a more abstract level using the lemmatisation. For instance, searching for forms of the progressive aspect (such as garirahancha “(he/she) is doing”) would have required all possible forms of the auxiliary hunu “be” to be spelt out explicitly (in garirahancha, cha is the form of hunu). With the lemmatised corpus, search patterns can be specified in terms of lemmata, which is much more efficient. For example, using the CEQL 13 query language in the CQPweb software, together with POS tags and lemmata, forms of the progressive such as garirahanchu, garirahanthyo and so on can be retrieved 13 Common Elementary Query Language, created by Stefan Evert. 162
with this query: _ V* {rahanu} {hunu} Hardie, Lohani, and Yadava: Extending corpus annotation of Nepali (with, of course, Devanagari strings for the lemmata; note that in CEQL syntax, {braces} indicate a lemma query and _underscore indicates a POS tag query, so this query searches for “any verb, followed by a form of rahanu, followed by a form of hunu”). Such abstracted searches are a tool of convenience, but could in theory be accomplished without lemmatisation, albeit with some effort. However, some analyses absolutely depend on the lemmatisation. An extension of Hardie’s (2008) work on collocational patterns around Nepali postpositions looks at patterns around auxiliary verbs. 14 For corpus software to generate correct collocation statistics for a verb such as hunu, it needs to be able to identify, group and quantify the contexts of all the different forms that that verb might take. The lemmatisation is indispensible for this purpose. 7 Conclusion In this paper, we have illustrated how lemmatisation of a language such as Nepali, which contains many morphological challenges for automated analysis, can be approached. In particular, we have focused on the linguistic knowledge resources required by the tokenisation and lemmatisation modules of the Unitag software – namely tokenisation rules and lists of exceptions, a lemmatisation lexicon, and lemmatisation rules. Such a combination of rules on the one hand, and specific wordlists on the other, is liable to be required by any such system, although of course the precise format of these resources will depend on the nature of the software in question. We have illustrated an approach to resource creation that is partly automated, by using corpus-derived frequency lists and the output of analysis programs to generate the lists of items on which the analyst bases the resources. This approach yields resources with maximised (though not complete) coverage of the language. It is our hope that the approach to tokenisation and lemmatisation, and associated linguistic knowledge resources, that we have presented here may be illustrative and of use to researchers looking at other languages of the Himalayan region, most especially those that have similar morphological behaviour to Nepali. Appendix The linguistic resources used by the tokeniser and lemmatiser are provided as supplementary materials to this paper: • The tokenisation rule-list (see 4 above), n-t-rules-ext.txt • The tokenisation exceptions lexicon (see 4 above), n-t-exceptions-ext.txt • The lemmatisation lexicon (see 5.3 above), n-lemmalex.txt • The lemmatisation rules (see 5.4 above), n-lemmarules.txt The Nepali part-of-speech tagging software is available separately at www.lancs.ac.uk/staff/hardiea/nepali/postag.php. 14 The work on postpositions was funded by the UK Arts and Humanities Research Council (AHRC). 163
Page 1 and 2:
volume 10 number 1 June 2011 himala
Page 3:
Himalayan Linguistics Volume 10, Nu
Page 7:
Michael Noonan in Nepal, sitting wi
Page 10 and 11:
Himalayan Linguistics, Vol 10(1) Fi
Page 12 and 13:
Himalayan Linguistics, Vol 10(1) Mi
Page 14 and 15:
Himalayan Linguistics, Vol 10(1) Re
Page 16 and 17:
Himalayan Linguistics, Vol 10(1) 20
Page 18 and 19:
Page 21 and 22:
Works by David Watters Books Forthc
Page 23:
xxi Works by David Watters 2006.
Page 26 and 27:
Himalayan Linguistics, Vol 10(1) La
Page 28 and 29:
Himalayan Linguistics, Vol 10(1) Th
Page 30 and 31:
Himalayan Linguistics, Vol 10(1) Is
Page 33 and 34:
Himalayan Linguistics, Vol. 10(1).
Page 35 and 36:
DeLancey: Verb agreement prefixes i
Page 37 and 38:
Page 39 and 40:
(6) amh bəyʔ-ne-wʔ food give-NON
Page 41 and 42:
Singular Plural 1 maʔ-niŋ maʔ-nu
Page 43 and 44:
(20) mín-rhê-reŋ-áŋ-cê 1OBJ
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61:
Page 64 and 65:
Himalayan Linguistics, Vol 10(1) wh
Page 66 and 67:
Himalayan Linguistics, Vol 10(1) fo
Page 68 and 69:
Himalayan Linguistics, Vol 10(1) un
Page 70 and 71:
Page 73 and 74:
Page 75 and 76:
43 Eppele: Language use among the B
Page 77 and 78:
Figure 2. Language 9 areas by VDC 4
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Genetti: Direct speech reports and
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
5.1 Intonation-unit boundaries Gene
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
(16) a. haŋ-ane // say-part Genett
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Greninger: Another look at storylin
Page 113 and 114:
# Info Type Surface Marking Grening
Page 115 and 116:
down - place -for HES HES HES house
Page 117 and 118:
Page 119 and 120:
kja¹ mi² -nok" woɾu, tɑŋ² kʰ
Page 121 and 122:
9 Background Activities Greninger:
Page 123 and 124:
and.then + sun/day after (21) SICK
Page 125 and 126:
carry -IMS soldier PL =DAT friend n
Page 127 and 128:
Page 129 and 130:
Page 131:
Page 134 and 135:
Himalayan Linguistics, Vol 10(1) an
Page 136 and 137:
Himalayan Linguistics, Vol 10(1) (1
Page 138 and 139:
Page 140 and 141:
sing-ATT.NMZ young.man-PL village-C
Page 142 and 143:
Page 144 and 145: D.DEM-DEF-ERG little.bit-little.bit
Page 146 and 147: Himalayan Linguistics, Vol 10(1) (2
Page 148 and 149: Himalayan Linguistics, Vol 10(1) 3.
Page 152 and 153: (37) (a) *kurc-ya bɦormi stingy- S
Page 154 and 155: ‘He is always late.’ (S) (b) ho
Page 156 and 157: Himalayan Linguistics, Vol 10(1) S
Page 158 and 159: Himalayan Linguistics, Vol 10(1) No
Page 160 and 161: Himalayan Linguistics, Vol 10(1) 1
Page 164 and 165: Himalayan Linguistics, Vol 10(1) pi
Page 166 and 167: Himalayan Linguistics, Vol 10(1) In
Page 168 and 169: Himalayan Linguistics, Vol 10(1) Th
Page 170 and 171: Himalayan Linguistics, Vol 10(1) In
Page 174 and 175: Himalayan Linguistics, Vol 10(1) ch
Page 176 and 177: Himalayan Linguistics, Vol 10(1) ji
Page 180 and 181: Himalayan Linguistics, Vol 10(1) th
Page 182 and 183: Himalayan Linguistics, Vol 10(1) Pa
Page 184 and 185: Himalayan Linguistics, Vol 10(1) bo
Page 186 and 187: Himalayan Linguistics, Vol 10(1) ab
Page 188 and 189: Himalayan Linguistics, Vol 10(1) (i
Page 190 and 191: Himalayan Linguistics, Vol 10(1) th
Page 192 and 193: Himalayan Linguistics, Vol 10(1) le
Page 196 and 197: Himalayan Linguistics, Vol 10(1) Re
Page 199 and 200: Himalayan Linguistics, Vol. 10(1).
Page 201 and 202: 3 Lexical correspondences with regi
Page 203 and 204: 3.2 Distinguishing features of Gyal
Page 205 and 206: Hildebrandt and Perry: Preliminary
Page 207 and 208: Chart 2. Average Initial-Syllable V
Page 209 and 210: Chart 5. Average Vowel Intensity (d
Page 217: Verbs Hildebrandt and Perry: Prelim
Page 220 and 221: Himalayan Linguistics, Vol 10(1) Fi
Page 222 and 223: Himalayan Linguistics, Vol 10(1) na
Page 224 and 225: Himalayan Linguistics, Vol 10(1) ma
Page 226 and 227: Himalayan Linguistics, Vol 10(1) (w
Page 228 and 229: Himalayan Linguistics, Vol 10(1) Mo
Page 230 and 231: Himalayan Linguistics, Vol 10(1) 48
Page 232 and 233: Himalayan Linguistics, Vol 10(1) On
Page 234 and 235: Himalayan Linguistics, Vol 10(1) No
Page 238 and 239: Himalayan Linguistics, Vol 10(1) Wh
Page 240 and 241: Himalayan Linguistics, Vol 10(1) re
Page 242 and 243: Himalayan Linguistics, Vol 10(1) ap
Page 244 and 245:
Himalayan Linguistics, Vol 10(1) 5.
Page 246 and 247:
Himalayan Linguistics, Vol 10(1) In
Page 248 and 249:
Himalayan Linguistics, Vol 10(1) Ba
Page 250 and 251:
Page 252 and 253:
Himalayan Linguistics, Vol 10(1) ap
Page 254 and 255:
Page 256 and 257:
Himalayan Linguistics, Vol 10(1) Ap
Page 259 and 260:
Page 261 and 262:
229 Lee: Issues in Bahing orthograp
Page 263 and 264:
Page 265 and 266:
(5) [ˈb̤luʔƃa] ; [ˈjoʔƃa]
Page 267 and 268:
Page 269 and 270:
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
(28) Words with Contrastive Short a
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
4 Reflections 251 Lee: Issues in Ba
Page 285 and 286:
Page 287 and 288:
255 Opgenort: Tilung and its positi
Page 289 and 290:
Page 291 and 292:
PK Hayu Sunwar Bahing Jero Wambule
Page 293 and 294:
Bahing po, Wambule pa, Dumi poʔo T
Page 295 and 296:
Page 297 and 298:
Page 299 and 300:
hear Gloss lie down one PK Hayu Sun
Page 301 and 302:
Page 303:
Page 306 and 307:
Page 308 and 309:
Page 310 and 311:
Page 312 and 313:
Page 314 and 315:
Page 316 and 317:
Page 318 and 319:
Page 320 and 321:
Himalayan Linguistics, Vol 10(1) Fi
Page 322 and 323:
Himalayan Linguistics, Vol 10(1) Si
Page 324 and 325:
Himalayan Linguistics, Vol 10(1) re
Page 326 and 327:
Page 328 and 329:
Page 331 and 332:
Page 333 and 334:
301 Bergqvist: Review of Chelliah a
show all

himalayan linguistics - UCSB Linguistics - University of California ...

Create successful ePaper yourself

Delete template?

Save as template?