himalayan linguistics - UCSB Linguistics - University of California ...

More documents

Recommendations

Info

Himalayan Linguistics, Vol 10(1) lemma will be deduced by removing the ii (the feminine suffix) and adding o (the masculine suffix). Applied to a feminine adjective such as raamrii “good”, for instance, this would produce the correct lemma raamro. Although the element removed by the transformation may be (and often is) the same as the suffix sought in the word-template, it need not be. An example verbal rule is the following: #यौ VV# #यौ #नु #yau VV# #yau #nu The suffix yau is the marker for the past tense second person plural (or singular medialhonorific), for example garyau “you did” which would map to the infinitive garnu. A word ending in yau must have a tag beginning in VV for this rule to apply – in the Nelralec Tagset, all indicative finite verb tags begin with VV. The use of tag templates as well as word templates in lemmatisation rules constrains the lemmatiser from overgenerating – applying its rules in cases where they do not apply, such as nouns which happen to end in a form which is also a verbal suffix – in the absence of a list of exceptions such as was used in the tokeniser. Of course, the tokeniser cannot rely on tags to constrain it, as tokenisation is undertaken prior to tagging. Errors are still possible, as the tagger itself may be mistaken. The final rule list contains around 150 rules, all but 10 of which relate to verb inflections, both finite and non-finite. This is a shorter set of rules than were used for tokenisation, since the lexical material they contain (different morphological and orthographical forms of the various suffixes) is more general. Even so, however, a relatively extensive list is produced simply because a single “rule” in the sense of a regularity of Nepali grammar will map to multiple “rules” in the formalism used by the lemmatiser. An example of the final annotation output is given below; note that though a columnar format is used here, it can equally well be expressed as XML or a horizontally-tagged format, and that both POS tags and lemmata are present in the output. हो VVYN1 हुनु , YM PUNC अत्यन्त RR अत्यन्त दुःखद JX दुःखद हुन् V0 हुनु छ VVYN1 हुनु पिरचय NN पिरचय । YF PUNC 5.5. Evaluation 5.5 Evaluation Corpus annotation systems are typically evaluated quantitatively, that is, by means of Corpus statistical annotation measures of systems how closely are typically their output evaluated approximates quantitatively, the desired that output, is, by usually means of statistical measured measures on a of token-by-token how closely their basis. output It is far approximates from clear that the these desired are optimal output, measures usually measured of on a token-by-token performance; as basis. Abney It (1997: is far from 121) points clear that out, these in many are cases optimal we might measures be more of performance; interested in as Abney (1997: the proportion 121) points of sentences out, in many that are cases correctly we might tagged be throughout more interested – which in will the obviously proportion be of sentences much lower. Moreover, particularly for languages like Nepali where the tagger performs that are correctly tagged throughout – which will obviously be much lower. Moreover, particularly extensive tokenisation, a statistic that measures success in tokens is prone to being swayed one way or the other by errors at the tokenisation stage. However, token-based performance measures are the de facto standard form reported in the literature. Two pairs of measures are used: precision (percentage of correct analyses out 160 of the total analyses produced) and recall (percentage of correct analyses out of the total number of tokens); and accuracy (percentage of tokens given the correct analysis) and ambiguity (mean number of analyses per token).
Hardie, Lohani, and Yadava: Extending corpus annotation of Nepali for languages like Nepali where the tagger performs extensive tokenisation, a statistic that measures success in tokens is prone to being swayed one way or the other by errors at the tokenisation stage. However, token-based performance measures are the de facto standard form reported in the literature. Two pairs of measures are used: precision (percentage of correct analyses out of the total analyses produced) and recall (percentage of correct analyses out of the total number of tokens); and accuracy (percentage of tokens given the correct analysis) and ambiguity (mean number of analyses per token). The reason for the pairs of measures is to allow the performance of n-best annotation, where more than one analysis per token may be produced, to be quantified. As van Halteren (1999: 82) notes, the precision/recall pair is “geared towards the description of ambiguous taggings” and is commonly used in the context of research into information retrieval; accuracy/ambiguity are the more common measures in the corpus linguistic literature. When, as here, the software always produces exactly one analysis per token, ambiguity is always equal to 1 and accuracy is usually cited alone. To give an example, the accuracy rate of our existing POS tagger, which as mentioned above is 93%, was determined by a standard procedure: of the available manually-tagged data, a 10% subset was held apart as test data, and the rest used as training data. The tagger’s performance on an untagged version of the test data, compared to the gold-standard manually-tagged version, gives the accuracy rate. This is expressed as a percentage of tokens given the same tag by manual and automated taggers. Of course, the test and training data both represent edited, published written Nepali; for other types of text, or spoken data, a degradation in the accuracy rate is to be anticipated. The obvious approach, therefore, would be to evaluate the lemmatiser according to the same procedure. However, this was not directly possible. Since the lemmatisation resources were not generated from training data in the form of a gold-standard annotated corpus, there was no test data on which to base such an evaluation. Manual lemmatisation of a large gold-standard dataset, which would be an extremely labour-intensive and lengthy process, was beyond the scope of our current development project. And while it is possible to estimate the textual coverage of the lemmatisation lexicon and rules, this would not tell us how often the application of lexicon and rules was correct. Given these problems, we resolved to undertake a restricted-scale test of the combination of tokeniser, POS tagger and lemmatiser by evaluating its performance on a small sample of running text. Of course, the performance of the lemmatiser cannot be assessed independently of that of the tokeniser and tagger, as the output of the tagger is the input to the lemmatiser. The test data were drawn from the NNC Core Sample. Three samples of approximately 2,000 words each (prior to tokenisation) were selected, for a total of 475 sentences. Each sample represented a different broad genre: one fiction, one literary non-fiction, one non-literary nonfiction; all were texts that had not previously been analysed manually in the process of training the POS tagger. Each token of this sample was manually checked to determine whether or not it had been accurately analysed. All errors of whatever kind were recorded. The sample data totalled 8,763 tokens after retokenisation. The figures for the different kinds of errors, and their combinations, are shown in Table 2. 161
Page 1 and 2:
volume 10 number 1 June 2011 himala
Page 3:
Himalayan Linguistics Volume 10, Nu
Page 7:
Michael Noonan in Nepal, sitting wi
Page 10 and 11:
Himalayan Linguistics, Vol 10(1) Fi
Page 12 and 13:
Himalayan Linguistics, Vol 10(1) Mi
Page 14 and 15:
Himalayan Linguistics, Vol 10(1) Re
Page 16 and 17:
Himalayan Linguistics, Vol 10(1) 20
Page 18 and 19:
Page 21 and 22:
Works by David Watters Books Forthc
Page 23:
xxi Works by David Watters 2006.
Page 26 and 27:
Himalayan Linguistics, Vol 10(1) La
Page 28 and 29:
Himalayan Linguistics, Vol 10(1) Th
Page 30 and 31:
Himalayan Linguistics, Vol 10(1) Is
Page 33 and 34:
Himalayan Linguistics, Vol. 10(1).
Page 35 and 36:
DeLancey: Verb agreement prefixes i
Page 37 and 38:
Page 39 and 40:
(6) amh bəyʔ-ne-wʔ food give-NON
Page 41 and 42:
Singular Plural 1 maʔ-niŋ maʔ-nu
Page 43 and 44:
(20) mín-rhê-reŋ-áŋ-cê 1OBJ
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61:
Page 64 and 65:
Himalayan Linguistics, Vol 10(1) wh
Page 66 and 67:
Himalayan Linguistics, Vol 10(1) fo
Page 68 and 69:
Himalayan Linguistics, Vol 10(1) un
Page 70 and 71:
Page 73 and 74:
Page 75 and 76:
43 Eppele: Language use among the B
Page 77 and 78:
Figure 2. Language 9 areas by VDC 4
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Genetti: Direct speech reports and
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
5.1 Intonation-unit boundaries Gene
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
(16) a. haŋ-ane // say-part Genett
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Greninger: Another look at storylin
Page 113 and 114:
# Info Type Surface Marking Grening
Page 115 and 116:
down - place -for HES HES HES house
Page 117 and 118:
Page 119 and 120:
kja¹ mi² -nok" woɾu, tɑŋ² kʰ
Page 121 and 122:
9 Background Activities Greninger:
Page 123 and 124:
and.then + sun/day after (21) SICK
Page 125 and 126:
carry -IMS soldier PL =DAT friend n
Page 127 and 128:
Page 129 and 130:
Page 131:
Page 134 and 135:
Himalayan Linguistics, Vol 10(1) an
Page 136 and 137:
Himalayan Linguistics, Vol 10(1) (1
Page 138 and 139:
Page 140 and 141:
sing-ATT.NMZ young.man-PL village-C
Page 142 and 143: Himalayan Linguistics, Vol 10(1) (1
Page 144 and 145: D.DEM-DEF-ERG little.bit-little.bit
Page 148 and 149: Himalayan Linguistics, Vol 10(1) 3.
Page 152 and 153: (37) (a) *kurc-ya bɦormi stingy- S
Page 154 and 155: ‘He is always late.’ (S) (b) ho
Page 156 and 157: Himalayan Linguistics, Vol 10(1) S
Page 158 and 159: Himalayan Linguistics, Vol 10(1) No
Page 160 and 161: Himalayan Linguistics, Vol 10(1) 1
Page 164 and 165: Himalayan Linguistics, Vol 10(1) pi
Page 166 and 167: Himalayan Linguistics, Vol 10(1) In
Page 168 and 169: Himalayan Linguistics, Vol 10(1) Th
Page 170 and 171: Himalayan Linguistics, Vol 10(1) In
Page 174 and 175: Himalayan Linguistics, Vol 10(1) ch
Page 176 and 177: Himalayan Linguistics, Vol 10(1) ji
Page 180 and 181: Himalayan Linguistics, Vol 10(1) th
Page 182 and 183: Himalayan Linguistics, Vol 10(1) Pa
Page 184 and 185: Himalayan Linguistics, Vol 10(1) bo
Page 186 and 187: Himalayan Linguistics, Vol 10(1) ab
Page 188 and 189: Himalayan Linguistics, Vol 10(1) (i
Page 190 and 191: Himalayan Linguistics, Vol 10(1) th
Page 194 and 195: Himalayan Linguistics, Vol 10(1) Er
Page 196 and 197: Himalayan Linguistics, Vol 10(1) Re
Page 199 and 200: Himalayan Linguistics, Vol. 10(1).
Page 201 and 202: 3 Lexical correspondences with regi
Page 203 and 204: 3.2 Distinguishing features of Gyal
Page 205 and 206: Hildebrandt and Perry: Preliminary
Page 207 and 208: Chart 2. Average Initial-Syllable V
Page 209 and 210: Chart 5. Average Vowel Intensity (d
Page 217: Verbs Hildebrandt and Perry: Prelim
Page 220 and 221: Himalayan Linguistics, Vol 10(1) Fi
Page 222 and 223: Himalayan Linguistics, Vol 10(1) na
Page 224 and 225: Himalayan Linguistics, Vol 10(1) ma
Page 226 and 227: Himalayan Linguistics, Vol 10(1) (w
Page 228 and 229: Himalayan Linguistics, Vol 10(1) Mo
Page 230 and 231: Himalayan Linguistics, Vol 10(1) 48
Page 232 and 233: Himalayan Linguistics, Vol 10(1) On
Page 234 and 235: Himalayan Linguistics, Vol 10(1) No
Page 238 and 239: Himalayan Linguistics, Vol 10(1) Wh
Page 240 and 241: Himalayan Linguistics, Vol 10(1) re
Page 242 and 243:
Himalayan Linguistics, Vol 10(1) ap
Page 244 and 245:
Himalayan Linguistics, Vol 10(1) 5.
Page 246 and 247:
Himalayan Linguistics, Vol 10(1) In
Page 248 and 249:
Himalayan Linguistics, Vol 10(1) Ba
Page 250 and 251:
Page 252 and 253:
Himalayan Linguistics, Vol 10(1) ap
Page 254 and 255:
Page 256 and 257:
Himalayan Linguistics, Vol 10(1) Ap
Page 259 and 260:
Page 261 and 262:
229 Lee: Issues in Bahing orthograp
Page 263 and 264:
Page 265 and 266:
(5) [ˈb̤luʔƃa] ; [ˈjoʔƃa]
Page 267 and 268:
Page 269 and 270:
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
(28) Words with Contrastive Short a
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
4 Reflections 251 Lee: Issues in Ba
Page 285 and 286:
Page 287 and 288:
255 Opgenort: Tilung and its positi
Page 289 and 290:
Page 291 and 292:
PK Hayu Sunwar Bahing Jero Wambule
Page 293 and 294:
Bahing po, Wambule pa, Dumi poʔo T
Page 295 and 296:
Page 297 and 298:
Page 299 and 300:
hear Gloss lie down one PK Hayu Sun
Page 301 and 302:
Page 303:
Page 306 and 307:
Page 308 and 309:
Page 310 and 311:
Page 312 and 313:
Page 314 and 315:
Page 316 and 317:
Page 318 and 319:
Page 320 and 321:
Himalayan Linguistics, Vol 10(1) Fi
Page 322 and 323:
Himalayan Linguistics, Vol 10(1) Si
Page 324 and 325:
Himalayan Linguistics, Vol 10(1) re
Page 326 and 327:
Page 328 and 329:
Page 331 and 332:
Page 333 and 334:
301 Bergqvist: Review of Chelliah a
show all

himalayan linguistics - UCSB Linguistics - University of California ...

Create successful ePaper yourself

Delete template?

Save as template?