himalayan linguistics - UCSB Linguistics - University of California ...

More documents

Recommendations

Info

Himalayan Linguistics, Vol 10(1) above – especially postpositions – are to be written as a single word with the element they are attached to or not. For example, instances of maa attached to a preceding noun and instances with an intervening space may both be observed. The general standard, albeit one far from universally followed, has been to write them together; however, quite recently there appears to have been a move towards the practice of writing them separately. 3.2 Compromises in the original NNC part-of-speech annotation scheme In developing a POS tagging scheme for the first release of the NNC (Hardie et al., 2009), referred to as the Nelralec Tagset, certain compromises were made in order to make the task more tractable. In particular, full consistency between the treatment of nouns and verbs in the area of tokenisation was sacrificed. Tokenisation is a necessary first step to POS tagging: the very act of assigning grammatical tags to a text assumes that appropriate units have been identified to which tags can be given. This is not necessarily a straightforward process. Even in a language such as English, where the notion of the word or token as a linguistic unit corresponds very closely to the orthographic (whitespacebounded) word, some adjustment to the tokenisation is undertaken prior to POS analysis. For instance, the wordform don’t is typically tokenised as do n’t: the resulting two-token sequence can then be given the same tags as do not, enabling the difference between these two realisations of the negative particle to be abstracted away in corpus searches and frequency counts if necessary. In a language such as Nepali, with the extensive grammatical agglutination noted above, there are three possible ways to approach tokenisation: • Perform no tokenisation (apart from the basic separating-out of punctuation), and tag each multi-element token according to all the grammatical distinctions indicated by the elements. • Perform no tokenisation, but tag each multi-element token in a simplified manner based on a subset of the grammatical distinctions that are present. • Perform extensive tokenisation, splitting apart multi-element words into strings of tokens which can then be individually tagged. It might be questioned why the third option, of splitting apart multi-element words, is even under consideration in this context. The reasons are twofold. Firstly, and crucially, elements within a Nepali compound are often (inflectionally and in some cases functionally) identical to units that can be found elsewhere in texts. For example, the elements of verb garnecha “(he) will do”, namely garne and cha, both occur as independent words (a participle of do and a finite form of hunu “to be”, respectively). Secondly, and as noted above, the use of spaces in Nepali orthography is not fully consistent. That means that there are cases where these elements are already, in effect, “pre-split” before analysis begins; splitting the other cases then leads to consistency. Both these issues would reduce the precision/recall and accuracy of computerised corpus searches for, and frequency counts of, different linguistic phenomena – whereas consistency in encoding makes it possible for such analyses to be substantially more thorough and rigorous. So while this third strategy does involve making changes to the original word divisions of the corpus text, there are strong advantages in terms of the tractability of subsequent analysis. The other two strategies above have drawbacks based on the schemas of analysis that they 154
Hardie, Lohani, and Yadava: Extending corpus annotation of Nepali necessarily imply. The disadvantage of the first strategy in particular is that an attempt to capture in an annotation scheme all the grammatical categories indicated by a potentially long string of compounded elements leads to massive inflation of the tagset. The second strategy avoids this problem, but only at the cost of reducing the resolution of the annotation. So for postpositions in Nepali, a strategy of extensive tokenisation allows a more consistent and comprehensive analysis than the other two strategies. These considerations led us to adopt this third strategy for nouns (and categories with similar behaviour such as adjectives and determiners). It would, of course, be possible to treat compounded auxiliary verbs in the same manner – splitting them apart from the verbs they are attached to and tagging them separately. However, a retokenisation solution is not as straightforward for verbs as for nouns and postpositions, since there are some forms within some verbal compounds that are not identical to a freestanding element. For instance, the verb form huncha “is” is composed of hun and cha, but while cha is also a freestanding element, hun is not (the independent equivalent would be the root form hu “be”). Moreover, the number and the complexity of different elements which would need to be separated out is much greater in the case of auxiliary verbs than in the case of postpositions. For this reason, the retokenisation approach was avoided for verbs in the Nelralec Tagset. Instead, the second of the approaches noted above was adopted for verbs: each verb was tagged as a whole, but with simplifying assumptions to prevent the number of categories within the tagset growing massively. 5 While making the task of POS tagging suitably tractable, this introduced an unavoidable inconsistency between the analysis of verbs (no retokenisation, simplified analysis) and the analysis of nouns (retokenisation, full analysis) in the NNC. It also created an obstacle to adding lemmatisation to the annotation of the NNC, namely that while we would ideally annotate the lemmata of all elements of a compound to enable maximal recall in retrieving instances of those lemmata, this was not currently supported by the tokenisation. Thus, when we came to consider extending the annotation, we had a twofold goal: (a) to amend the tokenisation system, and its realisation in the NNC, to reverse this compromise and make it fully consistent by splitting verbal complex elements such as garnecha, huncha and garirahyo just as noun-plus-postposition elements are split; and (b) to use this improved tokenisation to enable subsequent lemmatisation. 4 Resources for (re)tokenisation The creation of our extended tokenisation system for Nepali was focused on the development of certain necessary linguistic knowledge resources. The software that implements the Nepali POS tagger is the Unitag system (Hardie 2004, 2005), a modular tagger which incorporates a languageindependent configurable tokeniser. This Unitokenise module operates in two passes across each text file. In the first pass, it inserts an explicit token break at every whitespace, at every punctuation mark, and at every XML element – in effect, applying “dumb” tokenisation. In the second pass, it moves across the tokenised text adjusting the token breaks in accordance with a list of tokenisation rules. These are a language specific resource, external to the program. They are defined as a pattern around a token break; if the pattern is matched then the rule is applied. The rule may be to “split” 5 Many thousands of different verb tags would be required to capture all the tense-aspect, mood, voice, person, number, gender, and honorificity distinctions made via compounding of auxiliary verbs in Nepali. The amount of manuallyannotated data required to train a POS tagging system based on tag n-gram frequencies, such as a Markov model (see El-Beze and Merialdo 1999), is proportional to the n th power of the size of the tagset, so such a large number of distinctions would clearly not be practical. 155
Page 1 and 2:
volume 10 number 1 June 2011 himala
Page 3:
Himalayan Linguistics Volume 10, Nu
Page 7:
Michael Noonan in Nepal, sitting wi
Page 10 and 11:
Himalayan Linguistics, Vol 10(1) Fi
Page 12 and 13:
Himalayan Linguistics, Vol 10(1) Mi
Page 14 and 15:
Himalayan Linguistics, Vol 10(1) Re
Page 16 and 17:
Himalayan Linguistics, Vol 10(1) 20
Page 18 and 19:
Page 21 and 22:
Works by David Watters Books Forthc
Page 23:
xxi Works by David Watters 2006.
Page 26 and 27:
Himalayan Linguistics, Vol 10(1) La
Page 28 and 29:
Himalayan Linguistics, Vol 10(1) Th
Page 30 and 31:
Himalayan Linguistics, Vol 10(1) Is
Page 33 and 34:
Himalayan Linguistics, Vol. 10(1).
Page 35 and 36:
DeLancey: Verb agreement prefixes i
Page 37 and 38:
Page 39 and 40:
(6) amh bəyʔ-ne-wʔ food give-NON
Page 41 and 42:
Singular Plural 1 maʔ-niŋ maʔ-nu
Page 43 and 44:
(20) mín-rhê-reŋ-áŋ-cê 1OBJ
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52:
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
Page 59 and 60:
Page 61:
Page 64 and 65:
Himalayan Linguistics, Vol 10(1) wh
Page 66 and 67:
Himalayan Linguistics, Vol 10(1) fo
Page 68 and 69:
Himalayan Linguistics, Vol 10(1) un
Page 70 and 71:
Page 73 and 74:
Page 75 and 76:
43 Eppele: Language use among the B
Page 77 and 78:
Figure 2. Language 9 areas by VDC 4
Page 79 and 80:
Page 81 and 82:
Page 83 and 84:
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Genetti: Direct speech reports and
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
5.1 Intonation-unit boundaries Gene
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
(16) a. haŋ-ane // say-part Genett
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Greninger: Another look at storylin
Page 113 and 114:
# Info Type Surface Marking Grening
Page 115 and 116:
down - place -for HES HES HES house
Page 117 and 118:
Page 119 and 120:
kja¹ mi² -nok" woɾu, tɑŋ² kʰ
Page 121 and 122:
9 Background Activities Greninger:
Page 123 and 124:
and.then + sun/day after (21) SICK
Page 125 and 126:
carry -IMS soldier PL =DAT friend n
Page 127 and 128:
Page 129 and 130:
Page 131:
Page 134 and 135:
Himalayan Linguistics, Vol 10(1) an
Page 136 and 137: Himalayan Linguistics, Vol 10(1) (1
Page 138 and 139: Himalayan Linguistics, Vol 10(1) Th
Page 140 and 141: sing-ATT.NMZ young.man-PL village-C
Page 144 and 145: D.DEM-DEF-ERG little.bit-little.bit
Page 148 and 149: Himalayan Linguistics, Vol 10(1) 3.
Page 152 and 153: (37) (a) *kurc-ya bɦormi stingy- S
Page 154 and 155: ‘He is always late.’ (S) (b) ho
Page 156 and 157: Himalayan Linguistics, Vol 10(1) S
Page 158 and 159: Himalayan Linguistics, Vol 10(1) No
Page 160 and 161: Himalayan Linguistics, Vol 10(1) 1
Page 164 and 165: Himalayan Linguistics, Vol 10(1) pi
Page 166 and 167: Himalayan Linguistics, Vol 10(1) In
Page 168 and 169: Himalayan Linguistics, Vol 10(1) Th
Page 170 and 171: Himalayan Linguistics, Vol 10(1) In
Page 174 and 175: Himalayan Linguistics, Vol 10(1) ch
Page 176 and 177: Himalayan Linguistics, Vol 10(1) ji
Page 180 and 181: Himalayan Linguistics, Vol 10(1) th
Page 182 and 183: Himalayan Linguistics, Vol 10(1) Pa
Page 184 and 185: Himalayan Linguistics, Vol 10(1) bo
Page 188 and 189: Himalayan Linguistics, Vol 10(1) (i
Page 190 and 191: Himalayan Linguistics, Vol 10(1) th
Page 192 and 193: Himalayan Linguistics, Vol 10(1) le
Page 194 and 195: Himalayan Linguistics, Vol 10(1) Er
Page 196 and 197: Himalayan Linguistics, Vol 10(1) Re
Page 199 and 200: Himalayan Linguistics, Vol. 10(1).
Page 201 and 202: 3 Lexical correspondences with regi
Page 203 and 204: 3.2 Distinguishing features of Gyal
Page 205 and 206: Hildebrandt and Perry: Preliminary
Page 207 and 208: Chart 2. Average Initial-Syllable V
Page 209 and 210: Chart 5. Average Vowel Intensity (d
Page 217: Verbs Hildebrandt and Perry: Prelim
Page 220 and 221: Himalayan Linguistics, Vol 10(1) Fi
Page 222 and 223: Himalayan Linguistics, Vol 10(1) na
Page 224 and 225: Himalayan Linguistics, Vol 10(1) ma
Page 226 and 227: Himalayan Linguistics, Vol 10(1) (w
Page 228 and 229: Himalayan Linguistics, Vol 10(1) Mo
Page 230 and 231: Himalayan Linguistics, Vol 10(1) 48
Page 232 and 233: Himalayan Linguistics, Vol 10(1) On
Page 234 and 235: Himalayan Linguistics, Vol 10(1) No
Page 236 and 237:
Himalayan Linguistics, Vol 10(1) (1
Page 238 and 239:
Himalayan Linguistics, Vol 10(1) Wh
Page 240 and 241:
Himalayan Linguistics, Vol 10(1) re
Page 242 and 243:
Himalayan Linguistics, Vol 10(1) ap
Page 244 and 245:
Himalayan Linguistics, Vol 10(1) 5.
Page 246 and 247:
Himalayan Linguistics, Vol 10(1) In
Page 248 and 249:
Himalayan Linguistics, Vol 10(1) Ba
Page 250 and 251:
Page 252 and 253:
Himalayan Linguistics, Vol 10(1) ap
Page 254 and 255:
Page 256 and 257:
Himalayan Linguistics, Vol 10(1) Ap
Page 259 and 260:
Page 261 and 262:
229 Lee: Issues in Bahing orthograp
Page 263 and 264:
Page 265 and 266:
(5) [ˈb̤luʔƃa] ; [ˈjoʔƃa]
Page 267 and 268:
Page 269 and 270:
Page 271 and 272:
Page 273 and 274:
Page 275 and 276:
(28) Words with Contrastive Short a
Page 277 and 278:
Page 279 and 280:
Page 281 and 282:
Page 283 and 284:
4 Reflections 251 Lee: Issues in Ba
Page 285 and 286:
Page 287 and 288:
255 Opgenort: Tilung and its positi
Page 289 and 290:
Page 291 and 292:
PK Hayu Sunwar Bahing Jero Wambule
Page 293 and 294:
Bahing po, Wambule pa, Dumi poʔo T
Page 295 and 296:
Page 297 and 298:
Page 299 and 300:
hear Gloss lie down one PK Hayu Sun
Page 301 and 302:
Page 303:
Page 306 and 307:
Page 308 and 309:
Page 310 and 311:
Page 312 and 313:
Page 314 and 315:
Page 316 and 317:
Page 318 and 319:
Page 320 and 321:
Himalayan Linguistics, Vol 10(1) Fi
Page 322 and 323:
Himalayan Linguistics, Vol 10(1) Si
Page 324 and 325:
Himalayan Linguistics, Vol 10(1) re
Page 326 and 327:
Page 328 and 329:
Page 331 and 332:
Page 333 and 334:
301 Bergqvist: Review of Chelliah a
show all

himalayan linguistics - UCSB Linguistics - University of California ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?