PDF (Online Text) - EURAC

More documents

Recommendations

Info

forth in the neighbourhood of the noun candidates. We check for the existence of such patterns in parallel for singular forms and for their potential plural counterparts. The search is approximative, in so far as it checks the presence of agreement-bearing elements within a window of up to three words left or right of the noun candidate. The rules can, in principle, be triggered either by singular or by plural items (with the exception of class 9 versus class 10, where it is preferable to start from the plural). Table 7 contains an example of a noun guessing query (simplified, as many potential agreement-bearing indicator items are left out), formulated in the notation of the CQP corpus query language, which underlies the CWB Corpus WorkBench, (Christ et al. 1999), used in our experiments as a corpus representation and infrastructure. We indicate (parts of) the queries that extract nouns of classes 7 (and 8). Table 7: Sample Query for the Identification of Noun Candidates of Classes 7 + 8 ( ); [word = ‘sego|selo|sebatakgomo|...| []{0,2} setšhaba|seatla|sello’] [word = ‘sa|se|segolo| ) | sekhwi|sengwe|seo|sona …|’] ( [word = ‘sa|se|segolo| ( ....) []{0,2} ...] [word = ‘sego|selo|sebatakgomo|...’] 108 First part of query: candidate se- words as a disjunction; followed in distance 0 to 2 by class-7-indicators noted as a disjunction or (second part of query): choice of indicators followed in distance 0 to 2 by candidate words analogous procedure for noun candidates created by replacement of se- with di- (plus class 8 concords) When applied to the 43,000 words corpus sample, the query throws up, among others, the results displayed in Table 8.
Table 8: Sample Results of Noun Guessing for Classes 7 and 8 Class 7 cands. Class 8 cands. N? Equivalent(s) selo dilo + thing, things setšhaba ditšhaba + nation, nations sello dillo + (out)cry, outcries sepetše *dipetše — walked sekelela dikelela — recommend, disappear The checking tool is robust towards inexistent forms (cf. *dipetše) and towards forms that are not nominal (due to the context constraint on agreement-bearing items, (cf. sekelela versus dikelela). A first qualitative evaluation of the noun guessing routines on all candidates from the 43,000 word corpus sample seems to suggest that the tool only fails on lexicalized irregular forms (e.g. mong - beng, ‘owner(s)’, instead of the hypothetical mong - *bang), and on nouns that, mostly due to semantic reasons, do not have both a singular and a plural form (such as Sepedi ‘Pedi language and culture’, or leboa ‘North’). As for the verb guesser, the noun guesser can be and has to be applied (for quantitative reasons) to any new corpus to be annotated. 3.6 Rules for the Disambiguation of Closed Class Items Given the high degree of ambiguity in closed class items (see section 2.3), there is a major need for disambiguation strategies for these items. Even though a statistical tagger is designed for this type of disambiguation, a rule-based preprocessing, leading at least to a partial reduction of ambiguity, seems necessary. We use context-based disambiguation rules, in the spirit of Gross and Silberztein’s local grammars (Silberztein 1993) and of rule-based tagging. As with the noun guessing queries, disambiguation rules are implemented as queries in the format of the CQP language. Some extraction rules exclusively rely on lexical contexts (cf. the topmost part of Table 9), while others involve lexemes and word class tagged items (middle row), or a combination of lexical, categorical and morphological constraints (including, for example, the presence of certain affixes [cf. lower part of Table 9]). The examples in Table 9 all relate to the disambiguation of the form a, the most frequent and most ambiguous item in our sample (cf. Table 5). 109
Page 1:
LULCL 2005 Proceedings of the Lesse
Page 4 and 5:
Bestellungen bei: Europäische Akad
Page 6 and 7:
Computing Non-Concatenative Morphol
Page 8 and 9:
such as German in this special case
Page 10 and 11:
The contributions underline the sig
Page 12 and 13:
unumschränkte Verkehrs- und Verwal
Page 14 and 15:
Nun gilt es zu erklären, wie die S
Page 16 and 17:
Adjektive auf -abel, -ibel, -aivel,
Page 18 and 19:
überzeugenden Vorschlägen auf der
Page 20 and 21:
höheren Transparenz und der Nähe
Page 22 and 23:
sie zu verwenden; gezwungen wird er
Page 24 and 25:
einer sprachpolitischen Förderung
Page 26 and 27:
26 Bibliographie Ascoli, G.I. (1880
Page 29 and 30:
Implementing NLP-Projects for Small
Page 31 and 32:
• Small Languages Shouldn't Do So
Page 33 and 34:
impossible to evaluate. Young resea
Page 35 and 36:
esearch results available as free s
Page 37 and 38:
Parallel corpora of Orwell’s 1984
Page 39 and 40:
data would already have been lost (
Page 41 and 42:
on local players only? A policy for
Page 43:
“Traitement automatique des langu
Page 46 and 47:
lingua nel modo in cui essa viene e
Page 48 and 49:
Il SDC, data la totale assenza di a
Page 50 and 51:
• l’annotazione linguistica, ch
Page 52 and 53:
La header sarà esterna al CesDoc
Page 54 and 55:
• nelle posizioni da 1 a n i valo
Page 56 and 57:
Cfr. tabella 24 in appendice. 2.4 L
Page 58 and 59: SP su su Sp- SP stadiu stad
Page 60 and 61: 3. Conclusioni In questo articolo a
Page 62 and 63: Ide, N. (2004). “Preparation and
Page 64 and 65: Tabella 1: Codici per le parti del
Page 66 and 67: Vaii3s- VAS3II hiat, fiat Vaii1p- V
Page 68 and 69: Vmg----a VMGA tzerriendidda Vmg----
Page 70 and 71: Tabella 9: e per la categoria “
Page 72 and 73: Tabella 11: e per la categoria
Page 74 and 75: Tabella 17: e per la categoria “
Page 77 and 78: The Relevance of Lesser-Used Langua
Page 79 and 80: (better known as the ‘short Cimbr
Page 81 and 82: (2) Haute die Mome hat gekoaft die
Page 83 and 84: (14b) ’az se nette ghenan vüar 3
Page 85 and 86: what-he says (30) ben-ig-en nox vin
Page 87 and 88: • No cases of proclisis to the in
Page 89 and 90: Second, syntactic change does not p
Page 91 and 92: 91 Abbreviations Cat.1602 Cimbrian
Page 93 and 94: cimbra del XVI secolo. Final essay
Page 95 and 96: sincronica proiettata nella diacron
Page 97 and 98: Creating Word Class Tagged Corpora
Page 99 and 100: 2b bo+ bomalome uncles 3 mo- monwan
Page 101 and 102: Table 4: Alternatives in Tagging th
Page 103 and 104: • Nouns and verbs can be guessed
Page 105 and 106: Tag category: Common noun, singular
Page 107: 3.4 Verb Guesser In Table 3 above,
Page 111 and 112: potential word class features. This
Page 113 and 114: and Bantu languages in general. It
Page 115: Poulos, G. & Louwrens, L.J. (1994).
Page 118 and 119: are available for these languages,
Page 120 and 121: demonstrative pronoun ba, preceding
Page 122 and 123: These different writing systems imp
Page 124 and 125: At this stage, it is important to n
Page 126 and 127: Concerning the tags used in the abo
Page 128 and 129: Figure 4: Multi-level Approach Towa
Page 130 and 131: 6. Conclusion and Future Work In th
Page 133 and 134: Grammar-based Language Technology f
Page 135 and 136: Figure 1: Morphological Analysis of
Page 137 and 138: 3.1 Choosing Between a ‘Top-down
Page 139 and 140: It goes without saying that we use
Page 141 and 142: Figure 4: Quote from cvs Log Using
Page 143 and 144: • We have made a project-internal
Page 145 and 146: others. By doing this, we hope that
Page 147: Karttunen, L., Kaplan, R.M. & Zaene
Page 150 and 151: academia, health, social services a
Page 152 and 153: home help (of service) cymorth cart
Page 154 and 155: 4.2 Implementing TMF: a Simple Firs
Page 156 and 157: categories into the TBX structure w
Page 158 and 159:
home help entryTerm singular hom
Page 160 and 161:
Table 3: An Example Entry in the Te
Page 162 and 163:
“ 16
Page 164 and 165:
as WordFast. Cysgeir supports loadi
Page 166 and 167:
Ultimately, the same structure and
Page 168 and 169:
168 References “Cronfa Genedlaeth
Page 171 and 172:
SpeechCluster: A Speech Data Multit
Page 173 and 174:
that can be used by external develo
Page 175 and 176:
Table 1: SpeechCluster command-line
Page 177 and 178:
This task is a minor inconvenience
Page 179 and 180:
esearcher/developers are able to us
Page 181 and 182:
Line Code Fig. 2 Table 9: segfake i
Page 183 and 184:
charge, and the models generated ca
Page 185 and 186:
segFake’d training data with the
Page 187 and 188:
187 References Black, A. W., Taylor
Page 189 and 190:
XNLRDF: The Open Source Framework f
Page 191 and 192:
categories are: the orthography, th
Page 193 and 194:
2.1 Unicode and Characters: Upperca
Page 195 and 196:
• The resource is not directly ac
Page 197 and 198:
esource and the access to the resou
Page 199 and 200:
5.2 The Names of Metadata Categorie
Page 201 and 202:
Plate 4: A writing system (Thai) re
Page 203 and 204:
7. Status of the Project and Future
Page 205 and 206:
Writing Standard Sometimes the same
Page 207:
language-archives.org/documents/ove
Page 210 and 211:
of Cultures in Barcelona (with a ve
Page 212 and 213:
3. Evaluation The evaluation perfor
Page 214 and 215:
OK 1.19% 7.14% OK- 0% 5.96% Unaccep
Page 216 and 217:
a price, asking for information abo
Page 218 and 219:
4.3 Formality Feature In Catalan an
Page 220 and 221:
SPA: Calle Pelayo SPA/CAT: Calle Pe
Page 222 and 223:
Catalan. In each case, we described
Page 224 and 225:
User’s Manual.” Technical Repor
Page 226 and 227:
I begin with a brief overview of Ge
Page 228 and 229:
• Scr - Screeve marker. This is a
Page 230 and 231:
In many cases, the inflectional end
Page 232 and 233:
“The man paints / is painting the
Page 234 and 235:
• Lexical class-dependent screeve
Page 236 and 237:
verbs whose lexical information is
Page 238 and 239:
4.3 Level 2: Semi-regular Patterns
Page 240 and 241:
substitute the appropriate roots in
Page 242 and 243:
• Several types of exercises are
Page 244 and 245:
244 References Anderson, S.R. (1992
Page 247 and 248:
The Igbo Language and Computer Ling
Page 249 and 250:
The overall conclusion from this ov
Page 251 and 252:
2.2 The Computer Phase The computer
Page 253 and 254:
product on the grounds that Alt-I c
Page 255 and 256:
These programs are relatively theor
Page 257 and 258:
In its Dataset component, the progr
Page 259 and 260:
Figure 8: A Word Without Subdots in
Page 261 and 262:
simply used the German umlauted vow
Page 263 and 264:
263 References Adegbola, T. (2003).
Page 265 and 266:
Annotation of Documents for Electro
Page 267 and 268:
electronic corpora, regardless of w
Page 269 and 270:
writing of Hebrew represented the o
Page 271 and 272:
3.2 Automatic Generation of Electro
Page 273 and 274:
There are two types of annotations:
Page 275 and 276:
inspired by the relational model is
Page 277 and 278:
In the context of this particular t
Page 279 and 280:
Pédauque, R.T. (2003). “Document
Page 281 and 282:
Il ladino fra polinomia e standardi
Page 283 and 284:
Vocabolar todësch - ladin (Val Bad
Page 285 and 286:
persone che si trovano a dover scri
Page 287 and 288:
Fig. 6: Esempio di scheda di lemma
Page 289 and 290:
2.4 I corpora elettronici Nell’am
Page 291 and 292:
installazione automatica e di guida
Page 293 and 294:
293 Bibliografia Bortolotti, E. & R
Page 295:
Schmid, H. (2000). Criteri per la f
Page 298 and 299:
del progetto sono effettivamente ri
Page 300 and 301:
Oltre all’indubbio valore storico
Page 302 and 303:
Questo tipo di attestazioni compren
Page 304 and 305:
caso che, successivamente, lo spogl
Page 307 and 308:
Stealth Learning with an Online Dog
Page 309 and 310:
Figure 1: Colin and Cumberland - Th
Page 311 and 312:
• Closed Pos croeseiriau (crosswo
Page 313 and 314:
Mutations never occur in words when
Page 315 and 316:
} } } { } if (word[x-1] == ‘N’)
Page 317 and 318:
Figure 2: Screenshot showing Dic Pe
Page 319 and 320:
In programming terms, this is done
Page 321 and 322:
The old know, the young suppose A
Page 323 and 324:
Figure 7: Pos Croeseiriau Screensho
Page 325 and 326:
8. Results Figure 7: Screenshot of
Page 327 and 328:
“Archif Melville Richards Histori
Page 329 and 330:
Victoria Arranz (& Elisabet Comelle
Page 331 and 332:
Luca Panieri Il progetto “Zimbarb
Page 333:
Grammar-based language technology f
Page 336:
Gruffudd Prys Language Technologies
show all

PDF (Online Text) - EURAC

Create successful ePaper yourself

Delete template?

Save as template?