PDF (Online Text) - EURAC

More documents

Recommendations

Info

this complexity precisely when we allow writing systems to refer to each other recursively. Thus Braille, as with other transliteration systems, is represented as a writing system with its own independent locality, script, standard (e.g., contracted and non-contracted), and time period. The language of the transliteration and the referenced writing system are nevertheless the same, although XNLRDF would allow this to change for the transliteration as well. A transliteration is thus marked by a reference to another writing system, and, in the descriptive data, mappings between these two systems in the form of a mapping table, (e.g., between Bokmål and Bokmål Braille). Mappings between writing systems are a natural component in the description of all writing systems, even if they do not represent transliterations of each other, (e.g., mappings between hànyŭ pīnyīn, wade-giles, gwoyeu romatzyh and bopomofo/zhùyīn fúhào). The Braille of Mandarin in the People’s Republic, incidentally, is a transliteration of hànyŭ pīnyīn. To sum up, the metadata needed to identify the appropriate or best NLP resources for the processing of a text-document are much more complex than what current standards have defined. In other words, relying on only one part of the metadata, such as the Unicode scripts or the language codes combined with locality codes, is not always accurate and thus not completely reliable for automated NLP-tasks. If NLP-technologies on the Web have, until now, not suffered from this important misconception (e.g., in the metadata specification in the HTTP or XML header), it is because they either target about two dozen common languages (applying default assumptions that prevent less frequently used writing systems from being correctly processed), or because a linguistically informed human mediates between documents and resources. 2. XNLRDF and Information Needs Beyond Unicode Let us assume, for expository purposes, that an NLP-application can correctly identify the writing system of a document to be processed, and that this writing system contains references to Unicode scripts or code points. In effect, little follows from this, as Unicode defines only a very limited amount of operations, and defines them only for a script and not a writing system. The task of XNLRDF is thus to reformulate the operations defined in Unicode in the terms of a writing system, and, secondly, to enlarge the linguistic material so that more operations than those defined in Unicode become possible. 192
2.1 Unicode and Characters: Uppercasing, Lowercasing and Sorting Contrary to a common sense understanding of Unicode, the conceptual design of Unicode avoids the notion of character, since this is a language-specific notion, and languages are not covered by Unicode. Unicode refers instead to code elements (), which frequently coincide with characters, but also contain combining characters such as diacritics. Characters and code elements further differ, if ligatures (Dutch ‘ij’, Spanish ‘ll’, ‘ch’, Belorussian Lacinka ‘dz’) are to be treated as one character in a language. Uppercasing of ligatures is thus essentially undefined, and will produce from ‘xy’ uniformly either ‘Xy’ or ‘XY’, without knowing the requirements of the writing system. It is thus obvious that specifying the character set of writing systems and describing the mapping between the characters (e.g., for uppercasing and lowercasing) is one principle task in XNLRDF, just as lowercasing, for example, is an important step in the normalisation of a string (e.g., for a lexicon lookup or information retrieval). Similarly, the sorting of characters, the second operation defined in Unicode (e.g., for the purpose of presenting dictionary entries or creating indices), depends on the writing system, and can only be approximately defined on the basis of the script. Thus, Unicode might successfully sort ‘a’ before ‘b’, but already the position of ‘á’ after ‘a’ or after ‘z’ is specific to each writing system. Another example is the Spanish ‘ll.’ Although it is no longer considered a character, it maintains its specific position between ‘l’ and ‘m’ in a sorted list. Thus, sorting requires basic writing system-specific information, which XNLRDF sets out to describe. What this example also shows is that the definition of collating sequences for the characters of a writing system is independent from the status of the character (base character, composed character, contracted character, contracted non-character, context-sensitive character, foreign character, swap character, etc.). 2.2 Linguistic Information: What Else? The operations covered by Unicode are limited, and most NLP-applications would require much more linguistic knowledge when processing documents in potentially unknown writing systems. First, an application might need to identify the encoding (e.g. KOI-R), the script (Cyrillic), the language (Russian), the standard (civil script), and orthography (after 1917) of a document. Part of this information might be drawn from the metadata available in the document, from the Unicode range, or the URL of a document (in our example, http://xyz.xyz.ru), but filling in the remaining gaps, (e.g., mapping from the encoding KOI-R to the language Russian, from the language to potential scripts, or from a script to a language) requires background information about the legacy encodings and writing systems. This background information is 193
Page 1:
LULCL 2005 Proceedings of the Lesse
Page 4 and 5:
Bestellungen bei: Europäische Akad
Page 6 and 7:
Computing Non-Concatenative Morphol
Page 8 and 9:
such as German in this special case
Page 10 and 11:
The contributions underline the sig
Page 12 and 13:
unumschränkte Verkehrs- und Verwal
Page 14 and 15:
Nun gilt es zu erklären, wie die S
Page 16 and 17:
Adjektive auf -abel, -ibel, -aivel,
Page 18 and 19:
überzeugenden Vorschlägen auf der
Page 20 and 21:
höheren Transparenz und der Nähe
Page 22 and 23:
sie zu verwenden; gezwungen wird er
Page 24 and 25:
einer sprachpolitischen Förderung
Page 26 and 27:
26 Bibliographie Ascoli, G.I. (1880
Page 29 and 30:
Implementing NLP-Projects for Small
Page 31 and 32:
• Small Languages Shouldn't Do So
Page 33 and 34:
impossible to evaluate. Young resea
Page 35 and 36:
esearch results available as free s
Page 37 and 38:
Parallel corpora of Orwell’s 1984
Page 39 and 40:
data would already have been lost (
Page 41 and 42:
on local players only? A policy for
Page 43:
“Traitement automatique des langu
Page 46 and 47:
lingua nel modo in cui essa viene e
Page 48 and 49:
Il SDC, data la totale assenza di a
Page 50 and 51:
• l’annotazione linguistica, ch
Page 52 and 53:
La header sarà esterna al CesDoc
Page 54 and 55:
• nelle posizioni da 1 a n i valo
Page 56 and 57:
Cfr. tabella 24 in appendice. 2.4 L
Page 58 and 59:
SP su su Sp- SP stadiu stad
Page 60 and 61:
3. Conclusioni In questo articolo a
Page 62 and 63:
Ide, N. (2004). “Preparation and
Page 64 and 65:
Tabella 1: Codici per le parti del
Page 66 and 67:
Vaii3s- VAS3II hiat, fiat Vaii1p- V
Page 68 and 69:
Vmg----a VMGA tzerriendidda Vmg----
Page 70 and 71:
Tabella 9: e per la categoria “
Page 72 and 73:
Tabella 11: e per la categoria
Page 74 and 75:
Tabella 17: e per la categoria “
Page 77 and 78:
The Relevance of Lesser-Used Langua
Page 79 and 80:
(better known as the ‘short Cimbr
Page 81 and 82:
(2) Haute die Mome hat gekoaft die
Page 83 and 84:
(14b) ’az se nette ghenan vüar 3
Page 85 and 86:
what-he says (30) ben-ig-en nox vin
Page 87 and 88:
• No cases of proclisis to the in
Page 89 and 90:
Second, syntactic change does not p
Page 91 and 92:
91 Abbreviations Cat.1602 Cimbrian
Page 93 and 94:
cimbra del XVI secolo. Final essay
Page 95 and 96:
sincronica proiettata nella diacron
Page 97 and 98:
Creating Word Class Tagged Corpora
Page 99 and 100:
2b bo+ bomalome uncles 3 mo- monwan
Page 101 and 102:
Table 4: Alternatives in Tagging th
Page 103 and 104:
• Nouns and verbs can be guessed
Page 105 and 106:
Tag category: Common noun, singular
Page 107 and 108:
3.4 Verb Guesser In Table 3 above,
Page 109 and 110:
Table 8: Sample Results of Noun Gue
Page 111 and 112:
potential word class features. This
Page 113 and 114:
and Bantu languages in general. It
Page 115:
Poulos, G. & Louwrens, L.J. (1994).
Page 118 and 119:
are available for these languages,
Page 120 and 121:
demonstrative pronoun ba, preceding
Page 122 and 123:
These different writing systems imp
Page 124 and 125:
At this stage, it is important to n
Page 126 and 127:
Concerning the tags used in the abo
Page 128 and 129:
Figure 4: Multi-level Approach Towa
Page 130 and 131:
6. Conclusion and Future Work In th
Page 133 and 134:
Grammar-based Language Technology f
Page 135 and 136:
Figure 1: Morphological Analysis of
Page 137 and 138:
3.1 Choosing Between a ‘Top-down
Page 139 and 140:
It goes without saying that we use
Page 141 and 142: Figure 4: Quote from cvs Log Using
Page 143 and 144: • We have made a project-internal
Page 145 and 146: others. By doing this, we hope that
Page 147: Karttunen, L., Kaplan, R.M. & Zaene
Page 150 and 151: academia, health, social services a
Page 152 and 153: home help (of service) cymorth cart
Page 154 and 155: 4.2 Implementing TMF: a Simple Firs
Page 156 and 157: categories into the TBX structure w
Page 158 and 159: home help entryTerm singular hom
Page 160 and 161: Table 3: An Example Entry in the Te
Page 162 and 163: “ 16
Page 164 and 165: as WordFast. Cysgeir supports loadi
Page 166 and 167: Ultimately, the same structure and
Page 168 and 169: 168 References “Cronfa Genedlaeth
Page 171 and 172: SpeechCluster: A Speech Data Multit
Page 173 and 174: that can be used by external develo
Page 175 and 176: Table 1: SpeechCluster command-line
Page 177 and 178: This task is a minor inconvenience
Page 179 and 180: esearcher/developers are able to us
Page 181 and 182: Line Code Fig. 2 Table 9: segfake i
Page 183 and 184: charge, and the models generated ca
Page 185 and 186: segFake’d training data with the
Page 187 and 188: 187 References Black, A. W., Taylor
Page 189 and 190: XNLRDF: The Open Source Framework f
Page 191: categories are: the orthography, th
Page 195 and 196: • The resource is not directly ac
Page 197 and 198: esource and the access to the resou
Page 199 and 200: 5.2 The Names of Metadata Categorie
Page 201 and 202: Plate 4: A writing system (Thai) re
Page 203 and 204: 7. Status of the Project and Future
Page 205 and 206: Writing Standard Sometimes the same
Page 207: language-archives.org/documents/ove
Page 210 and 211: of Cultures in Barcelona (with a ve
Page 212 and 213: 3. Evaluation The evaluation perfor
Page 214 and 215: OK 1.19% 7.14% OK- 0% 5.96% Unaccep
Page 216 and 217: a price, asking for information abo
Page 218 and 219: 4.3 Formality Feature In Catalan an
Page 220 and 221: SPA: Calle Pelayo SPA/CAT: Calle Pe
Page 222 and 223: Catalan. In each case, we described
Page 224 and 225: User’s Manual.” Technical Repor
Page 226 and 227: I begin with a brief overview of Ge
Page 228 and 229: • Scr - Screeve marker. This is a
Page 230 and 231: In many cases, the inflectional end
Page 232 and 233: “The man paints / is painting the
Page 234 and 235: • Lexical class-dependent screeve
Page 236 and 237: verbs whose lexical information is
Page 238 and 239: 4.3 Level 2: Semi-regular Patterns
Page 240 and 241: substitute the appropriate roots in
Page 242 and 243:
• Several types of exercises are
Page 244 and 245:
244 References Anderson, S.R. (1992
Page 247 and 248:
The Igbo Language and Computer Ling
Page 249 and 250:
The overall conclusion from this ov
Page 251 and 252:
2.2 The Computer Phase The computer
Page 253 and 254:
product on the grounds that Alt-I c
Page 255 and 256:
These programs are relatively theor
Page 257 and 258:
In its Dataset component, the progr
Page 259 and 260:
Figure 8: A Word Without Subdots in
Page 261 and 262:
simply used the German umlauted vow
Page 263 and 264:
263 References Adegbola, T. (2003).
Page 265 and 266:
Annotation of Documents for Electro
Page 267 and 268:
electronic corpora, regardless of w
Page 269 and 270:
writing of Hebrew represented the o
Page 271 and 272:
3.2 Automatic Generation of Electro
Page 273 and 274:
There are two types of annotations:
Page 275 and 276:
inspired by the relational model is
Page 277 and 278:
In the context of this particular t
Page 279 and 280:
Pédauque, R.T. (2003). “Document
Page 281 and 282:
Il ladino fra polinomia e standardi
Page 283 and 284:
Vocabolar todësch - ladin (Val Bad
Page 285 and 286:
persone che si trovano a dover scri
Page 287 and 288:
Fig. 6: Esempio di scheda di lemma
Page 289 and 290:
2.4 I corpora elettronici Nell’am
Page 291 and 292:
installazione automatica e di guida
Page 293 and 294:
293 Bibliografia Bortolotti, E. & R
Page 295:
Schmid, H. (2000). Criteri per la f
Page 298 and 299:
del progetto sono effettivamente ri
Page 300 and 301:
Oltre all’indubbio valore storico
Page 302 and 303:
Questo tipo di attestazioni compren
Page 304 and 305:
caso che, successivamente, lo spogl
Page 307 and 308:
Stealth Learning with an Online Dog
Page 309 and 310:
Figure 1: Colin and Cumberland - Th
Page 311 and 312:
• Closed Pos croeseiriau (crosswo
Page 313 and 314:
Mutations never occur in words when
Page 315 and 316:
} } } { } if (word[x-1] == ‘N’)
Page 317 and 318:
Figure 2: Screenshot showing Dic Pe
Page 319 and 320:
In programming terms, this is done
Page 321 and 322:
The old know, the young suppose A
Page 323 and 324:
Figure 7: Pos Croeseiriau Screensho
Page 325 and 326:
8. Results Figure 7: Screenshot of
Page 327 and 328:
“Archif Melville Richards Histori
Page 329 and 330:
Victoria Arranz (& Elisabet Comelle
Page 331 and 332:
Luca Panieri Il progetto “Zimbarb
Page 333:
Grammar-based language technology f
Page 336:
Gruffudd Prys Language Technologies
show all

PDF (Online Text) - EURAC

Create successful ePaper yourself

Delete template?

Save as template?