PDF (Online Text) - EURAC

More documents

Recommendations

Info

• Researchers should be able to address linguistic problems with linguistic solutions. Often, a linguistic problem, or a problem initially described in language terms (e.g. retranscribing the data using a different phoneset) has to be redescribed in programming terms before it can be addressed. Problems should be addressable in the terms in which they occur. • The toolkit should be increasingly simple to maintain and develop. Over its lifetime, any toolkit increases in functionality: new problems occur and new tasks become possible. If extensions are increasingly difficult to implement, the toolkit eventually disintegrates (e.g. into a library of loosely-related scripts), becomes impossible to maintain, and falls into disuse. A well-designed toolkit can avoid this fate. • The toolkit should encourage its own use and development. It should be preferable to use the toolkit than to revert to the bad old ways. Nevertheless, further use of the toolkit should stimulate researchers to confront it with new problems, and to think of new areas in which the toolkit might be used. If possible, the toolkit should be extensible by the researchers themselves, rather than having to rely on a separate maintainer. In this case, the design of the toolkit should promote the writing of readable, reusable code. 3. A Solution 3.1 Introduction Our first (and current) attempt at a software artefact that meets these requirements is the SpeechCluster package (Uemlianin 2005a). SpeechCluster is a collection of small programs written in a programming language called Python. Python has a very clear, readable syntax, and is especially suited for projects with several programmers, or with novice programmers. As such, it suited our aim of encouraging non-programmers to write their own tools. The SpeechCluster package consists of a main SpeechCluster module with the basic API, and a number of modules that can be used as command line tools. The tools are intended to be used as such, but they can also be used as ‘examples’, or as a basis for customisation or further programming with SpeechCluster. Table 1 shows a list of the tools currently available as part of SpeechCluster: Below, we look at two of these in more detail before exploring the use of SpeechCluster as an API. Finally, we look at SpeechCluster in a larger system. 174
Table 1: SpeechCluster command-line tools Tool Function segFake ‘Fake autosegmentation’ of a speech audio file segInter Interpolates labels into a segmented but unlabelled segment tier segMerge Merges separate label files segReplace Converts labels in a label file segSwitch Converts label file format splitAll Splits audio/label file pairs 3.2 Using SpeechCluster I: The Tools a) SegSwitch SegSwitch is a label file format converter. It converts label files between any of the formats supported by SpeechCluster (currently, Praat TextGrid, esps and the various HTK formats [i.e., the simple .lab format and the multi-file .mlf format]). This kind of format conversion is a very common task. For example, HTK requires files to be in its own esps-like format, but our team prefers to handlabel files in Praat, which outputs its own TextGrid format. Festival uses an esps-like format that is slightly different from HTK’s. SegSwitch has a simple command-line interface (see Table 2), in which single files or whole directories can be converted easily and perfectly. Usage: Examples: Table 2: segSwitch usage segSwitch -i -o segSwitch -d -o segSwitch -i example.lab -o example.TextGrid segSwitch -d labDir -o textGrid A simple facility like this has a remarkable effect on the efficiency of a team. The team no longer has to worry about in what file format they have to work. They can concentrate on the research task converting files in and out of particular formats as needed. In a sense, the two parts of the work- the research and the bookkeeping- have been separated, and the bookkeeping is done by the tools. This division of labour is repeated between the tools and the SpeechCluster module itself. As much of the low-level data manipulation as possible is carried out by SpeechCluster, so that the tools themselves can be written in simple, task-oriented terms. Table 3 shows the main code for segSwitch (excluding the command-line parsing and the loop over files in a directory): all of the work of file format conversion is done by 175
Page 1:
LULCL 2005 Proceedings of the Lesse
Page 4 and 5:
Bestellungen bei: Europäische Akad
Page 6 and 7:
Computing Non-Concatenative Morphol
Page 8 and 9:
such as German in this special case
Page 10 and 11:
The contributions underline the sig
Page 12 and 13:
unumschränkte Verkehrs- und Verwal
Page 14 and 15:
Nun gilt es zu erklären, wie die S
Page 16 and 17:
Adjektive auf -abel, -ibel, -aivel,
Page 18 and 19:
überzeugenden Vorschlägen auf der
Page 20 and 21:
höheren Transparenz und der Nähe
Page 22 and 23:
sie zu verwenden; gezwungen wird er
Page 24 and 25:
einer sprachpolitischen Förderung
Page 26 and 27:
26 Bibliographie Ascoli, G.I. (1880
Page 29 and 30:
Implementing NLP-Projects for Small
Page 31 and 32:
• Small Languages Shouldn't Do So
Page 33 and 34:
impossible to evaluate. Young resea
Page 35 and 36:
esearch results available as free s
Page 37 and 38:
Parallel corpora of Orwell’s 1984
Page 39 and 40:
data would already have been lost (
Page 41 and 42:
on local players only? A policy for
Page 43:
“Traitement automatique des langu
Page 46 and 47:
lingua nel modo in cui essa viene e
Page 48 and 49:
Il SDC, data la totale assenza di a
Page 50 and 51:
• l’annotazione linguistica, ch
Page 52 and 53:
La header sarà esterna al CesDoc
Page 54 and 55:
• nelle posizioni da 1 a n i valo
Page 56 and 57:
Cfr. tabella 24 in appendice. 2.4 L
Page 58 and 59:
SP su su Sp- SP stadiu stad
Page 60 and 61:
3. Conclusioni In questo articolo a
Page 62 and 63:
Ide, N. (2004). “Preparation and
Page 64 and 65:
Tabella 1: Codici per le parti del
Page 66 and 67:
Vaii3s- VAS3II hiat, fiat Vaii1p- V
Page 68 and 69:
Vmg----a VMGA tzerriendidda Vmg----
Page 70 and 71:
Tabella 9: e per la categoria “
Page 72 and 73:
Tabella 11: e per la categoria
Page 74 and 75:
Tabella 17: e per la categoria “
Page 77 and 78:
The Relevance of Lesser-Used Langua
Page 79 and 80:
(better known as the ‘short Cimbr
Page 81 and 82:
(2) Haute die Mome hat gekoaft die
Page 83 and 84:
(14b) ’az se nette ghenan vüar 3
Page 85 and 86:
what-he says (30) ben-ig-en nox vin
Page 87 and 88:
• No cases of proclisis to the in
Page 89 and 90:
Second, syntactic change does not p
Page 91 and 92:
91 Abbreviations Cat.1602 Cimbrian
Page 93 and 94:
cimbra del XVI secolo. Final essay
Page 95 and 96:
sincronica proiettata nella diacron
Page 97 and 98:
Creating Word Class Tagged Corpora
Page 99 and 100:
2b bo+ bomalome uncles 3 mo- monwan
Page 101 and 102:
Table 4: Alternatives in Tagging th
Page 103 and 104:
• Nouns and verbs can be guessed
Page 105 and 106:
Tag category: Common noun, singular
Page 107 and 108:
3.4 Verb Guesser In Table 3 above,
Page 109 and 110:
Table 8: Sample Results of Noun Gue
Page 111 and 112:
potential word class features. This
Page 113 and 114:
and Bantu languages in general. It
Page 115:
Poulos, G. & Louwrens, L.J. (1994).
Page 118 and 119:
are available for these languages,
Page 120 and 121:
demonstrative pronoun ba, preceding
Page 122 and 123:
These different writing systems imp
Page 124 and 125: At this stage, it is important to n
Page 126 and 127: Concerning the tags used in the abo
Page 128 and 129: Figure 4: Multi-level Approach Towa
Page 130 and 131: 6. Conclusion and Future Work In th
Page 133 and 134: Grammar-based Language Technology f
Page 135 and 136: Figure 1: Morphological Analysis of
Page 137 and 138: 3.1 Choosing Between a ‘Top-down
Page 139 and 140: It goes without saying that we use
Page 141 and 142: Figure 4: Quote from cvs Log Using
Page 143 and 144: • We have made a project-internal
Page 145 and 146: others. By doing this, we hope that
Page 147: Karttunen, L., Kaplan, R.M. & Zaene
Page 150 and 151: academia, health, social services a
Page 152 and 153: home help (of service) cymorth cart
Page 154 and 155: 4.2 Implementing TMF: a Simple Firs
Page 156 and 157: categories into the TBX structure w
Page 158 and 159: home help entryTerm singular hom
Page 160 and 161: Table 3: An Example Entry in the Te
Page 162 and 163: “ 16
Page 164 and 165: as WordFast. Cysgeir supports loadi
Page 166 and 167: Ultimately, the same structure and
Page 168 and 169: 168 References “Cronfa Genedlaeth
Page 171 and 172: SpeechCluster: A Speech Data Multit
Page 173: that can be used by external develo
Page 177 and 178: This task is a minor inconvenience
Page 179 and 180: esearcher/developers are able to us
Page 181 and 182: Line Code Fig. 2 Table 9: segfake i
Page 183 and 184: charge, and the models generated ca
Page 185 and 186: segFake’d training data with the
Page 187 and 188: 187 References Black, A. W., Taylor
Page 189 and 190: XNLRDF: The Open Source Framework f
Page 191 and 192: categories are: the orthography, th
Page 193 and 194: 2.1 Unicode and Characters: Upperca
Page 195 and 196: • The resource is not directly ac
Page 197 and 198: esource and the access to the resou
Page 199 and 200: 5.2 The Names of Metadata Categorie
Page 201 and 202: Plate 4: A writing system (Thai) re
Page 203 and 204: 7. Status of the Project and Future
Page 205 and 206: Writing Standard Sometimes the same
Page 207: language-archives.org/documents/ove
Page 210 and 211: of Cultures in Barcelona (with a ve
Page 212 and 213: 3. Evaluation The evaluation perfor
Page 214 and 215: OK 1.19% 7.14% OK- 0% 5.96% Unaccep
Page 216 and 217: a price, asking for information abo
Page 218 and 219: 4.3 Formality Feature In Catalan an
Page 220 and 221: SPA: Calle Pelayo SPA/CAT: Calle Pe
Page 222 and 223: Catalan. In each case, we described
Page 224 and 225:
User’s Manual.” Technical Repor
Page 226 and 227:
I begin with a brief overview of Ge
Page 228 and 229:
• Scr - Screeve marker. This is a
Page 230 and 231:
In many cases, the inflectional end
Page 232 and 233:
“The man paints / is painting the
Page 234 and 235:
• Lexical class-dependent screeve
Page 236 and 237:
verbs whose lexical information is
Page 238 and 239:
4.3 Level 2: Semi-regular Patterns
Page 240 and 241:
substitute the appropriate roots in
Page 242 and 243:
• Several types of exercises are
Page 244 and 245:
244 References Anderson, S.R. (1992
Page 247 and 248:
The Igbo Language and Computer Ling
Page 249 and 250:
The overall conclusion from this ov
Page 251 and 252:
2.2 The Computer Phase The computer
Page 253 and 254:
product on the grounds that Alt-I c
Page 255 and 256:
These programs are relatively theor
Page 257 and 258:
In its Dataset component, the progr
Page 259 and 260:
Figure 8: A Word Without Subdots in
Page 261 and 262:
simply used the German umlauted vow
Page 263 and 264:
263 References Adegbola, T. (2003).
Page 265 and 266:
Annotation of Documents for Electro
Page 267 and 268:
electronic corpora, regardless of w
Page 269 and 270:
writing of Hebrew represented the o
Page 271 and 272:
3.2 Automatic Generation of Electro
Page 273 and 274:
There are two types of annotations:
Page 275 and 276:
inspired by the relational model is
Page 277 and 278:
In the context of this particular t
Page 279 and 280:
Pédauque, R.T. (2003). “Document
Page 281 and 282:
Il ladino fra polinomia e standardi
Page 283 and 284:
Vocabolar todësch - ladin (Val Bad
Page 285 and 286:
persone che si trovano a dover scri
Page 287 and 288:
Fig. 6: Esempio di scheda di lemma
Page 289 and 290:
2.4 I corpora elettronici Nell’am
Page 291 and 292:
installazione automatica e di guida
Page 293 and 294:
293 Bibliografia Bortolotti, E. & R
Page 295:
Schmid, H. (2000). Criteri per la f
Page 298 and 299:
del progetto sono effettivamente ri
Page 300 and 301:
Oltre all’indubbio valore storico
Page 302 and 303:
Questo tipo di attestazioni compren
Page 304 and 305:
caso che, successivamente, lo spogl
Page 307 and 308:
Stealth Learning with an Online Dog
Page 309 and 310:
Figure 1: Colin and Cumberland - Th
Page 311 and 312:
• Closed Pos croeseiriau (crosswo
Page 313 and 314:
Mutations never occur in words when
Page 315 and 316:
} } } { } if (word[x-1] == ‘N’)
Page 317 and 318:
Figure 2: Screenshot showing Dic Pe
Page 319 and 320:
In programming terms, this is done
Page 321 and 322:
The old know, the young suppose A
Page 323 and 324:
Figure 7: Pos Croeseiriau Screensho
Page 325 and 326:
8. Results Figure 7: Screenshot of
Page 327 and 328:
“Archif Melville Richards Histori
Page 329 and 330:
Victoria Arranz (& Elisabet Comelle
Page 331 and 332:
Luca Panieri Il progetto “Zimbarb
Page 333:
Grammar-based language technology f
Page 336:
Gruffudd Prys Language Technologies
show all

PDF (Online Text) - EURAC

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?