19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

difference between corpora that can be seen as ‘f<strong>in</strong>ished’,<br />

i.e. static, and others that are ‘grow<strong>in</strong>g’, i.e. where there is<br />

a more or less cont<strong>in</strong>uous stream of material be<strong>in</strong>g added,<br />

often over or after several years. A good example of the<br />

first k<strong>in</strong>d is the Corpus Spoken Dutch (CGN) <strong>for</strong> which<br />

the exploitation and management environment, COREX,<br />

was developed by the TLA group. In such a corpus project,<br />

where the data collection can be carefully planned and<br />

executed <strong>in</strong> a relatively short time, a large degree of<br />

coherence and consistency can be achieved with respect<br />

to the data and metadata <strong>for</strong>mats allow<strong>in</strong>g <strong>for</strong> efficient<br />

exploitation procedures and tool development. The corpus<br />

was compiled so as to have a representative sample of the<br />

spoken Dutch with many different monological and<br />

dialogical text types (CGN Design).<br />

Another example housed at TLA is the Dutch<br />

Bil<strong>in</strong>gualism Database (DBD), which is a curation project<br />

comb<strong>in</strong><strong>in</strong>g several older Second Language Acquisition<br />

(SLA) corpora, <strong>in</strong>tegrat<strong>in</strong>g them <strong>in</strong> a new overall corpus<br />

structure us<strong>in</strong>g coherent metadata descriptions. Obviously,<br />

the design of this corpus follows a rather narrow focus on<br />

SLA research. Un<strong>for</strong>tunately, no such coherence was<br />

possible also at the annotation level, and hence its<br />

usability may be limited.<br />

At the other side of the spectrum there are the language<br />

documentation projects (such as <strong>in</strong> the DOBES program<br />

funded by the Volkswagen Foundation) where data<br />

collection takes place over longer periods, accord<strong>in</strong>g to<br />

different procedures and without any agreed cod<strong>in</strong>g of<br />

l<strong>in</strong>guistic phenomena. In the case of such corpora, it is a<br />

much bigger challenge to achieve any k<strong>in</strong>d of (semantic)<br />

<strong>in</strong>teroperability, e.g. <strong>for</strong> search<strong>in</strong>g specific events over all<br />

corpora. Still, the design with respect to text types and<br />

genres is driven by similar criteria as that <strong>for</strong> the Corpus<br />

of Spoken Dutch – the design of these corpora is<br />

essentially multi-purpose, because many of the languages<br />

will most probably not be spoken <strong>in</strong> a few generations, so<br />

that the documentation is the major (or even only) source<br />

of data <strong>for</strong> future studies, also <strong>for</strong> neighbour<strong>in</strong>g fields<br />

such as ethnomusicology, -botany, -history, and<br />

anthropology <strong>in</strong> general. As a result, aga<strong>in</strong> ideally as<br />

many as different types of communication events are<br />

recorded, from traditional texts (at the core of importance<br />

<strong>in</strong> particular <strong>for</strong> the speech communities, as this is the<br />

most valuable knowledge that often threatens to be lost<br />

first with the ag<strong>in</strong>g and death of the older generation) via<br />

explanations, descriptions, spontaneous stories to natural<br />

conversation. Differently from the Corpus of Spoken<br />

Dutch, however, the content and its relevance <strong>for</strong> future<br />

studies <strong>in</strong> other than l<strong>in</strong>guistic research are an important<br />

criterion <strong>for</strong> the selection of record<strong>in</strong>gs. Still, it is one of<br />

the major objectives <strong>for</strong> the data to provide the basis <strong>for</strong><br />

an extensive description (grammar), and <strong>for</strong> typological<br />

studies. The crucial po<strong>in</strong>t about fieldwork corpora is that<br />

the l<strong>in</strong>guistic system of the languages be<strong>in</strong>g documented<br />

is often not well understood, so that details of the analysis<br />

underly<strong>in</strong>g the annotation may change and improve over<br />

time.<br />

In all cases it is important to notice that the design and<br />

68<br />

creation of the corpora, <strong>in</strong>clud<strong>in</strong>g the creation of the<br />

content of metadata, is done by scientists, not by members<br />

of TLA. TLA is responsible <strong>for</strong> the technical aspects of<br />

these corpora, such as data (<strong>in</strong>clud<strong>in</strong>g metadata) <strong>for</strong>mats<br />

and proper archiv<strong>in</strong>g.<br />

3. Annotation (<strong>in</strong>clud<strong>in</strong>g Transcription)<br />

One emergent technical data <strong>for</strong>mat <strong>for</strong> annotated speech<br />

is EAF (see above), and this is by now the default <strong>for</strong> most<br />

annotation <strong>in</strong> corpora hosted at TLA. ELAN does not only<br />

allow annotat<strong>in</strong>g audio and video record<strong>in</strong>gs, but it also<br />

allows to use as many “tiers” (an annotation conta<strong>in</strong>er<br />

without predef<strong>in</strong>ed disposition, represent<strong>in</strong>g a layer or<br />

type of annotation) as necessary <strong>for</strong> any given speaker,<br />

and to relate the data between these tiers <strong>in</strong> technically<br />

practical ways, allow<strong>in</strong>g to organize annotations <strong>in</strong><br />

hierarchical tier structures.<br />

Thus, ELAN is a generic annotation tool that is applied <strong>in</strong><br />

various types of research. There is no built-<strong>in</strong> tier type <strong>for</strong><br />

speech, phonetic transcription, gesture or whatever other<br />

type of aspects of communicative events could be<br />

annotated. This renders a flexible transcription<br />

environment to which easily and at will tiers can be<br />

added.<br />

Be<strong>in</strong>g one of the first tools of its k<strong>in</strong>d, ELAN enabled the<br />

deeper analysis of sign language. It also fostered the study<br />

of “paral<strong>in</strong>guistic” phenomena such as gestures, which<br />

<strong>in</strong>creas<strong>in</strong>gly turn out to be <strong>in</strong>tr<strong>in</strong>sically related with<br />

spoken language, mak<strong>in</strong>g their study <strong>in</strong>dispensable <strong>for</strong> the<br />

understand<strong>in</strong>g of the latter. Today, ELAN is widely<br />

adopted even outside these doma<strong>in</strong>s and even outside<br />

l<strong>in</strong>guistics.<br />

Under (l<strong>in</strong>guistic) annotation of data we understand any<br />

symbolic representation of properties of the (speech)<br />

event represented <strong>in</strong> the primary data. 2 By this def<strong>in</strong>ition,<br />

a transcription of the orig<strong>in</strong>al utterances (depend<strong>in</strong>g on<br />

the purpose of the corpus <strong>in</strong> phonetic, phonological or,<br />

most often, orthographical <strong>for</strong>m) <strong>in</strong> the orig<strong>in</strong>al language<br />

is also annotation, and it is <strong>in</strong>deed the most basic and most<br />

frequent type of annotation <strong>in</strong> the corpora at TLA.<br />

In the case of the language documentation corpora, the<br />

material is usually only <strong>in</strong>terpretable and thus useful to<br />

users (other than members of the speech community) if<br />

also a translation <strong>in</strong>to a major language is provided,<br />

constitut<strong>in</strong>g a second important type of annotation<br />

(represent<strong>in</strong>g semantic properties of the underly<strong>in</strong>g<br />

speech events). Together, a transcription and at least one<br />

translation, possibly with one further layer of notes or<br />

comments, represent what <strong>in</strong> language documentation can<br />

be def<strong>in</strong>ed as basic annotation – this is <strong>in</strong>deed the<br />

m<strong>in</strong>imum required annotation <strong>in</strong> the case of the DOBES<br />

program.<br />

There are many other possible levels of l<strong>in</strong>guistic,<br />

paral<strong>in</strong>guistic and non-l<strong>in</strong>guistic annotation. One attempt<br />

at systematiz<strong>in</strong>g the l<strong>in</strong>guistic levels of annotation has<br />

2 In l<strong>in</strong>guistics, primary data are direct representations or<br />

results of a speech event, <strong>for</strong> <strong>in</strong>stance a written text or, <strong>in</strong><br />

particular, an audio/video record<strong>in</strong>g of a speech event.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!