19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

een made <strong>in</strong> the pilot phase of DOBES (Lieb and Drude<br />

2001). But only recently the need <strong>for</strong> standards <strong>in</strong><br />

categoriz<strong>in</strong>g and referr<strong>in</strong>g to different levels of annotation<br />

has come to the attention of documentary l<strong>in</strong>guists and<br />

technicians. The DOBES corpora do not have any agreed<br />

<strong>for</strong>mat or nam<strong>in</strong>g <strong>for</strong> the types of l<strong>in</strong>guistic levels to be<br />

represented <strong>in</strong> the various annotation layers, and this<br />

presents now a major challenge <strong>for</strong> ef<strong>for</strong>ts to make the<br />

different corpora comparable and <strong>in</strong>teroperable.<br />

True, a very popular <strong>for</strong>mat of annotation of the orig<strong>in</strong>al<br />

record<strong>in</strong>gs <strong>in</strong> the language documentation corpora is<br />

basic gloss<strong>in</strong>g; 3 <strong>in</strong> particular the <strong>in</strong>terl<strong>in</strong>ear glosses as<br />

<strong>for</strong>malized by the “Leipzig Gloss<strong>in</strong>g Rules”, by now an<br />

established (but still develop<strong>in</strong>g) standard <strong>in</strong> language<br />

description and typology. These <strong>in</strong>terl<strong>in</strong>ear glosses are<br />

often created us<strong>in</strong>g the Toolbox program. But even when<br />

<strong>in</strong> pr<strong>in</strong>ciple aim<strong>in</strong>g at follow<strong>in</strong>g the Leipzig Gloss<strong>in</strong>g<br />

Rules, details of basic gloss<strong>in</strong>g <strong>in</strong> language<br />

documentation corpora usually vary quite a bit from<br />

corpus to corpus, <strong>in</strong> particular with respect to the<br />

abbreviations applied <strong>for</strong> abstract functional units. The<br />

ISOcat data category repository (Kemp-Snijders et.al.,<br />

2008) provides a means to clarify the nature of tiers and<br />

the mean<strong>in</strong>g of <strong>in</strong>dividual glosses. The ISOcat Data<br />

Category Registry (ISO 12620) def<strong>in</strong>es l<strong>in</strong>guistic<br />

concepts <strong>in</strong> a way compliant with the ISO/IEC 11179<br />

family of standards. It is hosted at and developed by the<br />

TLA. Thus, one can refer to a certa<strong>in</strong> concept<br />

<strong>in</strong>dependently from its concrete label or abbreviation –<br />

“noun”, “N”, “Subs(antive)” etc. can all refer to the same<br />

data category “Noun” – or to different categories which<br />

are connected by a relation of “is-roughly-equivalent-to”.<br />

Such relationships can be established between different<br />

ISOcat data categories with the new RELcat registry<br />

(W<strong>in</strong>dhouwer 2012). RELcat is likewise developed by the<br />

TLA and currently <strong>in</strong> the alpha phase.<br />

In ELAN both, tiers and annotations, can refer to a data<br />

category. On the tier level this reference <strong>in</strong>dicates the<br />

more general type of annotations of that tier, e.g.<br />

“part-of-speech”, on the annotation level it is the more<br />

specific category, e.g. “verb”. The goal is to achieve<br />

<strong>in</strong>teroperability between different annotations <strong>in</strong> different<br />

corpora despite the broad variation <strong>in</strong> annotation tiers,<br />

conventions and labels we observe <strong>in</strong> fieldwork and<br />

descriptive l<strong>in</strong>guistics.<br />

In addition to basic gloss<strong>in</strong>g, some (sub)corpora may have<br />

some advanced gloss<strong>in</strong>g – which covers one or several of<br />

the other l<strong>in</strong>guistic levels, from phonetic via phonological,<br />

morphological, syntactic, semantic to pragmatic, or even<br />

paral<strong>in</strong>guistic and non-l<strong>in</strong>guistic levels. Advanced<br />

gloss<strong>in</strong>gs can <strong>in</strong>clude, <strong>for</strong> <strong>in</strong>stance, a phonetic<br />

transcription and annotation of the <strong>in</strong>tonation contour, or<br />

3 Under basic gloss<strong>in</strong>g we understand annotation that, <strong>in</strong><br />

addition to basic annotation, also provides <strong>in</strong><strong>for</strong>mation on<br />

<strong>in</strong>dividual units (usually morphs, sometimes words), such<br />

as typically an <strong>in</strong>dividual gloss (<strong>in</strong>dication of mean<strong>in</strong>g or<br />

function) <strong>for</strong> each morph / word, and perhaps also<br />

categorical <strong>in</strong><strong>for</strong>mation such as a part-of-speech tag (or its<br />

equivalents on the morphological level).<br />

69<br />

of the syntactic structure, of grammatical relations, etc.<br />

For <strong>in</strong>stance, an emergent standard <strong>in</strong> the DOBES<br />

program is the GRAID annotation (Haig & Schnell 2011).<br />

Any k<strong>in</strong>d of manual annotation, from segment<strong>in</strong>g to<br />

cod<strong>in</strong>g different l<strong>in</strong>guistic and other <strong>in</strong><strong>for</strong>mation, is the<br />

most time expensive step of many workflows. TLA has<br />

recently begun the development of new annotation<br />

functionalities of ELAN that comprise automatic audio &<br />

video recognition and semi-automatic annotation, so that<br />

modules developed <strong>in</strong> an NLP context or at the MPI can<br />

be “plugged” <strong>in</strong>to ELAN. Such modules <strong>in</strong>clude<br />

morpheme-splitters, Toolbox/FLEX-like annotation<br />

support and traditional POS-taggers etc.<br />

4. Metadata & Data Management<br />

With respect to long-term availability (“archiv<strong>in</strong>g”) of<br />

speech data, TLA has also had a pioneer<strong>in</strong>g role. The<br />

solutions developed at the MPI are now one important<br />

basis <strong>for</strong> the construction of large <strong>in</strong>frastructures <strong>for</strong><br />

digital language research data (<strong>for</strong> <strong>in</strong>stance, <strong>in</strong> the<br />

CLARIN project).<br />

In the early 2000s, the technical group at the MPI started<br />

develop<strong>in</strong>g IMDI, a XML-based metadata standard which<br />

is geared to observational multimedia data such as<br />

language acquisition and field work record<strong>in</strong>gs. This<br />

standard was developed <strong>in</strong> close cooperation with<br />

researchers active <strong>in</strong> the early years of the DOBES<br />

research programme. Metadata are then stored <strong>in</strong> separate<br />

XML files side by side with the bundle of resources<br />

(multimedia files with record<strong>in</strong>gs etc., annotation files <strong>in</strong><br />

different <strong>for</strong>mats such as Shoebox/Toolbox-text files,<br />

Transcriber and EAF XML-based files, and a few other)<br />

they describe. These resources are l<strong>in</strong>ked to the metadata<br />

file by po<strong>in</strong>ters <strong>in</strong> the metadata files – today, persistent<br />

identifiers (handles) are used <strong>in</strong> order to guarantee reliable<br />

access even if the location of files should change. The<br />

bundle of a metadata file together with the resources that<br />

are referenced <strong>in</strong> it and described by it are called a<br />

“session” – it may conta<strong>in</strong> just one video or audio or text<br />

file, but also possibly dozens of closely related files, and,<br />

<strong>for</strong> technical reasons, different versions / encod<strong>in</strong>gs etc.<br />

of the ‘same’ file.<br />

The IMDI metadata schema conta<strong>in</strong>s several dedicated<br />

data fields <strong>for</strong> describ<strong>in</strong>g speakers and communicative<br />

events. This is the major po<strong>in</strong>t <strong>in</strong> which IMDI and, say,<br />

OLAC metadata diverge. The IMDI metadata schema has<br />

specializations <strong>for</strong> general speech corpora as CGN and<br />

other TLA corpora such as DBD.<br />

A virtual hierarchy or group<strong>in</strong>g of sessions <strong>in</strong> a tree-like<br />

structure is achieved by a second type of IMDI metadata<br />

files, each represent<strong>in</strong>g a node <strong>in</strong> the tree and po<strong>in</strong>t<strong>in</strong>g to<br />

other corpus node and / or session IMDI files. In this way,<br />

the same set of sessions can be organized by different<br />

criteria <strong>in</strong> parallel.<br />

The advantage of such a system is that all resources,<br />

<strong>in</strong>clud<strong>in</strong>g metadata, are stored as separate files <strong>in</strong> the file<br />

system, without be<strong>in</strong>g stored <strong>in</strong>side some database or<br />

other encapsulated file. For quick access and<br />

adm<strong>in</strong>istration a database is used at TLA, too, but this

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!