19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Best</strong> practices <strong>in</strong> the design, creation and dissem<strong>in</strong>ation of speech corpora<br />

at The Language Archive<br />

Sebastian Drude, Daan Broeder, Peter Wittenburg, Han Sloetjes<br />

Max Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics,<br />

The Language Archive,<br />

P.O. Box 310, 6500 AH Nijmegen, The Netherlands<br />

E-mail: {Sebastian.Drude, Daan.Broeder, Peter.Wittenburg, Han.Sloetjes}@mpi.nl<br />

Abstract<br />

In the last 15 years, the Technical Group (now: “The Language Archive”, TLA) at the Max Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics (MPI)<br />

has been engaged <strong>in</strong> build<strong>in</strong>g corpora of natural speech and mak<strong>in</strong>g them available <strong>for</strong> further research. The MPI has set standards with<br />

respect to archiv<strong>in</strong>g such resources, and has developed tools that are now widely used, or serve as a reference <strong>for</strong> good practice. We<br />

cover here core aspects of corpus design, annotation, metadata and data dissem<strong>in</strong>ation of the corpora hosted at TLA.<br />

Keywords: annotation software, language documentation, speech corpora<br />

1. Introduction<br />

This paper summarizes the central facts concern<strong>in</strong>g<br />

speech corpora at the Max Planck Institute <strong>for</strong><br />

Psychol<strong>in</strong>guistics, now under the responsibility of a new<br />

unit called “The Language Archive” (TLA 1 ). This unit,<br />

besides ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the archive proper, also develops<br />

software relevant <strong>for</strong> creat<strong>in</strong>g, archiv<strong>in</strong>g and us<strong>in</strong>g<br />

language resources, and is <strong>in</strong>volved <strong>in</strong> larger<br />

<strong>in</strong>frastructure projects aim<strong>in</strong>g at <strong>in</strong>tegrat<strong>in</strong>g resources and<br />

mak<strong>in</strong>g them reliably available. The TLA team is,<br />

however, not responsible <strong>for</strong> design<strong>in</strong>g, collect<strong>in</strong>g and<br />

creat<strong>in</strong>g the corpora, which is done by researchers.<br />

There<strong>for</strong>e this paper covers mostly technical aspects or<br />

reports on other aspects from an <strong>in</strong>direct and technical<br />

perspective. Most facts reported here are the (prelim<strong>in</strong>ary)<br />

result of on-go<strong>in</strong>g and long-term <strong>in</strong>vestments and<br />

developments. As such, they are mostly not new<br />

unpublished results, but still give a good overview over<br />

many of the solutions applied <strong>in</strong> TLA <strong>for</strong> relevant<br />

questions about speech corpora.<br />

The speech corpora at The Language Archive at the Max<br />

Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics (MPI-PL) have so<br />

far ma<strong>in</strong>ly come from two discipl<strong>in</strong>es not mentioned <strong>in</strong><br />

the call <strong>for</strong> papers: language acquisition and l<strong>in</strong>guistic<br />

fieldwork on small languages worldwide.<br />

Due to their provenience, these corpora differ<br />

considerably from usual corpora applied <strong>in</strong> speech<br />

technology or other areas traditionally concerned with<br />

l<strong>in</strong>guistic corpora.<br />

The language acquisition corpora at the MPI have mostly<br />

been annotated us<strong>in</strong>g a particular annotation <strong>for</strong>mat,<br />

CHAT (MacWh<strong>in</strong>ney 2000), developed and used s<strong>in</strong>ce<br />

the early 1980ies <strong>in</strong> the CHILDES project and database.<br />

Although applicable to other areas of research, CHAT is<br />

tailored to reveal emergent grammatical properties and<br />

the adaptive solution of communicative needs by children<br />

or second language learners. CHAT can be considered an<br />

excellent standard <strong>for</strong> annotat<strong>in</strong>g acquisition corpora, and<br />

1 All underl<strong>in</strong>ed terms refer to entries <strong>in</strong> the references.<br />

67<br />

there are powerful statistical tools <strong>for</strong> this corpus<br />

available.<br />

However, there are other areas of research that deal with<br />

audio and video data that are to be annotated, as <strong>for</strong><br />

<strong>in</strong>stance corpora of natural or elicited speech from the<br />

many native languages around the world. As at other<br />

centres, also at the MPI-PL such corpora have first been<br />

collected <strong>in</strong> field-research <strong>for</strong> purposes of description and<br />

comparison. S<strong>in</strong>ce the 1990ies, however, when the threat<br />

of ext<strong>in</strong>ction of the overwhelm<strong>in</strong>g part of l<strong>in</strong>guistic<br />

diversity, it became obvious that the documentation of<br />

endangered and other understudied languages is an<br />

important scientific goal <strong>in</strong> its own right, and research<br />

programs such as DOBES were established. This even<br />

gave rise to a new sub-discipl<strong>in</strong>e of l<strong>in</strong>guistics which is<br />

primarily concerned with the build<strong>in</strong>g of multimedia<br />

corpora of speech, viz. “Language Documentation”<br />

(sometimes “documentary l<strong>in</strong>guistics”).<br />

Besides tools and web-services <strong>for</strong> archiv<strong>in</strong>g language<br />

data, the technical group at the MPI-PL (now TLA) is<br />

engaged <strong>in</strong> develop<strong>in</strong>g a multi-purpose annotation tool <strong>for</strong><br />

speech data, ELAN (Wittenburg et.al. 2006). This tool<br />

was first applied <strong>in</strong> the documentation of endangered<br />

languages and other l<strong>in</strong>guistic field research, but then<br />

proved to be useful <strong>in</strong> the annotation of sign language data<br />

and generally <strong>in</strong> the area of multimodal research. The data<br />

available <strong>in</strong> the ELAN annotation <strong>for</strong>mat (EAF) as<br />

generated by the ELAN tool is suited <strong>for</strong> mach<strong>in</strong>e<br />

process<strong>in</strong>g (XML Schema based), and thus it is now at the<br />

core of most developments <strong>in</strong> TLA and well supported <strong>in</strong><br />

the TLA archive software (e.g. TROVA, ANNEX).<br />

The current contribution focuses on speech corpora as<br />

archived at TLA, <strong>in</strong> particular corpora as the result of<br />

Language Documentation. We try to address as many<br />

topics relevant <strong>for</strong> speech corpora as possible from an<br />

archive’s and software development group po<strong>in</strong>t of view.<br />

2. Corpus Design and Curation<br />

Consider<strong>in</strong>g the design, but also the management and<br />

curation of (speech) corpora, there is an important

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!