� 66
<strong>Best</strong> practices <strong>in</strong> the design, creation and dissem<strong>in</strong>ation of speech corpora at The Language Archive Sebastian Drude, Daan Broeder, Peter Wittenburg, Han Sloetjes Max Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics, The Language Archive, P.O. Box 310, 6500 AH Nijmegen, The Netherlands E-mail: {Sebastian.Drude, Daan.Broeder, Peter.Wittenburg, Han.Sloetjes}@mpi.nl Abstract In the last 15 years, the Technical Group (now: “The Language Archive”, TLA) at the Max Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics (MPI) has been engaged <strong>in</strong> build<strong>in</strong>g corpora of natural speech and mak<strong>in</strong>g them available <strong>for</strong> further research. The MPI has set standards with respect to archiv<strong>in</strong>g such resources, and has developed tools that are now widely used, or serve as a reference <strong>for</strong> good practice. We cover here core aspects of corpus design, annotation, metadata and data dissem<strong>in</strong>ation of the corpora hosted at TLA. Keywords: annotation software, language documentation, speech corpora 1. Introduction This paper summarizes the central facts concern<strong>in</strong>g speech corpora at the Max Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics, now under the responsibility of a new unit called “The Language Archive” (TLA 1 ). This unit, besides ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the archive proper, also develops software relevant <strong>for</strong> creat<strong>in</strong>g, archiv<strong>in</strong>g and us<strong>in</strong>g language resources, and is <strong>in</strong>volved <strong>in</strong> larger <strong>in</strong>frastructure projects aim<strong>in</strong>g at <strong>in</strong>tegrat<strong>in</strong>g resources and mak<strong>in</strong>g them reliably available. The TLA team is, however, not responsible <strong>for</strong> design<strong>in</strong>g, collect<strong>in</strong>g and creat<strong>in</strong>g the corpora, which is done by researchers. There<strong>for</strong>e this paper covers mostly technical aspects or reports on other aspects from an <strong>in</strong>direct and technical perspective. Most facts reported here are the (prelim<strong>in</strong>ary) result of on-go<strong>in</strong>g and long-term <strong>in</strong>vestments and developments. As such, they are mostly not new unpublished results, but still give a good overview over many of the solutions applied <strong>in</strong> TLA <strong>for</strong> relevant questions about speech corpora. The speech corpora at The Language Archive at the Max Planck Institute <strong>for</strong> Psychol<strong>in</strong>guistics (MPI-PL) have so far ma<strong>in</strong>ly come from two discipl<strong>in</strong>es not mentioned <strong>in</strong> the call <strong>for</strong> papers: language acquisition and l<strong>in</strong>guistic fieldwork on small languages worldwide. Due to their provenience, these corpora differ considerably from usual corpora applied <strong>in</strong> speech technology or other areas traditionally concerned with l<strong>in</strong>guistic corpora. The language acquisition corpora at the MPI have mostly been annotated us<strong>in</strong>g a particular annotation <strong>for</strong>mat, CHAT (MacWh<strong>in</strong>ney 2000), developed and used s<strong>in</strong>ce the early 1980ies <strong>in</strong> the CHILDES project and database. Although applicable to other areas of research, CHAT is tailored to reveal emergent grammatical properties and the adaptive solution of communicative needs by children or second language learners. CHAT can be considered an excellent standard <strong>for</strong> annotat<strong>in</strong>g acquisition corpora, and 1 All underl<strong>in</strong>ed terms refer to entries <strong>in</strong> the references. 67 there are powerful statistical tools <strong>for</strong> this corpus available. However, there are other areas of research that deal with audio and video data that are to be annotated, as <strong>for</strong> <strong>in</strong>stance corpora of natural or elicited speech from the many native languages around the world. As at other centres, also at the MPI-PL such corpora have first been collected <strong>in</strong> field-research <strong>for</strong> purposes of description and comparison. S<strong>in</strong>ce the 1990ies, however, when the threat of ext<strong>in</strong>ction of the overwhelm<strong>in</strong>g part of l<strong>in</strong>guistic diversity, it became obvious that the documentation of endangered and other understudied languages is an important scientific goal <strong>in</strong> its own right, and research programs such as DOBES were established. This even gave rise to a new sub-discipl<strong>in</strong>e of l<strong>in</strong>guistics which is primarily concerned with the build<strong>in</strong>g of multimedia corpora of speech, viz. “Language Documentation” (sometimes “documentary l<strong>in</strong>guistics”). Besides tools and web-services <strong>for</strong> archiv<strong>in</strong>g language data, the technical group at the MPI-PL (now TLA) is engaged <strong>in</strong> develop<strong>in</strong>g a multi-purpose annotation tool <strong>for</strong> speech data, ELAN (Wittenburg et.al. 2006). This tool was first applied <strong>in</strong> the documentation of endangered languages and other l<strong>in</strong>guistic field research, but then proved to be useful <strong>in</strong> the annotation of sign language data and generally <strong>in</strong> the area of multimodal research. The data available <strong>in</strong> the ELAN annotation <strong>for</strong>mat (EAF) as generated by the ELAN tool is suited <strong>for</strong> mach<strong>in</strong>e process<strong>in</strong>g (XML Schema based), and thus it is now at the core of most developments <strong>in</strong> TLA and well supported <strong>in</strong> the TLA archive software (e.g. TROVA, ANNEX). The current contribution focuses on speech corpora as archived at TLA, <strong>in</strong> particular corpora as the result of Language Documentation. We try to address as many topics relevant <strong>for</strong> speech corpora as possible from an archive’s and software development group po<strong>in</strong>t of view. 2. Corpus Design and Curation Consider<strong>in</strong>g the design, but also the management and curation of (speech) corpora, there is an important