Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best practices in the design, creation and dissemination of speech corpora 

at The Language Archive 

Sebastian Drude, Daan Broeder, Peter Wittenburg, Han Sloetjes 

Max Planck Institute for Psycholinguistics, 

The Language Archive, 

P.O. Box 310, 6500 AH Nijmegen, The Netherlands 

E-mail: {Sebastian.Drude, Daan.Broeder, Peter.Wittenburg, Han.Sloetjes}@mpi.nl 

Abstract 

In the last 15 years, the Technical Group (now: “The Language Archive”, TLA) at the Max Planck Institute for Psycholinguistics (MPI) 

has been engaged in building corpora of natural speech and making them available for further research. The MPI has set standards with 

respect to archiving such resources, and has developed tools that are now widely used, or serve as a reference for good practice. We 

cover here core aspects of corpus design, annotation, metadata and data dissemination of the corpora hosted at TLA. 

Keywords: annotation software, language documentation, speech corpora 

1. Introduction 

This paper summarizes the central facts concerning 

speech corpora at the Max Planck Institute for 

Psycholinguistics, now under the responsibility of a new 

unit called “The Language Archive” (TLA 1 ). This unit, 

besides maintaining the archive proper, also develops 

software relevant for creating, archiving and using 

language resources, and is involved in larger 

infrastructure projects aiming at integrating resources and 

making them reliably available. The TLA team is, 

however, not responsible for designing, collecting and 

creating the corpora, which is done by researchers. 

Therefore this paper covers mostly technical aspects or 

reports on other aspects from an indirect and technical 

perspective. Most facts reported here are the (preliminary) 

result of on-going and long-term investments and 

developments. As such, they are mostly not new 

unpublished results, but still give a good overview over 

many of the solutions applied in TLA for relevant 

questions about speech corpora. 

The speech corpora at The Language Archive at the Max 

Planck Institute for Psycholinguistics (MPI-PL) have so 

far mainly come from two disciplines not mentioned in 

the call for papers: language acquisition and linguistic 

fieldwork on small languages worldwide. 

Due to their provenience, these corpora differ 

considerably from usual corpora applied in speech 

technology or other areas traditionally concerned with 

linguistic corpora. 

The language acquisition corpora at the MPI have mostly 

been annotated using a particular annotation format, 

CHAT (MacWhinney 2000), developed and used since 

the early 1980ies in the CHILDES project and database. 

Although applicable to other areas of research, CHAT is 

tailored to reveal emergent grammatical properties and 

the adaptive solution of communicative needs by children 

or second language learners. CHAT can be considered an 

excellent standard for annotating acquisition corpora, and 

1 All underlined terms refer to entries in the references. 

67 

there are powerful statistical tools for this corpus 

available. 

However, there are other areas of research that deal with 

audio and video data that are to be annotated, as for 

instance corpora of natural or elicited speech from the 

many native languages around the world. As at other 

centres, also at the MPI-PL such corpora have first been 

collected in field-research for purposes of description and 

comparison. Since the 1990ies, however, when the threat 

of extinction of the overwhelming part of linguistic 

diversity, it became obvious that the documentation of 

endangered and other understudied languages is an 

important scientific goal in its own right, and research 

programs such as DOBES were established. This even 

gave rise to a new sub-discipline of linguistics which is 

primarily concerned with the building of multimedia 

corpora of speech, viz. “Language Documentation” 

(sometimes “documentary linguistics”). 

Besides tools and web-services for archiving language 

data, the technical group at the MPI-PL (now TLA) is 

engaged in developing a multi-purpose annotation tool for 

speech data, ELAN (Wittenburg et.al. 2006). This tool 

was first applied in the documentation of endangered 

languages and other linguistic field research, but then 

proved to be useful in the annotation of sign language data 

and generally in the area of multimodal research. The data 

available in the ELAN annotation format (EAF) as 

generated by the ELAN tool is suited for machine 

processing (XML Schema based), and thus it is now at the 

core of most developments in TLA and well supported in 

the TLA archive software (e.g. TROVA, ANNEX). 

The current contribution focuses on speech corpora as 

archived at TLA, in particular corpora as the result of 

Language Documentation. We try to address as many 

topics relevant for speech corpora as possible from an 

archive’s and software development group point of view. 

2. Corpus Design and Curation 

Considering the design, but also the management and 

curation of (speech) corpora, there is an important

Previous page

Next page

1

2

3

4

5

6

7

8

9

10

11

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

Best Practices for Speech Corpora in Linguistic Research Workshop ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?