Best Practices for Speech Corpora in Linguistic Research Workshop ...

switching and code mixing, we have included an additional 

annotation layer for language alternation, annotated with 

the tag Wechsel. In addition, the translation of the passage 

is given in the comment tier of the particular speaker (cf. 

fig. 1). 

Figure 1: The annotation layer for language alternation in 

the GeWiss corpus 

6. The Work Flow of the GeWiss Corpus 

Construction 

Creating a well designed corpus, and in particular one 

comprising (recorded and transcribed) spoken data, is a 

labour- and cost-intensive project. For the specification of 

the corpus design it requires clear ideas about the kind of 

research questions that it might help to answer as well as 

about the kind of applications it may be used for. In order to 

build a consistent corpus in a given time-limit and to keep 

track of the different tasks involved it also needs a straight 

workflow. 

The workflow of data gathering and preparation in the 

GeWiss project includes five complex subsequent steps, all 

coordinated by a human corpus manager: 

1. Data acquisition and recording – including the 

enquiry of recording opportunities, the recruitment of 

participants, the request for written consent, and 

finally the recording itself, conducted by research 

assistants who were also present as participant 

observers in the speech events in order to identify 

speakers and collect metadata; 

2. Data preparation – including the transferring of the 

recordings to the server, the editing and masking, the 

assignment of alias and the masking of the additional 

materials associated with the recording; 

3. Entering the metadata into the EXMARaLDA Corpus 

Manager – including the linking of the masked 

recordings and additional materials to the speech 

event; 

4. Transcription – including a three-stage correction 

phase carried out by the transcriber him-/herself and 

two other transcribers of the project; 

5. Final processing of the transcript – including the 

additional masking of the recording (if needed), the 

check for segmentation errors, the export of a 

segmented transcription and finally the linking of the 

transcription to the speech event in the corpus 

manager. 

10 

At present, the final check and the segmentation of the 

GeWiss transcriptions are in progress and the digital 

processing of the transcribed data has started. After that, 

the sub-corpora will be built up and an interface for the 

online access will be implemented. Through this web 

interface the GeWiss corpus will be publicly available for 

research and pedagogical purposes after free registration. 

The release of the first version of GeWiss is planned by 

autumn 2012. 

7. Conclusion 

We have described the creation process of a comparable 

corpus of spoken academic language data produced by 

native and non-native speakers recorded in three different 

academic contexts, i.e. the German, English and Polish 

context. We presented the parameters for the design of the 

GeWiss corpus, the types of metadata collected, the 

transcription conventions applied and the workflow from 

data gathering to corpus publication. The GeWiss corpus 

will be the first publicly available corpus of spoken 

academic German. Its specific design offers a valuable 

database for comparative investigations of various kinds. 

The successful completion of the phase of data acquisition 

and transcription is an important prerequisite for the 

creation of a valuable corpus of spoken data for linguistic 

purposes. The associated expenditure of time, however, 

shouldn’t be underestimated in the planning stage of such 

corpora. 

8. References 

Fandrych, C. ; Meißner, C. and Slavcheva, A. (in print). 

The GeWiss Corpus: Comparing Spoken Academic 

German, English and Polish. In T. Schmidt & K. 

Wörner (Eds), Multilingual corpora and multilingual 

corpus analysis. Amsterdam: Benjamins. (= Hamburg 

Studies in Multilingualism). 

Guckelsberger, S. (2005). Mündliche Referate in 

universitären Lehrveranstaltungen Diskursanalytische 

Untersuchungen im Hinblick auf eine wissenschafts- 

bezogene Qualifizierung von Studierenden. München: 

Iudicum. 

Koch, P. ; Österreicher, W. (2008). Mündlichkeit und 

Schriftlichkeit von Texten. In N. Janich (Ed.) 

Textlinguistik. Tübingen: Narr, pp.199--215. 

Jasny, S. (2001). Trennbare Verben in der gesprochenen 

Wissenschaftssprache und die Konsequenzen für ihre 

Behandlung im Unterricht für Deutsch als fremde 

Wissenschaftssprache. Regensburg: FaDaF. [= 

Materialien Deutsch als Fremdsprache 64]. 

Lange, D. ; Rogozi�ska, M. ; Jaworska S. ; Slavcheva, A. 

(in prep). GAT2 als Transkriptionskonvention für 

multilinguale Sprachdaten? Zur Adaption des 

Notationssystems im Rahmen des Projekts GeWiss. In 

C. Fandrych, C. Meißner & A. Slavcheva (Eds.), 

Tagungsband der GeWiss-Konferenz vom 27. - 29. 10. 

2011. Heidelberg: Synchronverlag. (= Reihe 

Wissenschaftskommunikation). 

Mauranen, A. ; Ranta, E. (2008). English as an Academic

Previous page

Next page

1

2

3

4

5

6

7

8

9

10

11

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

Best Practices for Speech Corpora in Linguistic Research Workshop ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?