19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

switch<strong>in</strong>g and code mix<strong>in</strong>g, we have <strong>in</strong>cluded an additional<br />

annotation layer <strong>for</strong> language alternation, annotated with<br />

the tag Wechsel. In addition, the translation of the passage<br />

is given <strong>in</strong> the comment tier of the particular speaker (cf.<br />

fig. 1).<br />

Figure 1: The annotation layer <strong>for</strong> language alternation <strong>in</strong><br />

the GeWiss corpus<br />

6. The Work Flow of the GeWiss Corpus<br />

Construction<br />

Creat<strong>in</strong>g a well designed corpus, and <strong>in</strong> particular one<br />

compris<strong>in</strong>g (recorded and transcribed) spoken data, is a<br />

labour- and cost-<strong>in</strong>tensive project. For the specification of<br />

the corpus design it requires clear ideas about the k<strong>in</strong>d of<br />

research questions that it might help to answer as well as<br />

about the k<strong>in</strong>d of applications it may be used <strong>for</strong>. In order to<br />

build a consistent corpus <strong>in</strong> a given time-limit and to keep<br />

track of the different tasks <strong>in</strong>volved it also needs a straight<br />

workflow.<br />

The workflow of data gather<strong>in</strong>g and preparation <strong>in</strong> the<br />

GeWiss project <strong>in</strong>cludes five complex subsequent steps, all<br />

coord<strong>in</strong>ated by a human corpus manager:<br />

1. Data acquisition and record<strong>in</strong>g – <strong>in</strong>clud<strong>in</strong>g the<br />

enquiry of record<strong>in</strong>g opportunities, the recruitment of<br />

participants, the request <strong>for</strong> written consent, and<br />

f<strong>in</strong>ally the record<strong>in</strong>g itself, conducted by research<br />

assistants who were also present as participant<br />

observers <strong>in</strong> the speech events <strong>in</strong> order to identify<br />

speakers and collect metadata;<br />

2. Data preparation – <strong>in</strong>clud<strong>in</strong>g the transferr<strong>in</strong>g of the<br />

record<strong>in</strong>gs to the server, the edit<strong>in</strong>g and mask<strong>in</strong>g, the<br />

assignment of alias and the mask<strong>in</strong>g of the additional<br />

materials associated with the record<strong>in</strong>g;<br />

3. Enter<strong>in</strong>g the metadata <strong>in</strong>to the EXMARaLDA Corpus<br />

Manager – <strong>in</strong>clud<strong>in</strong>g the l<strong>in</strong>k<strong>in</strong>g of the masked<br />

record<strong>in</strong>gs and additional materials to the speech<br />

event;<br />

4. Transcription – <strong>in</strong>clud<strong>in</strong>g a three-stage correction<br />

phase carried out by the transcriber him-/herself and<br />

two other transcribers of the project;<br />

5. F<strong>in</strong>al process<strong>in</strong>g of the transcript – <strong>in</strong>clud<strong>in</strong>g the<br />

additional mask<strong>in</strong>g of the record<strong>in</strong>g (if needed), the<br />

check <strong>for</strong> segmentation errors, the export of a<br />

segmented transcription and f<strong>in</strong>ally the l<strong>in</strong>k<strong>in</strong>g of the<br />

transcription to the speech event <strong>in</strong> the corpus<br />

manager.<br />

10<br />

At present, the f<strong>in</strong>al check and the segmentation of the<br />

GeWiss transcriptions are <strong>in</strong> progress and the digital<br />

process<strong>in</strong>g of the transcribed data has started. After that,<br />

the sub-corpora will be built up and an <strong>in</strong>terface <strong>for</strong> the<br />

onl<strong>in</strong>e access will be implemented. Through this web<br />

<strong>in</strong>terface the GeWiss corpus will be publicly available <strong>for</strong><br />

research and pedagogical purposes after free registration.<br />

The release of the first version of GeWiss is planned by<br />

autumn 2012.<br />

7. Conclusion<br />

We have described the creation process of a comparable<br />

corpus of spoken academic language data produced by<br />

native and non-native speakers recorded <strong>in</strong> three different<br />

academic contexts, i.e. the German, English and Polish<br />

context. We presented the parameters <strong>for</strong> the design of the<br />

GeWiss corpus, the types of metadata collected, the<br />

transcription conventions applied and the workflow from<br />

data gather<strong>in</strong>g to corpus publication. The GeWiss corpus<br />

will be the first publicly available corpus of spoken<br />

academic German. Its specific design offers a valuable<br />

database <strong>for</strong> comparative <strong>in</strong>vestigations of various k<strong>in</strong>ds.<br />

The successful completion of the phase of data acquisition<br />

and transcription is an important prerequisite <strong>for</strong> the<br />

creation of a valuable corpus of spoken data <strong>for</strong> l<strong>in</strong>guistic<br />

purposes. The associated expenditure of time, however,<br />

shouldn’t be underestimated <strong>in</strong> the plann<strong>in</strong>g stage of such<br />

corpora.<br />

8. References<br />

Fandrych, C. ; Meißner, C. and Slavcheva, A. (<strong>in</strong> pr<strong>in</strong>t).<br />

The GeWiss Corpus: Compar<strong>in</strong>g Spoken Academic<br />

German, English and Polish. In T. Schmidt & K.<br />

Wörner (Eds), Multil<strong>in</strong>gual corpora and multil<strong>in</strong>gual<br />

corpus analysis. Amsterdam: Benjam<strong>in</strong>s. (= Hamburg<br />

Studies <strong>in</strong> Multil<strong>in</strong>gualism).<br />

Guckelsberger, S. (2005). Mündliche Referate <strong>in</strong><br />

universitären Lehrveranstaltungen Diskursanalytische<br />

Untersuchungen im H<strong>in</strong>blick auf e<strong>in</strong>e wissenschafts-<br />

bezogene Qualifizierung von Studierenden. München:<br />

Iudicum.<br />

Koch, P. ; Österreicher, W. (2008). Mündlichkeit und<br />

Schriftlichkeit von Texten. In N. Janich (Ed.)<br />

Textl<strong>in</strong>guistik. Tüb<strong>in</strong>gen: Narr, pp.199--215.<br />

Jasny, S. (2001). Trennbare Verben <strong>in</strong> der gesprochenen<br />

Wissenschaftssprache und die Konsequenzen für ihre<br />

Behandlung im Unterricht für Deutsch als fremde<br />

Wissenschaftssprache. Regensburg: FaDaF. [=<br />

Materialien Deutsch als Fremdsprache 64].<br />

Lange, D. ; Rogozi�ska, M. ; Jaworska S. ; Slavcheva, A.<br />

(<strong>in</strong> prep). GAT2 als Transkriptionskonvention für<br />

multil<strong>in</strong>guale Sprachdaten? Zur Adaption des<br />

Notationssystems im Rahmen des Projekts GeWiss. In<br />

C. Fandrych, C. Meißner & A. Slavcheva (Eds.),<br />

Tagungsband der GeWiss-Konferenz vom 27. - 29. 10.<br />

2011. Heidelberg: Synchronverlag. (= Reihe<br />

Wissenschaftskommunikation).<br />

Mauranen, A. ; Ranta, E. (2008). English as an Academic

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!