19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.2 Parameters of the Corpus Design<br />

There are two key parameters determ<strong>in</strong><strong>in</strong>g the structure of<br />

the GeWiss corpus: (1) the type of discourse genre chosen,<br />

and (2) the language constellation of the speakers<br />

recorded.<br />

The two discourse genres <strong>in</strong>cluded <strong>in</strong> the GeWiss corpus<br />

were selected because they were considered to be of great<br />

relevance <strong>in</strong> many academic sett<strong>in</strong>gs (<strong>in</strong> various<br />

discipl<strong>in</strong>es, academic traditions, and l<strong>in</strong>guistic<br />

communities), though, of course, they are far from be<strong>in</strong>g<br />

universal: Conference papers / student presentations were<br />

selected as a key monologic genre. We decided to <strong>in</strong>clude<br />

oral presentations / conference papers held by expert<br />

scholars as well as presentations held by students <strong>in</strong><br />

sem<strong>in</strong>ars to allow <strong>for</strong> the comparison of different levels of<br />

academic literacy. The record<strong>in</strong>gs also <strong>in</strong>clude the<br />

respective follow-up discussions as we regard them as<br />

<strong>in</strong>tegral parts of the communicative events as a whole.<br />

Oral exam<strong>in</strong>ations were chosen as a dialogic genre of<br />

prime importance <strong>for</strong> student success because <strong>in</strong> many<br />

discipl<strong>in</strong>es and countries they are a prerequisite <strong>for</strong><br />

graduation.<br />

With regard to the parameter ‘language constellation of<br />

the speakers’, we took record<strong>in</strong>gs of L1 productions of<br />

German, English and Polish native speakers on the one<br />

hand, and L2 productions <strong>in</strong> German on the other hand.<br />

S<strong>in</strong>ce our ma<strong>in</strong> research <strong>in</strong>terest lies <strong>in</strong> the study of<br />

academic discourse and academic discourse conventions,<br />

we broadly def<strong>in</strong>e the L1 of our participants as the<br />

language <strong>in</strong> which they (predom<strong>in</strong>antly) received their<br />

school education. This still means that <strong>in</strong> certa<strong>in</strong> cases,<br />

more than one L1 can be identified. S<strong>in</strong>ce aspects of the<br />

participants’ language biography are documented as part<br />

of the metadata questionnaire (see below), such more<br />

complex language constellations are made transparent <strong>in</strong><br />

the metadata data base. The L2 productions are recorded<br />

<strong>in</strong> three different academic sett<strong>in</strong>gs (German, British, and<br />

Polish), possibly also reflect<strong>in</strong>g <strong>in</strong>fluences of different<br />

discourse traditions <strong>in</strong> the sense proposed by Koch and<br />

Österreicher (cf. Koch & Österreicher, 2008, 208ff.).<br />

2.3 Corpus Size<br />

The first version of the GeWiss corpus will comprise a<br />

total of about 120 hours of record<strong>in</strong>g, i.e. 60 hours per<br />

genre and 40 hours of data orig<strong>in</strong>at<strong>in</strong>g from German,<br />

English, and Polish academic sett<strong>in</strong>gs respectively.<br />

Table 1 gives an overview of the actual amount of<br />

transcribed data.<br />

Table 1: The actual corpus size of the GeWiss corpus<br />

8<br />

At the moment, the GeWiss corpus counts a total of 447<br />

speakers – 128 from the German academic sett<strong>in</strong>g, 121<br />

from the British and 198 from the Polish.<br />

2.4 Related Work: GeWiss <strong>in</strong> Comparison to<br />

Exist<strong>in</strong>g <strong>Corpora</strong> of Spoken Academic<br />

Language<br />

There are currently no other corpora of spoken academic<br />

German available. The situation is much better with<br />

regard to English where there are three corpora of spoken<br />

academic English to which the GeWiss corpus may be<br />

compared: MICASE (Michigan Corpus of Academic<br />

Spoken English, cf. Simpson et al. 2002), BASE (British<br />

Academic Spoken English, cf. Thompson & Nesi 2001),<br />

and ELFA (English as a L<strong>in</strong>gua Franca <strong>in</strong> Academic<br />

Sett<strong>in</strong>gs, cf. Mauranen & Ranta 2008). The first two<br />

conta<strong>in</strong> ma<strong>in</strong>ly L1 data while ELFA comprises only L2<br />

record<strong>in</strong>gs. GeWiss comb<strong>in</strong>es the different perspectives<br />

taken by MICASE and ELFA with regard to the usage<br />

contexts of a <strong>for</strong>eign language. While MICASE conta<strong>in</strong>s<br />

English L2 data produced only <strong>in</strong> a native (American)<br />

English context and ELFA conta<strong>in</strong>s English L2 data<br />

recorded at F<strong>in</strong>nish universities, i.e. <strong>in</strong> one specific<br />

non-English academic environment, GeWiss comprises<br />

both types of L2 data, thus allow<strong>in</strong>g <strong>for</strong> a comparison of<br />

academic German used by L2 speakers of German both <strong>in</strong><br />

a German context and <strong>in</strong> two non-German academic<br />

contexts.<br />

With regard to the range of discipl<strong>in</strong>es and discourse<br />

genres covered, the GeWiss corpus is much more focussed<br />

(or restricted) than both MICASE and ELFA. The latter<br />

two both conta<strong>in</strong> record<strong>in</strong>gs of a wide range of spoken<br />

academic genres and discipl<strong>in</strong>es. In GeWiss, <strong>in</strong> contrast,<br />

the number of genres covered is conf<strong>in</strong>ed to two<br />

(presentations and oral exam<strong>in</strong>ations) and that of<br />

discipl<strong>in</strong>es to one (philology). With regard to the number<br />

of genres covered, GeWiss is comparable to BASE which<br />

conta<strong>in</strong>s only lectures and sem<strong>in</strong>ars (which were recorded<br />

<strong>in</strong> different discipl<strong>in</strong>es). The restriction regard<strong>in</strong>g genre<br />

and discipl<strong>in</strong>e <strong>in</strong> the GeWiss corpus is, however, a ga<strong>in</strong><br />

when it comes to research <strong>in</strong>terests that <strong>in</strong>clude pragmatic<br />

and textual dimensions. It is also a key feature <strong>for</strong> any<br />

cross-language comparison.<br />

As a comparable corpus, GeWiss differs from MICASE,<br />

BASE, and ELFA <strong>in</strong> that it conta<strong>in</strong>s L1 data from three<br />

different languages (recorded <strong>in</strong> all the selected discourse<br />

genres).<br />

When compar<strong>in</strong>g the GeWiss corpus size to that of<br />

MICASE (200 hours / 1.8 million tokens), BASE (196<br />

hours / 1.6 million tokens) and ELFA (131 hours / 1<br />

million tokens), it is noticeable that GeWiss is somewhat<br />

smaller (a total of 120 hours), but if one compares the<br />

ratio of data per genre the differences are not generally all<br />

that big (cf. Fandrych, Meißner & Slavcheva <strong>in</strong> pr<strong>in</strong>t).<br />

In sum, although the first version of GeWiss is somewhat<br />

smaller than publicly available corpora of English spoken<br />

academic discourse, its specific design offers a valuable<br />

database <strong>for</strong> comparative <strong>in</strong>vestigations.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!