19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

is often fluid <strong>in</strong> terms of communicative goals. Corpus<br />

compilers have tackled classification <strong>in</strong> slightly differ<strong>in</strong>g<br />

manners and levels of granularity. While BNC lumps casual<br />

conversation a s<strong>in</strong>gle category, the Cambridge and<br />

Nott<strong>in</strong>gham Corpus of Discourse <strong>in</strong> English (CANCODE)<br />

differentiates such speech events along two axes: the<br />

relational and goal dimensions (McCarthy, 1998:5). One<br />

problem with this scheme is the rigid divide it proposes <strong>for</strong><br />

goal classification and <strong>in</strong> some cases <strong>for</strong> speaker relations.<br />

Conversations <strong>in</strong> real life may be either task or idea oriented<br />

at different specific times (Gu, 2010) or <strong>in</strong>clude differ<strong>in</strong>g<br />

social relations with<strong>in</strong> the same situated discourse. One way<br />

of resolv<strong>in</strong>g the multi-dimensionality and fluidity of<br />

discourse <strong>in</strong> corpus files would be to <strong>in</strong>clude social activity<br />

types <strong>in</strong> metadata.<br />

The discussion above has raised a number of issues <strong>in</strong><br />

spoken corpus compilation, focus<strong>in</strong>g on certa<strong>in</strong> design<br />

features. In the follow<strong>in</strong>g, we highlight aspects of speaker<br />

demographics, and the employment of doma<strong>in</strong> and social<br />

activity criteria <strong>for</strong> genre classification, which is<br />

supplemented with discourse topic and speech act<br />

annotation <strong>in</strong> metadata.<br />

3. Corpus Design, Pragmatic Metadata, and<br />

Sampl<strong>in</strong>g Procedures <strong>in</strong> STC and STCDC<br />

The construction of STC and STCDC as web-based corpora<br />

started at different but close times at the campuses of Middle<br />

East Technical University <strong>in</strong> Ankara and Güzelyurt (<strong>in</strong> 2008<br />

and 2010, respectively; http://www.stc.org.tr/en/;<br />

http://corpus.ncc.metu.edu.tr/? lang=en). STC will function<br />

as a general corpus <strong>for</strong> spoken Turkish <strong>in</strong> Turkey. It thus is a<br />

Variety corpus that <strong>in</strong>cludes demographic variation and<br />

different genres. STCDC is also a Variety corpus that<br />

<strong>in</strong>cludes different genres, but it has a regional dialect slant.<br />

Despite the dialectal focus of STCDC, the two corpora may<br />

be used <strong>for</strong> variation analysis because their sampl<strong>in</strong>g frames<br />

are similar and because they employ the same speaker<br />

metadata and genre classification parameters (see Andersen,<br />

2010: 557).<br />

With respect to speaker metadata, both STC and STCDC<br />

<strong>in</strong>clude standard demographic categories such as place of<br />

birth and residence, age, education, etc. In addition to these,<br />

speakers are described accord<strong>in</strong>g to length of residence <strong>in</strong><br />

different geographical locations (see Ruhi et al. (2010c) <strong>for</strong><br />

a fuller description of the metadata features <strong>in</strong> STC). Both<br />

corpora also document the language profiles of the speakers.<br />

This will allow sift<strong>in</strong>g of speakers accord<strong>in</strong>g to these<br />

parameters <strong>in</strong> future dialect research.<br />

STC and STCDC employ two parameters <strong>in</strong> genre<br />

classification: speaker relations and social activity type. The<br />

major doma<strong>in</strong>s are family, friend, family-friend, educational,<br />

service encounter, workplace, media discourse, legal,<br />

political, public, research, brief encounter, and unclassified.<br />

F<strong>in</strong>er granularity <strong>in</strong> metadata is achieved dur<strong>in</strong>g the corpus<br />

assignment and the various steps <strong>in</strong> the transcription stage <strong>in</strong><br />

STC through annotation of speech acts (e.g. criticiz<strong>in</strong>g), on<br />

18<br />

the one hand, and on the other hand, annotation <strong>for</strong><br />

conversational topics (e.g. child care), speech events (e.g.<br />

troubles talk), and ongo<strong>in</strong>g social activities (e.g., cook<strong>in</strong>g).<br />

In the current state of STC, speech acts are annotated as a<br />

separate category from topic annotation categories, but<br />

Topics as a super-category <strong>in</strong>cludes local conversational<br />

topics and discoursal topics, which, as evident <strong>in</strong> the<br />

examples given above, concern the meso-level of discourse<br />

<strong>in</strong> that they are more discourse activities than topics <strong>in</strong> the<br />

strict sense (see Lev<strong>in</strong>son (1979) on activity types).<br />

The <strong>in</strong>clusion of speech acts and topics as two further<br />

parameters <strong>in</strong> metadata is motivated by the assumption that<br />

comb<strong>in</strong><strong>in</strong>g generic metadata with more specific pragmatic<br />

annotation renders corpora more ‘visible’ <strong>for</strong> pragmatics<br />

research, broadly construed here to <strong>in</strong>clude variation<br />

analysis. Conflat<strong>in</strong>g local and global topics <strong>in</strong> a s<strong>in</strong>gle<br />

category may not be the ideal route. However, given that<br />

speech event, activities and discoursal topic categorizations<br />

are fuzzy notions, 1 we submit that separat<strong>in</strong>g them from<br />

local topics would not add to the worth of the annotation<br />

(<strong>Speech</strong> act and topic annotation will be implemented <strong>in</strong><br />

STCDC Project at a later stage.). Figure 1 is a partial display<br />

of the metadata <strong>for</strong> a file <strong>in</strong> STC:<br />

Figure 1. Partial metadata <strong>for</strong> a communication <strong>in</strong> STC<br />

This metadata annotation design is an improvement <strong>for</strong><br />

pragmatics-oriented research on spoken corpora, where<br />

scholars have underscored the difficulty of retrieval of<br />

speech act realizations through standard searches that more<br />

often than not rely on the extant literature on speech acts <strong>in</strong><br />

various languages (see, e.g., Jucker et al. (2008) <strong>for</strong> an<br />

<strong>in</strong>vestigation of compliments <strong>in</strong> BNC). Identify<strong>in</strong>g speech<br />

act realizations dur<strong>in</strong>g the corpus construction stage can<br />

enhance corpus-driven approaches to pragmatics (broadly<br />

construed here to <strong>in</strong>clude variation analysis) by provid<strong>in</strong>g<br />

start<strong>in</strong>g po<strong>in</strong>ts <strong>for</strong> researchers on where to look <strong>for</strong> the<br />

relevant realizations <strong>in</strong> the ‘jungle’ of corpus data. As has<br />

been noted scholars that critically assess the possibility of<br />

do<strong>in</strong>g corpus-oriented discourse analysis and pragmatics<br />

1<br />

See Adolphs, Knight & Carter (2011) <strong>for</strong> related remarks on<br />

classify<strong>in</strong>g activities <strong>in</strong> corpus.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!