Best Practices for Speech Corpora in Linguistic Research Workshop ...

More documents

Recommendations

Info

� 22
The Corpus of Spoken Greek: Goals, Challenges, Perspectives Theodossia-Soula Pavlidou Aristotle University of Thessaloniki GR-54124 Thessaloniki, Greece pavlidou@lit.auth.gr Abstract The purpose of the present paper is to introduce the Corpus of Spoken Greek, which has been developed at the Institute of Modern Greek Studies (Manolis Triandaphyllidis Foundation), Aristotle University of Thessaloniki. More specifically, I would like to describe and account for the particularities of this corpus, which is mainly intended for qualitative research purposes from the perspective of Conversation Analysis. I will thus exemplify some of the issues and challenges involved in the development of corpora that consist of naturalistic speech data. As a consequence, I would like to conclude that the idea of “best practices” for speech corpora can only be understood as a function of the research goals, explicit or implicit, that are prominent when compiling a corpus. Moreover, standardization of speech corpora can only be pursued to an extent that allows, on the one hand, comparability with other corpora and usability by a large community of researchers, and, on the other, ensures maintenance of those characteristics that are indispensable for the kind of research that the corpus was originally conceived for. Keywords: Modern Greek, speech, Conversation Analysis 1. Introduction While the significance of studying language on the basis of speech has been undisputed for more than a century now and while the necessity for corpora has been long recognized, the international praxis of speech corpora is only slowly catching up with that of written corpora. This is no coincidence, if one takes into account that the compilation of spoken data is much more time-consuming and expensive as it relies to a greater extent on technical equipment. Moreover, if ‘naturalness’ of the data is one of the goals, the so-called observer’s paradox (more precisely: the attempt to overcome this paradox) often has the consequence that speech corpora comprise public discourse genres, e.g. broadcast lectures, talk shows, news bulletins, etc. Other types of discourse, like everyday conversations or telephone calls, are much more sensitive to the kind of recording (audio vs. video) employed, as regards naturalness or authenticity; moreover, they are much less accessible (e.g. with respect to the participants’ consent to record their interaction and use the recorded material). It is no surprise then that speech corpora based on naturalistic data are relatively rare and small in comparison to written corpora. In this paper, I would like to discuss some of the problems involved in the development of corpora that consist of naturalistic speech data by way of presenting the Corpus of Spoken Greek of the Institute of Modern Greek Studies (Manolis Triandaphyllidis Foundation), Aristotle University of Thessaloniki. More specifically, I would like to describe and account for the particularities of this corpus which are closely related to its aims, but 23 also to the specific conditions under which it was developed, and point to some issues and challenges. 2. Background The compilation of a corpus of naturally occurring talk-in-interaction is one of the aims of the research project Greek talk-in-interaction and Conversation Analysis, 1 which is carried out under the author’s direction at the Institute of Modern Greek Studies. The project additionally aims at the study of the Greek language from the perspective of Conversation Analysis and the training of researchers in the theory and practice of Conversation Analysis (in the following, CA for short), an ethnomethodologically informed approach to linguistic interaction (for the aims, principles, methodology, etc., of CA, cf. e.g. Lerner, 2004; Schegloff, 2007). Given the CA orientation of the project, the Corpus of Spoken Greek is primarily intended for close qualitative rather than quantitative analyses, and it is this objective that lends the corpus its particular characteristics (cf. Section 3). However, as we shall see in Section 5, part of the corpus can also be used for quantitative analyses online. In its current form (with respect to its conceptualization and more or less stable composition of the team through the employment of part-time assistants), the project Greek talk-in-interaction and Conversation Analysis has been running for about four years. However, the Corpus of Spoken Greek did not arise out of nowhere nor did it get designed at one shot. Rather, it utilized earlier data collections (and transcriptions) carried out by the author 1 Website:
Page 1 and 2: Best Practices for Speech Corpora i
Page 3 and 4: Editors Michael Haugh Griffith Univ
Page 5 and 6: Author Index Broeder, Daan ........
Page 7 and 8: A linguistics-based speech corpus J
Page 9 and 10: Figure 2: Grammatical tags are visi
Page 11: In Jokinen, Kristiina and Eckhard B
Page 14 and 15: 2.2 Parameters of the Corpus Design
Page 16 and 17: switching and code mixing, we have
Page 18 and 19: � 12
Page 20 and 21: French and Russian screen versions
Page 22 and 23: singularity, expressiveness, semant
Page 24 and 25: is often fluid in terms of communic
Page 26 and 27: annotation. Researchers have often
Page 30 and 31: in various research and/or student
Page 32 and 33: As previously mentioned, the degree
Page 34 and 35: 9. References Corpus of Spoken Gree
Page 36 and 37: In addition to part-of-speech tags
Page 38 and 39: POS description example translitera
Page 40 and 41: Figure 4: Different syntactic analy
Page 42 and 43: Herbert H. Clark and Thomas Wasow.
Page 44 and 45: The ‘externality’ of DA arises
Page 46 and 47: 6. Data analysis Each turn is annot
Page 48 and 49: � 42
Page 50 and 51: 3. manual phonetic transcription (t
Page 52 and 53: • comparative linguistic research
Page 54 and 55: ��
Page 56 and 57: ��
Page 58 and 59: The global corpus data model is a s
Page 60 and 61: a) b) c) Figure 6: Web experiment o
Page 62 and 63: � 56
Page 64 and 65: conversation across 26 languages; t
Page 66 and 67: codes marked through special Unicod
Page 68 and 69: demographic fields, as are designat
Page 70 and 71: methods for eliciting metadata. Che
Page 72 and 73: � 66
Page 74 and 75: difference between corpora that can
Page 76 and 77: database can be reconstructed at an

Best Practices for Speech Corpora in Linguistic Research Workshop ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?