� 22
The Corpus of Spoken Greek: Goals, Challenges, Perspectives Theodossia-Soula Pavlidou Aristotle University of Thessaloniki GR-54124 Thessaloniki, Greece pavlidou@lit.auth.gr Abstract The purpose of the present paper is to <strong>in</strong>troduce the Corpus of Spoken Greek, which has been developed at the Institute of Modern Greek Studies (Manolis Triandaphyllidis Foundation), Aristotle University of Thessaloniki. More specifically, I would like to describe and account <strong>for</strong> the particularities of this corpus, which is ma<strong>in</strong>ly <strong>in</strong>tended <strong>for</strong> qualitative research purposes from the perspective of Conversation Analysis. I will thus exemplify some of the issues and challenges <strong>in</strong>volved <strong>in</strong> the development of corpora that consist of naturalistic speech data. As a consequence, I would like to conclude that the idea of “best practices” <strong>for</strong> speech corpora can only be understood as a function of the research goals, explicit or implicit, that are prom<strong>in</strong>ent when compil<strong>in</strong>g a corpus. Moreover, standardization of speech corpora can only be pursued to an extent that allows, on the one hand, comparability with other corpora and usability by a large community of researchers, and, on the other, ensures ma<strong>in</strong>tenance of those characteristics that are <strong>in</strong>dispensable <strong>for</strong> the k<strong>in</strong>d of research that the corpus was orig<strong>in</strong>ally conceived <strong>for</strong>. Keywords: Modern Greek, speech, Conversation Analysis 1. Introduction While the significance of study<strong>in</strong>g language on the basis of speech has been undisputed <strong>for</strong> more than a century now and while the necessity <strong>for</strong> corpora has been long recognized, the <strong>in</strong>ternational praxis of speech corpora is only slowly catch<strong>in</strong>g up with that of written corpora. This is no co<strong>in</strong>cidence, if one takes <strong>in</strong>to account that the compilation of spoken data is much more time-consum<strong>in</strong>g and expensive as it relies to a greater extent on technical equipment. Moreover, if ‘naturalness’ of the data is one of the goals, the so-called observer’s paradox (more precisely: the attempt to overcome this paradox) often has the consequence that speech corpora comprise public discourse genres, e.g. broadcast lectures, talk shows, news bullet<strong>in</strong>s, etc. Other types of discourse, like everyday conversations or telephone calls, are much more sensitive to the k<strong>in</strong>d of record<strong>in</strong>g (audio vs. video) employed, as regards naturalness or authenticity; moreover, they are much less accessible (e.g. with respect to the participants’ consent to record their <strong>in</strong>teraction and use the recorded material). It is no surprise then that speech corpora based on naturalistic data are relatively rare and small <strong>in</strong> comparison to written corpora. In this paper, I would like to discuss some of the problems <strong>in</strong>volved <strong>in</strong> the development of corpora that consist of naturalistic speech data by way of present<strong>in</strong>g the Corpus of Spoken Greek of the Institute of Modern Greek Studies (Manolis Triandaphyllidis Foundation), Aristotle University of Thessaloniki. More specifically, I would like to describe and account <strong>for</strong> the particularities of this corpus which are closely related to its aims, but 23 also to the specific conditions under which it was developed, and po<strong>in</strong>t to some issues and challenges. 2. Background The compilation of a corpus of naturally occurr<strong>in</strong>g talk-<strong>in</strong>-<strong>in</strong>teraction is one of the aims of the research project Greek talk-<strong>in</strong>-<strong>in</strong>teraction and Conversation Analysis, 1 which is carried out under the author’s direction at the Institute of Modern Greek Studies. The project additionally aims at the study of the Greek language from the perspective of Conversation Analysis and the tra<strong>in</strong><strong>in</strong>g of researchers <strong>in</strong> the theory and practice of Conversation Analysis (<strong>in</strong> the follow<strong>in</strong>g, CA <strong>for</strong> short), an ethnomethodologically <strong>in</strong><strong>for</strong>med approach to l<strong>in</strong>guistic <strong>in</strong>teraction (<strong>for</strong> the aims, pr<strong>in</strong>ciples, methodology, etc., of CA, cf. e.g. Lerner, 2004; Schegloff, 2007). Given the CA orientation of the project, the Corpus of Spoken Greek is primarily <strong>in</strong>tended <strong>for</strong> close qualitative rather than quantitative analyses, and it is this objective that lends the corpus its particular characteristics (cf. Section 3). However, as we shall see <strong>in</strong> Section 5, part of the corpus can also be used <strong>for</strong> quantitative analyses onl<strong>in</strong>e. In its current <strong>for</strong>m (with respect to its conceptualization and more or less stable composition of the team through the employment of part-time assistants), the project Greek talk-<strong>in</strong>-<strong>in</strong>teraction and Conversation Analysis has been runn<strong>in</strong>g <strong>for</strong> about four years. However, the Corpus of Spoken Greek did not arise out of nowhere nor did it get designed at one shot. Rather, it utilized earlier data collections (and transcriptions) carried out by the author 1 Website: