19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>in</strong> various research and/or student projects s<strong>in</strong>ce the<br />

1980s, mostly without any funds. In particular, an earlier<br />

corpus of approximately 250.000 words (cf. Pavlidou,<br />

2002) has been <strong>in</strong>corporated <strong>in</strong>to the current one. It is<br />

furthermore important to stress that the Corpus of Spoken<br />

Greek has not been <strong>in</strong>tended as a “closed” database, but<br />

as a dynamic corpus <strong>in</strong> that it gets enriched with new<br />

record<strong>in</strong>gs and transcriptions, while older transcriptions<br />

are re-exam<strong>in</strong>ed and eventually improved.<br />

3. Features<br />

As has already been <strong>in</strong>dicated, the Corpus of Spoken<br />

Greek is ma<strong>in</strong>ly <strong>in</strong>tended <strong>for</strong> qualitative analyses of the<br />

Greek language and, more specifically, <strong>for</strong> the study of<br />

Greek talk-<strong>in</strong>-<strong>in</strong>teraction from a CA perspective.<br />

Accord<strong>in</strong>gly, <strong>in</strong> the compilation of the Corpus, particular<br />

emphasis has been placed on data from everyday<br />

conversations <strong>in</strong> face-to-face <strong>in</strong>teractions or over the<br />

telephone. In addition though to <strong>in</strong><strong>for</strong>mal conversations<br />

among friends and relatives, the Corpus also <strong>in</strong>cludes<br />

other discourse types, which are more <strong>in</strong>stitutional, like<br />

record<strong>in</strong>gs of teacher-student <strong>in</strong>teraction <strong>in</strong> high school<br />

classes, television news bullet<strong>in</strong>s, and television panel<br />

discussions (cf. Table 2 <strong>for</strong> the size of data from each<br />

discourse type). In sections 3.1 to 3.3 further <strong>in</strong><strong>for</strong>mation<br />

with regard to the collection of data, transcription, files<br />

and metadata is provided.<br />

3.1 Data Collection<br />

The Corpus of Spoken Greek, as mentioned, utilized<br />

earlier data collections and transcriptions (cf. Section 2).<br />

In all cases, however, we have to do with naturalistic data<br />

that were collected <strong>for</strong> the most part by MA or PhD<br />

students <strong>for</strong> their theses or by undergraduate students <strong>for</strong><br />

their semester work. The <strong>in</strong><strong>for</strong>mal conversations were<br />

tape- or video-recorded by one of the participants. In the<br />

case of classroom <strong>in</strong>teraction, it was the teacher<br />

her-/himself who made record<strong>in</strong>gs of his/her classes at<br />

high school. The equipment used <strong>for</strong> the record<strong>in</strong>gs has<br />

been vary<strong>in</strong>g over time, as <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g only simple<br />

tape- or video-recorders were available, while later on<br />

digital recorders could be employed.<br />

Record<strong>in</strong>gs of private conversations and classroom<br />

<strong>in</strong>teractions were made openly and after hav<strong>in</strong>g <strong>in</strong><strong>for</strong>med<br />

the participants that they would be recorded and gett<strong>in</strong>g<br />

their consent. In the case of telephone conversations,<br />

sometimes this <strong>in</strong><strong>for</strong>mation was given after the telephone<br />

calls were conducted. In all cases, the persons who made<br />

the record<strong>in</strong>gs were asked to erase anyth<strong>in</strong>g they wanted<br />

and to hand <strong>in</strong> only those conversations/<strong>in</strong>teractions they<br />

would not m<strong>in</strong>d be<strong>in</strong>g heard or seen by others. The issue<br />

of participants’ consent, though, has been handled with<br />

much more rigor and detail <strong>in</strong> the last 15 years, as there is<br />

both a grow<strong>in</strong>g sensitivity <strong>in</strong> the Greek society and an<br />

official Data Protection Authority <strong>for</strong> the safeguard<strong>in</strong>g of<br />

personal data <strong>in</strong> Greece. We have there<strong>for</strong>e been us<strong>in</strong>g a<br />

written consent <strong>for</strong>m that is signed by all participants <strong>in</strong><br />

those <strong>in</strong>teractions which are not transmitted publicly.<br />

24<br />

3.2 Transcription<br />

As is well known, CA lays great emphasis on the detailed<br />

representation of spoken discourse via the transcription of<br />

the record<strong>in</strong>gs. For CA, transcription is not an automatic,<br />

mechanical procedure, of the k<strong>in</strong>d, <strong>for</strong> example,<br />

accomplished by various software packages, nor is it<br />

conf<strong>in</strong>ed to the representation of content, as is usually<br />

done <strong>in</strong> the written <strong>for</strong>m of <strong>in</strong>terviews by journalists (cf.<br />

also Ochs, 1979). On the contrary, the ‘translation’ of<br />

sound <strong>in</strong>to written text requires theoretical elaboration<br />

and analysis, presupposes tra<strong>in</strong><strong>in</strong>g, and demands multiple<br />

exam<strong>in</strong>ation/corrections by several people.<br />

The record<strong>in</strong>gs collected <strong>for</strong> the Corpus of Spoken Greek<br />

are there<strong>for</strong>e meticulously transcribed accord<strong>in</strong>g to the<br />

pr<strong>in</strong>ciples of CA (cf. e.g. Jefferson, 2004; Sacks,<br />

Schegloff and Jefferson, 1974; Schegloff, 2007) <strong>in</strong> several<br />

rounds by different people. This is basically an<br />

orthographic transcription, <strong>in</strong> which the mark<strong>in</strong>g of<br />

overlaps, repairs, pauses, <strong>in</strong>tonational features, etc., is<br />

carried out <strong>in</strong> a relatively detailed manner. To this we add<br />

the mark<strong>in</strong>g of certa<strong>in</strong> sandhi phenomena and dialectal<br />

features. For the disambiguation of prosodic features (e.g.<br />

sudden voice uprise), we employ the Praat software.<br />

F<strong>in</strong>ally, transcriptions are produced as Word documents,<br />

us<strong>in</strong>g a table <strong>for</strong>mat that allows different columns <strong>for</strong><br />

mark<strong>in</strong>g the participants’ names, the numbers of l<strong>in</strong>es<br />

(when necessary), etc. An extract of such a transcription<br />

is illustrated <strong>in</strong> Table 1 (to which the English translation<br />

has been added): 2<br />

Dimos […] ������ ��� ��:- � �� �� � �� � ��<br />

[…] ��������������������������������������������������<br />

� � �, (1.3) � �� �� � � � � �<br />

����������������������������������������������������<br />

[ . ( ’ .)]<br />

������������������������������������������������<br />

Afrod. [ � � , � � � �[ �] (� � � � )]=<br />

������������������������������������������������������<br />

Yorgos [((����..........................]=<br />

((���������………..….<br />

Afrod. =[ � �? � � � � � � .]<br />

((������� ))<br />

���������������������������������������������������������<br />

Yorgos =[................................................................))]<br />

................................................................))<br />

((������������������))<br />

2 ‘Afrod.’ stands <strong>for</strong> Afroditi, the name of a female participant.<br />

For the symbols used <strong>in</strong> the transcription please cf. the<br />

Appendix <strong>in</strong> Section 8.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!