19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

As previously mentioned, the degree of detail and quality<br />

of the transcription varies, depend<strong>in</strong>g on the number of<br />

people work<strong>in</strong>g on the data and their tra<strong>in</strong><strong>in</strong>g, but also<br />

accord<strong>in</strong>g to the research needs of the project and the<br />

necessities that have arisen (cf. e.g. the preparation of<br />

part of the Corpus <strong>for</strong> word search onl<strong>in</strong>e described <strong>in</strong><br />

the next Section). Consequently, not all transcribed texts<br />

appear<strong>in</strong>g <strong>in</strong> Table 2 are on the same level. In particular,<br />

most transcriptions of face-to-face conversations have<br />

been reworked on the average by three to four different<br />

persons. TV panel discussions, on the other hand, have<br />

only undergone transcription by one to two persons.<br />

The Corpus of Spoken Greek is, thus, unique both <strong>in</strong> its<br />

aims/conception but also <strong>in</strong> its make-up, as compared to<br />

the other four exist<strong>in</strong>g Greek corpora (of which only one<br />

conta<strong>in</strong>s spoken discourse files), s<strong>in</strong>ce a great part of it<br />

(ca. 44%, i.e. more than 750,000 words) comes from<br />

<strong>in</strong><strong>for</strong>mal (face-to-face or telephone) conversations.<br />

5. Extensions<br />

As already mentioned, the Corpus of Spoken Greek has<br />

been primarily designed <strong>for</strong> the qualitative analysis of<br />

language and l<strong>in</strong>guistic communication from a CA<br />

perspective. However, a tool has recently been<br />

developed <strong>for</strong> the search of words and phrases,<br />

concordances, statistics, etc., <strong>in</strong> a database consist<strong>in</strong>g of<br />

<strong>in</strong><strong>for</strong>mal conversations among friends and relatives that<br />

amounts to 200,000 words. This tool, which is available<br />

onl<strong>in</strong>e, 4 allows the user to track a word, irrespective of<br />

modifications of its <strong>for</strong>m due to the transcription<br />

conventions. Figure 1 illustrates the ma<strong>in</strong> page of the<br />

tool <strong>for</strong> quantitative analyses:<br />

Figure 1: Ma<strong>in</strong> page of the tool <strong>for</strong> quantitative analyses<br />

When search<strong>in</strong>g <strong>for</strong> a certa<strong>in</strong> word, e.g. � (‘good’,<br />

‘well’), the results are listed <strong>in</strong> contextual environments<br />

consist<strong>in</strong>g of three l<strong>in</strong>es of transcribed text. These l<strong>in</strong>es<br />

may conta<strong>in</strong> utterances belong<strong>in</strong>g to different speakers<br />

4 Website: <br />

26<br />

(symbolized <strong>in</strong> different color icons be<strong>for</strong>e every l<strong>in</strong>e).<br />

Figure 2 illustrates two such contextual environments <strong>for</strong><br />

the word � . It also shows the total number of its<br />

occurrences –617 <strong>in</strong> this case.<br />

Figure 2: Search results<br />

The search <strong>for</strong> a word, e.g. � (‘good’, ‘well’), will<br />

also yield all the words beg<strong>in</strong>n<strong>in</strong>g with the letter<br />

sequence -�- -�, e.g. �� (‘reed’), �<br />

(‘basket’). By us<strong>in</strong>g the asterisk (*), e.g. * � * or<br />

* � , the search will also yield all the words<br />

conta<strong>in</strong><strong>in</strong>g the letter sequence -�- -� or end<strong>in</strong>g <strong>in</strong> it.<br />

F<strong>in</strong>ally, the search <strong>for</strong> pairs of words, e.g. � � or<br />

� , is also possible.<br />

F<strong>in</strong>ally, some statistics can be provided, as <strong>for</strong> example<br />

the ten most frequent words <strong>in</strong> this part of the Corpus<br />

shown <strong>in</strong> Figure 3:<br />

Figure 3: The ten (10) most frequently used words<br />

In addition to the tool <strong>for</strong> quantitative purposes, we are<br />

currently <strong>in</strong> the process of assess<strong>in</strong>g alternative<br />

approaches to cod<strong>in</strong>g multimodal data (e.g. CLAN,<br />

ELAN, EXMARaLDA), so that the available video<br />

record<strong>in</strong>gs can be adequately transcribed and managed<br />

(cf. e.g. Mondada, 2007; Schmidt & Wörner, 2009) <strong>in</strong><br />

ways best suited to the purposes of the Corpus of Spoken<br />

Greek.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!