19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

annotation. <strong>Research</strong>ers have often criticized corpus<br />

metadata models that present <strong>in</strong><strong>for</strong>mation only <strong>in</strong> file<br />

headers (e.g. Archer, Culpeper & Davies, 2008). By<br />

enabl<strong>in</strong>g access to metadata at the site of the tokens<br />

retrieved, EXAKT responds to this crucial need <strong>in</strong> variation<br />

analysis. Figure 3 illustrates a token of the word tamam<br />

‘alright, okay’ <strong>in</strong> STC, along with a select number of<br />

metadata categories (Metadata criteria can be selected<br />

accord<strong>in</strong>g to research questions.). Figure 4 displays a standoff<br />

annotation of another token of tamam <strong>for</strong> utterance and<br />

speech act function.<br />

Figure 3: Token and selected conversation features<br />

Figure 4: Utterance and speech act type<br />

5. Transcription Standardization <strong>in</strong> STC and<br />

STCDC<br />

Transcription of language Varieties has been handled <strong>in</strong> two<br />

ways <strong>in</strong> current corpora and <strong>in</strong> discipl<strong>in</strong>es such as<br />

conversation analysis and discourse analysis, namely,<br />

Either produc<strong>in</strong>g dialectal, orthographic<br />

transcription which is as close to the<br />

pronunciation as possible, or produc<strong>in</strong>g a<br />

transcription based on a written standard, but<br />

usually allow<strong>in</strong>g <strong>for</strong> some variation.<br />

Andersen (2010: 554)<br />

As is current practice <strong>in</strong> general corpora, STC follows the<br />

second approach described <strong>in</strong> Andersen (2010) (see, e.g.,<br />

BNC). Standard orthography is employed <strong>in</strong> the v-tier,<br />

except <strong>for</strong> a limited list of words consistently pronounced <strong>in</strong><br />

different ways from the so-called careful pronunciation. For<br />

<strong>in</strong>stance, the word ����������� ‘lady’ is usually pronounced<br />

with the deletion of ��e. Thus, the word is written either as<br />

����������� or hanfendi <strong>in</strong> the v-tier depend<strong>in</strong>g on speaker<br />

pronunciation (<strong>for</strong> documentation of spell<strong>in</strong>g variation <strong>in</strong><br />

STC see Ruhi et al. (2010b) and the corpus web site <strong>for</strong><br />

documents on technical details). Prom<strong>in</strong>ently marked<br />

dialectal and stylistic variants of words and morphological<br />

20<br />

realizations are transcribed <strong>in</strong> the c-tier (see Figure 5a). As<br />

stylistic variation can <strong>in</strong>terleave with dialect shifts,<br />

represent<strong>in</strong>g both the actual pronunciation and its standard<br />

<strong>for</strong>m provides crucial <strong>in</strong><strong>for</strong>mation <strong>for</strong> <strong>in</strong>vestigat<strong>in</strong>g<br />

pragmatic variation.<br />

In STCDC, standardized dialectal orthography is employed<br />

<strong>in</strong> the v-tier and standardized orthography of prom<strong>in</strong>ent<br />

words is written <strong>in</strong> the c-tier. STC and STCDC<br />

transcriptions are thus like mirror images of each other<br />

(compare, Figures 2a,b and 5a,b).<br />

STC and STCDC share the same transcription system<br />

adapted from HIAT (Ruhi et al., 2010b). A major ga<strong>in</strong> of<br />

this procedure will be the possibility to compare dialectal<br />

and standard <strong>for</strong>ms <strong>in</strong> a unitary fashion across Turkey and<br />

Northern Cyprus. STCDC will also <strong>in</strong><strong>for</strong>m standardization<br />

<strong>in</strong> the compilation of special, dialectal computerized corpora<br />

<strong>for</strong> Anatolian dialects, ow<strong>in</strong>g to the fact that dialects <strong>in</strong><br />

Cypriot Turkish share an extensive number of variation<br />

patterns despite variation especially <strong>in</strong> certa<strong>in</strong> <strong>in</strong>flectional<br />

���������� ������������ ������Figures 5a and 5b illustrate<br />

the transcription of the /t/-/d/ and /k/-/g/ variation <strong>in</strong> STCDC<br />

and STC, respectively.<br />

BUR000030 [v] ((0.6)) geç içeri. gel. ((laughs)) �<br />

MUS000031 [v] �� � !<br />

MUS000031 [c] ��<br />

IND000002 [v] ((XXX))<br />

[nn]<br />

Figure 5a: Transcription of /k/-/g/ variation <strong>in</strong> STC<br />

BUR000002 [v] �������������������������������������������<br />

burdan yoksa<br />

BUR000002 [c] ���������� ((lengthen<strong>in</strong>g))<br />

FIK000003 [v]<br />

FIK000003 [c]<br />

[nn] background))<br />

Figure 5b: Transcription of /t/-/d/ variation <strong>in</strong> STCDC<br />

For dialectal variation studies, the transcription<br />

methodology described above is supported through EXAKT.<br />

The tool automatically produces word lists from the v-tier,<br />

which can then be selected <strong>for</strong> search <strong>in</strong> the corpora. Such<br />

searches reveal <strong>in</strong>stances of variation <strong>in</strong> situ, as can be seen<br />

<strong>in</strong> Figures 2a,b and 5a,b.<br />

Commonality <strong>in</strong> standardization <strong>in</strong> STC and STCDC is not<br />

restricted to solely the transcription of the word. The two<br />

corpora also employ similar annotation conventions <strong>for</strong><br />

discursively significant paral<strong>in</strong>guistic features and nonlexical<br />

contributions to the conversation (see, e.g., the<br />

superscript dot <strong>in</strong> Figure 5a <strong>for</strong> laughter). Figure 5b<br />

illustrates the lengthen<strong>in</strong>g of phonetic units <strong>in</strong> words, which<br />

is described <strong>in</strong> the c-tier <strong>in</strong> both corpora. This methodology<br />

is expected to ease cross-varietal pragmatic research.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!