19.11.2012 Views

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

Best Practices for Speech Corpora in Linguistic Research Workshop ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

codes marked through special Unicode characters<br />

entered through comb<strong>in</strong>ations of the F1 and F2<br />

function keys with other characters. This system is<br />

described at http://talkbank.org/CABank/codes.html<br />

and <strong>in</strong> MacWh<strong>in</strong>ney and Wagner (2010)<br />

2. Morphological and syntactic l<strong>in</strong>es. The MOR and<br />

GRASP programs compute these two annotation<br />

l<strong>in</strong>es automatically. The <strong>for</strong>ms on these l<strong>in</strong>es stand<br />

<strong>in</strong> a one-to-one relation with ma<strong>in</strong> l<strong>in</strong>e <strong>for</strong>ms,<br />

exclud<strong>in</strong>g retraces and nonwords. This alignment,<br />

which is ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the XML, permits a wide<br />

variety of detailed morphosyntactic analyses. We<br />

also hope to use this alignment to provide methods<br />

<strong>for</strong> writ<strong>in</strong>g from the XML to a <strong>for</strong>matted display of<br />

<strong>in</strong>terl<strong>in</strong>ear aligned morphological analysis.<br />

3. Phonological l<strong>in</strong>e. The %pho l<strong>in</strong>e stands <strong>in</strong> a<br />

one-to-one relation with all words on the ma<strong>in</strong> l<strong>in</strong>e,<br />

<strong>in</strong>clud<strong>in</strong>g retraces and nonwords. This l<strong>in</strong>e uses<br />

standard IPA cod<strong>in</strong>g to represent the phonological<br />

<strong>for</strong>ms of words on the ma<strong>in</strong> l<strong>in</strong>e. To represent<br />

elision processes, ma<strong>in</strong> l<strong>in</strong>e <strong>for</strong>ms may be grouped<br />

<strong>for</strong> correspondence to the %pho l<strong>in</strong>e. The Phon<br />

program developed by Yvan Rose and colleagues<br />

(Rose, Hedlund, Byrne, Wareham, & MacWh<strong>in</strong>ney,<br />

2007; Rose & MacWh<strong>in</strong>ney, <strong>in</strong> press) is able to<br />

directly import and export valid TalkBank XML.<br />

4. Error analysis. In earlier versions of the system,<br />

errors were coded on a separate l<strong>in</strong>e. However, we<br />

have found that it is more effective to word-level<br />

code errors directly on the ma<strong>in</strong> l<strong>in</strong>e, us<strong>in</strong>g a system<br />

specifically elaborated <strong>for</strong> aphasic speech at<br />

http://talkbank.org/AphasiaBank/errors.doc.<br />

5. Gesture cod<strong>in</strong>g. Although programs such as ELAN<br />

and Anvil provide powerful methods <strong>for</strong> gesture<br />

cod<strong>in</strong>g, we have found that it is often difficult to use<br />

these programs to obta<strong>in</strong> an <strong>in</strong>tuitive understand<strong>in</strong>g<br />

of gesture sequences. Simply l<strong>in</strong>k<strong>in</strong>g a series of<br />

gesture codes to the ma<strong>in</strong> l<strong>in</strong>e <strong>in</strong> TalkBank XML is<br />

similarly <strong>in</strong>adequate. To address this need, we have<br />

developed a new method of cod<strong>in</strong>g through nested<br />

cod<strong>in</strong>g files l<strong>in</strong>ked to particular stretches of the ma<strong>in</strong><br />

l<strong>in</strong>e. These cod<strong>in</strong>g files can be nested <strong>in</strong>def<strong>in</strong>itely,<br />

but we have found that two levels of embedd<strong>in</strong>g are<br />

enough <strong>for</strong> current analysis needs. Examples of<br />

these gesture cod<strong>in</strong>g methods can be found at<br />

http://talkbank.org/CABank/gesture.zip.<br />

6. Special cod<strong>in</strong>g l<strong>in</strong>es. CLAN and TalkBank XML<br />

also support a wide variety of additional cod<strong>in</strong>g<br />

l<strong>in</strong>es <strong>for</strong> speech act cod<strong>in</strong>g, analysis of written texts,<br />

situational background, and commentary. These<br />

cod<strong>in</strong>g tiers are not aligned only to utterances and<br />

not to <strong>in</strong>dividual words.<br />

8. Dissem<strong>in</strong>ation Plat<strong>for</strong>ms<br />

The fundamental idea underly<strong>in</strong>g the construction of<br />

TalkBank is the notion of data shar<strong>in</strong>g. By pool<strong>in</strong>g their<br />

hard-won data together, researchers can generate<br />

<strong>in</strong>creas<strong>in</strong>gly accurate and powerful answers to<br />

fundamental research questions. The CHILDES and<br />

TalkBank web sites are designed to maximize the<br />

60<br />

dissem<strong>in</strong>ation of the data, programs, and related methods.<br />

Transcript data can be downloaded <strong>in</strong> .zip <strong>for</strong>mat. Media<br />

can be downloaded or played back over the web through<br />

QuickTime reference movie files. The TalkBank<br />

browser allows users to view any TalkBank transcript <strong>in</strong><br />

the browser and listen to the correspond<strong>in</strong>g audio or see<br />

the correspond<strong>in</strong>g video <strong>in</strong> cont<strong>in</strong>uous playback mode,<br />

l<strong>in</strong>ked on the utterance level. We also provide methods<br />

<strong>for</strong> runn<strong>in</strong>g CLAN analyses over the web, which we are<br />

now supplement<strong>in</strong>g with analyses that use the XML<br />

database as served through the Mark Logic <strong>in</strong>terface.To<br />

teach the use of the system, we have produced manuals,<br />

<strong>in</strong>structional videos and powerpo<strong>in</strong>t demonstrations<br />

which we use <strong>in</strong> a wide variety of workshops<br />

<strong>in</strong>ternationally<br />

9. Conclusion<br />

Together these various TalkBank facilities provide a<br />

comprehensive, <strong>in</strong>teroperable set of best practices <strong>for</strong> the<br />

cod<strong>in</strong>g of spoken language corpora <strong>for</strong> research <strong>in</strong><br />

l<strong>in</strong>guistics, psychol<strong>in</strong>guistics, speech technology, and<br />

related discipl<strong>in</strong>es. New methods and improvements to<br />

these practices are cont<strong>in</strong>ually <strong>in</strong> development, as we<br />

expand the database to <strong>in</strong>clude a fuller representation of<br />

the many <strong>for</strong>ms of spoken communication.<br />

10. References<br />

Jefferson, G. (1984). Transcript notation. In J. Atk<strong>in</strong>son<br />

& J. Heritage (Eds.), Structures of social<br />

<strong>in</strong>teraction: Studies <strong>in</strong> conversation analysis (pp.<br />

134-162). Cambridge: Cambridge University<br />

Press.<br />

MacWh<strong>in</strong>ney, B. (2008). Enrich<strong>in</strong>g CHILDES <strong>for</strong><br />

morphosyntactic analysis. In H. Behrens (Ed.),<br />

Trends <strong>in</strong> corpus research: F<strong>in</strong>d<strong>in</strong>g structure <strong>in</strong><br />

data (pp. 165-198). Amsterdam: John<br />

Benjam<strong>in</strong>s.<br />

MacWh<strong>in</strong>ney, B., Fromm, D., Forbes, M., & Holland, A.<br />

(2011). AphasiaBank: Methods <strong>for</strong> study<strong>in</strong>g<br />

discourse. Aphasiology, 25, 1286-1307.<br />

MacWh<strong>in</strong>ney, B., & Wagner, J. (2010). Transcrib<strong>in</strong>g,<br />

search<strong>in</strong>g and data shar<strong>in</strong>g: The CLAN software<br />

and the TalkBank data repository.<br />

Gesprächs<strong>for</strong>schung, 2, 1-20.<br />

Malvern, D. D., Richards, B. J., Chipere, N., & Purán, P.<br />

(2004). Lexical diversity and language<br />

development. New York: Palgrave Macmillan.<br />

Rose, Y., & MacWh<strong>in</strong>ney, B. (<strong>in</strong> press). The Phon and<br />

PhonBank <strong>in</strong>itiatives.<br />

Sagae, K., Davis, E., Lavie, A., MacWh<strong>in</strong>ney, B., &<br />

W<strong>in</strong>tner, S. (2010). Morphosyntactic annotation<br />

of CHILDES transcripts. Journal of Child<br />

Language, 37, 705-729.<br />

Sagae, K., Lavie, A., & MacWh<strong>in</strong>ney, B. (2005).<br />

Automatic measurement of syntactic<br />

development <strong>in</strong> child language Proceed<strong>in</strong>gs of<br />

the 43rd Meet<strong>in</strong>g of the Association <strong>for</strong><br />

Computational L<strong>in</strong>guistics (pp. 197-204). Ann<br />

Arbor: ACL.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!