13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universityican) English (and the system is used increasinglyfor the prosodic annotation of other languages,too), and good inter-transcriber consistencycan be achieved as long as the voice qualityanalyzed represents normal (modal) phonation.Certain speech situations, however, seemto consistently produce voice qualities differentfrom modal phonation, and the prosodic analysisof such speech data with traditional ToBIlabeling may be problematic. Typical examplesare breathy, creaky and harsh voice qualities.Pitch analysis algorithms, which are used toproduce a record of the fundamental frequency(f0) contour of the utterance to aid the ToBIlabeling, yield a messy or lacking f0 track onnon-modal voice segments. Non-modal voicequalities may represent habitual speaking stylesor idiosyncrasies of speakers but they are oftenprosodic characteristics of emotional discourse(sadness, anger, etc.). It is likely, for example,that the speech of a depressed subject is to asignificant extent characterized by low f0 targetsand creak. Therefore, some special (possiblyemotion-specific) speech genres (observedand recorded in clinical settings) might be problematicfor traditional ToBI labeling.A potential modified system would be “4-Tone EVo” – a ToBI-based framework for transcribingthe prosody of modal/non-modal voicein (emotional) English. As in the original ToBIsystem, intonation is transcribed as a sequenceof pitch accents and boundary pitch movements(phrase accents and boundary tones). The originalToBI break index tier (with four strengthsof boundaries) is also used. The fundamentaldifference between 4-Tone EVo and the originalToBI is that four main tones (H, L, h, l) areused instead of two (H, L). In 4-Tone EVo, Hand L are high and low tones, respectively, asare “h” and “l”, but “h” is a high tone with nonmodalphonation and “l” a low tone with nonmodalphonation. Basically, “h” is H without aclear pitch representation in the record of f0contour, and “l” is a similar variant of L.Preliminary tests for (emotional) Englishprosodic annotation have been made using themodel, and the results seem promising (Toivanen,2006). To assess the usefulness of 4-ToneEVo, informal interviews with British exchangestudents (speakers of southern British English)were used (with permission obtained from thesubjects). The speakers described, among otherthings, their reactions to certain personal dilemmas(the emotional overtone was, predictably,rather low-keyed).The discussions were recorded in a soundtreatedroom; the speakers’ speech data wasrecorded directly to hard disk (44.1 kHz, 16 bit)using a high-quality microphone. The interactionwas visually recorded with a high-qualitydigital video recorder directly facing the speaker.The speech data consisted of 574 orthographicwords (82 utterances) produced bythree female students (20-27 years old). FiveFinnish students of linguistics/phonetics listenedto the tapes and watched the video data;the subjects transcribed the data prosodicallyusing 4-Tone EVo. The transcribers had beengiven a full training course in 4-Tone EVo stylelabeling. Each subject transcribed the materialindependently of one another.As in the evaluation studies of the originalToBI, a pairwise analysis was used to evaluatethe consistency of the transcribers: the label ofeach transcriber was compared against the labelsof every other transcriber for the particularaspect of the utterance. The 574 words weretranscribed by the five subjects; thus a total of5740 (574x10 pairs of transcribers) transcriberpair-wordswere produced. The following consistencyrates were obtained: presence of pitchaccent (73 %), choice of pitch accent (69 %),presence of phrase accent (82 %), presence ofboundary tone (89 %), choice of phrase accent(78 %), choice of boundary tone (85 %), choiceof break index (68 %).The level of consistency achieved for 4-Tone EVo transcription was somewhat lowerthan that reported for the original ToBI system.However, the differences in the agreement levelsseem quite insignificant bearing in mindthat 4-Tone EVo uses four tones instead of two!Gaze direction analysisOur second proposal concerns the multimodalityof a (clinical) situation, e.g. a patient interview,in which (emotional) speech is produced.It seems necessary to record the interactive situationas fully as possible, also visually. In aclinical situation, where the subject’s overallbehavior is being (at least indirectly) assessed,it is essential that other modalities than speechbe analyzed and annotated. Thus, as far as emotionexpression and emotion evaluation in interactionare concerned, the coding of the visuallyobservable behavior of the subject should be astandard procedure. We suggest that, after recordingthe discourse event with a video recorder,the gaze of the subject is annotated asfollows. The gaze of the subject (patient) may178

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!