13.07.2015 Views

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

Proceedings Fonetik 2009 - Institutionen för lingvistik

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Proceedings</strong>, FONETIK <strong>2009</strong>, Dept. of Linguistics, Stockholm Universitymal and degraded audio. Even though the totalresults were no better with the AR view thanwith a normal face, the score for some sentenceswas higher when the tongue was visible.Grauwinkel et al. (2007) also showed that subjectswho had received explicit training, in theform of a video that explained the intra-oral articulatormovement for different consonants,performed better in the VCV recognition taskin noise than the group who had not receivedthe training and the one who saw a normal face.An additional factor that may add to the unfamiliarityof the tongue movements is thatthey were generated with a rule-based visualspeech synthesizer in Wik & Engwall (2008)and Graunwinkel et al. (2007). Badin etal. (2008) on the other hand created the animationsbased on real movements, measured withElectromagnetic Articulography (EMA). In thisstudy, we investigate if the use of real movementsinstead of rule-generated ones has anyeffect on speech perception results.It could be the case that rule-generatedmovements give a better support for speechperception, since they are more exaggeratedand display less variability. It could howeveralso be the case that real movements give a bettersupport, because they may be closer to thelisteners’ conscious or subconscious notion ofwhat the tongue looks like for different phonemes.Such an effect could e.g., be explainedby the direct realist theory of speech perception(Fowler, 2008) that states that articulatory gesturesare the units of speech perception, whichmeans that perception may benefit from seeingthe gestures. The theory is different from, butclosely related to, and often confused with, thespeech motor theory (Liberman et al, 1967;Liberman & Mattingly, 1985), which stipulatesthat speech is perceived in terms of gesturesthat translate to phomenes by a decoder linkedto the listener's own speech production. It hasoften been criticized (e.g., Traunmüller, 2007)because of its inability to fully explain acousticspeech perception. For visual speech perception,there is on the other hand evidence (Skipperet al., 2007) that motor planning is indeedactivated when seeing visual speech gestures.Speech motor areas in the listener’s brain areactivated when seeing visemes, and the activitycorresponds to the areas activated in thespeaker when producing the same phonemes.We here investigate audiovisual processing ofthe more unfamiliar visual gestures of thetongue, using a speech perception and a classificationtest. The perception test analyzes thesupport given by audiovisual displays of thetongue, when they are generated based on realmeasurements (AVR) or synthesized by rules(AVS). The classification test investigates ifsubjects are aware of the differences betweenthe two types of animations and if there is anyrelation between scores in the perception testand the classification test.ExperimentsBoth the perception test (PT) and the classificationtest (CT) were carried out on a computerwith a graphical user interface consisting of oneframe showing the animations of the speechgestures and one response frame in which thesubjects gave their answers. The acoustic signalwas presented over headphones.The Augmented Reality displayBoth tests used the augmented reality side-viewof a talking head shown in Fig. 1. Movementsof the three-dimensional tongue and jaw havebeen made visible by making the skin at thecheek transparent and representing the palateby the midsagittal outline and the upper incisor.Speech movements are created in the talkinghead model using articulatory parameters, suchas jaw opening, shift and thrust; lip rounding;upper lip raise and retraction; lower lip depressionand retraction; tongue dorsum raise, bodyraise, tip raise, tip advance and width. Thetongue model is based on a component analysisof data from Magnetic Resonance Imaging(MRI) of a Swedish subject producing staticvowels and consonants (Engwall, 2003).Creating tongue movementsThe animations based on real tongue movements(AVR) were created directly from simultaneousand spatially aligned measurements ofthe face and the tongue for a female speaker ofSwedish (Beskow et al., 2003). The MovetrackEMA system (Branderud, 1985) was employedto measure the intraoral movements, usingthree coils placed on the tongue, one on the jawand one on the upper incisor. The movementsof the face were measured with the Qualisysmotion capture system, using 28 reflectors attachedto the lower part of the speaker’s face.The animations were created by adjusting theparameter values of the talking head to optimallyfit the Qualisys-Movetrack data (Beskowet al., 2003).31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!