21.06.2013 Views

Models of speech perception [pdf] - Ling.cam.ac.uk

Models of speech perception [pdf] - Ling.cam.ac.uk

Models of speech perception [pdf] - Ling.cam.ac.uk

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Models</strong> <strong>of</strong> <strong>speech</strong> <strong>perception</strong><br />

Paper 9<br />

Foundations <strong>of</strong> Speech Communication<br />

Lent: Week 2<br />

Katharine Barden


<strong>Models</strong> <strong>of</strong> <strong>speech</strong> <strong>perception</strong><br />

• Abstr<strong>ac</strong>tionist vs exemplar appro<strong>ac</strong>hes<br />

• Prelexical units<br />

• Modality <strong>of</strong> representations<br />

• Balance <strong>of</strong> “top-down” vs “bottom-up”<br />

processing<br />

• “Speech is special”


Abstr<strong>ac</strong>tionist appro<strong>ac</strong>h<br />

• Focus on identifying phonological<br />

or lexical units<br />

• Motivation: - <strong>perception</strong> seems categorical<br />

- small set <strong>of</strong> linguistic units generates infinite set<br />

<strong>of</strong> utterances<br />

- phonological rules operating on linguistic units<br />

• Impose order by early abstr<strong>ac</strong>tion into units <strong>of</strong><br />

formal linguistics<br />

• ‘Essence’ + variation<br />

– preserve the essence (phonemes or features)<br />

– discard the variation<br />

Phonetic detail is not retained beyond the<br />

earliest stages <strong>of</strong> <strong>perception</strong>


Motor Theory (Liberman et al., 1967)<br />

• No simple relationship between the <strong>ac</strong>oustic<br />

signal and the perceived phonemes<br />

abandon <strong>ac</strong>oustic invariance<br />

• Invariant units = the speaker’s phonemic<br />

gestures<br />

• Gesture = a class <strong>of</strong> movements by one or more<br />

articulators<br />

– e.g. movements for the “voiced labial stop gesture” /b/<br />

• closing and opening <strong>of</strong> the lips<br />

• closing the velum<br />

• starting vocal fold vibration at same time as lips open


Motivation:<br />

Motor Theory<br />

• Different <strong>ac</strong>oustic properties can be the<br />

consequence <strong>of</strong> invariant gestures<br />

e.g. /d/ rising F2 transitions in /di/<br />

falling F2 transitions in /du/<br />

But both <strong>of</strong> these <strong>ac</strong>oustic patterns result from an<br />

alveolar closure gesture that is followed by movement<br />

to appropriate tongue position for the vowel.


Motivation:<br />

• Coarticulation<br />

Motor Theory<br />

– produces variability so that the <strong>ac</strong>oustic signal itself<br />

contain no invariant properties<br />

– the underlying phoneme gestures can be retrieved by<br />

compensating for the coarticulatory effects<br />

e.g. in the context <strong>of</strong> lip-rounding, listeners will<br />

perceptually compensate for the resulting lower <strong>of</strong><br />

frequencies in surrounding sounds


Motivation:<br />

Motor Theory<br />

• Categorical <strong>perception</strong><br />

• Sharp category boundary<br />

• Poor discrimination <strong>of</strong> <strong>ac</strong>oustic differences within<br />

categories<br />

• Good discrimination <strong>of</strong> <strong>ac</strong>oustic differences <strong>ac</strong>ross<br />

categories<br />

– When a sound is heard, gestures are identified;<br />

Sounds that could come from a single type <strong>of</strong><br />

articulation are responded to as functionally<br />

equivalent (e.g. as /b/ or /d/)


Motor Theory<br />

A unifying explanation?<br />

• Speech production and <strong>perception</strong> share a common<br />

representation <strong>of</strong> phonemes as motor gestures<br />

How does it work...?<br />

• Speech <strong>perception</strong> is automatically mediated by an<br />

innate, specialised <strong>speech</strong> module to which we have<br />

no conscious <strong>ac</strong>cess


Evaluation<br />

Motor Theory<br />

• Are gestures any less variable than the<br />

<strong>ac</strong>oustic signal?<br />

– Probably not...<br />

– The Revised Motor Theory (Liberman and<br />

Mattingly, 1985) addresses evidence that<br />

gestures are variable by suggesting that it is<br />

the intended gestures that are perceived<br />

rather than the <strong>ac</strong>tual movements a speaker<br />

makes


Evaluation<br />

Motor Theory<br />

• How does the transformation from the<br />

<strong>ac</strong>oustic signal to the motor gesture<br />

happen?<br />

– Motor Theory postulates an innate module for<br />

processing <strong>speech</strong><br />

• How could it be tested?<br />

– How would this innate module derive invariant<br />

gestures from a variable <strong>ac</strong>oustic signal?


Evaluation<br />

Motor Theory<br />

• Evidence that <strong>perception</strong> precedes<br />

production in native language <strong>ac</strong>quisition<br />

could call into question the idea that innate<br />

gestures underlie production


Evaluation<br />

Motor Theory<br />

• Is the model’s output (a string <strong>of</strong> abstr<strong>ac</strong>t<br />

intended phonemic gestures) able to<br />

<strong>ac</strong>count for all influences on <strong>speech</strong><br />

<strong>perception</strong>?<br />

• Internal category structure<br />

• Talker-familiarity and perceptual learning effects<br />

• Systematic phonetic detail<br />

• Top-down contextual influences<br />

• Error correction


Evaluation<br />

Motor Theory<br />

• Are there alternative ways <strong>of</strong> overcoming<br />

the ‘problem <strong>of</strong> invariance’?<br />

– E.g. more complex relationships between the<br />

<strong>ac</strong>oustic signal and linguistic units


Acoustic Invariance<br />

• Alternative response to ‘problem <strong>of</strong><br />

invariance’<br />

– From static templates to relational invariants<br />

– Lexical Access From Features (LAFF –<br />

Stevens, 1980s)<br />

• Change the unit - search for invariant <strong>ac</strong>oustic or<br />

auditory correlates <strong>of</strong> phonological features<br />

(rather than phonemes)


Lexical Access From Features<br />

(Stevens, 1980s)<br />

• E<strong>ac</strong>h word in the mental lexicon is a<br />

matrix <strong>of</strong> features which are distributed<br />

over time to mirror the way e<strong>ac</strong>h appears<br />

in time<br />

• Features that can change a lot due to<br />

coarticulatory effects are marked as<br />

modifiable (M)


Lexical Access From Features<br />

(Stevens, 1980s)<br />

General form <strong>of</strong> the lexical representation <strong>of</strong> the word tan <strong>ac</strong>cording to LAFF


Lexical Access From Features<br />

Evaluation<br />

• Difficult to define the <strong>ac</strong>oustic properties <strong>of</strong><br />

some features<br />

• Does not include intrinsic duration <strong>of</strong><br />

features (just their sequencing)


Lexical Access From Features<br />

Evaluation<br />

• M’ feature → some allowance for variation<br />

But does it take into <strong>ac</strong>count the role <strong>of</strong> context in<br />

determining <strong>ac</strong>ceptable variation?<br />

– Assimilation <strong>ac</strong>ross a word boundary<br />

e.g. tan badly<br />

– the pl<strong>ac</strong>e <strong>of</strong> constriction may be labelled as M<br />

(modifiable)<br />

– BUT the bilabial PoA is only <strong>ac</strong>ceptable in the<br />

appropriate conditioning phonological context i.e. before<br />

a bilabial consonant (not before an alveolar consonant)<br />

– M feature allows variation but does not specify<br />

appropriate context for such variation


Lexical Access From Features<br />

Evaluation<br />

• Output is features rather than phonemes<br />

⇒ greater detail can influence <strong>perception</strong><br />

• But there are still some aspects <strong>of</strong> phonetic<br />

detail that cannot be captured in a featural<br />

representation e.g. mistakes vs mistimes


Lexical Access From Features<br />

Evaluation<br />

• Output = distinctive features mapped onto<br />

words<br />

• How can this <strong>ac</strong>count for…<br />

– long-domain effects?<br />

– talker-familiarity effects?<br />

– systematic phonetic detail relating to structures<br />

higher than the word?<br />

– top-down influences?


Lexical Access From Features<br />

Evaluation<br />

• (How) is <strong>perception</strong> related to production?


Exemplar/episodic theories<br />

Attempt to <strong>ac</strong>count for evidence that listeners<br />

remember details relating to specific episodes<br />

– Listeners are more <strong>ac</strong>curate at recognizing that they’ve<br />

heard a word before if it is repeated by the same talker<br />

and at the same speaking rate (Bradlow et al., 1999)<br />

– F<strong>ac</strong>ilitatory effect <strong>of</strong> familiarity with a talker (see last<br />

week’s handout)<br />

⇒ retention <strong>of</strong> surf<strong>ac</strong>e form<br />

– Cannot be captured by an abstr<strong>ac</strong>tionist appro<strong>ac</strong>h<br />

where phonetic information is discarded early in<br />

processing


Exemplar theories<br />

• E<strong>ac</strong>h stimulus (e.g. a word) leaves a<br />

unique tr<strong>ac</strong>e in memory<br />

• When a new stimulus is presented, these<br />

memory tr<strong>ac</strong>es are <strong>ac</strong>tivated in proportion<br />

to their similarity to the stimulus<br />

• Conscious experience <strong>of</strong> <strong>perception</strong><br />

occurs as a result <strong>of</strong> the combined<br />

<strong>ac</strong>tivation <strong>of</strong> these previously stored<br />

exemplars


Exemplar theories<br />

• No explicit prelexical code<br />

• The general similarity between many exemplars<br />

⇒ Emergent categories<br />

• Categories are self-organising and can be <strong>of</strong><br />

many sizes<br />

⇒ Listeners can make use <strong>of</strong> systematic phonetic<br />

variation that reflects linguistic structure <strong>of</strong> many<br />

types<br />

• Close similarity between few exemplars<br />

⇒ Talker-specific details can influence<br />

<strong>perception</strong>


Exemplar theories<br />

• Can <strong>ac</strong>count for plasticity<br />

– categories are continually evolving as new<br />

experiences create new memory tr<strong>ac</strong>es


Exemplar theories<br />

• Top-down effects<br />

e.g. Ganong Effect:<br />

– memory tr<strong>ac</strong>es exist for the real word but not<br />

for the nonsense word<br />

– therefore <strong>ac</strong>tivation <strong>of</strong> real-word memory<br />

tr<strong>ac</strong>es will contribute to the percept<br />

– whether it is sufficient to alter the percept will<br />

depend on the similarity <strong>of</strong> the stimulus to the<br />

tr<strong>ac</strong>es


Multimodal exemplar theories<br />

• Multimodal memory<br />

tr<strong>ac</strong>es…<br />

• Experience <strong>of</strong> one modality<br />

can <strong>ac</strong>tivate all/some <strong>of</strong><br />

others through associated<br />

memories<br />

• Meaning is <strong>ac</strong>hieved<br />

through combined <strong>ac</strong>tivation<br />

<strong>of</strong> relevant multimodal<br />

memory tr<strong>ac</strong>es on exposure<br />

to a stimulus<br />

• Can <strong>ac</strong>count for multimodal<br />

effects on <strong>perception</strong> e.g.<br />

McGurk effect, gestures<br />

• A more holistic appro<strong>ac</strong>h to<br />

communication – from<br />

sound to meaning<br />

taste<br />

texture<br />

look<br />

sound<br />

smell


Problems...?<br />

Exemplar theories<br />

• What is an exemplar?<br />

– What are the temporal boundaries <strong>of</strong> an<br />

‘episode’? What size/s ‘units’ are compared?<br />

– At what level/s <strong>of</strong> processing are exemplars<br />

stored?<br />

– How much <strong>of</strong> the context is stored?<br />

• What is similarity? What counts as<br />

‘similar’?


Problems...?<br />

Exemplar theories<br />

• Plasticity vs stability<br />

• How to test... Implementations are<br />

necessarily limited in comparison to<br />

human experience<br />

• Neurologically plausible? Memory<br />

cap<strong>ac</strong>ity?


Exemplar theories<br />

Within the general exemplar appro<strong>ac</strong>h, there are various<br />

different models...<br />

• Reading on exemplar appro<strong>ac</strong>h:<br />

– Hawkins in Pickett, pp. 255-260.<br />

– David Pisoni, Keith Johnson, Stephen Goldinger<br />

<strong>Models</strong><br />

• Goldinger: MINERVA – a computational implementation<br />

<strong>of</strong> a very limited exemplar model<br />

• Jusczyk: WRAPSA – based on child language <strong>ac</strong>q<br />

• (?) Kuhl: Native Language Magnet Model (can be viewed<br />

as exemplar-based?)<br />

• S. Hawkins and Smith: Polysp<br />

• J. Hawkins. Memory-prediction framework – not<br />

specifically about <strong>speech</strong>


TRACE: a connectionist model<br />

(McClelland and Elman, 1986)<br />

Aimed to:<br />

– identify single words<br />

– <strong>ac</strong>count for categorical <strong>perception</strong>,<br />

Ganong effect and other traditional<br />

phonetic findings that were considered<br />

important in 1970s-1980s


TRACE: a connectionist model<br />

(McClelland and Elman, 1986)<br />

• TRACE is a good example <strong>of</strong> an<br />

inter<strong>ac</strong>tive model<br />

(making use <strong>of</strong> bottom-up and top-down<br />

information)<br />

• Probabilistic: Several words can be<br />

“<strong>ac</strong>tivated” simultaneously, and the word<br />

that is identified is the most probable<br />

one (the most highly <strong>ac</strong>tivated)


Structure <strong>of</strong> TRACE<br />

3 layers:<br />

words<br />

phonemes<br />

features<br />

cat<br />

deep<br />

dog<br />

dark<br />

b d s l o<br />

OUTPUT: one<br />

word (or probabilities for<br />

several words)<br />

seep<br />

lark<br />

voice bilab alv cons voc<br />

INPUT: featural information, one time-slice at a time


Structure <strong>of</strong> TRACE<br />

3 layers:<br />

words<br />

phonemes<br />

features<br />

cat<br />

deep<br />

dog<br />

dark<br />

b d s l o<br />

seep<br />

lark<br />

voice bilab alv cons voc<br />

E<strong>ac</strong>h unit connects directly with every unit in its own layer, and in the adj<strong>ac</strong>ent layer(s).<br />

Connections BETWEEN layers excite (<strong>ac</strong>tivate) units in other layers.<br />

Connections WITHIN a layer inhibit (lower <strong>ac</strong>tivation <strong>of</strong>) units in same layer.


Processes <strong>of</strong> TRACE<br />

• E<strong>ac</strong>h unit has a resting <strong>ac</strong>tivation level<br />

• E<strong>ac</strong>h connection has a numerical weight<br />

that determines how fast it conducts<br />

information and thus how much<br />

information is passed in a given time<br />

• Output was originally a single word, but<br />

modified to be probabilities for different<br />

words


cat<br />

deep<br />

dog<br />

dark<br />

b d s u o<br />

seep<br />

lark<br />

voice bilab alv cons voc


cat<br />

deep<br />

dog<br />

dark<br />

b d s u o<br />

seep<br />

lark<br />

voice bilab alv cons voc


cat<br />

deep<br />

dog<br />

dark<br />

b d s u o<br />

seep<br />

lark<br />

voice bilab alv cons voc


cat<br />

deep<br />

dog<br />

dark<br />

b d s u o<br />

seep<br />

lark<br />

voice bilab alv cons voc


cat<br />

deep<br />

dog<br />

dark<br />

b d s u o<br />

seep<br />

lark<br />

voice bilab alv cons voc


cat<br />

deep<br />

dog<br />

dark<br />

b d s u o<br />

seep<br />

lark<br />

voice bilab alv cons voc


Evaluation:<br />

TRACE<br />

• Top-down effects can <strong>ac</strong>count for e.g.<br />

Ganong Effect, but are sometimes too<br />

powerful<br />

– strong influence <strong>of</strong> word level on phoneme<br />

level can prevent detection <strong>of</strong> serious<br />

mispronunciations<br />

• Surf<strong>ac</strong>e phonetic variation is discarded<br />

• No higher level structures (incomplete)


Summary<br />

• Diverse appro<strong>ac</strong>hes to modelling <strong>speech</strong><br />

<strong>perception</strong><br />

• Varied in scope – identification <strong>of</strong><br />

phonemes, words, or all the way to<br />

meaning...<br />

• Assumptions: abstr<strong>ac</strong>tionist vs exemplar,<br />

motor vs <strong>ac</strong>oustic invariants, features vs<br />

phonemes, top-down vs bottom-up...


Next week…<br />

• Speech production<br />

– <strong>Models</strong> <strong>of</strong> coarticulation


Main sources:<br />

Reading<br />

Pickett, J. M. (1999). The Acoustics <strong>of</strong> Speech<br />

Communication: Fundamentals, Speech Perception<br />

Theory, and Technology. Needham Heights, MA: Allyn &<br />

B<strong>ac</strong>on. MML [L9A.P.15:1,2]. Chpts 13, 14 and 15 (up to<br />

p. 260).<br />

Hawkins, S. (2004). Puzzles and patterns in 50 years <strong>of</strong><br />

research on <strong>speech</strong> <strong>perception</strong>. In Slifka, J. Manuel, S.<br />

and Matthies, M. (eds.) From Sound to Sense: 50+<br />

Years <strong>of</strong> Discoveries in Speech Communication. CD. On<br />

Sarah’s website: http://www.ling.<strong>cam</strong>.<strong>ac</strong>.<strong>uk</strong>/sarah/<br />

pubs_conf.html


Reading<br />

Exemplar theories:<br />

Goldinger, S.D. (1996). Words and voices: episodic tr<strong>ac</strong>es<br />

in spoken word identification and recognition memory.<br />

Journal <strong>of</strong> Experimental Psychology: Learning, Memory<br />

and Cognition, 22, 5, 1166-1183<br />

Goldinger, S.D. (1998). Echo <strong>of</strong> echoes? An episodic<br />

theory <strong>of</strong> lexical <strong>ac</strong>cess. Psychological Review, 105, 2,<br />

251-279.<br />

*Johnson, K. and Mullenix, J.W., eds. (1997). Talker<br />

Variability in Speech Processing. Academic Press. Ch 2<br />

(Pisoni).<br />

Johnson, K., Strand, E.A. and D’Imperio, M. (1999).<br />

Auditory-visual integration <strong>of</strong> talker gender in vowel<br />

<strong>perception</strong>. Journal <strong>of</strong> Phonetics, 27, 359-384.


Reading<br />

Connectionism/neural networks<br />

Harley, T. (2001). The Psychology <strong>of</strong> Language: From Data to Theory.<br />

2nd edition. Hove: Psychology Press. ch. 8 and Appendix on<br />

connectionism (good short introduction to connectionism).<br />

TRACE<br />

McClelland, J.L. and Elman, J.L. (1986). The TRACE model <strong>of</strong> <strong>speech</strong><br />

<strong>perception</strong>. Cognitive Psychology, 18, 1-86.<br />

Further reading on specific models:<br />

Motor theory<br />

Liberman, A. M. et al. (1967). Perception <strong>of</strong> the <strong>speech</strong> code.<br />

Psychological Review, 74, 431-461.<br />

Liberman, A. M. and Mattingly, I. G. (1985). The motor theory <strong>of</strong> <strong>speech</strong><br />

<strong>perception</strong> revised. Cognition, 21, 1-36.<br />

WRAPSA<br />

Peter W. Jusczyk (1997). The Discovery <strong>of</strong> Spoken Language.<br />

Cambridge, Mass.: MIT Press. Chapter 8.


Reading<br />

Specific models:<br />

Native Language Model – expanded.<br />

Kuhl, P. et al. (2008). Phonetic learning as a pathway to language: new<br />

data and native language magnet theory expanded (NLM-e).<br />

Philosophical trans<strong>ac</strong>tions <strong>of</strong> the Royal Society, 363, 979-1000.<br />

Polysp<br />

Hawkins, S. and Smith, R. (2001). Polysp: a polysystematic,<br />

phonetically-rich appro<strong>ac</strong>h to <strong>speech</strong> understanding. Italian Journal<br />

<strong>of</strong> <strong>Ling</strong>uistics/Rivista di <strong>Ling</strong>uistica 13 (1), 99--189. Section 4.4<br />

(<strong>Ling</strong>uistic categories). [Advanced.] On Sarah’s website:<br />

http://www.ling.<strong>cam</strong>.<strong>ac</strong>.<strong>uk</strong>/sarah/pubs_peer.html<br />

Memory-prediction framework<br />

Hawkins, J. (2004). On Intelligence. New York: Henry Holt & Co. (an<br />

Owl Book) esp. chs 3-6.<br />

Direct realism<br />

Fowler, C.A. (1986). An event appro<strong>ac</strong>h to the study <strong>of</strong> <strong>speech</strong><br />

<strong>perception</strong> from a direct-realist perspective. Journal <strong>of</strong> Phonetics,<br />

14, 3-28.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!