Models of speech perception [pdf] - Ling.cam.ac.uk
Models of speech perception [pdf] - Ling.cam.ac.uk
Models of speech perception [pdf] - Ling.cam.ac.uk
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Models</strong> <strong>of</strong> <strong>speech</strong> <strong>perception</strong><br />
Paper 9<br />
Foundations <strong>of</strong> Speech Communication<br />
Lent: Week 2<br />
Katharine Barden
<strong>Models</strong> <strong>of</strong> <strong>speech</strong> <strong>perception</strong><br />
• Abstr<strong>ac</strong>tionist vs exemplar appro<strong>ac</strong>hes<br />
• Prelexical units<br />
• Modality <strong>of</strong> representations<br />
• Balance <strong>of</strong> “top-down” vs “bottom-up”<br />
processing<br />
• “Speech is special”
Abstr<strong>ac</strong>tionist appro<strong>ac</strong>h<br />
• Focus on identifying phonological<br />
or lexical units<br />
• Motivation: - <strong>perception</strong> seems categorical<br />
- small set <strong>of</strong> linguistic units generates infinite set<br />
<strong>of</strong> utterances<br />
- phonological rules operating on linguistic units<br />
• Impose order by early abstr<strong>ac</strong>tion into units <strong>of</strong><br />
formal linguistics<br />
• ‘Essence’ + variation<br />
– preserve the essence (phonemes or features)<br />
– discard the variation<br />
Phonetic detail is not retained beyond the<br />
earliest stages <strong>of</strong> <strong>perception</strong>
Motor Theory (Liberman et al., 1967)<br />
• No simple relationship between the <strong>ac</strong>oustic<br />
signal and the perceived phonemes<br />
abandon <strong>ac</strong>oustic invariance<br />
• Invariant units = the speaker’s phonemic<br />
gestures<br />
• Gesture = a class <strong>of</strong> movements by one or more<br />
articulators<br />
– e.g. movements for the “voiced labial stop gesture” /b/<br />
• closing and opening <strong>of</strong> the lips<br />
• closing the velum<br />
• starting vocal fold vibration at same time as lips open
Motivation:<br />
Motor Theory<br />
• Different <strong>ac</strong>oustic properties can be the<br />
consequence <strong>of</strong> invariant gestures<br />
e.g. /d/ rising F2 transitions in /di/<br />
falling F2 transitions in /du/<br />
But both <strong>of</strong> these <strong>ac</strong>oustic patterns result from an<br />
alveolar closure gesture that is followed by movement<br />
to appropriate tongue position for the vowel.
Motivation:<br />
• Coarticulation<br />
Motor Theory<br />
– produces variability so that the <strong>ac</strong>oustic signal itself<br />
contain no invariant properties<br />
– the underlying phoneme gestures can be retrieved by<br />
compensating for the coarticulatory effects<br />
e.g. in the context <strong>of</strong> lip-rounding, listeners will<br />
perceptually compensate for the resulting lower <strong>of</strong><br />
frequencies in surrounding sounds
Motivation:<br />
Motor Theory<br />
• Categorical <strong>perception</strong><br />
• Sharp category boundary<br />
• Poor discrimination <strong>of</strong> <strong>ac</strong>oustic differences within<br />
categories<br />
• Good discrimination <strong>of</strong> <strong>ac</strong>oustic differences <strong>ac</strong>ross<br />
categories<br />
– When a sound is heard, gestures are identified;<br />
Sounds that could come from a single type <strong>of</strong><br />
articulation are responded to as functionally<br />
equivalent (e.g. as /b/ or /d/)
Motor Theory<br />
A unifying explanation?<br />
• Speech production and <strong>perception</strong> share a common<br />
representation <strong>of</strong> phonemes as motor gestures<br />
How does it work...?<br />
• Speech <strong>perception</strong> is automatically mediated by an<br />
innate, specialised <strong>speech</strong> module to which we have<br />
no conscious <strong>ac</strong>cess
Evaluation<br />
Motor Theory<br />
• Are gestures any less variable than the<br />
<strong>ac</strong>oustic signal?<br />
– Probably not...<br />
– The Revised Motor Theory (Liberman and<br />
Mattingly, 1985) addresses evidence that<br />
gestures are variable by suggesting that it is<br />
the intended gestures that are perceived<br />
rather than the <strong>ac</strong>tual movements a speaker<br />
makes
Evaluation<br />
Motor Theory<br />
• How does the transformation from the<br />
<strong>ac</strong>oustic signal to the motor gesture<br />
happen?<br />
– Motor Theory postulates an innate module for<br />
processing <strong>speech</strong><br />
• How could it be tested?<br />
– How would this innate module derive invariant<br />
gestures from a variable <strong>ac</strong>oustic signal?
Evaluation<br />
Motor Theory<br />
• Evidence that <strong>perception</strong> precedes<br />
production in native language <strong>ac</strong>quisition<br />
could call into question the idea that innate<br />
gestures underlie production
Evaluation<br />
Motor Theory<br />
• Is the model’s output (a string <strong>of</strong> abstr<strong>ac</strong>t<br />
intended phonemic gestures) able to<br />
<strong>ac</strong>count for all influences on <strong>speech</strong><br />
<strong>perception</strong>?<br />
• Internal category structure<br />
• Talker-familiarity and perceptual learning effects<br />
• Systematic phonetic detail<br />
• Top-down contextual influences<br />
• Error correction
Evaluation<br />
Motor Theory<br />
• Are there alternative ways <strong>of</strong> overcoming<br />
the ‘problem <strong>of</strong> invariance’?<br />
– E.g. more complex relationships between the<br />
<strong>ac</strong>oustic signal and linguistic units
Acoustic Invariance<br />
• Alternative response to ‘problem <strong>of</strong><br />
invariance’<br />
– From static templates to relational invariants<br />
– Lexical Access From Features (LAFF –<br />
Stevens, 1980s)<br />
• Change the unit - search for invariant <strong>ac</strong>oustic or<br />
auditory correlates <strong>of</strong> phonological features<br />
(rather than phonemes)
Lexical Access From Features<br />
(Stevens, 1980s)<br />
• E<strong>ac</strong>h word in the mental lexicon is a<br />
matrix <strong>of</strong> features which are distributed<br />
over time to mirror the way e<strong>ac</strong>h appears<br />
in time<br />
• Features that can change a lot due to<br />
coarticulatory effects are marked as<br />
modifiable (M)
Lexical Access From Features<br />
(Stevens, 1980s)<br />
General form <strong>of</strong> the lexical representation <strong>of</strong> the word tan <strong>ac</strong>cording to LAFF
Lexical Access From Features<br />
Evaluation<br />
• Difficult to define the <strong>ac</strong>oustic properties <strong>of</strong><br />
some features<br />
• Does not include intrinsic duration <strong>of</strong><br />
features (just their sequencing)
Lexical Access From Features<br />
Evaluation<br />
• M’ feature → some allowance for variation<br />
But does it take into <strong>ac</strong>count the role <strong>of</strong> context in<br />
determining <strong>ac</strong>ceptable variation?<br />
– Assimilation <strong>ac</strong>ross a word boundary<br />
e.g. tan badly<br />
– the pl<strong>ac</strong>e <strong>of</strong> constriction may be labelled as M<br />
(modifiable)<br />
– BUT the bilabial PoA is only <strong>ac</strong>ceptable in the<br />
appropriate conditioning phonological context i.e. before<br />
a bilabial consonant (not before an alveolar consonant)<br />
– M feature allows variation but does not specify<br />
appropriate context for such variation
Lexical Access From Features<br />
Evaluation<br />
• Output is features rather than phonemes<br />
⇒ greater detail can influence <strong>perception</strong><br />
• But there are still some aspects <strong>of</strong> phonetic<br />
detail that cannot be captured in a featural<br />
representation e.g. mistakes vs mistimes
Lexical Access From Features<br />
Evaluation<br />
• Output = distinctive features mapped onto<br />
words<br />
• How can this <strong>ac</strong>count for…<br />
– long-domain effects?<br />
– talker-familiarity effects?<br />
– systematic phonetic detail relating to structures<br />
higher than the word?<br />
– top-down influences?
Lexical Access From Features<br />
Evaluation<br />
• (How) is <strong>perception</strong> related to production?
Exemplar/episodic theories<br />
Attempt to <strong>ac</strong>count for evidence that listeners<br />
remember details relating to specific episodes<br />
– Listeners are more <strong>ac</strong>curate at recognizing that they’ve<br />
heard a word before if it is repeated by the same talker<br />
and at the same speaking rate (Bradlow et al., 1999)<br />
– F<strong>ac</strong>ilitatory effect <strong>of</strong> familiarity with a talker (see last<br />
week’s handout)<br />
⇒ retention <strong>of</strong> surf<strong>ac</strong>e form<br />
– Cannot be captured by an abstr<strong>ac</strong>tionist appro<strong>ac</strong>h<br />
where phonetic information is discarded early in<br />
processing
Exemplar theories<br />
• E<strong>ac</strong>h stimulus (e.g. a word) leaves a<br />
unique tr<strong>ac</strong>e in memory<br />
• When a new stimulus is presented, these<br />
memory tr<strong>ac</strong>es are <strong>ac</strong>tivated in proportion<br />
to their similarity to the stimulus<br />
• Conscious experience <strong>of</strong> <strong>perception</strong><br />
occurs as a result <strong>of</strong> the combined<br />
<strong>ac</strong>tivation <strong>of</strong> these previously stored<br />
exemplars
Exemplar theories<br />
• No explicit prelexical code<br />
• The general similarity between many exemplars<br />
⇒ Emergent categories<br />
• Categories are self-organising and can be <strong>of</strong><br />
many sizes<br />
⇒ Listeners can make use <strong>of</strong> systematic phonetic<br />
variation that reflects linguistic structure <strong>of</strong> many<br />
types<br />
• Close similarity between few exemplars<br />
⇒ Talker-specific details can influence<br />
<strong>perception</strong>
Exemplar theories<br />
• Can <strong>ac</strong>count for plasticity<br />
– categories are continually evolving as new<br />
experiences create new memory tr<strong>ac</strong>es
Exemplar theories<br />
• Top-down effects<br />
e.g. Ganong Effect:<br />
– memory tr<strong>ac</strong>es exist for the real word but not<br />
for the nonsense word<br />
– therefore <strong>ac</strong>tivation <strong>of</strong> real-word memory<br />
tr<strong>ac</strong>es will contribute to the percept<br />
– whether it is sufficient to alter the percept will<br />
depend on the similarity <strong>of</strong> the stimulus to the<br />
tr<strong>ac</strong>es
Multimodal exemplar theories<br />
• Multimodal memory<br />
tr<strong>ac</strong>es…<br />
• Experience <strong>of</strong> one modality<br />
can <strong>ac</strong>tivate all/some <strong>of</strong><br />
others through associated<br />
memories<br />
• Meaning is <strong>ac</strong>hieved<br />
through combined <strong>ac</strong>tivation<br />
<strong>of</strong> relevant multimodal<br />
memory tr<strong>ac</strong>es on exposure<br />
to a stimulus<br />
• Can <strong>ac</strong>count for multimodal<br />
effects on <strong>perception</strong> e.g.<br />
McGurk effect, gestures<br />
• A more holistic appro<strong>ac</strong>h to<br />
communication – from<br />
sound to meaning<br />
taste<br />
texture<br />
look<br />
sound<br />
smell
Problems...?<br />
Exemplar theories<br />
• What is an exemplar?<br />
– What are the temporal boundaries <strong>of</strong> an<br />
‘episode’? What size/s ‘units’ are compared?<br />
– At what level/s <strong>of</strong> processing are exemplars<br />
stored?<br />
– How much <strong>of</strong> the context is stored?<br />
• What is similarity? What counts as<br />
‘similar’?
Problems...?<br />
Exemplar theories<br />
• Plasticity vs stability<br />
• How to test... Implementations are<br />
necessarily limited in comparison to<br />
human experience<br />
• Neurologically plausible? Memory<br />
cap<strong>ac</strong>ity?
Exemplar theories<br />
Within the general exemplar appro<strong>ac</strong>h, there are various<br />
different models...<br />
• Reading on exemplar appro<strong>ac</strong>h:<br />
– Hawkins in Pickett, pp. 255-260.<br />
– David Pisoni, Keith Johnson, Stephen Goldinger<br />
<strong>Models</strong><br />
• Goldinger: MINERVA – a computational implementation<br />
<strong>of</strong> a very limited exemplar model<br />
• Jusczyk: WRAPSA – based on child language <strong>ac</strong>q<br />
• (?) Kuhl: Native Language Magnet Model (can be viewed<br />
as exemplar-based?)<br />
• S. Hawkins and Smith: Polysp<br />
• J. Hawkins. Memory-prediction framework – not<br />
specifically about <strong>speech</strong>
TRACE: a connectionist model<br />
(McClelland and Elman, 1986)<br />
Aimed to:<br />
– identify single words<br />
– <strong>ac</strong>count for categorical <strong>perception</strong>,<br />
Ganong effect and other traditional<br />
phonetic findings that were considered<br />
important in 1970s-1980s
TRACE: a connectionist model<br />
(McClelland and Elman, 1986)<br />
• TRACE is a good example <strong>of</strong> an<br />
inter<strong>ac</strong>tive model<br />
(making use <strong>of</strong> bottom-up and top-down<br />
information)<br />
• Probabilistic: Several words can be<br />
“<strong>ac</strong>tivated” simultaneously, and the word<br />
that is identified is the most probable<br />
one (the most highly <strong>ac</strong>tivated)
Structure <strong>of</strong> TRACE<br />
3 layers:<br />
words<br />
phonemes<br />
features<br />
cat<br />
deep<br />
dog<br />
dark<br />
b d s l o<br />
OUTPUT: one<br />
word (or probabilities for<br />
several words)<br />
seep<br />
lark<br />
voice bilab alv cons voc<br />
INPUT: featural information, one time-slice at a time
Structure <strong>of</strong> TRACE<br />
3 layers:<br />
words<br />
phonemes<br />
features<br />
cat<br />
deep<br />
dog<br />
dark<br />
b d s l o<br />
seep<br />
lark<br />
voice bilab alv cons voc<br />
E<strong>ac</strong>h unit connects directly with every unit in its own layer, and in the adj<strong>ac</strong>ent layer(s).<br />
Connections BETWEEN layers excite (<strong>ac</strong>tivate) units in other layers.<br />
Connections WITHIN a layer inhibit (lower <strong>ac</strong>tivation <strong>of</strong>) units in same layer.
Processes <strong>of</strong> TRACE<br />
• E<strong>ac</strong>h unit has a resting <strong>ac</strong>tivation level<br />
• E<strong>ac</strong>h connection has a numerical weight<br />
that determines how fast it conducts<br />
information and thus how much<br />
information is passed in a given time<br />
• Output was originally a single word, but<br />
modified to be probabilities for different<br />
words
cat<br />
deep<br />
dog<br />
dark<br />
b d s u o<br />
seep<br />
lark<br />
voice bilab alv cons voc
cat<br />
deep<br />
dog<br />
dark<br />
b d s u o<br />
seep<br />
lark<br />
voice bilab alv cons voc
cat<br />
deep<br />
dog<br />
dark<br />
b d s u o<br />
seep<br />
lark<br />
voice bilab alv cons voc
cat<br />
deep<br />
dog<br />
dark<br />
b d s u o<br />
seep<br />
lark<br />
voice bilab alv cons voc
cat<br />
deep<br />
dog<br />
dark<br />
b d s u o<br />
seep<br />
lark<br />
voice bilab alv cons voc
cat<br />
deep<br />
dog<br />
dark<br />
b d s u o<br />
seep<br />
lark<br />
voice bilab alv cons voc
Evaluation:<br />
TRACE<br />
• Top-down effects can <strong>ac</strong>count for e.g.<br />
Ganong Effect, but are sometimes too<br />
powerful<br />
– strong influence <strong>of</strong> word level on phoneme<br />
level can prevent detection <strong>of</strong> serious<br />
mispronunciations<br />
• Surf<strong>ac</strong>e phonetic variation is discarded<br />
• No higher level structures (incomplete)
Summary<br />
• Diverse appro<strong>ac</strong>hes to modelling <strong>speech</strong><br />
<strong>perception</strong><br />
• Varied in scope – identification <strong>of</strong><br />
phonemes, words, or all the way to<br />
meaning...<br />
• Assumptions: abstr<strong>ac</strong>tionist vs exemplar,<br />
motor vs <strong>ac</strong>oustic invariants, features vs<br />
phonemes, top-down vs bottom-up...
Next week…<br />
• Speech production<br />
– <strong>Models</strong> <strong>of</strong> coarticulation
Main sources:<br />
Reading<br />
Pickett, J. M. (1999). The Acoustics <strong>of</strong> Speech<br />
Communication: Fundamentals, Speech Perception<br />
Theory, and Technology. Needham Heights, MA: Allyn &<br />
B<strong>ac</strong>on. MML [L9A.P.15:1,2]. Chpts 13, 14 and 15 (up to<br />
p. 260).<br />
Hawkins, S. (2004). Puzzles and patterns in 50 years <strong>of</strong><br />
research on <strong>speech</strong> <strong>perception</strong>. In Slifka, J. Manuel, S.<br />
and Matthies, M. (eds.) From Sound to Sense: 50+<br />
Years <strong>of</strong> Discoveries in Speech Communication. CD. On<br />
Sarah’s website: http://www.ling.<strong>cam</strong>.<strong>ac</strong>.<strong>uk</strong>/sarah/<br />
pubs_conf.html
Reading<br />
Exemplar theories:<br />
Goldinger, S.D. (1996). Words and voices: episodic tr<strong>ac</strong>es<br />
in spoken word identification and recognition memory.<br />
Journal <strong>of</strong> Experimental Psychology: Learning, Memory<br />
and Cognition, 22, 5, 1166-1183<br />
Goldinger, S.D. (1998). Echo <strong>of</strong> echoes? An episodic<br />
theory <strong>of</strong> lexical <strong>ac</strong>cess. Psychological Review, 105, 2,<br />
251-279.<br />
*Johnson, K. and Mullenix, J.W., eds. (1997). Talker<br />
Variability in Speech Processing. Academic Press. Ch 2<br />
(Pisoni).<br />
Johnson, K., Strand, E.A. and D’Imperio, M. (1999).<br />
Auditory-visual integration <strong>of</strong> talker gender in vowel<br />
<strong>perception</strong>. Journal <strong>of</strong> Phonetics, 27, 359-384.
Reading<br />
Connectionism/neural networks<br />
Harley, T. (2001). The Psychology <strong>of</strong> Language: From Data to Theory.<br />
2nd edition. Hove: Psychology Press. ch. 8 and Appendix on<br />
connectionism (good short introduction to connectionism).<br />
TRACE<br />
McClelland, J.L. and Elman, J.L. (1986). The TRACE model <strong>of</strong> <strong>speech</strong><br />
<strong>perception</strong>. Cognitive Psychology, 18, 1-86.<br />
Further reading on specific models:<br />
Motor theory<br />
Liberman, A. M. et al. (1967). Perception <strong>of</strong> the <strong>speech</strong> code.<br />
Psychological Review, 74, 431-461.<br />
Liberman, A. M. and Mattingly, I. G. (1985). The motor theory <strong>of</strong> <strong>speech</strong><br />
<strong>perception</strong> revised. Cognition, 21, 1-36.<br />
WRAPSA<br />
Peter W. Jusczyk (1997). The Discovery <strong>of</strong> Spoken Language.<br />
Cambridge, Mass.: MIT Press. Chapter 8.
Reading<br />
Specific models:<br />
Native Language Model – expanded.<br />
Kuhl, P. et al. (2008). Phonetic learning as a pathway to language: new<br />
data and native language magnet theory expanded (NLM-e).<br />
Philosophical trans<strong>ac</strong>tions <strong>of</strong> the Royal Society, 363, 979-1000.<br />
Polysp<br />
Hawkins, S. and Smith, R. (2001). Polysp: a polysystematic,<br />
phonetically-rich appro<strong>ac</strong>h to <strong>speech</strong> understanding. Italian Journal<br />
<strong>of</strong> <strong>Ling</strong>uistics/Rivista di <strong>Ling</strong>uistica 13 (1), 99--189. Section 4.4<br />
(<strong>Ling</strong>uistic categories). [Advanced.] On Sarah’s website:<br />
http://www.ling.<strong>cam</strong>.<strong>ac</strong>.<strong>uk</strong>/sarah/pubs_peer.html<br />
Memory-prediction framework<br />
Hawkins, J. (2004). On Intelligence. New York: Henry Holt & Co. (an<br />
Owl Book) esp. chs 3-6.<br />
Direct realism<br />
Fowler, C.A. (1986). An event appro<strong>ac</strong>h to the study <strong>of</strong> <strong>speech</strong><br />
<strong>perception</strong> from a direct-realist perspective. Journal <strong>of</strong> Phonetics,<br />
14, 3-28.