Linguistic categories and speech perception
Linguistic categories and speech perception
Linguistic categories and speech perception
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Linguistic</strong> <strong>categories</strong> <strong>and</strong> <strong>speech</strong><br />
<strong>perception</strong><br />
Paper 9<br />
Foundations of Speech Communication<br />
Sarah Hawkins<br />
9 November 2007
Aim<br />
• To consider how (or why) we seem to<br />
recognize discrete linguistic units from the<br />
<strong>speech</strong> signal.<br />
(Note the assumption that mental linguistic<br />
units are discrete.)
3 ways discrete linguistic-phonetic<br />
<strong>categories</strong> might be perceived<br />
1. they are really there in the acoustic signal<br />
acoustic invariance (or reliability)<br />
2. they result from the way the auditory system<br />
processes sound<br />
auditory invariance (or reliability)<br />
3. they result from the way the brain processes<br />
any sort of information, sensory or not<br />
‘cognitive invariance’
What is a category<br />
A class or division in a<br />
system of classification
Structure of a category<br />
poor<br />
ok<br />
good<br />
best<br />
Quality of exemplars<br />
Boundaries
Thrush in summer<br />
<br />
Thrush in snow<br />
<br />
Sparrow in summer
Reminder:<br />
Ladefoged <strong>and</strong> Broadbent (1957)<br />
"Please say what this word is:<br />
bit bet bat but<br />
bet<br />
bit<br />
F1 of CARRIER<br />
200-380 Hz<br />
380-660 Hz<br />
Ladefoged <strong>and</strong> Broadbent (1957) JASA 29, 98-104
How ‘categorical’ is Categorical<br />
Perception<br />
• Category boundaries are not stable, but<br />
highly labile: they shift under the influence of<br />
many different factors.
0.2683<br />
0<br />
-0.2286<br />
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.54344<br />
Time (s)<br />
VOT -40 ms VOT +10 ms VOT +100 ms<br />
CP boundary shifts: range effects<br />
• identification expt e.g.<br />
• VOT continuum<br />
da..........ta<br />
• when stimuli are<br />
removed from one end,<br />
the 50% id boundary<br />
shifts towards the other<br />
% /d/<br />
100<br />
50<br />
0<br />
X<br />
boundary shift<br />
short VOT (d) long VOT (t)
CP boundary shifts: cue trading<br />
• Cue trading: more of one<br />
property compensates for<br />
less of another<br />
% /d/<br />
e.g. for stimuli whose VOT is<br />
ambiguous between /da/ <strong>and</strong><br />
/ta/, decreasing burst<br />
amplitude causes more /da/s<br />
to be perceived<br />
100<br />
50<br />
0<br />
Burst<br />
amplitude:<br />
high (fewer<br />
/d/ responses)<br />
low<br />
(more /d/<br />
responses)<br />
short VOT (d) long VOT (t)
• “Ganong effect”<br />
(word~nonword)<br />
more /d/ responses if real<br />
word begins with /d/ (dash—<br />
tash)<br />
• Similar effects for<br />
sentence meaning<br />
(if the task is appropriate)<br />
e.g. The farmer milked the<br />
[g/k]oat<br />
CP boundary shifts:<br />
meaningfulness<br />
% /d/<br />
100<br />
50<br />
0<br />
nonword-word: dask-task<br />
word-nonword: dash-tash<br />
short VOT (d) long VOT (t)<br />
Ganong (1980) J. Exp. Psych: HPP 6, 110-125<br />
Borsky, Shapiro, Tuller (2000) J. Psycholinguistic Res. 29, 155-168
Perception adjusts to the<br />
distribution of stimuli<br />
&<br />
is more forgiving<br />
about unclear sounds<br />
if the message makes sense<br />
% /d/<br />
100<br />
50<br />
0<br />
short VOT (d) long VOT (t)
CP: category goodness<br />
Eye-tracking<br />
Task: click on picture corresponding to heard word<br />
Stimuli varied in VOT: bear--pear<br />
More looks to competitor picture as VOT approaches<br />
category boundary McMurray et al. (2003) J. Psycholing. Res. 32, 77-97
CP: category goodness<br />
Mediated Priming in lexical decision task<br />
A /t/ with a short VOT primes unrelated words<br />
via rhymes that have /d/ instead of /t/<br />
Reaction times<br />
Related Modified Neutral<br />
t*ime primes penny via dime<br />
Misiurski et al. (2005) Brain & Lang. 93, 64-78
Does <strong>speech</strong> convey invariant information<br />
about linguistic <strong>categories</strong><br />
Classical theories held/hold that:<br />
• We necessarily perceive phonemes or<br />
phonological distinctive features when we<br />
listen to <strong>speech</strong>…<br />
• …because each phoneme or feature bears an<br />
invariant relationship with some<br />
property/properties of the <strong>speech</strong> signal
Classical assumptions<br />
Some theorists suggested invariants could be<br />
modality-neutral, but most debate was<br />
polarised: invariant units “had to be” either<br />
• motoric<br />
– the Motor Theory of Speech Perception<br />
– direct realism/direct <strong>perception</strong> (Fowler, Best)<br />
• or acoustic or auditory<br />
– quantal theory, acoustic invariance (K. Stevens)<br />
– auditory enhancement theory (Kingston, Diehl,<br />
Kluender).<br />
Read about these theories in Pickett, chapters 13-15, esp. 14.<br />
If you are very interested, ask your supervisor for more recent developments
Classical assumptions<br />
Most of these theories implicitly assume(d):<br />
• there is one basic perceptual unit<br />
e.g. distinctive features, gestures<br />
• linguistic categorization is either inherent in the<br />
signal itself, or the automatic consequence of<br />
low-level perceptual processes.
Classical assumptions<br />
• Almost all theories were vague about how we<br />
move from identification of the basic phonetic<br />
or phonological unit to underst<strong>and</strong>ing meaning.<br />
• No influential classical theory considered that<br />
the same basic processes could be involved in<br />
all aspects of <strong>speech</strong> underst<strong>and</strong>ing<br />
• Exception to both these statements: Klatt’s<br />
Lexical Access From Spectra (LAFS) model)<br />
Klatt, D. H. (1979). Speech <strong>perception</strong>: A model of acoustic-phonetic<br />
analysis <strong>and</strong> lexical access. Journal of Phonetics, 7, 279-312.
Do linguistic units have invariant<br />
correlates<br />
• There is no strong evidence that all linguistic<br />
units, even of a single type, are invariantly <strong>and</strong><br />
reliably present in the <strong>speech</strong> signal<br />
• Yet some acoustic-phonetic features are more<br />
robust across contexts <strong>and</strong> speakers than<br />
others i.e. their properties are a good deal<br />
more invariant <strong>and</strong> reliable than others,<br />
especially if they are considered in relation to<br />
their surrounding context, <strong>and</strong> with respect to<br />
known properties of the auditory system.
Robust features: spectrogram of<br />
“My family lives in Oxford”<br />
N V WF V N V V l V SF V N V sil SF WF V sil<br />
diph diph (l) N dipth transient transient<br />
<br />
voicing in an<br />
obstruent<br />
gottal stop (before<br />
voiceless stop)
Robust features (e.g. Zue, 1985)<br />
• "Strong" fricative — "weak" fricative —<br />
nasal — periodic — silence — transient —<br />
vowel (high/low, front/back, spread/round).<br />
• These offer a set of "invariant" acoustic<br />
features from which to make preliminary<br />
decisions about what words were spoken.<br />
• Some Automatic Speech Recognition (ASR)<br />
techniques use such broad featural<br />
<strong>categories</strong>; less widely applied to human<br />
<strong>speech</strong> <strong>perception</strong> work.
Robust features<br />
• usually clearly visible in spectrograms<br />
• independent of one another (e.g. you can know it’s a “strong<br />
fricative” without knowing its exact place of articulation), BUT only<br />
work for a small set of features that have simple acoustic properties.<br />
They don’t tell us place or voicing of stops, for example.<br />
• originally proposed as potentially powerful when combined with<br />
higher-order knowledge, especially in poor listening conditions: word<br />
recognition from the interaction of gross acoustic analysis <strong>and</strong><br />
top-down prediction based on knowledge of syllabic constituency,<br />
phonotactic rules, <strong>and</strong> word-sequencing probabilities.<br />
• current thinking might reformulate that ‘knowledge’ in terms of<br />
statistical distributions, built up from repeated experience of the way<br />
the information occurs in the <strong>speech</strong> signal, <strong>and</strong> relating to the<br />
identification of phones, syllables or words.
Isl<strong>and</strong>s of auditory reliability<br />
• Some sounds are distinguished, <strong>and</strong> others<br />
are grouped together, because of the way the<br />
auditory system responds to them e.g.<br />
dimensions of vowel quality; vocal tract<br />
normalisation (between speakers).
Isl<strong>and</strong>s of auditory reliability<br />
high<br />
vowel height<br />
low<br />
front back<br />
Syrdal <strong>and</strong> Gopal (1986, JASA 79, 1086-1100):<br />
Left panel: Scatterplot on a linear frequency scale of F1 frequency versus F2<br />
frequency for American English vowels spoken by men, women, <strong>and</strong> children<br />
(data from Peterson & Barney 1952). Right panel: the same data, replotted on a<br />
Bark scale in terms of F3-F2 Bark frequency versus F1-f0 Bark frequency.
Isl<strong>and</strong>s of acoustic/auditory reliability<br />
• Some acoustic-phonetic features are more robust<br />
across contexts <strong>and</strong> speakers than others.<br />
Of these:<br />
– All are defined or definable in relational terms.<br />
– Some are static e.g. vowels with two formants<br />
close together in frequency<br />
(“l<strong>and</strong>marks” in Hz<br />
Quantal Theory)<br />
– Some are dynamic (e.g. consonantal gestures)<br />
time
Acoustic/auditory invariance theory<br />
• Dynamic Relational<br />
invariants: esp.<br />
spectral changes on<br />
either side of abrupt<br />
boundaries between<br />
acoustic segments<br />
+strident -strident<br />
Stevens (2002) JASA 111, 1872-1891
Acoustic/Auditory invariance theory<br />
Stevens & Blumstein (1978)<br />
……. Stevens (2002)<br />
+consonantal -consonantal<br />
• For each DF there is a binary<br />
response to an invariant acoustic<br />
or auditory property (recently<br />
modifed to a (continuous)<br />
probability of response)<br />
• e.g. particular changes in spectral<br />
shape over short time periods at<br />
crucial parts of the signal<br />
– segment boundaries<br />
– vowel steady states<br />
change<br />
little change<br />
Stevens (2002) JASA 111, 1872-1891<br />
Stevens & Blumstein (1978) JASA 64, 1358-1368
Dynamic relational invariants for stop<br />
place of articulation (Stevens)<br />
Bilabial<br />
Alveolar<br />
Velar<br />
same principles for all<br />
obstruent-sonorant boundaries<br />
Onset of vowel [ɛ]<br />
burst flat or falling,<br />
low amp<br />
rising: burst > vowel spectrum<br />
at high freqs; burst <strong>and</strong> vowel<br />
peak freqs uncorrelated<br />
compact mid-freq<br />
peak near F2 & F3
Summary: Relationships<br />
between properties of the signal are critical<br />
• Current views are that relationships between<br />
successive acoustic (<strong>and</strong> visual) events define<br />
linguistic <strong>categories</strong> as much or more than static<br />
properties<br />
• i.e. listeners interpret sensory information (e.g.<br />
acoustic <strong>and</strong> visual input) in terms of relationships<br />
between properties that reflect the coordinated,<br />
dynamic behaviour of the vocal tract.<br />
• This conclusion does not necessarily entail that the<br />
basic perceptual units are motoric: they are more<br />
likely to be modality-neutral, or multi-modal.
Relational properties of <strong>speech</strong> sounds<br />
1. Relational properties are central to classical theories.<br />
2. Phonological theory is also based on relationships/contrasts.<br />
3. Timing <strong>and</strong> rhythm are essentially relational, <strong>and</strong> basic to <strong>speech</strong>:<br />
the “glue” of <strong>speech</strong> <strong>perception</strong>.<br />
4. Spectral relationships: e.g.:<br />
• Sine wave <strong>speech</strong>: reproduces the right relationships between<br />
spectral components.<br />
• Coarticulated vowels in context are identified no worse than<br />
isolated vowels <strong>and</strong> sometimes better, although the steady states<br />
of different coarticulated vowels are not as distinctive in F1-F2<br />
space as those of isolated vowels (Gottfried & Strange 1980,<br />
Strange et al. 1979, Strange et al. 1976, Assmann et al. 1982,<br />
Macchi 1980, summarised in Pickett p161-165).
Cognitive construction of <strong>categories</strong>: phonetic<br />
perceptual prototypes<br />
• Newborn babies have good discrimination of simplypresented<br />
“foreign” phonemic contrasts<br />
• They lose this ability as their own language develops.<br />
By 10-12 months of age, they tend only to<br />
discriminate those contrasts that are phonemic in<br />
their native language(s).<br />
• Kuhl: By 6 months of age, babies respond to classes<br />
of sounds (e.g. vowels, fricatives) spoken by different<br />
people as if they are all the same.
Kuhl: by 6 months of age babies have also<br />
developed language-specific vowel <strong>categories</strong><br />
Discrimination by<br />
6 month old babies<br />
Exemplars of<br />
American /i/s:<br />
good bad<br />
Exemplars of<br />
Swedish /i/s:<br />
good bad<br />
American babies poor good no difference<br />
Swedish babies no difference poor good
Development of prototypical representations, each<br />
acting as a perceptual magnet “pulling” similar sounds<br />
towards it in perceptual space so they become less<br />
discriminable<br />
Psychoacoustic space<br />
with no phonetic category:<br />
no magnet effect<br />
Psychoacoustic space<br />
with a phonetic category:<br />
perceptual magnet effect
This reasoning led to the Native Language Magnet model of <strong>speech</strong><br />
<strong>perception</strong> (early 1990s onwards, see<br />
Pickett p249-255). Recent extension: Kuhl (2007)<br />
But: what is a “phonetic category”<br />
Kuhl is inexplicit, but implies it’s a phoneme.<br />
• But phonemes can’t be directly related to psychoacoustic space…<br />
• …<strong>and</strong> phones vary a lot in different contexts.<br />
Barrett (1997): (PhD thesis, CU <strong>Linguistic</strong>s Dept)<br />
• magnet effects are context-sensitive: /u lu ju/ have independent<br />
prototypes & magnet effects<br />
• magnet effects differ depending on function: musicians have<br />
enhanced discrimination around C major chord, non-musicians do<br />
not, but can be trained to.<br />
So phonetic prototypes, demonstrated by perceptual magnet effects,<br />
operate at unknown <strong>and</strong> possibly more than one level of abstraction,<br />
<strong>and</strong> may serve various different purposes.<br />
• Do they involve memories of good/common patterns (cf. semantics)<br />
• Should we consider them as task-dependent functional processes
Neurological <strong>and</strong> neuropsychological<br />
evidence about the nature of phonetic<br />
<strong>categories</strong>
Brain activation for category boundaries<br />
• Many studies: Superior<br />
Temporal Gyrus (STG)<br />
is active when phonetic<br />
decisions are made<br />
(+ many other areas)<br />
• STG activation does<br />
not differ when the<br />
decisions are hard<br />
(other areas do e.g. frontal regions)<br />
Binder et al. (2004) Nat.Neurosci. 7, 295-301<br />
Blumstein et al. (2005) J. Cog. Neuroscience 17, 1353-1366
Brain activation for category boundaries:<br />
Ganong effect<br />
• STG is sensitive to change<br />
in category boundary due<br />
to lexical status:<br />
gift-kift vs. giss-kiss<br />
• Conclusion: lexical<br />
knowledge influences<br />
basic phonetic<br />
categorization processes<br />
Lateral view of left hemisphere:<br />
differential activation for the same<br />
physical stimulus dependent on<br />
whether it is in a word or a non-word<br />
Myers & Blumstein (2007) Cerebral Cortex
yet also.... simple ba-da continuum<br />
• brain activation differs for category centers & boundaries<br />
(adaptation fMRI)<br />
centers:<br />
boundaries:<br />
Primary auditory cortex<br />
left STG, left parietal,<br />
right cerebellum, ant. cingulate<br />
Lateral view of left<br />
hemisphere<br />
Lateral view of right<br />
hemisphere<br />
Coronal view<br />
(slice through top)<br />
Medial view of right<br />
hemisphere<br />
Raizada & Poldrack (CNS 2004)
Brain activation for<br />
native vs. non-native sounds<br />
• American <strong>and</strong> Japanese<br />
listeners heard /ra/ <strong>and</strong> /la/<br />
stimuli (non-phonemic in<br />
Japanese)<br />
• American listeners had more<br />
focal activation, for a shorter<br />
time<br />
• Japanese listeners had more<br />
distributed activation, lasting<br />
longer<br />
• (the brain is typically more active<br />
when it processes difficult material)<br />
Zhang et al. (2005, Neuroimage 26: 703-720)
Functional grouping in the brain<br />
• Neurological <strong>and</strong> neuropsychological evidence suggests that all<br />
sorts of <strong>categories</strong> are constructed by the brain from the<br />
statistical regularities amongst the salient properties of events<br />
each person experiences.<br />
• They are represented as modality-specific memories: the<br />
concept banana is stored in the brain as a cluster of different<br />
memories--of particular bananas’ taste, smell, texture, what<br />
they look like, whether you like them or not, etc.<br />
• Such memories are thought to cluster into functional<br />
groupings of brain cell activity. Thus cells from many different<br />
parts of the brain contribute to a single memory, <strong>and</strong> a single<br />
concept.
If you adopt this view, then linguistic<br />
<strong>categories</strong> are just like any other<br />
category:<br />
1. multimodal <strong>and</strong> distributed in many different parts of the brain<br />
(auditory, visual, tactile, emotional…..)<br />
2. context-sensitive (or relational) <strong>and</strong> therefore dynamic <strong>and</strong> labile<br />
3. constructed by each individual from his or her own experience<br />
4. constantly updated by new experience that fits into the category<br />
(another influence on their lability)<br />
5. can be thought of as hierarchically organised: smaller functional<br />
groupings combine into higher-order ones:<br />
– mouse—small furry mammals—larger furry mammals—<br />
mammals—animals<br />
– sound of [p] in syllable onset—syllable onset—syllable—foot (=<br />
stress group)—intonational phrase
Some questions<br />
If this is so:<br />
• what determines how the <strong>categories</strong> develop<br />
• what constrains the possible types of category,<br />
<strong>and</strong> the relationships between them