Linguistic categories and speech perception

Linguistic categories and speech 

perception 

Paper 9 

Foundations of Speech Communication 

Sarah Hawkins 

9 November 2007

Aim 

• To consider how (or why) we seem to 

recognize discrete linguistic units from the 

speech signal. 

(Note the assumption that mental linguistic 

units are discrete.)

3 ways discrete linguistic-phonetic 

categories might be perceived 

1. they are really there in the acoustic signal 

acoustic invariance (or reliability) 

2. they result from the way the auditory system 

processes sound 

auditory invariance (or reliability) 

3. they result from the way the brain processes 

any sort of information, sensory or not 

‘cognitive invariance’

What is a category 

A class or division in a 

system of classification

Structure of a category 

poor 

ok 

good 

best 

Quality of exemplars 

Boundaries

Thrush in summer 

 

Thrush in snow 

 

Sparrow in summer

Reminder: 

Ladefoged and Broadbent (1957) 

"Please say what this word is: 

bit bet bat but 

bet 

bit 

F1 of CARRIER 

200-380 Hz 

380-660 Hz 

Ladefoged and Broadbent (1957) JASA 29, 98-104

How ‘categorical’ is Categorical 

Perception 

• Category boundaries are not stable, but 

highly labile: they shift under the influence of 

many different factors.

0.2683 

0 

-0.2286 

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.54344 

Time (s) 

VOT -40 ms VOT +10 ms VOT +100 ms 

CP boundary shifts: range effects 

• identification expt e.g. 

• VOT continuum 

da..........ta 

• when stimuli are 

removed from one end, 

the 50% id boundary 

shifts towards the other 

% /d/ 

100 

50 

0 

X 

boundary shift 

short VOT (d) long VOT (t)

CP boundary shifts: cue trading 

• Cue trading: more of one 

property compensates for 

less of another 

% /d/ 

e.g. for stimuli whose VOT is 

ambiguous between /da/ and 

/ta/, decreasing burst 

amplitude causes more /da/s 

to be perceived 

100 

50 

0 

Burst 

amplitude: 

high (fewer 

/d/ responses) 

low 

(more /d/ 

responses) 


• “Ganong effect” 

(word~nonword) 

more /d/ responses if real 

word begins with /d/ (dash— 

tash) 

• Similar effects for 

sentence meaning 

(if the task is appropriate) 

e.g. The farmer milked the 

[g/k]oat 

CP boundary shifts: 

meaningfulness 

% /d/ 

100 

50 

0 

nonword-word: dask-task 

word-nonword: dash-tash 

short VOT (d) long VOT (t) 

Ganong (1980) J. Exp. Psych: HPP 6, 110-125 

Borsky, Shapiro, Tuller (2000) J. Psycholinguistic Res. 29, 155-168

Perception adjusts to the 

distribution of stimuli 

& 

is more forgiving 

about unclear sounds 

if the message makes sense 

% /d/ 

100 

50 

0 


CP: category goodness 

Eye-tracking 

Task: click on picture corresponding to heard word 

Stimuli varied in VOT: bear--pear 

More looks to competitor picture as VOT approaches 

category boundary McMurray et al. (2003) J. Psycholing. Res. 32, 77-97

CP: category goodness 

Mediated Priming in lexical decision task 

A /t/ with a short VOT primes unrelated words 

via rhymes that have /d/ instead of /t/ 

Reaction times 

Related Modified Neutral 

t*ime primes penny via dime 

Misiurski et al. (2005) Brain & Lang. 93, 64-78

Does speech convey invariant information 

about linguistic categories 

Classical theories held/hold that: 

• We necessarily perceive phonemes or 

phonological distinctive features when we 

listen to speech… 

• …because each phoneme or feature bears an 

invariant relationship with some 

property/properties of the speech signal

Classical assumptions 

Some theorists suggested invariants could be 

modality-neutral, but most debate was 

polarised: invariant units “had to be” either 

• motoric 

– the Motor Theory of Speech Perception 

– direct realism/direct perception (Fowler, Best) 

• or acoustic or auditory 

– quantal theory, acoustic invariance (K. Stevens) 

– auditory enhancement theory (Kingston, Diehl, 

Kluender). 

Read about these theories in Pickett, chapters 13-15, esp. 14. 

If you are very interested, ask your supervisor for more recent developments


Most of these theories implicitly assume(d): 

• there is one basic perceptual unit 

e.g. distinctive features, gestures 

• linguistic categorization is either inherent in the 

signal itself, or the automatic consequence of 

low-level perceptual processes.


• Almost all theories were vague about how we 

move from identification of the basic phonetic 

or phonological unit to understanding meaning. 

• No influential classical theory considered that 

the same basic processes could be involved in 

all aspects of speech understanding 

• Exception to both these statements: Klatt’s 

Lexical Access From Spectra (LAFS) model) 

Klatt, D. H. (1979). Speech perception: A model of acoustic-phonetic 

analysis and lexical access. Journal of Phonetics, 7, 279-312.

Do linguistic units have invariant 

correlates 

• There is no strong evidence that all linguistic 

units, even of a single type, are invariantly and 

reliably present in the speech signal 

• Yet some acoustic-phonetic features are more 

robust across contexts and speakers than 

others i.e. their properties are a good deal 

more invariant and reliable than others, 

especially if they are considered in relation to 

their surrounding context, and with respect to 

known properties of the auditory system.

Robust features: spectrogram of 

“My family lives in Oxford” 

N V WF V N V V l V SF V N V sil SF WF V sil 

diph diph (l) N dipth transient transient 

 

voicing in an 

obstruent 

gottal stop (before 

voiceless stop)

Robust features (e.g. Zue, 1985) 

• "Strong" fricative — "weak" fricative — 

nasal — periodic — silence — transient — 

vowel (high/low, front/back, spread/round). 

• These offer a set of "invariant" acoustic 

features from which to make preliminary 

decisions about what words were spoken. 

• Some Automatic Speech Recognition (ASR) 

techniques use such broad featural 

categories; less widely applied to human 

speech perception work.

Robust features 

• usually clearly visible in spectrograms 

• independent of one another (e.g. you can know it’s a “strong 

fricative” without knowing its exact place of articulation), BUT only 

work for a small set of features that have simple acoustic properties. 

They don’t tell us place or voicing of stops, for example. 

• originally proposed as potentially powerful when combined with 

higher-order knowledge, especially in poor listening conditions: word 

recognition from the interaction of gross acoustic analysis and 

top-down prediction based on knowledge of syllabic constituency, 

phonotactic rules, and word-sequencing probabilities. 

• current thinking might reformulate that ‘knowledge’ in terms of 

statistical distributions, built up from repeated experience of the way 

the information occurs in the speech signal, and relating to the 

identification of phones, syllables or words.

Islands of auditory reliability 

• Some sounds are distinguished, and others 

are grouped together, because of the way the 

auditory system responds to them e.g. 

dimensions of vowel quality; vocal tract 

normalisation (between speakers).

Islands of auditory reliability 

high 

vowel height 

low 

front back 

Syrdal and Gopal (1986, JASA 79, 1086-1100): 

Left panel: Scatterplot on a linear frequency scale of F1 frequency versus F2 

frequency for American English vowels spoken by men, women, and children 

(data from Peterson & Barney 1952). Right panel: the same data, replotted on a 

Bark scale in terms of F3-F2 Bark frequency versus F1-f0 Bark frequency.

Islands of acoustic/auditory reliability 

• Some acoustic-phonetic features are more robust 

across contexts and speakers than others. 

Of these: 

– All are defined or definable in relational terms. 

– Some are static e.g. vowels with two formants 

close together in frequency 

(“landmarks” in Hz 

Quantal Theory) 

– Some are dynamic (e.g. consonantal gestures) 

time

Acoustic/auditory invariance theory 

• Dynamic Relational 

invariants: esp. 

spectral changes on 

either side of abrupt 

boundaries between 

acoustic segments 

+strident -strident 

Stevens (2002) JASA 111, 1872-1891

Acoustic/Auditory invariance theory 

Stevens & Blumstein (1978) 

……. Stevens (2002) 

+consonantal -consonantal 

• For each DF there is a binary 

response to an invariant acoustic 

or auditory property (recently 

modifed to a (continuous) 

probability of response) 

• e.g. particular changes in spectral 

shape over short time periods at 

crucial parts of the signal 

– segment boundaries 

– vowel steady states 

change 

little change 

Stevens (2002) JASA 111, 1872-1891 

Stevens & Blumstein (1978) JASA 64, 1358-1368

Dynamic relational invariants for stop 

place of articulation (Stevens) 

Bilabial 

Alveolar 

Velar 

same principles for all 

obstruent-sonorant boundaries 

Onset of vowel [ɛ] 

burst flat or falling, 

low amp 

rising: burst > vowel spectrum 

at high freqs; burst and vowel 

peak freqs uncorrelated 

compact mid-freq 

peak near F2 & F3

Summary: Relationships 

between properties of the signal are critical 

• Current views are that relationships between 

successive acoustic (and visual) events define 

linguistic categories as much or more than static 

properties 

• i.e. listeners interpret sensory information (e.g. 

acoustic and visual input) in terms of relationships 

between properties that reflect the coordinated, 

dynamic behaviour of the vocal tract. 

• This conclusion does not necessarily entail that the 

basic perceptual units are motoric: they are more 

likely to be modality-neutral, or multi-modal.

Relational properties of speech sounds 

1. Relational properties are central to classical theories. 

2. Phonological theory is also based on relationships/contrasts. 

3. Timing and rhythm are essentially relational, and basic to speech: 

the “glue” of speech perception. 

4. Spectral relationships: e.g.: 

• Sine wave speech: reproduces the right relationships between 

spectral components. 

• Coarticulated vowels in context are identified no worse than 

isolated vowels and sometimes better, although the steady states 

of different coarticulated vowels are not as distinctive in F1-F2 

space as those of isolated vowels (Gottfried & Strange 1980, 

Strange et al. 1979, Strange et al. 1976, Assmann et al. 1982, 

Macchi 1980, summarised in Pickett p161-165).

Cognitive construction of categories: phonetic 

perceptual prototypes 

• Newborn babies have good discrimination of simplypresented 

“foreign” phonemic contrasts 

• They lose this ability as their own language develops. 

By 10-12 months of age, they tend only to 

discriminate those contrasts that are phonemic in 

their native language(s). 

• Kuhl: By 6 months of age, babies respond to classes 

of sounds (e.g. vowels, fricatives) spoken by different 

people as if they are all the same.

Kuhl: by 6 months of age babies have also 

developed language-specific vowel categories 

Discrimination by 

6 month old babies 

Exemplars of 

American /i/s: 

good bad 

Exemplars of 

Swedish /i/s: 

good bad 

American babies poor good no difference 

Swedish babies no difference poor good

Development of prototypical representations, each 

acting as a perceptual magnet “pulling” similar sounds 

towards it in perceptual space so they become less 

discriminable 

Psychoacoustic space 

with no phonetic category: 

no magnet effect 

Psychoacoustic space 

with a phonetic category: 

perceptual magnet effect

This reasoning led to the Native Language Magnet model of speech 

perception (early 1990s onwards, see 

Pickett p249-255). Recent extension: Kuhl (2007) 

But: what is a “phonetic category” 

Kuhl is inexplicit, but implies it’s a phoneme. 

• But phonemes can’t be directly related to psychoacoustic space… 

• …and phones vary a lot in different contexts. 

Barrett (1997): (PhD thesis, CU Linguistics Dept) 

• magnet effects are context-sensitive: /u lu ju/ have independent 

prototypes & magnet effects 

• magnet effects differ depending on function: musicians have 

enhanced discrimination around C major chord, non-musicians do 

not, but can be trained to. 

So phonetic prototypes, demonstrated by perceptual magnet effects, 

operate at unknown and possibly more than one level of abstraction, 

and may serve various different purposes. 

• Do they involve memories of good/common patterns (cf. semantics) 

• Should we consider them as task-dependent functional processes

Neurological and neuropsychological 

evidence about the nature of phonetic 

categories

Brain activation for category boundaries 

• Many studies: Superior 

Temporal Gyrus (STG) 

is active when phonetic 

decisions are made 

(+ many other areas) 

• STG activation does 

not differ when the 

decisions are hard 

(other areas do e.g. frontal regions) 

Binder et al. (2004) Nat.Neurosci. 7, 295-301 

Blumstein et al. (2005) J. Cog. Neuroscience 17, 1353-1366

Brain activation for category boundaries: 

Ganong effect 

• STG is sensitive to change 

in category boundary due 

to lexical status: 

gift-kift vs. giss-kiss 

• Conclusion: lexical 

knowledge influences 

basic phonetic 

categorization processes 

Lateral view of left hemisphere: 

differential activation for the same 

physical stimulus dependent on 

whether it is in a word or a non-word 

Myers & Blumstein (2007) Cerebral Cortex

yet also.... simple ba-da continuum 

• brain activation differs for category centers & boundaries 

(adaptation fMRI) 

centers: 

boundaries: 

Primary auditory cortex 

left STG, left parietal, 

right cerebellum, ant. cingulate 

Lateral view of left 

hemisphere 

Lateral view of right 

hemisphere 

Coronal view 

(slice through top) 

Medial view of right 

hemisphere 

Raizada & Poldrack (CNS 2004)

Brain activation for 

native vs. non-native sounds 

• American and Japanese 

listeners heard /ra/ and /la/ 

stimuli (non-phonemic in 

Japanese) 

• American listeners had more 

focal activation, for a shorter 

time 

• Japanese listeners had more 

distributed activation, lasting 

longer 

• (the brain is typically more active 

when it processes difficult material) 

Zhang et al. (2005, Neuroimage 26: 703-720)

Functional grouping in the brain 

• Neurological and neuropsychological evidence suggests that all 

sorts of categories are constructed by the brain from the 

statistical regularities amongst the salient properties of events 

each person experiences. 

• They are represented as modality-specific memories: the 

concept banana is stored in the brain as a cluster of different 

memories--of particular bananas’ taste, smell, texture, what 

they look like, whether you like them or not, etc. 

• Such memories are thought to cluster into functional 

groupings of brain cell activity. Thus cells from many different 

parts of the brain contribute to a single memory, and a single 

concept.

If you adopt this view, then linguistic 

categories are just like any other 

category: 

1. multimodal and distributed in many different parts of the brain 

(auditory, visual, tactile, emotional…..) 

2. context-sensitive (or relational) and therefore dynamic and labile 

3. constructed by each individual from his or her own experience 

4. constantly updated by new experience that fits into the category 

(another influence on their lability) 

5. can be thought of as hierarchically organised: smaller functional 

groupings combine into higher-order ones: 

– mouse—small furry mammals—larger furry mammals— 

mammals—animals 

– sound of [p] in syllable onset—syllable onset—syllable—foot (= 

stress group)—intonational phrase

Some questions 

If this is so: 

• what determines how the categories develop 

• what constrains the possible types of category, 

and the relationships between them

Linguistic categories and speech perception

Create successful ePaper yourself

Delete template?

Save as template?