01.01.2015 Views

Linguistic categories and speech perception

Linguistic categories and speech perception

Linguistic categories and speech perception

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Linguistic</strong> <strong>categories</strong> <strong>and</strong> <strong>speech</strong><br />

<strong>perception</strong><br />

Paper 9<br />

Foundations of Speech Communication<br />

Sarah Hawkins<br />

9 November 2007


Aim<br />

• To consider how (or why) we seem to<br />

recognize discrete linguistic units from the<br />

<strong>speech</strong> signal.<br />

(Note the assumption that mental linguistic<br />

units are discrete.)


3 ways discrete linguistic-phonetic<br />

<strong>categories</strong> might be perceived<br />

1. they are really there in the acoustic signal<br />

acoustic invariance (or reliability)<br />

2. they result from the way the auditory system<br />

processes sound<br />

auditory invariance (or reliability)<br />

3. they result from the way the brain processes<br />

any sort of information, sensory or not<br />

‘cognitive invariance’


What is a category<br />

A class or division in a<br />

system of classification


Structure of a category<br />

poor<br />

ok<br />

good<br />

best<br />

Quality of exemplars<br />

Boundaries


Thrush in summer<br />

<br />

Thrush in snow<br />

<br />

Sparrow in summer


Reminder:<br />

Ladefoged <strong>and</strong> Broadbent (1957)<br />

"Please say what this word is:<br />

bit bet bat but<br />

bet<br />

bit<br />

F1 of CARRIER<br />

200-380 Hz<br />

380-660 Hz<br />

Ladefoged <strong>and</strong> Broadbent (1957) JASA 29, 98-104


How ‘categorical’ is Categorical<br />

Perception<br />

• Category boundaries are not stable, but<br />

highly labile: they shift under the influence of<br />

many different factors.


0.2683<br />

0<br />

-0.2286<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.54344<br />

Time (s)<br />

VOT -40 ms VOT +10 ms VOT +100 ms<br />

CP boundary shifts: range effects<br />

• identification expt e.g.<br />

• VOT continuum<br />

da..........ta<br />

• when stimuli are<br />

removed from one end,<br />

the 50% id boundary<br />

shifts towards the other<br />

% /d/<br />

100<br />

50<br />

0<br />

X<br />

boundary shift<br />

short VOT (d) long VOT (t)


CP boundary shifts: cue trading<br />

• Cue trading: more of one<br />

property compensates for<br />

less of another<br />

% /d/<br />

e.g. for stimuli whose VOT is<br />

ambiguous between /da/ <strong>and</strong><br />

/ta/, decreasing burst<br />

amplitude causes more /da/s<br />

to be perceived<br />

100<br />

50<br />

0<br />

Burst<br />

amplitude:<br />

high (fewer<br />

/d/ responses)<br />

low<br />

(more /d/<br />

responses)<br />

short VOT (d) long VOT (t)


• “Ganong effect”<br />

(word~nonword)<br />

more /d/ responses if real<br />

word begins with /d/ (dash—<br />

tash)<br />

• Similar effects for<br />

sentence meaning<br />

(if the task is appropriate)<br />

e.g. The farmer milked the<br />

[g/k]oat<br />

CP boundary shifts:<br />

meaningfulness<br />

% /d/<br />

100<br />

50<br />

0<br />

nonword-word: dask-task<br />

word-nonword: dash-tash<br />

short VOT (d) long VOT (t)<br />

Ganong (1980) J. Exp. Psych: HPP 6, 110-125<br />

Borsky, Shapiro, Tuller (2000) J. Psycholinguistic Res. 29, 155-168


Perception adjusts to the<br />

distribution of stimuli<br />

&<br />

is more forgiving<br />

about unclear sounds<br />

if the message makes sense<br />

% /d/<br />

100<br />

50<br />

0<br />

short VOT (d) long VOT (t)


CP: category goodness<br />

Eye-tracking<br />

Task: click on picture corresponding to heard word<br />

Stimuli varied in VOT: bear--pear<br />

More looks to competitor picture as VOT approaches<br />

category boundary McMurray et al. (2003) J. Psycholing. Res. 32, 77-97


CP: category goodness<br />

Mediated Priming in lexical decision task<br />

A /t/ with a short VOT primes unrelated words<br />

via rhymes that have /d/ instead of /t/<br />

Reaction times<br />

Related Modified Neutral<br />

t*ime primes penny via dime<br />

Misiurski et al. (2005) Brain & Lang. 93, 64-78


Does <strong>speech</strong> convey invariant information<br />

about linguistic <strong>categories</strong><br />

Classical theories held/hold that:<br />

• We necessarily perceive phonemes or<br />

phonological distinctive features when we<br />

listen to <strong>speech</strong>…<br />

• …because each phoneme or feature bears an<br />

invariant relationship with some<br />

property/properties of the <strong>speech</strong> signal


Classical assumptions<br />

Some theorists suggested invariants could be<br />

modality-neutral, but most debate was<br />

polarised: invariant units “had to be” either<br />

• motoric<br />

– the Motor Theory of Speech Perception<br />

– direct realism/direct <strong>perception</strong> (Fowler, Best)<br />

• or acoustic or auditory<br />

– quantal theory, acoustic invariance (K. Stevens)<br />

– auditory enhancement theory (Kingston, Diehl,<br />

Kluender).<br />

Read about these theories in Pickett, chapters 13-15, esp. 14.<br />

If you are very interested, ask your supervisor for more recent developments


Classical assumptions<br />

Most of these theories implicitly assume(d):<br />

• there is one basic perceptual unit<br />

e.g. distinctive features, gestures<br />

• linguistic categorization is either inherent in the<br />

signal itself, or the automatic consequence of<br />

low-level perceptual processes.


Classical assumptions<br />

• Almost all theories were vague about how we<br />

move from identification of the basic phonetic<br />

or phonological unit to underst<strong>and</strong>ing meaning.<br />

• No influential classical theory considered that<br />

the same basic processes could be involved in<br />

all aspects of <strong>speech</strong> underst<strong>and</strong>ing<br />

• Exception to both these statements: Klatt’s<br />

Lexical Access From Spectra (LAFS) model)<br />

Klatt, D. H. (1979). Speech <strong>perception</strong>: A model of acoustic-phonetic<br />

analysis <strong>and</strong> lexical access. Journal of Phonetics, 7, 279-312.


Do linguistic units have invariant<br />

correlates<br />

• There is no strong evidence that all linguistic<br />

units, even of a single type, are invariantly <strong>and</strong><br />

reliably present in the <strong>speech</strong> signal<br />

• Yet some acoustic-phonetic features are more<br />

robust across contexts <strong>and</strong> speakers than<br />

others i.e. their properties are a good deal<br />

more invariant <strong>and</strong> reliable than others,<br />

especially if they are considered in relation to<br />

their surrounding context, <strong>and</strong> with respect to<br />

known properties of the auditory system.


Robust features: spectrogram of<br />

“My family lives in Oxford”<br />

N V WF V N V V l V SF V N V sil SF WF V sil<br />

diph diph (l) N dipth transient transient<br />

<br />

voicing in an<br />

obstruent<br />

gottal stop (before<br />

voiceless stop)


Robust features (e.g. Zue, 1985)<br />

• "Strong" fricative — "weak" fricative —<br />

nasal — periodic — silence — transient —<br />

vowel (high/low, front/back, spread/round).<br />

• These offer a set of "invariant" acoustic<br />

features from which to make preliminary<br />

decisions about what words were spoken.<br />

• Some Automatic Speech Recognition (ASR)<br />

techniques use such broad featural<br />

<strong>categories</strong>; less widely applied to human<br />

<strong>speech</strong> <strong>perception</strong> work.


Robust features<br />

• usually clearly visible in spectrograms<br />

• independent of one another (e.g. you can know it’s a “strong<br />

fricative” without knowing its exact place of articulation), BUT only<br />

work for a small set of features that have simple acoustic properties.<br />

They don’t tell us place or voicing of stops, for example.<br />

• originally proposed as potentially powerful when combined with<br />

higher-order knowledge, especially in poor listening conditions: word<br />

recognition from the interaction of gross acoustic analysis <strong>and</strong><br />

top-down prediction based on knowledge of syllabic constituency,<br />

phonotactic rules, <strong>and</strong> word-sequencing probabilities.<br />

• current thinking might reformulate that ‘knowledge’ in terms of<br />

statistical distributions, built up from repeated experience of the way<br />

the information occurs in the <strong>speech</strong> signal, <strong>and</strong> relating to the<br />

identification of phones, syllables or words.


Isl<strong>and</strong>s of auditory reliability<br />

• Some sounds are distinguished, <strong>and</strong> others<br />

are grouped together, because of the way the<br />

auditory system responds to them e.g.<br />

dimensions of vowel quality; vocal tract<br />

normalisation (between speakers).


Isl<strong>and</strong>s of auditory reliability<br />

high<br />

vowel height<br />

low<br />

front back<br />

Syrdal <strong>and</strong> Gopal (1986, JASA 79, 1086-1100):<br />

Left panel: Scatterplot on a linear frequency scale of F1 frequency versus F2<br />

frequency for American English vowels spoken by men, women, <strong>and</strong> children<br />

(data from Peterson & Barney 1952). Right panel: the same data, replotted on a<br />

Bark scale in terms of F3-F2 Bark frequency versus F1-f0 Bark frequency.


Isl<strong>and</strong>s of acoustic/auditory reliability<br />

• Some acoustic-phonetic features are more robust<br />

across contexts <strong>and</strong> speakers than others.<br />

Of these:<br />

– All are defined or definable in relational terms.<br />

– Some are static e.g. vowels with two formants<br />

close together in frequency<br />

(“l<strong>and</strong>marks” in Hz<br />

Quantal Theory)<br />

– Some are dynamic (e.g. consonantal gestures)<br />

time


Acoustic/auditory invariance theory<br />

• Dynamic Relational<br />

invariants: esp.<br />

spectral changes on<br />

either side of abrupt<br />

boundaries between<br />

acoustic segments<br />

+strident -strident<br />

Stevens (2002) JASA 111, 1872-1891


Acoustic/Auditory invariance theory<br />

Stevens & Blumstein (1978)<br />

……. Stevens (2002)<br />

+consonantal -consonantal<br />

• For each DF there is a binary<br />

response to an invariant acoustic<br />

or auditory property (recently<br />

modifed to a (continuous)<br />

probability of response)<br />

• e.g. particular changes in spectral<br />

shape over short time periods at<br />

crucial parts of the signal<br />

– segment boundaries<br />

– vowel steady states<br />

change<br />

little change<br />

Stevens (2002) JASA 111, 1872-1891<br />

Stevens & Blumstein (1978) JASA 64, 1358-1368


Dynamic relational invariants for stop<br />

place of articulation (Stevens)<br />

Bilabial<br />

Alveolar<br />

Velar<br />

same principles for all<br />

obstruent-sonorant boundaries<br />

Onset of vowel [ɛ]<br />

burst flat or falling,<br />

low amp<br />

rising: burst > vowel spectrum<br />

at high freqs; burst <strong>and</strong> vowel<br />

peak freqs uncorrelated<br />

compact mid-freq<br />

peak near F2 & F3


Summary: Relationships<br />

between properties of the signal are critical<br />

• Current views are that relationships between<br />

successive acoustic (<strong>and</strong> visual) events define<br />

linguistic <strong>categories</strong> as much or more than static<br />

properties<br />

• i.e. listeners interpret sensory information (e.g.<br />

acoustic <strong>and</strong> visual input) in terms of relationships<br />

between properties that reflect the coordinated,<br />

dynamic behaviour of the vocal tract.<br />

• This conclusion does not necessarily entail that the<br />

basic perceptual units are motoric: they are more<br />

likely to be modality-neutral, or multi-modal.


Relational properties of <strong>speech</strong> sounds<br />

1. Relational properties are central to classical theories.<br />

2. Phonological theory is also based on relationships/contrasts.<br />

3. Timing <strong>and</strong> rhythm are essentially relational, <strong>and</strong> basic to <strong>speech</strong>:<br />

the “glue” of <strong>speech</strong> <strong>perception</strong>.<br />

4. Spectral relationships: e.g.:<br />

• Sine wave <strong>speech</strong>: reproduces the right relationships between<br />

spectral components.<br />

• Coarticulated vowels in context are identified no worse than<br />

isolated vowels <strong>and</strong> sometimes better, although the steady states<br />

of different coarticulated vowels are not as distinctive in F1-F2<br />

space as those of isolated vowels (Gottfried & Strange 1980,<br />

Strange et al. 1979, Strange et al. 1976, Assmann et al. 1982,<br />

Macchi 1980, summarised in Pickett p161-165).


Cognitive construction of <strong>categories</strong>: phonetic<br />

perceptual prototypes<br />

• Newborn babies have good discrimination of simplypresented<br />

“foreign” phonemic contrasts<br />

• They lose this ability as their own language develops.<br />

By 10-12 months of age, they tend only to<br />

discriminate those contrasts that are phonemic in<br />

their native language(s).<br />

• Kuhl: By 6 months of age, babies respond to classes<br />

of sounds (e.g. vowels, fricatives) spoken by different<br />

people as if they are all the same.


Kuhl: by 6 months of age babies have also<br />

developed language-specific vowel <strong>categories</strong><br />

Discrimination by<br />

6 month old babies<br />

Exemplars of<br />

American /i/s:<br />

good bad<br />

Exemplars of<br />

Swedish /i/s:<br />

good bad<br />

American babies poor good no difference<br />

Swedish babies no difference poor good


Development of prototypical representations, each<br />

acting as a perceptual magnet “pulling” similar sounds<br />

towards it in perceptual space so they become less<br />

discriminable<br />

Psychoacoustic space<br />

with no phonetic category:<br />

no magnet effect<br />

Psychoacoustic space<br />

with a phonetic category:<br />

perceptual magnet effect


This reasoning led to the Native Language Magnet model of <strong>speech</strong><br />

<strong>perception</strong> (early 1990s onwards, see<br />

Pickett p249-255). Recent extension: Kuhl (2007)<br />

But: what is a “phonetic category”<br />

Kuhl is inexplicit, but implies it’s a phoneme.<br />

• But phonemes can’t be directly related to psychoacoustic space…<br />

• …<strong>and</strong> phones vary a lot in different contexts.<br />

Barrett (1997): (PhD thesis, CU <strong>Linguistic</strong>s Dept)<br />

• magnet effects are context-sensitive: /u lu ju/ have independent<br />

prototypes & magnet effects<br />

• magnet effects differ depending on function: musicians have<br />

enhanced discrimination around C major chord, non-musicians do<br />

not, but can be trained to.<br />

So phonetic prototypes, demonstrated by perceptual magnet effects,<br />

operate at unknown <strong>and</strong> possibly more than one level of abstraction,<br />

<strong>and</strong> may serve various different purposes.<br />

• Do they involve memories of good/common patterns (cf. semantics)<br />

• Should we consider them as task-dependent functional processes


Neurological <strong>and</strong> neuropsychological<br />

evidence about the nature of phonetic<br />

<strong>categories</strong>


Brain activation for category boundaries<br />

• Many studies: Superior<br />

Temporal Gyrus (STG)<br />

is active when phonetic<br />

decisions are made<br />

(+ many other areas)<br />

• STG activation does<br />

not differ when the<br />

decisions are hard<br />

(other areas do e.g. frontal regions)<br />

Binder et al. (2004) Nat.Neurosci. 7, 295-301<br />

Blumstein et al. (2005) J. Cog. Neuroscience 17, 1353-1366


Brain activation for category boundaries:<br />

Ganong effect<br />

• STG is sensitive to change<br />

in category boundary due<br />

to lexical status:<br />

gift-kift vs. giss-kiss<br />

• Conclusion: lexical<br />

knowledge influences<br />

basic phonetic<br />

categorization processes<br />

Lateral view of left hemisphere:<br />

differential activation for the same<br />

physical stimulus dependent on<br />

whether it is in a word or a non-word<br />

Myers & Blumstein (2007) Cerebral Cortex


yet also.... simple ba-da continuum<br />

• brain activation differs for category centers & boundaries<br />

(adaptation fMRI)<br />

centers:<br />

boundaries:<br />

Primary auditory cortex<br />

left STG, left parietal,<br />

right cerebellum, ant. cingulate<br />

Lateral view of left<br />

hemisphere<br />

Lateral view of right<br />

hemisphere<br />

Coronal view<br />

(slice through top)<br />

Medial view of right<br />

hemisphere<br />

Raizada & Poldrack (CNS 2004)


Brain activation for<br />

native vs. non-native sounds<br />

• American <strong>and</strong> Japanese<br />

listeners heard /ra/ <strong>and</strong> /la/<br />

stimuli (non-phonemic in<br />

Japanese)<br />

• American listeners had more<br />

focal activation, for a shorter<br />

time<br />

• Japanese listeners had more<br />

distributed activation, lasting<br />

longer<br />

• (the brain is typically more active<br />

when it processes difficult material)<br />

Zhang et al. (2005, Neuroimage 26: 703-720)


Functional grouping in the brain<br />

• Neurological <strong>and</strong> neuropsychological evidence suggests that all<br />

sorts of <strong>categories</strong> are constructed by the brain from the<br />

statistical regularities amongst the salient properties of events<br />

each person experiences.<br />

• They are represented as modality-specific memories: the<br />

concept banana is stored in the brain as a cluster of different<br />

memories--of particular bananas’ taste, smell, texture, what<br />

they look like, whether you like them or not, etc.<br />

• Such memories are thought to cluster into functional<br />

groupings of brain cell activity. Thus cells from many different<br />

parts of the brain contribute to a single memory, <strong>and</strong> a single<br />

concept.


If you adopt this view, then linguistic<br />

<strong>categories</strong> are just like any other<br />

category:<br />

1. multimodal <strong>and</strong> distributed in many different parts of the brain<br />

(auditory, visual, tactile, emotional…..)<br />

2. context-sensitive (or relational) <strong>and</strong> therefore dynamic <strong>and</strong> labile<br />

3. constructed by each individual from his or her own experience<br />

4. constantly updated by new experience that fits into the category<br />

(another influence on their lability)<br />

5. can be thought of as hierarchically organised: smaller functional<br />

groupings combine into higher-order ones:<br />

– mouse—small furry mammals—larger furry mammals—<br />

mammals—animals<br />

– sound of [p] in syllable onset—syllable onset—syllable—foot (=<br />

stress group)—intonational phrase


Some questions<br />

If this is so:<br />

• what determines how the <strong>categories</strong> develop<br />

• what constrains the possible types of category,<br />

<strong>and</strong> the relationships between them

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!