01.01.2015 Views

Linguistic categories and speech perception

Linguistic categories and speech perception

Linguistic categories and speech perception

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Linguistic</strong> <strong>categories</strong> <strong>and</strong> <strong>speech</strong><br />

<strong>perception</strong><br />

Paper 9<br />

Foundations of Speech Communication<br />

Sarah Hawkins<br />

9 November 2007


Aim<br />

• To consider how (or why) we seem to<br />

recognize discrete linguistic units from the<br />

<strong>speech</strong> signal.<br />

(Note the assumption that mental linguistic<br />

units are discrete.)


3 ways discrete linguistic-phonetic<br />

<strong>categories</strong> might be perceived<br />

1. they are really there in the acoustic signal<br />

acoustic invariance (or reliability)<br />

2. they result from the way the auditory system<br />

processes sound<br />

auditory invariance (or reliability)<br />

3. they result from the way the brain processes<br />

any sort of information, sensory or not<br />

‘cognitive invariance’


What is a category<br />

A class or division in a<br />

system of classification


Structure of a category<br />

poor<br />

ok<br />

good<br />

best<br />

Quality of exemplars<br />

Boundaries


Thrush in summer<br />

<br />

Thrush in snow<br />

<br />

Sparrow in summer


Reminder:<br />

Ladefoged <strong>and</strong> Broadbent (1957)<br />

"Please say what this word is:<br />

bit bet bat but<br />

bet<br />

bit<br />

F1 of CARRIER<br />

200-380 Hz<br />

380-660 Hz<br />

Ladefoged <strong>and</strong> Broadbent (1957) JASA 29, 98-104


How ‘categorical’ is Categorical<br />

Perception<br />

• Category boundaries are not stable, but<br />

highly labile: they shift under the influence of<br />

many different factors.


0.2683<br />

0<br />

-0.2286<br />

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.54344<br />

Time (s)<br />

VOT -40 ms VOT +10 ms VOT +100 ms<br />

CP boundary shifts: range effects<br />

• identification expt e.g.<br />

• VOT continuum<br />

da..........ta<br />

• when stimuli are<br />

removed from one end,<br />

the 50% id boundary<br />

shifts towards the other<br />

% /d/<br />

100<br />

50<br />

0<br />

X<br />

boundary shift<br />

short VOT (d) long VOT (t)


CP boundary shifts: cue trading<br />

• Cue trading: more of one<br />

property compensates for<br />

less of another<br />

% /d/<br />

e.g. for stimuli whose VOT is<br />

ambiguous between /da/ <strong>and</strong><br />

/ta/, decreasing burst<br />

amplitude causes more /da/s<br />

to be perceived<br />

100<br />

50<br />

0<br />

Burst<br />

amplitude:<br />

high (fewer<br />

/d/ responses)<br />

low<br />

(more /d/<br />

responses)<br />

short VOT (d) long VOT (t)


• “Ganong effect”<br />

(word~nonword)<br />

more /d/ responses if real<br />

word begins with /d/ (dash—<br />

tash)<br />

• Similar effects for<br />

sentence meaning<br />

(if the task is appropriate)<br />

e.g. The farmer milked the<br />

[g/k]oat<br />

CP boundary shifts:<br />

meaningfulness<br />

% /d/<br />

100<br />

50<br />

0<br />

nonword-word: dask-task<br />

word-nonword: dash-tash<br />

short VOT (d) long VOT (t)<br />

Ganong (1980) J. Exp. Psych: HPP 6, 110-125<br />

Borsky, Shapiro, Tuller (2000) J. Psycholinguistic Res. 29, 155-168


Perception adjusts to the<br />

distribution of stimuli<br />

&<br />

is more forgiving<br />

about unclear sounds<br />

if the message makes sense<br />

% /d/<br />

100<br />

50<br />

0<br />

short VOT (d) long VOT (t)


CP: category goodness<br />

Eye-tracking<br />

Task: click on picture corresponding to heard word<br />

Stimuli varied in VOT: bear--pear<br />

More looks to competitor picture as VOT approaches<br />

category boundary McMurray et al. (2003) J. Psycholing. Res. 32, 77-97


CP: category goodness<br />

Mediated Priming in lexical decision task<br />

A /t/ with a short VOT primes unrelated words<br />

via rhymes that have /d/ instead of /t/<br />

Reaction times<br />

Related Modified Neutral<br />

t*ime primes penny via dime<br />

Misiurski et al. (2005) Brain & Lang. 93, 64-78


Does <strong>speech</strong> convey invariant information<br />

about linguistic <strong>categories</strong><br />

Classical theories held/hold that:<br />

• We necessarily perceive phonemes or<br />

phonological distinctive features when we<br />

listen to <strong>speech</strong>…<br />

• …because each phoneme or feature bears an<br />

invariant relationship with some<br />

property/properties of the <strong>speech</strong> signal


Classical assumptions<br />

Some theorists suggested invariants could be<br />

modality-neutral, but most debate was<br />

polarised: invariant units “had to be” either<br />

• motoric<br />

– the Motor Theory of Speech Perception<br />

– direct realism/direct <strong>perception</strong> (Fowler, Best)<br />

• or acoustic or auditory<br />

– quantal theory, acoustic invariance (K. Stevens)<br />

– auditory enhancement theory (Kingston, Diehl,<br />

Kluender).<br />

Read about these theories in Pickett, chapters 13-15, esp. 14.<br />

If you are very interested, ask your supervisor for more recent developments


Classical assumptions<br />

Most of these theories implicitly assume(d):<br />

• there is one basic perceptual unit<br />

e.g. distinctive features, gestures<br />

• linguistic categorization is either inherent in the<br />

signal itself, or the automatic consequence of<br />

low-level perceptual processes.


Classical assumptions<br />

• Almost all theories were vague about how we<br />

move from identification of the basic phonetic<br />

or phonological unit to underst<strong>and</strong>ing meaning.<br />

• No influential classical theory considered that<br />

the same basic processes could be involved in<br />

all aspects of <strong>speech</strong> underst<strong>and</strong>ing<br />

• Exception to both these statements: Klatt’s<br />

Lexical Access From Spectra (LAFS) model)<br />

Klatt, D. H. (1979). Speech <strong>perception</strong>: A model of acoustic-phonetic<br />

analysis <strong>and</strong> lexical access. Journal of Phonetics, 7, 279-312.


Do linguistic units have invariant<br />

correlates<br />

• There is no strong evidence that all linguistic<br />

units, even of a single type, are invariantly <strong>and</strong><br />

reliably present in the <strong>speech</strong> signal<br />

• Yet some acoustic-phonetic features are more<br />

robust across contexts <strong>and</strong> speakers than<br />

others i.e. their properties are a good deal<br />

more invariant <strong>and</strong> reliable than others,<br />

especially if they are considered in relation to<br />

their surrounding context, <strong>and</strong> with respect to<br />

known properties of the auditory system.


Robust features: spectrogram of<br />

“My family lives in Oxford”<br />

N V WF V N V V l V SF V N V sil SF WF V sil<br />

diph diph (l) N dipth transient transient<br />

<br />

voicing in an<br />

obstruent<br />

gottal stop (before<br />

voiceless stop)


Robust features (e.g. Zue, 1985)<br />

• "Strong" fricative — "weak" fricative —<br />

nasal — periodic — silence — transient —<br />

vowel (high/low, front/back, spread/round).<br />

• These offer a set of "invariant" acoustic<br />

features from which to make preliminary<br />

decisions about what words were spoken.<br />

• Some Automatic Speech Recognition (ASR)<br />

techniques use such broad featural<br />

<strong>categories</strong>; less widely applied to human<br />

<strong>speech</strong> <strong>perception</strong> work.


Robust features<br />

• usually clearly visible in spectrograms<br />

• independent of one another (e.g. you can know it’s a “strong<br />

fricative” without knowing its exact place of articulation), BUT only<br />

work for a small set of features that have simple acoustic properties.<br />

They don’t tell us place or voicing of stops, for example.<br />

• originally proposed as potentially powerful when combined with<br />

higher-order knowledge, especially in poor listening conditions: word<br />

recognition from the interaction of gross acoustic analysis <strong>and</strong><br />

top-down prediction based on knowledge of syllabic constituency,<br />

phonotactic rules, <strong>and</strong> word-sequencing probabilities.<br />

• current thinking might reformulate that ‘knowledge’ in terms of<br />

statistical distributions, built up from repeated experience of the way<br />

the information occurs in the <strong>speech</strong> signal, <strong>and</strong> relating to the<br />

identification of phones, syllables or words.


Isl<strong>and</strong>s of auditory reliability<br />

• Some sounds are distinguished, <strong>and</strong> others<br />

are grouped together, because of the way the<br />

auditory system responds to them e.g.<br />

dimensions of vowel quality; vocal tract<br />

normalisation (between speakers).


Isl<strong>and</strong>s of auditory reliability<br />

high<br />

vowel height<br />

low<br />

front back<br />

Syrdal <strong>and</strong> Gopal (1986, JASA 79, 1086-1100):<br />

Left panel: Scatterplot on a linear frequency scale of F1 frequency versus F2<br />

frequency for American English vowels spoken by men, women, <strong>and</strong> children<br />

(data from Peterson & Barney 1952). Right panel: the same data, replotted on a<br />

Bark scale in terms of F3-F2 Bark frequency versus F1-f0 Bark frequency.


Isl<strong>and</strong>s of acoustic/auditory reliability<br />

• Some acoustic-phonetic features are more robust<br />

across contexts <strong>and</strong> speakers than others.<br />

Of these:<br />

– All are defined or definable in relational terms.<br />

– Some are static e.g. vowels with two formants<br />

close together in frequency<br />

(“l<strong>and</strong>marks” in Hz<br />

Quantal Theory)<br />

– Some are dynamic (e.g. consonantal gestures)<br />

time


Acoustic/auditory invariance theory<br />

• Dynamic Relational<br />

invariants: esp.<br />

spectral changes on<br />

either side of abrupt<br />

boundaries between<br />

acoustic segments<br />

+strident -strident<br />

Stevens (2002) JASA 111, 1872-1891


Acoustic/Auditory invariance theory<br />

Stevens & Blumstein (1978)<br />

……. Stevens (2002)<br />

+consonantal -consonantal<br />

• For each DF there is a binary<br />

response to an invariant acoustic<br />

or auditory property (recently<br />

modifed to a (continuous)<br />

probability of response)<br />

• e.g. particular changes in spectral<br />

shape over short time periods at<br />

crucial parts of the signal<br />

– segment boundaries<br />

– vowel steady states<br />

change<br />

little change<br />

Stevens (2002) JASA 111, 1872-1891<br />

Stevens & Blumstein (1978) JASA 64, 1358-1368


Dynamic relational invariants for stop<br />

place of articulation (Stevens)<br />

Bilabial<br />

Alveolar<br />

Velar<br />

same principles for all<br />

obstruent-sonorant boundaries<br />

Onset of vowel [ɛ]<br />

burst flat or falling,<br />

low amp<br />

rising: burst > vowel spectrum<br />

at high freqs; burst <strong>and</strong> vowel<br />

peak freqs uncorrelated<br />

compact mid-freq<br />

peak near F2 & F3


Summary: Relationships<br />

between properties of the signal are critical<br />

• Current views are that relationships between<br />

successive acoustic (<strong>and</strong> visual) events define<br />

linguistic <strong>categories</strong> as much or more than static<br />

properties<br />

• i.e. listeners interpret sensory information (e.g.<br />

acoustic <strong>and</strong> visual input) in terms of relationships<br />

between properties that reflect the coordinated,<br />

dynamic behaviour of the vocal tract.<br />

• This conclusion does not necessarily entail that the<br />

basic perceptual units are motoric: they are more<br />

likely to be modality-neutral, or multi-modal.


Relational properties of <strong>speech</strong> sounds<br />

1. Relational properties are central to classical theories.<br />

2. Phonological theory is also based on relationships/contrasts.<br />

3. Timing <strong>and</strong> rhythm are essentially relational, <strong>and</strong> basic to <strong>speech</strong>:<br />

the “glue” of <strong>speech</strong> <strong>perception</strong>.<br />

4. Spectral relationships: e.g.:<br />

• Sine wave <strong>speech</strong>: reproduces the right relationships between<br />

spectral components.<br />

• Coarticulated vowels in context are identified no worse than<br />

isolated vowels <strong>and</strong> sometimes better, although the steady states<br />

of different coarticulated vowels are not as distinctive in F1-F2<br />

space as those of isolated vowels (Gottfried & Strange 1980,<br />

Strange et al. 1979, Strange et al. 1976, Assmann et al. 1982,<br />

Macchi 1980, summarised in Pickett p161-165).


Cognitive construction of <strong>categories</strong>: phonetic<br />

perceptual prototypes<br />

• Newborn babies have good discrimination of simplypresented<br />

“foreign” phonemic contrasts<br />

• They lose this ability as their own language develops.<br />

By 10-12 months of age, they tend only to<br />

discriminate those contrasts that are phonemic in<br />

their native language(s).<br />

• Kuhl: By 6 months of age, babies respond to classes<br />

of sounds (e.g. vowels, fricatives) spoken by different<br />

people as if they are all the same.


Kuhl: by 6 months of age babies have also<br />

developed language-specific vowel <strong>categories</strong><br />

Discrimination by<br />

6 month old babies<br />

Exemplars of<br />

American /i/s:<br />

good bad<br />

Exemplars of<br />

Swedish /i/s:<br />

good bad<br />

American babies poor good no difference<br />

Swedish babies no difference poor good


Development of prototypical representations, each<br />

acting as a perceptual magnet “pulling” similar sounds<br />

towards it in perceptual space so they become less<br />

discriminable<br />

Psychoacoustic space<br />

with no phonetic category:<br />

no magnet effect<br />

Psychoacoustic space<br />

with a phonetic category:<br />

perceptual magnet effect


This reasoning led to the Native Language Magnet model of <strong>speech</strong><br />

<strong>perception</strong> (early 1990s onwards, see<br />

Pickett p249-255). Recent extension: Kuhl (2007)<br />

But: what is a “phonetic category”<br />

Kuhl is inexplicit, but implies it’s a phoneme.<br />

• But phonemes can’t be directly related to psychoacoustic space…<br />

• …<strong>and</strong> phones vary a lot in different contexts.<br />

Barrett (1997): (PhD thesis, CU <strong>Linguistic</strong>s Dept)<br />

• magnet effects are context-sensitive: /u lu ju/ have independent<br />

prototypes & magnet effects<br />

• magnet effects differ depending on function: musicians have<br />

enhanced discrimination around C major chord, non-musicians do<br />

not, but can be trained to.<br />

So phonetic prototypes, demonstrated by perceptual magnet effects,<br />

operate at unknown <strong>and</strong> possibly more than one level of abstraction,<br />

<strong>and</strong> may serve various different purposes.<br />

• Do they involve memories of good/common patterns (cf. semantics)<br />

• Should we consider them as task-dependent functional processes


Neurological <strong>and</strong> neuropsychological<br />

evidence about the nature of phonetic<br />

<strong>categories</strong>


Brain activation for category boundaries<br />

• Many studies: Superior<br />

Temporal Gyrus (STG)<br />

is active when phonetic<br />

decisions are made<br />

(+ many other areas)<br />

• STG activation does<br />

not differ when the<br />

decisions are hard<br />

(other areas do e.g. frontal regions)<br />

Binder et al. (2004) Nat.Neurosci. 7, 295-301<br />

Blumstein et al. (2005) J. Cog. Neuroscience 17, 1353-1366


Brain activation for category boundaries:<br />

Ganong effect<br />

• STG is sensitive to change<br />

in category boundary due<br />

to lexical status:<br />

gift-kift vs. giss-kiss<br />

• Conclusion: lexical<br />

knowledge influences<br />

basic phonetic<br />

categorization processes<br />

Lateral view of left hemisphere:<br />

differential activation for the same<br />

physical stimulus dependent on<br />

whether it is in a word or a non-word<br />

Myers & Blumstein (2007) Cerebral Cortex


yet also.... simple ba-da continuum<br />

• brain activation differs for category centers & boundaries<br />

(adaptation fMRI)<br />

centers:<br />

boundaries:<br />

Primary auditory cortex<br />

left STG, left parietal,<br />

right cerebellum, ant. cingulate<br />

Lateral view of left<br />

hemisphere<br />

Lateral view of right<br />

hemisphere<br />

Coronal view<br />

(slice through top)<br />

Medial view of right<br />

hemisphere<br />

Raizada & Poldrack (CNS 2004)


Brain activation for<br />

native vs. non-native sounds<br />

• American <strong>and</strong> Japanese<br />

listeners heard /ra/ <strong>and</strong> /la/<br />

stimuli (non-phonemic in<br />

Japanese)<br />

• American listeners had more<br />

focal activation, for a shorter<br />

time<br />

• Japanese listeners had more<br />

distributed activation, lasting<br />

longer<br />

• (the brain is typically more active<br />

when it processes difficult material)<br />

Zhang et al. (2005, Neuroimage 26: 703-720)


Functional grouping in the brain<br />

• Neurological <strong>and</strong> neuropsychological evidence suggests that all<br />

sorts of <strong>categories</strong> are constructed by the brain from the<br />

statistical regularities amongst the salient properties of events<br />

each person experiences.<br />

• They are represented as modality-specific memories: the<br />

concept banana is stored in the brain as a cluster of different<br />

memories--of particular bananas’ taste, smell, texture, what<br />

they look like, whether you like them or not, etc.<br />

• Such memories are thought to cluster into functional<br />

groupings of brain cell activity. Thus cells from many different<br />

parts of the brain contribute to a single memory, <strong>and</strong> a single<br />

concept.


If you adopt this view, then linguistic<br />

<strong>categories</strong> are just like any other<br />

category:<br />

1. multimodal <strong>and</strong> distributed in many different parts of the brain<br />

(auditory, visual, tactile, emotional…..)<br />

2. context-sensitive (or relational) <strong>and</strong> therefore dynamic <strong>and</strong> labile<br />

3. constructed by each individual from his or her own experience<br />

4. constantly updated by new experience that fits into the category<br />

(another influence on their lability)<br />

5. can be thought of as hierarchically organised: smaller functional<br />

groupings combine into higher-order ones:<br />

– mouse—small furry mammals—larger furry mammals—<br />

mammals—animals<br />

– sound of [p] in syllable onset—syllable onset—syllable—foot (=<br />

stress group)—intonational phrase


Some questions<br />

If this is so:<br />

• what determines how the <strong>categories</strong> develop<br />

• what constrains the possible types of category,<br />

<strong>and</strong> the relationships between them

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!