Acoustic Theory of Speech Production; Speech Perception Speech ...

Digital Signal Processing 

Design — Lecture 4b 

Acoustic Theory of 

Speech Production; 

Speech Perception 

1

Basics 

• speech is composed of a sequence of sounds 

• sounds (and ( transitions between them) ) serve as a 

symbolic representation of information to be shared 

between humans (or humans and machines) 

• arrangement of sounds is governed by rules of 

language (constraints on sound sequences, word 

sequences, etc)--/spl/ t ) / l/ exists, i t /sbk/ / bk/ doesn’t d ’t exist i t 

• linguistics is the study of the rules of language 

• phonetics h ti iis th the study t d of f the th sounds d of f speechh 

can exploit knowledge about the structure of sounds and language language—and and how it 

is encoded in the signal—to do speech analysis, speech coding, speech 

synthesis, speech recognition, speaker recognition, etc. 

2

Mid-sagittal plane 

X-ray of human 

vocal apparatus 

Human Vocal Vocal Apparatus 

• vocal tract —dotted dotted lines in figure; begins at the 

glottis (the vocal cords) and ends at the lips 

• consists of the pharynx (the connection from 

the esophagus to the mouth) and the mouth 

itself (the oral cavity) 

• average male vocal tract length is 17.5 cm 

• cross sectional area, determined by 

positions of the tongue, lips, jaw and velum, 

varies from zero (complete closure) to 20 sq 

cm 

• nasal tract —begins at the velum and ends at the 

nostrils 

• velum —a trapdoor-like mechanism at the back of 

the mouth cavity; lowers to couple the nasal tract to 

the vocal tract to produce the nasal sounds like /m/ 

(mom), /n/ (night), /ng/ (sing) 

Vocal Tract MRI Sequences 

Mid-sagittal plane X-ray of human vocal apparatus 

3

MRI – Shri Shri Narayanan 

Narayanan 

4

Schematic View of of Vocal Vocal Tract 

Tract 

Acoustic Tube Models Demo 

Speech Production Mechanism: 

• air enters the lungs via normal breathing 

and no speech is produced (generally) on 

in-take 

• as air is expelled from the lungs lungs, via the 

trachea or windpipe, the tensed vocal 

cords within the larynx are caused to 

vibrate (Bernoulli oscillation) by the air 

flow 

• air is chopped up into quasi-periodic 

pulses which are modulated in frequency 

( (spectrally t ll shaped) h d) iin passing i th through h th the 

pharynx (the throat cavity), the mouth 

cavity, and possibly the nasal cavity; the 

positions p of the various articulators (j (jaw, , 

tongue, velum, lips, mouth) determine the 

sound that is produced 

5

Tube Models 

6

Tube Models 

7

Vocal Cords 

The vocal cords (folds) form a relaxation 

oscillator. Air pressure builds up and 

blows them apart. Air flows through the 

orifice and pressure drops allowing the 

vocal cords to close. close Then the cycle is 

repeated. 

8

Vocal Cord Cord Views and Operation 

Operation 

Bernoulli Oscillation Tensed Vocal Cords – 

RReady d t to Vib Vibrate t 

Lax Vocal Cords – 

Open for Breathing Breath ng 

9

Glottal Flow 

Flow 

Glottal volume velocity and resulting sound pressure at the mouth for the 

first 30 msec of a voiced sound 

• 15 msec buildup to periodicity => pitch detection issues at beginning 

and end of voicing; also voiced voiced-unvoiced unvoiced uncertainty for 15 msec 

10

Artificial Larynx 

Larynx 

Artificial Larynx Demo 

11

Schematic Production Mechanism 

Schematic representation of 

physiological mechanisms of 

speech production 

• lungs and associated muscles act as the source 

of air for exciting the vocal mechanism 

• muscle force pushes air out of the lungs 

(like a piston pushing air up within a 

cylinder) through bronchi and trachea 

• if vocal cords are tensed, air flow causes 

them to vibrate, producing voiced or quasi- 

periodic speech sounds (musical notes) 

• if vocal cords are relaxed, air flow 

continues ti th through h vocal l t tract t until til it it hits hit a 

constriction in the tract, causing it to 

become turbulent, thereby producing 

unvoiced sounds (like ( /s/, , /sh/), ), or it hits a 

point of total closure in the vocal tract, 

building up pressure until the closure is 

opened and the pressure is suddenly and 

abruptly released, released causing a brief transient 

sound, like at the beginning of /p/, /t/, or /k/ 

12

Abstractions of Physical Model 

Time-Varying 

Filter 

excitation speech 

voiced 

unvoiced 

mixed 

13

The Speech Speech Signal 

Signal 

• speech hiis a sequence of f ever changing h i sounds d 

• sound properties are highly dependent on 

context (i.e., the sounds which occur before and 

after the current sound) 

• the state of the vocal cords, the positions, shapes 

and sizes of the various articulators articulators—all all change 

slowly over time, thereby producing the desired 

speech sounds 

=> need to determine the physical properties of 

speech by observing and measuring the speech 

waveform (as ( well as signals g derived from the 

speech waveform—e.g., the signal spectrum) 

14

Speech Waveforms Waveforms and Spectra 

100 msec 

• 100 msec/line; 0.5 sec for 

utterance 

• S-silence silence-background 

background-no no speech 

• U-unvoiced, unvoiced, no vocal cord 

vibration b at o (aspiration, (asp at o , unvoiced u o ced 

sounds) 

• V-voiced voiced-quasi quasi-periodic periodic speech 

• speech is a slowly time time varying 

varying 

signal over 55-100 

100 msec intervals 

• over longer intervals (100 msec msec-5 5 

sec) sec), the speech characteristics 

change as rapidly as 10 10-20 20 

times/second 

=> no well well-defined defined or exact regions 

where individuals sounds begin 

and end 

15

• “Should Should we chase chase” 

– /sh/ sound 

– /ould/ sounds 

– /we/ sounds 

– /ch/ sound 

– /a/ sound 

– /s/ sound 

Speech Sounds 

• hard to distinguish weak sounds from silence 

• hard to segment with high precision => don’t do it when it can be avoided 

COOL EDIT demo—’should’, ‘every’ 

16

Estimate of Pitch Period -I TH 

IY 

IY 

V Z 

HH 

UW 

17

Range Range of Pitch Pitch Period 

Period 

Average Maximum Minimum 

(msec) ( ) (msec) ( ) (msec) ( ) 

Male 8 12.5 5 

Female 4.4 6.7 2.9 

Child 33 3.3 5 2 

Range of Pitch Pitch Frequency 

Average Minimum Maximum 

(Hz) ( ) (Hz) ( ) (Hz) ( ) 

Male 125 80 200 

Female 225 150 350 

Child 300 200 500 

18

Acoustic Theory of Speech 

P Production d i 

19

Source/System Model 

20

Making Speech “Visible” in 1947 

21

Spectrogram Properties 

Speech p Spectrogram p g —sound intensity y versus time and 

frequency 

• wideband spectrogram -spectral analysis on 15 msec 

sections of waveform using a broad (125 Hz) bandwidth 

analysis filter, with new analyzes every 1 msec 

– spectral intensity resolves individual periods of the speech and 

shows s o svertical etca st striations ato sdu during gvoiced ocedregions ego s 

• narrowband spectrogram -spectral analysis on 50 

msec sections of waveform using a narrow (40 Hz) 

bandwidth analysis filter filter, with new analyzes every 1 

msec 

– narrowband spectrogram resolves individual pitch harmonics 

and shows horizontal striations during voiced regions 

22

Wideband Spectrograms of s5.wav and s5_synthetic.wav 

23

Narrowband Spectrogram p g of f s5.wav 

24

Sound Spectrogram 

Spectrogram 

wavsurfer demo—’s5’, ‘s5_synthetic’ 

voicebox demo—’s5’, ‘s5_synthetic’ 

COLEA demo—’should’, , ‘every’ y 

Wav Surfer: 

www.wavsurfer.com 

VoiceBox: 

www.ee.ic.ac.uk/hp/staff/dmb/voic 

ebox/voicebox.htm 

COLEA UI: 

www.utdallas.edu/~loizou/speech/ 

colea colea.htm htm 

HMM Toolkit: 

25 

www.ai.mit.edu/~murphyk/Softwa 

re/HMM/hmm.html#hmm

Speech Sentence Waveform 

26

Speech Wideband Spectrogram 

two plus seven is less than ten 

27

Parametrization of Spectra 

• human vocal tract is essentially a tube of varying cross sectional 

area, or can be approximated as a concatentation of tubes of varying 

cross sectional areas 

• acoustic theory shows that the transfer function of energy from the 

excitation source to the output can be described in terms of the natural 

frequencies or resonances of the tube 

• resonances known as formants or formant frequencies for speech 

and d th they represent t the th ffrequencies i th that t pass the th most t acoustic ti energy 

from the source to the output 

• typically there are 3 significant formants below about 3500 Hz 

• formants are a highly efficient, efficient compact representation of speech 

28

Spectrogram Spectrogram and Formants 

Key Issue Issue: Issue Issue: : 

reliability in 

estimating 

formants from 

spectral data 

29

Acoustic Acoustic Theory Summary 

• basic speech processes — from ideas to 

speech (production) (production), from speech to ideas 

(perception) 

• basic vocal production mechanisms — vocal 

tract, nasal tract, velum 

• source of sound flow at the glottis; output of 

sound flow at the lips and nose 

• speech waveforms and properties — voiced, 

unvoiced unvoiced, silence silence, pitch 

• speech spectrograms and properties — 

wideband spectrograms, narrowband 

spectrograms, formants 

30

English Speech Speech Sounds 

Sounds 

ARPABET representation 

• 48 sounds 

• 18 vowels/diphthongs 

• 4vowel 4vowel-like 4 vowel like like consonants 

consonants 

• 21 standard consonants 

• 4 syllabic sounds 

• 1 glottal stop 

31

Phonemes 

Phonemes—Link Link Between 

O Orthography h h and d Speech S h 

Orthography sequence of sounds 

• Larry /l/ /ae/ /r/ /iy/ (/L/ /AE/ /R/ /IY/) 

S Speech h Waveform W f sequence of f sounds d 

• based on acoustic properties (temporal) of phonemes 

Spectrogram sequence of sounds 

• based on acoustic properties (spectral) of phonemes 

The bottom line is that we use a phonetic code as an intermediate 

representation of language, from either orthography or from 

waveforms a eforms or spectrograms spectrograms; no now we e ha have e to learn ho how to 

recognize sounds within speech utterances 

32

Phonetic Transcriptions 

Transcriptions 

• based on ideal (dictionary-based) ( y )p pronunciations of 

all words in sentence 

– ‘My name is Larry’-/M/ /AY/-/N/ /EY/ /M/-/IH/ /Z/-/L/ /AE/ 

/R/ /IY/ 

– ‘How old are you’-/H/ /AW/-/OW/ /L/ /D/-/AA/ /R/-/Y/ /UW/ 

– ‘Speech processing is fun’-/S/ /P/ /IY/ /CH/-/P/ /R/ /AH/ 

/S/ /EH/ /S/ /IH/ /NG/ /NG/-/IH/ /IH/ /Z/-/F/ /Z/ /F/ /AH/ /N/ 

• word ambiguity abounds 

– ‘lives’-/L/ /IH/ /V/ /Z/ (he lives here) versus /L/ /AY/ /V/ /Z/ 

(a cat has nine lives) 

– ‘record’-/R/ /EH/ /K/ /ER/ /D/ (he holds the world record) 

versus /R/ /IY/ /K/ /AW/ /D/ (please record my favorite 

show tonight) 

33

Phoneme Phoneme Classification Chart 

Chart 

EY 

Vocal Cords 

Vib Vibrating ti 

Noise-Like 

Excitation 

34

Reduced Reduced Set of English Sounds 

• 39 sounds 

– 11 vowels (front, mid, back) classification based on 

tongue hump position 

– 4 diphthongs (vowel-like (vowel like combinations) 

– 4 semi-vowels (liquids and glides) 

– 3 nasal consonants 

– 6 voiced and unvoiced stop consonants 

– 8 voiced and unvoiced fricative consonants 

– 2 affricate consonants 

– 1 whispered sound 

• look at each class of sounds to characterize 

their acoustic and spectral properties 

35

Vowels 

• longest duration sounds – least context sensitive 

• can be held indefinitely in singing and other musical 

works ( (opera) ) 

• carry very little linguistic information (some languages 

don’t display p y vowels in text-Hebrew, , Arabic) ) 

Text 1: all vowels deleted 

Th Th_y n_t_d t d s_gn_f_c_nt f t _mpr_v_m_nts t _n th_ th c_mp_ny’s ’ 

_m_g_, s_p_rv_s__n _nd m_n_g_m_nt. 

Text 2: all consonants deleted 

A__i_u_e_ _o_a__ _a_ __a_e_ e__e__ia___ __e _a_e, 

_i__ i __e e __i_e_ i e oo_ oo__u_a_io_a_ u a io a ee___o_ee_ o ee __i_____ i 

_e__ea_i__. 

36

Vowels and Consonants 

Text 1: all vowels deleted 

Th_y n_t_d s_gn_f_c_nt _mpr_v_m_nts _n th_ c_mp_ny’s 

_m_g_, m g s_p_rv_s__n s p rv s n _nd nd m_n_g_m_nt. 

m n g m nt 

(They noted significant improvements in the company’s 

image, supervision and management.) 

Text 2: all consonants deleted 

A__i_u_e_ _o_a__ _a_ __a_e_ e__e__ia___ __e _a_e, 

_i__ __e __i_e_ o_ o__u_a_io_a_ e___o_ee_ __i_____ 

_e__ea_i__. e ea i 

(Attitudes toward pay stayed essentially the same, with the 

scores of occupational employees slightly decreasing) 

37

Vowels 

• produced p using g fixed vocal tract shape p 

• sustained sounds 

• vocal cords are vibrating g => voiced sounds 

• cross-sectional area of vocal tract determines 

vowel resonance frequencies and vowel sound 

quality lit 

• tongue position (height, forward/back position) 

most important in determining vowel sound 

• usually relatively long in duration (can be held 

during g singing) g g) and are spectrally p y 

well formed 

38

Vowel Production 

Production 

39

Vowel Waveforms & 

Spectrograms 

Synthetic versions of the 

10 vowels 

40

Vowel Formants 

Formants 

Clear pattern of variability 

of vowel pronunciation 

among men, women and 

children hild 

Strong overlap for different 

vowel sounds by different 

ttalkers lk => no unique i 

identification of vowel 

strictly from resonances 

=> need context to define 

vowel sound 

41

Canonic Vowel Spectra 

10 Hz 33 Hz 

100 Hz 

IY IY 

AA 

AA 

UW UW 

100 Hz Fundamental 

42

Canonic Vowel Spectra 

IY IY 

AA 

UW 

100 100 Hz H F Fundamental d t l 300 Hz 

300 Hz Fundamental 

AA 

UW 

43

Eliminating Vowels and Consonants 

44

• Gliding speech sound 

that starts at or near the 

articulatory position for 

one vowel and moves to 

or toward the position p for 

another vowel 

– /AY/ in buy 

– /AW/ in down 

– /EY/ in bait 

– /OY/ in boy 

Diphthongs 

45

Distinctive Distinctive Features 

Features 

Classify non-vowel/non-diphthong sounds in terms of distinctive 

features 

– place of articulation 

• Bilabial (lips)—p,b,m,w 

• Labiodental (between lips and front of teeth)-f,v 

• Dental (teeth)-th,dh 

• Alveolar (front of palate)-t,d,s,z,n,l 

• Palatal (middle of palate)-sh,zh,r 

• Velar (at velum)-k,g,ng 

• Pharyngeal (at end of pharynx)-h 

– manner of articulation 

• Glide—smooth motion-w,l,r,y 

• Nasal—lowered velum-m,n,ng 

• Stop—constricted vocal tract-p,t,k,b,d,g 

• Fricative—turbulent source-f,th,s,sh,v,dh,z,zh,h 

• Voicing—voiced source-b,d,g,v,dh,z,zh,m,n,ng,w,l,r 

• Mixed source—both voicing and unvoiced-j,ch 

• Whispered--h 

46

Places of Articulation 

47

Semivowels Semivowels (Liquids and and Glides) 

Glides) 

• vowel-like vowel like in nature (called semivowels for 

this reason) 

• voiced sounds (w (w-l-r-y) l r y) 

• acoustic characteristics of these sounds 

are strongly t l influenced i fl d bby context—unlike 

t t lik 

most vowel sounds which are much less 

iinfluenced fl d bby context t t 

uh-{w,l,r,y}-a 

Manner: glides 

Place: bilabial (w), alveolar (l), 

palatal (r) 

48

Nasal Consonants 

• The nasal consonants consist of /M/, /N/, and /NG/ 

– nasals produced using glottal excitation => voiced sounds 

– vocal tract totally constricted at some point along the tract 

– velum lowered so sound is radiated at nostrils 

– constricted oral cavity serves as a resonant cavity that traps 

acoustic energy at certain natural frequencies (anti-resonances 

or zeros of transmission) 

– /M/ is produced with a constriction at the lips => low frequency 

zero 

– /N/ is produced with a constriction just behind the teeth => higher 

frequency zero 

– /NG/ is produced with a constriction just forward of the velum => 

even higher frequency zero 

uh-{m,n,ng}-a 

Manner: nasal 

Place: bilabial (m), alveolar (n), velar 

(ng) 

49

Nasal Production 

50

Nasal Sounds 

Hole in 

spectrum 

UH M AA UH N AA 

51

Unvoiced Fricatives 

• Consonant sounds /F/, /TH/, /S/, /SH/ 

– produced by exciting vocal tract by steady air flow 

which becomes turbulent in region of a constriction in 

the vocal tract 

• /F/ constriction near the lips 

• /TH/ constriction near the teeth 

• /S/ constriction near the middle of the vocal tract 

• /SH/ constriction near the back of the vocal tract 

– noise source at constriction => vocal tract is 

separated into two cavities 

– sound radiated from lips – front cavity 

– back cavity traps energy and produces antiresonances 

(zeros of transmission) 

uh-{f,th,s,sh}-a 

Manner: fricative 

Place: labiodental (f), dental (th), 

alveolar (s), palatal (sh) 

52

Unvoiced Fricative Production 

53

Unvoiced Fricatives 

UH F AA UH S AA UH SH AA 

54

Voiced Fricatives 

Fricatives 

• Sounds /V/,/DH/, /Z/, /ZH/ 

– place of constriction same as for unvoiced 

counterparts 

– two sources of excitation; vocal cords 

vibrating producing semi-periodic puffs of air 

to excite the tract; the resulting air flow 

becomes turbulent at the constriction giving a 

noise noise-like like component in addition to the 

voiced-like component 

uh-{v,dh,z,zh}-a 

Manner: fricative 

Place: labiodental (v), dental (dh), 

alveolar (z), palatal (zh) 

55

Voiced Fricatives 

UH V AA UH ZH AA 

56

Voiced and Unvoiced Stop 

Consonants 

• sounds-/B/, /D/, /G/ (voiced stop consonants) and /P/, /T/ 

/K/ (unvoiced ( i d stop t consonants) t ) 

– voiced stops are transient sounds produced by building up 

pressure behind a total constriction in the oral tract and then 

suddenly releasing the pressure, pressure resulting in a pop-like pop like sound 

• /B/ constriction at lips 

Manner: stop 

• /D/ constriction at back of teeth 

• /G/ constriction at velum 

uh-{b,d,g}-a 

– no sound diis radiated dit dffrom th the li lips dduring i constriction ti ti => 

sometimes sound is radiated from the throat during constriction 

(leakage through tract walls) allowing vocal cords to vibrate in 

spite of total constriction 

– stop sounds strongly influenced by surrounding sounds 

– unvoiced stops have no vocal cord vibration during period of 

closure => brief period of frication (due to sudden turbulence of 

escaping air) and aspiration (steady air flow from the glottis) 

before voiced excitation begins 

Place: bilabial (b,p), alveolar 

(d,t), velar (g, k) 

57

Stop Consonant Consonant Production 

Production 

58

Voiced Stop Consonant 

UH B AA 

59

Unvoiced Stop Consonants 

Stop p Gap p 

UH P AA UH T AA 

uh-{p,t,k}-a 

uh-{j,ch,h}-a 

60

Distinctive Distinctive Phoneme Phoneme Features 

Features 

• the brain recognizes sounds by doing a distinctive feature 

analysis from the information going to the brain 

• the distinctive features are somewhat insensitive to noise, 

background, reverberation => they are robust and reliable 

61

IY-beat 

IH IH-bit bit 

EH-bet 

AE-bat 

AA-bob 

ER-bird 

AH-but AH but 

AO-bought 

UW-boot 

UH UH-book book 

OW-boat 

AW-down 

AY-buy 

OY-boy 

EY-bait EY bait 

Review Exercises 

Exercises 

Write the transcription of the 

sentence “Good friends are hard to 

find” 

G-UH-D F-R-EH-N-D-Z AA-R HH- 

AA-R-D T-UH (UW) F-AY-N-D 

62

samples s offset 

0 

2000 

4000 

6000 

8000 


file: enjoy 1 0k, sampling rate: 10000, starting sample: 1 number of samples 8079 

EH N 

N JH OY 

OY 

OY 

0 200 400 600 800 1000 

sample number 

1200 1400 1600 1800 2000 

Enjoy: 

EH-N-JH-OY 

63

samples 

offset 

0 

2000 

4000 

6000 


Exercises 

file: simple 1 0k, sampling rate: 10000, starting sample: 1 number of samples 7152 

S IH M 

P AX AX-L L | EL 

0 200 400 600 800 1000 

sample l number 

b 

1200 1400 1600 1800 2000 

S 

Simple: 

S-IH-M-P- 

(AX-L | EL) 

64

TH-IH S IH Z UH T EH S T 

This is a test (16 kHz sampling rate) 

65

Word Choices: 

that, and, was, by, 

people, little, simple, 

between, very, enjoy, 

only only, other other, company company, 

those 

Ultimate Exercise Exercise—Identify Identify Words From Spectrogram 

Top Row: simple; l enjoy; those 

Second Row: was; between; company 

66


Word Choices: 


people people, little little, simple simple, 


only, other, company, 

those 

/was/ -- this word can be 

identified by the voiced 

initial portion with very 

low first and second 

formants (sounds like UW 

or W), followed by the AA 

sound and ending with the 

Z (S) sound. 

67


Word Choices: 



between between, very very, enjoy enjoy, 


those 

/enjoy/ – this word can be 

identified by the two two-syllable syllable 

nature nature, with the nasal sound 

N at the end of the first 

syllable syllable, and the fricative JH 

at the beginning of the second 

syllable, with the characteristic 

OY diphthong at the end of 

th the word 

d 

68


Word Choices: 

that, , and, , was, , by, y, 




those 

/company/ – this word can be 

identified by the three syllable 

nature nature, with the initial stop 

consonant K, the first syllable 

ending in the nasal MM, followed by 

the stop P, and with the second 

syllable ending with the nasal N 

followed by an IY vowel-like 

sound 

69


Word Choices: 





those 

/ /simple/ i l / – thi this word d can bbe 

identified by the two two-syllable syllable 

nature nature, with a strong initial 

fricative S beginning the first 

syllable ll bl and d th the nasal l M 

ending the first syllable, and 

with the stop consonant P 

beginning the second syllable 

70

Spectrograms of Cities 

A 

Los Angeles 

B 

Chicago 

C 

Denver 

D 

Seattle 

E 

Memphis p 

City Choices: Chicago, Denver, Los Angeles, Memphis, Seattle

Summary 

• sounds of the English language—phonemes, 

language phonemes, 

syllables, words 

• phonetic p transcriptions p of words and 

sentences — coarticulation across word 

boundaries 

• vowels and consonents — their roles, 

articulatory shapes, waveforms, spectrograms, 

formants 

72

Sound Sound Perception 

Perception 

73

Some Facts About Human 

H Hearing i 

• the range of human hearing is incredible 

– threshold of hearing — thermal limit of Brownian motion of air 

particles in the inner ear 

– threshold of pain — intensities of from 10**12 to 10**16 greater 

than the threshold of hearing 

• human hearing perceives both sound frequency and 

sound direction 

– can detect weak spectral components in strong broadband noise 

• masking is the phenomenon whereby one loud sound 

makes another softer sound inaudible 

– masking is most effective for frequencies around the masker 

frequency 

– masking is used to hide quantizer noise by methods of spectral 

shaping (similar grossly to Dolby noise reduction methods) 

74

Anechoic Chamber (no Echos) 

75

Sound Pressure Levels (dB) 

SPL (dB)—Sound Source SPL (dB)—Sound Source 

160 Jet Engine — close up 70 Busy Street; Noisy 

150 Firecracker; Artillery Fire 

Restaurant 

140 

130 

120 

110 

Rock Singer Screaming into 

Microphone: Jet Takeoff 

Threshold of Pain; .22 

Caliber Rifle 

Planes on Airport Runway; 

Rock Concert; Thunder 

Power Tools; Shouting in Ear 

60 

50 

40 

30 

Conversational Speech — 1 

foot 

Average Office Noise; Light 

Traffic; Rainfall 

Quiet Conversation; 

RRefrigerator; f i t Lib Library 

Quiet Office; Whisper 

100 SSubway b TTrains; i GGarbage b 

Truck 

90 Heavy Truck Traffic; Lawn 

Mower 

80 Home Stereo — 1 foot; Blow 

Dryer 

20 Quiet Living Room; Rustling 

Leaves 

10 Quiet Recording Studio; 

Breathing 

0 Threshold of Hearing 

76

Speech Communication 

77

Overview of Auditory Mechanism 

Model of signal processing for perception and 

understanding of speech 

78

The The Human Human Ear 

Ear 

OOuter t ear: pinna i and d external t l canall 

Middle ear: tympanic membrane or 

eardrum 

Inner ear: cochlea, neural connections 

79

Human Ear 

• Outer ear: funnels sound into ear canal 

• Middle ear: sound impinges on tympanic 

membrane; this causes motion 

– middle ear is a mechanical transducer, consisting of the 

hammer, anvil and stirrup; it converts acoustical sound 

wave to mechanical vibrations along the inner ear 

• Inner ear: the cochlea is a fluid-filled chamber 

partitioned by the basilar membrane 

– the auditory nerve is connected to the basilar membrane 

via inner hair cells 

– mechanical vibrations at the entrance to the cochlea 

create standing waves (of fluid inside the cochlea) 

causing basilar membrane to vibrate at frequencies 

commensurate with the input acoustic wave frequencies 

(formants) and at a place along the basilar membrane 

that is associated with these frequencies 

80

The The Outer Outer Ear 

Ear 

81

The Middle Middle Ear 

Ear 

The Hammer (Malleus), 

Anvil (Incus) and Stirrup 

(Stapes) are the three 

tiniest bones in the body. 

Together they form the 

coupling between the 

vibration of the eardrum 

and the forces exerted on 

the oval window of the 

inner ear ear. 

These bones can be 

thought of as a compound 

lever which achieves a 

multiplication of force—by 

a factor of about three 

under optimum conditions. 

(Th (They also l protect t t the th ear 

against loud sounds by 

82 

attenuating the sound.)

Tympanic y p 

Membrane 

The Cochlea 

Malleus Ossicles 

Incus 

Stapes 

(Middle Ear Bones) 

AAuditory dit nnerves 

Oval Window 

Cochlea 

RRound d Wi Window d 

Vestibule 

83

Stretched Cochlea & Basilar Membrane 

Scala 

Vestibuli 

Basilar 

Membrane 

1600 Hz 

800 Hz 

400 Hz 

200 Hz 

Cochlear Base 100 Hz 

(high 

frequency) Unrolled 

Cochlea 

50 Hz 

Relative 

amplitude 

25 Hz 

0 10 20 30 

Distance from Stapes (mm) 

Cochlear Apex 

(low ( ffrequency) q y) 

84

Basilar Membrane Motion 

85

Audience Noise Noise Suppression 

Suppression 

86

Basilar Membrane Membrane Mechanics 

Mechanics 

• characterized by a set of frequency responses at different points 

along the membrane 

• mechanical realization of a bank of filters 

• filters are roughly constant Q (center frequency/bandwidth) with 

logarithmically g y decreasing g bandwidth 

• distributed along the Basilar Membrane is a set of sensors called 

Inner Hair Cells (IHC) which act as mechanical motion-to-neural 

activity converters 

• mechanical motion along the BM is sensed by local IHC causing 

firing activity at nerve fibers that innervate bottom of each IHC 

• each IHC connected to about 10 nerve fibers, each of different 

diameter => thin fibers fire at high motion levels, thick fibers fire at 

llower motion ti llevels l 

• 30,000 nerve fibers link IHC to auditory nerve 

• electrical pulses run along auditory nerve, ultimately reach higher 

levels of auditory processing in brain brain, perceived as sound 

87

Hearing Thresholds 

Thresholds 

• Threshold of Audibilityy is the acoustic intensity y 

level of a pure tone that can barely be heard at a 

particular frequency 

– th threshold h ld of f audibility dibilit ≈ 0 dB at t 1000 HHz 

– threshold of feeling ≈ 120 dB 

– threshold of pain ≈ 140 dB 

– immediate damage ≈ 160 dB 

• Thresholds vary with frequency and from 

person-to-person 

• Maximum sensitivity is at about 3000 Hz 

88

Sound Prressure 

Level L 

Range Range of Human Human Hearing 

Hearing 

140 

120 

100 

80 

60 

40 

20 

0 

Music 

Threshold in Quiet 

Threshold of Pain 

Contour of Damage Risk 

140 

120 

100 

80 

Speech p 

60 

00.02 02 0.05 0 05 0.1 0 1 0.2 0 2 0.5 0 5 1 2 5 10 20 

Frequency (kHz) 

40 

20 

0 

Sound Inntensity 

Level L 

89

Piitch 

in mels m 

Pitch Pitch-The Pitch The The Mel Mel Scale 

Scale 

3000 

2000 

1000 

0 

20 40 100 200 400 1k 2k 4k 10k 

Frequency in Hz 

Pitch ( mels ) = 

3233log10( 

1+ 

f 

/ 1000) 

90

Sound pressure p e level (ddB) 

Auditory Auditory Masking 

Masking 

70 

50 

30 

10 

0 

0.02 

Tone masker @ 1kHz 

threshold in 

quiet q 

Inaudible range 

0.079 

Signal perceptible even in the 

presence p 

of the tone masker 

threshold when 

masker is 

present 

0.313 1.25 2.5 5 10 

F Frequency (KHz) (KH ) 

20 

Signal not perceptible due to 

the presence of the tone 

masker 

91

Exploiting Exploiting Masking Masking in in Coding 

Coding 

Level L (dB) ) 

Power Spectrum 

Predicted Masking Threshold 

Bit Assignment (Equivalent SNR) 

Frequency (Hz) 

92

Speech Perception Issues 

Issues 

• Ear performs p spectral p analysis y along g basilar membrane 

– both place (along basilar membrane) and frequency are 

significant 

– frequency scale is pseudo-logarithmic pseudo logarithmic (mel or Bark scale) 

– amplitude is logarithmic 

– maximal frequency sensitivity between 1 and 3 kHz 

• Tonal and noise masking hide tones and noise in bands 

close to masker 

– critical bandwidths as a function of frequency q y 

– masking restricted to critical band region 

• Neural processing is still not well understood 

– evidence for distinctive features (place and manner features of 

sounds) 93

Acoustic Theory of Speech Production; Speech Perception Speech ...

Create successful ePaper yourself

Delete template?

Save as template?