18.07.2013 Views

Acoustic Theory of Speech Production; Speech Perception Speech ...

Acoustic Theory of Speech Production; Speech Perception Speech ...

Acoustic Theory of Speech Production; Speech Perception Speech ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Digital Signal Processing<br />

Design — Lecture 4b<br />

<strong>Acoustic</strong> <strong>Theory</strong> <strong>of</strong><br />

<strong>Speech</strong> <strong>Production</strong>;<br />

<strong>Speech</strong> <strong>Perception</strong><br />

1


Basics<br />

• speech is composed <strong>of</strong> a sequence <strong>of</strong> sounds<br />

• sounds (and ( transitions between them) ) serve as a<br />

symbolic representation <strong>of</strong> information to be shared<br />

between humans (or humans and machines)<br />

• arrangement <strong>of</strong> sounds is governed by rules <strong>of</strong><br />

language (constraints on sound sequences, word<br />

sequences, etc)--/spl/ t ) / l/ exists, i t /sbk/ / bk/ doesn’t d ’t exist i t<br />

• linguistics is the study <strong>of</strong> the rules <strong>of</strong> language<br />

• phonetics h ti iis th the study t d <strong>of</strong> f the th sounds d <strong>of</strong> f speechh<br />

can exploit knowledge about the structure <strong>of</strong> sounds and language language—and and how it<br />

is encoded in the signal—to do speech analysis, speech coding, speech<br />

synthesis, speech recognition, speaker recognition, etc.<br />

2


Mid-sagittal plane<br />

X-ray <strong>of</strong> human<br />

vocal apparatus<br />

Human Vocal Vocal Apparatus<br />

• vocal tract —dotted dotted lines in figure; begins at the<br />

glottis (the vocal cords) and ends at the lips<br />

• consists <strong>of</strong> the pharynx (the connection from<br />

the esophagus to the mouth) and the mouth<br />

itself (the oral cavity)<br />

• average male vocal tract length is 17.5 cm<br />

• cross sectional area, determined by<br />

positions <strong>of</strong> the tongue, lips, jaw and velum,<br />

varies from zero (complete closure) to 20 sq<br />

cm<br />

• nasal tract —begins at the velum and ends at the<br />

nostrils<br />

• velum —a trapdoor-like mechanism at the back <strong>of</strong><br />

the mouth cavity; lowers to couple the nasal tract to<br />

the vocal tract to produce the nasal sounds like /m/<br />

(mom), /n/ (night), /ng/ (sing)<br />

Vocal Tract MRI Sequences<br />

Mid-sagittal plane X-ray <strong>of</strong> human vocal apparatus<br />

3


MRI – Shri Shri Narayanan<br />

Narayanan<br />

4


Schematic View <strong>of</strong> <strong>of</strong> Vocal Vocal Tract<br />

Tract<br />

<strong>Acoustic</strong> Tube Models Demo<br />

<strong>Speech</strong> <strong>Production</strong> Mechanism:<br />

• air enters the lungs via normal breathing<br />

and no speech is produced (generally) on<br />

in-take<br />

• as air is expelled from the lungs lungs, via the<br />

trachea or windpipe, the tensed vocal<br />

cords within the larynx are caused to<br />

vibrate (Bernoulli oscillation) by the air<br />

flow<br />

• air is chopped up into quasi-periodic<br />

pulses which are modulated in frequency<br />

( (spectrally t ll shaped) h d) iin passing i th through h th the<br />

pharynx (the throat cavity), the mouth<br />

cavity, and possibly the nasal cavity; the<br />

positions p <strong>of</strong> the various articulators (j (jaw, ,<br />

tongue, velum, lips, mouth) determine the<br />

sound that is produced<br />

5


Tube Models<br />

6


Tube Models<br />

7


Vocal Cords<br />

The vocal cords (folds) form a relaxation<br />

oscillator. Air pressure builds up and<br />

blows them apart. Air flows through the<br />

orifice and pressure drops allowing the<br />

vocal cords to close. close Then the cycle is<br />

repeated.<br />

8


Vocal Cord Cord Views and Operation<br />

Operation<br />

Bernoulli Oscillation Tensed Vocal Cords –<br />

RReady d t to Vib Vibrate t<br />

Lax Vocal Cords –<br />

Open for Breathing Breath ng<br />

9


Glottal Flow<br />

Flow<br />

Glottal volume velocity and resulting sound pressure at the mouth for the<br />

first 30 msec <strong>of</strong> a voiced sound<br />

• 15 msec buildup to periodicity => pitch detection issues at beginning<br />

and end <strong>of</strong> voicing; also voiced voiced-unvoiced unvoiced uncertainty for 15 msec<br />

10


Artificial Larynx<br />

Larynx<br />

Artificial Larynx Demo<br />

11


Schematic <strong>Production</strong> Mechanism<br />

Schematic representation <strong>of</strong><br />

physiological mechanisms <strong>of</strong><br />

speech production<br />

• lungs and associated muscles act as the source<br />

<strong>of</strong> air for exciting the vocal mechanism<br />

• muscle force pushes air out <strong>of</strong> the lungs<br />

(like a piston pushing air up within a<br />

cylinder) through bronchi and trachea<br />

• if vocal cords are tensed, air flow causes<br />

them to vibrate, producing voiced or quasi-<br />

periodic speech sounds (musical notes)<br />

• if vocal cords are relaxed, air flow<br />

continues ti th through h vocal l t tract t until til it it hits hit a<br />

constriction in the tract, causing it to<br />

become turbulent, thereby producing<br />

unvoiced sounds (like ( /s/, , /sh/), ), or it hits a<br />

point <strong>of</strong> total closure in the vocal tract,<br />

building up pressure until the closure is<br />

opened and the pressure is suddenly and<br />

abruptly released, released causing a brief transient<br />

sound, like at the beginning <strong>of</strong> /p/, /t/, or /k/<br />

12


Abstractions <strong>of</strong> Physical Model<br />

Time-Varying<br />

Filter<br />

excitation speech<br />

voiced<br />

unvoiced<br />

mixed<br />

13


The <strong>Speech</strong> <strong>Speech</strong> Signal<br />

Signal<br />

• speech hiis a sequence <strong>of</strong> f ever changing h i sounds d<br />

• sound properties are highly dependent on<br />

context (i.e., the sounds which occur before and<br />

after the current sound)<br />

• the state <strong>of</strong> the vocal cords, the positions, shapes<br />

and sizes <strong>of</strong> the various articulators articulators—all all change<br />

slowly over time, thereby producing the desired<br />

speech sounds<br />

=> need to determine the physical properties <strong>of</strong><br />

speech by observing and measuring the speech<br />

waveform (as ( well as signals g derived from the<br />

speech waveform—e.g., the signal spectrum)<br />

14


<strong>Speech</strong> Waveforms Waveforms and Spectra<br />

100 msec<br />

• 100 msec/line; 0.5 sec for<br />

utterance<br />

• S-silence silence-background<br />

background-no no speech<br />

• U-unvoiced, unvoiced, no vocal cord<br />

vibration b at o (aspiration, (asp at o , unvoiced u o ced<br />

sounds)<br />

• V-voiced voiced-quasi quasi-periodic periodic speech<br />

• speech is a slowly time time varying<br />

varying<br />

signal over 55-100<br />

100 msec intervals<br />

• over longer intervals (100 msec msec-5 5<br />

sec) sec), the speech characteristics<br />

change as rapidly as 10 10-20 20<br />

times/second<br />

=> no well well-defined defined or exact regions<br />

where individuals sounds begin<br />

and end<br />

15


• “Should Should we chase chase”<br />

– /sh/ sound<br />

– /ould/ sounds<br />

– /we/ sounds<br />

– /ch/ sound<br />

– /a/ sound<br />

– /s/ sound<br />

<strong>Speech</strong> Sounds<br />

• hard to distinguish weak sounds from silence<br />

• hard to segment with high precision => don’t do it when it can be avoided<br />

COOL EDIT demo—’should’, ‘every’<br />

16


Estimate <strong>of</strong> Pitch Period -I TH<br />

IY<br />

IY<br />

V Z<br />

HH<br />

UW<br />

17


Range Range <strong>of</strong> Pitch Pitch Period<br />

Period<br />

Average Maximum Minimum<br />

(msec) ( ) (msec) ( ) (msec) ( )<br />

Male 8 12.5 5<br />

Female 4.4 6.7 2.9<br />

Child 33 3.3 5 2<br />

Range <strong>of</strong> Pitch Pitch Frequency<br />

Average Minimum Maximum<br />

(Hz) ( ) (Hz) ( ) (Hz) ( )<br />

Male 125 80 200<br />

Female 225 150 350<br />

Child 300 200 500<br />

18


<strong>Acoustic</strong> <strong>Theory</strong> <strong>of</strong> <strong>Speech</strong><br />

P <strong>Production</strong> d i<br />

19


Source/System Model<br />

20


Making <strong>Speech</strong> “Visible” in 1947<br />

21


Spectrogram Properties<br />

<strong>Speech</strong> p Spectrogram p g —sound intensity y versus time and<br />

frequency<br />

• wideband spectrogram -spectral analysis on 15 msec<br />

sections <strong>of</strong> waveform using a broad (125 Hz) bandwidth<br />

analysis filter, with new analyzes every 1 msec<br />

– spectral intensity resolves individual periods <strong>of</strong> the speech and<br />

shows s o svertical etca st striations ato sdu during gvoiced ocedregions ego s<br />

• narrowband spectrogram -spectral analysis on 50<br />

msec sections <strong>of</strong> waveform using a narrow (40 Hz)<br />

bandwidth analysis filter filter, with new analyzes every 1<br />

msec<br />

– narrowband spectrogram resolves individual pitch harmonics<br />

and shows horizontal striations during voiced regions<br />

22


Wideband Spectrograms <strong>of</strong> s5.wav and s5_synthetic.wav<br />

23


Narrowband Spectrogram p g <strong>of</strong> f s5.wav<br />

24


Sound Spectrogram<br />

Spectrogram<br />

wavsurfer demo—’s5’, ‘s5_synthetic’<br />

voicebox demo—’s5’, ‘s5_synthetic’<br />

COLEA demo—’should’, , ‘every’ y<br />

Wav Surfer:<br />

www.wavsurfer.com<br />

VoiceBox:<br />

www.ee.ic.ac.uk/hp/staff/dmb/voic<br />

ebox/voicebox.htm<br />

COLEA UI:<br />

www.utdallas.edu/~loizou/speech/<br />

colea colea.htm htm<br />

HMM Toolkit:<br />

25<br />

www.ai.mit.edu/~murphyk/S<strong>of</strong>twa<br />

re/HMM/hmm.html#hmm


<strong>Speech</strong> Sentence Waveform<br />

26


<strong>Speech</strong> Wideband Spectrogram<br />

two plus seven is less than ten<br />

27


Parametrization <strong>of</strong> Spectra<br />

• human vocal tract is essentially a tube <strong>of</strong> varying cross sectional<br />

area, or can be approximated as a concatentation <strong>of</strong> tubes <strong>of</strong> varying<br />

cross sectional areas<br />

• acoustic theory shows that the transfer function <strong>of</strong> energy from the<br />

excitation source to the output can be described in terms <strong>of</strong> the natural<br />

frequencies or resonances <strong>of</strong> the tube<br />

• resonances known as formants or formant frequencies for speech<br />

and d th they represent t the th ffrequencies i th that t pass the th most t acoustic ti energy<br />

from the source to the output<br />

• typically there are 3 significant formants below about 3500 Hz<br />

• formants are a highly efficient, efficient compact representation <strong>of</strong> speech<br />

28


Spectrogram Spectrogram and Formants<br />

Key Issue Issue: Issue Issue: :<br />

reliability in<br />

estimating<br />

formants from<br />

spectral data<br />

29


<strong>Acoustic</strong> <strong>Acoustic</strong> <strong>Theory</strong> Summary<br />

• basic speech processes — from ideas to<br />

speech (production) (production), from speech to ideas<br />

(perception)<br />

• basic vocal production mechanisms — vocal<br />

tract, nasal tract, velum<br />

• source <strong>of</strong> sound flow at the glottis; output <strong>of</strong><br />

sound flow at the lips and nose<br />

• speech waveforms and properties — voiced,<br />

unvoiced unvoiced, silence silence, pitch<br />

• speech spectrograms and properties —<br />

wideband spectrograms, narrowband<br />

spectrograms, formants<br />

30


English <strong>Speech</strong> <strong>Speech</strong> Sounds<br />

Sounds<br />

ARPABET representation<br />

• 48 sounds<br />

• 18 vowels/diphthongs<br />

• 4vowel 4vowel-like 4 vowel like like consonants<br />

consonants<br />

• 21 standard consonants<br />

• 4 syllabic sounds<br />

• 1 glottal stop<br />

31


Phonemes<br />

Phonemes—Link Link Between<br />

O Orthography h h and d <strong>Speech</strong> S h<br />

Orthography sequence <strong>of</strong> sounds<br />

• Larry /l/ /ae/ /r/ /iy/ (/L/ /AE/ /R/ /IY/)<br />

S <strong>Speech</strong> h Waveform W f sequence <strong>of</strong> f sounds d<br />

• based on acoustic properties (temporal) <strong>of</strong> phonemes<br />

Spectrogram sequence <strong>of</strong> sounds<br />

• based on acoustic properties (spectral) <strong>of</strong> phonemes<br />

The bottom line is that we use a phonetic code as an intermediate<br />

representation <strong>of</strong> language, from either orthography or from<br />

waveforms a eforms or spectrograms spectrograms; no now we e ha have e to learn ho how to<br />

recognize sounds within speech utterances<br />

32


Phonetic Transcriptions<br />

Transcriptions<br />

• based on ideal (dictionary-based) ( y )p pronunciations <strong>of</strong><br />

all words in sentence<br />

– ‘My name is Larry’-/M/ /AY/-/N/ /EY/ /M/-/IH/ /Z/-/L/ /AE/<br />

/R/ /IY/<br />

– ‘How old are you’-/H/ /AW/-/OW/ /L/ /D/-/AA/ /R/-/Y/ /UW/<br />

– ‘<strong>Speech</strong> processing is fun’-/S/ /P/ /IY/ /CH/-/P/ /R/ /AH/<br />

/S/ /EH/ /S/ /IH/ /NG/ /NG/-/IH/ /IH/ /Z/-/F/ /Z/ /F/ /AH/ /N/<br />

• word ambiguity abounds<br />

– ‘lives’-/L/ /IH/ /V/ /Z/ (he lives here) versus /L/ /AY/ /V/ /Z/<br />

(a cat has nine lives)<br />

– ‘record’-/R/ /EH/ /K/ /ER/ /D/ (he holds the world record)<br />

versus /R/ /IY/ /K/ /AW/ /D/ (please record my favorite<br />

show tonight)<br />

33


Phoneme Phoneme Classification Chart<br />

Chart<br />

EY<br />

Vocal Cords<br />

Vib Vibrating ti<br />

Noise-Like<br />

Excitation<br />

34


Reduced Reduced Set <strong>of</strong> English Sounds<br />

• 39 sounds<br />

– 11 vowels (front, mid, back) classification based on<br />

tongue hump position<br />

– 4 diphthongs (vowel-like (vowel like combinations)<br />

– 4 semi-vowels (liquids and glides)<br />

– 3 nasal consonants<br />

– 6 voiced and unvoiced stop consonants<br />

– 8 voiced and unvoiced fricative consonants<br />

– 2 affricate consonants<br />

– 1 whispered sound<br />

• look at each class <strong>of</strong> sounds to characterize<br />

their acoustic and spectral properties<br />

35


Vowels<br />

• longest duration sounds – least context sensitive<br />

• can be held indefinitely in singing and other musical<br />

works ( (opera) )<br />

• carry very little linguistic information (some languages<br />

don’t display p y vowels in text-Hebrew, , Arabic) )<br />

Text 1: all vowels deleted<br />

Th Th_y n_t_d t d s_gn_f_c_nt f t _mpr_v_m_nts t _n th_ th c_mp_ny’s ’<br />

_m_g_, s_p_rv_s__n _nd m_n_g_m_nt.<br />

Text 2: all consonants deleted<br />

A__i_u_e_ _o_a__ _a_ __a_e_ e__e__ia___ __e _a_e,<br />

_i__ i __e e __i_e_ i e oo_ oo__u_a_io_a_ u a io a ee___o_ee_ o ee __i_____ i<br />

_e__ea_i__.<br />

36


Vowels and Consonants<br />

Text 1: all vowels deleted<br />

Th_y n_t_d s_gn_f_c_nt _mpr_v_m_nts _n th_ c_mp_ny’s<br />

_m_g_, m g s_p_rv_s__n s p rv s n _nd nd m_n_g_m_nt.<br />

m n g m nt<br />

(They noted significant improvements in the company’s<br />

image, supervision and management.)<br />

Text 2: all consonants deleted<br />

A__i_u_e_ _o_a__ _a_ __a_e_ e__e__ia___ __e _a_e,<br />

_i__ __e __i_e_ o_ o__u_a_io_a_ e___o_ee_ __i_____<br />

_e__ea_i__. e ea i<br />

(Attitudes toward pay stayed essentially the same, with the<br />

scores <strong>of</strong> occupational employees slightly decreasing)<br />

37


Vowels<br />

• produced p using g fixed vocal tract shape p<br />

• sustained sounds<br />

• vocal cords are vibrating g => voiced sounds<br />

• cross-sectional area <strong>of</strong> vocal tract determines<br />

vowel resonance frequencies and vowel sound<br />

quality lit<br />

• tongue position (height, forward/back position)<br />

most important in determining vowel sound<br />

• usually relatively long in duration (can be held<br />

during g singing) g g) and are spectrally p y<br />

well formed<br />

38


Vowel <strong>Production</strong><br />

<strong>Production</strong><br />

39


Vowel Waveforms &<br />

Spectrograms<br />

Synthetic versions <strong>of</strong> the<br />

10 vowels<br />

40


Vowel Formants<br />

Formants<br />

Clear pattern <strong>of</strong> variability<br />

<strong>of</strong> vowel pronunciation<br />

among men, women and<br />

children hild<br />

Strong overlap for different<br />

vowel sounds by different<br />

ttalkers lk => no unique i<br />

identification <strong>of</strong> vowel<br />

strictly from resonances<br />

=> need context to define<br />

vowel sound<br />

41


Canonic Vowel Spectra<br />

10 Hz 33 Hz<br />

100 Hz<br />

IY IY<br />

AA<br />

AA<br />

UW UW<br />

100 Hz Fundamental<br />

42


Canonic Vowel Spectra<br />

IY IY<br />

AA<br />

UW<br />

100 100 Hz H F Fundamental d t l 300 Hz<br />

300 Hz Fundamental<br />

AA<br />

UW<br />

43


Eliminating Vowels and Consonants<br />

44


• Gliding speech sound<br />

that starts at or near the<br />

articulatory position for<br />

one vowel and moves to<br />

or toward the position p for<br />

another vowel<br />

– /AY/ in buy<br />

– /AW/ in down<br />

– /EY/ in bait<br />

– /OY/ in boy<br />

Diphthongs<br />

45


Distinctive Distinctive Features<br />

Features<br />

Classify non-vowel/non-diphthong sounds in terms <strong>of</strong> distinctive<br />

features<br />

– place <strong>of</strong> articulation<br />

• Bilabial (lips)—p,b,m,w<br />

• Labiodental (between lips and front <strong>of</strong> teeth)-f,v<br />

• Dental (teeth)-th,dh<br />

• Alveolar (front <strong>of</strong> palate)-t,d,s,z,n,l<br />

• Palatal (middle <strong>of</strong> palate)-sh,zh,r<br />

• Velar (at velum)-k,g,ng<br />

• Pharyngeal (at end <strong>of</strong> pharynx)-h<br />

– manner <strong>of</strong> articulation<br />

• Glide—smooth motion-w,l,r,y<br />

• Nasal—lowered velum-m,n,ng<br />

• Stop—constricted vocal tract-p,t,k,b,d,g<br />

• Fricative—turbulent source-f,th,s,sh,v,dh,z,zh,h<br />

• Voicing—voiced source-b,d,g,v,dh,z,zh,m,n,ng,w,l,r<br />

• Mixed source—both voicing and unvoiced-j,ch<br />

• Whispered--h<br />

46


Places <strong>of</strong> Articulation<br />

47


Semivowels Semivowels (Liquids and and Glides)<br />

Glides)<br />

• vowel-like vowel like in nature (called semivowels for<br />

this reason)<br />

• voiced sounds (w (w-l-r-y) l r y)<br />

• acoustic characteristics <strong>of</strong> these sounds<br />

are strongly t l influenced i fl d bby context—unlike<br />

t t lik<br />

most vowel sounds which are much less<br />

iinfluenced fl d bby context t t<br />

uh-{w,l,r,y}-a<br />

Manner: glides<br />

Place: bilabial (w), alveolar (l),<br />

palatal (r)<br />

48


Nasal Consonants<br />

• The nasal consonants consist <strong>of</strong> /M/, /N/, and /NG/<br />

– nasals produced using glottal excitation => voiced sounds<br />

– vocal tract totally constricted at some point along the tract<br />

– velum lowered so sound is radiated at nostrils<br />

– constricted oral cavity serves as a resonant cavity that traps<br />

acoustic energy at certain natural frequencies (anti-resonances<br />

or zeros <strong>of</strong> transmission)<br />

– /M/ is produced with a constriction at the lips => low frequency<br />

zero<br />

– /N/ is produced with a constriction just behind the teeth => higher<br />

frequency zero<br />

– /NG/ is produced with a constriction just forward <strong>of</strong> the velum =><br />

even higher frequency zero<br />

uh-{m,n,ng}-a<br />

Manner: nasal<br />

Place: bilabial (m), alveolar (n), velar<br />

(ng)<br />

49


Nasal <strong>Production</strong><br />

50


Nasal Sounds<br />

Hole in<br />

spectrum<br />

UH M AA UH N AA<br />

51


Unvoiced Fricatives<br />

• Consonant sounds /F/, /TH/, /S/, /SH/<br />

– produced by exciting vocal tract by steady air flow<br />

which becomes turbulent in region <strong>of</strong> a constriction in<br />

the vocal tract<br />

• /F/ constriction near the lips<br />

• /TH/ constriction near the teeth<br />

• /S/ constriction near the middle <strong>of</strong> the vocal tract<br />

• /SH/ constriction near the back <strong>of</strong> the vocal tract<br />

– noise source at constriction => vocal tract is<br />

separated into two cavities<br />

– sound radiated from lips – front cavity<br />

– back cavity traps energy and produces antiresonances<br />

(zeros <strong>of</strong> transmission)<br />

uh-{f,th,s,sh}-a<br />

Manner: fricative<br />

Place: labiodental (f), dental (th),<br />

alveolar (s), palatal (sh)<br />

52


Unvoiced Fricative <strong>Production</strong><br />

53


Unvoiced Fricatives<br />

UH F AA UH S AA UH SH AA<br />

54


Voiced Fricatives<br />

Fricatives<br />

• Sounds /V/,/DH/, /Z/, /ZH/<br />

– place <strong>of</strong> constriction same as for unvoiced<br />

counterparts<br />

– two sources <strong>of</strong> excitation; vocal cords<br />

vibrating producing semi-periodic puffs <strong>of</strong> air<br />

to excite the tract; the resulting air flow<br />

becomes turbulent at the constriction giving a<br />

noise noise-like like component in addition to the<br />

voiced-like component<br />

uh-{v,dh,z,zh}-a<br />

Manner: fricative<br />

Place: labiodental (v), dental (dh),<br />

alveolar (z), palatal (zh)<br />

55


Voiced Fricatives<br />

UH V AA UH ZH AA<br />

56


Voiced and Unvoiced Stop<br />

Consonants<br />

• sounds-/B/, /D/, /G/ (voiced stop consonants) and /P/, /T/<br />

/K/ (unvoiced ( i d stop t consonants) t )<br />

– voiced stops are transient sounds produced by building up<br />

pressure behind a total constriction in the oral tract and then<br />

suddenly releasing the pressure, pressure resulting in a pop-like pop like sound<br />

• /B/ constriction at lips<br />

Manner: stop<br />

• /D/ constriction at back <strong>of</strong> teeth<br />

• /G/ constriction at velum<br />

uh-{b,d,g}-a<br />

– no sound diis radiated dit dffrom th the li lips dduring i constriction ti ti =><br />

sometimes sound is radiated from the throat during constriction<br />

(leakage through tract walls) allowing vocal cords to vibrate in<br />

spite <strong>of</strong> total constriction<br />

– stop sounds strongly influenced by surrounding sounds<br />

– unvoiced stops have no vocal cord vibration during period <strong>of</strong><br />

closure => brief period <strong>of</strong> frication (due to sudden turbulence <strong>of</strong><br />

escaping air) and aspiration (steady air flow from the glottis)<br />

before voiced excitation begins<br />

Place: bilabial (b,p), alveolar<br />

(d,t), velar (g, k)<br />

57


Stop Consonant Consonant <strong>Production</strong><br />

<strong>Production</strong><br />

58


Voiced Stop Consonant<br />

UH B AA<br />

59


Unvoiced Stop Consonants<br />

Stop p Gap p<br />

UH P AA UH T AA<br />

uh-{p,t,k}-a<br />

uh-{j,ch,h}-a<br />

60


Distinctive Distinctive Phoneme Phoneme Features<br />

Features<br />

• the brain recognizes sounds by doing a distinctive feature<br />

analysis from the information going to the brain<br />

• the distinctive features are somewhat insensitive to noise,<br />

background, reverberation => they are robust and reliable<br />

61


IY-beat<br />

IH IH-bit bit<br />

EH-bet<br />

AE-bat<br />

AA-bob<br />

ER-bird<br />

AH-but AH but<br />

AO-bought<br />

UW-boot<br />

UH UH-book book<br />

OW-boat<br />

AW-down<br />

AY-buy<br />

OY-boy<br />

EY-bait EY bait<br />

Review Exercises<br />

Exercises<br />

Write the transcription <strong>of</strong> the<br />

sentence “Good friends are hard to<br />

find”<br />

G-UH-D F-R-EH-N-D-Z AA-R HH-<br />

AA-R-D T-UH (UW) F-AY-N-D<br />

62


samples s <strong>of</strong>fset<br />

0<br />

2000<br />

4000<br />

6000<br />

8000<br />

Review Exercises<br />

file: enjoy 1 0k, sampling rate: 10000, starting sample: 1 number <strong>of</strong> samples 8079<br />

EH N<br />

N JH OY<br />

OY<br />

OY<br />

0 200 400 600 800 1000<br />

sample number<br />

1200 1400 1600 1800 2000<br />

Enjoy:<br />

EH-N-JH-OY<br />

63


samples<br />

<strong>of</strong>fset<br />

0<br />

2000<br />

4000<br />

6000<br />

Review Exercises<br />

Exercises<br />

file: simple 1 0k, sampling rate: 10000, starting sample: 1 number <strong>of</strong> samples 7152<br />

S IH M<br />

P AX AX-L L | EL<br />

0 200 400 600 800 1000<br />

sample l number<br />

b<br />

1200 1400 1600 1800 2000<br />

S<br />

Simple:<br />

S-IH-M-P-<br />

(AX-L | EL)<br />

64


TH-IH S IH Z UH T EH S T<br />

This is a test (16 kHz sampling rate)<br />

65


Word Choices:<br />

that, and, was, by,<br />

people, little, simple,<br />

between, very, enjoy,<br />

only only, other other, company company,<br />

those<br />

Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />

Top Row: simple; l enjoy; those<br />

Second Row: was; between; company<br />

66


Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />

Word Choices:<br />

that, and, was, by,<br />

people people, little little, simple simple,<br />

between, very, enjoy,<br />

only, other, company,<br />

those<br />

/was/ -- this word can be<br />

identified by the voiced<br />

initial portion with very<br />

low first and second<br />

formants (sounds like UW<br />

or W), followed by the AA<br />

sound and ending with the<br />

Z (S) sound.<br />

67


Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />

Word Choices:<br />

that, and, was, by,<br />

people, little, simple,<br />

between between, very very, enjoy enjoy,<br />

only, other, company,<br />

those<br />

/enjoy/ – this word can be<br />

identified by the two two-syllable syllable<br />

nature nature, with the nasal sound<br />

N at the end <strong>of</strong> the first<br />

syllable syllable, and the fricative JH<br />

at the beginning <strong>of</strong> the second<br />

syllable, with the characteristic<br />

OY diphthong at the end <strong>of</strong><br />

th the word<br />

d<br />

68


Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />

Word Choices:<br />

that, , and, , was, , by, y,<br />

people, little, simple,<br />

between, very, enjoy,<br />

only, other, company,<br />

those<br />

/company/ – this word can be<br />

identified by the three syllable<br />

nature nature, with the initial stop<br />

consonant K, the first syllable<br />

ending in the nasal MM, followed by<br />

the stop P, and with the second<br />

syllable ending with the nasal N<br />

followed by an IY vowel-like<br />

sound<br />

69


Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />

Word Choices:<br />

that, and, was, by,<br />

people, little, simple,<br />

between, very, enjoy,<br />

only, other, company,<br />

those<br />

/ /simple/ i l / – thi this word d can bbe<br />

identified by the two two-syllable syllable<br />

nature nature, with a strong initial<br />

fricative S beginning the first<br />

syllable ll bl and d th the nasal l M<br />

ending the first syllable, and<br />

with the stop consonant P<br />

beginning the second syllable<br />

70


Spectrograms <strong>of</strong> Cities<br />

A<br />

Los Angeles<br />

B<br />

Chicago<br />

C<br />

Denver<br />

D<br />

Seattle<br />

E<br />

Memphis p<br />

City Choices: Chicago, Denver, Los Angeles, Memphis, Seattle


Summary<br />

• sounds <strong>of</strong> the English language—phonemes,<br />

language phonemes,<br />

syllables, words<br />

• phonetic p transcriptions p <strong>of</strong> words and<br />

sentences — coarticulation across word<br />

boundaries<br />

• vowels and consonents — their roles,<br />

articulatory shapes, waveforms, spectrograms,<br />

formants<br />

72


Sound Sound <strong>Perception</strong><br />

<strong>Perception</strong><br />

73


Some Facts About Human<br />

H Hearing i<br />

• the range <strong>of</strong> human hearing is incredible<br />

– threshold <strong>of</strong> hearing — thermal limit <strong>of</strong> Brownian motion <strong>of</strong> air<br />

particles in the inner ear<br />

– threshold <strong>of</strong> pain — intensities <strong>of</strong> from 10**12 to 10**16 greater<br />

than the threshold <strong>of</strong> hearing<br />

• human hearing perceives both sound frequency and<br />

sound direction<br />

– can detect weak spectral components in strong broadband noise<br />

• masking is the phenomenon whereby one loud sound<br />

makes another s<strong>of</strong>ter sound inaudible<br />

– masking is most effective for frequencies around the masker<br />

frequency<br />

– masking is used to hide quantizer noise by methods <strong>of</strong> spectral<br />

shaping (similar grossly to Dolby noise reduction methods)<br />

74


Anechoic Chamber (no Echos)<br />

75


Sound Pressure Levels (dB)<br />

SPL (dB)—Sound Source SPL (dB)—Sound Source<br />

160 Jet Engine — close up 70 Busy Street; Noisy<br />

150 Firecracker; Artillery Fire<br />

Restaurant<br />

140<br />

130<br />

120<br />

110<br />

Rock Singer Screaming into<br />

Microphone: Jet Take<strong>of</strong>f<br />

Threshold <strong>of</strong> Pain; .22<br />

Caliber Rifle<br />

Planes on Airport Runway;<br />

Rock Concert; Thunder<br />

Power Tools; Shouting in Ear<br />

60<br />

50<br />

40<br />

30<br />

Conversational <strong>Speech</strong> — 1<br />

foot<br />

Average Office Noise; Light<br />

Traffic; Rainfall<br />

Quiet Conversation;<br />

RRefrigerator; f i t Lib Library<br />

Quiet Office; Whisper<br />

100 SSubway b TTrains; i GGarbage b<br />

Truck<br />

90 Heavy Truck Traffic; Lawn<br />

Mower<br />

80 Home Stereo — 1 foot; Blow<br />

Dryer<br />

20 Quiet Living Room; Rustling<br />

Leaves<br />

10 Quiet Recording Studio;<br />

Breathing<br />

0 Threshold <strong>of</strong> Hearing<br />

76


<strong>Speech</strong> Communication<br />

77


Overview <strong>of</strong> Auditory Mechanism<br />

Model <strong>of</strong> signal processing for perception and<br />

understanding <strong>of</strong> speech<br />

78


The The Human Human Ear<br />

Ear<br />

OOuter t ear: pinna i and d external t l canall<br />

Middle ear: tympanic membrane or<br />

eardrum<br />

Inner ear: cochlea, neural connections<br />

79


Human Ear<br />

• Outer ear: funnels sound into ear canal<br />

• Middle ear: sound impinges on tympanic<br />

membrane; this causes motion<br />

– middle ear is a mechanical transducer, consisting <strong>of</strong> the<br />

hammer, anvil and stirrup; it converts acoustical sound<br />

wave to mechanical vibrations along the inner ear<br />

• Inner ear: the cochlea is a fluid-filled chamber<br />

partitioned by the basilar membrane<br />

– the auditory nerve is connected to the basilar membrane<br />

via inner hair cells<br />

– mechanical vibrations at the entrance to the cochlea<br />

create standing waves (<strong>of</strong> fluid inside the cochlea)<br />

causing basilar membrane to vibrate at frequencies<br />

commensurate with the input acoustic wave frequencies<br />

(formants) and at a place along the basilar membrane<br />

that is associated with these frequencies<br />

80


The The Outer Outer Ear<br />

Ear<br />

81


The Middle Middle Ear<br />

Ear<br />

The Hammer (Malleus),<br />

Anvil (Incus) and Stirrup<br />

(Stapes) are the three<br />

tiniest bones in the body.<br />

Together they form the<br />

coupling between the<br />

vibration <strong>of</strong> the eardrum<br />

and the forces exerted on<br />

the oval window <strong>of</strong> the<br />

inner ear ear.<br />

These bones can be<br />

thought <strong>of</strong> as a compound<br />

lever which achieves a<br />

multiplication <strong>of</strong> force—by<br />

a factor <strong>of</strong> about three<br />

under optimum conditions.<br />

(Th (They also l protect t t the th ear<br />

against loud sounds by<br />

82<br />

attenuating the sound.)


Tympanic y p<br />

Membrane<br />

The Cochlea<br />

Malleus Ossicles<br />

Incus<br />

Stapes<br />

(Middle Ear Bones)<br />

AAuditory dit nnerves<br />

Oval Window<br />

Cochlea<br />

RRound d Wi Window d<br />

Vestibule<br />

83


Stretched Cochlea & Basilar Membrane<br />

Scala<br />

Vestibuli<br />

Basilar<br />

Membrane<br />

1600 Hz<br />

800 Hz<br />

400 Hz<br />

200 Hz<br />

Cochlear Base 100 Hz<br />

(high<br />

frequency) Unrolled<br />

Cochlea<br />

50 Hz<br />

Relative<br />

amplitude<br />

25 Hz<br />

0 10 20 30<br />

Distance from Stapes (mm)<br />

Cochlear Apex<br />

(low ( ffrequency) q y)<br />

84


Basilar Membrane Motion<br />

85


Audience Noise Noise Suppression<br />

Suppression<br />

86


Basilar Membrane Membrane Mechanics<br />

Mechanics<br />

• characterized by a set <strong>of</strong> frequency responses at different points<br />

along the membrane<br />

• mechanical realization <strong>of</strong> a bank <strong>of</strong> filters<br />

• filters are roughly constant Q (center frequency/bandwidth) with<br />

logarithmically g y decreasing g bandwidth<br />

• distributed along the Basilar Membrane is a set <strong>of</strong> sensors called<br />

Inner Hair Cells (IHC) which act as mechanical motion-to-neural<br />

activity converters<br />

• mechanical motion along the BM is sensed by local IHC causing<br />

firing activity at nerve fibers that innervate bottom <strong>of</strong> each IHC<br />

• each IHC connected to about 10 nerve fibers, each <strong>of</strong> different<br />

diameter => thin fibers fire at high motion levels, thick fibers fire at<br />

llower motion ti llevels l<br />

• 30,000 nerve fibers link IHC to auditory nerve<br />

• electrical pulses run along auditory nerve, ultimately reach higher<br />

levels <strong>of</strong> auditory processing in brain brain, perceived as sound<br />

87


Hearing Thresholds<br />

Thresholds<br />

• Threshold <strong>of</strong> Audibilityy is the acoustic intensity y<br />

level <strong>of</strong> a pure tone that can barely be heard at a<br />

particular frequency<br />

– th threshold h ld <strong>of</strong> f audibility dibilit ≈ 0 dB at t 1000 HHz<br />

– threshold <strong>of</strong> feeling ≈ 120 dB<br />

– threshold <strong>of</strong> pain ≈ 140 dB<br />

– immediate damage ≈ 160 dB<br />

• Thresholds vary with frequency and from<br />

person-to-person<br />

• Maximum sensitivity is at about 3000 Hz<br />

88


Sound Prressure<br />

Level L<br />

Range Range <strong>of</strong> Human Human Hearing<br />

Hearing<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Music<br />

Threshold in Quiet<br />

Threshold <strong>of</strong> Pain<br />

Contour <strong>of</strong> Damage Risk<br />

140<br />

120<br />

100<br />

80<br />

<strong>Speech</strong> p<br />

60<br />

00.02 02 0.05 0 05 0.1 0 1 0.2 0 2 0.5 0 5 1 2 5 10 20<br />

Frequency (kHz)<br />

40<br />

20<br />

0<br />

Sound Inntensity<br />

Level L<br />

89


Piitch<br />

in mels m<br />

Pitch Pitch-The Pitch The The Mel Mel Scale<br />

Scale<br />

3000<br />

2000<br />

1000<br />

0<br />

20 40 100 200 400 1k 2k 4k 10k<br />

Frequency in Hz<br />

Pitch ( mels ) =<br />

3233log10(<br />

1+<br />

f<br />

/ 1000)<br />

90


Sound pressure p e level (ddB)<br />

Auditory Auditory Masking<br />

Masking<br />

70<br />

50<br />

30<br />

10<br />

0<br />

0.02<br />

Tone masker @ 1kHz<br />

threshold in<br />

quiet q<br />

Inaudible range<br />

0.079<br />

Signal perceptible even in the<br />

presence p<br />

<strong>of</strong> the tone masker<br />

threshold when<br />

masker is<br />

present<br />

0.313 1.25 2.5 5 10<br />

F Frequency (KHz) (KH )<br />

20<br />

Signal not perceptible due to<br />

the presence <strong>of</strong> the tone<br />

masker<br />

91


Exploiting Exploiting Masking Masking in in Coding<br />

Coding<br />

Level L (dB) )<br />

Power Spectrum<br />

Predicted Masking Threshold<br />

Bit Assignment (Equivalent SNR)<br />

Frequency (Hz)<br />

92


<strong>Speech</strong> <strong>Perception</strong> Issues<br />

Issues<br />

• Ear performs p spectral p analysis y along g basilar membrane<br />

– both place (along basilar membrane) and frequency are<br />

significant<br />

– frequency scale is pseudo-logarithmic pseudo logarithmic (mel or Bark scale)<br />

– amplitude is logarithmic<br />

– maximal frequency sensitivity between 1 and 3 kHz<br />

• Tonal and noise masking hide tones and noise in bands<br />

close to masker<br />

– critical bandwidths as a function <strong>of</strong> frequency q y<br />

– masking restricted to critical band region<br />

• Neural processing is still not well understood<br />

– evidence for distinctive features (place and manner features <strong>of</strong><br />

sounds) 93

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!