Acoustic Theory of Speech Production; Speech Perception Speech ...
Acoustic Theory of Speech Production; Speech Perception Speech ...
Acoustic Theory of Speech Production; Speech Perception Speech ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Digital Signal Processing<br />
Design — Lecture 4b<br />
<strong>Acoustic</strong> <strong>Theory</strong> <strong>of</strong><br />
<strong>Speech</strong> <strong>Production</strong>;<br />
<strong>Speech</strong> <strong>Perception</strong><br />
1
Basics<br />
• speech is composed <strong>of</strong> a sequence <strong>of</strong> sounds<br />
• sounds (and ( transitions between them) ) serve as a<br />
symbolic representation <strong>of</strong> information to be shared<br />
between humans (or humans and machines)<br />
• arrangement <strong>of</strong> sounds is governed by rules <strong>of</strong><br />
language (constraints on sound sequences, word<br />
sequences, etc)--/spl/ t ) / l/ exists, i t /sbk/ / bk/ doesn’t d ’t exist i t<br />
• linguistics is the study <strong>of</strong> the rules <strong>of</strong> language<br />
• phonetics h ti iis th the study t d <strong>of</strong> f the th sounds d <strong>of</strong> f speechh<br />
can exploit knowledge about the structure <strong>of</strong> sounds and language language—and and how it<br />
is encoded in the signal—to do speech analysis, speech coding, speech<br />
synthesis, speech recognition, speaker recognition, etc.<br />
2
Mid-sagittal plane<br />
X-ray <strong>of</strong> human<br />
vocal apparatus<br />
Human Vocal Vocal Apparatus<br />
• vocal tract —dotted dotted lines in figure; begins at the<br />
glottis (the vocal cords) and ends at the lips<br />
• consists <strong>of</strong> the pharynx (the connection from<br />
the esophagus to the mouth) and the mouth<br />
itself (the oral cavity)<br />
• average male vocal tract length is 17.5 cm<br />
• cross sectional area, determined by<br />
positions <strong>of</strong> the tongue, lips, jaw and velum,<br />
varies from zero (complete closure) to 20 sq<br />
cm<br />
• nasal tract —begins at the velum and ends at the<br />
nostrils<br />
• velum —a trapdoor-like mechanism at the back <strong>of</strong><br />
the mouth cavity; lowers to couple the nasal tract to<br />
the vocal tract to produce the nasal sounds like /m/<br />
(mom), /n/ (night), /ng/ (sing)<br />
Vocal Tract MRI Sequences<br />
Mid-sagittal plane X-ray <strong>of</strong> human vocal apparatus<br />
3
MRI – Shri Shri Narayanan<br />
Narayanan<br />
4
Schematic View <strong>of</strong> <strong>of</strong> Vocal Vocal Tract<br />
Tract<br />
<strong>Acoustic</strong> Tube Models Demo<br />
<strong>Speech</strong> <strong>Production</strong> Mechanism:<br />
• air enters the lungs via normal breathing<br />
and no speech is produced (generally) on<br />
in-take<br />
• as air is expelled from the lungs lungs, via the<br />
trachea or windpipe, the tensed vocal<br />
cords within the larynx are caused to<br />
vibrate (Bernoulli oscillation) by the air<br />
flow<br />
• air is chopped up into quasi-periodic<br />
pulses which are modulated in frequency<br />
( (spectrally t ll shaped) h d) iin passing i th through h th the<br />
pharynx (the throat cavity), the mouth<br />
cavity, and possibly the nasal cavity; the<br />
positions p <strong>of</strong> the various articulators (j (jaw, ,<br />
tongue, velum, lips, mouth) determine the<br />
sound that is produced<br />
5
Tube Models<br />
6
Tube Models<br />
7
Vocal Cords<br />
The vocal cords (folds) form a relaxation<br />
oscillator. Air pressure builds up and<br />
blows them apart. Air flows through the<br />
orifice and pressure drops allowing the<br />
vocal cords to close. close Then the cycle is<br />
repeated.<br />
8
Vocal Cord Cord Views and Operation<br />
Operation<br />
Bernoulli Oscillation Tensed Vocal Cords –<br />
RReady d t to Vib Vibrate t<br />
Lax Vocal Cords –<br />
Open for Breathing Breath ng<br />
9
Glottal Flow<br />
Flow<br />
Glottal volume velocity and resulting sound pressure at the mouth for the<br />
first 30 msec <strong>of</strong> a voiced sound<br />
• 15 msec buildup to periodicity => pitch detection issues at beginning<br />
and end <strong>of</strong> voicing; also voiced voiced-unvoiced unvoiced uncertainty for 15 msec<br />
10
Artificial Larynx<br />
Larynx<br />
Artificial Larynx Demo<br />
11
Schematic <strong>Production</strong> Mechanism<br />
Schematic representation <strong>of</strong><br />
physiological mechanisms <strong>of</strong><br />
speech production<br />
• lungs and associated muscles act as the source<br />
<strong>of</strong> air for exciting the vocal mechanism<br />
• muscle force pushes air out <strong>of</strong> the lungs<br />
(like a piston pushing air up within a<br />
cylinder) through bronchi and trachea<br />
• if vocal cords are tensed, air flow causes<br />
them to vibrate, producing voiced or quasi-<br />
periodic speech sounds (musical notes)<br />
• if vocal cords are relaxed, air flow<br />
continues ti th through h vocal l t tract t until til it it hits hit a<br />
constriction in the tract, causing it to<br />
become turbulent, thereby producing<br />
unvoiced sounds (like ( /s/, , /sh/), ), or it hits a<br />
point <strong>of</strong> total closure in the vocal tract,<br />
building up pressure until the closure is<br />
opened and the pressure is suddenly and<br />
abruptly released, released causing a brief transient<br />
sound, like at the beginning <strong>of</strong> /p/, /t/, or /k/<br />
12
Abstractions <strong>of</strong> Physical Model<br />
Time-Varying<br />
Filter<br />
excitation speech<br />
voiced<br />
unvoiced<br />
mixed<br />
13
The <strong>Speech</strong> <strong>Speech</strong> Signal<br />
Signal<br />
• speech hiis a sequence <strong>of</strong> f ever changing h i sounds d<br />
• sound properties are highly dependent on<br />
context (i.e., the sounds which occur before and<br />
after the current sound)<br />
• the state <strong>of</strong> the vocal cords, the positions, shapes<br />
and sizes <strong>of</strong> the various articulators articulators—all all change<br />
slowly over time, thereby producing the desired<br />
speech sounds<br />
=> need to determine the physical properties <strong>of</strong><br />
speech by observing and measuring the speech<br />
waveform (as ( well as signals g derived from the<br />
speech waveform—e.g., the signal spectrum)<br />
14
<strong>Speech</strong> Waveforms Waveforms and Spectra<br />
100 msec<br />
• 100 msec/line; 0.5 sec for<br />
utterance<br />
• S-silence silence-background<br />
background-no no speech<br />
• U-unvoiced, unvoiced, no vocal cord<br />
vibration b at o (aspiration, (asp at o , unvoiced u o ced<br />
sounds)<br />
• V-voiced voiced-quasi quasi-periodic periodic speech<br />
• speech is a slowly time time varying<br />
varying<br />
signal over 55-100<br />
100 msec intervals<br />
• over longer intervals (100 msec msec-5 5<br />
sec) sec), the speech characteristics<br />
change as rapidly as 10 10-20 20<br />
times/second<br />
=> no well well-defined defined or exact regions<br />
where individuals sounds begin<br />
and end<br />
15
• “Should Should we chase chase”<br />
– /sh/ sound<br />
– /ould/ sounds<br />
– /we/ sounds<br />
– /ch/ sound<br />
– /a/ sound<br />
– /s/ sound<br />
<strong>Speech</strong> Sounds<br />
• hard to distinguish weak sounds from silence<br />
• hard to segment with high precision => don’t do it when it can be avoided<br />
COOL EDIT demo—’should’, ‘every’<br />
16
Estimate <strong>of</strong> Pitch Period -I TH<br />
IY<br />
IY<br />
V Z<br />
HH<br />
UW<br />
17
Range Range <strong>of</strong> Pitch Pitch Period<br />
Period<br />
Average Maximum Minimum<br />
(msec) ( ) (msec) ( ) (msec) ( )<br />
Male 8 12.5 5<br />
Female 4.4 6.7 2.9<br />
Child 33 3.3 5 2<br />
Range <strong>of</strong> Pitch Pitch Frequency<br />
Average Minimum Maximum<br />
(Hz) ( ) (Hz) ( ) (Hz) ( )<br />
Male 125 80 200<br />
Female 225 150 350<br />
Child 300 200 500<br />
18
<strong>Acoustic</strong> <strong>Theory</strong> <strong>of</strong> <strong>Speech</strong><br />
P <strong>Production</strong> d i<br />
19
Source/System Model<br />
20
Making <strong>Speech</strong> “Visible” in 1947<br />
21
Spectrogram Properties<br />
<strong>Speech</strong> p Spectrogram p g —sound intensity y versus time and<br />
frequency<br />
• wideband spectrogram -spectral analysis on 15 msec<br />
sections <strong>of</strong> waveform using a broad (125 Hz) bandwidth<br />
analysis filter, with new analyzes every 1 msec<br />
– spectral intensity resolves individual periods <strong>of</strong> the speech and<br />
shows s o svertical etca st striations ato sdu during gvoiced ocedregions ego s<br />
• narrowband spectrogram -spectral analysis on 50<br />
msec sections <strong>of</strong> waveform using a narrow (40 Hz)<br />
bandwidth analysis filter filter, with new analyzes every 1<br />
msec<br />
– narrowband spectrogram resolves individual pitch harmonics<br />
and shows horizontal striations during voiced regions<br />
22
Wideband Spectrograms <strong>of</strong> s5.wav and s5_synthetic.wav<br />
23
Narrowband Spectrogram p g <strong>of</strong> f s5.wav<br />
24
Sound Spectrogram<br />
Spectrogram<br />
wavsurfer demo—’s5’, ‘s5_synthetic’<br />
voicebox demo—’s5’, ‘s5_synthetic’<br />
COLEA demo—’should’, , ‘every’ y<br />
Wav Surfer:<br />
www.wavsurfer.com<br />
VoiceBox:<br />
www.ee.ic.ac.uk/hp/staff/dmb/voic<br />
ebox/voicebox.htm<br />
COLEA UI:<br />
www.utdallas.edu/~loizou/speech/<br />
colea colea.htm htm<br />
HMM Toolkit:<br />
25<br />
www.ai.mit.edu/~murphyk/S<strong>of</strong>twa<br />
re/HMM/hmm.html#hmm
<strong>Speech</strong> Sentence Waveform<br />
26
<strong>Speech</strong> Wideband Spectrogram<br />
two plus seven is less than ten<br />
27
Parametrization <strong>of</strong> Spectra<br />
• human vocal tract is essentially a tube <strong>of</strong> varying cross sectional<br />
area, or can be approximated as a concatentation <strong>of</strong> tubes <strong>of</strong> varying<br />
cross sectional areas<br />
• acoustic theory shows that the transfer function <strong>of</strong> energy from the<br />
excitation source to the output can be described in terms <strong>of</strong> the natural<br />
frequencies or resonances <strong>of</strong> the tube<br />
• resonances known as formants or formant frequencies for speech<br />
and d th they represent t the th ffrequencies i th that t pass the th most t acoustic ti energy<br />
from the source to the output<br />
• typically there are 3 significant formants below about 3500 Hz<br />
• formants are a highly efficient, efficient compact representation <strong>of</strong> speech<br />
28
Spectrogram Spectrogram and Formants<br />
Key Issue Issue: Issue Issue: :<br />
reliability in<br />
estimating<br />
formants from<br />
spectral data<br />
29
<strong>Acoustic</strong> <strong>Acoustic</strong> <strong>Theory</strong> Summary<br />
• basic speech processes — from ideas to<br />
speech (production) (production), from speech to ideas<br />
(perception)<br />
• basic vocal production mechanisms — vocal<br />
tract, nasal tract, velum<br />
• source <strong>of</strong> sound flow at the glottis; output <strong>of</strong><br />
sound flow at the lips and nose<br />
• speech waveforms and properties — voiced,<br />
unvoiced unvoiced, silence silence, pitch<br />
• speech spectrograms and properties —<br />
wideband spectrograms, narrowband<br />
spectrograms, formants<br />
30
English <strong>Speech</strong> <strong>Speech</strong> Sounds<br />
Sounds<br />
ARPABET representation<br />
• 48 sounds<br />
• 18 vowels/diphthongs<br />
• 4vowel 4vowel-like 4 vowel like like consonants<br />
consonants<br />
• 21 standard consonants<br />
• 4 syllabic sounds<br />
• 1 glottal stop<br />
31
Phonemes<br />
Phonemes—Link Link Between<br />
O Orthography h h and d <strong>Speech</strong> S h<br />
Orthography sequence <strong>of</strong> sounds<br />
• Larry /l/ /ae/ /r/ /iy/ (/L/ /AE/ /R/ /IY/)<br />
S <strong>Speech</strong> h Waveform W f sequence <strong>of</strong> f sounds d<br />
• based on acoustic properties (temporal) <strong>of</strong> phonemes<br />
Spectrogram sequence <strong>of</strong> sounds<br />
• based on acoustic properties (spectral) <strong>of</strong> phonemes<br />
The bottom line is that we use a phonetic code as an intermediate<br />
representation <strong>of</strong> language, from either orthography or from<br />
waveforms a eforms or spectrograms spectrograms; no now we e ha have e to learn ho how to<br />
recognize sounds within speech utterances<br />
32
Phonetic Transcriptions<br />
Transcriptions<br />
• based on ideal (dictionary-based) ( y )p pronunciations <strong>of</strong><br />
all words in sentence<br />
– ‘My name is Larry’-/M/ /AY/-/N/ /EY/ /M/-/IH/ /Z/-/L/ /AE/<br />
/R/ /IY/<br />
– ‘How old are you’-/H/ /AW/-/OW/ /L/ /D/-/AA/ /R/-/Y/ /UW/<br />
– ‘<strong>Speech</strong> processing is fun’-/S/ /P/ /IY/ /CH/-/P/ /R/ /AH/<br />
/S/ /EH/ /S/ /IH/ /NG/ /NG/-/IH/ /IH/ /Z/-/F/ /Z/ /F/ /AH/ /N/<br />
• word ambiguity abounds<br />
– ‘lives’-/L/ /IH/ /V/ /Z/ (he lives here) versus /L/ /AY/ /V/ /Z/<br />
(a cat has nine lives)<br />
– ‘record’-/R/ /EH/ /K/ /ER/ /D/ (he holds the world record)<br />
versus /R/ /IY/ /K/ /AW/ /D/ (please record my favorite<br />
show tonight)<br />
33
Phoneme Phoneme Classification Chart<br />
Chart<br />
EY<br />
Vocal Cords<br />
Vib Vibrating ti<br />
Noise-Like<br />
Excitation<br />
34
Reduced Reduced Set <strong>of</strong> English Sounds<br />
• 39 sounds<br />
– 11 vowels (front, mid, back) classification based on<br />
tongue hump position<br />
– 4 diphthongs (vowel-like (vowel like combinations)<br />
– 4 semi-vowels (liquids and glides)<br />
– 3 nasal consonants<br />
– 6 voiced and unvoiced stop consonants<br />
– 8 voiced and unvoiced fricative consonants<br />
– 2 affricate consonants<br />
– 1 whispered sound<br />
• look at each class <strong>of</strong> sounds to characterize<br />
their acoustic and spectral properties<br />
35
Vowels<br />
• longest duration sounds – least context sensitive<br />
• can be held indefinitely in singing and other musical<br />
works ( (opera) )<br />
• carry very little linguistic information (some languages<br />
don’t display p y vowels in text-Hebrew, , Arabic) )<br />
Text 1: all vowels deleted<br />
Th Th_y n_t_d t d s_gn_f_c_nt f t _mpr_v_m_nts t _n th_ th c_mp_ny’s ’<br />
_m_g_, s_p_rv_s__n _nd m_n_g_m_nt.<br />
Text 2: all consonants deleted<br />
A__i_u_e_ _o_a__ _a_ __a_e_ e__e__ia___ __e _a_e,<br />
_i__ i __e e __i_e_ i e oo_ oo__u_a_io_a_ u a io a ee___o_ee_ o ee __i_____ i<br />
_e__ea_i__.<br />
36
Vowels and Consonants<br />
Text 1: all vowels deleted<br />
Th_y n_t_d s_gn_f_c_nt _mpr_v_m_nts _n th_ c_mp_ny’s<br />
_m_g_, m g s_p_rv_s__n s p rv s n _nd nd m_n_g_m_nt.<br />
m n g m nt<br />
(They noted significant improvements in the company’s<br />
image, supervision and management.)<br />
Text 2: all consonants deleted<br />
A__i_u_e_ _o_a__ _a_ __a_e_ e__e__ia___ __e _a_e,<br />
_i__ __e __i_e_ o_ o__u_a_io_a_ e___o_ee_ __i_____<br />
_e__ea_i__. e ea i<br />
(Attitudes toward pay stayed essentially the same, with the<br />
scores <strong>of</strong> occupational employees slightly decreasing)<br />
37
Vowels<br />
• produced p using g fixed vocal tract shape p<br />
• sustained sounds<br />
• vocal cords are vibrating g => voiced sounds<br />
• cross-sectional area <strong>of</strong> vocal tract determines<br />
vowel resonance frequencies and vowel sound<br />
quality lit<br />
• tongue position (height, forward/back position)<br />
most important in determining vowel sound<br />
• usually relatively long in duration (can be held<br />
during g singing) g g) and are spectrally p y<br />
well formed<br />
38
Vowel <strong>Production</strong><br />
<strong>Production</strong><br />
39
Vowel Waveforms &<br />
Spectrograms<br />
Synthetic versions <strong>of</strong> the<br />
10 vowels<br />
40
Vowel Formants<br />
Formants<br />
Clear pattern <strong>of</strong> variability<br />
<strong>of</strong> vowel pronunciation<br />
among men, women and<br />
children hild<br />
Strong overlap for different<br />
vowel sounds by different<br />
ttalkers lk => no unique i<br />
identification <strong>of</strong> vowel<br />
strictly from resonances<br />
=> need context to define<br />
vowel sound<br />
41
Canonic Vowel Spectra<br />
10 Hz 33 Hz<br />
100 Hz<br />
IY IY<br />
AA<br />
AA<br />
UW UW<br />
100 Hz Fundamental<br />
42
Canonic Vowel Spectra<br />
IY IY<br />
AA<br />
UW<br />
100 100 Hz H F Fundamental d t l 300 Hz<br />
300 Hz Fundamental<br />
AA<br />
UW<br />
43
Eliminating Vowels and Consonants<br />
44
• Gliding speech sound<br />
that starts at or near the<br />
articulatory position for<br />
one vowel and moves to<br />
or toward the position p for<br />
another vowel<br />
– /AY/ in buy<br />
– /AW/ in down<br />
– /EY/ in bait<br />
– /OY/ in boy<br />
Diphthongs<br />
45
Distinctive Distinctive Features<br />
Features<br />
Classify non-vowel/non-diphthong sounds in terms <strong>of</strong> distinctive<br />
features<br />
– place <strong>of</strong> articulation<br />
• Bilabial (lips)—p,b,m,w<br />
• Labiodental (between lips and front <strong>of</strong> teeth)-f,v<br />
• Dental (teeth)-th,dh<br />
• Alveolar (front <strong>of</strong> palate)-t,d,s,z,n,l<br />
• Palatal (middle <strong>of</strong> palate)-sh,zh,r<br />
• Velar (at velum)-k,g,ng<br />
• Pharyngeal (at end <strong>of</strong> pharynx)-h<br />
– manner <strong>of</strong> articulation<br />
• Glide—smooth motion-w,l,r,y<br />
• Nasal—lowered velum-m,n,ng<br />
• Stop—constricted vocal tract-p,t,k,b,d,g<br />
• Fricative—turbulent source-f,th,s,sh,v,dh,z,zh,h<br />
• Voicing—voiced source-b,d,g,v,dh,z,zh,m,n,ng,w,l,r<br />
• Mixed source—both voicing and unvoiced-j,ch<br />
• Whispered--h<br />
46
Places <strong>of</strong> Articulation<br />
47
Semivowels Semivowels (Liquids and and Glides)<br />
Glides)<br />
• vowel-like vowel like in nature (called semivowels for<br />
this reason)<br />
• voiced sounds (w (w-l-r-y) l r y)<br />
• acoustic characteristics <strong>of</strong> these sounds<br />
are strongly t l influenced i fl d bby context—unlike<br />
t t lik<br />
most vowel sounds which are much less<br />
iinfluenced fl d bby context t t<br />
uh-{w,l,r,y}-a<br />
Manner: glides<br />
Place: bilabial (w), alveolar (l),<br />
palatal (r)<br />
48
Nasal Consonants<br />
• The nasal consonants consist <strong>of</strong> /M/, /N/, and /NG/<br />
– nasals produced using glottal excitation => voiced sounds<br />
– vocal tract totally constricted at some point along the tract<br />
– velum lowered so sound is radiated at nostrils<br />
– constricted oral cavity serves as a resonant cavity that traps<br />
acoustic energy at certain natural frequencies (anti-resonances<br />
or zeros <strong>of</strong> transmission)<br />
– /M/ is produced with a constriction at the lips => low frequency<br />
zero<br />
– /N/ is produced with a constriction just behind the teeth => higher<br />
frequency zero<br />
– /NG/ is produced with a constriction just forward <strong>of</strong> the velum =><br />
even higher frequency zero<br />
uh-{m,n,ng}-a<br />
Manner: nasal<br />
Place: bilabial (m), alveolar (n), velar<br />
(ng)<br />
49
Nasal <strong>Production</strong><br />
50
Nasal Sounds<br />
Hole in<br />
spectrum<br />
UH M AA UH N AA<br />
51
Unvoiced Fricatives<br />
• Consonant sounds /F/, /TH/, /S/, /SH/<br />
– produced by exciting vocal tract by steady air flow<br />
which becomes turbulent in region <strong>of</strong> a constriction in<br />
the vocal tract<br />
• /F/ constriction near the lips<br />
• /TH/ constriction near the teeth<br />
• /S/ constriction near the middle <strong>of</strong> the vocal tract<br />
• /SH/ constriction near the back <strong>of</strong> the vocal tract<br />
– noise source at constriction => vocal tract is<br />
separated into two cavities<br />
– sound radiated from lips – front cavity<br />
– back cavity traps energy and produces antiresonances<br />
(zeros <strong>of</strong> transmission)<br />
uh-{f,th,s,sh}-a<br />
Manner: fricative<br />
Place: labiodental (f), dental (th),<br />
alveolar (s), palatal (sh)<br />
52
Unvoiced Fricative <strong>Production</strong><br />
53
Unvoiced Fricatives<br />
UH F AA UH S AA UH SH AA<br />
54
Voiced Fricatives<br />
Fricatives<br />
• Sounds /V/,/DH/, /Z/, /ZH/<br />
– place <strong>of</strong> constriction same as for unvoiced<br />
counterparts<br />
– two sources <strong>of</strong> excitation; vocal cords<br />
vibrating producing semi-periodic puffs <strong>of</strong> air<br />
to excite the tract; the resulting air flow<br />
becomes turbulent at the constriction giving a<br />
noise noise-like like component in addition to the<br />
voiced-like component<br />
uh-{v,dh,z,zh}-a<br />
Manner: fricative<br />
Place: labiodental (v), dental (dh),<br />
alveolar (z), palatal (zh)<br />
55
Voiced Fricatives<br />
UH V AA UH ZH AA<br />
56
Voiced and Unvoiced Stop<br />
Consonants<br />
• sounds-/B/, /D/, /G/ (voiced stop consonants) and /P/, /T/<br />
/K/ (unvoiced ( i d stop t consonants) t )<br />
– voiced stops are transient sounds produced by building up<br />
pressure behind a total constriction in the oral tract and then<br />
suddenly releasing the pressure, pressure resulting in a pop-like pop like sound<br />
• /B/ constriction at lips<br />
Manner: stop<br />
• /D/ constriction at back <strong>of</strong> teeth<br />
• /G/ constriction at velum<br />
uh-{b,d,g}-a<br />
– no sound diis radiated dit dffrom th the li lips dduring i constriction ti ti =><br />
sometimes sound is radiated from the throat during constriction<br />
(leakage through tract walls) allowing vocal cords to vibrate in<br />
spite <strong>of</strong> total constriction<br />
– stop sounds strongly influenced by surrounding sounds<br />
– unvoiced stops have no vocal cord vibration during period <strong>of</strong><br />
closure => brief period <strong>of</strong> frication (due to sudden turbulence <strong>of</strong><br />
escaping air) and aspiration (steady air flow from the glottis)<br />
before voiced excitation begins<br />
Place: bilabial (b,p), alveolar<br />
(d,t), velar (g, k)<br />
57
Stop Consonant Consonant <strong>Production</strong><br />
<strong>Production</strong><br />
58
Voiced Stop Consonant<br />
UH B AA<br />
59
Unvoiced Stop Consonants<br />
Stop p Gap p<br />
UH P AA UH T AA<br />
uh-{p,t,k}-a<br />
uh-{j,ch,h}-a<br />
60
Distinctive Distinctive Phoneme Phoneme Features<br />
Features<br />
• the brain recognizes sounds by doing a distinctive feature<br />
analysis from the information going to the brain<br />
• the distinctive features are somewhat insensitive to noise,<br />
background, reverberation => they are robust and reliable<br />
61
IY-beat<br />
IH IH-bit bit<br />
EH-bet<br />
AE-bat<br />
AA-bob<br />
ER-bird<br />
AH-but AH but<br />
AO-bought<br />
UW-boot<br />
UH UH-book book<br />
OW-boat<br />
AW-down<br />
AY-buy<br />
OY-boy<br />
EY-bait EY bait<br />
Review Exercises<br />
Exercises<br />
Write the transcription <strong>of</strong> the<br />
sentence “Good friends are hard to<br />
find”<br />
G-UH-D F-R-EH-N-D-Z AA-R HH-<br />
AA-R-D T-UH (UW) F-AY-N-D<br />
62
samples s <strong>of</strong>fset<br />
0<br />
2000<br />
4000<br />
6000<br />
8000<br />
Review Exercises<br />
file: enjoy 1 0k, sampling rate: 10000, starting sample: 1 number <strong>of</strong> samples 8079<br />
EH N<br />
N JH OY<br />
OY<br />
OY<br />
0 200 400 600 800 1000<br />
sample number<br />
1200 1400 1600 1800 2000<br />
Enjoy:<br />
EH-N-JH-OY<br />
63
samples<br />
<strong>of</strong>fset<br />
0<br />
2000<br />
4000<br />
6000<br />
Review Exercises<br />
Exercises<br />
file: simple 1 0k, sampling rate: 10000, starting sample: 1 number <strong>of</strong> samples 7152<br />
S IH M<br />
P AX AX-L L | EL<br />
0 200 400 600 800 1000<br />
sample l number<br />
b<br />
1200 1400 1600 1800 2000<br />
S<br />
Simple:<br />
S-IH-M-P-<br />
(AX-L | EL)<br />
64
TH-IH S IH Z UH T EH S T<br />
This is a test (16 kHz sampling rate)<br />
65
Word Choices:<br />
that, and, was, by,<br />
people, little, simple,<br />
between, very, enjoy,<br />
only only, other other, company company,<br />
those<br />
Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />
Top Row: simple; l enjoy; those<br />
Second Row: was; between; company<br />
66
Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />
Word Choices:<br />
that, and, was, by,<br />
people people, little little, simple simple,<br />
between, very, enjoy,<br />
only, other, company,<br />
those<br />
/was/ -- this word can be<br />
identified by the voiced<br />
initial portion with very<br />
low first and second<br />
formants (sounds like UW<br />
or W), followed by the AA<br />
sound and ending with the<br />
Z (S) sound.<br />
67
Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />
Word Choices:<br />
that, and, was, by,<br />
people, little, simple,<br />
between between, very very, enjoy enjoy,<br />
only, other, company,<br />
those<br />
/enjoy/ – this word can be<br />
identified by the two two-syllable syllable<br />
nature nature, with the nasal sound<br />
N at the end <strong>of</strong> the first<br />
syllable syllable, and the fricative JH<br />
at the beginning <strong>of</strong> the second<br />
syllable, with the characteristic<br />
OY diphthong at the end <strong>of</strong><br />
th the word<br />
d<br />
68
Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />
Word Choices:<br />
that, , and, , was, , by, y,<br />
people, little, simple,<br />
between, very, enjoy,<br />
only, other, company,<br />
those<br />
/company/ – this word can be<br />
identified by the three syllable<br />
nature nature, with the initial stop<br />
consonant K, the first syllable<br />
ending in the nasal MM, followed by<br />
the stop P, and with the second<br />
syllable ending with the nasal N<br />
followed by an IY vowel-like<br />
sound<br />
69
Ultimate Exercise Exercise—Identify Identify Words From Spectrogram<br />
Word Choices:<br />
that, and, was, by,<br />
people, little, simple,<br />
between, very, enjoy,<br />
only, other, company,<br />
those<br />
/ /simple/ i l / – thi this word d can bbe<br />
identified by the two two-syllable syllable<br />
nature nature, with a strong initial<br />
fricative S beginning the first<br />
syllable ll bl and d th the nasal l M<br />
ending the first syllable, and<br />
with the stop consonant P<br />
beginning the second syllable<br />
70
Spectrograms <strong>of</strong> Cities<br />
A<br />
Los Angeles<br />
B<br />
Chicago<br />
C<br />
Denver<br />
D<br />
Seattle<br />
E<br />
Memphis p<br />
City Choices: Chicago, Denver, Los Angeles, Memphis, Seattle
Summary<br />
• sounds <strong>of</strong> the English language—phonemes,<br />
language phonemes,<br />
syllables, words<br />
• phonetic p transcriptions p <strong>of</strong> words and<br />
sentences — coarticulation across word<br />
boundaries<br />
• vowels and consonents — their roles,<br />
articulatory shapes, waveforms, spectrograms,<br />
formants<br />
72
Sound Sound <strong>Perception</strong><br />
<strong>Perception</strong><br />
73
Some Facts About Human<br />
H Hearing i<br />
• the range <strong>of</strong> human hearing is incredible<br />
– threshold <strong>of</strong> hearing — thermal limit <strong>of</strong> Brownian motion <strong>of</strong> air<br />
particles in the inner ear<br />
– threshold <strong>of</strong> pain — intensities <strong>of</strong> from 10**12 to 10**16 greater<br />
than the threshold <strong>of</strong> hearing<br />
• human hearing perceives both sound frequency and<br />
sound direction<br />
– can detect weak spectral components in strong broadband noise<br />
• masking is the phenomenon whereby one loud sound<br />
makes another s<strong>of</strong>ter sound inaudible<br />
– masking is most effective for frequencies around the masker<br />
frequency<br />
– masking is used to hide quantizer noise by methods <strong>of</strong> spectral<br />
shaping (similar grossly to Dolby noise reduction methods)<br />
74
Anechoic Chamber (no Echos)<br />
75
Sound Pressure Levels (dB)<br />
SPL (dB)—Sound Source SPL (dB)—Sound Source<br />
160 Jet Engine — close up 70 Busy Street; Noisy<br />
150 Firecracker; Artillery Fire<br />
Restaurant<br />
140<br />
130<br />
120<br />
110<br />
Rock Singer Screaming into<br />
Microphone: Jet Take<strong>of</strong>f<br />
Threshold <strong>of</strong> Pain; .22<br />
Caliber Rifle<br />
Planes on Airport Runway;<br />
Rock Concert; Thunder<br />
Power Tools; Shouting in Ear<br />
60<br />
50<br />
40<br />
30<br />
Conversational <strong>Speech</strong> — 1<br />
foot<br />
Average Office Noise; Light<br />
Traffic; Rainfall<br />
Quiet Conversation;<br />
RRefrigerator; f i t Lib Library<br />
Quiet Office; Whisper<br />
100 SSubway b TTrains; i GGarbage b<br />
Truck<br />
90 Heavy Truck Traffic; Lawn<br />
Mower<br />
80 Home Stereo — 1 foot; Blow<br />
Dryer<br />
20 Quiet Living Room; Rustling<br />
Leaves<br />
10 Quiet Recording Studio;<br />
Breathing<br />
0 Threshold <strong>of</strong> Hearing<br />
76
<strong>Speech</strong> Communication<br />
77
Overview <strong>of</strong> Auditory Mechanism<br />
Model <strong>of</strong> signal processing for perception and<br />
understanding <strong>of</strong> speech<br />
78
The The Human Human Ear<br />
Ear<br />
OOuter t ear: pinna i and d external t l canall<br />
Middle ear: tympanic membrane or<br />
eardrum<br />
Inner ear: cochlea, neural connections<br />
79
Human Ear<br />
• Outer ear: funnels sound into ear canal<br />
• Middle ear: sound impinges on tympanic<br />
membrane; this causes motion<br />
– middle ear is a mechanical transducer, consisting <strong>of</strong> the<br />
hammer, anvil and stirrup; it converts acoustical sound<br />
wave to mechanical vibrations along the inner ear<br />
• Inner ear: the cochlea is a fluid-filled chamber<br />
partitioned by the basilar membrane<br />
– the auditory nerve is connected to the basilar membrane<br />
via inner hair cells<br />
– mechanical vibrations at the entrance to the cochlea<br />
create standing waves (<strong>of</strong> fluid inside the cochlea)<br />
causing basilar membrane to vibrate at frequencies<br />
commensurate with the input acoustic wave frequencies<br />
(formants) and at a place along the basilar membrane<br />
that is associated with these frequencies<br />
80
The The Outer Outer Ear<br />
Ear<br />
81
The Middle Middle Ear<br />
Ear<br />
The Hammer (Malleus),<br />
Anvil (Incus) and Stirrup<br />
(Stapes) are the three<br />
tiniest bones in the body.<br />
Together they form the<br />
coupling between the<br />
vibration <strong>of</strong> the eardrum<br />
and the forces exerted on<br />
the oval window <strong>of</strong> the<br />
inner ear ear.<br />
These bones can be<br />
thought <strong>of</strong> as a compound<br />
lever which achieves a<br />
multiplication <strong>of</strong> force—by<br />
a factor <strong>of</strong> about three<br />
under optimum conditions.<br />
(Th (They also l protect t t the th ear<br />
against loud sounds by<br />
82<br />
attenuating the sound.)
Tympanic y p<br />
Membrane<br />
The Cochlea<br />
Malleus Ossicles<br />
Incus<br />
Stapes<br />
(Middle Ear Bones)<br />
AAuditory dit nnerves<br />
Oval Window<br />
Cochlea<br />
RRound d Wi Window d<br />
Vestibule<br />
83
Stretched Cochlea & Basilar Membrane<br />
Scala<br />
Vestibuli<br />
Basilar<br />
Membrane<br />
1600 Hz<br />
800 Hz<br />
400 Hz<br />
200 Hz<br />
Cochlear Base 100 Hz<br />
(high<br />
frequency) Unrolled<br />
Cochlea<br />
50 Hz<br />
Relative<br />
amplitude<br />
25 Hz<br />
0 10 20 30<br />
Distance from Stapes (mm)<br />
Cochlear Apex<br />
(low ( ffrequency) q y)<br />
84
Basilar Membrane Motion<br />
85
Audience Noise Noise Suppression<br />
Suppression<br />
86
Basilar Membrane Membrane Mechanics<br />
Mechanics<br />
• characterized by a set <strong>of</strong> frequency responses at different points<br />
along the membrane<br />
• mechanical realization <strong>of</strong> a bank <strong>of</strong> filters<br />
• filters are roughly constant Q (center frequency/bandwidth) with<br />
logarithmically g y decreasing g bandwidth<br />
• distributed along the Basilar Membrane is a set <strong>of</strong> sensors called<br />
Inner Hair Cells (IHC) which act as mechanical motion-to-neural<br />
activity converters<br />
• mechanical motion along the BM is sensed by local IHC causing<br />
firing activity at nerve fibers that innervate bottom <strong>of</strong> each IHC<br />
• each IHC connected to about 10 nerve fibers, each <strong>of</strong> different<br />
diameter => thin fibers fire at high motion levels, thick fibers fire at<br />
llower motion ti llevels l<br />
• 30,000 nerve fibers link IHC to auditory nerve<br />
• electrical pulses run along auditory nerve, ultimately reach higher<br />
levels <strong>of</strong> auditory processing in brain brain, perceived as sound<br />
87
Hearing Thresholds<br />
Thresholds<br />
• Threshold <strong>of</strong> Audibilityy is the acoustic intensity y<br />
level <strong>of</strong> a pure tone that can barely be heard at a<br />
particular frequency<br />
– th threshold h ld <strong>of</strong> f audibility dibilit ≈ 0 dB at t 1000 HHz<br />
– threshold <strong>of</strong> feeling ≈ 120 dB<br />
– threshold <strong>of</strong> pain ≈ 140 dB<br />
– immediate damage ≈ 160 dB<br />
• Thresholds vary with frequency and from<br />
person-to-person<br />
• Maximum sensitivity is at about 3000 Hz<br />
88
Sound Prressure<br />
Level L<br />
Range Range <strong>of</strong> Human Human Hearing<br />
Hearing<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
Music<br />
Threshold in Quiet<br />
Threshold <strong>of</strong> Pain<br />
Contour <strong>of</strong> Damage Risk<br />
140<br />
120<br />
100<br />
80<br />
<strong>Speech</strong> p<br />
60<br />
00.02 02 0.05 0 05 0.1 0 1 0.2 0 2 0.5 0 5 1 2 5 10 20<br />
Frequency (kHz)<br />
40<br />
20<br />
0<br />
Sound Inntensity<br />
Level L<br />
89
Piitch<br />
in mels m<br />
Pitch Pitch-The Pitch The The Mel Mel Scale<br />
Scale<br />
3000<br />
2000<br />
1000<br />
0<br />
20 40 100 200 400 1k 2k 4k 10k<br />
Frequency in Hz<br />
Pitch ( mels ) =<br />
3233log10(<br />
1+<br />
f<br />
/ 1000)<br />
90
Sound pressure p e level (ddB)<br />
Auditory Auditory Masking<br />
Masking<br />
70<br />
50<br />
30<br />
10<br />
0<br />
0.02<br />
Tone masker @ 1kHz<br />
threshold in<br />
quiet q<br />
Inaudible range<br />
0.079<br />
Signal perceptible even in the<br />
presence p<br />
<strong>of</strong> the tone masker<br />
threshold when<br />
masker is<br />
present<br />
0.313 1.25 2.5 5 10<br />
F Frequency (KHz) (KH )<br />
20<br />
Signal not perceptible due to<br />
the presence <strong>of</strong> the tone<br />
masker<br />
91
Exploiting Exploiting Masking Masking in in Coding<br />
Coding<br />
Level L (dB) )<br />
Power Spectrum<br />
Predicted Masking Threshold<br />
Bit Assignment (Equivalent SNR)<br />
Frequency (Hz)<br />
92
<strong>Speech</strong> <strong>Perception</strong> Issues<br />
Issues<br />
• Ear performs p spectral p analysis y along g basilar membrane<br />
– both place (along basilar membrane) and frequency are<br />
significant<br />
– frequency scale is pseudo-logarithmic pseudo logarithmic (mel or Bark scale)<br />
– amplitude is logarithmic<br />
– maximal frequency sensitivity between 1 and 3 kHz<br />
• Tonal and noise masking hide tones and noise in bands<br />
close to masker<br />
– critical bandwidths as a function <strong>of</strong> frequency q y<br />
– masking restricted to critical band region<br />
• Neural processing is still not well understood<br />
– evidence for distinctive features (place and manner features <strong>of</strong><br />
sounds) 93