18.07.2013 Views

The Hidden Markov Model (HMM)

The Hidden Markov Model (HMM)

The Hidden Markov Model (HMM)

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Digital Speech Processing—<br />

Processing<br />

Lecture 20<br />

<strong>The</strong> <strong>Hidden</strong> <strong>Markov</strong><br />

<strong>Model</strong> (<strong>HMM</strong>)<br />

1


Lecture Outline<br />

• <strong>The</strong>ory of <strong>Markov</strong> <strong>Model</strong>s<br />

– discrete <strong>Markov</strong> processes<br />

– hidden <strong>Markov</strong> processes<br />

• Solutions to the Three Basic Problems of <strong>HMM</strong>’s<br />

– computation of observation probability<br />

– determination of optimal state sequence<br />

– optimal training of model<br />

• Variations of elements of the <strong>HMM</strong><br />

– model types<br />

– densities<br />

• Implementation Issues<br />

– scaling<br />

– multiple observation sequences<br />

– initial parameter estimates<br />

– insufficient training data<br />

• Implementation of Isolated Word Recognizer Using <strong>HMM</strong>’s<br />

2


Stochastic Signal <strong>Model</strong>ing<br />

• Reasons for Interest:<br />

– basis for theoretical description of signal<br />

processing algorithms<br />

– can learn about signal source properties<br />

– models work well in practice in real world<br />

applications<br />

• Types of Signal <strong>Model</strong>s<br />

– deteministic, parametric models<br />

– stochastic models<br />

3


Discrete <strong>Markov</strong> Processes<br />

{ }<br />

System of N distinct states, S1, S2,..., SN<br />

<strong>Markov</strong> Property:<br />

Time( t)<br />

1 2 3 4 5 ...<br />

State q q q q q ...<br />

1 2 3 4 5<br />

P⎡q = S | q = S , q = S ,... ⎤ = P⎡q = S | q = S<br />

⎤<br />

⎣ t i t−1 j t−2 k ⎦ ⎣ t i t−1 j⎦<br />

4


Properties of State Transition Coefficients<br />

Consider processes where state transitions are<br />

time independent, i.e.,<br />

a = P⎡q = S | q = S ⎤,1<br />

≤ i, j≤N ji ⎣ t i t −1<br />

j ⎦<br />

a ≥0 ∀j,<br />

i<br />

N<br />

ji<br />

∑<br />

i = 1<br />

a = 1 ∀j<br />

ji<br />

5


Example of Discrete <strong>Markov</strong><br />

Process<br />

Once each day (e.g., at noon), the weather is observed<br />

and classified as being one of the following:<br />

– State 1—Rain (or Snow; e.g. precipitation)<br />

– State 2—Cloudy<br />

– State 3—Sunny<br />

with state transition probabilities:<br />

⎡0.4 0.3 0.3⎤<br />

A = { a<br />

} =<br />

⎢ ⎥<br />

ij ⎢<br />

0.2 0.6 0.2<br />

⎥<br />

⎢⎣0.1 0.1 0.8⎥⎦<br />

6


Discrete <strong>Markov</strong> Process<br />

Problem: Given that the weather on day 1 is sunny, what<br />

is the probability (according to the model) that the<br />

weather for the next 7 days will be “sunny-sunny-rainrain-sunny-cloudy-sunny”?<br />

Solution: We define the observation sequence, O, as:<br />

{ , , , , , , , }<br />

O S S S S S S S S<br />

= 3 3 3 1 1 3 2 3<br />

and we want to calculate P(O|<strong>Model</strong>). That is:<br />

[ ]<br />

P( O| <strong>Model</strong>) P S , S , S , S, S, S , S , S<br />

| <strong>Model</strong><br />

= 3 3 3 1 1 3 2 3<br />

7


Discrete <strong>Markov</strong> Process<br />

π<br />

= π<br />

[ ]<br />

PO ( | <strong>Model</strong>) = P S, S, S, S, S, S, S, S | <strong>Model</strong><br />

=<br />

⋅<br />

[ ] [ | ] [ | ] [ | ]<br />

PS [ ] [ ] [ ]<br />

3 | S1 PS2 | S3 PS3 | S2<br />

2<br />

3( a33) a31a11a13a32a23 2<br />

( ) ( )( )( )( )( )<br />

= 1 0.8 0.1 0.4 0.3 0.1 0.2<br />

= 1.536 ⋅10<br />

3 3 3 1 1 3 2 3<br />

2<br />

3 3 3 1 3 1 1<br />

PS PS S PS S PS S<br />

[ ]<br />

−04<br />

= Pq= S , 1≤<br />

i≤N i 1 i<br />

8


Discrete <strong>Markov</strong> Process<br />

Problem: Given that the model is in a known<br />

state, what is the probability it stays in that state<br />

for exactly d days?<br />

Solution:<br />

{ , , ,..., , }<br />

O = S S S S S ≠S<br />

i i i i j i<br />

1 2 3 d d + 1<br />

( ) ( ) −<br />

P O| <strong>Model</strong>, q = S = a (1 − a ) = p( d)<br />

∞<br />

1<br />

d = ∑d<br />

p ( d)<br />

=<br />

1−<br />

1<br />

i i<br />

d = 1<br />

aii<br />

d 1<br />

i ii ii i<br />

9


Exercise<br />

Given a single fair coin, i.e., P (H=Heads)=<br />

P (T=Tails) = 0.5, which you toss once and observe<br />

Tails:<br />

a) what is the probability that the next 10 tosses will<br />

provide the sequence {H H T H T T H T T H}?<br />

SOLUTION:<br />

For a fair coin, with independent coin tosses, the probability of<br />

any specific observation sequence of length 10 (10 tosses) is<br />

(1/2) 10 since there are 2 10 such sequences and all are equally<br />

probable. Thus:<br />

P (H H T H T T H T T H) = (1/2) 10<br />

10


Exercise<br />

b) what is the probability that the next 10<br />

tosses will produce the sequence {H H<br />

H H H H H H H H}?<br />

SOLUTION:<br />

Similarly:<br />

P (H H H H H H H H H H)= (1/2) 10<br />

Thus a specified run of length 10 is equally as likely as<br />

a specified run of interlaced H and T.<br />

11


Exercise<br />

c) what is the probability that 5 of the next 10 tosses will<br />

be tails? What is the expected number of tails over<br />

the next 10 tosses?<br />

SOLUTION:<br />

<strong>The</strong> probability of 5 tails in the next 10 tosses is just the number of observation<br />

sequences with 5 tails and 5 heads (in any sequence) and this is:<br />

P (5H, 5T)=(10C5) (1/2) 10 = 252/1024≈0.25<br />

since there are (10C5) combinations (ways of getting 5H and 5T) for 10 coin<br />

tosses, and each sequence has probability of (1/2) 10 . <strong>The</strong> expected number of<br />

tails in 10 tosses is:<br />

10 ⎛10⎞⎛1⎞ E(Number of T in 10 coin tosses) = ∑d⎜<br />

⎟⎜ ⎟ = 5<br />

d = 0 ⎝ d ⎠⎝2⎠<br />

Thus, on average, there will be 5H and 5T in 10 tosses, but the probability<br />

of exactly 5H and 5T is only about 0.25.<br />

10<br />

12


Coin Toss <strong>Model</strong>s<br />

A series of coin tossing experiments is performed. <strong>The</strong><br />

number of coins is unknown; only the results of each coin<br />

toss are revealed. Thus a typical observation sequence is:<br />

O = O O O ... O = HHTTTHTTH... H<br />

1 2 3<br />

T<br />

Problem: Build an <strong>HMM</strong> to explain the observation sequence.<br />

Issues:<br />

1. What are the states in the model?<br />

2. How many states should be used?<br />

3. What are the state transition probabilities?<br />

13


Coin Toss <strong>Model</strong>s<br />

14


Coin Toss <strong>Model</strong>s<br />

15


Coin Toss <strong>Model</strong>s<br />

Problem: Consider an <strong>HMM</strong> representation (model λ) of a coin<br />

tossing experiment. Assume a 3-state model (corresponding to 3<br />

different coins) with probabilities:<br />

P(H)<br />

P(T)<br />

State 1<br />

0.5<br />

0.5<br />

State 2<br />

0.75<br />

0.25<br />

State 3<br />

0.25<br />

0.75<br />

and with all state transition probabilities equal to 1/3. (Assume initial state<br />

probabilities of 1/3).<br />

a) You observe the sequence: O=H H H H T H T T T T<br />

What state sequence is most likely? What is the probability of the<br />

observation sequence and this most likely state sequence?<br />

16


Coin Toss Problem Solution<br />

SOLUTION:<br />

Given O=HHHHTHTTTT, the most likely state<br />

sequence is the one for which the probability of<br />

each individual observation is maximum. Thus for<br />

each H, the most likely state is S2 and for each T<br />

the most likely state is S3 . Thus the most likely<br />

state sequence is:<br />

S= S2 S2 S2 S2 S3 S2 S3 S3 S3 S3 <strong>The</strong> probability of O and S (given the model) is:<br />

⎛ ⎞<br />

POS<br />

( , | λ)<br />

= (0.75) ⎜ ⎟<br />

⎝3⎠ 10 1<br />

10<br />

17


Coin Toss <strong>Model</strong>s<br />

b) What is the probability that the observation sequence came entirely<br />

from state 1?<br />

SOLUTION:<br />

<strong>The</strong> probability of O given that S is of the form:<br />

is:<br />

Sˆ = SSSSSSSSSS<br />

1 1 1 1 1 1 1 1 1 1<br />

10 ⎛1⎞ POS ( , ˆ | λ)<br />

= (0.50) ⎜ ⎟<br />

⎝3⎠ <strong>The</strong> ratio of POS ( , | λ) to POS ( , ˆ | λ)<br />

is:<br />

POS ( , | λ)<br />

⎛3⎞ R = = ⎜ ⎟ = 57.67<br />

POS<br />

( , ˆ | λ)<br />

⎝2⎠ 10<br />

10<br />

18


Coin Toss <strong>Model</strong>s<br />

c) Consider the observation sequence:<br />

O = HT T HTHHTTH<br />

How would your answers to parts a and b change?<br />

SOLUTION:<br />

Given Owhich has the same number of H's and T's,<br />

the answers to<br />

parts a and b would remain the same as the most likely states occur<br />

the same number of times in both cases.<br />

19


Coin Toss <strong>Model</strong>s<br />

d) If the state transition probabilities were of the form:<br />

a11 = 0.9, a21 = 0.45, a31<br />

= 0.45<br />

a12 = 0.05, a22 = 0.1, a32<br />

= 0.45<br />

a = 0.05, a = 0.45, a = 0.1<br />

13 23 33<br />

i.e., a new model λ’, how would your answers to parts a-c<br />

change? What does this suggest about the type of sequences<br />

generated by the models?<br />

20


Coin Toss Problem Solution<br />

SOLUTION:<br />

<strong>The</strong> new probability of O and S becomes:<br />

10 ⎛1⎞ 6 3<br />

POS ( , | λ′<br />

) = (0.75) ⎜ ( 0.1) ( 0.45)<br />

3<br />

⎟<br />

⎝ ⎠<br />

<strong>The</strong> new probability of O and Sˆbecomes:<br />

ˆ<br />

⎛1⎞ POS ( , | λ′<br />

) = (0.50) ⎜ (0.9)<br />

3<br />

⎟<br />

⎝ ⎠<br />

<strong>The</strong> ratio is:<br />

R<br />

10 6 3<br />

10 9<br />

⎛3⎞ ⎛1⎞ ⎛1⎞ = ⎜ 1.36 10<br />

2<br />

⎟ ⎜ = ⋅<br />

9<br />

⎟ ⎜<br />

2<br />

⎟<br />

⎝ ⎠ ⎝ ⎠ ⎝ ⎠<br />

−5<br />

21


Coin Toss Problem Solution<br />

Now the probability of Oand S is not the same as the probability of O<br />

and S.<br />

We now have:<br />

<br />

⎛1⎞ POS ( , | λ′<br />

) = (0.75) ⎜ ⎟(0.45)<br />

(0.1)<br />

⎝3⎠ <br />

10 ⎛1⎞ 9<br />

POS ( , ˆ | λ′<br />

) = (0.50) ⎜ ⎟(0.9)<br />

⎝3⎠ with the ratio:<br />

10 6 3<br />

10 6 3<br />

⎛3⎞ ⎛1⎞ ⎛1⎞ −3<br />

R = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = 1.24 ⋅10<br />

⎝2⎠ ⎝2⎠ ⎝9⎠ <strong>Model</strong> λ,<br />

the initial model, clearly favors long runs of H's or T's,<br />

whereas model λ′<br />

, the new model, clearly favors random sequences<br />

of H's and T's.<br />

Thus even a run of H's or T's<br />

is more likely to<br />

occur in state 1 for model λ′<br />

, and a random sequence of H's and T's<br />

is more likely to occur in states 2 and 3 for model λ.<br />

22


Balls in Urns <strong>Model</strong><br />

23


Elements of an <strong>HMM</strong><br />

1. N,<br />

number of states in the model<br />

{ }<br />

istates,<br />

S = S, S ,..., S<br />

istate<br />

at time t, qt∈S 2. M,<br />

number of distinct observation symbols per state<br />

i observation symbols, V = , ,...,<br />

i<br />

1 2<br />

observation at time<br />

ij t + 1 j t i<br />

{ j ( ) }<br />

N<br />

t, O ∈V<br />

{ ν ν ν }<br />

1 2<br />

M<br />

{ ij}<br />

3. State transition probability distribution, A = a ,<br />

a = P( q = S | q = S ), 1 ≤ i, j ≤N<br />

4. Observation symbol probability distribution in state<br />

B = b k<br />

b ⎡ j( k) = P⎣νk at t| qt = S ⎤ j⎦,<br />

1 ≤ j ≤N, 1≤<br />

k ≤M<br />

5. Initial state<br />

distribution, Π= π<br />

π<br />

[ ]<br />

= P q = S ,1≤<br />

i ≤N<br />

i 1 i<br />

t<br />

{ }<br />

i<br />

j<br />

24


<strong>HMM</strong> Generator of Observations<br />

1. Choose an initial state, q1= Si,<br />

according to the initial state<br />

distribution, Π.<br />

2. Set t = 1.<br />

3. Choose Ot<br />

= ν k according to the symbol probability distribution<br />

in state Si, namely bi( k).<br />

4. Transit to a new state, q S according to the state transition<br />

( ABΠ)<br />

Notation: λ=<br />

, , --<strong>HMM</strong><br />

+ = t 1 j<br />

probability distribution for state S, namely a .<br />

i ij<br />

5. Set t = t + 1; return to step 3 if t ≤T;<br />

otherwise terminate the procedure.<br />

t<br />

state<br />

observation<br />

1<br />

q 1<br />

O 1<br />

2<br />

q 2<br />

O 2<br />

3<br />

q 3<br />

O 3<br />

4<br />

q 4<br />

O 4<br />

5<br />

q 5<br />

O 5<br />

6<br />

q 6<br />

O 6<br />

…<br />

…<br />

…<br />

T<br />

q T<br />

O T<br />

25


Three Basic <strong>HMM</strong> Problems<br />

Problem 1--Given<br />

the observation sequence, O = OO ... O , and a model<br />

( Π)<br />

λ= AB , , , how do we (efficiently) compute PO ( | λ),<br />

the probability of the<br />

observation sequence?<br />

Problem 2--Given<br />

the observation sequence,<br />

O = OO 1 2...<br />

OT,<br />

how do we<br />

choose a state sequence Q = qq 1 2...<br />

qTwhich<br />

is optimal in some meaningful<br />

sense?<br />

Problem 3--How<br />

do we adjust the model parameters λ=<br />

AB , , Π to maximize<br />

PO ( | λ)?<br />

Interpretation:<br />

Problem 1--Evaluation<br />

or scoring problem.<br />

Problem 2--Learn<br />

structure problem.<br />

Problem 3--Training<br />

problem.<br />

1 2<br />

T<br />

( )<br />

26


Solution to Problem 1—P(O| 1 P(O|λ)<br />

T<br />

Consider the fixed state sequence (there are N such sequences):<br />

<strong>The</strong>n<br />

and<br />

Q = q q ... q<br />

1 2<br />

POQ ( | , λ)<br />

= b ( O) ⋅b(<br />

O)... b ( O )<br />

q 1 q 2 q T<br />

1 2<br />

PQ ( | λ) = π a a ... a<br />

1 1 2 2 3 T−1 T<br />

POQ ( , | λ) = POQ ( | , λ) ⋅PQ<br />

( | λ)<br />

Finally<br />

PO ( | λ) = ∑ POQ ( , | λ)<br />

all Q<br />

T<br />

q q q q q q q<br />

∑ 1 1 1 1 2 2 2<br />

T−1 T T<br />

1, 2,...,<br />

T<br />

P(<br />

O| λ) = π b ( O ) a b ( O )... a b ( O )<br />

q q q<br />

T<br />

Calculations required ≈2 T ⋅ N ; N = 5, T<br />

= 100⇒2⋅100 ⋅5<br />

72<br />

≈ 10 computations!<br />

T<br />

q q q q q q q q T<br />

100<br />

27


<strong>The</strong> “Forward” Procedure<br />

Consider the forward variable, αt<br />

( i),<br />

defined as the probability of the<br />

partial observation sequence (until time t) and state Siat time t,<br />

given<br />

the model, i.e.,<br />

α () i = P( OO ... O, q = S | λ)<br />

t 1 2 t t i<br />

Inductively solve<br />

for αt<br />

( i)<br />

as:<br />

1. Initialization<br />

α1() i = πi<br />

bi( O1), 1≤i<br />

≤N<br />

2. Induction<br />

N ⎡ ⎤<br />

αt+ 1( j) = ⎢∑αt( i) aij⎥bj( Ot+ 1),<br />

1≤t ≤T −1, i ≤ j ≤N<br />

⎣ i = 1 ⎦<br />

3. Termination<br />

Computation:<br />

PO ( | λ) = POO ( ... O, q = S| λ) = α ( i)<br />

N<br />

2<br />

N N<br />

∑ ∑<br />

1 2 T T i T<br />

i= 1 i=<br />

1<br />

T<br />

T versus 2 TN ; N = 5, T = 100 ⇒<br />

2500 versus 10<br />

72<br />

28


<strong>The</strong> “Forward” Procedure<br />

29


<strong>The</strong> “Backward” Algorithm<br />

Consider the backward variable, β ( i),<br />

defined as the probability of<br />

the partial observation sequence from t + 1 to the end, given state<br />

S at time t,<br />

and the model, i.e.,<br />

i<br />

β () i = P( O O ... O | q = S,<br />

λ)<br />

t t+ 1 t+ 2 T t i<br />

Inductive Solution :<br />

1. Initialization<br />

βT<br />

( i) = 1, 1≤<br />

i ≤N<br />

2. Induction<br />

⋅<br />

2<br />

NT<br />

t<br />

N<br />

∑<br />

j = 1<br />

ij j t+ 1 t+<br />

1<br />

β ( i) = a b ( O ) β ( j), t = T −1, T −2,...,1, 1≤<br />

i ≤N<br />

calculations, same as in forward case<br />

t<br />

30


Solution to Problem 2—Optimal<br />

2 Optimal<br />

State Sequence<br />

1. Choose states, qt, which are individually most likely ⇒<br />

maximize expected number of correct individual states<br />

2. Choose states, qt, which are pair -wise most likely ⇒<br />

maximize expected number of<br />

correct state pairs<br />

3. Choose states, qt, which are triple-wise most likely ⇒<br />

maximize expected number of correct state triples<br />

4. Choose states, qt, which are T -wise most likely ⇒<br />

find the single<br />

best state sequence which maximizes PQO<br />

( , | λ)<br />

This solution is often called the Viterbi state sequence because<br />

it is found using the Viterbi algorithm.<br />

31


Maximize Individual States<br />

We define γ t( i) as the probability of being in state Si at time t,<br />

given the observation sequence, and the model, i.e.,<br />

Pq ( t = Si, O|<br />

λ)<br />

γt( i) = P( qt = Si| O,<br />

λ)<br />

=<br />

PO ( | λ)<br />

then<br />

with<br />

then<br />

Pq ( t = Si, O|<br />

λ)<br />

αt() i βt() i αt() i βt()<br />

i<br />

γ t () i =<br />

= =<br />

N N<br />

PO ( | λ)<br />

∑Pq ( t = Si, O|<br />

λ)<br />

∑αt()<br />

i βt()<br />

i<br />

N<br />

∑<br />

i = 1<br />

i= 1 i=<br />

1<br />

γ ( i) = 1, ∀t<br />

t<br />

∗<br />

t<br />

1≤≤<br />

i N<br />

t<br />

[ γ ]<br />

q = argmax ( i) , 1≤t<br />

≤T<br />

∗<br />

Problem:<br />

q need not obey state transition constraints.<br />

t<br />

32


Best State Sequence—<strong>The</strong><br />

Sequence <strong>The</strong><br />

Viterbi Algorithm<br />

Define δ ( i)<br />

as the highest probability along a single path,<br />

t<br />

at time t, which accounts for the first t observations, i.e.,<br />

1 2 t −1<br />

[ ]<br />

δ ( i) = max P qq ... q , q = i, OO ... O | λ<br />

t 1 2 t−1 t 1 2 t<br />

q , q ,..., q<br />

We must keep track<br />

of the state sequence which gave the<br />

best path, at time t, to state i. We do this in the array ψ<br />

( i).<br />

t<br />

33


<strong>The</strong> Viterbi Algorithm<br />

Step 1- -Initialization<br />

δ1( i) = πi<br />

bi( O1), 1≤<br />

i ≤N<br />

ψ () i = 0, 1≤i<br />

≤N<br />

1<br />

Step 2 - -Recursion<br />

( )<br />

δ ⎡δ ⎤<br />

t( j) = max −<br />

≤≤ ⎣ t 1(<br />

i) aij⎦bj 1 i N<br />

Ot , 2 ≤t ≤T, 1≤<br />

j ≤N<br />

ψ ( ) = argmax ⎡<br />

⎣δ ⎤<br />

t j t−1( i) aij⎦, 2 ≤t ≤T, 1≤<br />

j ≤N<br />

1≤≤<br />

i N<br />

Step 3 - -Termination<br />

∗<br />

P = max δ ( i)<br />

[ T ]<br />

[ δ ]<br />

∗<br />

T =<br />

1≤≤<br />

i N<br />

1≤≤<br />

i N<br />

T<br />

q argmax ( i)<br />

Step 4 - -Path (State Sequence) Backtracking<br />

( + )<br />

∗ ∗<br />

q = ψ q , t = T −1, T −2,...,1<br />

t t+1 t 1<br />

2<br />

Calculation ≈NToperations ( ∗,+)<br />

34


Alternative Viterbi Implementation<br />

( )<br />

( ) ( )<br />

πi = log πi<br />

1≤i<br />

≤N<br />

b i Ot = log⎡⎣bi Ot ⎤⎦<br />

1 ≤ i ≤N, 1≤t<br />

≤T<br />

a ij = log ⎡<br />

⎣a ⎤ ij ⎦ 1 ≤i, j ≤N<br />

Step 1- -Initialization<br />

δ1() i = log( δ1()) i = πi<br />

+ b i ( O1) , 1≤i<br />

≤N<br />

ψ 1()<br />

i = 0,<br />

Step 2 - -Recursion<br />

1≤i<br />

≤N<br />

δt( j) = log( δt(j))=max ⎡ δt−1( i) + a⎤ ⎣ ij⎦ + b j ( Ot)<br />

, 2 ≤t ≤T,1≤ j ≤N<br />

1≤≤<br />

i N<br />

ψ ( j) = argmax ⎡ <br />

⎣<br />

δ ( i) + a⎤ ⎦<br />

, 2 ≤t ≤T, 1≤<br />

j ≤N<br />

t t−1ij 1≤≤<br />

i N<br />

Step 3 - -Termination<br />

P∗ = max ⎡ δT<br />

( i) ⎤,<br />

1≤≤<br />

i N⎣<br />

⎦<br />

1≤i<br />

≤N<br />

q = argmax ⎡ δ ( i) ⎤<br />

⎣ ⎦<br />

, 1≤i<br />

≤N<br />

∗<br />

T<br />

1≤≤<br />

i N<br />

T<br />

Step 4 - -Backtracking<br />

q = ψ ( q ), t = T −1, T −2,...,1<br />

∗ ∗<br />

t t+ 1 t+<br />

1<br />

2<br />

Calculation ≈ NTadditions<br />

35


Problem<br />

Given the model of the coin toss experiment used earlier (i.e., 3<br />

different coins) with probabilities:<br />

P(H)<br />

P(T)<br />

State 1<br />

0.5<br />

0.5<br />

State 2<br />

0.75<br />

0.25<br />

State 3<br />

0.25<br />

0.75<br />

with all state transition probabilities equal to 1/3, and with initial<br />

state probabilities equal to 1/3. For the observation sequence O=H<br />

H H H T H T T T T, find the Viterbi path of maximum likelihood.<br />

36


Problem Solution<br />

Since all a terms are equal to 1/3, we can omit these terms (as well as<br />

the initial state probability term) giving:<br />

δ1(1) = 0.5, δ1(2) = 0.75, δ1(3)<br />

= 0.25<br />

<strong>The</strong> recursion for δt<br />

( j) gives ( 2 ≤t ≤10)<br />

δ (1) = (0.75)(0.5) ,<br />

2<br />

δ (2) = (0.75) , δ (3) = (0.75)(0.25)<br />

2<br />

ij<br />

2 2<br />

2 3 2<br />

δ (1) = (0.75) (0.5), δ (2) = (0.75) , δ (3) = (0.75) (0.25)<br />

3 3 3<br />

3 4 3<br />

δ (1) = (0.75) (0.5), δ (2) = (0.75) , δ (3) = (0.75) (0.25)<br />

4 4 4<br />

4<br />

δ5(1) = (0.75) (0.5),<br />

4<br />

δ5(2) = (0.75) (0.25),<br />

5<br />

δ5(3)<br />

= (0.75)<br />

5<br />

δ (1) = (0.75 ) (0.5),<br />

6<br />

δ (2) = (0.75) ,<br />

5<br />

δ (3) = (0.75) (0.25)<br />

6<br />

6 6<br />

6 6 7<br />

7 7 7<br />

7 7 8<br />

8 8 8<br />

8 8 9<br />

9 9 9<br />

9 9 10<br />

δ (1) = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75)<br />

δ (1) = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75)<br />

δ (1) = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75)<br />

δ10(1)<br />

= (0.75) (0.5), δ10(2) = (0.75) (0.25), δ10(3)<br />

= (0.75)<br />

This leads to a diagram (trellis) of the form:<br />

37


Solution to Problem 3—the 3 the Training Problem<br />

• no globally optimum solution is known<br />

• all solutions yield local optima<br />

– can get solution via gradient techniques<br />

– can use a re-estimation procedure such as the Baum-Welch or<br />

EM method<br />

• consider re-estimation procedures<br />

– basic idea: given a current model estimate, λ, compute<br />

expected values of model events, then refine the model based<br />

on the computed values<br />

[ <strong>Model</strong> Events] [ <strong>Model</strong> Events]<br />

E E<br />

(0) (1) (2)<br />

λ ⎯⎯⎯⎯⎯⎯→ λ ⎯⎯⎯⎯⎯⎯→λ ⋅⋅⋅<br />

Define ξt(<br />

i, j), the probability of being in state Si at time t,<br />

and<br />

state S at time t + 1, given the model and the observation sequence, i.e.,<br />

j<br />

ξ = ⎡<br />

⎣ = + = λ⎤<br />

t( i, j) P qt Si, qt 1 Sj| O,<br />

⎦<br />

38


<strong>The</strong> Training Problem<br />

ξ = ⎡<br />

⎣ = + = λ⎤<br />

t( i, j) P qt Si, qt 1 Sj| O,<br />

⎦<br />

39


<strong>The</strong> Training Problem<br />

ξ = ⎡<br />

⎣ = + = λ⎤<br />

t( i, j) P qt Si, qt 1 Sj| O,<br />

⎦<br />

P<br />

ξ<br />

⎣<br />

⎡qt t (, i j)<br />

=<br />

= S + λ⎤<br />

i, qt 1 = Sj, O|<br />

⎦<br />

PO ( | λ)<br />

αt( i) aij bj( Ot+ 1) βt+ 1( j) = =<br />

PO ( | λ)<br />

αt( i) aij bj( Ot+ 1) βt+<br />

1(<br />

j)<br />

N N<br />

α () i a b ( O ) β ( j)<br />

N<br />

∑<br />

γ () i = ξ (, i j)<br />

t t<br />

j = 1<br />

T −1<br />

∑<br />

t = 1<br />

T −1<br />

∑<br />

t = 1<br />

t<br />

∑∑<br />

i= 1 j=<br />

1<br />

t ij j t+ 1 t+<br />

1<br />

γ ( i)<br />

= Expected number<br />

of transitions from S<br />

ξ ( i, j) = Expected number of transitions from S toS<br />

t i j<br />

i<br />

40


π<br />

a<br />

Re-estimation Re estimation Formulas<br />

= Expected number of times in state S at t = 1<br />

i i<br />

ij<br />

= γ 1()<br />

i<br />

Expected number of transitions from state S to state S<br />

=<br />

Expected number of transitions from state S<br />

=<br />

b ( k)<br />

j<br />

T −1<br />

∑<br />

t = 1<br />

T<br />

∑<br />

t = 1<br />

=<br />

=<br />

ξ (, i j)<br />

t<br />

γ () i<br />

t<br />

i j<br />

Expected number<br />

of times in state j with symbol ν k<br />

Expected number of times in state j<br />

T<br />

∑<br />

γ ( j)<br />

t<br />

t = 1<br />

∋ Ot=<br />

ν k<br />

T<br />

∑<br />

t = 1<br />

γ ( j)<br />

t<br />

i<br />

41


Re-estimation Re estimation Formulas<br />

( AB ) ( AB )<br />

If λ = , , Π is the initial model, and λ = , , Π is the<br />

re-estimated model, then it can be proven that either:<br />

1. the initial model, λ,<br />

defines a critical point of the likelihood<br />

function, in which<br />

case λ = λ,<br />

or<br />

2. model λ is more likely than model λ in the sense that<br />

PO ( | λ) > PO ( | λ), i.e., we have found a new model λ from<br />

which the observation sequence is more likely to have been<br />

produced.<br />

Conclusion:<br />

Iteratively use λ in place of λ,<br />

and repeat the<br />

re-estimation until some limiting point is reached. <strong>The</strong> resulting<br />

model is called the maximum likelihood ( ML) <strong>HMM</strong>.<br />

42


Re-estimation Re estimation Formulas<br />

1. <strong>The</strong> re-estimation formulas can be derived by maximizing the<br />

auxiliary function Q(<br />

λλ , ) over λ,<br />

i.e.,<br />

∑<br />

Q( λλ , ) = P( O, q| λ)log ⎡<br />

⎣P( O, q|<br />

λ⎤<br />

⎦<br />

It can be proved that:<br />

q<br />

max ⎡ λλ ⎤ ⇒<br />

λ ⎣Q( , ) ⎦ P( O| λ) ≥ P( O|<br />

λ)<br />

Eventually the likelihood function converges to a critical point<br />

2. Relation to EM algorithm:<br />

i E (Expectation) step is the calculation of the auxiliary<br />

function, Q(<br />

λλ , )<br />

i M (Modification) step is the maximization over<br />

λ<br />

43


Notes on Re-estimation<br />

Re estimation<br />

1. Stochastic constraints on π , a , b ( k)<br />

are automatically met, i.e.,<br />

N N M<br />

∑ ∑ ∑<br />

i ij j<br />

π = 1, a = 1, b ( k)<br />

= 1<br />

i ij j<br />

i= 1 j= 1 k=<br />

1<br />

2. At the critical points of P = P( O|<br />

λ),<br />

then<br />

∂P<br />

π i<br />

∂π<br />

i<br />

πi= N<br />

∂P<br />

∑π<br />

k<br />

k = 1 ∂π<br />

k<br />

= πi<br />

∂P<br />

aij<br />

∂aij<br />

aij<br />

= N<br />

∂P<br />

∑aik<br />

k = 1 ∂aik<br />

= aij<br />

bj( k) =<br />

∂P<br />

bj( k) ∂bj(<br />

k)<br />

= bj( k)<br />

M<br />

∂P<br />

bj() l<br />

∂b<br />

() <br />

⇒<br />

∑ <br />

= 1<br />

j<br />

at critical points, the re-estimation formulas are exactly<br />

correct.<br />

44


Variations on <strong>HMM</strong>’s<br />

1. Types of <strong>HMM</strong>—model structures<br />

2. Continuous observation density<br />

models—mixtures<br />

3. Autoregressive <strong>HMM</strong>’s—LPC links<br />

4. Null transitions and tied states<br />

5. Inclusion of explicit state duration density<br />

in <strong>HMM</strong>’s<br />

6. Optimization criterion—ML, MMI, MDI<br />

45


⎧1,<br />

i = 1<br />

π i = ⎨<br />

⎩0,<br />

i ≠ 1<br />

a = 0 j > i<br />

Types of <strong>HMM</strong><br />

1. Ergodic models--no transient states<br />

2. Left-right models--all transient states (except the last state)<br />

with the constraints:<br />

ij<br />

Controlled transitions implies:<br />

a = 0, j > i +Δ ( Δ = 1,2 typically)<br />

ij<br />

3. Mixed forms of ergodic and left-right models (e.g., parallel branches)<br />

Note:<br />

Constraints of left-right models don't affect re-estimation<br />

formulas (i.e., a parameter initially set to 0 remains at 0 during<br />

re-estimation).<br />

46


Types of <strong>HMM</strong><br />

Ergodic <strong>Model</strong><br />

Left-Right Left Right <strong>Model</strong><br />

Mixed <strong>Model</strong><br />

47


Continuous Observation Density <strong>HMM</strong>’s<br />

Most general form of pdf with a valid re-estimation procedure is:<br />

M<br />

∑ <br />

b = ⎡ μ ⎤<br />

j( x) cjm ⎣x, jm, Ujm⎦, 1≤<br />

j ≤N<br />

m=<br />

1<br />

{ }<br />

x = observation vector= x , x ,..., x<br />

1 2<br />

D<br />

M = number of mixture densities<br />

cjm = gain of m-th<br />

mixture<br />

in state j<br />

= any log-concave or elliptically symmetric density (e.g., a Gaussian)<br />

μ = mean vector for mixture m, state j<br />

jm<br />

U = covariance matrix for mixture m, state j<br />

jm<br />

c ≥0, 1 ≤ j ≤N, 1 ≤ m ≤M<br />

M<br />

jm<br />

∑<br />

m=<br />

1<br />

∞<br />

−∞<br />

c = 1, 1 ≤ j ≤N<br />

jm<br />

∫ bj( x) dx = 1, 1≤<br />

j ≤N<br />

48


State Equivalence Chart<br />

S<br />

S<br />

S<br />

S S<br />

Equivalence of<br />

state with<br />

mixture density<br />

to multi-state<br />

multi state<br />

single mixture<br />

case<br />

49


Re-estimation Re estimation for Mixture Densities<br />

c<br />

μ<br />

U<br />

=<br />

∑<br />

jk T<br />

t = 1<br />

M<br />

∑∑<br />

t= 1 k=<br />

1<br />

T<br />

∑<br />

γ (, jk)<br />

γ (, jk)<br />

γ (, jk) ⋅O<br />

t t<br />

t = 1<br />

jk = T<br />

∑<br />

t = 1<br />

T<br />

∑<br />

T<br />

t<br />

t<br />

γ (, jk)<br />

t<br />

( )( )<br />

γ (, jk) ⋅ O−μ O−μ<br />

t t jk t jk<br />

jk<br />

t = 1 =<br />

T<br />

∑<br />

t = 1<br />

γ (, jk)<br />

t<br />

i γ t ( jk , ) is the probability of being in state jat time twith<br />

the<br />

k-th<br />

mixture component accounting<br />

for O<br />

⎡ ⎤⎡ ⎤<br />

⎢<br />

α β<br />

⎥⎢ (<br />

, μ , ) ⎥<br />

γ ⎢ t( j) t(<br />

j)<br />

cjk Otjk Ujk<br />

= ⎥⎢ ⎥<br />

t (, jk)<br />

N M<br />

⎢ ⎥⎢<br />

α β μ<br />

⎥<br />

⎢∑ t( j) t( j) ⎥⎢∑cjm( Ot, jm, Ujm)<br />

⎥<br />

⎣ j= 1 ⎦⎣m=<br />

1<br />

⎦<br />

′<br />

t<br />

50


Autoregressive <strong>HMM</strong><br />

Consider an observation vector O = ( x0, x1,..., xK−1)<br />

where each<br />

xkis a waveform sample, and O represents a frame of the signal<br />

(e.g., K = 256 samples). We assume xkis<br />

related to previous<br />

samples of O by a Gaussian autoregressive process of order p,<br />

i.e.,<br />

p<br />

k ∑ i k−i k<br />

i = 1<br />

O =− aO + e , 0 ≤k ≤K −1<br />

where e are Gaussian, independent, identically distributed random<br />

k<br />

2<br />

variables with zero mean and variance σ , and ai,1<br />

≤i ≤ p are the<br />

autoregressive or predictor coefficients.<br />

As K →∞,<br />

then<br />

2 −K<br />

/2 ⎧ 1 ⎫<br />

fO ( ) = (2 πσ ) exp ⎨− δ ( Oa , )<br />

2 ⎬<br />

⎩ 2σ<br />

⎭<br />

where<br />

δ ( Oa , ) = ra(0)(0) r + 2 ∑ra()()<br />

i r i<br />

p<br />

i = 1<br />

51


[ ]<br />

Autoregressive <strong>HMM</strong><br />

p−i ∑<br />

r () i = a a ,( a = 1), 1≤i<br />

≤ p<br />

a n n+ i<br />

n=<br />

0<br />

K−− i 1<br />

∑<br />

ri () = xx , 0≤i≤<br />

p<br />

n=<br />

0<br />

n n+ i<br />

0<br />

a ′ = ⎡<br />

⎣1, a ⎤<br />

1, a2,..., ap⎦<br />

<strong>The</strong> prediction residual is:<br />

K ⎡ 2 ⎤ 2<br />

α = E⎢∑( ei) ⎥ = Kσ<br />

⎣ i = 1 ⎦<br />

Consider the normalized observation vector<br />

ˆ O O<br />

O = =<br />

2 α Kσ<br />

ˆ<br />

−K<br />

/2 ⎛ K ⎞<br />

fO ( ) = (2 π ) exp ⎜−δ( Oa ˆ,<br />

) ⎟<br />

⎝ 2 ⎠<br />

In practice, K is replaced by Kˆ,<br />

the effective frame length, e.g.,<br />

Kˆ = K<br />

/ 3 for frame overlap of 3 to 1.<br />

52


Application of Autoregressive <strong>HMM</strong><br />

M<br />

j = ∑ jm jm<br />

m=<br />

1<br />

b (0) c b ( O)<br />

−K<br />

/2 ⎧ K ⎫<br />

bjm( O) = (2 π) exp ⎨− δ(<br />

O, ajm)<br />

⎬<br />

⎩ 2 ⎭<br />

Each mixture characterized by predictor vector or by<br />

autocorrelation vector from which predictor vector can<br />

be derived. Re-estimation formulas for r are:<br />

r<br />

=<br />

T<br />

∑<br />

t = 1<br />

jk T<br />

γ (, jk) ⋅ r<br />

∑<br />

t = 1<br />

t t<br />

γ (, jk)<br />

t<br />

⎡ ⎤⎡ ⎤<br />

⎢<br />

α β<br />

⎥⎢ ( ) ⎥<br />

γ ⎢ t( j) t(<br />

j)<br />

cjkbjkOt = ⎥⎢ ⎥<br />

t (, jk)<br />

N M<br />

⎢ ⎥⎢<br />

α β<br />

⎥<br />

⎢∑ t ( j) t ( j) ⎥⎢∑cjkbjk ( Ot)<br />

⎥<br />

⎣ j= 1 ⎦⎣<br />

k=<br />

1 ⎦<br />

jk<br />

53


Null Transitions and Tied States<br />

Null Transitions: transitions which produce no<br />

output, and take no time, denoted by φ<br />

Tied States: sets up an equivalence relation<br />

between <strong>HMM</strong> parameters in different states<br />

– number of independent parameters of the model<br />

reduced<br />

– parameter estimation becomes simpler<br />

– useful in cases where there is insufficient training<br />

data for reliable estimation of all model parameters<br />

54


Null Transitions<br />

55


Inclusion of Explicit State Duration Density<br />

For standard <strong>HMM</strong>'s, the duration density is:<br />

p ( d) = probability of exactly d observations in state S<br />

i i<br />

d −1<br />

= aii −aii<br />

( ) (1 )<br />

With arbitrary state duration density, pi( d),<br />

observations are<br />

generated as follows:<br />

1. an initial state, q1= Si,<br />

is chosen according to the initial state<br />

distribution, π i<br />

2. a duration d1<br />

is chosen according to the state duration density<br />

pq( d<br />

1 1)<br />

3. observations OO 1 2...<br />

Od<br />

are chosen according to the joint density<br />

1<br />

b ( O O ... O ). Generally we assume independence, so<br />

q 1 2 d<br />

1 1<br />

1<br />

∏<br />

b ( O O ... O ) = b ( O )<br />

q1 1 2 d1 q1 t<br />

t = 1<br />

4. the next state, q = S , is chosen according to<br />

the state transition<br />

2<br />

j<br />

d<br />

probabilities, a , with the constraint that a<br />

= 0, i.e., no transition<br />

qq qq<br />

1 2 1 1<br />

back to the same state can occur.<br />

56


Explicit State Duration Density<br />

Standard <strong>HMM</strong><br />

<strong>HMM</strong> with explicit state duration density<br />

57


Explicit State Duration Density<br />

t 1 d + 1 d + d + 1<br />

state<br />

duration<br />

1 1 2<br />

1 2 3<br />

1 2 3<br />

observations O... O O ... O O ... O<br />

Assume:<br />

q q q<br />

d d d<br />

1 d d + 1 d + d d + d + 1 d + d + d<br />

1 1 1 2 1 2 1 2 3<br />

1. first state, q1, begins at t = 1<br />

2. last state, qr, ends at t = T<br />

⇒entire<br />

duration intervals are included<br />

within the observation<br />

sequence OO 1 2...<br />

OT<br />

Modified α:<br />

α ( i) = P( O O ... O, S ending at t|<br />

λ)<br />

t 1 2 t i<br />

Assume r states in first t observations, i.e.,<br />

{ }<br />

Q = q q ... q with q = S<br />

1 2<br />

{ }<br />

D = d d ... d with<br />

1 2<br />

r r i<br />

r<br />

r<br />

∑<br />

s=<br />

1<br />

s<br />

d = t<br />

58


<strong>The</strong>n we have<br />

By induction:<br />

Explicit State Duration Density<br />

∑∑<br />

α ( i) = π p ( d ) P( OO ... O | q )<br />

t q1 q1 1 1 2 d1<br />

1<br />

q d<br />

⋅ a p ( d ) P( O ... O | q )...<br />

qq q 2 d + 1 d + d 2<br />

1 2 2 1 1 2<br />

⋅ a p ( d ) P( O ... O | q )<br />

q q q r d + d + ... + d + 1 t r<br />

r−1 r r 1 2 r−1<br />

N D<br />

∑∑<br />

∏<br />

α ( j) = α ( i) a p ( d) b ( O )<br />

t t−d ij j j s<br />

i = 1 d = 1 s=− t d+<br />

1<br />

Initialization of αt<br />

( i)<br />

:<br />

α () i = π p (1) b( O )<br />

1 i i i 1<br />

2<br />

α () i = π p (2) b( O ) + α ( j) a p (1) b( O )<br />

2 i i i s 1 ji i i 2<br />

s=<br />

1<br />

j= 1, j≠i α () i = π p (3) b( O ) + α ( j) a p ( d) b( O )<br />

3 i i i s 3−d<br />

ji i i s<br />

s= 1 d= 1 j= 1, j≠i s= 4−d<br />

N<br />

∑<br />

PO ( | λ) = α ( i)<br />

i = 1<br />

∏<br />

N<br />

∑<br />

t<br />

3 2 N<br />

3<br />

∏ ∑∑ ∏<br />

T<br />

59


i<br />

i<br />

Explicit State Duration Density<br />

re-estimation formulas for a , b( k), and p( d)<br />

can be formulated<br />

ij i i<br />

and appropriately interpreted<br />

modifications to Viterbi scoring required, i.e.,<br />

δ ( i) = P( OO ... O, qq ... q = S ending at t|<br />

O)<br />

t 1 2 t 1 2 r i<br />

Basic Recursion :<br />

t<br />

⎡ ⎤<br />

δt() i = max max δ −<br />

≤ ≤ ≠ ≤ ≤ ⎢ t d( j) aji pi( d) ∏ bj( Os)<br />

1 j N, j i 1 d D<br />

⎥<br />

⎣ s=− t d+<br />

1 ⎦<br />

i storage required for δt−1... δt−D<br />

⇒N⋅D locations<br />

i maximization involves all terms--not just old δ's<br />

and a as in<br />

previous<br />

case ⇒ significantly larger computational load<br />

2 2<br />

≈ ( D / 2) N T computations involving b ( O)<br />

Example: N = 5, D<br />

= 20<br />

implicit duration explicit duration<br />

storage 5 100<br />

computation 2500 500,000<br />

j<br />

ji<br />

60


Issues with Explicit State Duration<br />

Density<br />

1. quality of signal modeling is often improved significantly<br />

2. significant increase in the number of parameters per state<br />

( D duration estimates)<br />

3. significant increase in the computation associated<br />

with probability<br />

2<br />

calculation ( / 2)<br />

≈ D<br />

4. insufficient data to give good p( d)<br />

estimates<br />

Alternatives :<br />

1. use parametric state duration density<br />

2<br />

i( ) = (<br />

, μi, σi<br />

) -- Gaussian<br />

p d d<br />

i i 1 i ηi<br />

pi( d)<br />

=<br />

-- Gamma<br />

( )<br />

2. incorporate state duration information after probability<br />

calculation, e.g., in a post-processor<br />

d<br />

ν ν − −η<br />

d e<br />

Γ<br />

ν i<br />

i<br />

61


Alternatives to ML Estimation<br />

Assume we wish to design V different <strong>HMM</strong>'s, λ1, λ2,..., λV.<br />

Normally we design each <strong>HMM</strong>, λ , based on a training set of<br />

V<br />

observations, O , using a maximum likelihood (ML) criterion, i.e.,<br />

P<br />

∗<br />

V =<br />

V<br />

max P⎡ ⎣<br />

O | λ ⎤ V ⎦<br />

λ<br />

V<br />

Consider the mutual information,<br />

I , between the observation<br />

V<br />

( )<br />

V<br />

sequence, O , and the complete set of models λ = λ, λ ,..., λ ,<br />

IV V<br />

⎡ V V<br />

= ⎢log P( O | λV) −log<br />

∑P(<br />

O<br />

⎣ w = 1<br />

⎤<br />

| λW)<br />

⎥<br />

⎦<br />

Consider maximizing I over λ,<br />

giving<br />

V<br />

V<br />

1 2<br />

V<br />

∗ ⎡ V V ⎤<br />

IV = max λ −<br />

λ<br />

λ ⎢log P( O | V) log ∑P(<br />

O | W)<br />

⎥<br />

⎣ w = 1 ⎦<br />

i choose λ so as to separate the correct model, λ , from all<br />

V<br />

other models, as much as possible, for the training set, O<br />

.<br />

V<br />

V<br />

62


Alternatives to ML Estimation<br />

Sum over all such training sets to give models according to an MMI<br />

criterion, i.e.,<br />

i<br />

V V<br />

∗ ⎧ ⎡ v v ⎤⎫<br />

I = max ⎨∑ ( λ ) −<br />

λ ⎬<br />

λ ⎢log P( O | v log ∑P(<br />

O<br />

| w)<br />

⎥<br />

⎩v= 1⎣ w=<br />

1 ⎦⎭<br />

solution via steepest descent methods.<br />

63


Comparison of <strong>HMM</strong>’s<br />

Problem:<br />

given two <strong>HMM</strong>'s, λ1 and λ2,<br />

is it possible to give a<br />

measure of how similar the two models are<br />

Example :<br />

equivalent<br />

( A B ) ⇔ ( A B ) P O = ν<br />

For , , we require ( ) to be the same<br />

1 1 2 2<br />

t k<br />

for both models and for all symbols ν k.<br />

Thus we require<br />

pq + (1 −p)(1 − q) = rs + (1 −r)(1 −s)<br />

2pq −p− q = 2rs=<br />

r = s<br />

p+ 1−2pq−r s =<br />

1−2r Let p = 0.6, q = 0.7, r = 0.2, then<br />

s<br />

= 13 / 30 0.433<br />

64


Comparison of <strong>HMM</strong>’s<br />

Thus the two models have very different A and B matrices, but are<br />

equivalent in the sense that all symbol probabilities (averaged over<br />

time) are the same.<br />

We generalize the concept of model distance (dis-similarity)<br />

by<br />

defining a distance measure, D(<br />

λ1, λ2)<br />

between two <strong>Markov</strong> sources,<br />

λ and λ , as<br />

1 2<br />

1<br />

(2) (2)<br />

D( λ λ = ⎡<br />

⎣ λ − λ ⎤<br />

1, 2) log P( OT | 1) log P( OT<br />

| 2)<br />

T<br />

⎦<br />

(2)<br />

where OT<br />

is a sequence of observations generated by model λ2,<br />

and scored by both<br />

models.<br />

We symmetrize D by using the relation:<br />

1<br />

DS( λ1, λ2) = [ D( λ1, λ2) + D(<br />

λ2, λ1)<br />

]<br />

2<br />

65


Implementation Issues for <strong>HMM</strong>’s<br />

1. Scaling—to prevent underflow<br />

and/or overflow.<br />

2. Multiple Observation Sequences—to<br />

train left-right models.<br />

3. Initial Estimates of <strong>HMM</strong><br />

Parameters—to provide robust<br />

models.<br />

4. Effects of Insufficient Training Data<br />

66


i<br />

i<br />

i<br />

Scaling<br />

α ( i)<br />

is a sum of a large number of terms, each of the form:<br />

t<br />

t-1 t ⎡ ⎤<br />

⎢∏aqq + ∏b<br />

( )<br />

s s 1 q O<br />

s s ⎥<br />

⎣ s= 1 s=<br />

1 ⎦<br />

since each a and b term is less than 1, as t gets larger, αt<br />

( i)<br />

exponentially heads to 0. Thus<br />

scaling is required to prevent<br />

underflow.<br />

consider scaling αt<br />

( i)<br />

by the factor<br />

1<br />

ct= , independent of t<br />

N<br />

α () i<br />

∑<br />

i = 1<br />

i we denote the scaled α's<br />

as:<br />

αt<br />

() i<br />

ˆ αt( i) = ctαt( i)<br />

= N<br />

α () i<br />

N<br />

∑<br />

i = 1<br />

ˆ α ( i)<br />

= 1<br />

t<br />

t<br />

∑<br />

i = 1<br />

t<br />

67


i<br />

i<br />

i<br />

i<br />

for fixed t,<br />

we compute<br />

α ( i) = ˆ α ( j) a b( O )<br />

ˆ α ( i)<br />

=<br />

ˆ α<br />

t t−1ji i t<br />

j = 1<br />

scaling gives<br />

t N<br />

j = 1<br />

N<br />

t −1<br />

∑<br />

∑<br />

∑∑<br />

i= 1 j=<br />

1<br />

by induction we get<br />

giving<br />

ˆ α<br />

N<br />

N<br />

t −1<br />

⎡<br />

( j) = ∏c<br />

⎣<br />

⎢ τ ⎥<br />

τ = 1 ⎦<br />

Scaling<br />

ˆ α ( j) a b( O )<br />

t−1ji i t<br />

ˆ α ( j) a b( O )<br />

t−1ji i t<br />

⎤<br />

α<br />

t −1<br />

( j)<br />

⎡ ⎤<br />

N<br />

t −1<br />

∑α t−1( j) ⎢∏cτ⎥aji bi( Ot)<br />

j = 1 ⎣ τ = 1 ⎦<br />

αt<br />

t ( i)<br />

= =<br />

N N<br />

t −1<br />

N<br />

⎡ ⎤<br />

∑∑αt−1( j) ⎢∏cτ⎥aji bi( Ot)<br />

∑ t<br />

i= 1 j=<br />

1 τ = 1<br />

i = 1<br />

⎣ ⎦<br />

() i<br />

α () i<br />

68


i<br />

i<br />

i<br />

Scaling<br />

for scaling the βt<br />

( i)<br />

terms we use the same scale factors as for<br />

the αt<br />

( i)<br />

terms, i.e.,<br />

ˆ βt( i) = ctβt( i)<br />

since the magnitudes of the α and β terms are comparable.<br />

the re-estimation<br />

formula for a in terms of the scaled α's<br />

and β's<br />

is:<br />

we have<br />

a<br />

T −1<br />

∑<br />

ˆ α () i a b ( O ) ˆ β ( j)<br />

t ij j t+ 1 t+<br />

1<br />

ij = t = 1<br />

N T−1<br />

∑∑<br />

j= 1 t=<br />

1<br />

ˆ α () i a b ( O ) ˆ β ( j)<br />

t ij j t+ 1 t+<br />

1<br />

t ⎡ ⎤<br />

ˆ αt<br />

( i) = ⎢∏cτ⎥αt() i = Ctαt() i<br />

⎣ τ = 1 ⎦<br />

T<br />

ˆ<br />

⎡ ⎤<br />

βt+ 1( j) = ⎢∏cτ⎥βt+ 1( j) = Dt+ 1βt+ 1(<br />

j)<br />

⎣τ=+ t 1 ⎦<br />

ij<br />

69


i<br />

giving<br />

a<br />

T −1<br />

τ τ τ<br />

τ= 1 τ= t + 1 τ=<br />

1<br />

Scaling<br />

Cα () i a b ( O ) D β ( j)<br />

t t ij j t+ 1 t+ 1 t+<br />

1<br />

ij = t = 1<br />

N T−1<br />

t t+<br />

1<br />

i independent of t.<br />

Notes on Scaling :<br />

∑<br />

∑∑<br />

Cα () i a b ( O ) D β ( j)<br />

t t ij j t+ 1 t+ 1 t+<br />

1<br />

j= 1 t=<br />

1<br />

t T T<br />

∏ ∏ ∏<br />

CD = c c = c = C<br />

1. scaling procedure<br />

works equally well on π or B coefficients<br />

2. scaling need not be performed each iteration; set ct<br />

= 1 whenever scaling is skipped<br />

c. can solve for PO ( | λ)<br />

from scaled coefficients as:<br />

T N N<br />

∏<br />

t = 1<br />

∑ ∑<br />

c α () i = C α () i = 1<br />

t T T<br />

i= 1 i=<br />

1<br />

N<br />

∑ ∏<br />

PO ( | λ) = α ( i) = 1/ c<br />

T t<br />

i = 1 t = 1<br />

T<br />

∑<br />

log PO ( | λ)<br />

=− log( c)<br />

t = 1<br />

t<br />

T<br />

70


Multiple Observation Sequences<br />

For left-right models, we need to use multiple sequences of observations for training.<br />

Assume a set of K observation sequences (i.e., training utterances):<br />

where<br />

(1) (2) ( K )<br />

O = ⎡<br />

⎣<br />

O , O ,..., O ⎤<br />

⎦<br />

( k) ( k) ( k)<br />

( k )<br />

O = ⎡O1 O ⎤<br />

⎣ 2 ... OTk<br />

⎦<br />

We wish to maximize the probability<br />

∏ ∏<br />

( k )<br />

PO ( | λ) = PO ( | λ)<br />

= P<br />

a<br />

K<br />

k<br />

∑∑<br />

K K<br />

k= 1 k=<br />

1<br />

T −1<br />

k ( k) k<br />

α () i a b ( O ) β ( j)<br />

t ij j t+ 1 t+<br />

1<br />

ij<br />

k= 1 = t=<br />

1<br />

K T −1<br />

k<br />

∑∑<br />

k= 1 t=<br />

1<br />

k k<br />

α () i β () i<br />

t t<br />

Scaling requires:<br />

K Tk<br />

−1<br />

1 k<br />

( k) ∑ ∑ ˆ αt<br />

() i a ˆ k<br />

ij bj( Ot+ 1) βt+<br />

1(<br />

j)<br />

k= 1Pkt= 1<br />

aij<br />

=<br />

K Tk<br />

−1<br />

1 k<br />

∑ ∑ ˆ α () ˆ k<br />

t i βt<br />

() i<br />

k= 1Pkt= 1<br />

i all scaling factors cancel out<br />

k<br />

71


N<br />

M<br />

Initial Estimates of <strong>HMM</strong><br />

Parameters<br />

-- choose based on physical considerations<br />

-- choose based on model fits<br />

πi -- random or uniform ( πi<br />

≠ 0)<br />

a -- random or uniform ( a ≠ 0)<br />

ij ij<br />

b ( k) -- random or uniform ( b ( k)<br />

≥ ε )<br />

j j<br />

b ( O) -- need good initial estimates of mean vectors;<br />

j<br />

need reasonable estimates of covariance matrices<br />

72


Effects of Insufficient Training<br />

Data<br />

Insufficient training data leads to poor estimates of model parameters.<br />

Possible Solutions:<br />

1. use more training data--often this is impractical<br />

2. reduce the size of the model--often there are physical<br />

reasons for<br />

keeping a chosen model size<br />

3. add extra constraints to model parameters<br />

b ( k)<br />

≥ ε<br />

j<br />

U (,) r r ≥ δ<br />

jk<br />

⋅ often the model performance is relatively insensitive to exact<br />

choice<br />

of ε, δ<br />

4. method of deleted interpolation<br />

λ= ελ+(1- ε) λ′<br />

73


Methods for Insufficient Data<br />

Performance insensitivity to ε<br />

74


Deleted Interpolation<br />

75


Isolated Word Recognition Using<br />

<strong>HMM</strong>’s<br />

Assume a vocabulary of V words, with K occurrences of each spoken word<br />

in a training set. Observation vectors are spectral characterizations of the word.<br />

For isolated word recognition, we do the following:<br />

v<br />

1. for each word, v,<br />

in the vocabulary, we must build an <strong>HMM</strong>, λ , i.e., we<br />

must re-estimate model parameters , , that optimize the likelihood of the<br />

( ABΠ)<br />

training set observation vectors<br />

for the v-th<br />

word. (TRAINING)<br />

2. for each unknown word which is to be recognized, we do the following:<br />

a. measure the observation sequence O O O ... O<br />

[ ]<br />

= 1 2<br />

v<br />

b. calculate model likelihoods,<br />

PO ( | λ ), 1≤v≤V<br />

c. select the word whose model likelihood score is highest<br />

∗<br />

v<br />

v = argmax ⎡<br />

⎣<br />

P( O|<br />

λ ) ⎤<br />

⎦<br />

1≤v≤V<br />

2<br />

Computation is on the order of V ⋅ N T required; V = 100, N = 5, T = 40<br />

5<br />

⇒10<br />

computations<br />

T<br />

76


Isolated Word <strong>HMM</strong> Recognizer<br />

77


Choice of <strong>Model</strong> Parameters<br />

1. Left-right model preferable to ergodic model (speech is a left-right<br />

process)<br />

2. Number of states in range 2-40 (from sounds to frames)<br />

• Order of number of distinct sounds in the word<br />

• Order of average number of observations in word<br />

3. Observation vectors<br />

• Cepstral coefficients (and their second and third order derivatives)<br />

derived from LPC (1-9 mixtures), diagonal covariance matrices<br />

• Vector quantized discrete symbols (16-256 codebook sizes)<br />

4. Constraints on b j(O) densities<br />

• bj(k)>ε for discrete densities<br />

• Cjm >δ, Ujm (r,r)>δ for continuous densities<br />

78


Performance Vs Number of<br />

States in <strong>Model</strong><br />

79


<strong>HMM</strong> Feature Vector Densities<br />

80


Motivation:<br />

Segmental K-Means K Means<br />

Segmentation into States<br />

derive good estimates of the bj (O) densities as required for rapid<br />

convergence of re-estimation procedure.<br />

Initially:<br />

training set of multiple sequences of observations, initial model estimate.<br />

Procedure:<br />

segment each observation sequence into states using a Viterbi procedure.<br />

For discrete observation densities, code all observations in state j using<br />

the M-codeword codebook, giving<br />

bj (k) = number of vectors with codebook index k, in state j, divided by the<br />

number of vectors in state j.<br />

for continuous observation densities, cluster the observations in state j into<br />

a set of M clusters, giving<br />

81


Segmental K-Means K Means<br />

Segmentation into States<br />

c jm = number of vectors assigned to cluster m of state j divided<br />

by the number of vectors in state j.<br />

μ jm = sample mean of the vectors assigned to cluster m of state<br />

j<br />

U jm = sample covariance of the vectors assigned to cluster m of<br />

state j<br />

use as the estimate of the state transition probabilities<br />

a ii = number of vectors in state i minus the number of<br />

observation sequences for the training word divided by the<br />

number of vectors in state i.<br />

a i,i+1 = 1 – a ii<br />

the segmenting <strong>HMM</strong> is updated and the procedure is<br />

iterated until a converged model is obtained.<br />

82


Segmental K-Means K Means Training<br />

83


<strong>HMM</strong> Segmentation for /SIX/<br />

84


unknown<br />

log<br />

energy<br />

frame<br />

likelihood<br />

scores<br />

Digit Recognition Using <strong>HMM</strong>’s<br />

one<br />

one<br />

one<br />

nine<br />

one<br />

nine<br />

one<br />

nine<br />

frame<br />

cumulative<br />

scores<br />

state<br />

segmentation<br />

85


Digit Recognition Using <strong>HMM</strong>’s<br />

unknown<br />

log energy<br />

frame<br />

likelihood<br />

scores<br />

seven<br />

seven<br />

seven<br />

six<br />

seven<br />

six<br />

seven<br />

six<br />

frame<br />

cumulative<br />

scores<br />

state<br />

segmentation<br />

86

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!