The Hidden Markov Model (HMM)

Digital Speech Processing— 

Processing 

Lecture 20 

The Hidden Markov 

Model (HMM) 

1

Lecture Outline 

• Theory of Markov Models 

– discrete Markov processes 

– hidden Markov processes 

• Solutions to the Three Basic Problems of HMM’s 

– computation of observation probability 

– determination of optimal state sequence 

– optimal training of model 

• Variations of elements of the HMM 

– model types 

– densities 

• Implementation Issues 

– scaling 

– multiple observation sequences 

– initial parameter estimates 

– insufficient training data 

• Implementation of Isolated Word Recognizer Using HMM’s 

2

Stochastic Signal Modeling 

• Reasons for Interest: 

– basis for theoretical description of signal 

processing algorithms 

– can learn about signal source properties 

– models work well in practice in real world 

applications 

• Types of Signal Models 

– deteministic, parametric models 

– stochastic models 

3

Discrete Markov Processes 

{ } 

System of N distinct states, S1, S2,..., SN 

Markov Property: 

Time( t) 

1 2 3 4 5 ... 

State q q q q q ... 

1 2 3 4 5 

P⎡q = S | q = S , q = S ,... ⎤ = P⎡q = S | q = S 

⎤ 

⎣ t i t−1 j t−2 k ⎦ ⎣ t i t−1 j⎦ 

4

Properties of State Transition Coefficients 

Consider processes where state transitions are 

time independent, i.e., 

a = P⎡q = S | q = S ⎤,1 

≤ i, j≤N ji ⎣ t i t −1 

j ⎦ 

a ≥0 ∀j, 

i 

N 

ji 

∑ 

i = 1 

a = 1 ∀j 

ji 

5

Example of Discrete Markov 

Process 

Once each day (e.g., at noon), the weather is observed 

and classified as being one of the following: 

– State 1—Rain (or Snow; e.g. precipitation) 

– State 2—Cloudy 

– State 3—Sunny 

with state transition probabilities: 

⎡0.4 0.3 0.3⎤ 

A = { a 

} = 

⎢ ⎥ 

ij ⎢ 

0.2 0.6 0.2 

⎥ 

⎢⎣0.1 0.1 0.8⎥⎦ 

6

Discrete Markov Process 

Problem: Given that the weather on day 1 is sunny, what 

is the probability (according to the model) that the 

weather for the next 7 days will be “sunny-sunny-rainrain-sunny-cloudy-sunny”? 

Solution: We define the observation sequence, O, as: 

{ , , , , , , , } 

O S S S S S S S S 

= 3 3 3 1 1 3 2 3 

and we want to calculate P(O|Model). That is: 

[ ] 

P( O| Model) P S , S , S , S, S, S , S , S 

| Model 

= 3 3 3 1 1 3 2 3 

7


π 

= π 

[ ] 

PO ( | Model) = P S, S, S, S, S, S, S, S | Model 

= 

⋅ 

[ ] [ | ] [ | ] [ | ] 

PS [ ] [ ] [ ] 

3 | S1 PS2 | S3 PS3 | S2 

2 

3( a33) a31a11a13a32a23 2 

( ) ( )( )( )( )( ) 

= 1 0.8 0.1 0.4 0.3 0.1 0.2 

= 1.536 ⋅10 

3 3 3 1 1 3 2 3 

2 

3 3 3 1 3 1 1 

PS PS S PS S PS S 

[ ] 

−04 

= Pq= S , 1≤ 

i≤N i 1 i 

8


Problem: Given that the model is in a known 

state, what is the probability it stays in that state 

for exactly d days? 

Solution: 

{ , , ,..., , } 

O = S S S S S ≠S 

i i i i j i 

1 2 3 d d + 1 

( ) ( ) − 

P O| Model, q = S = a (1 − a ) = p( d) 

∞ 

1 

d = ∑d 

p ( d) 

= 

1− 

1 

i i 

d = 1 

aii 

d 1 

i ii ii i 

9

Exercise 

Given a single fair coin, i.e., P (H=Heads)= 

P (T=Tails) = 0.5, which you toss once and observe 

Tails: 

a) what is the probability that the next 10 tosses will 

provide the sequence {H H T H T T H T T H}? 

SOLUTION: 

For a fair coin, with independent coin tosses, the probability of 

any specific observation sequence of length 10 (10 tosses) is 

(1/2) 10 since there are 2 10 such sequences and all are equally 

probable. Thus: 

P (H H T H T T H T T H) = (1/2) 10 

10

Exercise 

b) what is the probability that the next 10 

tosses will produce the sequence {H H 

H H H H H H H H}? 

SOLUTION: 

Similarly: 

P (H H H H H H H H H H)= (1/2) 10 

Thus a specified run of length 10 is equally as likely as 

a specified run of interlaced H and T. 

11

Exercise 

c) what is the probability that 5 of the next 10 tosses will 

be tails? What is the expected number of tails over 

the next 10 tosses? 

SOLUTION: 

The probability of 5 tails in the next 10 tosses is just the number of observation 

sequences with 5 tails and 5 heads (in any sequence) and this is: 

P (5H, 5T)=(10C5) (1/2) 10 = 252/1024≈0.25 

since there are (10C5) combinations (ways of getting 5H and 5T) for 10 coin 

tosses, and each sequence has probability of (1/2) 10 . The expected number of 

tails in 10 tosses is: 

10 ⎛10⎞⎛1⎞ E(Number of T in 10 coin tosses) = ∑d⎜ 

⎟⎜ ⎟ = 5 

d = 0 ⎝ d ⎠⎝2⎠ 

Thus, on average, there will be 5H and 5T in 10 tosses, but the probability 

of exactly 5H and 5T is only about 0.25. 

10 

12

Coin Toss Models 

A series of coin tossing experiments is performed. The 

number of coins is unknown; only the results of each coin 

toss are revealed. Thus a typical observation sequence is: 

O = O O O ... O = HHTTTHTTH... H 

1 2 3 

T 

Problem: Build an HMM to explain the observation sequence. 

Issues: 

1. What are the states in the model? 

2. How many states should be used? 

3. What are the state transition probabilities? 

13


14


15


Problem: Consider an HMM representation (model λ) of a coin 

tossing experiment. Assume a 3-state model (corresponding to 3 

different coins) with probabilities: 

P(H) 

P(T) 

State 1 

0.5 

0.5 

State 2 

0.75 

0.25 

State 3 

0.25 

0.75 

and with all state transition probabilities equal to 1/3. (Assume initial state 

probabilities of 1/3). 

a) You observe the sequence: O=H H H H T H T T T T 

What state sequence is most likely? What is the probability of the 

observation sequence and this most likely state sequence? 

16

Coin Toss Problem Solution 

SOLUTION: 

Given O=HHHHTHTTTT, the most likely state 

sequence is the one for which the probability of 

each individual observation is maximum. Thus for 

each H, the most likely state is S2 and for each T 

the most likely state is S3 . Thus the most likely 

state sequence is: 

S= S2 S2 S2 S2 S3 S2 S3 S3 S3 S3 The probability of O and S (given the model) is: 

⎛ ⎞ 

POS 

( , | λ) 

= (0.75) ⎜ ⎟ 

⎝3⎠ 10 1 

10 

17


b) What is the probability that the observation sequence came entirely 

from state 1? 

SOLUTION: 

The probability of O given that S is of the form: 

is: 

Sˆ = SSSSSSSSSS 

1 1 1 1 1 1 1 1 1 1 

10 ⎛1⎞ POS ( , ˆ | λ) 

= (0.50) ⎜ ⎟ 

⎝3⎠ The ratio of POS ( , | λ) to POS ( , ˆ | λ) 

is: 

POS ( , | λ) 

⎛3⎞ R = = ⎜ ⎟ = 57.67 

POS 

( , ˆ | λ) 

⎝2⎠ 10 

10 

18


c) Consider the observation sequence: 

O = HT T HTHHTTH 

How would your answers to parts a and b change? 

SOLUTION: 

Given Owhich has the same number of H's and T's, 

the answers to 

parts a and b would remain the same as the most likely states occur 

the same number of times in both cases. 

19


d) If the state transition probabilities were of the form: 

a11 = 0.9, a21 = 0.45, a31 

= 0.45 

a12 = 0.05, a22 = 0.1, a32 

= 0.45 

a = 0.05, a = 0.45, a = 0.1 

13 23 33 

i.e., a new model λ’, how would your answers to parts a-c 

change? What does this suggest about the type of sequences 

generated by the models? 

20


SOLUTION: 

The new probability of O and S becomes: 

10 ⎛1⎞ 6 3 

POS ( , | λ′ 

) = (0.75) ⎜ ( 0.1) ( 0.45) 

3 

⎟ 

⎝ ⎠ 

The new probability of O and Sˆbecomes: 

ˆ 

⎛1⎞ POS ( , | λ′ 

) = (0.50) ⎜ (0.9) 

3 

⎟ 

⎝ ⎠ 

The ratio is: 

R 

10 6 3 

10 9 

⎛3⎞ ⎛1⎞ ⎛1⎞ = ⎜ 1.36 10 

2 

⎟ ⎜ = ⋅ 

9 

⎟ ⎜ 

2 

⎟ 

⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 

−5 

21


Now the probability of Oand S is not the same as the probability of O 

and S. 

We now have: 

 

⎛1⎞ POS ( , | λ′ 

) = (0.75) ⎜ ⎟(0.45) 

(0.1) 

⎝3⎠ 

10 ⎛1⎞ 9 

POS ( , ˆ | λ′ 

) = (0.50) ⎜ ⎟(0.9) 

⎝3⎠ with the ratio: 

10 6 3 

10 6 3 

⎛3⎞ ⎛1⎞ ⎛1⎞ −3 

R = ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = 1.24 ⋅10 

⎝2⎠ ⎝2⎠ ⎝9⎠ Model λ, 

the initial model, clearly favors long runs of H's or T's, 

whereas model λ′ 

, the new model, clearly favors random sequences 

of H's and T's. 

Thus even a run of H's or T's 

is more likely to 

occur in state 1 for model λ′ 

, and a random sequence of H's and T's 

is more likely to occur in states 2 and 3 for model λ. 

22

Balls in Urns Model 

23

Elements of an HMM 

1. N, 

number of states in the model 

{ } 

istates, 

S = S, S ,..., S 

istate 

at time t, qt∈S 2. M, 

number of distinct observation symbols per state 

i observation symbols, V = , ,..., 

i 

1 2 

observation at time 

ij t + 1 j t i 

{ j ( ) } 

N 

t, O ∈V 

{ ν ν ν } 

1 2 

M 

{ ij} 

3. State transition probability distribution, A = a , 

a = P( q = S | q = S ), 1 ≤ i, j ≤N 

4. Observation symbol probability distribution in state 

B = b k 

b ⎡ j( k) = P⎣νk at t| qt = S ⎤ j⎦, 

1 ≤ j ≤N, 1≤ 

k ≤M 

5. Initial state 

distribution, Π= π 

π 

[ ] 

= P q = S ,1≤ 

i ≤N 

i 1 i 

t 

{ } 

i 

j 

24

HMM Generator of Observations 

1. Choose an initial state, q1= Si, 

according to the initial state 

distribution, Π. 

2. Set t = 1. 

3. Choose Ot 

= ν k according to the symbol probability distribution 

in state Si, namely bi( k). 

4. Transit to a new state, q S according to the state transition 

( ABΠ) 

Notation: λ= 

, , --HMM 

+ = t 1 j 

probability distribution for state S, namely a . 

i ij 

5. Set t = t + 1; return to step 3 if t ≤T; 

otherwise terminate the procedure. 

t 

state 

observation 

1 

q 1 

O 1 

2 

q 2 

O 2 

3 

q 3 

O 3 

4 

q 4 

O 4 

5 

q 5 

O 5 

6 

q 6 

O 6 

… 

… 

… 

T 

q T 

O T 

25

Three Basic HMM Problems 

Problem 1--Given 

the observation sequence, O = OO ... O , and a model 

( Π) 

λ= AB , , , how do we (efficiently) compute PO ( | λ), 

the probability of the 

observation sequence? 

Problem 2--Given 

the observation sequence, 

O = OO 1 2... 

OT, 

how do we 

choose a state sequence Q = qq 1 2... 

qTwhich 

is optimal in some meaningful 

sense? 

Problem 3--How 

do we adjust the model parameters λ= 

AB , , Π to maximize 

PO ( | λ)? 

Interpretation: 

Problem 1--Evaluation 

or scoring problem. 

Problem 2--Learn 

structure problem. 

Problem 3--Training 

problem. 

1 2 

T 

( ) 

26

Solution to Problem 1—P(O| 1 P(O|λ) 

T 

Consider the fixed state sequence (there are N such sequences): 

Then 

and 

Q = q q ... q 

1 2 

POQ ( | , λ) 

= b ( O) ⋅b( 

O)... b ( O ) 

q 1 q 2 q T 

1 2 

PQ ( | λ) = π a a ... a 

1 1 2 2 3 T−1 T 

POQ ( , | λ) = POQ ( | , λ) ⋅PQ 

( | λ) 

Finally 

PO ( | λ) = ∑ POQ ( , | λ) 

all Q 

T 

q q q q q q q 

∑ 1 1 1 1 2 2 2 

T−1 T T 

1, 2,..., 

T 

P( 

O| λ) = π b ( O ) a b ( O )... a b ( O ) 

q q q 

T 

Calculations required ≈2 T ⋅ N ; N = 5, T 

= 100⇒2⋅100 ⋅5 

72 

≈ 10 computations! 

T 

q q q q q q q q T 

100 

27

The “Forward” Procedure 

Consider the forward variable, αt 

( i), 

defined as the probability of the 

partial observation sequence (until time t) and state Siat time t, 

given 

the model, i.e., 

α () i = P( OO ... O, q = S | λ) 

t 1 2 t t i 

Inductively solve 

for αt 

( i) 

as: 

1. Initialization 

α1() i = πi 

bi( O1), 1≤i 

≤N 

2. Induction 

N ⎡ ⎤ 

αt+ 1( j) = ⎢∑αt( i) aij⎥bj( Ot+ 1), 

1≤t ≤T −1, i ≤ j ≤N 

⎣ i = 1 ⎦ 

3. Termination 

Computation: 

PO ( | λ) = POO ( ... O, q = S| λ) = α ( i) 

N 

2 

N N 

∑ ∑ 

1 2 T T i T 

i= 1 i= 

1 

T 

T versus 2 TN ; N = 5, T = 100 ⇒ 

2500 versus 10 

72 

28

The “Forward” Procedure 

29

The “Backward” Algorithm 

Consider the backward variable, β ( i), 

defined as the probability of 

the partial observation sequence from t + 1 to the end, given state 

S at time t, 

and the model, i.e., 

i 

β () i = P( O O ... O | q = S, 

λ) 

t t+ 1 t+ 2 T t i 

Inductive Solution : 

1. Initialization 

βT 

( i) = 1, 1≤ 

i ≤N 

2. Induction 

⋅ 

2 

NT 

t 

N 

∑ 

j = 1 

ij j t+ 1 t+ 

1 

β ( i) = a b ( O ) β ( j), t = T −1, T −2,...,1, 1≤ 

i ≤N 

calculations, same as in forward case 

t 

30

Solution to Problem 2—Optimal 

2 Optimal 

State Sequence 

1. Choose states, qt, which are individually most likely ⇒ 

maximize expected number of correct individual states 

2. Choose states, qt, which are pair -wise most likely ⇒ 

maximize expected number of 

correct state pairs 

3. Choose states, qt, which are triple-wise most likely ⇒ 

maximize expected number of correct state triples 

4. Choose states, qt, which are T -wise most likely ⇒ 

find the single 

best state sequence which maximizes PQO 

( , | λ) 

This solution is often called the Viterbi state sequence because 

it is found using the Viterbi algorithm. 

31

Maximize Individual States 

We define γ t( i) as the probability of being in state Si at time t, 

given the observation sequence, and the model, i.e., 

Pq ( t = Si, O| 

λ) 

γt( i) = P( qt = Si| O, 

λ) 

= 

PO ( | λ) 

then 

with 

then 

Pq ( t = Si, O| 

λ) 

αt() i βt() i αt() i βt() 

i 

γ t () i = 

= = 

N N 

PO ( | λ) 

∑Pq ( t = Si, O| 

λ) 

∑αt() 

i βt() 

i 

N 

∑ 

i = 1 

i= 1 i= 

1 

γ ( i) = 1, ∀t 

t 

∗ 

t 

1≤≤ 

i N 

t 

[ γ ] 

q = argmax ( i) , 1≤t 

≤T 

∗ 

Problem: 

q need not obey state transition constraints. 

t 

32

Best State Sequence—The 

Sequence The 

Viterbi Algorithm 

Define δ ( i) 

as the highest probability along a single path, 

t 

at time t, which accounts for the first t observations, i.e., 

1 2 t −1 

[ ] 

δ ( i) = max P qq ... q , q = i, OO ... O | λ 

t 1 2 t−1 t 1 2 t 

q , q ,..., q 

We must keep track 

of the state sequence which gave the 

best path, at time t, to state i. We do this in the array ψ 

( i). 

t 

33

The Viterbi Algorithm 

Step 1- -Initialization 

δ1( i) = πi 

bi( O1), 1≤ 

i ≤N 

ψ () i = 0, 1≤i 

≤N 

1 

Step 2 - -Recursion 

( ) 

δ ⎡δ ⎤ 

t( j) = max − 

≤≤ ⎣ t 1( 

i) aij⎦bj 1 i N 

Ot , 2 ≤t ≤T, 1≤ 

j ≤N 

ψ ( ) = argmax ⎡ 

⎣δ ⎤ 

t j t−1( i) aij⎦, 2 ≤t ≤T, 1≤ 

j ≤N 

1≤≤ 

i N 

Step 3 - -Termination 

∗ 

P = max δ ( i) 

[ T ] 

[ δ ] 

∗ 

T = 

1≤≤ 

i N 

1≤≤ 

i N 

T 

q argmax ( i) 

Step 4 - -Path (State Sequence) Backtracking 

( + ) 

∗ ∗ 

q = ψ q , t = T −1, T −2,...,1 

t t+1 t 1 

2 

Calculation ≈NToperations ( ∗,+) 

34

Alternative Viterbi Implementation 

( ) 

( ) ( ) 

πi = log πi 

1≤i 

≤N 

b i Ot = log⎡⎣bi Ot ⎤⎦ 

1 ≤ i ≤N, 1≤t 

≤T 

a ij = log ⎡ 

⎣a ⎤ ij ⎦ 1 ≤i, j ≤N 

Step 1- -Initialization 

δ1() i = log( δ1()) i = πi 

+ b i ( O1) , 1≤i 

≤N 

ψ 1() 

i = 0, 

Step 2 - -Recursion 

1≤i 

≤N 

δt( j) = log( δt(j))=max ⎡ δt−1( i) + a⎤ ⎣ ij⎦ + b j ( Ot) 

, 2 ≤t ≤T,1≤ j ≤N 

1≤≤ 

i N 

ψ ( j) = argmax ⎡ 

⎣ 

δ ( i) + a⎤ ⎦ 

, 2 ≤t ≤T, 1≤ 

j ≤N 

t t−1ij 1≤≤ 

i N 

Step 3 - -Termination 

P∗ = max ⎡ δT 

( i) ⎤, 

1≤≤ 

i N⎣ 

⎦ 

1≤i 

≤N 

q = argmax ⎡ δ ( i) ⎤ 

⎣ ⎦ 

, 1≤i 

≤N 

∗ 

T 

1≤≤ 

i N 

T 

Step 4 - -Backtracking 

q = ψ ( q ), t = T −1, T −2,...,1 

∗ ∗ 

t t+ 1 t+ 

1 

2 

Calculation ≈ NTadditions 

35

Problem 

Given the model of the coin toss experiment used earlier (i.e., 3 

different coins) with probabilities: 

P(H) 

P(T) 

State 1 

0.5 

0.5 

State 2 

0.75 

0.25 

State 3 

0.25 

0.75 

with all state transition probabilities equal to 1/3, and with initial 

state probabilities equal to 1/3. For the observation sequence O=H 

H H H T H T T T T, find the Viterbi path of maximum likelihood. 

36

Problem Solution 

Since all a terms are equal to 1/3, we can omit these terms (as well as 

the initial state probability term) giving: 

δ1(1) = 0.5, δ1(2) = 0.75, δ1(3) 

= 0.25 

The recursion for δt 

( j) gives ( 2 ≤t ≤10) 

δ (1) = (0.75)(0.5) , 

2 

δ (2) = (0.75) , δ (3) = (0.75)(0.25) 

2 

ij 

2 2 

2 3 2 

δ (1) = (0.75) (0.5), δ (2) = (0.75) , δ (3) = (0.75) (0.25) 

3 3 3 

3 4 3 

δ (1) = (0.75) (0.5), δ (2) = (0.75) , δ (3) = (0.75) (0.25) 

4 4 4 

4 

δ5(1) = (0.75) (0.5), 

4 

δ5(2) = (0.75) (0.25), 

5 

δ5(3) 

= (0.75) 

5 

δ (1) = (0.75 ) (0.5), 

6 

δ (2) = (0.75) , 

5 

δ (3) = (0.75) (0.25) 

6 

6 6 

6 6 7 

7 7 7 

7 7 8 

8 8 8 

8 8 9 

9 9 9 

9 9 10 

δ (1) = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75) 

δ (1) = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75) 

δ (1) = (0.75) (0.5), δ (2) = (0.75) (0.25), δ (3) = (0.75) 

δ10(1) 

= (0.75) (0.5), δ10(2) = (0.75) (0.25), δ10(3) 

= (0.75) 

This leads to a diagram (trellis) of the form: 

37

Solution to Problem 3—the 3 the Training Problem 

• no globally optimum solution is known 

• all solutions yield local optima 

– can get solution via gradient techniques 

– can use a re-estimation procedure such as the Baum-Welch or 

EM method 

• consider re-estimation procedures 

– basic idea: given a current model estimate, λ, compute 

expected values of model events, then refine the model based 

on the computed values 

[ Model Events] [ Model Events] 

E E 

(0) (1) (2) 

λ ⎯⎯⎯⎯⎯⎯→ λ ⎯⎯⎯⎯⎯⎯→λ ⋅⋅⋅ 

Define ξt( 

i, j), the probability of being in state Si at time t, 

and 

state S at time t + 1, given the model and the observation sequence, i.e., 

j 

ξ = ⎡ 

⎣ = + = λ⎤ 

t( i, j) P qt Si, qt 1 Sj| O, 

⎦ 

38

The Training Problem 

ξ = ⎡ 

⎣ = + = λ⎤ 


⎦ 

39

The Training Problem 

ξ = ⎡ 

⎣ = + = λ⎤ 


⎦ 

P 

ξ 

⎣ 

⎡qt t (, i j) 

= 

= S + λ⎤ 

i, qt 1 = Sj, O| 

⎦ 

PO ( | λ) 

αt( i) aij bj( Ot+ 1) βt+ 1( j) = = 

PO ( | λ) 

αt( i) aij bj( Ot+ 1) βt+ 

1( 

j) 

N N 

α () i a b ( O ) β ( j) 

N 

∑ 

γ () i = ξ (, i j) 

t t 

j = 1 

T −1 

∑ 

t = 1 

T −1 

∑ 

t = 1 

t 

∑∑ 

i= 1 j= 

1 

t ij j t+ 1 t+ 

1 

γ ( i) 

= Expected number 

of transitions from S 

ξ ( i, j) = Expected number of transitions from S toS 

t i j 

i 

40

π 

a 

Re-estimation Re estimation Formulas 

= Expected number of times in state S at t = 1 

i i 

ij 

= γ 1() 

i 

Expected number of transitions from state S to state S 

= 

Expected number of transitions from state S 

= 

b ( k) 

j 

T −1 

∑ 

t = 1 

T 

∑ 

t = 1 

= 

= 

ξ (, i j) 

t 

γ () i 

t 

i j 

Expected number 

of times in state j with symbol ν k 

Expected number of times in state j 

T 

∑ 

γ ( j) 

t 

t = 1 

∋ Ot= 

ν k 

T 

∑ 

t = 1 

γ ( j) 

t 

i 

41


( AB ) ( AB ) 

If λ = , , Π is the initial model, and λ = , , Π is the 

re-estimated model, then it can be proven that either: 

1. the initial model, λ, 

defines a critical point of the likelihood 

function, in which 

case λ = λ, 

or 

2. model λ is more likely than model λ in the sense that 

PO ( | λ) > PO ( | λ), i.e., we have found a new model λ from 

which the observation sequence is more likely to have been 

produced. 

Conclusion: 

Iteratively use λ in place of λ, 

and repeat the 

re-estimation until some limiting point is reached. The resulting 

model is called the maximum likelihood ( ML) HMM. 

42


1. The re-estimation formulas can be derived by maximizing the 

auxiliary function Q( 

λλ , ) over λ, 

i.e., 

∑ 

Q( λλ , ) = P( O, q| λ)log ⎡ 

⎣P( O, q| 

λ⎤ 

⎦ 

It can be proved that: 

q 

max ⎡ λλ ⎤ ⇒ 

λ ⎣Q( , ) ⎦ P( O| λ) ≥ P( O| 

λ) 

Eventually the likelihood function converges to a critical point 

2. Relation to EM algorithm: 

i E (Expectation) step is the calculation of the auxiliary 

function, Q( 

λλ , ) 

i M (Modification) step is the maximization over 

λ 

43

Notes on Re-estimation 

Re estimation 

1. Stochastic constraints on π , a , b ( k) 

are automatically met, i.e., 

N N M 

∑ ∑ ∑ 

i ij j 

π = 1, a = 1, b ( k) 

= 1 

i ij j 

i= 1 j= 1 k= 

1 

2. At the critical points of P = P( O| 

λ), 

then 

∂P 

π i 

∂π 

i 

πi= N 

∂P 

∑π 

k 

k = 1 ∂π 

k 

= πi 

∂P 

aij 

∂aij 

aij 

= N 

∂P 

∑aik 

k = 1 ∂aik 

= aij 

bj( k) = 

∂P 

bj( k) ∂bj( 

k) 

= bj( k) 

M 

∂P 

bj() l 

∂b 

() 

⇒ 

∑ 

= 1 

j 

at critical points, the re-estimation formulas are exactly 

correct. 

44

Variations on HMM’s 

1. Types of HMM—model structures 

2. Continuous observation density 

models—mixtures 

3. Autoregressive HMM’s—LPC links 

4. Null transitions and tied states 

5. Inclusion of explicit state duration density 

in HMM’s 

6. Optimization criterion—ML, MMI, MDI 

45

⎧1, 

i = 1 

π i = ⎨ 

⎩0, 

i ≠ 1 

a = 0 j > i 

Types of HMM 

1. Ergodic models--no transient states 

2. Left-right models--all transient states (except the last state) 

with the constraints: 

ij 

Controlled transitions implies: 

a = 0, j > i +Δ ( Δ = 1,2 typically) 

ij 

3. Mixed forms of ergodic and left-right models (e.g., parallel branches) 

Note: 

Constraints of left-right models don't affect re-estimation 

formulas (i.e., a parameter initially set to 0 remains at 0 during 

re-estimation). 

46

Types of HMM 

Ergodic Model 

Left-Right Left Right Model 

Mixed Model 

47

Continuous Observation Density HMM’s 

Most general form of pdf with a valid re-estimation procedure is: 

M 

∑ 

b = ⎡ μ ⎤ 

j( x) cjm ⎣x, jm, Ujm⎦, 1≤ 

j ≤N 

m= 

1 

{ } 

x = observation vector= x , x ,..., x 

1 2 

D 

M = number of mixture densities 

cjm = gain of m-th 

mixture 

in state j 

= any log-concave or elliptically symmetric density (e.g., a Gaussian) 

μ = mean vector for mixture m, state j 

jm 

U = covariance matrix for mixture m, state j 

jm 

c ≥0, 1 ≤ j ≤N, 1 ≤ m ≤M 

M 

jm 

∑ 

m= 

1 

∞ 

−∞ 

c = 1, 1 ≤ j ≤N 

jm 

∫ bj( x) dx = 1, 1≤ 

j ≤N 

48

State Equivalence Chart 

S 

S 

S 

S S 

Equivalence of 

state with 

mixture density 

to multi-state 

multi state 

single mixture 

case 

49

Re-estimation Re estimation for Mixture Densities 

c 

μ 

U 

= 

∑ 

jk T 

t = 1 

M 

∑∑ 

t= 1 k= 

1 

T 

∑ 

γ (, jk) 

γ (, jk) 

γ (, jk) ⋅O 

t t 

t = 1 

jk = T 

∑ 

t = 1 

T 

∑ 

T 

t 

t 

γ (, jk) 

t 

( )( ) 

γ (, jk) ⋅ O−μ O−μ 

t t jk t jk 

jk 

t = 1 = 

T 

∑ 

t = 1 

γ (, jk) 

t 

i γ t ( jk , ) is the probability of being in state jat time twith 

the 

k-th 

mixture component accounting 

for O 

⎡ ⎤⎡ ⎤ 

⎢ 

α β 

⎥⎢ ( 

, μ , ) ⎥ 

γ ⎢ t( j) t( 

j) 

cjk Otjk Ujk 

= ⎥⎢ ⎥ 

t (, jk) 

N M 

⎢ ⎥⎢ 

α β μ 

⎥ 

⎢∑ t( j) t( j) ⎥⎢∑cjm( Ot, jm, Ujm) 

⎥ 

⎣ j= 1 ⎦⎣m= 

1 

⎦ 

′ 

t 

50

Autoregressive HMM 

Consider an observation vector O = ( x0, x1,..., xK−1) 

where each 

xkis a waveform sample, and O represents a frame of the signal 

(e.g., K = 256 samples). We assume xkis 

related to previous 

samples of O by a Gaussian autoregressive process of order p, 

i.e., 

p 

k ∑ i k−i k 

i = 1 

O =− aO + e , 0 ≤k ≤K −1 

where e are Gaussian, independent, identically distributed random 

k 

2 

variables with zero mean and variance σ , and ai,1 

≤i ≤ p are the 

autoregressive or predictor coefficients. 

As K →∞, 

then 

2 −K 

/2 ⎧ 1 ⎫ 

fO ( ) = (2 πσ ) exp ⎨− δ ( Oa , ) 

2 ⎬ 

⎩ 2σ 

⎭ 

where 

δ ( Oa , ) = ra(0)(0) r + 2 ∑ra()() 

i r i 

p 

i = 1 

51

[ ] 

Autoregressive HMM 

p−i ∑ 

r () i = a a ,( a = 1), 1≤i 

≤ p 

a n n+ i 

n= 

0 

K−− i 1 

∑ 

ri () = xx , 0≤i≤ 

p 

n= 

0 

n n+ i 

0 

a ′ = ⎡ 

⎣1, a ⎤ 

1, a2,..., ap⎦ 

The prediction residual is: 

K ⎡ 2 ⎤ 2 

α = E⎢∑( ei) ⎥ = Kσ 

⎣ i = 1 ⎦ 

Consider the normalized observation vector 

ˆ O O 

O = = 

2 α Kσ 

ˆ 

−K 

/2 ⎛ K ⎞ 

fO ( ) = (2 π ) exp ⎜−δ( Oa ˆ, 

) ⎟ 

⎝ 2 ⎠ 

In practice, K is replaced by Kˆ, 

the effective frame length, e.g., 

Kˆ = K 

/ 3 for frame overlap of 3 to 1. 

52

Application of Autoregressive HMM 

M 

j = ∑ jm jm 

m= 

1 

b (0) c b ( O) 

−K 

/2 ⎧ K ⎫ 

bjm( O) = (2 π) exp ⎨− δ( 

O, ajm) 

⎬ 

⎩ 2 ⎭ 

Each mixture characterized by predictor vector or by 

autocorrelation vector from which predictor vector can 

be derived. Re-estimation formulas for r are: 

r 

= 

T 

∑ 

t = 1 

jk T 

γ (, jk) ⋅ r 

∑ 

t = 1 

t t 

γ (, jk) 

t 

⎡ ⎤⎡ ⎤ 

⎢ 

α β 

⎥⎢ ( ) ⎥ 

γ ⎢ t( j) t( 

j) 

cjkbjkOt = ⎥⎢ ⎥ 

t (, jk) 

N M 

⎢ ⎥⎢ 

α β 

⎥ 

⎢∑ t ( j) t ( j) ⎥⎢∑cjkbjk ( Ot) 

⎥ 

⎣ j= 1 ⎦⎣ 

k= 

1 ⎦ 

jk 

53

Null Transitions and Tied States 

Null Transitions: transitions which produce no 

output, and take no time, denoted by φ 

Tied States: sets up an equivalence relation 

between HMM parameters in different states 

– number of independent parameters of the model 

reduced 

– parameter estimation becomes simpler 

– useful in cases where there is insufficient training 

data for reliable estimation of all model parameters 

54

Null Transitions 

55

Inclusion of Explicit State Duration Density 

For standard HMM's, the duration density is: 

p ( d) = probability of exactly d observations in state S 

i i 

d −1 

= aii −aii 

( ) (1 ) 

With arbitrary state duration density, pi( d), 

observations are 

generated as follows: 

1. an initial state, q1= Si, 

is chosen according to the initial state 

distribution, π i 

2. a duration d1 

is chosen according to the state duration density 

pq( d 

1 1) 

3. observations OO 1 2... 

Od 

are chosen according to the joint density 

1 

b ( O O ... O ). Generally we assume independence, so 

q 1 2 d 

1 1 

1 

∏ 

b ( O O ... O ) = b ( O ) 

q1 1 2 d1 q1 t 

t = 1 

4. the next state, q = S , is chosen according to 

the state transition 

2 

j 

d 

probabilities, a , with the constraint that a 

= 0, i.e., no transition 

qq qq 

1 2 1 1 

back to the same state can occur. 

56

Explicit State Duration Density 

Standard HMM 

HMM with explicit state duration density 

57


t 1 d + 1 d + d + 1 

state 

duration 

1 1 2 

1 2 3 

1 2 3 

observations O... O O ... O O ... O 

Assume: 

q q q 

d d d 

1 d d + 1 d + d d + d + 1 d + d + d 

1 1 1 2 1 2 1 2 3 

1. first state, q1, begins at t = 1 

2. last state, qr, ends at t = T 

⇒entire 

duration intervals are included 

within the observation 

sequence OO 1 2... 

OT 

Modified α: 

α ( i) = P( O O ... O, S ending at t| 

λ) 

t 1 2 t i 

Assume r states in first t observations, i.e., 

{ } 

Q = q q ... q with q = S 

1 2 

{ } 

D = d d ... d with 

1 2 

r r i 

r 

r 

∑ 

s= 

1 

s 

d = t 

58

Then we have 

By induction: 


∑∑ 

α ( i) = π p ( d ) P( OO ... O | q ) 

t q1 q1 1 1 2 d1 

1 

q d 

⋅ a p ( d ) P( O ... O | q )... 

qq q 2 d + 1 d + d 2 

1 2 2 1 1 2 

⋅ a p ( d ) P( O ... O | q ) 

q q q r d + d + ... + d + 1 t r 

r−1 r r 1 2 r−1 

N D 

∑∑ 

∏ 

α ( j) = α ( i) a p ( d) b ( O ) 

t t−d ij j j s 

i = 1 d = 1 s=− t d+ 

1 

Initialization of αt 

( i) 

: 

α () i = π p (1) b( O ) 

1 i i i 1 

2 

α () i = π p (2) b( O ) + α ( j) a p (1) b( O ) 

2 i i i s 1 ji i i 2 

s= 

1 

j= 1, j≠i α () i = π p (3) b( O ) + α ( j) a p ( d) b( O ) 

3 i i i s 3−d 

ji i i s 

s= 1 d= 1 j= 1, j≠i s= 4−d 

N 

∑ 

PO ( | λ) = α ( i) 

i = 1 

∏ 

N 

∑ 

t 

3 2 N 

3 

∏ ∑∑ ∏ 

T 

59

i 

i 


re-estimation formulas for a , b( k), and p( d) 

can be formulated 

ij i i 

and appropriately interpreted 

modifications to Viterbi scoring required, i.e., 

δ ( i) = P( OO ... O, qq ... q = S ending at t| 

O) 

t 1 2 t 1 2 r i 

Basic Recursion : 

t 

⎡ ⎤ 

δt() i = max max δ − 

≤ ≤ ≠ ≤ ≤ ⎢ t d( j) aji pi( d) ∏ bj( Os) 

1 j N, j i 1 d D 

⎥ 

⎣ s=− t d+ 

1 ⎦ 

i storage required for δt−1... δt−D 

⇒N⋅D locations 

i maximization involves all terms--not just old δ's 

and a as in 

previous 

case ⇒ significantly larger computational load 

2 2 

≈ ( D / 2) N T computations involving b ( O) 

Example: N = 5, D 

= 20 

implicit duration explicit duration 

storage 5 100 

computation 2500 500,000 

j 

ji 

60

Issues with Explicit State Duration 

Density 

1. quality of signal modeling is often improved significantly 

2. significant increase in the number of parameters per state 

( D duration estimates) 

3. significant increase in the computation associated 

with probability 

2 

calculation ( / 2) 

≈ D 

4. insufficient data to give good p( d) 

estimates 

Alternatives : 

1. use parametric state duration density 

2 

i( ) = ( 

, μi, σi 

) -- Gaussian 

p d d 

i i 1 i ηi 

pi( d) 

= 

-- Gamma 

( ) 

2. incorporate state duration information after probability 

calculation, e.g., in a post-processor 

d 

ν ν − −η 

d e 

Γ 

ν i 

i 

61

Alternatives to ML Estimation 

Assume we wish to design V different HMM's, λ1, λ2,..., λV. 

Normally we design each HMM, λ , based on a training set of 

V 

observations, O , using a maximum likelihood (ML) criterion, i.e., 

P 

∗ 

V = 

V 

max P⎡ ⎣ 

O | λ ⎤ V ⎦ 

λ 

V 

Consider the mutual information, 

I , between the observation 

V 

( ) 

V 

sequence, O , and the complete set of models λ = λ, λ ,..., λ , 

IV V 

⎡ V V 

= ⎢log P( O | λV) −log 

∑P( 

O 

⎣ w = 1 

⎤ 

| λW) 

⎥ 

⎦ 

Consider maximizing I over λ, 

giving 

V 

V 

1 2 

V 

∗ ⎡ V V ⎤ 

IV = max λ − 

λ 

λ ⎢log P( O | V) log ∑P( 

O | W) 

⎥ 

⎣ w = 1 ⎦ 

i choose λ so as to separate the correct model, λ , from all 

V 

other models, as much as possible, for the training set, O 

. 

V 

V 

62

Alternatives to ML Estimation 

Sum over all such training sets to give models according to an MMI 

criterion, i.e., 

i 

V V 

∗ ⎧ ⎡ v v ⎤⎫ 

I = max ⎨∑ ( λ ) − 

λ ⎬ 

λ ⎢log P( O | v log ∑P( 

O 

| w) 

⎥ 

⎩v= 1⎣ w= 

1 ⎦⎭ 

solution via steepest descent methods. 

63

Comparison of HMM’s 

Problem: 

given two HMM's, λ1 and λ2, 

is it possible to give a 

measure of how similar the two models are 

Example : 

equivalent 

( A B ) ⇔ ( A B ) P O = ν 

For , , we require ( ) to be the same 

1 1 2 2 

t k 

for both models and for all symbols ν k. 

Thus we require 

pq + (1 −p)(1 − q) = rs + (1 −r)(1 −s) 

2pq −p− q = 2rs= 

r = s 

p+ 1−2pq−r s = 

1−2r Let p = 0.6, q = 0.7, r = 0.2, then 

s 

= 13 / 30 0.433 

64

Comparison of HMM’s 

Thus the two models have very different A and B matrices, but are 

equivalent in the sense that all symbol probabilities (averaged over 

time) are the same. 

We generalize the concept of model distance (dis-similarity) 

by 

defining a distance measure, D( 

λ1, λ2) 

between two Markov sources, 

λ and λ , as 

1 2 

1 

(2) (2) 

D( λ λ = ⎡ 

⎣ λ − λ ⎤ 

1, 2) log P( OT | 1) log P( OT 

| 2) 

T 

⎦ 

(2) 

where OT 

is a sequence of observations generated by model λ2, 

and scored by both 

models. 

We symmetrize D by using the relation: 

1 

DS( λ1, λ2) = [ D( λ1, λ2) + D( 

λ2, λ1) 

] 

2 

65

Implementation Issues for HMM’s 

1. Scaling—to prevent underflow 

and/or overflow. 

2. Multiple Observation Sequences—to 

train left-right models. 

3. Initial Estimates of HMM 

Parameters—to provide robust 

models. 

4. Effects of Insufficient Training Data 

66

i 

i 

i 

Scaling 

α ( i) 

is a sum of a large number of terms, each of the form: 

t 

t-1 t ⎡ ⎤ 

⎢∏aqq + ∏b 

( ) 

s s 1 q O 

s s ⎥ 

⎣ s= 1 s= 

1 ⎦ 

since each a and b term is less than 1, as t gets larger, αt 

( i) 

exponentially heads to 0. Thus 

scaling is required to prevent 

underflow. 

consider scaling αt 

( i) 

by the factor 

1 

ct= , independent of t 

N 

α () i 

∑ 

i = 1 

i we denote the scaled α's 

as: 

αt 

() i 

ˆ αt( i) = ctαt( i) 

= N 

α () i 

N 

∑ 

i = 1 

ˆ α ( i) 

= 1 

t 

t 

∑ 

i = 1 

t 

67

i 

i 

i 

i 

for fixed t, 

we compute 

α ( i) = ˆ α ( j) a b( O ) 

ˆ α ( i) 

= 

ˆ α 

t t−1ji i t 

j = 1 

scaling gives 

t N 

j = 1 

N 

t −1 

∑ 

∑ 

∑∑ 

i= 1 j= 

1 

by induction we get 

giving 

ˆ α 

N 

N 

t −1 

⎡ 

( j) = ∏c 

⎣ 

⎢ τ ⎥ 

τ = 1 ⎦ 

Scaling 

ˆ α ( j) a b( O ) 

t−1ji i t 

ˆ α ( j) a b( O ) 

t−1ji i t 

⎤ 

α 

t −1 

( j) 

⎡ ⎤ 

N 

t −1 

∑α t−1( j) ⎢∏cτ⎥aji bi( Ot) 

j = 1 ⎣ τ = 1 ⎦ 

αt 

t ( i) 

= = 

N N 

t −1 

N 

⎡ ⎤ 

∑∑αt−1( j) ⎢∏cτ⎥aji bi( Ot) 

∑ t 

i= 1 j= 

1 τ = 1 

i = 1 

⎣ ⎦ 

() i 

α () i 

68

i 

i 

i 

Scaling 

for scaling the βt 

( i) 

terms we use the same scale factors as for 

the αt 

( i) 

terms, i.e., 

ˆ βt( i) = ctβt( i) 

since the magnitudes of the α and β terms are comparable. 

the re-estimation 

formula for a in terms of the scaled α's 

and β's 

is: 

we have 

a 

T −1 

∑ 

ˆ α () i a b ( O ) ˆ β ( j) 


1 

ij = t = 1 

N T−1 

∑∑ 

j= 1 t= 

1 

ˆ α () i a b ( O ) ˆ β ( j) 


1 

t ⎡ ⎤ 

ˆ αt 

( i) = ⎢∏cτ⎥αt() i = Ctαt() i 

⎣ τ = 1 ⎦ 

T 

ˆ 

⎡ ⎤ 

βt+ 1( j) = ⎢∏cτ⎥βt+ 1( j) = Dt+ 1βt+ 1( 

j) 

⎣τ=+ t 1 ⎦ 

ij 

69

i 

giving 

a 

T −1 

τ τ τ 

τ= 1 τ= t + 1 τ= 

1 

Scaling 

Cα () i a b ( O ) D β ( j) 

t t ij j t+ 1 t+ 1 t+ 

1 

ij = t = 1 

N T−1 

t t+ 

1 

i independent of t. 

Notes on Scaling : 

∑ 

∑∑ 

Cα () i a b ( O ) D β ( j) 

t t ij j t+ 1 t+ 1 t+ 

1 

j= 1 t= 

1 

t T T 

∏ ∏ ∏ 

CD = c c = c = C 

1. scaling procedure 

works equally well on π or B coefficients 

2. scaling need not be performed each iteration; set ct 

= 1 whenever scaling is skipped 

c. can solve for PO ( | λ) 

from scaled coefficients as: 

T N N 

∏ 

t = 1 

∑ ∑ 

c α () i = C α () i = 1 

t T T 

i= 1 i= 

1 

N 

∑ ∏ 

PO ( | λ) = α ( i) = 1/ c 

T t 

i = 1 t = 1 

T 

∑ 

log PO ( | λ) 

=− log( c) 

t = 1 

t 

T 

70

Multiple Observation Sequences 

For left-right models, we need to use multiple sequences of observations for training. 

Assume a set of K observation sequences (i.e., training utterances): 

where 

(1) (2) ( K ) 

O = ⎡ 

⎣ 

O , O ,..., O ⎤ 

⎦ 

( k) ( k) ( k) 

( k ) 

O = ⎡O1 O ⎤ 

⎣ 2 ... OTk 

⎦ 

We wish to maximize the probability 

∏ ∏ 

( k ) 

PO ( | λ) = PO ( | λ) 

= P 

a 

K 

k 

∑∑ 

K K 

k= 1 k= 

1 

T −1 

k ( k) k 

α () i a b ( O ) β ( j) 


1 

ij 

k= 1 = t= 

1 

K T −1 

k 

∑∑ 

k= 1 t= 

1 

k k 

α () i β () i 

t t 

Scaling requires: 

K Tk 

−1 

1 k 

( k) ∑ ∑ ˆ αt 

() i a ˆ k 

ij bj( Ot+ 1) βt+ 

1( 

j) 

k= 1Pkt= 1 

aij 

= 

K Tk 

−1 

1 k 

∑ ∑ ˆ α () ˆ k 

t i βt 

() i 

k= 1Pkt= 1 

i all scaling factors cancel out 

k 

71

N 

M 

Initial Estimates of HMM 

Parameters 

-- choose based on physical considerations 

-- choose based on model fits 

πi -- random or uniform ( πi 

≠ 0) 

a -- random or uniform ( a ≠ 0) 

ij ij 

b ( k) -- random or uniform ( b ( k) 

≥ ε ) 

j j 

b ( O) -- need good initial estimates of mean vectors; 

j 

need reasonable estimates of covariance matrices 

72

Effects of Insufficient Training 

Data 

Insufficient training data leads to poor estimates of model parameters. 

Possible Solutions: 

1. use more training data--often this is impractical 

2. reduce the size of the model--often there are physical 

reasons for 

keeping a chosen model size 

3. add extra constraints to model parameters 

b ( k) 

≥ ε 

j 

U (,) r r ≥ δ 

jk 

⋅ often the model performance is relatively insensitive to exact 

choice 

of ε, δ 

4. method of deleted interpolation 

λ= ελ+(1- ε) λ′ 

73

Methods for Insufficient Data 

Performance insensitivity to ε 

74

Deleted Interpolation 

75

Isolated Word Recognition Using 

HMM’s 

Assume a vocabulary of V words, with K occurrences of each spoken word 

in a training set. Observation vectors are spectral characterizations of the word. 

For isolated word recognition, we do the following: 

v 

1. for each word, v, 

in the vocabulary, we must build an HMM, λ , i.e., we 

must re-estimate model parameters , , that optimize the likelihood of the 

( ABΠ) 

training set observation vectors 

for the v-th 

word. (TRAINING) 

2. for each unknown word which is to be recognized, we do the following: 

a. measure the observation sequence O O O ... O 

[ ] 

= 1 2 

v 

b. calculate model likelihoods, 

PO ( | λ ), 1≤v≤V 

c. select the word whose model likelihood score is highest 

∗ 

v 

v = argmax ⎡ 

⎣ 

P( O| 

λ ) ⎤ 

⎦ 

1≤v≤V 

2 

Computation is on the order of V ⋅ N T required; V = 100, N = 5, T = 40 

5 

⇒10 

computations 

T 

76

Isolated Word HMM Recognizer 

77

Choice of Model Parameters 

1. Left-right model preferable to ergodic model (speech is a left-right 

process) 

2. Number of states in range 2-40 (from sounds to frames) 

• Order of number of distinct sounds in the word 

• Order of average number of observations in word 

3. Observation vectors 

• Cepstral coefficients (and their second and third order derivatives) 

derived from LPC (1-9 mixtures), diagonal covariance matrices 

• Vector quantized discrete symbols (16-256 codebook sizes) 

4. Constraints on b j(O) densities 

• bj(k)>ε for discrete densities 

• Cjm >δ, Ujm (r,r)>δ for continuous densities 

78

Performance Vs Number of 

States in Model 

79

HMM Feature Vector Densities 

80

Motivation: 

Segmental K-Means K Means 

Segmentation into States 

derive good estimates of the bj (O) densities as required for rapid 

convergence of re-estimation procedure. 

Initially: 

training set of multiple sequences of observations, initial model estimate. 

Procedure: 

segment each observation sequence into states using a Viterbi procedure. 

For discrete observation densities, code all observations in state j using 

the M-codeword codebook, giving 

bj (k) = number of vectors with codebook index k, in state j, divided by the 

number of vectors in state j. 

for continuous observation densities, cluster the observations in state j into 

a set of M clusters, giving 

81

Segmental K-Means K Means 

Segmentation into States 

c jm = number of vectors assigned to cluster m of state j divided 

by the number of vectors in state j. 

μ jm = sample mean of the vectors assigned to cluster m of state 

j 

U jm = sample covariance of the vectors assigned to cluster m of 

state j 

use as the estimate of the state transition probabilities 

a ii = number of vectors in state i minus the number of 

observation sequences for the training word divided by the 

number of vectors in state i. 

a i,i+1 = 1 – a ii 

the segmenting HMM is updated and the procedure is 

iterated until a converged model is obtained. 

82

Segmental K-Means K Means Training 

83

HMM Segmentation for /SIX/ 

84

unknown 

log 

energy 

frame 

likelihood 

scores 

Digit Recognition Using HMM’s 

one 

one 

one 

nine 

one 

nine 

one 

nine 

frame 

cumulative 

scores 

state 

segmentation 

85

Digit Recognition Using HMM’s 

unknown 

log energy 

frame 

likelihood 

scores 

seven 

seven 

seven 

six 

seven 

six 

seven 

six 

frame 

cumulative 

scores 

state 

segmentation 

86

The Hidden Markov Model (HMM)

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?