06.01.2015 Views

HMM Parameter Tying - University of Birmingham

HMM Parameter Tying - University of Birmingham

HMM Parameter Tying - University of Birmingham

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

<strong>HMM</strong> <strong>Parameter</strong> <strong>Tying</strong><br />

Version 1 February 2002<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 1


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

The Problem<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! Good recognition accuracy requires<br />

context-sensitive phone-level models<br />

! If there are 50 phones, the maximum<br />

number od triphone <strong>HMM</strong>s is 50 3 =125,000<br />

! Most ruled out by phonological constraints<br />

– most phone triples never occur in speech<br />

! But many are legal<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 2


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Model <strong>Parameter</strong>s<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Each model has 3 emitting states<br />

! Each state modelled as, say, a 10<br />

component Gaussian mixture<br />

! Each feature vector is 40 dimensional<br />

! Hence number <strong>of</strong> parameters per model is:<br />

3×(10 ×(40+40+1)+9)=2,457<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 3<br />

Number<br />

<strong>of</strong> states<br />

Number <strong>of</strong><br />

mixture<br />

components<br />

Mean<br />

vector<br />

Variance<br />

vector<br />

Mixture<br />

weight<br />

Transition<br />

probs


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Acoustic model parameters<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 4<br />

! So, even if we only have 1,000 acoustic<br />

models (instead <strong>of</strong> 125,000), total acoustic<br />

model parameters will be 2,457,000<br />

! Too many to estimate with practical quantity<br />

<strong>of</strong> data<br />

! Most common solution is <strong>HMM</strong> parameter<br />

tying<br />

! Some different <strong>HMM</strong>s share same<br />

parameters


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Tied variance<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 5<br />

! Variances are more costly to estimate than<br />

means<br />

! Simple solution – divide set <strong>of</strong> all <strong>HMM</strong>s into<br />

classes, so that within a class all <strong>HMM</strong> state<br />

PDFs have same variance<br />

! This is tied variance<br />

! If all <strong>HMM</strong> state PDFs share the same<br />

variance, the variance is referred to as<br />

grand variance


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Tied Mixtures<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 6<br />

! Another common method is tied (or shared)<br />

mixtures<br />

! In a normal Gaussian mixture <strong>HMM</strong> system,<br />

each state is a associated with a Gaussian<br />

mixture PDF <strong>of</strong> the form:<br />

b<br />

M<br />

( y) = ∑ w ( )<br />

m<br />

pm<br />

y<br />

m=<br />

1<br />

! In a tied mixture system, all <strong>of</strong> the p m s are<br />

chosen from a fixed, finite set <strong>of</strong> unimodal<br />

Gaussian PDFs


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Tied Mixtures<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! By controlling the number <strong>of</strong> shared mixture<br />

components, the total number <strong>of</strong> acoustic<br />

model parameters is controlled<br />

! The set <strong>of</strong> shared Gaussian PDFs is like a<br />

vector quantiser codebook<br />

! Tied mixture <strong>HMM</strong>s are also known as<br />

semi-continuous <strong>HMM</strong>s<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 7


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Generalised Triphones<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 8<br />

! Previous techniques share (or tie) particular<br />

model parameters<br />

! An alternative is to share whole <strong>HMM</strong>s<br />

! In other words, assume that some contexts<br />

induce the same effects on the acoustic<br />

realisation <strong>of</strong> a phone, and model their<br />

triphone using the same <strong>HMM</strong><br />

! These equivalence classes <strong>of</strong> triphone<br />

<strong>HMM</strong>s are called generalised triphones


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Clustered triphones<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Suppose M and N are phone-level <strong>HMM</strong>s,<br />

both with same number <strong>of</strong> states<br />

! Can define a distance d(M,N) between M<br />

and N in several ways<br />

! E.g. define d 1 (M,N) to be the difference<br />

between state 1 or M and state 1 <strong>of</strong> N<br />

Digital Systems<br />

&<br />

Vision Processing<br />

b 1,M b 1,N<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 9<br />

! Define d(M,N)= d 1 (M,N)+ d 2 (M,N)+ d 3 (M,N)


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Clustered triphones<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 10<br />

! Given a distance measure, we can treat<br />

<strong>HMM</strong>s as points in a high-dimensional<br />

space and cluster them together<br />

! Each cluster can then be represented by a<br />

single triphone <strong>HMM</strong><br />

! Can control number <strong>of</strong> parameters by<br />

controlling number <strong>of</strong> clusters<br />

! For medium vocabulary tasks (500-1,000<br />

words) 300-500 clusters is sufficient


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Two problems<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 11<br />

! In the clustered triphone method we decide<br />

to combine triphone <strong>HMM</strong>s based on the<br />

similarity <strong>of</strong> their state output PDFs<br />

! But the whole point is that we don’t have<br />

accurate estimates <strong>of</strong> these PDFs<br />

! Suppose we want to model a new word,<br />

which contains a triphone which was not in<br />

the training set<br />

! Which generalised triphone should we use


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Phone decision trees<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 12<br />

! Most common approach to <strong>HMM</strong> tying,<br />

which addresses both problems, is decision<br />

tree clustering<br />

! Decision tree clustering can be applied to<br />

individual states or to whole <strong>HMM</strong>s – we’ll<br />

consider whole <strong>HMM</strong>s<br />

! Basic idea is to supplement data-driven<br />

methods (distances between PDFs) with<br />

knowledge about which phones are likely to<br />

induce similar contextual effects


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Phonetic knowledge<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

! For example, we know that /f/ and /s/ are<br />

both unvoiced fricatives, produced in a<br />

similar manner<br />

! Therefore we might hypothesise that, for<br />

example, an utterance <strong>of</strong> the vowel /e/<br />

preceded by /f/ might be similar to one<br />

preceded by /s/<br />

! This is the basic idea behind decision tree<br />

clustering<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 13


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

A phone decision tree for /e/<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 14<br />

! A phone decision tree is<br />

just a binary tree, where<br />

each node <strong>of</strong> the tree is<br />

associated with:<br />

– A set <strong>of</strong> phones<br />

– A position (L or R)<br />

! The root node <strong>of</strong> the tree<br />

corresponds to /e/<br />

! The terminal nodes<br />

correspond to<br />

significantly different<br />

contextual variants <strong>of</strong> /e/


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

A decision tree node<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

{/p/, /t/, /k/}, L<br />

a<br />

Y N<br />

b<br />

c<br />

! Want to choose a model<br />

for /e/ in a particular<br />

context<br />

! At node (a), ask question<br />

is the Left context one <strong>of</strong><br />

the set {/p/, /t/, /k/}<br />

! If “yes” go to node (b),<br />

otherwise go to node (c)<br />

! Continue until a terminal<br />

node is reached<br />

! Choose associated <strong>HMM</strong><br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 15


Building a phone decision tree<br />

THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 16<br />

for /e/<br />

! First choose a set <strong>of</strong> questions<br />

– These can be chosen using phonetic<br />

knowledge about sets <strong>of</strong> phones which are<br />

likely to induce similar contextual effects<br />

– …plus pragmatics!<br />

! Also need the set E <strong>of</strong> acoustic patterns<br />

corresponding to /e/ in the training data<br />

! Each question partitions E into two subsets<br />

– E Y –the set <strong>of</strong> instances <strong>of</strong> /e/ for which the<br />

answer to the question is “Yes”<br />

– E N – the set <strong>of</strong> instances <strong>of</strong> /e/ for which the<br />

answer to the question is “No”


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Building a phone decision tree<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! For each question Q, we can define a<br />

“quality measure” g(Q)<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 17<br />

! g(Q) is a measure <strong>of</strong> how well the sets E Y<br />

and E N can be modelled by separate <strong>HMM</strong>s<br />

! Intuitively, g(Q) is a measure <strong>of</strong> how<br />

compact or ‘homogeneous’ the sets E Y<br />

and E N are<br />

! Choose the question Q for which g(Q) is<br />

biggest


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

Building a phone decision tree<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 18<br />

! Training patterns in E Y (resp E N ) are<br />

assigned to the “Y” (resp “N”) successor<br />

nodes<br />

! Whole process is repeated for each<br />

successor node<br />

! Process stops when, for example, the<br />

number <strong>of</strong> samples associated with a node<br />

reaches a minimum<br />

! A <strong>HMM</strong> is built for each terminal node


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

Phone Decision Tree<br />

/e/<br />

{/p/, /t/, /k/; L}<br />

{/e/, /i/, /A/; L}<br />

N<br />

N<br />

Y<br />

N<br />

{/s/, /f/; R}<br />

{/s/, /f/; L}<br />

N<br />

N<br />

{/#/; R}<br />

Digital Systems<br />

&<br />

Vision Processing<br />

N<br />

{/e/, /i/; R}<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 19


THE UNIVERSITY<br />

OF BIRMINGHAM<br />

PDT Concluding Remarks<br />

SCHOOL OF<br />

ELECTRONIC &<br />

ELECTRICAL<br />

ENGINEERING<br />

! Phone decision trees can be applied at the<br />

state level, to construct a set <strong>of</strong> triphones<br />

with tied states<br />

! State level phone decision trees supported<br />

by HTK<br />

Digital Systems<br />

&<br />

Vision Processing<br />

<strong>Tying</strong><br />

17-Feb-01<br />

SLIDE 20

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!