HMM Parameter Tying - University of Birmingham
HMM Parameter Tying - University of Birmingham
HMM Parameter Tying - University of Birmingham
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
<strong>HMM</strong> <strong>Parameter</strong> <strong>Tying</strong><br />
Version 1 February 2002<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 1
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
The Problem<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! Good recognition accuracy requires<br />
context-sensitive phone-level models<br />
! If there are 50 phones, the maximum<br />
number od triphone <strong>HMM</strong>s is 50 3 =125,000<br />
! Most ruled out by phonological constraints<br />
– most phone triples never occur in speech<br />
! But many are legal<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 2
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Model <strong>Parameter</strong>s<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Each model has 3 emitting states<br />
! Each state modelled as, say, a 10<br />
component Gaussian mixture<br />
! Each feature vector is 40 dimensional<br />
! Hence number <strong>of</strong> parameters per model is:<br />
3×(10 ×(40+40+1)+9)=2,457<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 3<br />
Number<br />
<strong>of</strong> states<br />
Number <strong>of</strong><br />
mixture<br />
components<br />
Mean<br />
vector<br />
Variance<br />
vector<br />
Mixture<br />
weight<br />
Transition<br />
probs
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Acoustic model parameters<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 4<br />
! So, even if we only have 1,000 acoustic<br />
models (instead <strong>of</strong> 125,000), total acoustic<br />
model parameters will be 2,457,000<br />
! Too many to estimate with practical quantity<br />
<strong>of</strong> data<br />
! Most common solution is <strong>HMM</strong> parameter<br />
tying<br />
! Some different <strong>HMM</strong>s share same<br />
parameters
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Tied variance<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 5<br />
! Variances are more costly to estimate than<br />
means<br />
! Simple solution – divide set <strong>of</strong> all <strong>HMM</strong>s into<br />
classes, so that within a class all <strong>HMM</strong> state<br />
PDFs have same variance<br />
! This is tied variance<br />
! If all <strong>HMM</strong> state PDFs share the same<br />
variance, the variance is referred to as<br />
grand variance
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Tied Mixtures<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 6<br />
! Another common method is tied (or shared)<br />
mixtures<br />
! In a normal Gaussian mixture <strong>HMM</strong> system,<br />
each state is a associated with a Gaussian<br />
mixture PDF <strong>of</strong> the form:<br />
b<br />
M<br />
( y) = ∑ w ( )<br />
m<br />
pm<br />
y<br />
m=<br />
1<br />
! In a tied mixture system, all <strong>of</strong> the p m s are<br />
chosen from a fixed, finite set <strong>of</strong> unimodal<br />
Gaussian PDFs
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Tied Mixtures<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! By controlling the number <strong>of</strong> shared mixture<br />
components, the total number <strong>of</strong> acoustic<br />
model parameters is controlled<br />
! The set <strong>of</strong> shared Gaussian PDFs is like a<br />
vector quantiser codebook<br />
! Tied mixture <strong>HMM</strong>s are also known as<br />
semi-continuous <strong>HMM</strong>s<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 7
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Generalised Triphones<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 8<br />
! Previous techniques share (or tie) particular<br />
model parameters<br />
! An alternative is to share whole <strong>HMM</strong>s<br />
! In other words, assume that some contexts<br />
induce the same effects on the acoustic<br />
realisation <strong>of</strong> a phone, and model their<br />
triphone using the same <strong>HMM</strong><br />
! These equivalence classes <strong>of</strong> triphone<br />
<strong>HMM</strong>s are called generalised triphones
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Clustered triphones<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Suppose M and N are phone-level <strong>HMM</strong>s,<br />
both with same number <strong>of</strong> states<br />
! Can define a distance d(M,N) between M<br />
and N in several ways<br />
! E.g. define d 1 (M,N) to be the difference<br />
between state 1 or M and state 1 <strong>of</strong> N<br />
Digital Systems<br />
&<br />
Vision Processing<br />
b 1,M b 1,N<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 9<br />
! Define d(M,N)= d 1 (M,N)+ d 2 (M,N)+ d 3 (M,N)
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Clustered triphones<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 10<br />
! Given a distance measure, we can treat<br />
<strong>HMM</strong>s as points in a high-dimensional<br />
space and cluster them together<br />
! Each cluster can then be represented by a<br />
single triphone <strong>HMM</strong><br />
! Can control number <strong>of</strong> parameters by<br />
controlling number <strong>of</strong> clusters<br />
! For medium vocabulary tasks (500-1,000<br />
words) 300-500 clusters is sufficient
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Two problems<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 11<br />
! In the clustered triphone method we decide<br />
to combine triphone <strong>HMM</strong>s based on the<br />
similarity <strong>of</strong> their state output PDFs<br />
! But the whole point is that we don’t have<br />
accurate estimates <strong>of</strong> these PDFs<br />
! Suppose we want to model a new word,<br />
which contains a triphone which was not in<br />
the training set<br />
! Which generalised triphone should we use
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Phone decision trees<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 12<br />
! Most common approach to <strong>HMM</strong> tying,<br />
which addresses both problems, is decision<br />
tree clustering<br />
! Decision tree clustering can be applied to<br />
individual states or to whole <strong>HMM</strong>s – we’ll<br />
consider whole <strong>HMM</strong>s<br />
! Basic idea is to supplement data-driven<br />
methods (distances between PDFs) with<br />
knowledge about which phones are likely to<br />
induce similar contextual effects
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Phonetic knowledge<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
! For example, we know that /f/ and /s/ are<br />
both unvoiced fricatives, produced in a<br />
similar manner<br />
! Therefore we might hypothesise that, for<br />
example, an utterance <strong>of</strong> the vowel /e/<br />
preceded by /f/ might be similar to one<br />
preceded by /s/<br />
! This is the basic idea behind decision tree<br />
clustering<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 13
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
A phone decision tree for /e/<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 14<br />
! A phone decision tree is<br />
just a binary tree, where<br />
each node <strong>of</strong> the tree is<br />
associated with:<br />
– A set <strong>of</strong> phones<br />
– A position (L or R)<br />
! The root node <strong>of</strong> the tree<br />
corresponds to /e/<br />
! The terminal nodes<br />
correspond to<br />
significantly different<br />
contextual variants <strong>of</strong> /e/
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
A decision tree node<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
{/p/, /t/, /k/}, L<br />
a<br />
Y N<br />
b<br />
c<br />
! Want to choose a model<br />
for /e/ in a particular<br />
context<br />
! At node (a), ask question<br />
is the Left context one <strong>of</strong><br />
the set {/p/, /t/, /k/}<br />
! If “yes” go to node (b),<br />
otherwise go to node (c)<br />
! Continue until a terminal<br />
node is reached<br />
! Choose associated <strong>HMM</strong><br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 15
Building a phone decision tree<br />
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 16<br />
for /e/<br />
! First choose a set <strong>of</strong> questions<br />
– These can be chosen using phonetic<br />
knowledge about sets <strong>of</strong> phones which are<br />
likely to induce similar contextual effects<br />
– …plus pragmatics!<br />
! Also need the set E <strong>of</strong> acoustic patterns<br />
corresponding to /e/ in the training data<br />
! Each question partitions E into two subsets<br />
– E Y –the set <strong>of</strong> instances <strong>of</strong> /e/ for which the<br />
answer to the question is “Yes”<br />
– E N – the set <strong>of</strong> instances <strong>of</strong> /e/ for which the<br />
answer to the question is “No”
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Building a phone decision tree<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! For each question Q, we can define a<br />
“quality measure” g(Q)<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 17<br />
! g(Q) is a measure <strong>of</strong> how well the sets E Y<br />
and E N can be modelled by separate <strong>HMM</strong>s<br />
! Intuitively, g(Q) is a measure <strong>of</strong> how<br />
compact or ‘homogeneous’ the sets E Y<br />
and E N are<br />
! Choose the question Q for which g(Q) is<br />
biggest
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
Building a phone decision tree<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 18<br />
! Training patterns in E Y (resp E N ) are<br />
assigned to the “Y” (resp “N”) successor<br />
nodes<br />
! Whole process is repeated for each<br />
successor node<br />
! Process stops when, for example, the<br />
number <strong>of</strong> samples associated with a node<br />
reaches a minimum<br />
! A <strong>HMM</strong> is built for each terminal node
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
Phone Decision Tree<br />
/e/<br />
{/p/, /t/, /k/; L}<br />
{/e/, /i/, /A/; L}<br />
N<br />
N<br />
Y<br />
N<br />
{/s/, /f/; R}<br />
{/s/, /f/; L}<br />
N<br />
N<br />
{/#/; R}<br />
Digital Systems<br />
&<br />
Vision Processing<br />
N<br />
{/e/, /i/; R}<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 19
THE UNIVERSITY<br />
OF BIRMINGHAM<br />
PDT Concluding Remarks<br />
SCHOOL OF<br />
ELECTRONIC &<br />
ELECTRICAL<br />
ENGINEERING<br />
! Phone decision trees can be applied at the<br />
state level, to construct a set <strong>of</strong> triphones<br />
with tied states<br />
! State level phone decision trees supported<br />
by HTK<br />
Digital Systems<br />
&<br />
Vision Processing<br />
<strong>Tying</strong><br />
17-Feb-01<br />
SLIDE 20