The dissertation of Andreas Stolcke is approved: University of ...
The dissertation of Andreas Stolcke is approved: University of ...
The dissertation of Andreas Stolcke is approved: University of ...
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
4ä Ö ÕÃ,CHAPTER 3. HIDDEN MARKOV MODELS 37Description Length priorsWe can use the MDL framework as d<strong>is</strong>cussed in Section 2.5.6 to derive simplepriors for HMM structures from various coding schemes. For example, a natural way to encode the transitionsand em<strong>is</strong>sions in an HMM <strong>is</strong> to simply enumerate them. Each transitioncan be encoded using log , .ñÞ9.ñN 10 bits,since there are .ñÞ9. possible transitions, plus a special ‘end’ marker which allows us not to encode the m<strong>is</strong>singtransitions explicitly. <strong>The</strong> total description length for all transitions from ¢state <strong>is</strong> thusSimilarly, u Õ ä Öall em<strong>is</strong>sions from can ¢ ú be coded . 3§.N 10 using log bits. 4 , uä Ö Õlog , .ñÞã.N 10 . à<strong>The</strong> resulting priorû…ü;ý , ?B= ?E= ü;ý ûÿ(3.8)10 3§.N . þ. 4Tø&0%Ò.ƒÞã.N 10+-,has the property that small differences in the number <strong>of</strong> states matter little compared to differences in the totalnumber <strong>of</strong> transitions and em<strong>is</strong>sions.We have seen in Section 2.5.7 that the preferred criterion for maximization <strong>is</strong> the posterior <strong>of</strong>structure +-, 4ÀÃM. )g0 , which requires integrating out the parameters UÄ . In Section 3.4 we give a solution forth<strong>is</strong> computation that relies on the approximation <strong>of</strong> sample likelihoods by Viterbi paths.3.3.4 Why are smaller HMMs preferred?Intuitively, we want an HMM induction algorithm to prefer ‘smaller’ models over ‘larger’ ones,other things being equal. Th<strong>is</strong> can be interpreted as a special case <strong>of</strong> ‘Occam’s razor,’ or the scientific maximthat simpler explanations are to be preferred unless more complex explanations are required to explain thedata.Once the notions <strong>of</strong> model size (or explanation complexity) and goodness <strong>of</strong> explanation arequantified, th<strong>is</strong> principle can be modified to include a trade-<strong>of</strong>f between the criteria <strong>of</strong> simplicity and datafit. Th<strong>is</strong> <strong>is</strong> prec<strong>is</strong>ely what the Bayesian approach does, since in optimizing the product 450+-,)/. 450 a+-,comprom<strong>is</strong>e between simplicity (embodied in the prior) and fit to the data (high model likelihood) <strong>is</strong> found.But how <strong>is</strong> it that the HMM priors d<strong>is</strong>cussed in the previous section lead to a preference for ‘smaller’or ‘simpler’ models? Two answers present themselves: one has to do with the general phenomenon <strong>of</strong> ‘Occamfactors’ found in Bayesian inference; the other <strong>is</strong> related, but specific to the way HMMs partition data forpurposes <strong>of</strong> ‘explaining’ it. We will d<strong>is</strong>cuss each in turn.3.3.4.1 Occam factorsConsider the following scenario. Two pundits, 4 1 and 4 2, are asked for their predictions regardingan upcoming election involving a number <strong>of</strong> candidates. Each pundit has h<strong>is</strong>/her own ‘model’ <strong>of</strong> the politicalprocess. We will identify these models with their respective proponents, and try to evaluate each according4 <strong>The</strong> basic idea <strong>of</strong> encoding transitions and em<strong>is</strong>sions by enumeration has various more soph<strong>is</strong>ticated variants. For example, onecould base the enumeration <strong>of</strong> transitions on a canonical ordering <strong>of</strong> states, such that only log¤£¦¥1‡§¥log£¦¥©¨¨¨¥s££¥1‡bits are required. Or one could use the-out-<strong>of</strong>-£-bit integer coding scheme described in Cover & Thomas (1991) and used for MDLinference in Quinlan & Rivest (1989). Any reasonable Bayesian inference procedure should not be sensitive to such minor difference inthe prior, unless it <strong>is</strong> used with too little data. Our goal here <strong>is</strong> simply to suggest priors that have reasonable qualitative properties, andare at the same time computationally convenient.