12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4ä Ö ÕÃ,CHAPTER 3. HIDDEN MARKOV MODELS 37Description Length priorsWe can use the MDL framework as d<strong>is</strong>cussed in Section 2.5.6 to derive simplepriors for HMM structures from various coding schemes. For example, a natural way to encode the transitionsand em<strong>is</strong>sions in an HMM <strong>is</strong> to simply enumerate them. Each transitioncan be encoded using log , .ñÞ9.ñN 10 bits,since there are .ñÞ9. possible transitions, plus a special ‘end’ marker which allows us not to encode the m<strong>is</strong>singtransitions explicitly. <strong>The</strong> total description length for all transitions from ¢state <strong>is</strong> thusSimilarly, u Õ ä Öall em<strong>is</strong>sions from can ¢ ú be coded . 3§.N 10 using log bits. 4 , uä Ö Õlog , .ñÞã.N 10 . à<strong>The</strong> resulting priorû…ü;ý , ?B= ?E= ü;ý ûÿ(3.8)10 3§.N . þ. 4Tø&0%Ò.ƒÞã.N 10+-,has the property that small differences in the number <strong>of</strong> states matter little compared to differences in the totalnumber <strong>of</strong> transitions and em<strong>is</strong>sions.We have seen in Section 2.5.7 that the preferred criterion for maximization <strong>is</strong> the posterior <strong>of</strong>structure +-, 4ÀÃM. )g0 , which requires integrating out the parameters UÄ . In Section 3.4 we give a solution forth<strong>is</strong> computation that relies on the approximation <strong>of</strong> sample likelihoods by Viterbi paths.3.3.4 Why are smaller HMMs preferred?Intuitively, we want an HMM induction algorithm to prefer ‘smaller’ models over ‘larger’ ones,other things being equal. Th<strong>is</strong> can be interpreted as a special case <strong>of</strong> ‘Occam’s razor,’ or the scientific maximthat simpler explanations are to be preferred unless more complex explanations are required to explain thedata.Once the notions <strong>of</strong> model size (or explanation complexity) and goodness <strong>of</strong> explanation arequantified, th<strong>is</strong> principle can be modified to include a trade-<strong>of</strong>f between the criteria <strong>of</strong> simplicity and datafit. Th<strong>is</strong> <strong>is</strong> prec<strong>is</strong>ely what the Bayesian approach does, since in optimizing the product 450+-,)/. 450 a+-,comprom<strong>is</strong>e between simplicity (embodied in the prior) and fit to the data (high model likelihood) <strong>is</strong> found.But how <strong>is</strong> it that the HMM priors d<strong>is</strong>cussed in the previous section lead to a preference for ‘smaller’or ‘simpler’ models? Two answers present themselves: one has to do with the general phenomenon <strong>of</strong> ‘Occamfactors’ found in Bayesian inference; the other <strong>is</strong> related, but specific to the way HMMs partition data forpurposes <strong>of</strong> ‘explaining’ it. We will d<strong>is</strong>cuss each in turn.3.3.4.1 Occam factorsConsider the following scenario. Two pundits, 4 1 and 4 2, are asked for their predictions regardingan upcoming election involving a number <strong>of</strong> candidates. Each pundit has h<strong>is</strong>/her own ‘model’ <strong>of</strong> the politicalprocess. We will identify these models with their respective proponents, and try to evaluate each according4 <strong>The</strong> basic idea <strong>of</strong> encoding transitions and em<strong>is</strong>sions by enumeration has various more soph<strong>is</strong>ticated variants. For example, onecould base the enumeration <strong>of</strong> transitions on a canonical ordering <strong>of</strong> states, such that only log¤£¦¥1‡§¥log£¦¥©¨¨¨¥s££¥1‡bits are required. Or one could use the-out-<strong>of</strong>-£-bit integer coding scheme described in Cover & Thomas (1991) and used for MDLinference in Quinlan & Rivest (1989). Any reasonable Bayesian inference procedure should not be sensitive to such minor difference inthe prior, unless it <strong>is</strong> used with too little data. Our goal here <strong>is</strong> simply to suggest priors that have reasonable qualitative properties, andare at the same time computationally convenient.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!