12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 2. FOUNDATIONS 22¡An explicit, probabil<strong>is</strong>tic bias expressed by the prior +-, 450 .An implicit, heur<strong>is</strong>tic bias as part <strong>of</strong> the choice <strong>of</strong> the topology <strong>of</strong> the search space (the search bias).¡Even if the merging operator(s) are defined such that all models in the space are reachable from theinitialmodel, th<strong>is</strong> bias <strong>is</strong> still significant due to the local and sequential nature <strong>of</strong> the posterior probabilitymaximization.We will return to the question <strong>of</strong> how to relax the search bias by using less constrained search methods(Section 3.4.5).<strong>The</strong> remaining section <strong>of</strong> th<strong>is</strong> chapter present some <strong>of</strong> the common mathematical tools needed ininstantiating the Bayesian model merging approach in the domain <strong>of</strong> probabil<strong>is</strong>tic grammars. <strong>The</strong> followingchapters 3, 4 and 5 each describe one such instantiation.2.5.4 Minimum Description LengthBayesian inference based on posterior probabilities has an alternative formulation in terms <strong>of</strong>finformation-theoretic concepts. <strong>The</strong> dual<strong>is</strong>m between the two formulations <strong>is</strong> useful both for a deeperunderstanding <strong>of</strong> the underlying principles, and for the construction <strong>of</strong> prior d<strong>is</strong>tributions (see Section 2.5.6below).<strong>The</strong> maximization <strong>of</strong>+-,+-,+-,4|_)g0w6450450 )À.implicit in Bayesian model inference <strong>is</strong> equivalent to minimizingL log +-, 4|_)h0'6|L log +-, 450ÁL log +-, )/. 450F£Information theory tells us that the negative logarithm <strong>of</strong> the probability <strong>of</strong> a d<strong>is</strong>crete event b <strong>is</strong> the optimalcode word length for communicating an instance <strong>of</strong> b , so as to minimize the average code length <strong>of</strong> arepresentative message.description lengths.Accordingly, the terms in the above equation can be interpreted as message orSpecifically, L log +9, 450 <strong>is</strong> the description length <strong>of</strong> the model under the prior d<strong>is</strong>tribution;L log +-, )/. 450 corresponds to a description <strong>of</strong> the data ) using 4 as the model on which code lengthsare based. <strong>The</strong> negative log <strong>of</strong> the joint probability can therefore be interpreted as the total description length<strong>of</strong> model and data.Inference or estimation by minimum description length (MDL) (R<strong>is</strong>sanen 1983; Wallace & Freeman1987) <strong>is</strong> thus equivalent to, and a useful alternative conceptualization <strong>of</strong> posterior probability maximization. 99 <strong>The</strong> picture become somewhat more complex when d<strong>is</strong>tributions over continuous spaces are involved. For those the correspondingMDL formulation has to consider the optimal granularity <strong>of</strong> the d<strong>is</strong>crete encoding. We will can conveniently avoid th<strong>is</strong> complication asthe only formal use <strong>of</strong> description lengths will be to dev<strong>is</strong>e priors for d<strong>is</strong>crete objects, namely, grammar structures.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!