12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

@For example, for multinomials an unbiased, cons<strong>is</strong>tent estimator <strong>of</strong> the component probabilities Us@v,vv,,,6@,t000CHAPTER 2. FOUNDATIONS 11<strong>is</strong> given by the observed averagesU @£ (2.8)eZero bias <strong>is</strong> not always desirable, however, as attested by the extensive literature on how to modifyestimates for ¢ -gram (and similar) models so as to avoid estimates from taking on 0 values, specifically inlingu<strong>is</strong>tic domains. According to (2.8) th<strong>is</strong> would be the case whenever some outcome has not been observedat all (6 0). In many domains and for real<strong>is</strong>tic sample sizes one expects many counts to remain zero eventhough the underlying parameters are not; hence one wants to bias the estimates away from zero. See Church& Gale (1991) for a survey and compar<strong>is</strong>on <strong>of</strong> various methods for doing that. Another reason for introducingbias <strong>is</strong> to reduce the variance <strong>of</strong> an estimator (Geman et al. 1992).2.2.6 Likelihood and cross-entropyIf the ) data <strong>is</strong> given and fixed, we can view )/. 450 as a function <strong>of</strong> 4 , the likelihood (function).A large class <strong>of</strong> estimators, so called maximum likelihood (ML) estimators, can be defined as the maxima <strong>of</strong>+-,likelihood functions. For example, the simple estimator (2.8) for multinomials happens to also be the MLestimator, as setting the parameters U @ to their empirical averages maximizes the probability <strong>of</strong> the observeddata.Intuitively, a high likelihood means that the 4 model ‘fits’ the data well. Th<strong>is</strong> <strong>is</strong> because a modelallocates a fixed probability mass (unity) over the space <strong>of</strong> possible strings (or sequences <strong>of</strong> strings). Tomaximize the probability <strong>of</strong> the observed strings the unobserved ones have as little probability as possible,within the constraints <strong>of</strong> the model. In fact, if a model class allows assigning probabilities to samplesaccording to their relative frequencies as in (2.8), th<strong>is</strong> <strong>is</strong> the best one can do.An alternative measure <strong>of</strong> the fit or closeness <strong>of</strong> a model to a d<strong>is</strong>tribution <strong>is</strong> based on the concept <strong>of</strong>entropy. <strong>The</strong> relative entropy (also known as the Kullback-Leibler d<strong>is</strong>tance) between d<strong>is</strong>tributionst two andu<strong>is</strong> defined as0 logu ,21(2.9)tw.o. u 0w6xLgyz{t,21,;1Th<strong>is</strong> can be written as,210 log u ,21,;10 logt,21tw.o. u 0&6|L y z5t0%L y z{t<strong>The</strong> second sum <strong>is</strong> the familiar } entropyterm in which both d<strong>is</strong>tributions appear. We will call th<strong>is</strong> first term the cross-entropy 3 }¦~tE0 <strong>of</strong> the d<strong>is</strong>tribution t , whereas the first sum <strong>is</strong> an entropy-like, u 0 , givingu 0

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!