12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ÆCHAPTER 3. HIDDEN MARKOV MODELS 69At the very least, the method chosen should bewell-defined, i.e., correspond to some probabil<strong>is</strong>tic model that represents a proper d<strong>is</strong>tribution over all¡strings;unbiased with respect to the methods being compared, to the extent possible.¡Standard back-<strong>of</strong>f models (where a second model <strong>is</strong> consulted if, and only if, the first one returnsprobability zero) do not yield cons<strong>is</strong>tent probabilities unless they are combined with ‘d<strong>is</strong>counting’ <strong>of</strong> probabilitiesto ensure that the total probabilitymass sums up to unity (Katz 1987). <strong>The</strong> d<strong>is</strong>counting scheme, as wellas various smoothing approaches (e.g., adding a fixed number <strong>of</strong> virtual ‘Dirichlet’ samples into parameterestimates) tend to be specific to the model used, and are therefore inherently problematic when comparingdifferent model-building methods.To overcome these problems, we chose to use the mixture models approach described in Section2.3.1. <strong>The</strong> target models to be evaluated are combined with a simple back-<strong>of</strong>f model that guaranteesnon-zero probabilities, e.g., a bigram grammar with smoothed parameters. Th<strong>is</strong> back-<strong>of</strong>f grammar <strong>is</strong> identicalin structure for all target models. Unlike d<strong>is</strong>crete back-<strong>of</strong>f schemes, the target and the back-up are alwaysconsulted both for the probability they assign to a given sample, which are then weighted and averagedaccording to a mixture proportion.When comparing two model induction methods, we first let each induce a structure. Each <strong>is</strong> builtinto a mixture model, and both the component model parameters and the mixture proportions are estimatedusing the EM procedure for generic mixture d<strong>is</strong>tributions. To get meaningful estimates for the mixtureproportions, the HMM structure <strong>is</strong> induced based on a subset <strong>of</strong> the training data, and the full training data<strong>is</strong> then used to estimate the parameters, including the mixture weights. Th<strong>is</strong> holding-out <strong>of</strong> training datamakes the mixture model approach similar to the deleted interpolation method (Jelinek & Mercer 1980). <strong>The</strong>main difference <strong>is</strong> that the component parameters are estimated jointly with the mixture proportions. 18 In ourexperiments we always used half <strong>of</strong> the training data in the structure induction phase, adding the other halfduring the EM estimation phase. Also, to ensure that the back-<strong>of</strong>f model receives a non-zero prior probability,we estimate the mixture proportions under a simple symmetrical Dirichlet prior with Æ 1 62 6 1£ 5.3.6.2.4 Results and d<strong>is</strong>cussionHMM merging was evaluated in two variants, with and without the single-output constraint. Ineach version, three settings <strong>of</strong> the structure prior weight. were tried: 0.25, 0.5 and 1.0. Similarly, forBaum-Welch training the preset number <strong>of</strong> states in the fully parameterized HMM was set to 1.0, 1.5 and1.75 times the longest sample length. For compar<strong>is</strong>on purposes, we also included the performance <strong>of</strong> theunmerged maximum-likelihood HMM, and a biphone grammar <strong>of</strong> the kind used in the mixture models usedto evaluate the other model types. Table 3.1 summarizes the results <strong>of</strong> these experiments.18 Th<strong>is</strong> difference can be traced to the different goals: in deleted interpolation the main goal <strong>is</strong> to gauge the reliability <strong>of</strong> parameterestimates, whereas here we want to assess the different structures.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!