12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

. canCHAPTER 3. HIDDEN MARKOV MODELS 71<strong>The</strong> table also includes useful summary stat<strong>is</strong>tics <strong>of</strong> the model sizes obtained, and the time it took tocompute the models. <strong>The</strong> latter figures are obviously only a very rough measure <strong>of</strong> computational demands,and their compar<strong>is</strong>on suffers from the fact that the implementation <strong>of</strong> each <strong>of</strong> the methods may certainly beoptimized in idiosyncratic ways. Nevertheless these figures should give an approximate idea <strong>of</strong> what to expectin a real<strong>is</strong>tic application <strong>of</strong> the induction methods involved.One important general conclusion from these experiments <strong>is</strong> that both the merged models and thoseobtained by Baum-Welch training do significantlybetter than the two ‘dumb’ approaches, the bigram grammarand the ML HMM (which <strong>is</strong> essentially a l<strong>is</strong>t <strong>of</strong> observed samples). We can therefore conclude that it pays totry to generalize from the data, either using our Bayesian approach or Baum-Welch on an HMM <strong>of</strong> suitablesize.Overall the difference in scores even between the simplest approach (bigram) and the best scoring(merging,.T6 1£ one 0) are quite small, with phone perplexities ranging from 1.985 to 1.849. Th<strong>is</strong> <strong>is</strong> notsurpr<strong>is</strong>ing given the specialized nature and small size <strong>of</strong> the sample corpus. Unfortunately, th<strong>is</strong> also leavesvery little room for significant differences in comparing alternate methods. However, the advantage <strong>of</strong> thebest model merging result (unconstrained with. 6 1£ outputs 0) <strong>is</strong> still significant compared to the bestBaum-Welch (size factor 1.5) result (t Ï 0£ 041). Such small differences in log probabilities would probablybe irrelevant when the resulting HMMs are embedded in a speech recognition system.Perhaps the biggest advantage <strong>of</strong> the merging approach in th<strong>is</strong> application <strong>is</strong> the compactness <strong>of</strong> theresulting models. <strong>The</strong> merged models are considerably smaller than the comparable Baum-Welch HMMs.Th<strong>is</strong> <strong>is</strong> important for any <strong>of</strong> the standard algorithms operating on HMMs, which typically scale linearly withthe number <strong>of</strong> transitions (or quadratically with the number <strong>of</strong> states). Besides th<strong>is</strong> advantage in productionuse, the trainingtimes for Baum-Welch grow quadratically with the number <strong>of</strong> states for the structure inductionphase since it requires fully parameterized HMMs. Th<strong>is</strong> scaling <strong>is</strong> clearly v<strong>is</strong>ible in the run times we observed.Although we haven’t done a word-by-word compar<strong>is</strong>on <strong>of</strong> the HMM structures derived by mergingand Baum-Welch, the summary <strong>of</strong> model sizes seem to confirm our earlier finding (Section 3.6.1) that Baum-Welch training needs a certain redundancy in ‘model real estate’ to be effective in finding good-fitting models.Smaller size factors give poor fits, whereas sufficiently large HMMs will tend to overfit the training data.<strong>The</strong> choice <strong>of</strong> the weights. prior for HMM merging (Section 3.4.4) controls the model size inan indirect way: larger values lead to more generalization and smaller HMMs. For best results th<strong>is</strong> valuecan be set based on previous experience with representative data. Th<strong>is</strong> could effectively be done in a crossvalidationlike procedure, in which generalization <strong>is</strong> successively increased starting with small.’s. Due tothe nature <strong>of</strong> the merging algorithm, th<strong>is</strong> can be done incrementally, i.e., the outcome <strong>of</strong> merging with a smallbe submitted to more merging at a larger. value, until further increases reduce generalization on thecross-validation data.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!