Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
main (static) n-gram model. As the cache models provide truly significant improvements<br />
in perplexity (sometimes even more than 20%), there exists a large number <strong>of</strong> more refined<br />
techniques that can capture the same patterns as the basic cache models - for example,<br />
various topic models, latent semantic analysis <str<strong>on</strong>g>based</str<strong>on</strong>g> models [3], trigger models [39] or<br />
dynamically evaluated models [32] [49].<br />
The advantage <strong>of</strong> cache (or similar) models is in large reducti<strong>on</strong> <strong>of</strong> perplexity, thus<br />
these techniques are very popular in the language modeling related papers. Also, their<br />
implementati<strong>on</strong> is <strong>of</strong>ten quite easy. The problematic part is that new cache-like techniques<br />
are compared to weak baselines, like bigram or trigram models. It is unfair to not include<br />
at least unigram cache model to the baseline, as it is very simple to do so (for example by<br />
using standard LM toolkits such as SRILM [72]).<br />
The main disadvantage is in questi<strong>on</strong>able correlati<strong>on</strong> between perplexity improvements<br />
and word error rate reducti<strong>on</strong>s. This has been explained by [24] as a result <strong>of</strong> the fact<br />
that the errors are locked in the system - if the speech recognizer decodes incorrectly a<br />
word, it is placed in the cache which hurts further recogniti<strong>on</strong> by increasing chance <strong>of</strong><br />
doing the same error again. When the output from the recognizer is corrected by the user,<br />
cache models are reported to work better; however, it is not practical to force users to<br />
manually correct the output. Advanced versi<strong>on</strong>s, like trigger models or LSA models were<br />
reported to provide interesting WER reducti<strong>on</strong>s, yet these models are not comm<strong>on</strong>ly used<br />
in practice.<br />
Another explanati<strong>on</strong> <strong>of</strong> poor performance <strong>of</strong> cache models in speech recogniti<strong>on</strong> is<br />
that since the output <strong>of</strong> a speech recognizer is imperfect, the perplexity calculati<strong>on</strong>s that<br />
are normally performed <strong>on</strong> some held-out data (correct sentences) are misleading. If the<br />
cache models were using the highly ambiguous history <strong>of</strong> previous words from a speech<br />
recognizer, the perplexity improvements would be dramatically lower. It is thus important<br />
to be careful when c<strong>on</strong>clusi<strong>on</strong>s are made about techniques that access very l<strong>on</strong>g c<strong>on</strong>text<br />
informati<strong>on</strong>.<br />
2.3.2 Class Based <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />
One way to fight the data sparsity in higher order n-grams is to introduce equivalence<br />
classes. In the simplest case, each word is mapped to a single class, which usually repre-<br />
sents several words. Next, n-gram model is trained <strong>on</strong> these classes. This allows better<br />
generalizati<strong>on</strong> to novel patterns which were not seen in the training data. Improvements<br />
19