02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

main (static) n-gram model. As the cache models provide truly significant improvements<br />

in perplexity (sometimes even more than 20%), there exists a large number <strong>of</strong> more refined<br />

techniques that can capture the same patterns as the basic cache models - for example,<br />

various topic models, latent semantic analysis <str<strong>on</strong>g>based</str<strong>on</strong>g> models [3], trigger models [39] or<br />

dynamically evaluated models [32] [49].<br />

The advantage <strong>of</strong> cache (or similar) models is in large reducti<strong>on</strong> <strong>of</strong> perplexity, thus<br />

these techniques are very popular in the language modeling related papers. Also, their<br />

implementati<strong>on</strong> is <strong>of</strong>ten quite easy. The problematic part is that new cache-like techniques<br />

are compared to weak baselines, like bigram or trigram models. It is unfair to not include<br />

at least unigram cache model to the baseline, as it is very simple to do so (for example by<br />

using standard LM toolkits such as SRILM [72]).<br />

The main disadvantage is in questi<strong>on</strong>able correlati<strong>on</strong> between perplexity improvements<br />

and word error rate reducti<strong>on</strong>s. This has been explained by [24] as a result <strong>of</strong> the fact<br />

that the errors are locked in the system - if the speech recognizer decodes incorrectly a<br />

word, it is placed in the cache which hurts further recogniti<strong>on</strong> by increasing chance <strong>of</strong><br />

doing the same error again. When the output from the recognizer is corrected by the user,<br />

cache models are reported to work better; however, it is not practical to force users to<br />

manually correct the output. Advanced versi<strong>on</strong>s, like trigger models or LSA models were<br />

reported to provide interesting WER reducti<strong>on</strong>s, yet these models are not comm<strong>on</strong>ly used<br />

in practice.<br />

Another explanati<strong>on</strong> <strong>of</strong> poor performance <strong>of</strong> cache models in speech recogniti<strong>on</strong> is<br />

that since the output <strong>of</strong> a speech recognizer is imperfect, the perplexity calculati<strong>on</strong>s that<br />

are normally performed <strong>on</strong> some held-out data (correct sentences) are misleading. If the<br />

cache models were using the highly ambiguous history <strong>of</strong> previous words from a speech<br />

recognizer, the perplexity improvements would be dramatically lower. It is thus important<br />

to be careful when c<strong>on</strong>clusi<strong>on</strong>s are made about techniques that access very l<strong>on</strong>g c<strong>on</strong>text<br />

informati<strong>on</strong>.<br />

2.3.2 Class Based <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />

One way to fight the data sparsity in higher order n-grams is to introduce equivalence<br />

classes. In the simplest case, each word is mapped to a single class, which usually repre-<br />

sents several words. Next, n-gram model is trained <strong>on</strong> these classes. This allows better<br />

generalizati<strong>on</strong> to novel patterns which were not seen in the training data. Improvements<br />

19

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!