Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

THE SKY THIS WAS BLUE N-gram models with N = 4 are unable to efficiently model such common patterns in the language. With N = 10, we can see that the number of variations is so large that we cannot realistically hope to have such amounts of training data that would allow n-gram models to capture such long-context patterns - we would basically have to see each specific variation in the training data, which is infeasible in practical situations. Another type of patterns that n-gram models will not be able to model efficiently is similarity of individual words. A popular example is: PARTY WILL BE ON Considering that only two or three variations of this sentence are present in the training data, such as PARTY WILL BE ON MONDAY and PARTY WILL BE ON TUESDAY, the n-gram models will not be able to assign meaningful probability to novel (but similar) sequence such as PARTY WILL BE ON FRIDAY, even if days of the week appeared in the training data frequently enough to discover that there is some similarity among them. As language modeling is closely related to artificial intelligence and language learning, it is possible to find great amount of different language modeling techniques and large number of their variations across research literature published in the past thirty years. While it is out of scope of this work to describe all of these techniques in detail, we will at least make short introduction to the important techniques and provide references for further details. 2.3.1 Cache <strong>Language</strong> <strong>Models</strong> As stated previously, one of the most obvious drawbacks of n-gram models is in their inability to represent longer term patterns. It has been empirically observed that many words, especially the rare ones, have significantly higher chance of occurring again if they did occur in the recent history. Cache models [32] are supposed to deal with this regularity, and are often represented as another n-gram model, which is estimated dynamically from the recent history (usually few hundreds of words are considered) and interpolated with the 18
main (static) n-gram model. As the cache models provide truly significant improvements in perplexity (sometimes even more than 20%), there exists a large number of more refined techniques that can capture the same patterns as the basic cache models - for example, various topic models, latent semantic analysis <strong>based</strong> models [3], trigger models [39] or dynamically evaluated models [32] [49]. The advantage of cache (or similar) models is in large reduction of perplexity, thus these techniques are very popular in the language modeling related papers. Also, their implementation is often quite easy. The problematic part is that new cache-like techniques are compared to weak baselines, like bigram or trigram models. It is unfair to not include at least unigram cache model to the baseline, as it is very simple to do so (for example by using standard LM toolkits such as SRILM [72]). The main disadvantage is in questionable correlation between perplexity improvements and word error rate reductions. This has been explained by [24] as a result of the fact that the errors are locked in the system - if the speech recognizer decodes incorrectly a word, it is placed in the cache which hurts further recognition by increasing chance of doing the same error again. When the output from the recognizer is corrected by the user, cache models are reported to work better; however, it is not practical to force users to manually correct the output. Advanced versions, like trigger models or LSA models were reported to provide interesting WER reductions, yet these models are not commonly used in practice. Another explanation of poor performance of cache models in speech recognition is that since the output of a speech recognizer is imperfect, the perplexity calculations that are normally performed on some held-out data (correct sentences) are misleading. If the cache models were using the highly ambiguous history of previous words from a speech recognizer, the perplexity improvements would be dramatically lower. It is thus important to be careful when conclusions are made about techniques that access very long context information. 2.3.2 Class Based <strong>Models</strong> One way to fight the data sparsity in higher order n-grams is to introduce equivalence classes. In the simplest case, each word is mapped to a single class, which usually repre- sents several words. Next, n-gram model is trained on these classes. This allows better generalization to novel patterns which were not seen in the training data. Improvements 19
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21: abilities of n-grams are stored in
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74:
Table 5.5: Results for models <stro
Page 75 and 76:
trained together with a maximum ent
Page 77 and 78:
wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80:
n-gram probabilities. However, it w
Page 81 and 82:
Table 6.1: Training corpora for NIS
Page 83 and 84:
Perplexity 360 340 320 300 280 260
Page 85 and 86:
Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88:
1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90:
Table 6.4: Perplexity on the evalua
Page 91 and 92:
Entropy reduction per word over KN4
Page 93 and 94:
Table 6.6: Perplexity with the new
Page 95 and 96:
Entropy reduction over KN5 -0.04 -0
Page 97 and 98:
as a baseline, and 12.3% after resc
Page 99 and 100:
Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?