Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

than a sentence) will be called a ”long-span model”. An example of a long-span model is a cache model, or a topic model. Comparison of performance of a short span model (such as 4-gram LM) against a combination of a short span and a long span model (such as 4-gram + cache) is very popular in the literature, as it leads to large improvements in perplexity. However, the reduction of a word error rate in speech recognition by using long-span models is usually quite small - as was mentioned previously, this is caused by the fact that perplexity is commonly evaluated while assuming perfect history, which is a false assumption as the history in speech recognition is typically very noisy 1 . Typical examples of such experiments are different novel ways how to compute cache-like models. Joshua Goodman’s report [24] is a good reference for those who are interested in more insight into criticism of typical language modeling research. To avoid these mistakes, performance of individual models is reported and compared to a modified Kneser-Ney smoothed 5-gram (which is basically a state-of-the-art among n-gram models), and further compared to a combination of a 5-gram model with a un- igram cache model. After that, we report the results after using all models together, with an analysis which models are providing the most complementary information in the combination, and which models discover patterns that can be better discovered by other techniques. 4.2 Penn Treebank Dataset One of the most widely used data sets for evaluating performance of the statistical language models is the Penn Treebank portion of the WSJ corpus (denoted here as a Penn Treebank Corpus). It has been previously used by many researchers, with exactly the same data preprocessing (the same training, validation and test data and the same vocabulary limited to 10K words). This is quite rare in the language modeling field, and allows us to compare directly performances of different techniques and their combinations, as many researchers were kind enough to provide us their results for the following comparison. Combination of the models is further done by using linear interpolation - for combination of two models M1 and M2 this means PM12 (w|h) = λPM1 (w|h) + (1 − λ)PM2 (w|h) (4.1) 1 Thanks to Dietrich Klakow for pointing this out. 46
where λ is the interpolation weight of the model M1. As long as both models produce correct probability distributions and λ ∈< 0; 1 >, the linear interpolation produces correct probability distribution. It has been reported that log-linear interpolation of models can work in some cases significantly better than the linear interpolation (especially when combining long span and short span language models), but the log-linear interpolation requires renormalization of the probability distribution and is thus much more computa- tionally expensive [35]: 1 PM12 (w|h) = Zλ(h) PM1 (w|h)λ1 × PM2 (w|h)λ2 (4.2) where Zλ(h) is the normalization term. Because of the normalization term, we need to consider the full probability distribution given by both models, while for the linear interpolation, it is enough to interpolate probabilities given by both models for an individual word. The previous equations can be easily extended to combination of more than two models, by having separate weight for each model. The Penn Treebank Corpus was divided as follows: sections 0-20 were used as the training data (930k tokens), sections 21-22 as the validation data (74k tokens) and sections 23-24 as the test data (82k tokens). All words outside the 10K vocabulary were mapped to a special token (unknown word) in all PTB data sets, thus there are no Out-Of-Vocabulary (OOV) words. 4.3 Performance of Individual <strong>Models</strong> The performance of all individual models used in the further experiments is presented in Table 4.1. First, we will give references and provide brief details about the individual models. Then we will compare performance of models, combine them together and finally analyze contributions of all individual models and techniques. I would also like to mention here that the following experiments were performed with the help of Anoop Deoras who reimplemented some of the advanced LM techniques that are mentioned in the comparison. Some of the following results are also <strong>based</strong> on the work of other researchers, as will be mentioned later. 47
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49: While for any of the previous point
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?