Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

improvements actually increased - this can also be explained by the fact that the WSJ-JHU setup is about 40x larger. 5.2 Kaldi WSJ Setup Additional experiments on the Wall Street Journal task were performed using n-best lists generated with an open source speech recognition toolkit Kaldi [60] trained on SI-84 data further described in [62]. The acoustic models used in the following experiments were <strong>based</strong> on triphones and GMMs. Several advantages of using Kaldi such as better re- peatability of the performed experiments were already mentioned in the beginning of this chapter (although Kaldi is still being developed, it should be easy to repeat the following experiments with slightly better results, as RNN rescoring code is integrated in the Kaldi toolkit). Note that this setup is also not the state of the art, as with more training data and advanced acoustic modeling techniques, it is possible to get better baseline results. Rescoring experiments with RNN LMs on a state of the art setup is subject of the following chapter. I used 1000-best lists generated by Stefan Kombrink in the following experiments. The test sets are the same as for the JHU setup. This time I trained RNNME models to save time - it is possible to achieve very good results even with tiny size of the hidden layer. For the ME part of the model, I used unigram, bigram, trigram and fourgram features, with hash size 2G parameters. The vocabulary was limited to 20K words used by the decoder. Training data consisted of 37M tokens, from which 1% was used as heldout data. The training data were shuffled to increase speed of convergence during training, however, due to homogeneity of the corpus, the automatic sorting technique as described in Chapter 6 was not used. The results are summarized in Table 5.3. It can be seen that RNNME models improve PPL and WER significantly even with tiny size of the hidden layer, such as 10 neurons. However, for reaching top performance, it is useful to train models as large as possible. While training of small RNNME models (such as with less than 100 neurons in the hidden layer) takes around several hours, training the largest models takes a few days. After combining all RNNME models, the performance still improves; however, adding unsupervised adaptation resulted in rather insignificant improvement - note that the Eval 92 contains 333 utterances and Eval 92 only 213, thus there is noise in the WER results due to small amount of test data. 66
Table 5.3: Results on the WSJ setup using Kaldi. Model Perplexity WER [%] heldout Eval 92 Eval 92 Eval 93 GT2 167 209 14.6 19.7 GT3 105 147 13.0 17.6 KN5 87 131 12.5 16.6 KN5 (no count cutoffs) 80 122 12.0 16.6 RNNME-0 90 129 12.4 17.3 RNNME-10 81 116 11.9 16.3 RNNME-80 70 100 10.4 14.9 RNNME-160 65 95 10.2 14.5 RNNME-320 62 93 9.8 14.2 RNNME-480 59 90 10.2 13.7 RNNME-640 59 89 9.6 14.4 combination of RNNME models - - 9.24 13.23 + unsupervised adaptation - - 9.15 13.11 Table 5.4: Sentence accuracy on the Kaldi WSJ setup. Model Sentence accuracy [%] Eval 92 Eval 93 KN5 (no count cutoffs) 27.6 26.8 RNNME combination+adaptation 39.9 36.6 Overall, the absolute reduction of WER is quite impressive: against 5-gram with mod- ified Kneser-Ney smoothing with no count cutoffs, the WER reduction is about 2.9% - 3.5%. This corresponds to relative reduction of WER by 21% - 24%, which is the most likely the best result in the statistical language modeling field. As the word error rates are already quite low, it is interesting to check another performance metric - the number of correctly recognized sentences, as reported in Table 5.4. Relatively, the sentence accuracy increased by using RNNME models instead of n-gram models by 37% - 45%. In Figure 5.2, it is shown how word error rate decreases with increasing size of the N-best lists. It is possible that results can be further improved by using even larger N- best lists, such as 10K-best. However, the expected improvements are small. It would be 67
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16:
2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18:
ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69: Entropy per word on the WSJ test da
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118: Bibliography [1] A. Alexandrescu, K
Page 119 and 120: [23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?