Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

Chapter 5 Wall Street Journal Experiments Another important data set frequently used by the speech recognition community for research purposes is the Wall Street Journal speech recognition task. In the following experiments, we aim to: • show full potential of RNN LMs on moderately sized task, where speech recognition errors are mainly caused by the language model (as opposed to acoustically noisy tasks where it would be more important to work on the acoustic models) • show performance of RNN LMs with increasing amount of the training data • provide comparison to other advanced language modeling techniques in terms of word error rate • describe experiments with open source speech recognition toolkit Kaldi that can be reproduced 5.1 WSJ-JHU Setup Description The experiments in this section were performed with data set that was kindly shared with us by researchers from Johns Hopkins university. We report results after rescoring 100-best lists from DARPA WSJ’92 and WSJ’93 data sets - the same data sets were used by Xu [79], Filimonov [23], and in my previous work [49]. Oracle WER of the 100-best lists is 6.1% for the development set and 9.5% for the evaluation set. Training data for the language model are the same as used by Xu [79]. The training corpus consists of 37M words from NYT section of English Gigaword. The hyper-parameters for all RNN models 62
were: 400 classes, hidden layer size up to 800 neurons. Other hyper-parameters such as interpolation weights were tuned on the WSJ’92 set (333 sentences), and the WSJ’93 set used for evaluation consists of 465 sentences. Note that this setup is very simple as the acoustic models that were used to generate n-best lists for this task were not the state of the art. Also, the corresponding language models used in the previous research were trained just on limited amount of the training data (37M-70M words), although by using more training data that are easily affordable for this task, better performance can be expected. The same holds for the vocabulary - a 20K word list was used, although it would be simple to use more. Thus, the experiments on this setup are not supposed to beat the state of the art, but to allow comparison to other LM techniques and to provide more insight into the performance of the RNN LMs. 5.1.1 Results on the JHU Setup Results with RNN models and competitive techniques are summarized in Table 5.1. The best RNN models have very high optimal weight when combined with KN5 baseline model, and actually by discarding the n-gram model completely, the results are not significantly affected. Interpolation of three RNN models gives the best results - the word error rate is reduced relatively by about 20%. Other techniques, such as discriminatively trained language model and joint LM (structured model) provide smaller improvements, only about 2-3% reduction of WER on the evaluation set. The adapted RNN model is not evaluated as a dynamic RNN LM described in the previous chapters, but simply a static model that is re-trained on the 1-best lists. This was done due to performance issues; it becomes relatively slow to work with RNN models that are continuously updated, especially in the n-best list rescoring framework. Adaptation itself provides relatively small improvement, especially with the large models. 5.1.2 Performance with Increasing Size of the Training Data It was observed by Joshua Goodman that with increasing amount of the training data, improvements provided by many advanced language modeling techniques vanish, with a possible conclusion that it might be sufficient to train basic n-gram models on huge amounts of data to obtain good performance [24]. This is sometimes interpreted as an argument against language modeling research; however, as was mentioned in the introduction of this thesis, simple counting of words in different contexts is far from being close 63
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65: 4.6 Conclusion of the Model Combina
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?