Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

computation between the hidden and the output layers. It is also possible to parallelize the computation by training from multiple positions in the training data simultaneously. Another approach is to divide the training data into K subsets and train a single neural net model on each of them separately. Then, it is needed to use all K models during the test phase, and average their outputs. However, neural network models profit from clustering of similar events, thus such an approach would lead to suboptimal results. Also, the test phase would be more computationally complex, as it would be needed to evaluate K models, instead of one. 6.3 Experimental Setup We performed recognition on the English Broadcast News (BN) NIST RT04 task using state-of-the-art acoustic models trained on the English Broadcast News (BN) corpus (430 hours of audio) provided to us by IBM [31]. The acoustic model was discriminatively trained on about 430 hours of HUB4 and TDT4 data, with LDA+MLLT, VTLN, fMLLR <strong>based</strong> SAT training, fMMI and mMMI. IBM also provided us its state-of-the-art speech recognizer, Attila [71] and two Kneser-Ney smoothed backoff 4-gram LMs containing 4.7M n-grams and 54M n-grams, both trained on about 400M tokens. Additional details about the recognizer can be found in [31]. We followed IBM’s multi-pass decoding recipe using the 4.7M n-gram LM in the first pass followed by rescoring using the larger 54M n-gram LM. The development data con- sisted of DEV04f+RT03 data (25K tokens). For evaluation, we used the RT04 evaluation set (47K tokens). The size of the vocabulary is 82K words. The training corpora for language models are shown in Table 6.1. The baseline 4-gram model was trained using modified Kneser-Ney smoothing on all available data (about 403M tokens). For all the RNN models that will be described later, we have used the class-<strong>based</strong> architecture described in Section 3.4.2 with 400 classes. 6.4 Automatic Data Selection and Sorting Usually, stochastic gradient descent is used for training neural networks. This assumes randomization of the order of the training data before start of each training epoch. In the context of NN LMs, the randomization is usually performed on the level of sentences or paragraphs. However, an alternative view can be taken when it comes to training deep 76
Table 6.1: Training corpora for NIST RT04 Broadcast news speech recognition task. Name of Corpus # of Tokens LDC97T22 838k LDC98T28 834k LDC2005T16 11,292k BN03 45,340k LDC98T31 159,799k LDC2007E02 184,484k ALL 402,589k neural network architectures, such as recurrent neural networks: we hope that the model will be able to find complex patterns in the data, that are <strong>based</strong> on simpler patterns. These simple patterns need to be learned before complex patterns can be learned. Such concept is usually called ‘incremental learning’. In the context of simple RNN <strong>based</strong> language models, it has previously been investigated by Elman [18]. In the context of NN LMs, the incremental learning was described and formalized in [8], where it is denoted as Curriculum learning. Inspired by these approaches, we have decided to change the order of the training data, so that the training starts with out-of-domain data, and ends with the most important in-domain data. Another motivation for this approach is even simpler: if the most useful data are processed at the end of the training, they will have higher weights, as the update of parameters is done on-line and the learning rate during a training epoch is fixed. We divided the full training set into 560 equally-sized chunks (each containing 40K sentences). Next, we computed perplexity on the development data given a 2-gram model trained on each chunk. We sorted all chunks by their performance on the development set. We observed that although we use very standard LDC data, some chunks contain noisy data or repeating articles, resulting in high perplexity of models trained on these parts of the training data. In Figure 6.2, we plot the performance on the development set, as well as performance on the evaluation set, to show that the correlation of performance on different, but similar test sets is very high. We decided to discard the data chunks with perplexity above 600 to obtain the Reduced-Sorted training set, that is ordered as shown in Figure 6.2. This set contains 77
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16:
2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18:
ALP can be used to obtain prior pro
Page 19 and 20:
• Good theoretical motivation •
Page 21 and 22:
abilities of n-grams are stored in
Page 23 and 24:
main (static) n-gram model. As the
Page 25 and 26:
There are many popular examples sho
Page 27 and 28:
y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79: n-gram probabilities. However, it w
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118: Bibliography [1] A. Alexandrescu, K
Page 119 and 120: [23] D. Filimonov, M. Harper. A joi
Page 121 and 122: [50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124: [77] W. Wang, M. Harper. The SuperA
Page 125 and 126: Test Phase After the model is train
Page 127 and 128: • compute sentence-level scores g
Page 129 and 130: Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?