Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

The largest terms in the previous three equations are W , the number of the training words, and V , the size of the vocabulary. Typically, W can be in order of millions, and V in hundreds of thousands. 6.2.1 Reduction of Training Epochs Training of neural net LMs is mostly performed by gradient descent with on-line update of weights. Usually, it is reported that 10-50 training epochs are needed to obtain con- vergence, although there are exceptions (in [81], it is reported that thousands of epochs were needed). In the next section, we will show that good performance can be achieved while performing as few as 6-8 training epochs, if the training data are sorted by their complexity. 6.2.2 Reduction of Number of Training Tokens In usual circumstances, backoff n-gram language models are trained on as much data as available. However, for common speech recognition tasks, only small subset of this data is in-domain. Out-of-domain data usually occupy more than 90% size of the training corpora, but their weight in the final model is relatively low. Thus, neural net LMs are usually trained only using the in-domain corpora. In [68], neural net LMs are trained on in-domain data plus some randomly sampled subset of the out-of-domain data that is randomly chosen at the start of each new training epoch. In a vast majority of cases nowadays, neural net LMs for LVCSR tasks are trained on just 5-30M tokens. Although the sampling trick can be used to claim that the neural network model has seen all the training data at least once, simple sampling techniques lead to severe performance degradation, against a model that is trained on all data - a more advanced sampling technique has been recently introduced in [80]. 6.2.3 Reduction of Vocabulary Size It can be seen that most of the computational complexity of neural net LM in Eq. 6.3 is caused by the huge term H ×V . For LVCSR tasks, the size of the hidden layer H is usually between 100 and 500 neurons, and the size of the vocabulary V is between 50k and 300k words. Thus, many attempts have been made to reduce the size of the vocabulary. The most simple technique is to compute probability distribution only for the most frequent S words in the neural network model, called a shortlist; the rest of the words use backoff 74
n-gram probabilities. However, it was shown in [40] that this simple technique degrades performance for small values of S very significantly, and even with small S such as 2000, the complexity induced by the H × V term is still very large. More successful approaches are <strong>based</strong> on Goodman’s trick for speeding up maximum entropy models using classes [25]. Each word from the vocabulary is assigned to a single class, and only the probability distribution over the classes is computed first. In the second step, the probability distribution over words that are members of a particular class is computed (we know this class from the predicted word whose probability we are trying to estimate). As the number of classes can be very small (several hundreds), this is a much more effective approach than using shortlists, and the performance degradation is smaller. We have shown that meaningful classes can be formed very easily, by considering only unigram frequencies of words [50]. Similar approaches have been described in [40] and [57]. 6.2.4 Reduction of Size of the Hidden Layer Another way to reduce H ×V is to choose a small value of H. For example, in [8], H = 100 is used when the amount of the training data is over 600M words. However, we will show that the small size of the hidden layer is insufficient to obtain good performance when the amount of training data is large, as long as the usual neural net LM architecture is used. In Section 6.6, a novel architecture of neural net LM is described, denoted as RNNME (recurrent neural network trained jointly with maximum entropy model). It allows small hidden layers to be used for models that are trained on huge amounts of data, with very good performance (much better than what can be achieved with the traditional architecture). 6.2.5 Parallelization Computation in artificial neural network models can be parallelized quite easily. It is possible to either divide the matrix times vector computation between several CPUs, or to process several examples at once, which allows going to matrix times matrix computation that can be optimized by existing libraries such as BLAS. In the context of NN LMs, Schwenk has reported a speedup of several times by exploiting parallelization [68]. It might seem that recurrent networks are much harder to parallelize, as the state of the hidden layer depends on the previous state. However, one can parallelize just the 75
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8:
6.2.3 Reduction of Vocabulary Size
Page 9 and 10:
Maybe the most popular vision of fu
Page 11 and 12:
Chapter 6 presents further extensio
Page 13 and 14:
Chapter 2 Overview of Stati
Page 15 and 16:
2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18:
ALP can be used to obtain prior pro
Page 19 and 20:
• Good theoretical motivation •
Page 21 and 22:
abilities of n-grams are stored in
Page 23 and 24:
main (static) n-gram model. As the
Page 25 and 26:
There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110: that almost every non-trivial compu
Page 111 and 112: supervision such as one digit at a
Page 113 and 114: Chapter 9 Conclusion and Future Wor
Page 115 and 116: from the expensive part of the mode
Page 117 and 118: Bibliography [1] A. Alexandrescu, K
Page 119 and 120: [23] D. Filimonov, M. Harper. A joi
Page 121 and 122: [50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124: [77] W. Wang, M. Harper. The SuperA
Page 125 and 126: Test Phase After the model is train
Page 127 and 128: • compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?