Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

3.4.3 Approximation of Complex <strong>Language</strong> Model by Backoff N-gram model In [15], we have shown that NNLM can be partly approximated by a finite state machine. The conversion is done by sampling words from the probability distribution computed by NNLM, and a common N-gram model is afterwards trained on the sampled text data. For infinite amount of sampled data and infinite order N, this approximation technique is guaranteed to converge to an equivalent model to the one that was used for generating the words. Of course, this is not achievable in practice, as it is not possible to generate infinite amounts of data. However we have shown that even for manageable amounts of sampled data (hundreds of million words), the approximated model provides some of the improve- ment over baseline n-gram model that is provided by the full NNLM. Note that this approach is not limited just to NNLMs or RNNLMs, but can be used to convert any complex model to a finite state representation. However, following the motivation examples that were shown in the introductory chapter, representing certain patterns using FSMs is quite impractical, thus we believe this technique can be the most useful for tasks with limited amount of the training data, where size of models is not so restrictive. Important advantage of this approach include possibility of using the approximated model directly during decoding, for the standard lattice rescoring, etc. It is even possible to use the (R)NNLMs for speech recognition without actually having a single line of neural net code in the system, as the complex patterns learned by neural net are represented as a list of possible combinations in the n-gram model. The sampling approach is thus giving the best possible speedup for the test phase, by trading the computational complexity for the space complexity. Empirical results obtained by using this technique for approximating RNNLMs in speech recognition systems are described in [15] and [38], which is a joint work with Anoop Deoras and Stefan Kombrink. 3.4.4 Dynamic Evaluation of the Model From the artificial intelligence point of view, the usual statistical language models have another drawback besides their inability to represent longer term patterns: the impossibil- ity to learn new information. This is caused by the fact that LMs are commonly assumed to be static - the parameters of the models do not change during processing of the data. 40
While RNN models can overcome this disadvantage to some degree by remembering some information in the hidden layer, due to the vanishing gradient problem it is not possible to train RNNs to do so using normal gradient <strong>based</strong> training. Moreover, even if the training algorithm was powerful enough to discover longer term patterns, it would be inefficient to store all new information (such as new names of people) in the state of the hidden layer, and access and update this information at every time step. The simplest way to overcome this problem is to use dynamic models, which has been already proposed by Jelinek in [32]. In the case of n-gram models, we can simply train another n-gram model during processing of the test data <strong>based</strong> on the recent history, and interpolate it with the static one - such dynamic model is usually called cache model. Another approach is to maintain just a single model, and update its parameters online during processing of the test data. This can be easily achieved using neural network models. The disadvantages of using dynamically updated models are that the computational complexity of the test phase increases, as we need to perform not only the forward pass, but also calculate gradients and propagate them backwards through the network, and change weights. More seriously, a network that is presented ambiguous data continually for significant amount of time steps might forget older information - it can rewrite its own weights with meaningless information. After the test data switches back to normal data, the network cannot access the forgotten information anymore. This would not happen with the n-gram models, since these access parameters very sparsely. Neural net models share information among all words, thus it is easier to corrupt them. The dynamic evaluation of the NN language models has been described in my recent work [49] [50], and is achieved by training the RNN model during processing of the test data, with a fixed learning rate α = 0.1. Thus, the test data are processed only once, which is a difference to normal NN training where training data are seen several times. While the dynamic evaluation of the NN models leads to interesting perplexity im- provements, especially after combination with the static model (which has the advantage that it cannot forget any information and cannot be corrupted by the noisy data), ap- plication to a speech recognition system is very computationally expensive if done in the exact way, as several versions of the model must be kept in the memory and weights have to be reloaded to prevent the model to ”see the future” (for example in n-best list rescoring, it is needed to reload weights after processing each hypothesis from a given list). A 41
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43: output layer changes to computation
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96:
Entropy reduction over KN5 -0.04 -0
Page 97 and 98:
as a baseline, and 12.3% after resc
Page 99 and 100:
Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?