Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

Perplexity (Penn corpus) 130 125 120 115 110 105 100 95 0 5 10 15 20 25 Number of RNN models RNN mixture RNN mixture + KN5 Figure 4.2: Linear interpolation of RNN models trained with different random initialization of the weights. 4.4 Comparison of Different Neural Network Architectures As neural network language models seem to work very well, we will describe them and compare their performances in more detail. As can be seen in Table 4.1, the performance of neural network <strong>based</strong> models with feedforward architectures is almost identical. The syntactical NNLM has an advantage of having more input features: it uses a syntactical parser to obtain part of speech tags for words in a sentence. We can observe that a signif- icant improvement was obtained by using these additional features. On the other hand, application of this technique to new languages and domains might not be straightforward, as it relies on a syntactical parser that has to be trained on (usually) hand-annotated data. There has been a following work done by Emami [22], where it is shown how different linguistic features affect results, both perplexity and word error rate. The conclusion of that experiments seems to be that the linguistic features improve only perplexity, but do not reduce the word error rate. Additional input features make interpretation of the results quite difficult: it can be seen that the PTB corpus is quite small, and thus having additional information from a parser that is trained on additional data (POS tags of words) can be a great boost to the results. However, with increased amount of the training data, such additional information would be probably less useful. Thus, it would be more convincing if the results obtained 54
with syntactical NNLMs would be presented on larger data sets, containing at least several hundreds of millions of words. The other way to improve accuracy of the model is to allow it to represent compactly larger set of patterns. By changing the topology of the network from a feedforward to a recurrent one, we allow the model to form a short context memory that is learned unsupervisedly from the data. The prediction of the next word then depends on the previous word and the state of the short context memory. We can thus claim that such model can actually cluster entire histories that are in some sense similar. This is in contrast to feedforward neural networks that can effectively cluster only individual words in the projection layer, and then it is needed to perform another step to cluster the low- dimensional representation of several words from the history. If some pattern involves variable position of some word in the history, it is not possible to represent such pattern efficiently with a compact feedforward network, while this can be accomplished by using a recurrent one. From the empirical point of view, we can see in Table 4.1 that recurrent networks work both better than feedforward networks, and also better than feedforward networks with additional linguistic features. A question arises, if this improvement does not come from simply learning cache-like information from the data, as the recurrent topology actually allows this. Theoretical explanation from Bengio [4] shows that this actually can not happen if RNN is trained by stochastic gradient descent, as the error signal that is propagated through the recurrent connections converges to zero fast in most cases, thus it is hard to train a recurrent network to represent long term patterns that would span over several sentences. Table 4.1 shows empirical results, where we can see a combination with a cache model: we can observe that in combination with the KN5+cache model the recurrent network is behaving in a very similar way as the feedforward networks. Thus we can conclude that the improvements obtained with RNN models come from better representation of short context information, and not from learning cache information. Overall, we can see 2.4% improvement in entropy when comparing the neural net with the feedforward achitecture (PPL 140.2) and with the recurrent achitecture (PPL 124.7). This is a large improvement, especially if we consider that the feedforward neural network itself has a very good position among studied models. The dynamic evaluation of the RNN model provides further improvement. While the perplexity reduction against the static RNN model is small, we will see in the next 55
Page 1 and 2:
VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4:
Abstrakt Statistické jazykové mod
Page 5 and 6:
Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57: Perplexity (Penn corpus) 145 140 13
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80: n-gram probabilities. However, it w
Page 81 and 82: Table 6.1: Training corpora for NIS
Page 83 and 84: Perplexity 360 340 320 300 280 260
Page 85 and 86: Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88: 1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90: Table 6.4: Perplexity on the evalua
Page 91 and 92: Entropy reduction per word over KN4
Page 93 and 94: Table 6.6: Perplexity with the new
Page 95 and 96: Entropy reduction over KN5 -0.04 -0
Page 97 and 98: as a baseline, and 12.3% after resc
Page 99 and 100: Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102: Table 7.3: Size of compressed text
Page 103 and 104: Table 7.4: Accuracy of different la
Page 105 and 106: Table 7.6: Entropy on PTB with n-gr
Page 107 and 108: 8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?