Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

”Every time I fire a linguist out of my group, the accuracy goes up 3 .” We may understand Jelinek’s statement as an observation that with decreased complexity of the system and increased generality of the approaches, the performance goes up. It is then not so surprising to see the general purpose algorithms to beat the very specific ones, although clearly the task specific algorithms may have better initial results. Neural network language models will be described in more detail in Chapter 2. These models are today among state of the art techniques, and we will demonstrate their performance on several data sets, where on each of them their performance is unmatched by other techniques. The main advantage of NNLMs over n-grams is that history is no longer seen as exact sequence of n − 1 words H, but rather as a projection of H into some lower dimensional space. This reduces number of parameters in the model that have to be trained, resulting in automatic clustering of similar histories. While this might sound the same as the motivation for class <strong>based</strong> models, the main difference is that NNLMs project all words into the same low dimensional space, and there can be many degrees of similarity between words. The main weak point of these models is very large computational complexity, which usually prohibits to train these models on full training set, using the full vocabulary. I will deal with these issues in this work by proposing simple and effective speed-up techniques. Experiments and results obtained with neural network models trained on over 400M words while using large vocabulary will be reported, which is to my knowledge the largest set that a proper NNLM has been trained on 4 . 2.4 Introduction to Data Sets and Experimental Setups In this work, I would like to avoid mistakes that are often mentioned when it comes to criticism of the current research in the statistical language modeling. It is usually claimed that the new techniques are studied in very specific systems, using weak or ambiguous baselines. Comparability of the achieved results is very low, if any. This leads to much 3 Although later, Jelinek himself claimed that the original statement was ”Every time a linguist leaves my group, the accuracy goes up”, the former one gained more popularity. 4 I am aware of experiments with even more training data (more than 600M words) [8], but the resulting model in that work uses a small hidden layer, which as it will be shown later prohibits to train a model with competitive performance on such amount of training data. 24
confusion among researchers, and many new results are simply ignored as it is very time consuming to verify them. To avoid these problems, the performance of the proposed techniques is studied on very standard tasks, where it is possible to compare achieved results to baselines that were previously reported by other researchers 5 . First, experiments will be shown on a well known Penn Treebank Corpus, and the comparison will include wide variety of models that were introduced in section 2.3. A combination of results given by various techniques provides very important information by showing complementarity of the different language modeling techniques. Final combination of all techniques that were available to us results in a new state of the art performance on this particular data set, which is significantly better than of any individual technique. Second, experiments with increasing amount of the training data will be shown while using Wall Street Journal training data (NYT Section, the same data as used by [23] [79] [49]). This study will focus on both entropy and word error rate improvements. The conclusion seems to be that with increasing amount of the training data, the difference in performance between the RNN models and the backoff models is getting larger, which is in contrast to what was found by Goodman [24] for other advanced LM techniques, such as class <strong>based</strong> models. Experiments with adaptation of the RNN language models will be shown on this setup and additional details and results will be provided for another WSJ setup that can be much more easily replicated, as it is <strong>based</strong> on a new open-source speech recognition toolkit, Kaldi [60]. Third, results will be shown for the RNN model applied to the state of the art speech recognition system developed by IBM [30] that was already briefly mentioned in Sec- tion 2.3.5, where we will compare the performance to the current state of the art language model on that set (so-called model M). The language models for this task were trained on approximately 400M words. Achieved word error rate reductions over the best n-gram model are relatively over 10%, which is a proof of usefulness of the techniques developed in this work. Lastly, comparison of performance of RNN and n-gram models will be provided on a novel task ”The Microsoft Research Sentence Completion Challenge” [83] that focuses on ability of artificial language models to appropriately complete a sentence where a single informative word is missing. 5 Many of the experiments described in this work can be reproduced by using a toolkit for training Recurrent neural network (RNN) language models which can be found at http://www.fit.vutbr.cz/ ~imikolov/rnnlm/. 25
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23 and 24: main (static) n-gram model. As the
Page 25 and 26: There are many popular examples sho
Page 27: y Chen et al., who proposed a so-ca
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76: trained together with a maximum ent
Page 77 and 78: wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80:
n-gram probabilities. However, it w
Page 81 and 82:
Table 6.1: Training corpora for NIS
Page 83 and 84:
Perplexity 360 340 320 300 280 260
Page 85 and 86:
Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88:
1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90:
Table 6.4: Perplexity on the evalua
Page 91 and 92:
Entropy reduction per word over KN4
Page 93 and 94:
Table 6.6: Perplexity with the new
Page 95 and 96:
Entropy reduction over KN5 -0.04 -0
Page 97 and 98:
as a baseline, and 12.3% after resc
Page 99 and 100:
Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

Create successful ePaper yourself

Delete template?

Save as template?