Statistical Language Models based on Neural Networks - Faculty of ...

More documents

Recommendations

Info

are usually achieved by combining class <strong>based</strong> model and the n-gram model. There exists a lot of variations of class <strong>based</strong> models, which often focus on the process of forming classes. So-called soft classes allow one word to belong to multiple classes. Description of several variants of class <strong>based</strong> models can be found in [24]. While perplexity improvements given by class <strong>based</strong> models are usually moderate, these techniques have noticeable effect on the word error rate in speech recognition, especially when only small amount of training data is available. This makes class <strong>based</strong> models quite attractive as opposed to the cache models, which usually work well only in experiments concerning perplexity. The disadvantages of class <strong>based</strong> models include high computational complexity during inference (for statistical classes) or reliance on expert knowledge (for manually assigned classes). More seriously, improvements tend to vanish with increased amount of the training data [24]. Thus, class <strong>based</strong> models are more often found in the research papers, than in real applications. From the critical point of view, there are several theoretical difficulties involving class <strong>based</strong> models: • The assumption that words belong to some higher level classes is intuitive, but usually no special theoretical explanation is given to the process how classes are constructed; in the end, the number of classes is usually just some tunable parameter that is chosen <strong>based</strong> on performance on development data • Most techniques do attempt to cluster individual words in the vocabulary, but the idea is not extended to n-grams: by thinking about character-level models, it is obvi- ous that with increasing amount of the training data, classes can only be successful if longer context can be captured by a single class (several characters for this case) 2.3.3 Structured <strong>Language</strong> <strong>Models</strong> The statistical language modeling was criticized heavily by the linguists from the first days of its existence. The already mentioned Chomsky’s statement that ”the notion of probability of a sentence is completely useless one” can be nowadays easily seen as a big mistake due to indisputable success of applications that involve n-gram models. However, further objections from the linguistic community usually address the inability of n-gram models to represent longer term patterns that clearly exist between words in a sentence. 20
There are many popular examples showing that words in a sentence are often related, even if they do not lie next to each other. It can be shown that such patterns cannot be effectively encoded using a finite state machine (n-gram models belong to this family of computational models). However, these patterns can be often effectively described while using for example context free grammars. This was the motivation for the structured language models that attempt to bridge dif- ferences between the linguistic theories and the statistical models of the natural languages. The sentence is viewed as a tree structure generated by a context free grammar, where leafs are individual words and nodes are non-terminal symbols. The statistical approach is employed when constructing the tree: the derivations have assigned probabilities that are estimated from the training data, thus every new sentence can be assigned probability of being generated by the given grammar. The advantage of these models is in their theoretical ability to represent patterns in a sentence across many words. Also, these models make language modeling much more attractive for the linguistic community. However, there are many practical disadvantages of the structured language models: • computational complexity and sometimes unstable behaviour (complexity raises non- linearly with the length of the parsed sentences) • ambiguity (many different parses are possible) • questionable performance when applied to spontaneous speech • large amount of manual work that has to be done by expert linguists is often required, especially when the technique is to be applied to new domains or new languages, which can be very costly • for many languages, it is more difficult to represent sentences using context free grammars - this is true for example for languages where the concept of word is not so clear as in English, or where the word order is much more free and not so regular as it is for English Despite great research effort in the past decade, the results of these techniques remain questionable. However, it is certain that the addressed problem - long context patterns in the natural languages - has to be solved, if we want to get closer towards intelligent models of languages. 21
Page 1 and 2: VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ
Page 3 and 4: Abstrakt Statistické jazykové mod
Page 5 and 6: Contents 1 Introduction 4 1.1 Motiv
Page 7 and 8: 6.2.3 Reduction of Vocabulary Size
Page 9 and 10: Maybe the most popular vision of fu
Page 11 and 12: Chapter 6 presents further extensio
Page 13 and 14: Chapter 2 Overview of Stati
Page 15 and 16: 2.1 Evaluation 2.1.1 Perplexity Eva
Page 17 and 18: ALP can be used to obtain prior pro
Page 19 and 20: • Good theoretical motivation •
Page 21 and 22: abilities of n-grams are stored in
Page 23: main (static) n-gram model. As the
Page 27 and 28: y Chen et al., who proposed a so-ca
Page 29 and 30: confusion among researchers, and ma
Page 31 and 32: language model took almost a week u
Page 33 and 34: w(t) s(t-1) s(t) U V W y(t) Figure
Page 35 and 36: ate is halved at start of every new
Page 37 and 38: or using matrix-vector notation as
Page 39 and 40: information for more than 5 time st
Page 41 and 42: A simple solution to the exploding
Page 43 and 44: output layer changes to computation
Page 45 and 46: While RNN models can overcome this
Page 47 and 48: complex or random architectures (su
Page 49 and 50: While for any of the previous point
Page 51 and 52: where λ is the interpolation weigh
Page 53 and 54: model with default SRILM cutoffs pr
Page 55 and 56: experiments, we have used the one i
Page 57 and 58: Perplexity (Penn corpus) 145 140 13
Page 59 and 60: with syntactical NNLMs would be pre
Page 61 and 62: Table 4.3: Combination of individua
Page 63 and 64: Table 4.6: Results on Penn Treebank
Page 65 and 66: 4.6 Conclusion of the Model Combina
Page 67 and 68: were: 400 classes, hidden layer siz
Page 69 and 70: Entropy per word on the WSJ test da
Page 71 and 72: Table 5.3: Results on the WSJ setup
Page 73 and 74: Table 5.5: Results for models <stro
Page 75 and 76:
trained together with a maximum ent
Page 77 and 78:
wt-3 wt-2 wt-1 D D D P(wt|context)
Page 79 and 80:
n-gram probabilities. However, it w
Page 81 and 82:
Table 6.1: Training corpora for NIS
Page 83 and 84:
Perplexity 360 340 320 300 280 260
Page 85 and 86:
Entropy per word 9 8.5 8 7.5 7 6.5
Page 87 and 88:
1 a a a 1 2 3 P(w(t)|*) ONE TWO THR
Page 89 and 90:
Table 6.4: Perplexity on the evalua
Page 91 and 92:
Entropy reduction per word over KN4
Page 93 and 94:
Table 6.6: Perplexity with the new
Page 95 and 96:
Entropy reduction over KN5 -0.04 -0
Page 97 and 98:
as a baseline, and 12.3% after resc
Page 99 and 100:
Table 7.1: BLEU on IWSLT 2005 Machi
Page 101 and 102:
Table 7.3: Size of compressed text
Page 103 and 104:
Table 7.4: Accuracy of different la
Page 105 and 106:
Table 7.6: Entropy on PTB with n-gr
Page 107 and 108:
8.1 Machine Learning One possible d
Page 109 and 110:
that almost every non-trivial compu
Page 111 and 112:
supervision such as one digit at a
Page 113 and 114:
Chapter 9 Conclusion and Future Wor
Page 115 and 116:
from the expensive part of the mode
Page 117 and 118:
Bibliography [1] A. Alexandrescu, K
Page 119 and 120:
[23] D. Filimonov, M. Harper. A joi
Page 121 and 122:
[50] T. Mikolov, S. Kombrink, L. Bu
Page 123 and 124:
[77] W. Wang, M. Harper. The SuperA
Page 125 and 126:
Test Phase After the model is train
Page 127 and 128:
• compute sentence-level scores g
Page 129 and 130:
Appendix B: Data generated from mod
Page 131 and 132:
Appendix C: Example of decoded utte
Page 133:
AND SAYS THIS SWING IS WITHIN A REA
show all

Statistical Language Models based on Neural Networks - Faculty of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?