Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
computati<strong>on</strong> between the hidden and the output layers. It is also possible to parallelize<br />
the computati<strong>on</strong> by training from multiple positi<strong>on</strong>s in the training data simultaneously.<br />
Another approach is to divide the training data into K subsets and train a single<br />
neural net model <strong>on</strong> each <strong>of</strong> them separately. Then, it is needed to use all K models<br />
during the test phase, and average their outputs. However, neural network models pr<strong>of</strong>it<br />
from clustering <strong>of</strong> similar events, thus such an approach would lead to suboptimal results.<br />
Also, the test phase would be more computati<strong>on</strong>ally complex, as it would be needed to<br />
evaluate K models, instead <strong>of</strong> <strong>on</strong>e.<br />
6.3 Experimental Setup<br />
We performed recogniti<strong>on</strong> <strong>on</strong> the English Broadcast News (BN) NIST RT04 task using<br />
state-<strong>of</strong>-the-art acoustic models trained <strong>on</strong> the English Broadcast News (BN) corpus (430<br />
hours <strong>of</strong> audio) provided to us by IBM [31]. The acoustic model was discriminatively<br />
trained <strong>on</strong> about 430 hours <strong>of</strong> HUB4 and TDT4 data, with LDA+MLLT, VTLN, fMLLR<br />
<str<strong>on</strong>g>based</str<strong>on</strong>g> SAT training, fMMI and mMMI. IBM also provided us its state-<strong>of</strong>-the-art speech<br />
recognizer, Attila [71] and two Kneser-Ney smoothed back<strong>of</strong>f 4-gram LMs c<strong>on</strong>taining 4.7M<br />
n-grams and 54M n-grams, both trained <strong>on</strong> about 400M tokens. Additi<strong>on</strong>al details about<br />
the recognizer can be found in [31].<br />
We followed IBM’s multi-pass decoding recipe using the 4.7M n-gram LM in the first<br />
pass followed by rescoring using the larger 54M n-gram LM. The development data c<strong>on</strong>-<br />
sisted <strong>of</strong> DEV04f+RT03 data (25K tokens). For evaluati<strong>on</strong>, we used the RT04 evaluati<strong>on</strong><br />
set (47K tokens). The size <strong>of</strong> the vocabulary is 82K words.<br />
The training corpora for language models are shown in Table 6.1. The baseline 4-gram<br />
model was trained using modified Kneser-Ney smoothing <strong>on</strong> all available data (about 403M<br />
tokens). For all the RNN models that will be described later, we have used the class-<str<strong>on</strong>g>based</str<strong>on</strong>g><br />
architecture described in Secti<strong>on</strong> 3.4.2 with 400 classes.<br />
6.4 Automatic Data Selecti<strong>on</strong> and Sorting<br />
Usually, stochastic gradient descent is used for training neural networks. This assumes<br />
randomizati<strong>on</strong> <strong>of</strong> the order <strong>of</strong> the training data before start <strong>of</strong> each training epoch. In the<br />
c<strong>on</strong>text <strong>of</strong> NN LMs, the randomizati<strong>on</strong> is usually performed <strong>on</strong> the level <strong>of</strong> sentences or<br />
paragraphs. However, an alternative view can be taken when it comes to training deep<br />
76