02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

computati<strong>on</strong> between the hidden and the output layers. It is also possible to parallelize<br />

the computati<strong>on</strong> by training from multiple positi<strong>on</strong>s in the training data simultaneously.<br />

Another approach is to divide the training data into K subsets and train a single<br />

neural net model <strong>on</strong> each <strong>of</strong> them separately. Then, it is needed to use all K models<br />

during the test phase, and average their outputs. However, neural network models pr<strong>of</strong>it<br />

from clustering <strong>of</strong> similar events, thus such an approach would lead to suboptimal results.<br />

Also, the test phase would be more computati<strong>on</strong>ally complex, as it would be needed to<br />

evaluate K models, instead <strong>of</strong> <strong>on</strong>e.<br />

6.3 Experimental Setup<br />

We performed recogniti<strong>on</strong> <strong>on</strong> the English Broadcast News (BN) NIST RT04 task using<br />

state-<strong>of</strong>-the-art acoustic models trained <strong>on</strong> the English Broadcast News (BN) corpus (430<br />

hours <strong>of</strong> audio) provided to us by IBM [31]. The acoustic model was discriminatively<br />

trained <strong>on</strong> about 430 hours <strong>of</strong> HUB4 and TDT4 data, with LDA+MLLT, VTLN, fMLLR<br />

<str<strong>on</strong>g>based</str<strong>on</strong>g> SAT training, fMMI and mMMI. IBM also provided us its state-<strong>of</strong>-the-art speech<br />

recognizer, Attila [71] and two Kneser-Ney smoothed back<strong>of</strong>f 4-gram LMs c<strong>on</strong>taining 4.7M<br />

n-grams and 54M n-grams, both trained <strong>on</strong> about 400M tokens. Additi<strong>on</strong>al details about<br />

the recognizer can be found in [31].<br />

We followed IBM’s multi-pass decoding recipe using the 4.7M n-gram LM in the first<br />

pass followed by rescoring using the larger 54M n-gram LM. The development data c<strong>on</strong>-<br />

sisted <strong>of</strong> DEV04f+RT03 data (25K tokens). For evaluati<strong>on</strong>, we used the RT04 evaluati<strong>on</strong><br />

set (47K tokens). The size <strong>of</strong> the vocabulary is 82K words.<br />

The training corpora for language models are shown in Table 6.1. The baseline 4-gram<br />

model was trained using modified Kneser-Ney smoothing <strong>on</strong> all available data (about 403M<br />

tokens). For all the RNN models that will be described later, we have used the class-<str<strong>on</strong>g>based</str<strong>on</strong>g><br />

architecture described in Secti<strong>on</strong> 3.4.2 with 400 classes.<br />

6.4 Automatic Data Selecti<strong>on</strong> and Sorting<br />

Usually, stochastic gradient descent is used for training neural networks. This assumes<br />

randomizati<strong>on</strong> <strong>of</strong> the order <strong>of</strong> the training data before start <strong>of</strong> each training epoch. In the<br />

c<strong>on</strong>text <strong>of</strong> NN LMs, the randomizati<strong>on</strong> is usually performed <strong>on</strong> the level <strong>of</strong> sentences or<br />

paragraphs. However, an alternative view can be taken when it comes to training deep<br />

76

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!