02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Appendix A: RNNLM Toolkit<br />

To further support research <strong>of</strong> advanced language modeling techniques, I implemented and<br />

released open source toolkit for training recurrent neural network <str<strong>on</strong>g>based</str<strong>on</strong>g> language models.<br />

It is available at http://www.fit.vutbr.cz/~imikolov/rnnlm/. The main goals for the<br />

RNNLM toolkit are:<br />

• promoti<strong>on</strong> <strong>of</strong> research <strong>of</strong> advanced language modeling techniques<br />

• easy usage<br />

• simple portable code without any dependencies <strong>on</strong> external libraries<br />

• computati<strong>on</strong>al efficiency<br />

Basic Functi<strong>on</strong>ality<br />

The toolkit supports several functi<strong>on</strong>s, mostly for the basic language modeling operati<strong>on</strong>s:<br />

training RNN LM, training hash-<str<strong>on</strong>g>based</str<strong>on</strong>g> maximum entropy model (ME LM) and RNNME<br />

LM. For evaluati<strong>on</strong>, either perplexity can be computed <strong>on</strong> some test data, or n-best lists<br />

can be rescored to evaluate impact <strong>of</strong> the models <strong>on</strong> the word error rate or the BLEU score.<br />

Additi<strong>on</strong>ally, the toolkit can be used for generating random sequences <strong>of</strong> words from the<br />

model, which can be useful for approximating the RNN models by n-gram models, at a<br />

cost <strong>of</strong> memory complexity [15].<br />

Training Phase<br />

The input data are expected to be in a simple ASCII text format, with a space between<br />

words and end-<strong>of</strong>-line character at the end <strong>of</strong> each sentence. After specifying the training<br />

data set, a vocabulary is automatically c<strong>on</strong>structed, and it is saved as part <strong>of</strong> the RNN<br />

model file. Note that if <strong>on</strong>e wants to use limited vocabulary (for example for openvocabulary<br />

experiments), the text data should be modified outside the toolkit, by first<br />

rewriting all words outside the vocabulary to or similar special token.<br />

After the vocabulary is learned, the training phase starts (opti<strong>on</strong>ally, the progress can<br />

be shown if -debug 2 opti<strong>on</strong> is used). Implicitly, it is expected that some validati<strong>on</strong> data<br />

are provided using the opti<strong>on</strong> -valid, to c<strong>on</strong>trol the number <strong>of</strong> the training epochs and the<br />

learning rate. However, it is also possible to train models without having any validati<strong>on</strong><br />

data; the opti<strong>on</strong> -<strong>on</strong>e-iter can be used for that purpose. The model is saved after each<br />

completed epoch (or also after processing specified amount <strong>of</strong> words); the training process<br />

can be c<strong>on</strong>tinued if interrupted.<br />

120

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!