02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3.4.3 Approximati<strong>on</strong> <strong>of</strong> Complex <str<strong>on</strong>g>Language</str<strong>on</strong>g> Model by Back<strong>of</strong>f N-gram<br />

model<br />

In [15], we have shown that NNLM can be partly approximated by a finite state machine.<br />

The c<strong>on</strong>versi<strong>on</strong> is d<strong>on</strong>e by sampling words from the probability distributi<strong>on</strong> computed by<br />

NNLM, and a comm<strong>on</strong> N-gram model is afterwards trained <strong>on</strong> the sampled text data.<br />

For infinite amount <strong>of</strong> sampled data and infinite order N, this approximati<strong>on</strong> technique<br />

is guaranteed to c<strong>on</strong>verge to an equivalent model to the <strong>on</strong>e that was used for generating<br />

the words.<br />

Of course, this is not achievable in practice, as it is not possible to generate infinite<br />

amounts <strong>of</strong> data. However we have shown that even for manageable amounts <strong>of</strong> sampled<br />

data (hundreds <strong>of</strong> milli<strong>on</strong> words), the approximated model provides some <strong>of</strong> the improve-<br />

ment over baseline n-gram model that is provided by the full NNLM. Note that this<br />

approach is not limited just to NNLMs or RNNLMs, but can be used to c<strong>on</strong>vert any com-<br />

plex model to a finite state representati<strong>on</strong>. However, following the motivati<strong>on</strong> examples<br />

that were shown in the introductory chapter, representing certain patterns using FSMs<br />

is quite impractical, thus we believe this technique can be the most useful for tasks with<br />

limited amount <strong>of</strong> the training data, where size <strong>of</strong> models is not so restrictive.<br />

Important advantage <strong>of</strong> this approach include possibility <strong>of</strong> using the approximated<br />

model directly during decoding, for the standard lattice rescoring, etc. It is even possible<br />

to use the (R)NNLMs for speech recogniti<strong>on</strong> without actually having a single line <strong>of</strong> neural<br />

net code in the system, as the complex patterns learned by neural net are represented as a<br />

list <strong>of</strong> possible combinati<strong>on</strong>s in the n-gram model. The sampling approach is thus giving<br />

the best possible speedup for the test phase, by trading the computati<strong>on</strong>al complexity for<br />

the space complexity.<br />

Empirical results obtained by using this technique for approximating RNNLMs in<br />

speech recogniti<strong>on</strong> systems are described in [15] and [38], which is a joint work with<br />

Anoop Deoras and Stefan Kombrink.<br />

3.4.4 Dynamic Evaluati<strong>on</strong> <strong>of</strong> the Model<br />

From the artificial intelligence point <strong>of</strong> view, the usual statistical language models have<br />

another drawback besides their inability to represent l<strong>on</strong>ger term patterns: the impossibil-<br />

ity to learn new informati<strong>on</strong>. This is caused by the fact that LMs are comm<strong>on</strong>ly assumed<br />

to be static - the parameters <strong>of</strong> the models do not change during processing <strong>of</strong> the data.<br />

40

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!