Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
3.4.3 Approximati<strong>on</strong> <strong>of</strong> Complex <str<strong>on</strong>g>Language</str<strong>on</strong>g> Model by Back<strong>of</strong>f N-gram<br />
model<br />
In [15], we have shown that NNLM can be partly approximated by a finite state machine.<br />
The c<strong>on</strong>versi<strong>on</strong> is d<strong>on</strong>e by sampling words from the probability distributi<strong>on</strong> computed by<br />
NNLM, and a comm<strong>on</strong> N-gram model is afterwards trained <strong>on</strong> the sampled text data.<br />
For infinite amount <strong>of</strong> sampled data and infinite order N, this approximati<strong>on</strong> technique<br />
is guaranteed to c<strong>on</strong>verge to an equivalent model to the <strong>on</strong>e that was used for generating<br />
the words.<br />
Of course, this is not achievable in practice, as it is not possible to generate infinite<br />
amounts <strong>of</strong> data. However we have shown that even for manageable amounts <strong>of</strong> sampled<br />
data (hundreds <strong>of</strong> milli<strong>on</strong> words), the approximated model provides some <strong>of</strong> the improve-<br />
ment over baseline n-gram model that is provided by the full NNLM. Note that this<br />
approach is not limited just to NNLMs or RNNLMs, but can be used to c<strong>on</strong>vert any com-<br />
plex model to a finite state representati<strong>on</strong>. However, following the motivati<strong>on</strong> examples<br />
that were shown in the introductory chapter, representing certain patterns using FSMs<br />
is quite impractical, thus we believe this technique can be the most useful for tasks with<br />
limited amount <strong>of</strong> the training data, where size <strong>of</strong> models is not so restrictive.<br />
Important advantage <strong>of</strong> this approach include possibility <strong>of</strong> using the approximated<br />
model directly during decoding, for the standard lattice rescoring, etc. It is even possible<br />
to use the (R)NNLMs for speech recogniti<strong>on</strong> without actually having a single line <strong>of</strong> neural<br />
net code in the system, as the complex patterns learned by neural net are represented as a<br />
list <strong>of</strong> possible combinati<strong>on</strong>s in the n-gram model. The sampling approach is thus giving<br />
the best possible speedup for the test phase, by trading the computati<strong>on</strong>al complexity for<br />
the space complexity.<br />
Empirical results obtained by using this technique for approximating RNNLMs in<br />
speech recogniti<strong>on</strong> systems are described in [15] and [38], which is a joint work with<br />
Anoop Deoras and Stefan Kombrink.<br />
3.4.4 Dynamic Evaluati<strong>on</strong> <strong>of</strong> the Model<br />
From the artificial intelligence point <strong>of</strong> view, the usual statistical language models have<br />
another drawback besides their inability to represent l<strong>on</strong>ger term patterns: the impossibil-<br />
ity to learn new informati<strong>on</strong>. This is caused by the fact that LMs are comm<strong>on</strong>ly assumed<br />
to be static - the parameters <strong>of</strong> the models do not change during processing <strong>of</strong> the data.<br />
40