Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
trained together with a maximum entropy model, which can be seen as a part <strong>of</strong> the the<br />
neural network, where the input layer is directly c<strong>on</strong>nected to the output layer. We intro-<br />
duce a hash-<str<strong>on</strong>g>based</str<strong>on</strong>g> implementati<strong>on</strong> <strong>of</strong> a class-<str<strong>on</strong>g>based</str<strong>on</strong>g> maximum entropy model, that allows<br />
us to easily c<strong>on</strong>trol the trade-<strong>of</strong>f between the memory complexity, the space complexity<br />
and the computati<strong>on</strong>al complexity.<br />
In this chapter, we report results <strong>on</strong> the NIST RT04 Broadcast News speech recogni-<br />
ti<strong>on</strong> task. We use lattices generated from IBM Attila decoder [71] that uses state <strong>of</strong> the art<br />
discriminatively trained acoustic models 1 . The language models for this task are trained<br />
<strong>on</strong> about 400M tokens. This highly competitive setup has been used in the 2010 Speech<br />
Recogniti<strong>on</strong> with Segmental C<strong>on</strong>diti<strong>on</strong>al Random Fields summer workshop at Johns Hop-<br />
kins University 2 [82]. Some <strong>of</strong> the results reported in this chapter were recently published<br />
in [52].<br />
6.1 Model Descripti<strong>on</strong><br />
In this secti<strong>on</strong>, we will show that a maximum entropy model can be seen as a neural<br />
network model with no hidden layer. A maximum entropy model has the following form:<br />
P (w|h) = e N<br />
i=1 λifi(h,w)<br />
<br />
w e N<br />
i=1<br />
λifi(h,w) , (6.1)<br />
where f is a set <strong>of</strong> features, λ is a set <strong>of</strong> weights and h is a history. Training <strong>of</strong> maximum<br />
entropy model c<strong>on</strong>sists <strong>of</strong> learning the set <strong>of</strong> weights λ. Usual features are n-grams, but it<br />
is easy to integrate any informati<strong>on</strong> source into the model, for example triggers or syntactic<br />
features [64]. The choice <strong>of</strong> features is usually d<strong>on</strong>e manually, and significantly affects the<br />
overall performance <strong>of</strong> the model.<br />
The standard neural network language model has a very similar form. The main<br />
difference is that the features for this model are automatically learned as a functi<strong>on</strong> <strong>of</strong><br />
the history. Also, the usual features for the ME model are binary, while NN models use<br />
c<strong>on</strong>tinuous-valued features. We can describe the NN LM as follows:<br />
P (w|h) = e N<br />
i=1 λifi(s,w)<br />
<br />
w e N<br />
i=1<br />
λifi(s,w) , (6.2)<br />
1 The lattice rescoring experiments reported in this chapter were performed by Anoop Deoras at JHU<br />
due to the license issues <strong>of</strong> the IBM recognizer.<br />
2 www.clsp.jhu.edu/workshops/archive/ws10/groups/speech-recogniti<strong>on</strong>-with-segmental-c<strong>on</strong>diti<strong>on</strong>al-random-fields/<br />
71