02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

trained together with a maximum entropy model, which can be seen as a part <strong>of</strong> the the<br />

neural network, where the input layer is directly c<strong>on</strong>nected to the output layer. We intro-<br />

duce a hash-<str<strong>on</strong>g>based</str<strong>on</strong>g> implementati<strong>on</strong> <strong>of</strong> a class-<str<strong>on</strong>g>based</str<strong>on</strong>g> maximum entropy model, that allows<br />

us to easily c<strong>on</strong>trol the trade-<strong>of</strong>f between the memory complexity, the space complexity<br />

and the computati<strong>on</strong>al complexity.<br />

In this chapter, we report results <strong>on</strong> the NIST RT04 Broadcast News speech recogni-<br />

ti<strong>on</strong> task. We use lattices generated from IBM Attila decoder [71] that uses state <strong>of</strong> the art<br />

discriminatively trained acoustic models 1 . The language models for this task are trained<br />

<strong>on</strong> about 400M tokens. This highly competitive setup has been used in the 2010 Speech<br />

Recogniti<strong>on</strong> with Segmental C<strong>on</strong>diti<strong>on</strong>al Random Fields summer workshop at Johns Hop-<br />

kins University 2 [82]. Some <strong>of</strong> the results reported in this chapter were recently published<br />

in [52].<br />

6.1 Model Descripti<strong>on</strong><br />

In this secti<strong>on</strong>, we will show that a maximum entropy model can be seen as a neural<br />

network model with no hidden layer. A maximum entropy model has the following form:<br />

P (w|h) = e N<br />

i=1 λifi(h,w)<br />

<br />

w e N<br />

i=1<br />

λifi(h,w) , (6.1)<br />

where f is a set <strong>of</strong> features, λ is a set <strong>of</strong> weights and h is a history. Training <strong>of</strong> maximum<br />

entropy model c<strong>on</strong>sists <strong>of</strong> learning the set <strong>of</strong> weights λ. Usual features are n-grams, but it<br />

is easy to integrate any informati<strong>on</strong> source into the model, for example triggers or syntactic<br />

features [64]. The choice <strong>of</strong> features is usually d<strong>on</strong>e manually, and significantly affects the<br />

overall performance <strong>of</strong> the model.<br />

The standard neural network language model has a very similar form. The main<br />

difference is that the features for this model are automatically learned as a functi<strong>on</strong> <strong>of</strong><br />

the history. Also, the usual features for the ME model are binary, while NN models use<br />

c<strong>on</strong>tinuous-valued features. We can describe the NN LM as follows:<br />

P (w|h) = e N<br />

i=1 λifi(s,w)<br />

<br />

w e N<br />

i=1<br />

λifi(s,w) , (6.2)<br />

1 The lattice rescoring experiments reported in this chapter were performed by Anoop Deoras at JHU<br />

due to the license issues <strong>of</strong> the IBM recognizer.<br />

2 www.clsp.jhu.edu/workshops/archive/ws10/groups/speech-recogniti<strong>on</strong>-with-segmental-c<strong>on</strong>diti<strong>on</strong>al-random-fields/<br />

71

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!