02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

output layer changes to computati<strong>on</strong> between the hidden and the class layer:<br />

and the hidden layer and a subset <strong>of</strong> the output layer:<br />

The probability <strong>of</strong> word w(t + 1) is then computed as<br />

⎛<br />

cm(t) = g ⎝ <br />

⎞<br />

sj(t)xmj⎠<br />

, (3.23)<br />

yV<br />

j<br />

⎛<br />

′(t) = g ⎝ <br />

sj(t)vV ′ ⎞<br />

⎠ j . (3.24)<br />

j<br />

P (wt+1|s(t)) = P (ci|s(t))P (wi|ci, s(t)), (3.25)<br />

where wi is an index <strong>of</strong> the predicted word and ci is its class. During training, the weights<br />

are accessed in the same way as during the forward pass, thus the gradient <strong>of</strong> the error<br />

vector is computed for the word part and for the class part, and then is backpropagated<br />

back to the hidden layer, where gradients are added together. Thus, the hidden layer is<br />

trained to predict both the distributi<strong>on</strong> over the words and over the classes.<br />

An alternative to simple frequency binning is a slightly modified approach, that min-<br />

imizes access to words and classes: instead <strong>of</strong> using frequencies <strong>of</strong> words for the equal<br />

binning algorithm, <strong>on</strong>e can apply square root functi<strong>on</strong> <strong>on</strong> the original frequencies, and<br />

perform the binning <strong>on</strong> these modified frequencies. This approach leads to even larger<br />

speed-up 3 .<br />

Factorizati<strong>on</strong> <strong>of</strong> the computati<strong>on</strong> between the hidden and output layers using simple<br />

classes can easily lead to 15 - 30 times speed-up against a fair baseline, and for the net-<br />

works with huge output layers (more then 100K words), the speedup may be even an order<br />

<strong>of</strong> magnitude larger. Thus, this speedup trick is essential for achieving reas<strong>on</strong>able perfor-<br />

mance <strong>on</strong> larger data sets. Additi<strong>on</strong>al techniques for reducing computati<strong>on</strong>al complexity<br />

will be discussed in more detail in Chapter 6.<br />

3 Thanks to Dan Povey who suggested this modificati<strong>on</strong>.<br />

39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!