Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
w(t)<br />
s(t-1)<br />
s(t)<br />
U V<br />
W<br />
y(t)<br />
Figure 3.1: Simple recurrent neural network.<br />
The network is trained by stochastic gradient descent using either usual backpropa-<br />
gati<strong>on</strong> (BP) algorithm, or backpropagati<strong>on</strong> through time (BPTT) [65]. The network is<br />
represented by input, hidden and output layers and corresp<strong>on</strong>ding weight matrices - ma-<br />
trices U and W between the input and the hidden layer, and matrix V between the hidden<br />
and the output layer. Output values in the layers are computed as follows:<br />
<br />
<br />
sj(t) = f wi(t)uji + <br />
<br />
sl(t−1)wjl<br />
i<br />
l<br />
(3.1)<br />
⎛<br />
yk(t) = g ⎝ <br />
⎞<br />
sj(t)vkj ⎠ (3.2)<br />
where f(z) and g(z) are sigmoid and s<strong>of</strong>tmax activati<strong>on</strong> functi<strong>on</strong>s (the s<strong>of</strong>tmax functi<strong>on</strong><br />
in the output layer is used to ensure that the outputs form a valid probability distributi<strong>on</strong>,<br />
i.e. all outputs are greater than 0 and their sum is 1):<br />
f(z) =<br />
j<br />
1<br />
1 + e−z , g(zm) = ezm<br />
<br />
k ezk<br />
(3.3)<br />
Note that biases are not used in the neural network, as no significant improvement <strong>of</strong><br />
performance was observed - following the Occam’s razor, the soluti<strong>on</strong> is as simple as it<br />
needs to be. Alternatively, the equati<strong>on</strong>s 3.1 and 3.2 can be rewritten as a matrix-vector<br />
multiplicati<strong>on</strong>:<br />
s(t) = f (Uw(t) + Ws(t−1)) (3.4)<br />
29