Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
By c<strong>on</strong>sidering popular machine learning techniques from the computati<strong>on</strong>al point <strong>of</strong><br />
view, we can see that the principal comp<strong>on</strong>ent analysis, logistic regressi<strong>on</strong> and neural net-<br />
works use small, fixed algorithms with huge amount <strong>of</strong> learnable parameters. Thus, these<br />
techniques are still somewhere between general learning systems and simple databases.<br />
Based <strong>on</strong> the experimental results described in the previous chapters, we can observe<br />
that by allowing the functi<strong>on</strong> f to learn more complex patterns, the generalizati<strong>on</strong> ability<br />
increases. Thus, neural networks with <strong>on</strong>e hidden layer (with n<strong>on</strong>-linear activati<strong>on</strong> func-<br />
ti<strong>on</strong>) have potential to work better than logistic regressi<strong>on</strong>. From the computati<strong>on</strong>al point<br />
<strong>of</strong> view, we can see the n<strong>on</strong>-linearity in the neural networks as the IF command: we can-<br />
not express regularity in the data such as IF OBJECT IS SMALL AND RED, THE OBJECT<br />
IS APPLE without using the IF command, as weighted combinati<strong>on</strong> such as SMALL IS<br />
APPLE, RED IS APPLE would classify any small object as an apple.<br />
We do not have to stop with just a single n<strong>on</strong>-linearity; it can be seen that some other<br />
patterns can require two subsequent IF commands to be expressed efficiently. It is <strong>of</strong>ten<br />
menti<strong>on</strong>ed that single n<strong>on</strong>-linearity in computati<strong>on</strong>al models is sufficient to approximate<br />
any functi<strong>on</strong> - while this is true, such approximati<strong>on</strong> can collapse to the basic database<br />
approach; the logic in functi<strong>on</strong> f is then expressed in the form <strong>of</strong> large number <strong>of</strong> rules such<br />
as IF (X=1.05 AND Z=2.11) THEN Y=1.14. C<strong>on</strong>trary, deeper architecture can represent<br />
some patterns using much fewer rules, which leads to better generalizati<strong>on</strong> when new data<br />
are presented as inputs.<br />
A possible future work in machine learning can be to use models with more n<strong>on</strong>-<br />
linearities, such as deep neural networks that are composed <strong>of</strong> several layers <strong>of</strong> neur<strong>on</strong>s<br />
with n<strong>on</strong>-linear activati<strong>on</strong> functi<strong>on</strong> [6]. This way, it is possible to express efficiently more<br />
patterns. Recurrent neural networks are another example <strong>of</strong> deep architecture. On the<br />
other hand, while the increased depth <strong>of</strong> the computati<strong>on</strong>al models increases the number <strong>of</strong><br />
patterns that can be expressed efficiently, it becomes more difficult to find good soluti<strong>on</strong>s<br />
using techniques such as simple gradient descent.<br />
Even the deep neural network architectures are severely limited, as the number <strong>of</strong><br />
hidden layers (computati<strong>on</strong>al steps) is defined by the programmer in advance, as a hyper-<br />
parameter <strong>of</strong> the model. However, certain patterns cannot be described efficiently by using<br />
c<strong>on</strong>stant number <strong>of</strong> computati<strong>on</strong>al steps (for example any algorithm that involves loop<br />
with variable amount <strong>of</strong> steps, or recursive functi<strong>on</strong>s). Additi<strong>on</strong>ally, feedforward neural<br />
networks do not have ability to represent patterns such as memory; however, practice shows<br />
104