Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
There is a great variability <strong>of</strong> approaches how <strong>on</strong>e can choose the size <strong>of</strong> the popula-<br />
ti<strong>on</strong>, the probabilities that certain acti<strong>on</strong> will be taken, the way how both crossover and<br />
mutati<strong>on</strong> operators affect the individuals, how populati<strong>on</strong> in new epoch is exactly formed<br />
etc. It is possible to even apply genetic algorithm to this optimizati<strong>on</strong> process itself, to<br />
avoid the need to tune these c<strong>on</strong>stants; this is called meta-genetic programming.<br />
Genetic programming can be successfully applied to small problems and it was reported<br />
that many novel algorithms were found using these or similar approaches. However, for<br />
large problems, the GP seems to be very inefficient; the hope is that through the recom-<br />
binati<strong>on</strong> <strong>of</strong> soluti<strong>on</strong>s using the crossover operator, the GP will find basic atomic functi<strong>on</strong>s<br />
first, and these will be used later to compose more complex functi<strong>on</strong>s. However, it is<br />
sometimes menti<strong>on</strong>ed that GP itself is not really guaranteed to work better than other<br />
simple search techniques, and clearly for difficult large problems, random search is very<br />
inefficient as space <strong>of</strong> possible soluti<strong>on</strong>s increases exp<strong>on</strong>entially with the length <strong>of</strong> their<br />
descripti<strong>on</strong>.<br />
My own attempts to train recurrent neural network language models using genetic al-<br />
gorithms were not very promising; even <strong>on</strong> small problems, the stochastic gradient descent<br />
is orders <strong>of</strong> magnitude faster. However, certain problems cannot be efficiently learned by<br />
SGD, such as storing some informati<strong>on</strong> for l<strong>on</strong>g time periods. For toy problems, RNNs<br />
trained by genetic algorithms were able to easily learn patterns that otherwise cannot be<br />
learned by SGD (such as to store single bit <strong>of</strong> informati<strong>on</strong> for 100 time steps). Thus, an<br />
interesting directi<strong>on</strong> for future research might be to combine GP and SGD.<br />
8.3 Incremental Learning<br />
The way humans naturally approach a complex problem is through its decompositi<strong>on</strong> into<br />
smaller problems that are solved separately. Such incremental approach can be also used<br />
for solving many machine learning problems - even more, it might be crucial to learn com-<br />
plex algorithms that are represented either by deep architectures, or by complex computer<br />
programs. It can be shown that certain complex problems can be solved exp<strong>on</strong>entially<br />
faster, if additi<strong>on</strong>al supervisi<strong>on</strong> in the form <strong>of</strong> simpler examples is provided.<br />
Assume a task to find a six digit number, with a supervisi<strong>on</strong> giving informati<strong>on</strong> if<br />
the candidate number is correct or not. On average, we would need 106<br />
2<br />
attempts to find<br />
the number. However, if we can search for the number incrementally with additi<strong>on</strong>al<br />
106