Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2.3.4 Decisi<strong>on</strong> Trees and Random Forest <str<strong>on</strong>g>Language</str<strong>on</strong>g> <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />
A decisi<strong>on</strong> tree can partiti<strong>on</strong> the data in the history by asking questi<strong>on</strong> about history at<br />
every node. As these questi<strong>on</strong>s can be very general, decisi<strong>on</strong> trees were believed to have<br />
a big potential - for example, it is possible to ask questi<strong>on</strong>s about presence <strong>of</strong> specific<br />
word in the history <strong>of</strong> last ten words. However, in practice it was found that finding good<br />
decisi<strong>on</strong> trees can be quite difficult, and even if it can be proved that very good decisi<strong>on</strong><br />
trees exist, usually <strong>on</strong>ly suboptimal <strong>on</strong>es are found by normal training techniques. This<br />
has motivated work <strong>on</strong> random forest models, which is a combinati<strong>on</strong> <strong>of</strong> many randomly<br />
grown decisi<strong>on</strong> trees (linear interpolati<strong>on</strong> is usually used to combine trees into forests).<br />
For more informati<strong>on</strong>, see [78].<br />
As the questi<strong>on</strong>s in the decisi<strong>on</strong> trees can be very general, these models have a possi-<br />
bility to work well for languages with free word order as well as for inflecti<strong>on</strong>al languages,<br />
by asking questi<strong>on</strong>s about morphology <strong>of</strong> the words in the history etc. [59]. The drawback<br />
is again high computati<strong>on</strong>al complexity. Also, the improvements seem to decrease when<br />
the amount <strong>of</strong> the training data is large. Thus, these techniques seem to work similar to<br />
the class <str<strong>on</strong>g>based</str<strong>on</strong>g> models, in some aspects.<br />
2.3.5 Maximum Entropy <str<strong>on</strong>g>Language</str<strong>on</strong>g> <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />
Maximum entropy (ME) model is an exp<strong>on</strong>ential model with a form<br />
P (w|h) = ei<br />
λifi(w,h)<br />
Z(h)<br />
(2.6)<br />
where w is a word in a c<strong>on</strong>text h and Z(h) is used for normalizing the probability distri-<br />
buti<strong>on</strong>:<br />
Z(h) = <br />
wi∈V<br />
e <br />
j λjfj(wi,h)<br />
(2.7)<br />
thus it can be viewed as a model that combines many feature functi<strong>on</strong>s fi(w, h). The<br />
problem <strong>of</strong> training ME model is to find weights λi <strong>of</strong> the features, and also to obtain a<br />
good set <strong>of</strong> these features, as these are not automatically learned from the data. Usual<br />
features are n-grams, skip-grams, etc.<br />
ME models have shown big potential, as they can easily incorporate any features.<br />
Rosenfeld [64] used triggers and word features to obtain very large perplexity improvement,<br />
as well as significant word error rate reducti<strong>on</strong>. There has been a lot <strong>of</strong> work recently d<strong>on</strong>e<br />
22