02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.3.4 Decisi<strong>on</strong> Trees and Random Forest <str<strong>on</strong>g>Language</str<strong>on</strong>g> <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />

A decisi<strong>on</strong> tree can partiti<strong>on</strong> the data in the history by asking questi<strong>on</strong> about history at<br />

every node. As these questi<strong>on</strong>s can be very general, decisi<strong>on</strong> trees were believed to have<br />

a big potential - for example, it is possible to ask questi<strong>on</strong>s about presence <strong>of</strong> specific<br />

word in the history <strong>of</strong> last ten words. However, in practice it was found that finding good<br />

decisi<strong>on</strong> trees can be quite difficult, and even if it can be proved that very good decisi<strong>on</strong><br />

trees exist, usually <strong>on</strong>ly suboptimal <strong>on</strong>es are found by normal training techniques. This<br />

has motivated work <strong>on</strong> random forest models, which is a combinati<strong>on</strong> <strong>of</strong> many randomly<br />

grown decisi<strong>on</strong> trees (linear interpolati<strong>on</strong> is usually used to combine trees into forests).<br />

For more informati<strong>on</strong>, see [78].<br />

As the questi<strong>on</strong>s in the decisi<strong>on</strong> trees can be very general, these models have a possi-<br />

bility to work well for languages with free word order as well as for inflecti<strong>on</strong>al languages,<br />

by asking questi<strong>on</strong>s about morphology <strong>of</strong> the words in the history etc. [59]. The drawback<br />

is again high computati<strong>on</strong>al complexity. Also, the improvements seem to decrease when<br />

the amount <strong>of</strong> the training data is large. Thus, these techniques seem to work similar to<br />

the class <str<strong>on</strong>g>based</str<strong>on</strong>g> models, in some aspects.<br />

2.3.5 Maximum Entropy <str<strong>on</strong>g>Language</str<strong>on</strong>g> <str<strong>on</strong>g>Models</str<strong>on</strong>g><br />

Maximum entropy (ME) model is an exp<strong>on</strong>ential model with a form<br />

P (w|h) = ei<br />

λifi(w,h)<br />

Z(h)<br />

(2.6)<br />

where w is a word in a c<strong>on</strong>text h and Z(h) is used for normalizing the probability distri-<br />

buti<strong>on</strong>:<br />

Z(h) = <br />

wi∈V<br />

e <br />

j λjfj(wi,h)<br />

(2.7)<br />

thus it can be viewed as a model that combines many feature functi<strong>on</strong>s fi(w, h). The<br />

problem <strong>of</strong> training ME model is to find weights λi <strong>of</strong> the features, and also to obtain a<br />

good set <strong>of</strong> these features, as these are not automatically learned from the data. Usual<br />

features are n-grams, skip-grams, etc.<br />

ME models have shown big potential, as they can easily incorporate any features.<br />

Rosenfeld [64] used triggers and word features to obtain very large perplexity improvement,<br />

as well as significant word error rate reducti<strong>on</strong>. There has been a lot <strong>of</strong> work recently d<strong>on</strong>e<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!