02.04.2013 Views

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

Statistical Language Models based on Neural Networks - Faculty of ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 4.7: Results <strong>on</strong> Penn Treebank corpus (evaluati<strong>on</strong> set) with different linear interpolati<strong>on</strong><br />

techniques.<br />

Model PPL<br />

Static LI <strong>of</strong> all models 83.5<br />

Static LI <strong>of</strong> all models + dynamic RNNs with α = 0.5 80.5<br />

Adaptive LI <strong>of</strong> all models + dynamic RNNs with α = 0.5 79.4<br />

4.5.1 Adaptive Linear Combinati<strong>on</strong><br />

All the experiments above use fixed weights <strong>of</strong> models in the combinati<strong>on</strong>, where the<br />

weights are estimated <strong>on</strong> the PTB validati<strong>on</strong> set. We have extended the usual linear<br />

combinati<strong>on</strong> <strong>of</strong> models to a case when weights <strong>of</strong> all individual models are variable, and are<br />

estimated during processing <strong>of</strong> the test data. The initial distributi<strong>on</strong> <strong>of</strong> weights is uniform<br />

(every model has the same weight), and as the test data are being processed, we compute<br />

optimal weights <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the performance <strong>of</strong> models <strong>on</strong> the history <strong>of</strong> the last several words<br />

(the objective is to minimize perplexity). In theory, the weights can be estimated using<br />

the whole history. However, we found that it is possible to use multiple lengths <strong>of</strong> history<br />

- a combinati<strong>on</strong> where interpolati<strong>on</strong> weights are estimated using just a few preceding<br />

words can capture short c<strong>on</strong>text characteristics that can vary rapidly between individual<br />

sentences or paragraphs, while a combinati<strong>on</strong> where interpolati<strong>on</strong> weights depend <strong>on</strong> the<br />

whole history is the most robust.<br />

It should be noted that an important motivati<strong>on</strong> for this approach is that a combinati<strong>on</strong><br />

<strong>of</strong> adaptive and static RNN models with fixed weights is suboptimal. When the first word<br />

in the test data is processed, both static and adaptive models are equal. As more data is<br />

processed, the adaptive model is supposed to learn new informati<strong>on</strong>, and thus its optimal<br />

weight can change. If there is a sudden change <strong>of</strong> topic in the test data, the static model<br />

might perform better for several sentences, while if there are repeating sentences or names<br />

<strong>of</strong> people, the dynamic model can work better.<br />

Further improvement was motivated by the observati<strong>on</strong> that adaptati<strong>on</strong> <strong>of</strong> RNN models<br />

with the learning rate α = 0.1 leads usually to the best individual results, but models in<br />

combinati<strong>on</strong> are more complementary if some are processed with larger learning rate. The<br />

results are summarized in Table 4.7. Overall, the adaptive learning rate provides small<br />

improvement, and has an interesting advantage: it does not require any validati<strong>on</strong> data<br />

for tuning the weights <strong>of</strong> individual models.<br />

60

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!