Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
Statistical Language Models based on Neural Networks - Faculty of ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Table 4.7: Results <strong>on</strong> Penn Treebank corpus (evaluati<strong>on</strong> set) with different linear interpolati<strong>on</strong><br />
techniques.<br />
Model PPL<br />
Static LI <strong>of</strong> all models 83.5<br />
Static LI <strong>of</strong> all models + dynamic RNNs with α = 0.5 80.5<br />
Adaptive LI <strong>of</strong> all models + dynamic RNNs with α = 0.5 79.4<br />
4.5.1 Adaptive Linear Combinati<strong>on</strong><br />
All the experiments above use fixed weights <strong>of</strong> models in the combinati<strong>on</strong>, where the<br />
weights are estimated <strong>on</strong> the PTB validati<strong>on</strong> set. We have extended the usual linear<br />
combinati<strong>on</strong> <strong>of</strong> models to a case when weights <strong>of</strong> all individual models are variable, and are<br />
estimated during processing <strong>of</strong> the test data. The initial distributi<strong>on</strong> <strong>of</strong> weights is uniform<br />
(every model has the same weight), and as the test data are being processed, we compute<br />
optimal weights <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the performance <strong>of</strong> models <strong>on</strong> the history <strong>of</strong> the last several words<br />
(the objective is to minimize perplexity). In theory, the weights can be estimated using<br />
the whole history. However, we found that it is possible to use multiple lengths <strong>of</strong> history<br />
- a combinati<strong>on</strong> where interpolati<strong>on</strong> weights are estimated using just a few preceding<br />
words can capture short c<strong>on</strong>text characteristics that can vary rapidly between individual<br />
sentences or paragraphs, while a combinati<strong>on</strong> where interpolati<strong>on</strong> weights depend <strong>on</strong> the<br />
whole history is the most robust.<br />
It should be noted that an important motivati<strong>on</strong> for this approach is that a combinati<strong>on</strong><br />
<strong>of</strong> adaptive and static RNN models with fixed weights is suboptimal. When the first word<br />
in the test data is processed, both static and adaptive models are equal. As more data is<br />
processed, the adaptive model is supposed to learn new informati<strong>on</strong>, and thus its optimal<br />
weight can change. If there is a sudden change <strong>of</strong> topic in the test data, the static model<br />
might perform better for several sentences, while if there are repeating sentences or names<br />
<strong>of</strong> people, the dynamic model can work better.<br />
Further improvement was motivated by the observati<strong>on</strong> that adaptati<strong>on</strong> <strong>of</strong> RNN models<br />
with the learning rate α = 0.1 leads usually to the best individual results, but models in<br />
combinati<strong>on</strong> are more complementary if some are processed with larger learning rate. The<br />
results are summarized in Table 4.7. Overall, the adaptive learning rate provides small<br />
improvement, and has an interesting advantage: it does not require any validati<strong>on</strong> data<br />
for tuning the weights <strong>of</strong> individual models.<br />
60