PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Tracking English Inclusions in German 82<br />
The F-score for the space travel test data is almost 7 points lower than that ob-<br />
tained for the development set. This performance drop is caused by lower precision<br />
and recall. Although the classifier is overgenerating on the development set for this<br />
particular domain, the fact that the scores on the unseen test data are relatively con-<br />
sistent across all three different domains is a positive result. Moreover, each data set<br />
is relatively small which makes it difficult to draw clear conclusions. In fact, the test<br />
and development data on space travel are slightly different in nature as can be seen in<br />
Table 3.1. While both sets contain a similar percentage <strong>of</strong> English inclusions (2.8%<br />
versus 3%), those in the test set are much less repeated than those in the development<br />
set which is reflected in their type-token-ratios amounting to 0.33 and 0.15, respec-<br />
tively. Therefore, the higher development test data scores could be due to the higher<br />
number <strong>of</strong> repetitions <strong>of</strong> English inclusions in the space travel development data.<br />
The full system F-scores for the EU test data are considerably higher than for the<br />
development set (84.42 versus 65.22 points). This is not surprising since the EU de-<br />
velopment data only contains 30 different English inclusions, less than 1% <strong>of</strong> all types,<br />
which made it an unusual data set for evaluating the classifier on. Error analysis was<br />
therefore focused mainly on the output <strong>of</strong> the other two data sets. The EU test data, on<br />
the other hand, contains 86 different English inclusions, i.e. three times as many types<br />
as in the development data (see Table 3.1). Considering that the English inclusion clas-<br />
sifier yields an equally high performance on the unseen EU test data as on the other<br />
two test data sets, it can be concluded that system design decisions and post-processing<br />
rules are made general enough to apply to documents on different domains.<br />
The best overall F-scores on all six data sets are obtained when combining the full<br />
system with a second consistency checking run (Internet test data: F=84.78, Space<br />
travel test data: F=84.66, EU test data: F=84.68). This second run essentially ensures<br />
that all English inclusions found in the first run are consistently classified within each<br />
document. This is done by applying an on-the-fly gazetteer which is generated auto-<br />
matically. This setup was explained in more detail in Section 3.3.6. The results listed<br />
in Table 3.14 show that the improvement in F-score is always caused by an increase in<br />
recall, outweighing the smaller decrease in precision. While this improvement is es-<br />
sential for document classification, particularly when comparing different classifiers,<br />
it is unlikely to be beneficial when performing language classification on tokens in<br />
individual sentences, for example during the text analysis <strong>of</strong> a TTS syn<strong>thesis</strong> system.