05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Tracking English Inclusions in German 82<br />

The F-score for the space travel test data is almost 7 points lower than that ob-<br />

tained for the development set. This performance drop is caused by lower precision<br />

and recall. Although the classifier is overgenerating on the development set for this<br />

particular domain, the fact that the scores on the unseen test data are relatively con-<br />

sistent across all three different domains is a positive result. Moreover, each data set<br />

is relatively small which makes it difficult to draw clear conclusions. In fact, the test<br />

and development data on space travel are slightly different in nature as can be seen in<br />

Table 3.1. While both sets contain a similar percentage <strong>of</strong> English inclusions (2.8%<br />

versus 3%), those in the test set are much less repeated than those in the development<br />

set which is reflected in their type-token-ratios amounting to 0.33 and 0.15, respec-<br />

tively. Therefore, the higher development test data scores could be due to the higher<br />

number <strong>of</strong> repetitions <strong>of</strong> English inclusions in the space travel development data.<br />

The full system F-scores for the EU test data are considerably higher than for the<br />

development set (84.42 versus 65.22 points). This is not surprising since the EU de-<br />

velopment data only contains 30 different English inclusions, less than 1% <strong>of</strong> all types,<br />

which made it an unusual data set for evaluating the classifier on. Error analysis was<br />

therefore focused mainly on the output <strong>of</strong> the other two data sets. The EU test data, on<br />

the other hand, contains 86 different English inclusions, i.e. three times as many types<br />

as in the development data (see Table 3.1). Considering that the English inclusion clas-<br />

sifier yields an equally high performance on the unseen EU test data as on the other<br />

two test data sets, it can be concluded that system design decisions and post-processing<br />

rules are made general enough to apply to documents on different domains.<br />

The best overall F-scores on all six data sets are obtained when combining the full<br />

system with a second consistency checking run (Internet test data: F=84.78, Space<br />

travel test data: F=84.66, EU test data: F=84.68). This second run essentially ensures<br />

that all English inclusions found in the first run are consistently classified within each<br />

document. This is done by applying an on-the-fly gazetteer which is generated auto-<br />

matically. This setup was explained in more detail in Section 3.3.6. The results listed<br />

in Table 3.14 show that the improvement in F-score is always caused by an increase in<br />

recall, outweighing the smaller decrease in precision. While this improvement is es-<br />

sential for document classification, particularly when comparing different classifiers,<br />

it is unlikely to be beneficial when performing language classification on tokens in<br />

individual sentences, for example during the text analysis <strong>of</strong> a TTS syn<strong>thesis</strong> system.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!