PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 5. Parsing English Inclusions 132<br />
5.3.5.1 Gold Standard Phrase Categories<br />
Table 5.5 lists the different types <strong>of</strong> phrase categories surrounding the 109 multi-word<br />
English inclusions in the error analysis sample and their frequency. 6 The last column<br />
lists a typical example for each category. The figures illustrate that the majority <strong>of</strong><br />
multi-word English inclusions are contained in a proper noun (PN) phrase, including<br />
names <strong>of</strong> companies, political parties, organisations, films, books, newspapers, etc.<br />
The components <strong>of</strong> PN phrases tend to be marked with the grammatical function PNC,<br />
proper noun component (Brants et al., 2002). A less frequent phrasal category <strong>of</strong> En-<br />
glish inclusions is chunk (CH) which tends to be used for slogans, quotes or expressions<br />
like Made in Germany. The components <strong>of</strong> CH phrases are annotated with a grammat-<br />
ical function <strong>of</strong> type UC (unit component). Even in this small sample, phrase category<br />
annotations <strong>of</strong> English inclusions as either PN or CH, and not the other, can be mislead-<br />
ing. For example, the organisation Friends <strong>of</strong> the Earth is annotated as PN, whereas<br />
another organisation International Union for the Conservation <strong>of</strong> Nature is marked as<br />
CH in the gold standard. The latter is believed to be an inconsistency in the annotation<br />
and should have been marked as PN as well.<br />
The phrase category <strong>of</strong> an English inclusion with the syntactic function <strong>of</strong> a noun<br />
phrase which is neither a PN nor a CH is annotated as NP (noun phrase). One exam-<br />
ple is Peace Enforcement which is not translated into German and used rather like a<br />
buzzword in a sentence on UN missions. In this case, the POS tag <strong>of</strong> its individual<br />
tokens is NN (noun). The fact that this expression is not German is therefore lost in<br />
the gold standard annotation. Another example <strong>of</strong> an English inclusion NP in the gold<br />
standard is Framingham Heart Study which could arguably be <strong>of</strong> phrase category PN.<br />
Furthermore, the sample contains an example <strong>of</strong> phrase category CH, Shopping Mall,<br />
an English noun phrase. The least frequent type <strong>of</strong> phrase category used for English<br />
inclusions is CNP. In this sample, this category marks a company names made up <strong>of</strong> a<br />
conjunction, for example Botts and Company. The POS tags <strong>of</strong> the coordinated sisters<br />
are NE (named entity) and the English coordinated conjunction and is tagged as KON.<br />
Finally, there are also two cases, where the English inclusion itself is not contained in<br />
a phrase category. One <strong>of</strong> them is Chief Executives which is clearly an NP. These are<br />
believed to be annotation errors.<br />
6 All phrase category (node) labels and grammatical function (edge) labels occurring in the TIGER<br />
treebank annotation are listed and defined in Appendix C.